[
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a Bug report to help us improve\ntitle: \"[BUG] Function X is raising error Y\"\nlabels: bug\nassignees: jmcorreia\n\n---\n\n**Describe the bug**\nA clear and concise description of what the bug is.\n\n**Environment Details**\n- Lakehouse Engine Version\n- Environment where you are using the Lakehouse Engine (Ex. Databricks 13.3LTS)\n\n**To Reproduce**\nPlease include all the necessary details to reproduce the problem, including the full ACON or functions that are being used and at what point the problem is occurring. \n\n**Expected behavior**\nA clear and concise description of what you expected to happen.\n\n**Screenshots**\nIf applicable, add screenshots to help explain your problem.\n\n**Additional context**\nAdd any other context about the problem here.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: \"[FEATURE] I would like to have the capability to do X\"\nlabels: enhancement\nassignees: jmcorreia\n\n---\n\n**Is your feature request related to a problem? Please describe.**\nA clear and concise description of what the problem is. Ex. I'm always frustrated when [...]\n\n**Describe the solution you'd like**\nA clear and concise description of what you want to happen.\n\n**Describe alternatives you've considered**\nA clear and concise description of any alternative solutions or features you've considered.\n\n**Additional context**\nAdd any other context, useful links or screenshots about the feature request here.\n"
  },
  {
    "path": ".github/pull_request_template.md",
    "content": "- [ ] Description of PR changes above includes a link to [an existing GitHub issue](https://github.com/adidas/lakehouse-engine/issues)\n- [ ] PR title is prefixed with one of: [BUGFIX], [FEATURE]\n- [ ] Appropriate tests and docs have been updated\n- [ ] Code is linted and tested -\n```\n  make style\n  make lint\n  make test\n  make test-security\n```\n\nFor more information about contributing, see [Contribute](https://github.com/adidas/lakehouse-engine/blob/master/CONTRIBUTING.md).\n\nAfter you submit your PR, keep **monitoring its statuses and discuss/apply fixes for any issues or suggestions coming from the PR Reviews**. \n\nThanks for contributing!"
  },
  {
    "path": ".gitignore",
    "content": "# mac os hidden files\n.DS_Store\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checer\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# intellij and vscode\n.idea/\n**.iml\n.vscode/\n\n# credentials\n**credential**\n\n# lakehouse and spark\n/tests/lakehouse/**\n*derby.log*\n**/metastore_db/\n/metastore_db/\n**/spark-warehouse/\n/spark-warehouse/\n\n/artefacts/\ntmp_os/"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# How to Contribute\n\n📖 Search algorithms, transformations and check implementation details & examples in our [documentation](https://adidas.github.io/lakehouse-engine-docs/lakehouse_engine.html).\n\n💭 In case you have doubts, ideas, want to ask for help or want to discuss different approach and usages, feel free to create a [discussion](https://github.com/adidas/lakehouse-engine/discussions).\n\n⚠️ Are you facing any issues? Open an issue on [GitHub](https://github.com/adidas/lakehouse-engine/issues).\n\n💡 Do you have ideas for new features? Open a feature request on [GitHub](https://github.com/adidas/lakehouse-engine/issues).\n\n🚀 Want to find the available releases? Check our release notes on [GitHub](https://github.com/adidas/lakehouse-engine/releases) and [PyPi](https://pypi.org/project/lakehouse-engine/).\n\n## Prerequisites\n\n1. Git.\n2. Your IDE of choice with a Python 3 environment (e.g., virtualenv created from the requirements_cicd.txt file).\n3. Docker. **Warning:** The default spark driver memory limit for the tests is set at 2g. This limit is configurable but your\n   testing docker setup **MUST** always have **at least** 2 * spark driver memory limit + 1 gb configured.\n4. GNU make.\n\n## General steps for contributing\n1. Fork the project.\n2. Clone the forked project into your working environment.\n3. Create your feature branch following the convention [feature|bugfix]/ISSUE_ID_short_name.\n4. Apply your changes in the recently created branch. It is **mandatory** to add tests covering the feature of fix contributed.\n5. Style, lint, test and test security:\n    ```\n    make style\n    make lint\n    make test\n    make test-security\n    ```\n---\n> ***Note:*** To use the make targets with another docker-compatible cli other than docker you can pass the parameter \"container_cli\". \nExample: `make test container_cli=nerdctl`\n\n---\n\n---\n> ***Note:*** Most make target commands are running on docker. If you face any problem, you can also check the code of the respective make targets and directly execute the code in your python virtual environment.\n\n---\n\n6. (optional) You can build the wheel locally with `make build`.\n7. (optional) Install the wheel you have just generated and test it.\n8. If you have changed or added new requirements, you should run `make build-lock-files`, to rebuild the lock files. \n9. If the transitive dependencies have not been updated for a while, and you want to upgrade them, you can use `make upgrade-lock-files` to update them. This will update the transitive dependencies even if you have not changed the requirements.\n10. When you're ready with your changes, open a Pull Request (PR) to develop.\n11. Ping the team through the preferred communication channel.\n12. The team will come together to review it and approve it (2 approvals required).\n13. Your changes will be tested internally, promoted to master and included in the next release.\n\n> 🚀🚀🚀\n>\n> **Pull Requests are welcome from anyone**. However, before opening one, please make sure to open an issue on [GitHub](https://github.com/adidas/lakehouse-engine/issues)\n> and link it.\n> Moreover, if the Pull Request intends to cover big changes or features, it is recommended to first discuss it on a [GitHub issue](https://github.com/adidas/lakehouse-engine/issues) or [Discussion](https://github.com/adidas/lakehouse-engine/discussions).\n>\n> 🚀🚀🚀"
  },
  {
    "path": "LICENSE.txt",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright 2023 adidas AG\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "Makefile",
    "content": "SHELL := /bin/bash -euxo pipefail\n\ncontainer_cli := docker\nimage_name := lakehouse-engine\ndeploy_env := dev\nproject_version := $(shell cat cicd/.bumpversion.cfg | grep \"current_version =\" | cut -f 3 -d \" \")\nversion := $(project_version)\n# Gets system information in upper case\nsystem_information := $(shell uname -mvp | tr a-z A-Z)\nmeta_conf_file := cicd/meta.yaml\nmeta_os_conf_file := cicd/meta_os.yaml\ngroup_id := $(shell id -g ${USER})\nengine_conf_file := lakehouse_engine/configs/engine.yaml\nengine_os_conf_file := lakehouse_engine/configs/engine_os.yaml\nremove_files_from_os := $(engine_conf_file) $(meta_conf_file) CODEOWNERS sonar-project.properties CONTRIBUTING.md CHANGELOG.md assets/img/os_strategy.png\nlast_commit_msg := \"$(shell git log -1 --pretty=%B)\"\ngit_tag := $(shell git describe --tags --abbrev=0)\ncommits_url := $(shell cat $(meta_conf_file) | grep commits_url | cut -f 2 -d \" \")\n\nifneq ($(project_version), $(version))\nwheel_version := $(project_version)+$(subst _,.,$(subst -,.,$(version)))\nproject_name := lakehouse-engine-experimental\nelse\nwheel_version := $(version)\nproject_name := lakehouse-engine\nendif\n\n# Add \\ to make reg safe comparisons (e.g. in the perl commands)\nwheel_version_reg_safe := $(subst +,\\+,$(subst .,\\.,$(wheel_version)))\nproject_version_reg_safe := $(subst .,\\.,$(project_version))\n\n# Condition to define the Python image to be built based on the machine CPU architecture.\n# The base Python image only changes if the identified CPU architecture is ARM.\nifneq (,$(findstring ARM,$(system_information)))\npython_image := $(shell cat $(meta_conf_file) | grep arm_python_image | cut -f 2 -d \" \")\ncpu_architecture := arm64\nelse\npython_image := $(shell cat $(meta_conf_file) | grep amd_python_image | cut -f 2 -d \" \")\ncpu_architecture := amd64\nendif\n\n# Condition to define the spark driver memory limit to be used in the tests\n# In order to change this limit you can use the spark_driver_memory parameter\n# Example: make test spark_driver_memory=3g\n#\n# WARNING: When the tests are being run 2 spark nodes are created, so despite\n# the default value being 2g, your configured docker environment should have\n# extra memory for communication and overhead.\nifndef $(spark_driver_memory)\n\tspark_driver_memory := \"2g\"\nendif\n\n# A requirements_full.lock file is created based on all the requirements of the project (core, dq, os, azure, sftp, cicd and sharepoint).\n# The requirements_full.lock file is then used as a constraints file to build the other lock file so that we ensure dependencies are consistent and compatible\n# with each other, otherwise, the the installations would likely fail.\n# Moreover, the requirement_full.lock file is also used in the dockerfile to install all project dependencies.\nfull_requirements := -o requirements_full.lock requirements.txt requirements_os.txt requirements_dq.txt requirements_azure.txt requirements_sftp.txt requirements_cicd.txt requirements_sharepoint.txt\nrequirements := -o requirements.lock requirements.txt -c requirements_full.lock\nos_requirements := -o requirements_os.lock requirements_os.txt -c requirements_full.lock\ndq_requirements = -o requirements_dq.lock requirements_dq.txt -c requirements_full.lock\nazure_requirements = -o requirements_azure.lock requirements_azure.txt -c requirements_full.lock\nsftp_requirements = -o requirements_sftp.lock requirements_sftp.txt -c requirements_full.lock\nsharepoint_requirements = -o requirements_sharepoint.lock requirements_sharepoint.txt -c requirements_full.lock\nos_deployment := False\ncontainer_user_dir := /home/appuser\ntrust_git_host := ssh -oStrictHostKeyChecking=no -i $(container_user_dir)/.ssh/id_rsa git@github.com\nifeq ($(os_deployment), True)\nbuild_src_dir := tmp_os/lakehouse-engine\nelse\nbuild_src_dir := .\nendif\n\nbuild-image:\n\t$(container_cli) build \\\n\t\t--build-arg USER_ID=$(shell id -u ${USER}) \\\n\t\t--build-arg GROUP_ID=$(group_id)  \\\n\t\t--build-arg PYTHON_IMAGE=$(python_image) \\\n\t\t--build-arg CPU_ARCHITECTURE=$(cpu_architecture) \\\n\t\t-t $(image_name):$(version) . -f cicd/Dockerfile\n\nbuild-image-windows:\n\t$(container_cli) build \\\n\t\t--build-arg PYTHON_IMAGE=$(python_image) \\\n        --build-arg CPU_ARCHITECTURE=$(cpu_architecture) \\\n        -t $(image_name):$(version) . -f cicd/Dockerfile\n\n# The build target is used to build the wheel package.\n# It makes usage of some `perl` commands to change the project wheel version in the pyproject.toml file,\n# whenever the goal is to release a package for testing, instead of an official release.\n# Ex: if you run 'make build-image version=feature-x-1276, and the current project version is 1.20.0, the generated wheel will be: lakehouse_engine_experimental-1.20.0+feature.x.1276-py3-none-any,\n# while for the official 1.20.0 release, the wheel will be: lakehouse_engine-1.20.0-py3-none-any.\nbuild:\n\tperl -pi -e 's/version = \"$(project_version_reg_safe)\"/version = \"$(wheel_version)\"/g' pyproject.toml && \\\n\tperl -pi -e 's/name = \"lakehouse-engine\"/name = \"$(project_name)\"/g' pyproject.toml && \\\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'python -m build --wheel $(build_src_dir)' && \\\n\tperl -pi -e 's/version = \"$(wheel_version_reg_safe)\"/version = \"$(project_version)\"/g' pyproject.toml && \\\n\tperl -pi -e 's/name = \"$(project_name)\"/name = \"lakehouse-engine\"/g' pyproject.toml\n\n\ndeploy: build\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t-v $(artifactory_credentials_file):$(container_user_dir)/.pypirc \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'twine upload -r artifactory dist/$(subst -,_,$(project_name))-$(wheel_version)-py3-none-any.whl --skip-existing'\n\ndocs:\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'cd $(build_src_dir) && pip install . && python ./cicd/code_doc/render_docs.py'\n\n# mypy incremental mode is used by default, so in case there is any cache related issue,\n# you can modify the command to include --no-incremental flag or you can delete mypy_cache folder.\nlint:\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n        -v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'flake8 --docstring-convention google --config=cicd/flake8.conf lakehouse_engine tests cicd/code_doc/render_docs.py \\\n\t\t&& mypy --no-incremental lakehouse_engine tests'\n\n# useful to print and use make variables. Usage: make print-variable var=variable_to_print.\nprint-variable:\n\t@echo $($(var))\n\nstyle:\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c '''isort lakehouse_engine tests cicd/code_doc/render_docs.py && \\\n        black lakehouse_engine tests cicd/code_doc/render_docs.py'''\n\nterminal:\n\t$(container_cli) run \\\n\t\t-it \\\n\t\t--rm \\\n\t  \t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash\n\n# Can use test only: ```make test test_only=\"tests/feature/test_delta_load_record_mode_cdc.py\"```.\n# You can also hack it by doing ```make test test_only=\"-rx tests/feature/test_delta_load_record_mode_cdc.py\"```\n# to show complete output even of passed tests.\n# We also fix the coverage filepaths, using perl, so that report has the correct paths\ntest:\n\t$(container_cli) run \\\n\t\t--rm \\\n\t\t-w /app \\\n        -v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c \"pytest \\\n            --junitxml=artefacts/tests.xml \\\n            --cov-report xml --cov-report xml:artefacts/coverage.xml \\\n            --cov-report term-missing --cov=lakehouse_engine \\\n            --log-cli-level=INFO --color=yes -x -vv \\\n            --spark_driver_memory=$(spark_driver_memory) $(test_only)\" && \\\n\tperl -pi -e 's/filename=\\\"/filename=\\\"lakehouse_engine\\//g' artefacts/coverage.xml\n\ntest-security:\n\t$(container_cli) run \\\n\t\t--rm \\\n\t\t-w /app \\\n        -v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'bandit -c cicd/bandit.yaml -r lakehouse_engine tests'\n\n#####################################\n##### Dependency Management Targets #####\n#####################################\n\naudit-dep-safety:\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n        -v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'pip-audit -r cicd/requirements_full.lock --desc on -f json --fix --dry-run -o artefacts/safety_analysis.json'\n\n# This target will build the lock files to be used for building the wheel and delivering it.\nbuild-lock-files:\n\t$(container_cli) run --rm \\\n\t    -w /app \\\n\t    -v \"$$PWD\":/app \\\n\t    $(image_name):$(version) \\\n\t    /bin/bash -c 'cd cicd && pip-compile --resolver=backtracking $(full_requirements) && \\\n\t    pip-compile --resolver=backtracking $(requirements) && \\\n\t    pip-compile --resolver=backtracking $(os_requirements) && \\\n\t    pip-compile --resolver=backtracking $(dq_requirements) && \\\n\t\tpip-compile --resolver=backtracking $(azure_requirements) && \\\n\t\tpip-compile --resolver=backtracking $(sftp_requirements) && \\\n\t\tpip-compile --resolver=backtracking $(sharepoint_requirements)'\n\n# We test the dependencies to check if they need to be updated because requirements.txt files have changed.\n# On top of that, we also test if we will be able to install the base and the extra packages together, \n# as their lock files are built separately and therefore dependency constraints might be too restricted. \n# If that happens, pip install will fail because it cannot solve the dependency resolution process, and therefore\n# we need to pin those conflict dependencies in the requirements.txt files to a version that fits both the base and \n# extra packages.\ntest-deps:\n\t@GIT_STATUS=\"$$(git status --porcelain --ignore-submodules cicd/)\"; \\\n\tif [ ! \"x$$GIT_STATUS\" = \"x\"  ]; then \\\n\t    echo \"!!! Requirements lists has been updated but lock file was not rebuilt !!!\"; \\\n\t    echo \"!!! Run make build-lock-files !!!\"; \\\n\t    echo -e \"$${GIT_STATUS}\"; \\\n\t    git diff cicd/; \\\n\t    exit 1; \\\n\tfi\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n        -v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'pip install -e .[azure,dq,sftp,os] --dry-run --ignore-installed'\n\n# This will update the transitive dependencies even if there were no changes in the requirements files.\n# This should be a recurrent activity to make sure transitive dependencies are kept up to date.\nupgrade-lock-files:\n\t$(container_cli) run --rm \\\n\t    -w /app \\\n\t    -v \"$$PWD\":/app \\\n\t    $(image_name):$(version) \\\n\t    /bin/bash -c 'cd cicd && pip-compile --resolver=backtracking --upgrade $(full_requirements) && \\\n\t    pip-compile --resolver=backtracking --upgrade $(requirements) && \\\n\t    pip-compile --resolver=backtracking --upgrade $(os_requirements) && \\\n\t    pip-compile --resolver=backtracking --upgrade $(dq_requirements) && \\\n\t\tpip-compile --resolver=backtracking --upgrade $(azure_requirements) && \\\n\t\tpip-compile --resolver=backtracking --upgrade $(sftp_requirements) && \\\n\t\tpip-compile --resolver=backtracking --upgrade $(sharepoint_requirements)'\n\n#####################################\n##### GitHub Deployment Targets #####\n#####################################\n\nprepare-github-repo:\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t-v $(git_credentials_file):$(container_user_dir)/.ssh/id_rsa \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c \"\"\"mkdir -p tmp_os/$(repository); \\\n\t\tcd tmp_os/$(repository); \\\n\t\tgit init -b master; \\\n\t\tgit config pull.rebase false; \\\n\t\tgit config user.email 'lak-engine@adidas.com'; \\\n\t\tgit config user.name 'Lakehouse Engine'; \\\n\t\t$(trust_git_host); \\\n\t\tgit remote add origin git@github.com:adidas/$(repository).git; \\\n\t\tgit pull origin master --tags\"\"\"\n\nsync-to-github: prepare-github-repo\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t-v $(git_credentials_file):$(container_user_dir)/.ssh/id_rsa \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c \"\"\"cd tmp_os/lakehouse-engine; \\\n\t\trsync -r --exclude=.git --exclude=.*cache* --exclude=venv --exclude=dist --exclude=tmp_os /app/ . ; \\\n\t\trm $(remove_files_from_os); \\\n\t\tmv $(engine_os_conf_file) $(engine_conf_file); \\\n\t\tmv $(meta_os_conf_file) $(meta_conf_file); \\\n\t\tmv CONTRIBUTING_OS.md CONTRIBUTING.md; \\\n\t\t$(trust_git_host); \\\n\t\tgit add . ; \\\n\t\tgit commit -m \"'${last_commit_msg}'\"; \\\n\t\tgit tag -a $(git_tag) -m 'Release $(git_tag)' ; \\\n\t\tgit push origin master --follow-tags;\"\"\"\n\ndeploy-docs-to-github: docs prepare-github-repo\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t-v $(git_credentials_file):$(container_user_dir)/.ssh/id_rsa \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c \"\"\"cp -r tmp_os/lakehouse-engine/artefacts/docs/site/* tmp_os/lakehouse-engine-docs/ ; \\\n\t\tcd tmp_os/lakehouse-engine-docs; \\\n\t\t$(trust_git_host); \\\n\t\tgit add . ; \\\n\t\tgit commit -m 'Lakehouse Engine $(version) documentation'; \\\n\t\tgit push origin master ; \\\n\t\tcd .. && rm -rf tmp_os/lakehouse-engine-docs\"\"\"\n\ndeploy-to-pypi: build\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t-v $(pypi_credentials_file):$(container_user_dir)/.pypirc \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'twine upload tmp_os/lakehouse-engine/dist/lakehouse_engine-$(project_version)-py3-none-any.whl --skip-existing'\n\ndeploy-to-pypi-and-clean: deploy-to-pypi\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'rm -rf tmp_os/lakehouse-engine'\n\n###########################\n##### Release Targets #####\n###########################\ncreate-changelog:\n\techo \"# Changelog - $(shell date +\"%Y-%m-%d\") v$(shell cat cicd/.bumpversion.cfg | grep \"current_version =\" | cut -f 3 -d \" \")\" > CHANGELOG.md && \\\n\techo \"All notable changes to this project will be documented in this file automatically. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).\" >> CHANGELOG.md && \\\n\techo \"\" >> CHANGELOG.md && \\\n\tgit log --no-decorate --pretty=format:\"#### [%cs] [%(describe)]%n [%h]($(commits_url)%H) %s\" -n 1000 >> CHANGELOG.md\n\nbump-up-version:\n\t$(container_cli) run --rm \\\n\t\t-w /app \\\n\t\t-v \"$$PWD\":/app \\\n\t\t$(image_name):$(version) \\\n\t\t/bin/bash -c 'bump2version --config-file cicd/.bumpversion.cfg $(increment)'\n\nprepare-release: bump-up-version create-changelog\n\techo \"Prepared version and changelog to release!\"\n\ncommit-release:\n\tgit commit -a -m 'Create release $(version)' && \\\n    git tag -a 'v$(version)' -m 'Release $(version)'\n\npush-release:\n\tgit push --follow-tags\n\ndelete-tag:\n\tgit push --delete origin $(tag)\n\n.PHONY: $(MAKECMDGOALS)\n"
  },
  {
    "path": "README.md",
    "content": "<img align=\"right\" src=\"assets/img/lakehouse_engine_logo_symbol_small.png\" alt=\"Lakehouse Engine Logo\">\n\n# Lakehouse Engine\nA configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.\n\n---\n> ***Note:*** whenever you read Data Product or Data Product team, we want to refer to Teams and use cases, whose main focus is on \nleveraging the power of data, on a particular topic, end-to-end (ingestion, consumption...) to achieve insights, supporting faster and better decisions, \nwhich generate value for their businesses. These Teams should not be focusing on building reusable frameworks, but on re-using the existing frameworks to achieve their goals.\n\n---\n\n## Main Goals\nThe goal of the Lakehouse Engine is to bring some advantages, such as:\n\n- offer cutting-edge, standard, governed and battle-tested foundations that several Data Product teams can benefit from;\n- avoid that Data Product teams develop siloed solutions, reducing technical debts and high operating costs (redundant developments across teams);\n- allow Data Product teams to focus mostly on data-related tasks, avoiding wasting time & resources on developing the same code for different use cases;\n- benefit from the fact that many teams are reusing the same code, which increases the likelihood that common issues are surfaced and solved faster;\n- decrease the dependency and learning curve to Spark and other technologies that the Lakehouse Engine abstracts;\n- speed up repetitive tasks;\n- reduced vendor lock-in.\n\n---\n  > ***Note:*** even though you will see a focus on AWS and Databricks, this is just due to the lack of use cases for other technologies like GCP and Azure, but we are open for contribution.\n\n---\n\n## Key Features\n⭐ **Data Loads:** perform data loads from diverse source types and apply transformations and data quality validations, \nensuring trustworthy data, before integrating it into distinct target types. Additionally, people can also define termination \nactions like optimisations or notifications. [On the usage section](#load-data-usage-example) you will find an example using all the supported keywords for data loads.\n\n---\n> ***Note:*** The Lakehouse \nEngine supports different types of sources and targets, such as, kafka, jdbc, dataframes, files (csv, parquet, json, delta...), sftp, sap bw, sap b4...\n\n---\n\n⭐ **Transformations:** configuration driven transformations without the need to write any spark code. Transformations can be applied by using the `transform_specs` in the Data Loads.\n\n---\n> ***Note:*** you can search all the available transformations, as well as checking implementation details and examples [here](reference/packages/transformers/index.md).\n\n---\n\n⭐ **Data Quality Validations:** the Lakehouse Engine uses Great Expectations as a backend and abstracts any implementation\ndetails by offering people the capability to specify what validations to apply on the data, solely using dict/json based configurations.\nThe Data Quality validations can be applied on:\n\n- post-mortem (static) data, using the DQ Validator algorithm (`execute_dq_validation`)\n- data in-motion, using the `dq_specs` keyword in the Data Loads, to add it as one more step while loading data. \n[On the usage section](#load-data-usage-example) you will find an example using this type of Data Quality validations.\n\n⭐ **Reconciliation:** useful algorithm to compare two source of data, by defining one version of the `truth` to compare\nagainst the `current` version of the data. It can be particularly useful during migrations phases, two compare a few KPIs\nand ensure the new version of a table (`current`), for example, delivers the same vision of the data as the old one (`truth`).\nFind usage examples [here](lakehouse_engine_usage/reconciliator/reconciliator.md).\n\n⭐ **Sensors:** an abstraction to otherwise complex spark code that can be executed in very small single-node clusters\nto check if an upstream system or Data Product contains new data since the last execution. With this feature, people can\ntrigger jobs to run in more frequent intervals and if the upstream does not contain new data, then the rest of the job\nexits without creating bigger clusters to execute more intensive data ETL (Extraction, Transformation, and Loading).\nFind usage examples [here](lakehouse_engine_usage/sensors/sensors.md).\n\n⭐ **Terminators:** this feature allow people to specify what to do as a last action, before finishing a Data Load.\nSome examples of actions are: optimising target table, vacuum, compute stats, expose change data feed to external location\nor even send e-mail notifications. Thus, it is specifically used in Data Loads, using the `terminate_specs` keyword.\n[On the usage section](#load-data-usage-example) you will find an example using terminators.\n\n⭐ **Table Manager:** function `manage_table`, offers a set of actions to manipulate tables/views in several ways, such as:\n\n- compute table statistics;\n- create/drop tables and views;\n- delete/truncate/repair tables;\n- vacuum delta tables or locations;\n- optimize table;\n- describe table;\n- show table properties;\n- execute sql.\n\n⭐ **File Manager:** function `manage_files`, offers a set of actions to manipulate files in several ways, such as:\n\n- delete Objects in S3;\n- copy Objects in S3;\n- restore Objects from S3 Glacier;\n- check the status of a restore from S3 Glacier;\n- request a restore of objects from S3 Glacier and wait for them to be copied to a destination.\n\n\n⭐ **Notifications:** you can configure and send email notifications.\n\n---\n> ***Note:*** it can be used as an independent function (`send_notification`) or as a `terminator_spec`, using the function `notify`.\n\n---\n\n📖 In case you want to check further details you can check the documentation of the [Lakehouse Engine facade](reference/packages/engine.md).\n\n## Installation\nAs the Lakehouse Engine is built as wheel (look into our **build** and **deploy** make targets) you can install it as any other python package using **pip**.\n\n```\npip install lakehouse-engine\n```\n\nAlternatively, you can also upload the wheel to any target of your like (e.g. S3) and perform a pip installation pointing to that target location.\n\n---\n> ***Note:*** The Lakehouse Engine is packaged with plugins or optional dependencies, which are not installed by default. The goal is\n> to make its installation lighter and to avoid unnecessary dependencies. You can check all the optional dependencies in\n> the [tool.setuptools.dynamic] section of the [pyproject.toml](pyproject.toml) file. They are currently: os, dq, azure, sharepoint and sftp. So,\n> in case you want to make usage of the Data Quality features offered in the Lakehouse Engine, instead of running the previous command, you should run\n> the command below, which will bring the core functionalities, plus DQ.\n> ```\n> pip install lakehouse-engine[dq]\n> ```\n> In case you are in an environment without pre-install spark and delta, you will also want to install the `os` optional dependencies, like so:\n> ```\n> pip install lakehouse-engine[os]\n> ```\n> And in case you want to install several optional dependencies, you can run a command like:\n> ```\n> pip install lakehouse-engine[dq,sftp]\n> ```\n> It is advisable for a Data Product to pin a specific version of the Lakehouse Engine (and have recurring upgrading activities)\n> to avoid breaking changes in a new release.\n> In case you don't want to be so conservative, you can pin to a major version, which usually shouldn't include changes that break backwards compatibility.\n\n---\n\n## How Data Products use the Lakehouse Engine Framework?\n<img src=\"assets/img/lakehouse_dp_usage.drawio.png?raw=true\" style=\"max-width: 800px; height: auto; \"/>\n\nThe Lakehouse Engine is a configuration-first Data Engineering framework, using the concept of ACONs to configure algorithms. \nAn ACON, stands for Algorithm Configuration and is a JSON representation, as the [Load Data Usage Example](#load-data-usage-example) demonstrates. \n\nBelow you find described the main keywords you can use to configure and ACON for a Data Load.\n\n---\n> ***Note:*** the usage logic for the other [algorithms/features presented](#key-features) will always be similar, but using different keywords, \nwhich you can search for in the examples and documentation provided in the [Key Features](#key-features) and [Community Support and Contributing](#community-support-and-contributing) sections.\n\n---\n\n- **Input specifications (input_specs):** specify how to read data. This is a **mandatory** keyword.\n- **Transform specifications (transform_specs):** specify how to transform data.\n- **Data quality specifications (dq_specs):** specify how to execute the data quality process.\n- **Output specifications (output_specs):** specify how to write data to the target. This is a **mandatory** keyword.\n- **Terminate specifications (terminate_specs):** specify what to do after writing into the target (e.g., optimising target table, vacuum, compute stats, expose change data feed to external location, etc).\n- **Execution environment (exec_env):** custom Spark session configurations to be provided for your algorithm (configurations can also be provided from your job/cluster configuration, which we highly advise you to do instead of passing performance related configs here for example).\n\n## Load Data Usage Example\n\nYou can use the Lakehouse Engine in a **pyspark script** or **notebook**.\nBelow you can find an example on how to execute a Data Load using the Lakehouse Engine, which is doing the following:\n\n1. Read CSV files, from a specified location, in a streaming fashion and providing a specific schema and some additional \noptions for properly read the files (e.g. header, delimiter...);\n2. Apply two transformations on the input data:\n    1. Add a new column having the Row ID;\n    2. Add a new column `extraction_date`, which extracts the date from the `lhe_extraction_filepath`, based on a regex.\n3. Apply Data Quality validations and store the result of their execution in the table `your_database.order_events_dq_checks`:\n    1. Check if the column `omnihub_locale_code` is not having null values;\n    2. Check if the distinct value count for the column `product_division` is between 10 and 100;\n    3. Check if the max of the column `so_net_value` is between 10 and 1000;\n    4. Check if the length of the values in the column `omnihub_locale_code` is between 1 and 10;\n    5. Check if the mean of the values for the column `coupon_code` is between 15 and 20.\n4. Write the output into the table `your_database.order_events_with_dq` in a delta format, partitioned by `order_date_header`\nand applying a merge predicate condition, ensuring the data is only inserted into the table if it does not match the predicate\n(meaning the data is not yet available in the table). Moreover, the `insert_only` flag is used to specify that there should not \nbe any updates or deletes in the target table, only inserts;\n5. Optimize the Delta Table that we just wrote in (e.g. z-ordering);\n6. Specify 3 custom Spark Session configurations.\n\n---\n> ⚠️ ***Note:*** `spec_id` is one of the main concepts to ensure you can chain the steps of the algorithm,\nso, for example, you can specify the transformations (in `transform_specs`) of a DataFrame that was read in the `input_specs`.\n\n---\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"orders_bronze\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"s3://my-data-product-bucket/artefacts/metadata/bronze/schemas/orders.json\",\n            \"with_filepath\": True,\n            \"options\": {\n                \"badRecordsPath\": \"s3://my-data-product-bucket/badrecords/order_events_with_dq/\",\n                \"header\": False,\n                \"delimiter\": \"\\u005E\",\n                \"dateFormat\": \"yyyyMMdd\",\n            },\n            \"location\": \"s3://my-data-product-bucket/bronze/orders/\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"orders_bronze_with_extraction_date\",\n            \"input_id\": \"orders_bronze\",\n            \"transformers\": [\n                {\"function\": \"with_row_id\"},\n                {\n                    \"function\": \"with_regex_value\",\n                    \"args\": {\n                        \"input_col\": \"lhe_extraction_filepath\",\n                        \"output_col\": \"extraction_date\",\n                        \"drop_input_col\": True,\n                        \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\",\n                    },\n                },\n            ],\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"check_orders_bronze_with_extraction_date\",\n            \"input_id\": \"orders_bronze_with_extraction_date\",\n            \"dq_type\": \"validator\",\n            \"result_sink_db_table\": \"your_database.order_events_dq_checks\",\n            \"fail_on_error\": False,\n            \"dq_functions\": [\n                {\n                  \"dq_function\": \"expect_column_values_to_not_be_null\", \n                  \"args\": {\n                    \"column\": \"omnihub_locale_code\"\n                  }\n                },\n                {\n                    \"dq_function\": \"expect_column_unique_value_count_to_be_between\",\n                    \"args\": {\n                      \"column\": \"product_division\", \n                      \"min_value\": 10,\n                      \"max_value\": 100\n                    },\n                },\n                {\n                    \"dq_function\": \"expect_column_max_to_be_between\", \n                    \"args\": {\n                      \"column\": \"so_net_value\", \n                      \"min_value\": 10, \n                      \"max_value\": 1000\n                    }\n                },\n                {\n                    \"dq_function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                      \"column\": \"omnihub_locale_code\", \n                      \"min_value\": 1, \n                      \"max_value\": 10\n                    },\n                },\n                {\n                  \"dq_function\": \"expect_column_mean_to_be_between\", \n                  \"args\": {\n                    \"column\": \"coupon_code\", \n                    \"min_value\": 15, \n                    \"max_value\": 20\n                  }\n                },\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"orders_silver\",\n            \"input_id\": \"check_orders_bronze_with_extraction_date\",\n            \"data_format\": \"delta\",\n            \"write_type\": \"merge\",\n            \"partitions\": [\"order_date_header\"],\n            \"merge_opts\": {\n                \"merge_predicate\": \"\"\"\n                    new.sales_order_header = current.sales_order_header\n                    AND new.sales_order_schedule = current.sales_order_schedule\n                    AND new.sales_order_item=current.sales_order_item\n                    AND new.epoch_status=current.epoch_status\n                    AND new.changed_on=current.changed_on\n                    AND new.extraction_date=current.extraction_date\n                    AND new.lhe_batch_id=current.lhe_batch_id\n                    AND new.lhe_row_id=current.lhe_row_id\n                \"\"\",\n                \"insert_only\": True,\n            },\n            \"db_table\": \"your_database.order_events_with_dq\",\n            \"options\": {\n                \"checkpointLocation\": \"s3://my-data-product-bucket/checkpoints/template_order_events_with_dq/\"\n            },\n        }\n    ],\n    \"terminate_specs\": [\n        {\n            \"function\": \"optimize_dataset\",\n            \"args\": {\n              \"db_table\": \"your_database.order_events_with_dq\"\n            }\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n---\n> ***Note:*** Although it is possible to interact with the Lakehouse Engine functions directly from your python code, \ninstead of relying on creating an ACON dict and use the engine api, we do not ensure the stability across new \nLakehouse Engine releases when calling internal functions (not exposed in the facade) directly.\n\n---\n\n---\n> ***Note:*** ACON structure might change across releases, please test your Data Product first before updating to a \nnew version of the Lakehouse Engine in your Production environment.\n\n---\n## Overwriting default configurations\n\nWe use a YAML file to specify various configurations needed for different functionalities. You can overwrite these \nconfigurations using a dictionary with new settings or by providing a path to a YAML file.\n\nThis functionality can be particularly useful for the open-source community as it unlocks \nthe usage several functionalities like Prisma and engine usage logs.\n\nCheck default configurations.\n```\nfrom lakehouse_engine.core import exec_env\nprint(exec_env.ExecEnv.ENGINE_CONFIG.dq_dev_bucket)\n   > default-bucket\n```\n\nChange the dq_dev_bucket configuration.\n```\nexec_env.ExecEnv.set_default_engine_config(custom_configs_dict={\"dq_dev_bucket\": \"your-dq-bucket\"})\nprint(exec_env.ExecEnv.ENGINE_CONFIG.dq_dev_bucket)\n   > your-dq-bucket\n```\nReset to default configurations.\n```\nexec_env.ExecEnv.set_default_engine_config()\nprint(exec_env.ExecEnv.ENGINE_CONFIG.dq_dev_bucket)\n   > default-bucket\n```\n\n---\n\n## Who maintains the Lakehouse Engine?\nThe Lakehouse Engine is under active development and production usage by the Adidas Lakehouse Foundations Engineering team. \n\n## Community Support and Contributing\n\n🤝 Do you want to contribute or need any support? Check out all the details in [CONTRIBUTING.md](https://github.com/adidas/lakehouse-engine/blob/master/CONTRIBUTING.md).\n\n## License and Software Information\n\n© adidas AG\n\nadidas AG publishes this software and accompanied documentation (if any) subject to the terms of the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt)\nwith the aim of helping the community with our tools and libraries which we think can be also useful for other people.\nYou will find a copy of the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt) in the root folder of this package. All rights not explicitly granted\nto you under the [license](https://github.com/adidas/lakehouse-engine/blob/master/LICENSE.txt) remain the sole and exclusive property of adidas AG.\n\n---\n> ***NOTICE:*** The software has been designed solely for the purposes described in this ReadMe file. The software is NOT designed,\ntested or verified for productive use whatsoever, nor or for any use related to high risk environments, such as health care,\nhighly or fully autonomous driving, power plants, or other critical infrastructures or services.\n\n---\n\nIf you want to contact adidas regarding the software, you can mail us at software.engineering@adidas.com.\n\nFor further information open the [adidas terms and conditions](https://github.com/adidas/adidas-contribution-guidelines/wiki/Terms-and-conditions) page.\n"
  },
  {
    "path": "assets/gab/metadata/gab/f_agg_dummy_sales_kpi/1_article_category.sql",
    "content": "SELECT\n    \"category_a\" AS category_name\n   ,\"article1\" AS article_id\nUNION\nSELECT\n     \"category_a\" AS category_name\n    ,\"article2\" AS article_id\nUNION\nSELECT\n     \"category_a\" AS category_name\n    ,\"article3\" AS article_id\nUNION\nSELECT\n     \"category_a\" AS category_name\n    ,\"article4\" AS article_id\nUNION\nSELECT\n     \"category_b\" AS category_name\n    ,\"article5\" AS article_id\nUNION\nSELECT\n     \"category_b\" AS category_name\n    ,\"article6\" AS article_id\nUNION\nSELECT\n     \"category_b\" AS category_name\n    ,\"article7\" AS article_id\n"
  },
  {
    "path": "assets/gab/metadata/gab/f_agg_dummy_sales_kpi/2_f_agg_dummy_sales_kpi.sql",
    "content": "SELECT\n    {% if replace_offset_value == 0 %} {{ project_date_column }}\n    {% else %} ({{ project_date_column }} + interval '{{offset_value}}' hour)\n    {% endif %} AS order_date,\n    {{ to_date }} AS to_date,\n    b.category_name,\n    COUNT(a.article_id) qty_articles,\n    SUM(amount) total_amount\nFROM\n  `{{ database }}`.`dummy_sales_kpi` a {{ joins }}\n  LEFT JOIN article_categories b\n    ON a.article_id = b.article_id\nWHERE\n  TO_DATE({{ filter_date_column }}, 'yyyyMMdd') >= (\n          '{{start_date}}' + interval '{{offset_value}}' hour\n  )\n  AND TO_DATE({{ filter_date_column }}, 'yyyyMMdd') < (\n          '{{ end_date}}' + interval '{{offset_value}}' hour\n  )\nGROUP BY\n  1,2,3\n\n\n\n"
  },
  {
    "path": "assets/gab/metadata/tables/dim_calendar.sql",
    "content": "DROP TABLE IF EXISTS  `database`.dim_calendar;\nCREATE EXTERNAL TABLE `database`.dim_calendar (\n  calendar_date DATE COMMENT 'Full calendar date in the format yyyyMMdd.',\n  day_en STRING COMMENT 'Name of the day of the week.',\n  weeknum_mon INT COMMENT 'Week number where the week starts on Monday.',\n  weekstart_mon DATE COMMENT 'First day of the week where the week starts on Monday.',\n  weekend_mon DATE COMMENT 'Last day of the week where the week starts on Monday.',\n  weekstart_sun DATE COMMENT 'First day of the week where the week starts on Sunday.',\n  weekend_sun DATE COMMENT 'Last day of the week where the week starts on Sunday.',\n  month_start DATE COMMENT 'First day of the Month.',\n  month_end DATE COMMENT 'Last day of the Month.',\n  quarter_start DATE COMMENT 'First day of the Quarter.',\n  quarter_end DATE COMMENT 'Last day of the Quarter.',\n  year_start DATE COMMENT 'First day of the Year.',\n  year_end DATE COMMENT 'Last day of the Year.'\n)\nUSING DELTA\nLOCATION 's3://my-data-product-bucket/dim_calendar'\nCOMMENT 'This table stores the calendar information.'\nTBLPROPERTIES(\n  'lakehouse.primary_key'='calendar_date',\n  'delta.enableChangeDataFeed'='false'\n)"
  },
  {
    "path": "assets/gab/metadata/tables/dummy_sales_kpi.sql",
    "content": "DROP TABLE IF EXISTS `database`.`dummy_sales_kpi`;\nCREATE EXTERNAL TABLE `database`.`dummy_sales_kpi` (\n  `order_date` DATE COMMENT 'date of the orders',\n  `article_id` STRING COMMENT 'article id',\n  `amount` INT COMMENT 'quantity/amount sold on this date'\n)\nUSING DELTA\nPARTITIONED BY (order_date)\nLOCATION 's3://my-data-product-bucket/dummy_sales_kpi'\nCOMMENT 'Dummy sales KPI (articles sold per date).'\nTBLPROPERTIES(\n  'lakehouse.primary_key'='article_id, order_date',\n  'delta.enableChangeDataFeed'='true'\n)\n"
  },
  {
    "path": "assets/gab/metadata/tables/gab_log_events.sql",
    "content": "DROP TABLE IF EXISTS `database`.`gab_log_events`;\nCREATE EXTERNAL TABLE `database`.`gab_log_events`\n(\n`run_start_time` TIMESTAMP COMMENT 'Run start time for the use case',\n`run_end_time` TIMESTAMP COMMENT 'Run end time for the use case',\n`input_start_date` TIMESTAMP COMMENT 'The start time set for the use case process',\n`input_end_date` TIMESTAMP COMMENT 'The end time set for the use case process',\n`query_id` STRING COMMENT 'Query ID for the use case',\n`query_label` STRING COMMENT 'Query label for the use case',\n`cadence` STRING COMMENT 'This field stores the cadence of data granularity (Day/Week/Month/Quarter/Year)',\n`stage_name` STRING COMMENT 'Intermediate stage',\n`stage_query` STRING COMMENT 'Query run as part of stage',\n`status` STRING COMMENT 'Status of the stage',\n`error_code` STRING COMMENT 'Error code'\n)\nUSING DELTA\nPARTITIONED BY (query_id)\nLOCATION 's3://my-data-product-bucket/gab_log_events'\nCOMMENT 'This table stores the log for all use cases in gab'\nTBLPROPERTIES(\n  'lakehouse.primary_key'='run_start_time,query_id,stage_name',\n  'delta.enableChangeDataFeed'='false'\n)"
  },
  {
    "path": "assets/gab/metadata/tables/gab_use_case_results.sql",
    "content": "DROP TABLE IF EXISTS `database`.`gab_use_case_results`;\nCREATE EXTERNAL TABLE `database`.`gab_use_case_results`\n(\n`query_id` STRING COMMENT 'Query ID for the use case',\n`cadence` STRING COMMENT 'Cadence of data granularity (Day/Week/Month/Quarter/Year)',\n`from_date` DATE COMMENT 'Aggregate based on the date column',\n`to_date` DATE COMMENT 'Snapshot end date',\n`d1` STRING COMMENT 'Dimension 1',\n`d2` STRING COMMENT 'Dimension 2',\n`d3` STRING COMMENT 'Dimension 3',\n`d4` STRING COMMENT 'Dimension 4',\n`d5` STRING COMMENT 'Dimension 5',\n`d6` STRING COMMENT 'Dimension 6',\n`d7` STRING COMMENT 'Dimension 7',\n`d8` STRING COMMENT 'Dimension 8',\n`d9` STRING COMMENT 'Dimension 9',\n`d10` STRING COMMENT 'Dimension 10',\n`d11` STRING COMMENT 'Dimension 11',\n`d12` STRING COMMENT 'Dimension 12',\n`d13` STRING COMMENT 'Dimension 13',\n`d14` STRING COMMENT 'Dimension 14',\n`d15` STRING COMMENT 'Dimension 15',\n`d16` STRING COMMENT 'Dimension 16',\n`d17` STRING COMMENT 'Dimension 17',\n`d18` STRING COMMENT 'Dimension 18',\n`d19` STRING COMMENT 'Dimension 19',\n`d20` STRING COMMENT 'Dimension 20',\n`d21` STRING COMMENT 'Dimension 21',\n`d22` STRING COMMENT 'Dimension 22',\n`d23` STRING COMMENT 'Dimension 23',\n`d24` STRING COMMENT 'Dimension 24',\n`d25` STRING COMMENT 'Dimension 25',\n`d26` STRING COMMENT 'Dimension 26',\n`d27` STRING COMMENT 'Dimension 27',\n`d28` STRING COMMENT 'Dimension 28',\n`d29` STRING COMMENT 'Dimension 29',\n`d30` STRING COMMENT 'Dimension 30',\n`d31` STRING COMMENT 'Dimension 31',\n`d32` STRING COMMENT 'Dimension 32',\n`d33` STRING COMMENT 'Dimension 33',\n`d34` STRING COMMENT 'Dimension 34',\n`d35` STRING COMMENT 'Dimension 35',\n`d36` STRING COMMENT 'Dimension 36',\n`d37` STRING COMMENT 'Dimension 37',\n`d38` STRING COMMENT 'Dimension 38',\n`d39` STRING COMMENT 'Dimension 39',\n`d40` STRING COMMENT 'Dimension 40',\n`m1` DOUBLE COMMENT 'Metric 1',\n`m2` DOUBLE COMMENT 'Metric 2',\n`m3` DOUBLE COMMENT 'Metric 3',\n`m4` DOUBLE COMMENT 'Metric 4',\n`m5` DOUBLE COMMENT 'Metric 5',\n`m6` DOUBLE COMMENT 'Metric 6',\n`m7` DOUBLE COMMENT 'Metric 7',\n`m8` DOUBLE COMMENT 'Metric 8',\n`m9` DOUBLE COMMENT 'Metric 9',\n`m10` DOUBLE COMMENT 'Metric 10',\n`m11` DOUBLE COMMENT 'Metric 11',\n`m12` DOUBLE COMMENT 'Metric 12',\n`m13` DOUBLE COMMENT 'Metric 13',\n`m14` DOUBLE COMMENT 'Metric 14',\n`m15` DOUBLE COMMENT 'Metric 15',\n`m16` DOUBLE COMMENT 'Metric 16',\n`m17` DOUBLE COMMENT 'Metric 17',\n`m18` DOUBLE COMMENT 'Metric 18',\n`m19` DOUBLE COMMENT 'Metric 19',\n`m20` DOUBLE COMMENT 'Metric 20',\n`m21` DOUBLE COMMENT 'Metric 21',\n`m22` DOUBLE COMMENT 'Metric 22',\n`m23` DOUBLE COMMENT 'Metric 23',\n`m24` DOUBLE COMMENT 'Metric 24',\n`m25` DOUBLE COMMENT 'Metric 25',\n`m26` DOUBLE COMMENT 'Metric 26',\n`m27` DOUBLE COMMENT 'Metric 27',\n`m28` DOUBLE COMMENT 'Metric 28',\n`m29` DOUBLE COMMENT 'Metric 29',\n`m30` DOUBLE COMMENT 'Metric 30',\n`m31` DOUBLE COMMENT 'Metric 31',\n`m32` DOUBLE COMMENT 'Metric 32',\n`m33` DOUBLE COMMENT 'Metric 33',\n`m34` DOUBLE COMMENT 'Metric 34',\n`m35` DOUBLE COMMENT 'Metric 35',\n`m36` DOUBLE COMMENT 'Metric 36',\n`m37` DOUBLE COMMENT 'Metric 37',\n`m38` DOUBLE COMMENT 'Metric 38',\n`m39` DOUBLE COMMENT 'Metric 39',\n`m40` DOUBLE COMMENT 'Metric 40',\n`lh_created_on` TIMESTAMP COMMENT 'This field stores the created_on in lakehouse'\n)\nUSING DELTA\nPARTITIONED BY (query_id)\nLOCATION 's3://my-data-product-bucket/gab_use_case_results'\nCOMMENT 'This table is the common table for all use cases and stores all the dimensions and metrics'\nTBLPROPERTIES(\n  'lakehouse.primary_key'='query_id,cadence,to_date,from_date',\n  'delta.enableChangeDataFeed'='false'\n)"
  },
  {
    "path": "assets/gab/metadata/tables/lkp_query_builder.sql",
    "content": "DROP TABLE IF EXISTS `database`.`lkp_query_builder`;\nCREATE EXTERNAL TABLE `database`.`lkp_query_builder`\n(\n`query_id` INT COMMENT 'Query ID for the use case which is a sequence of numbers',\n`query_label` STRING COMMENT 'Summarized description of the use case',\n`query_type` STRING COMMENT 'Type of use case based on region',\n`mappings` STRING COMMENT 'Dictionary of mappings for dimensions and metrics',\n`intermediate_stages` STRING COMMENT 'All the stages and their configs such as storageLevel repartitioning date columns',\n`recon_window` STRING COMMENT 'Configurations for Cadence and Reconciliation Windows',\n`timezone_offset` INT COMMENT 'Timezone offsets can be configured by a positive or negative integer',\n`start_of_the_week` STRING COMMENT 'Sunday or Monday can be configured as the start of the week',\n`is_active` STRING COMMENT 'Active Flag - Can be set to Y or N',\n`queue` STRING COMMENT 'Can be set to High/Medium/Low based on the cluster computation requirement',\n`lh_created_on` TIMESTAMP COMMENT 'This field stores the created_on in lakehouse'\n)\nUSING DELTA\nLOCATION 's3://my-data-product-bucket/lkp_query_builder'\nCOMMENT 'This table stores the configuration for the gab framework'\nTBLPROPERTIES(\n  'lakehouse.primary_key'='query_id',\n  'delta.enableChangeDataFeed'='false'\n)"
  },
  {
    "path": "assets/gab/notebooks/gab.py",
    "content": "# Databricks notebook source\nfrom datetime import datetime, timedelta\n\nfrom lakehouse_engine.engine import execute_gab\nfrom pyspark.sql.functions import collect_list, collect_set, lit\n\n# COMMAND ----------\n\ndbutils.widgets.text(\"lookup_table\", \"lkp_query_builder\")\nlookup_table = dbutils.widgets.get(\"lookup_table\")\ndbutils.widgets.text(\"source_database\", \"source_database\")\nsource_database = dbutils.widgets.get(\"source_database\")\ndbutils.widgets.text(\"target_database\", \"target_database\")\ntarget_database = dbutils.widgets.get(\"target_database\")\n\n# COMMAND ----------\n\n\ndef flatten_extend(list_to_flatten: list) -> list:\n    \"\"\"Flatten python list.\n\n    Args:\n        list_to_flatten: list to be flattened.\n    Returns:\n        A list containing the flatten values.\n    \"\"\"\n    flat_list = []\n    for row in list_to_flatten:\n        flat_list.extend(row)\n    return flat_list\n\n\nlkp_query_builder_df = spark.read.table(\n    \"{}.{}\".format(target_database, lookup_table)\n)\n\nquery_label_and_queue = (\n    lkp_query_builder_df.groupBy(lit(1)).agg(collect_list(\"query_label\"), collect_set(\"queue\")).collect()\n)\nquery_list = flatten_extend(query_label_and_queue)[1]\nqueue_list = flatten_extend(query_label_and_queue)[2]\n\n# COMMAND ----------\n\ndbutils.widgets.text(\"start_date\", \"\", label=\"Start Date\")\ndbutils.widgets.text(\"end_date\", \"\", label=\"End Date\")\ndbutils.widgets.text(\"rerun_flag\", \"N\", label=\"Re-Run Flag\")\ndbutils.widgets.text(\"look_back\", \"1\", label=\"Look Back Window\")\ndbutils.widgets.multiselect(\n    \"cadence_filter\",\n    \"All\",\n    [\"All\", \"DAY\", \"WEEK\", \"MONTH\", \"QUARTER\", \"YEAR\"],\n    label=\"Cadence\",\n)\ndbutils.widgets.multiselect(\"query_label_filter\", \"All\", query_list + [\"All\"], label=\"Use Case\")\ndbutils.widgets.multiselect(\"queue_filter\", \"All\", queue_list + [\"All\"], label=\"Query Categorization\")\ndbutils.widgets.text(\"gab_base_path\", \"\", label=\"Base Path Use Cases\")\ndbutils.widgets.text(\"target_table\", \"\", label=\"Target Table\")\n\n# Input Parameters\nlookback_days = \"1\" if dbutils.widgets.get(\"look_back\") == \"\" else dbutils.widgets.get(\"look_back\")\n\n# COMMAND ----------\n\nend_date_str = (\n    datetime.today().strftime(\"%Y-%m-%d\") if dbutils.widgets.get(\"end_date\") == \"\" else dbutils.widgets.get(\"end_date\")\n)\nend_date = datetime.strptime(end_date_str, \"%Y-%m-%d\")\n\n# As part of daily run, when no end_date is given, program always runs\n# for yesterday date (Unless custom end date is given)\nif dbutils.widgets.get(\"end_date\") == \"\":\n    end_date = end_date - timedelta(days=1)\n\nstart_date_str = (\n    datetime.date(end_date - timedelta(days=int(lookback_days))).strftime(\"%Y-%m-%d\")\n    if dbutils.widgets.get(\"start_date\") == \"\"\n    else dbutils.widgets.get(\"start_date\")\n)\nstart_date = datetime.strptime(start_date_str, \"%Y-%m-%d\")\nend_date_str = end_date.strftime(\"%Y-%m-%d\")\nrerun_flag = dbutils.widgets.get(\"rerun_flag\")\n\nquery_label_filter = dbutils.widgets.get(\"query_label_filter\")\nrecon_filter = dbutils.widgets.get(\"cadence_filter\")\nqueue_filter = dbutils.widgets.get(\"queue_filter\")\ngab_base_path = dbutils.widgets.get(\"gab_base_path\")\n\n# COMMAND ----------\n\nquery_label_filter = [x.strip() for x in list(set(query_label_filter.split(\",\")))]\nqueue_filter = list(set(queue_filter.split(\",\")))\nrecon_filter = list(set(recon_filter.split(\",\")))\n\nif \"All\" in query_label_filter:\n    query_label_filter = query_list\n\nif \"All\" in queue_filter:\n    queue_filter = queue_list\n\n# COMMAND ----------\n\ntarget_table = (\n    \"gab_use_case_results\" if dbutils.widgets.get(\"target_table\") == \"\" else dbutils.widgets.get(\"target_table\")\n)\n\n# COMMAND ----------\n\nprint(f\"Query Label: {query_label_filter}\")\nprint(f\"Queue Filter: {queue_filter}\")\nprint(f\"Cadence Filter: {recon_filter}\")\nprint(f\"Target Database: {target_database}\")\nprint(f\"Start Date: {start_date}\")\nprint(f\"End Date: {end_date}\")\nprint(f\"Look Back Days: {lookback_days}\")\nprint(f\"Re-run Flag: {rerun_flag}\")\nprint(f\"Target Table: {target_table}\")\nprint(f\"Source Database: {source_database}\")\nprint(f\"Path Use Cases: {gab_base_path}\")\n\n# COMMAND ----------\n\ngab_acon = {\n    \"query_label_filter\": query_label_filter,\n    \"queue_filter\": queue_filter,\n    \"cadence_filter\": recon_filter,\n    \"target_database\": target_database,\n    \"start_date\": start_date,\n    \"end_date\": end_date,\n    \"rerun_flag\": rerun_flag,\n    \"target_table\": target_table,\n    \"source_database\": source_database,\n    \"gab_base_path\": gab_base_path,\n    \"lookup_table\": lookup_table,\n    \"calendar_table\": \"dim_calendar\",\n}\n\n# COMMAND ----------\n\nexecute_gab(acon=gab_acon)\n"
  },
  {
    "path": "assets/gab/notebooks/gab_dim_calendar.py",
    "content": "# Databricks notebook source\n# MAGIC %md\n# MAGIC # This notebook holds the calendar used as part of the GAB framework.\n\n# COMMAND ----------\n\n# Import the required libraries\nfrom datetime import datetime, timedelta\n\nfrom pyspark.sql.functions import to_date\nfrom pyspark.sql.types import StringType\n\n# COMMAND ----------\n\nDIM_CALENDAR_LOCATION = \"s3://my-data-product-bucket/dim_calendar\"\n\n# COMMAND ----------\n\ninitial_date = datetime.strptime(\"1990-01-01\", \"%Y-%m-%d\")\n\ndates_list = [datetime.strftime(initial_date, \"%Y-%m-%d\")]\n\nfor _ in range(1, 200000):\n    initial_date = initial_date + timedelta(days=1)\n    next_date = datetime.strftime(initial_date, \"%Y-%m-%d\")\n    dates_list.append(next_date)\n\n# COMMAND ----------\n\ndf_date_completed = spark.createDataFrame(dates_list, StringType())\ndf_date_completed = df_date_completed.withColumn(\"calendar_date\", to_date(df_date_completed.value, \"yyyy-MM-dd\")).drop(\n    df_date_completed.value\n)\ndf_date_completed.createOrReplaceTempView(\"dates_completed\")\n\n# COMMAND ----------\n\ndf_cal = spark.sql(\n    \"\"\"\n    WITH monday_calendar AS (\n        SELECT\n            calendar_date,\n            WEEKOFYEAR(calendar_date) AS weeknum_mon,\n            DATE_FORMAT(calendar_date, 'E') AS day_en,\n            MIN(calendar_date) OVER (PARTITION BY CONCAT(DATE_PART('YEAROFWEEK', calendar_date),\n            WEEKOFYEAR(calendar_date)) ORDER BY calendar_date) AS weekstart_mon\n        FROM dates_completed\n        ORDER BY\n            calendar_date\n    ),\n    monday_calendar_plus_week_num_sunday AS (\n        SELECT\n            monday_calendar.*,\n            LEAD(weeknum_mon) OVER(ORDER BY calendar_date) AS weeknum_sun\n        FROM monday_calendar\n    ),\n    calendar_complementary_values AS (\n        SELECT\n            calendar_date,\n            weeknum_mon,\n            day_en,\n            weekstart_mon,\n            weekstart_mon+6 AS weekend_mon,\n            LEAD(weekstart_mon-1) OVER(ORDER BY calendar_date) AS weekstart_sun,\n            DATE(DATE_TRUNC('MONTH', calendar_date)) AS month_start,\n            DATE(DATE_TRUNC('QUARTER', calendar_date)) AS quarter_start,\n            DATE(DATE_TRUNC('YEAR', calendar_date)) AS year_start\n        FROM monday_calendar_plus_week_num_sunday\n    )\n    SELECT\n        calendar_date,\n        day_en,\n        weeknum_mon,\n        weekstart_mon,\n        weekend_mon,\n        weekstart_sun,\n        weekstart_sun+6 AS weekend_sun,\n        month_start,\n        add_months(month_start, 1)-1 AS month_end,\n        quarter_start,\n        ADD_MONTHS(quarter_start, 3)-1 AS quarter_end,\n        year_start,\n        ADD_MONTHS(year_start, 12)-1 AS year_end\n    FROM calendar_complementary_values\n    \"\"\"\n)\ndf_cal.createOrReplaceTempView(\"df_cal\")\n\n# COMMAND ----------\n\ndf_cal.write.format(\"delta\").mode(\"overwrite\").save(DIM_CALENDAR_LOCATION)\n"
  },
  {
    "path": "assets/gab/notebooks/gab_job_manager.py",
    "content": "# Databricks notebook source\nimport os\n\nNOTEBOOK_CONTEXT = dbutils.notebook.entry_point.getDbutils().notebook().getContext()\n\n# Import the required libraries\nimport datetime\nimport json\nimport time\nimport uuid\nimport ast\n\nfrom pyspark.sql.functions import col, lit, upper\n\n# COMMAND ----------\n\n# MAGIC %run ../utils/databricks_job_utils\n\n# COMMAND ----------\n\nAUTH_TOKEN = NOTEBOOK_CONTEXT.apiToken().getOrElse(None)\n\nHOST_NAME = spark.conf.get(\"spark.databricks.workspaceUrl\")\n\nDATABRICKS_JOB_UTILS = DatabricksJobs(databricks_instance=HOST_NAME, auth=AUTH_TOKEN)\n\n# COMMAND ----------\n\ndbutils.widgets.text(\"gab_job_schedule\", \"{'hour': {07: 'GLOBAL'}}\")\ngab_job_schedule = ast.literal_eval(dbutils.widgets.get(\"gab_job_schedule\"))\n\ndbutils.widgets.text(\"source_database\", \"\")\nsource_database = dbutils.widgets.get(\"source_database\")\n\ndbutils.widgets.text(\"target_database\", \"\")\ntarget_database = dbutils.widgets.get(\"target_database\")\n\ndbutils.widgets.text(\"gab_base_path\", \"\")\ngab_base_path = dbutils.widgets.get(\"gab_base_path\")\n\ndbutils.widgets.text(\"gab_max_jobs_limit_high_job\", \"\")\ngab_max_jobs_limit_high_job = dbutils.widgets.get(\"gab_max_jobs_limit_high_job\")\n\ndbutils.widgets.text(\"gab_max_jobs_limit_medium_job\", \"\")\ngab_max_jobs_limit_medium_job = dbutils.widgets.get(\"gab_max_jobs_limit_medium_job\")\n\ndbutils.widgets.text(\"gab_max_jobs_limit_low_job\", \"\")\ngab_max_jobs_limit_low_job = dbutils.widgets.get(\"gab_max_jobs_limit_low_job\")\n\n\n# COMMAND ----------\n\n# functions\n\n\ndef divide_chunks(input_list: list, max_number_of_jobs: int) -> list:\n    \"\"\"Split list into predefined chunks, accordingly to the number of jobs.\n\n        This function reads the maximum job limit defined by the parameter for each queue type in order to determine\n            the number of parallel runs for each queue and divides the use cases into chunks for each run.\n        For example, if the maximum job limit is set to 30 for the high queue and there are 60 use cases for the\n            high queue, then each run will handle 2 use cases.\n    Args:\n        input_list: Input list to be split.\n        max_number_of_jobs: Max job number.\n\n    Returns:\n        Split chunk list.\n    \"\"\"\n    avg_chunk_size = len(input_list) // max_number_of_jobs\n    remainder = len(input_list) % max_number_of_jobs\n\n    chunks = [\n        input_list[i * avg_chunk_size + min(i, remainder) : (i + 1) * avg_chunk_size + min(i + 1, remainder)]\n        for i in range(max_number_of_jobs)\n    ]\n    chunks = list(filter(None, chunks))\n    return chunks\n\n\ndef get_run_regions(job_schedule: dict, job_info: dict) -> list:\n    \"\"\"Get run regions accordingly to job_manager trigger time.\n\n    Args:\n        job_schedule: Markets schedule list from the parameter `gab_job_schedule`.\n        job_info: Job manager info to match.\n\n    Returns:\n        Markets run list.\n    \"\"\"\n    q_type_match = \"\"\n    for keys in job_schedule[\"hour\"].keys():\n        if keys == int(datetime.datetime.fromtimestamp(job_info[\"start_time\"] / 1000).strftime(\"%H\")):\n            q_type_match = job_schedule[\"hour\"][keys]\n    try:\n        print(\"Matched regions are: \", q_type_match)\n        return list(q_type_match.split(\",\"))\n    except Exception:\n        raise Exception(\"None of the query types are configured to be run at this time\")\n\n\n# COMMAND ----------\n\n\ncontext_json = json.loads(NOTEBOOK_CONTEXT.safeToJson())\n\nrun_id = \"\"\nif context_json.get(\"attributes\") and context_json[\"attributes\"].get(\"rootRunId\"):\n    run_id = context_json[\"attributes\"][\"rootRunId\"]\nprint(f\"Job Run Id: {run_id}\")\n\njob_status = DATABRICKS_JOB_UTILS.get_job(run_id)\nprint(\"Job Status: \", job_status)\n\n# COMMAND ----------\n\nlist_q_type_match = get_run_regions(gab_job_schedule, job_status)\n\njob_queues = {\n    \"High\": {\"queue\": \"gab_high_queue\", \"max_jobs\": gab_max_jobs_limit_high_job},\n    \"Medium\": {\n        \"queue\": \"gab_medium_queue\",\n        \"max_jobs\": gab_max_jobs_limit_medium_job,\n    },\n    \"Low\": {\"queue\": \"gab_low_queue\", \"max_jobs\": gab_max_jobs_limit_low_job},\n}\n\ndf = spark.read.table(f\"{target_database}.lkp_query_builder\")\n\nfor queue_type, queue_config in job_queues.items():\n    lst = (\n        df.filter(upper(col(\"queue\")) == lit(queue_type.upper()))\n        .filter(col(\"query_type\").isin(list_q_type_match))\n        .select(col(\"query_label\"))\n        .collect()\n    )\n    query_list = [job_queues[0] for job_queues in lst]\n\n    chunk = divide_chunks(query_list, int(queue_config[\"max_jobs\"]))\n    chunk = [i for i in chunk if i]\n\n    if chunk:\n        for i in range(0, len(chunk)):\n            chunk_split = \",\".join(chunk[i])\n            print(chunk_split)\n            time.sleep(2)\n\n            idempotency_token = uuid.uuid4()\n            print(idempotency_token)\n\n            result = DATABRICKS_JOB_UTILS.run_now(\n                DATABRICKS_JOB_UTILS.job_id_extraction(queue_config[\"queue\"]),\n                {\n                    \"query_label_filter\": chunk_split,\n                    \"start_date\": \"\",\n                    \"look_back\": \"\",\n                    \"end_date\": \"\",\n                    \"cadence_filter\": \"All\",\n                    \"queue_filter\": queue_type,\n                    \"rerun_flag\": \"N\",\n                    \"target_database\": target_database,\n                    \"source_database\": source_database,\n                    \"gab_base_path\": gab_base_path,\n                },\n                idempotency_token=idempotency_token,\n            )\n            print(f\"{result}\\n\")\n"
  },
  {
    "path": "assets/gab/notebooks/query_builder_helper.py",
    "content": "# Databricks notebook source\n# MAGIC %md\n# MAGIC # Import Utils\n\n# COMMAND ----------\n\n# MAGIC %run ../utils/query_builder_utils\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS = QueryBuilderUtils()\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC <h1>Use Case Setup\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC\n# MAGIC The Global Asset Builder (GAB) has been developed to help you automate the creations of aggregate tables for\n# MAGIC dashboards on top of base fact tables. It reduce the efforts and time to production for new aggregate tables.\n# MAGIC Users don't need to create separate pipeline for all such cases.\n# MAGIC\n# MAGIC This notebook has been developed to help users to create their use cases configurations easily.\n# MAGIC\n# MAGIC There is some mandatory information that must be completed for the use case to work correctly:\n# MAGIC\n# MAGIC **Use case name:** This parameter must not contain spaces or special characters.\n# MAGIC The suggestion is to use lowercase and underlined alphanumeric characters.\n# MAGIC\n# MAGIC **Market:** Related to the job schedule, example: GLOBAL starting at 07AM UTC\n# MAGIC It gets the complete coverage of last day for the market.\n# MAGIC - GLOBAL - 07AM UTC\n# MAGIC\n# MAGIC **Reference date:** Reference date of the use case. The parameter should be the column name.\n# MAGIC The selected column should have the date/datetime format.\n# MAGIC\n# MAGIC **To date:** This parameter is used in the template, by default its value must be \"to_date\".\n# MAGIC You can change it if you have managed this in your SQL files.\n# MAGIC The values stored in this column depend on the use case behavior:\n# MAGIC - if snapshots are enabled, it will contain the snapshot end day.\n# MAGIC - If snapshot is not enabled, it will contain the last day of the cadence.\n# MAGIC The snapshot behaviour is set in the reconciliation steps.\n# MAGIC\n# MAGIC **How many dimensions?** An integer input of the number of dimensions (columns) expected in the use case.\n# MAGIC Do not consider the reference date or metrics here, as they have their own parameters.\n# MAGIC\n# MAGIC **Time Offset:** The time zone offset that you want to apply to the reference date column.\n# MAGIC It should be a number to decrement or add to the date (e.g., -8 or 8). The default value is zero,\n# MAGIC which means that any time zone transformation will be applied to the date.\n# MAGIC\n# MAGIC **Week start:** The start of the business week of the use case. Two options are available SUNDAY or MONDAY.\n# MAGIC\n# MAGIC **Is Active:** Flag to make the use case active or not. Default value is \"Y\".\n# MAGIC\n# MAGIC **How many views?** Defines how many consumption views you want to have for the use case.\n# MAGIC You can have as many as you want. However, they will have exactly the same structure\n# MAGIC (metrics, columns, timelines, etc.), the only change will be the filter applied to them.\n# MAGIC The default value is 1.\n# MAGIC\n# MAGIC **Complexity:** Defines the complexity of your use case. You should mainly consider the volume of data.\n# MAGIC This parameter directly affects the number of workers that will be spin up to execute the use case.\n# MAGIC - High\n# MAGIC - Medium\n# MAGIC - Low\n# MAGIC\n# MAGIC **SQL File Names:** Name of the SQL files used in the use case.\n# MAGIC You can combine different layers of dependencies between them as shown in the example,\n# MAGIC where the \"2_combined.sql\" file depends on \"1_product_category.sql\" file.\n# MAGIC The file name should follow the pattern x_file_name (where x is an integer digit) and be separated by a comma\n# MAGIC (e.g.: 1_first_query.sql, 2_second_query.sql).\n# MAGIC\n# MAGIC **DEV - Database Schema Name** Refers to the name of the development environment database where the\n# MAGIC \"lkp_query_builder\" table resides. This parameter is used at the end of the notebook to insert data into\n# MAGIC the \"lkp_query_builder\" table.\n\n# COMMAND ----------\n\ndbutils.widgets.removeAll()\ndbutils.widgets.text(name=\"usecase_name\", defaultValue=\"\", label=\"Use Case Name\")\ndbutils.widgets.dropdown(\n    name=\"market\", defaultValue=\"GLOBAL\", label=\"Market\", choices=[\"APAC\", \"GLOBAL\", \"NAM\", \"NIGHTLY\"]\n)\ndbutils.widgets.text(name=\"from_date\", defaultValue=\"\", label=\"Reference Date\")\ndbutils.widgets.text(name=\"to_date\", defaultValue=\"to_date\", label=\"Snapshot End Date\")\ndbutils.widgets.text(name=\"num_dimensions\", defaultValue=\"\", label=\"How many dimensions?\")\ndbutils.widgets.text(name=\"time_offset\", defaultValue=\"0\", label=\"Time Offset\")\ndbutils.widgets.dropdown(name=\"week_start\", defaultValue=\"MONDAY\", label=\"Week start\", choices=[\"SUNDAY\", \"MONDAY\"])\ndbutils.widgets.dropdown(name=\"is_active\", defaultValue=\"Y\", label=\"Is Active\", choices=[\"Y\", \"N\"])\ndbutils.widgets.text(name=\"num_of_views\", defaultValue=\"1\", label=\"How many views?\")\ndbutils.widgets.dropdown(\n    name=\"complexity\", defaultValue=\"Medium\", label=\"Complexity\", choices=[\"Low\", \"Medium\", \"High\"]\n)\ndbutils.widgets.text(name=\"sql_files\", defaultValue=\"\", label=\"SQL File Names\")\ndbutils.widgets.text(name=\"db_schema\", defaultValue=\"\", label=\"DEV - Database Schema Name\")\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC Set configurations and validate.\n\n# COMMAND ----------\n\nusecase_name = dbutils.widgets.get(\"usecase_name\").lower().strip()\nmarket = dbutils.widgets.get(\"market\")\nfrom_date = dbutils.widgets.get(\"from_date\")\nto_date = dbutils.widgets.get(\"to_date\")\nnum_dimensions = dbutils.widgets.get(\"num_dimensions\")\ntime_offset = dbutils.widgets.get(\"time_offset\")\nweek_start = dbutils.widgets.get(\"week_start\")\nis_active = dbutils.widgets.get(\"is_active\")\nnum_of_views = dbutils.widgets.get(\"num_of_views\")\ncomplexity = dbutils.widgets.get(\"complexity\")\nsql_files = dbutils.widgets.get(\"sql_files\").replace(\".sql\", \"\")\ndb_schema = dbutils.widgets.get(\"db_schema\")\nnum_of_metrics = \"\"\n\nQUERY_BUILDER_UTILS.check_config_inputs(\n    usecase_name, from_date, num_dimensions, sql_files, num_of_views, to_date, time_offset, db_schema\n)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC Set Dimensions.\n# MAGIC\n# MAGIC In this step you will have to map the dimension columns with their respective order.\n# MAGIC The options available in the widgets to fill are based on the number of dimensions previously defined.\n# MAGIC For example, if you have two dimensions to analyze, such as country and category,\n# MAGIC values must be set to D1 and D2.\n# MAGIC For example:\n# MAGIC D1. Dimension name = country\n# MAGIC D2. Dimension name = category\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.set_dimensions(num_dimensions)\n\n# COMMAND ----------\n\ndimensions = QUERY_BUILDER_UTILS.get_dimensions(num_dimensions)\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.print_definitions(\n    usecase_name=usecase_name,\n    market=market,\n    from_date=from_date,\n    to_date=to_date,\n    dimensions=dimensions,\n    time_offset=time_offset,\n    week_start=week_start,\n    is_active=is_active,\n    num_of_views=num_of_views,\n    complexity=complexity,\n    sql_files=sql_files,\n    db_schema=db_schema,\n)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC <h1> 1 - Configure view(s) name(s) and filter(s)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC The filters defined in this step will be based on the dimensions defined in the previous step.\n# MAGIC\n# MAGIC So, if you have set the country as D1, the filter here should be D1 = \"Germany\".\n# MAGIC The commands allowed for the filter step are the same as those used in the where clause in SQL language.\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.set_views(num_of_views)\n\n# COMMAND ----------\n\ndims_dict = QUERY_BUILDER_UTILS.get_view_information(num_of_views)\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.print_definitions(\n    usecase_name=usecase_name,\n    market=market,\n    from_date=from_date,\n    to_date=to_date,\n    dimensions=dimensions,\n    time_offset=time_offset,\n    week_start=week_start,\n    is_active=is_active,\n    num_of_views=num_of_views,\n    complexity=complexity,\n    sql_files=sql_files,\n    db_schema=db_schema,\n    dims_dict=dims_dict,\n)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC # 2 - Configure Reconciliation\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC The reconciliation configuration (recon) is mandatory.\n# MAGIC In this section you will set the cadence, recon and snapshot behaviour of your use case.\n# MAGIC\n# MAGIC CADENCE - The cadence sets how often the data will be calculated. E.g: DAY, WEEK, MONTH, QUARTER, YEAR.\n# MAGIC\n# MAGIC RECON - The reconciliation for the cadence set.\n# MAGIC\n# MAGIC IS SNAPSHOT? - Set yes or no for the combination of cadence and reconciliation.\n# MAGIC\n# MAGIC Combination examples:\n# MAGIC - DAILY CADENCE = DAY - This configuration means that only daily data will be refreshed.\n# MAGIC - MONTHLY CADENCE - WEEKLY RECONCILIATION - WITHOUT SNAPSHOT = MONTH-WEEK-N -\n# MAGIC This means after every week, the whole month data is refreshed without snapshot.\n# MAGIC - WEEKLY CADENCE - DAY RECONCILIATION - WITH SNAPSHOT = WEEK-DAY-Y -\n# MAGIC This means that every day, the entire week's data (week to date) is refreshed with snapshot.\n# MAGIC It will generate a record for each day with the specific position of the value for the week.\n\n# COMMAND ----------\n\ndbutils.widgets.removeAll()\ndbutils.widgets.multiselect(\n    name=\"recon_cadence\",\n    defaultValue=\"DAY\",\n    label=\"Recon Cadence\",\n    choices=QUERY_BUILDER_UTILS.get_recon_choices(),\n)\n\n# COMMAND ----------\n\nrecon_list = list(filter(None, dbutils.widgets.get(name=\"recon_cadence\").split(\",\")))\nprint(f\"List of chosen reconciliation values: {recon_list}\")\n\n# COMMAND ----------\n\nrecon_dict = QUERY_BUILDER_UTILS.get_recon_config(recon_list)\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.print_definitions(\n    usecase_name=usecase_name,\n    market=market,\n    from_date=from_date,\n    to_date=to_date,\n    dimensions=dimensions,\n    time_offset=time_offset,\n    week_start=week_start,\n    is_active=is_active,\n    num_of_views=num_of_views,\n    complexity=complexity,\n    sql_files=sql_files,\n    db_schema=db_schema,\n    dims_dict=dims_dict,\n    recon_dict=recon_dict,\n)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC <h1> 3 - Configure METRICS\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC Define how many metrics your SQL files contain. For example, you have a sum (amount) as total_amount\n# MAGIC and a count(*) as total_records, you will need to set 2 here.\n# MAGIC\n# MAGIC The metrics column must be configured in the same order they appear in the sql files.\n# MAGIC\n# MAGIC For example:\n# MAGIC 1. Metric name = total_amount\n# MAGIC 2. Metric name = total_records\n\n# COMMAND ----------\n\ndbutils.widgets.removeAll()\ndbutils.widgets.text(name=\"num_of_metrics\", defaultValue=\"1\", label=\"How many metrics?\")\n\n# COMMAND ----------\n\nnum_of_metrics = dbutils.widgets.get(\"num_of_metrics\")\n\nQUERY_BUILDER_UTILS.set_metric(num_of_metrics)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC Based on the metric setup, it is possible to derive 4 new columns based on each metric.\n# MAGIC Those new columns will be based on cadences like last_cadence, last_year_cadence and window function.\n# MAGIC But also, you can create a derived column, which is a SQL statement that you can write on your own\n# MAGIC by selecting the option of \"derived_metric\".\n\n# COMMAND ----------\n\nmetrics_dict = QUERY_BUILDER_UTILS.get_metric_configuration(num_of_metrics)\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.set_extra_metric_config(num_of_metrics, metrics_dict)\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.print_definitions(\n    usecase_name=usecase_name,\n    market=market,\n    from_date=from_date,\n    to_date=to_date,\n    dimensions=dimensions,\n    time_offset=time_offset,\n    week_start=week_start,\n    is_active=is_active,\n    num_of_views=num_of_views,\n    complexity=complexity,\n    sql_files=sql_files,\n    db_schema=db_schema,\n    dims_dict=dims_dict,\n    recon_dict=recon_dict,\n    metrics_dict=metrics_dict,\n)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC <h1> 4 - Configure STAGES\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC The parameters available for this step are:\n# MAGIC\n# MAGIC - Filter Date Column - This column will be used to filter the data of your use case.\n# MAGIC This information will be replaced in the placeholder of the GAB template.\n# MAGIC - Project Date Column - This column will be used as reference date for the query given.\n# MAGIC This information will be replaced in the placeholder of the GAB template.\n# MAGIC - Repartition Value - This parameter only has effect when used with Repartition Type parameter.\n# MAGIC It sets the way of repartitioning the data while processing.\n# MAGIC - Repartition Type - Type of repartitioning the data of the query.\n# MAGIC Available values are Key and Number. When use Key, it expects column names separated by a comma.\n# MAGIC When set number it expects and integer of how many partitions the user want.\n# MAGIC - Storage Level - Defines the type of spark persistence storage levels you want to define\n# MAGIC (e.g. Memory Only, Memory and Disk etc).\n# MAGIC - Table Alias - The alias name of the sql file that will run.\n# MAGIC\n\n# COMMAND ----------\n\nsql_files_list = QUERY_BUILDER_UTILS.set_stages(sql_files=sql_files)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC According to the number of sql files provided in the use case, a set of widgets will appear to be configured.\n# MAGIC Remember that the configuration index matches the given sql file order.\n# MAGIC\n# MAGIC For example: 1_categories.sql, 2_fact_kpi.sql. Settings starting with index “1”.\n# MAGIC will be set to sql file 1_categories.sql. The same will happen with index “2.”.\n\n# COMMAND ----------\n\nstages_dict = QUERY_BUILDER_UTILS.get_stages(sql_files_list, usecase_name)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC <h1> BUILD AND INSERT SQL INSTRUCTION\n\n# COMMAND ----------\n\ndelete_sttmt, insert_sttmt = QUERY_BUILDER_UTILS.create_sql_statement(\n    usecase_name,\n    market,\n    stages_dict,\n    recon_dict,\n    time_offset,\n    week_start,\n    is_active,\n    complexity,\n    db_schema,\n    dims_dict,\n    dimensions,\n    from_date,\n    to_date,\n    metrics_dict,\n)\n\nprint(delete_sttmt + \"\\n\" + insert_sttmt)\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC <h1> INSERT CONFIGURATION DATA\n# MAGIC\n# MAGIC **Note:** This insert will have effect just on dev/uat, to execute it on prod\n# MAGIC it will need to use the Table/SQL Manager or another job.\n\n# COMMAND ----------\n\nQUERY_BUILDER_UTILS.insert_data_into_lkp_query_builder(delete_sttmt, insert_sttmt)\n"
  },
  {
    "path": "assets/gab/utils/databricks_job_utils.py",
    "content": "# Databricks notebook source\n# imports\nimport enum\nfrom typing import Tuple\nfrom uuid import UUID\n\nimport requests\n\n\n# COMMAND ----------\n\nclass BearerAuth:\n    \"\"\"Create authorisation object to be used in the requests header.\"\"\"\n\n    def __init__(self, token):\n        \"\"\"Create auth object with personal access token.\"\"\"\n        self.token = token\n\n    def __call__(self, r):\n        \"\"\"Add bearer token to header.\n\n        This function is internally called by get or post method of requests.\n        \"\"\"\n        r.headers[\"authorization\"] = \"Bearer \" + self.token\n        return r\n\n\nclass ResultState(str, enum.Enum):\n    \"\"\"Possible values for result state of a job run.\"\"\"\n\n    SUCCESS = \"SUCCESS\"\n    CANCELED = \"CANCELED\"\n    FAILED = \"FAILED\"\n    SKIPPED = \"SKIPPED\"\n\n\nclass DatabricksJobs:\n    \"\"\"Class with methods to execute databricks jobs API commands.\n        Refer documentation for details: https://docs.databricks.com/dev-tools/api/latest/jobs.html#.\n    \"\"\"\n\n    # api endpoints\n    RUN_NOW = \"/2.1/jobs/run-now\"\n    GET_OUTPUT = \"/2.1/jobs/runs/get-output\"\n    GET_JOB = \"/2.1/jobs/runs/get\"\n    GET_LIST_JOBS = \"/2.1/jobs/list\"\n    CANCEL_JOB = \"/2.1/jobs/runs/cancel\"\n\n    headers = {\"Content-type\": \"application/json\"}\n\n    def __init__(self, databricks_instance: str, auth: str):\n        \"\"\"\n        Construct a databricks jobs object using databricks instance and api token.\n\n        Parameters:\n            databricks_instance: domain name of databricks deployment. Use the form <account>.cloud.databricks.com\n            auth: personal access token\n        \"\"\"\n        self.databricks_instance = databricks_instance\n        self.auth = BearerAuth(auth)\n\n    @staticmethod\n    def _check_response(response):\n        if response.status_code != 200:\n            raise Exception(f\"Response Code: {response.status_code} \\n {response.content}\")\n\n    def list_jobs(self, name: str = None, limit: int = 20, offset: int = 0, expand_tasks: bool = False) -> dict:\n        \"\"\"\n        List the databricks jobs corresponding to given `name`.\n\n        for details refer API documentation:\n            https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsList\n\n        Parameters:\n            name: optional, to filter jobs as per name (case-insensitive)\n            limit: optional, The number of jobs to return, valid range 0 to 25.\n            offset: The offset of the first job to return, relative to the most recently created job\n            expand_tasks: Whether to include task and cluster details in the response.\n        Returns:\n            A dictionary of job ids matching the name (if provided) else returns in chunks\n        \"\"\"\n        params = {\"limit\": limit, \"offset\": offset, \"expand_tasks\": expand_tasks}\n\n        if name:\n            params.update({\"name\": name})\n        response = requests.get(\n            f\"https://{self.databricks_instance}/api{self.GET_LIST_JOBS}\",\n            params=params,\n            headers=self.headers,\n            auth=self.auth,\n        )\n        self._check_response(response)  # Raises exception if not successful\n        return response.json()\n\n    def run_now(self, job_id: int, notebook_params: dict, idempotency_token: UUID = None) -> dict:\n        \"\"\"\n        Trigger the job specified by the job id.\n\n        Note: currently it expects notebook tasks in a job, but can be extended for other tasks\n\n        Parameters:\n            job_id: databricks job identifier\n            notebook_params: key value pairs of the parameter name and its value to be passed to the job\n            idempotency_token: An optional token to guarantee the idempotency of job run requests,\n                it should have at most 64 characters\n        Returns:\n            A dictionary consisting of run_id and number_in_job\n\n        \"\"\"\n        data = {\"job_id\": job_id, \"notebook_params\": notebook_params}\n        if idempotency_token:\n            data.update({\"idempotency_token\": str(idempotency_token)})\n\n        response = requests.post(\n            f\"https://{self.databricks_instance}/api{self.RUN_NOW}\",\n            json=data,\n            headers=self.headers,\n            auth=self.auth,\n        )\n        self._check_response(response)  # Raises exception if not successful\n        return response.json()\n\n    def get_output(self, run_id: int) -> dict:\n        \"\"\"\n        Fetch the single job run output and metadata for a single task.\n\n        Reference: https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsGetOutput\n\n        Parameters:\n            run_id: identifier for the job run\n        Returns:\n            A dictionary containing the output and metadata from task\n        \"\"\"\n        params = {}\n\n        if run_id:\n            params.update({\"run_id\": run_id})\n        response = requests.get(\n            f\"https://{self.databricks_instance}/api{self.GET_OUTPUT}\",\n            params=params,\n            headers=self.headers,\n            auth=self.auth,\n        )\n        self._check_response(response)  # Raises exception if not successful\n        return response.json()\n\n    def get_job(self, run_id: int) -> dict:\n        \"\"\"\n        Retrieve the metadata of a job run identified by run_id.\n\n        Parameters:\n            run_id: identifier for the job run\n        Returns:\n            A dictionary containing the metadata of a job\n        \"\"\"\n        params = {}\n\n        if run_id:\n            params.update({\"run_id\": run_id})\n        response = requests.get(\n            f\"https://{self.databricks_instance}/api{self.GET_JOB}\", params=params, headers=self.headers, auth=self.auth\n        )\n        self._check_response(response)  # Raises exception if not successful\n        return response.json()\n\n    def cancel_job(self, run_id: int) -> dict:\n        \"\"\"\n        Cancel job specified by run_id.\n\n        Parameters:\n            run_id: job run identifier\n\n        Returns:\n            Response received from endpoint\n        \"\"\"\n        response = requests.post(\n            f\"https://{self.databricks_instance}/api{self.CANCEL_JOB}\",\n            json={\"run_id\": run_id},\n            headers=self.headers,\n            auth=self.auth,\n        )\n        self._check_response(response)  # Raises exception if not successful\n        return response.json()\n\n    def trigger_job_by_name(self, job_name: str, notebook_params: dict, idempotency_token: UUID = None) -> dict:\n        \"\"\"\n        Triggers a job as specified by the job name, if found.\n\n        Parameters:\n            job_name: name of the job\n            notebook_params: key value pairs of the parameter name and its value to be passed to the job\n            idempotency_token: Optional token to guarantee the idempotency of job run requests, 64 characters max\n        Returns:\n            A dictionary consisting of run_id and number_in_job\n        \"\"\"\n        result = self.list_jobs(name=job_name)\n        if result.get(\"jobs\") is None:\n            raise Exception(f\"job with name {job_name} not found.\")\n\n        return self.run_now(int(result.get(\"jobs\")[0].get(\"job_id\")), notebook_params, idempotency_token)\n\n    def get_job_status(self, run_id: int) -> Tuple[bool, dict]:\n        \"\"\"\n        Fetch the status of the job run id.\n\n        Parameters:\n            run_id: identifier for the job run\n        Returns:\n            Tuple bool and dict containing whether the job run has succeeded and its state\n        \"\"\"\n        state = self.get_job(run_id)[\"state\"]\n        result_state = state.get(\"result_state\") or state.get(\"life_cycle_state\")\n        return result_state == ResultState.SUCCESS, state\n\n    def job_id_extraction(self, job_name: str) -> int:\n        \"\"\"Extract the job id from the job run.\n\n        Args:\n            job_name: Job name.\n\n        Returns:\n            Job ID number.\n        \"\"\"\n        jobs_list = self.list_jobs(name=job_name)\n        if jobs_list.get(\"jobs\") is None:\n            raise Exception(\"No jobs found.\")\n        return int(jobs_list.get(\"jobs\")[0].get(\"job_id\"))\n"
  },
  {
    "path": "assets/gab/utils/query_builder_utils.py",
    "content": "# Databricks notebook source\nimport json\nimport re\n\nfrom databricks.sdk.runtime import *\n\n\nclass QueryBuilderUtils:\n    \"\"\"Class with methods to create GAB use case configuration.\"\"\"\n\n    def __init__(self):\n        \"\"\"Instantiate objects of the class QueryBuilderUtils.\"\"\"\n        self.regex_no_special_characters = \"^[a-zA-Z0-9]+(_[a-zA-Z0-9]+)*$\"\n        self.cadences = [\"DAY\", \"WEEK\", \"MONTH\", \"QUARTER\", \"YEAR\"]\n\n    def check_config_inputs(\n            self,\n            usecase_name: str,\n            from_date: str,\n            num_dimensions: str,\n            sql_files: str,\n            num_of_views: str,\n            to_date: str,\n            time_offset: str,\n            db_schema: str\n    ) -> str:\n        \"\"\"\n        Check the parameters input.\n\n        Args:\n            usecase_name: The use case name.\n            from_date: The reference date of the use case.\n            num_dimensions: The number of dimensions chosen for analysis.\n            sql_files: Name of the SQL files that will be submitted for the framework\n                to process (e.g. file1.sql, file2.sql).\n            num_of_views: Number of views the use case has.\n            to_date: The end date of the snapshot configuration.\n            time_offset: Hours related to the timezone (e.g. 8, -8).\n            db_schema: Database name that lkp_query_builder is located.\n\n        Returns:\n            A message with the status of the validation.\n        \"\"\"\n        message = \"\"\n        if (\n                usecase_name.strip() == \"\"\n                or from_date.strip() == \"\"\n                or num_dimensions.strip() == \"\"\n                or sql_files.strip() == \"\"\n                or num_of_views.strip() == \"\"\n                or to_date.strip() == \"\"\n                or db_schema.strip() == \"\"\n        ):\n            message = \"WRONG CONFIGURATION:\"\n            if usecase_name.strip() == \"\":\n                message += \"\\n\\t - Please, add the Use Case Name.\"\n            if from_date.strip() == \"\":\n                message += \"\\n\\t - Please, add the From Date.\"\n            if num_dimensions.strip() == \"\":\n                message += \"\\n\\t - Please, add the Number of Dimensions.\"\n            if sql_files.strip() == \"\":\n                message += \"\\n\\t - Please, add the SQL File Names.\"\n            if num_of_views.strip() == \"\":\n                message += \"\\n\\t - Please, add the number of views.\"\n            if to_date.strip() == \"\":\n                message += \"\\n\\t - Please, add the to date value. This information is mandatory. \"\n                message += \"Keep it as 'to_date' unless you change its name in your SQL files.\"\n            if db_schema.strip() == \"\":\n                message += \"\\n\\t - Please, add the database schema where the lkp_query_builder table is located.\"\n\n        if time_offset.strip():\n            try:\n                int(re.findall('-?\\d+\\.?\\d*',time_offset.strip())[0])\n            except Exception:\n                if message:\n                    message += \"\\n\\t The timezone offset must be a number (e.g. 0, 12 or -8).\"\n                else:\n                    message = \"WRONG CONFIGURATION:\"\n                    message += \"\\n\\t - The timezone offset must be a number (e.g. 0, 12 or -8).\"\n\n        if num_dimensions.strip():\n            try:\n                int(num_dimensions)\n                if int(num_dimensions) == 0:\n                    message = \"WRONG CONFIGURATION:\"\n                    message += \"\\n\\t - The number of dimensions must be greater than zero.\"\n            except Exception:\n                if message:\n                    message += \"\\n\\t - The number of dimensions must be an integer.\"\n                else:\n                    message = \"WRONG CONFIGURATION:\"\n                    message += \"\\n\\t - The number of dimensions must be an integer.\"\n\n        if sql_files.strip():\n            files_list = self._sort_files(sql_files)\n            for file in files_list:\n                sql_files_err = f\"\"\"\\n\\t - Check the SQL file name '{file}'. \"\"\"\n                sql_files_err += \"It must follow the pattern x_file_name (X is an integer digit).\" \"\"\n                try:\n                    int(re.match(\"(.*?)_\", file).group()[:-1])\n                except Exception:\n                    if message:\n                        message += sql_files_err\n                    else:\n                        message = \"WRONG CONFIGURATION:\"\n                        message += sql_files_err\n        if not message:\n            message = \"Validation status: OK\"\n\n        return print(message)\n\n    def create_sql_statement(\n            self,\n            usecase_name: str,\n            market: str,\n            stages_dict: dict,\n            recon_dict: dict,\n            time_offset: str,\n            week_start: str,\n            is_active: str,\n            complexity: str,\n            db_schema: str,\n            dims_dict: dict,\n            dimensions: str,\n            from_date: str,\n            to_date: str,\n            metrics_dict: dict,\n    ) -> tuple[str, str]:\n        \"\"\"\n        Create the SQL statement to insert data into lkp_query_builder_table.\n\n        Args:\n            usecase_name: The name of use case.\n            market: The market used for the use case (APAC, GLOBAL, NAM, NIGHTLY).\n            stages_dict: A dictionary of stages and it's configurations.\n            recon_dict: A dictionary of reconciliation setup.\n            time_offset: Hours related to the timezone (e.g. 8, -8).\n            week_start: Day of the start of the week (e.g. Sunday, Monday)\n            is_active: If the use case is active or not. (e.g. Y, N)\n            complexity: The categories are directly related to the number of workers in each cluster.\n                That is, High = 10 workers, Medium = 6 workers and Low = 4 workers.\n            db_schema: Database name that lkp_query_builder is located.\n            dims_dict: The dictionary of views and it's setup.\n            dimensions: Store supporting information to the fact table.\n            from_date: Aggregating date column for the use case.\n            to_date: Contains the current date (default value is to_date).\n                Information used as template for the framework.\n            metrics_dict: The dictionary of metrics and it's setup.\n\n        Returns:\n            A tuple with a text formatted with the delete and insert statement.\n\n        \"\"\"\n        dbutils.widgets.removeAll()\n\n        mapping_dict = self._get_mapping(dims_dict, dimensions, from_date, to_date, metrics_dict)\n\n        query_id = self._generate_query_id(usecase_name)\n        query_label = f\"'{usecase_name}'\"\n        query_type = f\"'{market}'\"\n        mapping_str = json.dumps(mapping_dict, indent=4)\n        mappings = '\"\"\"' + mapping_str.replace('\"', \"'\").replace(\"#+#-#\", '\\\\\"') + '\"\"\"'\n        steps_str = json.dumps(stages_dict, indent=4)\n        intermediate_stages = '\"\"\"' + steps_str.replace('\"', \"'\") + '\"\"\"'\n        recon_str = json.dumps(recon_dict)\n        recon_window = '\"\"\"' + recon_str.replace('\"', \"'\") + '\"\"\"'\n        col_time_offset = f\"'{time_offset}'\"\n        start_of_week = f\"'{week_start}'\"\n        col_is_active = f\"'{is_active}'\"\n        queue = f\"'{complexity}'\"\n\n        delete_sttmt = f\"\"\"DELETE FROM {db_schema}.lkp_query_builder WHERE QUERY_LABEL = {query_label};\"\"\"\n        insert_sttmt = f\"\"\"INSERT INTO {db_schema}.lkp_query_builder VALUES (\n            {query_id},\n            {query_label},\n            {query_type},\n            {mappings},\n            {intermediate_stages},\n            {recon_window},\n            {col_time_offset},\n            {start_of_week},\n            {col_is_active},\n            {queue},\n            current_timestamp());\"\"\"\n\n        return delete_sttmt, insert_sttmt\n\n    def get_dimensions(self, num_dimensions: str) -> str:\n        \"\"\"\n        Get the dimensions set on the widgets and validate.\n\n        Args:\n            num_dimensions: The number of dimensions set.\n\n        Returns:\n            A string with comma-separated dimensions names.\n\n        \"\"\"\n        dimensions = \"\"\n        list_status = []\n        for i in range(int(num_dimensions)):\n            i = i + 1\n            if re.match(self.regex_no_special_characters, dbutils.widgets.get(f\"D{i}\").strip()):\n                dimensions += \",\" + dbutils.widgets.get(f\"D{i}\").strip()\n                list_status.append(\"success\")\n            else:\n                print(\"WRONG CONFIGURATION:\")\n                print(f\"\\t- {dbutils.widgets.get(f'D{i}')} is empty of malformed!\")\n                print(\n                    \"\\t Names can contain only alphanumeric characters and must begin with \"\n                    \"an alphabetic character or an underscore (_).\"\n                )\n                list_status.append(\"fail\")\n        if \"fail\" not in list_status:\n            print(\"Dimensions validation status: OK\")\n            return dimensions[1:]\n\n    @classmethod\n    def get_recon_choices(cls) -> list:\n        \"\"\"\n        Return all possible combinations for cadences, reconciliations and the snapshot flag value (Y,N).\n\n        Returns:\n            List used to generate a multiselect widget for the users to interact with.\n\n        \"\"\"\n        return [\n            \"DAY\",\n            \"DAY-WEEK-N\",\n            \"DAY-MONTH-N\",\n            \"DAY-QUARTER-N\",\n            \"DAY-YEAR-N\",\n            \"WEEK\",\n            \"WEEK-DAY-N\",\n            \"WEEK-DAY-Y\",\n            \"WEEK-MONTH-N\",\n            \"WEEK-QUARTER-N\",\n            \"WEEK-YEAR-N\",\n            \"MONTH\",\n            \"MONTH-DAY-N\",\n            \"MONTH-DAY-Y\",\n            \"MONTH-WEEK-Y\",\n            \"MONTH-WEEK-N\",\n            \"MONTH-QUARTER-N\",\n            \"MONTH-YEAR-N\",\n            \"QUARTER\",\n            \"QUARTER-DAY-N\",\n            \"QUARTER-DAY-Y\",\n            \"QUARTER-WEEK-N\",\n            \"QUARTER-WEEK-Y\",\n            \"QUARTER-MONTH-N\",\n            \"QUARTER-MONTH-Y\",\n            \"QUARTER-YEAR-N\",\n            \"YEAR\",\n            \"YEAR-DAY-N\",\n            \"YEAR-DAY-Y\",\n            \"YEAR-WEEK-N\",\n            \"YEAR-WEEK-Y\",\n            \"YEAR-MONTH-N\",\n            \"YEAR-MONTH-Y\",\n            \"YEAR-QUARTER-N\",\n            \"YEAR-QUARTER-Y\",\n        ]\n\n    @classmethod\n    def get_metric_configuration(cls, num_of_metrics: str) -> dict:\n        \"\"\"\n        Get metrics information based on the widget setup.\n\n        Args:\n            num_of_metrics: Number of metrics selected.\n\n        Returns:\n            metrics_dict: The dictionary of metrics and their setup.\n\n        \"\"\"\n        metrics_dict = {}\n        for i in range(int(num_of_metrics)):\n            i = i + 1\n            if dbutils.widgets.get(f\"metric_name{i}\"):\n                metrics_dict[f\"m{i}\"] = {\n                    \"metric_name\": dbutils.widgets.get(f\"metric_name{i}\"),\n                    \"calculated_metric\": {},\n                    \"derived_metric\": {},\n                }\n                calculated_metric_list = list(filter(None, dbutils.widgets.get(f\"calculated_metric{i}\").split(\",\")))\n                for calc_metric in calculated_metric_list:\n                    if calc_metric == \"last_cadence\":\n                        metrics_dict[f\"m{i}\"][\"calculated_metric\"].update({calc_metric: {}})\n                        # add label and window for last_cadence\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_label\", defaultValue=\"\", label=f\"{i}_{calc_metric}.Label\"\n                        )\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_window\", defaultValue=\"\", label=f\"{i}_{calc_metric}.Window\"\n                        )\n                    if calc_metric == \"last_year_cadence\":\n                        metrics_dict[f\"m{i}\"][\"calculated_metric\"].update({calc_metric: {}})\n                        # add label and window for last_cadence\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_label\", defaultValue=\"\", label=f\"{i}_{calc_metric}.Label\"\n                        )\n                    if calc_metric == \"window_function\":\n                        metrics_dict[f\"m{i}\"][\"calculated_metric\"].update({calc_metric: {}})\n                        # add label and window for window_function\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_label\", defaultValue=\"\", label=f\"{i}_{calc_metric}.Label\"\n                        )\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_window\",\n                            defaultValue=\"\",\n                            label=f\"{i}_{calc_metric}.Window Interval\",\n                        )\n                        dbutils.widgets.dropdown(\n                            name=f\"{i}_{calc_metric}_agg_func\",\n                            defaultValue=\"sum\",\n                            label=f\"{i}_{calc_metric}.Agg Func\",\n                            choices=[\"sum\", \"avg\", \"max\", \"min\", \"count\"],\n                        )\n                    # add label and window for derived_metric\n                    if calc_metric == \"derived_metric\":\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_label\", defaultValue=\"\", label=f\"{i}_{calc_metric}.Label\"\n                        )\n                        dbutils.widgets.text(\n                            name=f\"{i}_{calc_metric}_formula\", defaultValue=\"\", label=f\"{i}_{calc_metric}.Formula\"\n                        )\n                print(\"Metric configuration status: OK\")\n            else:\n                print(\"WRONG CONFIGURATION:\")\n                print(\"\\t- The metric name is mandatory!\")\n\n        return metrics_dict\n\n    def get_recon_config(self, recon_list: list) -> dict:\n        \"\"\"\n        Get reconciliation information based on the widget setup.\n\n        Args:\n            recon_list: List of cadences setup for the reconciliation.\n\n        Returns:\n            A dictionary of reconciliation setup.\n\n        \"\"\"\n        cadence_list = []\n        # create a list with the distinct cadences values.\n        for cadence in recon_list:\n            cadence_name = cadence.split(\"-\")[0]\n            cadence_list.append(cadence_name)\n\n        cadence_list = list(dict.fromkeys(cadence_list))\n\n        # create a dict with the structure of each cadence.\n        recon_dict = {}\n        for cad in cadence_list:\n            recon_dict[f\"{cad}\"] = {}\n            recon_dict[f\"{cad}\"][\"recon_window\"] = {}\n\n        # updates the dict of each cadence with the recon configurations selected.\n        for cadence in recon_list:\n            if cadence in self.cadences:\n                recon_dict[f\"{cad}\"][\"recon_window\"] = {}\n            else:\n                cadence_name = cadence.split(\"-\")[0]\n                recon = cadence.split(\"-\")[1]\n                snapshot = cadence.split(\"-\")[2]\n                for cad in cadence_list:\n                    if cadence_name == cad:\n                        recon_dict[cad][\"recon_window\"].update({recon: {\"snapshot\": snapshot}})\n\n        # remove empty recon_window when the selected just cadence.\n        for cadence in recon_list:\n            if cadence in [\"DAY\", \"WEEK\", \"MONTH\", \"QUARTER\", \"YEAR\"]:\n                if recon_dict[f\"{cadence}\"][\"recon_window\"] == {}:\n                    del recon_dict[f\"{cadence}\"][\"recon_window\"]\n\n        if recon_dict:\n            print(\"Reconciliation configuration status: OK\")\n        else:\n            print(\"WRONG CONFIGURATION:\")\n            print(\"\\t- The recon information is mandatory!\")\n        return recon_dict\n\n    def get_stages(self, sql_files_list: list, usecase_name: str) -> dict:\n        \"\"\"\n        Set stages based on the widget setup.\n\n        Args:\n            sql_files_list: A list of sql files and their setup.\n            usecase_name: The use case name.\n\n        Returns:\n            stages_dict: A dictionary of stages and their setup.\n\n        \"\"\"\n        stages_dict = {}\n        i = 0\n        list_status = []\n        for file in sql_files_list:\n            i = i + 1\n            if dbutils.widgets.get(name=f\"{i}_script_table_alias\"):\n                stages_dict[f\"{i}\"] = {\n                    \"file_path\": usecase_name + \"/\" + file.strip() + \".sql\",\n                    \"table_alias\": dbutils.widgets.get(name=f\"{i}_script_table_alias\"),\n                    \"storage_level\": dbutils.widgets.get(name=f\"{i}_script_storage_level\"),\n                    \"project_date_column\": dbutils.widgets.get(name=f\"{i}_script_project_dt_col\"),\n                    \"filter_date_column\": dbutils.widgets.get(name=f\"{i}_script_filter_dt_col\"),\n                }\n\n                repartition_value = self._format_keys_list(dbutils.widgets.get(name=f\"{i}_script_repartition_value\"))\n\n                stages_dict[f\"{i}\"][\"repartition\"] = {}\n                if dbutils.widgets.get(name=f\"{i}_script_repartition_type\") == \"NUMBER\":\n                    try:\n                        int(dbutils.widgets.get(name=f\"{i}_script_repartition_value\").split(\",\")[0])\n                        stages_dict[f\"{i}\"][\"repartition\"] = {\n                            \"numPartitions\": dbutils.widgets.get(name=f\"{i}_script_repartition_value\")\n                            .split(\",\")[0]\n                            .replace(\"'\", \"\")\n                        }\n                    except Exception:\n                        print(\"The repartition value must be INTEGER when the type is defined as NUMBER.\")\n                        list_status.append(\"fail\")\n\n                elif dbutils.widgets.get(name=f\"{i}_script_repartition_type\") == \"KEY\":\n                    stages_dict[f\"{i}\"][\"repartition\"] = {\"keys\": repartition_value}\n            else:\n                print(f\"The field script alias is missing for {i}.Script Table Alias. This field is mandatory!\")\n                stages_dict = {}\n                list_status.append(\"fail\")\n\n        if \"fail\" not in list_status:\n            print(\"Stages configuration status: OK\")\n        return stages_dict\n\n    def get_view_information(self, num_of_views: str) -> dict:\n        \"\"\"\n        Get the views information based on the widget setup.\n\n        Args:\n            num_of_views: Number of views selected.\n\n        Returns:\n            The dictionary of views and their setup.\n\n        \"\"\"\n        dims_dict = {}\n        for i in range(int(num_of_views)):\n            i = i + 1\n            if re.match(self.regex_no_special_characters, dbutils.widgets.get(f\"view_name{i}\")):\n                dims_dict[f\"view_name{i}\"] = {\n                    \"name\": dbutils.widgets.get(f\"view_name{i}\"),\n                    \"filter\": dbutils.widgets.get(f\"view_filter{i}\").replace(\"'\", \"#+#-#\").replace('\"', \"#+#-#\"),\n                }\n                print(\"Views validation status: OK\")\n            else:\n                print(\"WRONG CONFIGURATION:\")\n                print(\"\\t- View name is empty of malformed!\")\n                print(\n                    \"\\t Names can contain only alphanumeric characters and must begin with \"\n                    \"an alphabetic character or an underscore (_).\"\n                )\n        return dims_dict\n\n    @classmethod\n    def insert_data_into_lkp_query_builder(cls, delete_sttmt: str, insert_sttmt: str):\n        \"\"\"\n        Insert data into the lkp query builder table.\n\n        Args:\n            delete_sttmt: The delete statement.\n            insert_sttmt: The insert statement.\n\n        \"\"\"\n        try:\n            spark.sql(f\"{delete_sttmt}\")\n            spark.sql(f\"{insert_sttmt}\")\n            print(\"CONFIGURATION INSERTED SUCCESSFULLY!\")\n        except Exception as e:\n            print(e)\n\n    def print_definitions(\n            self,\n            usecase_name,\n            market,\n            from_date,\n            to_date,\n            dimensions,\n            time_offset,\n            week_start,\n            is_active,\n            num_of_views,\n            complexity,\n            sql_files,\n            db_schema,\n            dims_dict: dict = None,\n            recon_dict: dict = None,\n            metrics_dict: dict = None,\n            stages_dict: dict = None,\n    ):\n        \"\"\"\n        Print the definitions set on widgets.\n\n        Args:\n            usecase_name: The name of use case.\n            market: The market used for the use case (APAC, GLOBAL, NAM, NIGHTLY).\n            from_date: Aggregating date column for the use case.\n            to_date: Contains the current date (default value is to_date).\n                Information used as template for the framework.\n            dimensions: Store supporting information to the fact table\n            time_offset: Hours related to the timezone (e.g. 8, -8).\n            week_start: Day of the start of the week (e.g. Sunday, Monday)\n            is_active: If the use case is active or not. (e.g. Y, N)\n            num_of_views: Number of views desired for the use case (e.g. 1, 2, 3).\n            complexity: The categories are directly related to the number of workers in each cluster.\n            That is, High = 10 workers, Medium = 6 workers and Low = 4 workers\n            sql_files: Name of the SQL files that will be submitted for the framework\n            to process (e.g. file1.sql, file2.sql).\n            Database name that lkp_query_builder is located.\n            dims_dict: A dictionary of dimensions.\n            recon_dict: A dictionary of reconciliation setup.\n            metrics_dict: The dictionary of metrics and their setup.\n            stages_dict: A dictionary of stages and their setup.\n\n        \"\"\"\n        print(\"USE CASE DEFINITIONS:\")\n        print(\"Use Case Name:\", usecase_name)\n        print(\"Market:\", market)\n        print(\"From Date:\", from_date)\n        print(\"To Date:\", to_date)\n        print(\"Dimensions:\", dimensions)\n        print(\"Time Offset:\", time_offset)\n        print(\"Week Start:\", week_start)\n        print(\"Is Active:\", is_active)\n        print(\"How many views?\", num_of_views)\n        print(\"Complexity:\", complexity)\n        print(\"SQL Files:\", sql_files)\n        print(\"Database Schema Name:\", db_schema)\n        self._print_dims_dict(dims_dict)\n        self._print_recon_dict(recon_dict)\n        if metrics_dict:\n            print(\"METRICS CONFIGURED:\")\n            for key_metrics in metrics_dict:\n                self._print_metrics_dict(key_metrics, metrics_dict)\n        self._print_stages_dict(stages_dict)\n\n    @classmethod\n    def set_dimensions(cls, num_dimensions: str):\n        \"\"\"\n        Set the dimension mappings based on the widget setup.\n\n        Args:\n            num_dimensions: Number of dimensions selected.\n\n        \"\"\"\n        dbutils.widgets.removeAll()\n\n        for i in range(int(num_dimensions)):\n            i = i + 1\n            dbutils.widgets.text(name=f\"D{i}\", defaultValue=\"\", label=f\"D{i}.Dimension Name\")\n        print(\"Please, configure the dimensions using the widgets and proceed to the next cmd.\")\n\n    def set_extra_metric_config(self, num_of_metrics: str, metrics_dict: dict):\n        \"\"\"\n        Set extra metrics information based on the widget setup.\n\n        Args:\n            num_of_metrics: Number of metrics selected.\n\n        \"\"\"\n        for i in range(int(num_of_metrics)):\n            i = i + 1\n            calculated_metric_list = list(filter(None, dbutils.widgets.get(f\"calculated_metric{i}\").split(\",\")))\n            if calculated_metric_list:\n                for calc_metric in calculated_metric_list:\n                    self._validate_metrics_config(calc_metric, metrics_dict, i)\n            else:\n                print(\"Extra metrics configuration status: OK\")\n\n    @classmethod\n    def set_metric(cls, num_of_metrics: str):\n        \"\"\"\n        Set metrics information based on the widget setup.\n\n        Args:\n            num_of_metrics: Number of metrics selected.\n\n        \"\"\"\n        dbutils.widgets.removeAll()\n        for i in range(1, int(num_of_metrics) + 1):\n            dbutils.widgets.text(name=f\"metric_name{i}\", defaultValue=\"\", label=f\"{i}.Metric Name\")\n            dbutils.widgets.multiselect(\n                name=f\"calculated_metric{i}\",\n                defaultValue=\"\",\n                label=f\"{i}.Calculated Metric\",\n                choices=[\"\", \"last_cadence\", \"last_year_cadence\", \"window_function\", \"derived_metric\"],\n            )\n        print(\"Please, configure the metrics using the widgets and proceed to the next cmd.\")\n\n    def set_stages(self, sql_files: list) -> list:\n        \"\"\"\n        Set stages based on the widget setup.\n\n        Args:\n            sql_files: The SQL file names that will be used in the use case.\n\n        Returns:\n            sql_files_list: A list of sql files and their setup.\n\n        \"\"\"\n        dbutils.widgets.removeAll()\n        sql_files_list = self._sort_files(sql_files)\n\n        for i in range(1, len(sql_files_list) + 1):\n            dbutils.widgets.dropdown(\n                name=f\"{i}_script_storage_level\",\n                defaultValue=\"MEMORY_ONLY\",\n                label=f\"{i}.Storage Level\",\n                choices=[\n                    \"DISK_ONLY\",\n                    \"DISK_ONLY_2\",\n                    \"DISK_ONLY_3\",\n                    \"MEMORY_AND_DISK\",\n                    \"MEMORY_AND_DISK_2\",\n                    \"MEMORY_AND_DISK_DESER\",\n                    \"MEMORY_ONLY\",\n                    \"MEMORY_ONLY_2\",\n                    \"OFF_HEAP\",\n                ],\n            )\n            dbutils.widgets.text(name=f\"{i}_script_table_alias\", defaultValue=\"\", label=f\"{i}.Table Alias\")\n            dbutils.widgets.text(name=f\"{i}_script_project_dt_col\", defaultValue=\"\", label=f\"{i}.Project Date Column\")\n            dbutils.widgets.text(name=f\"{i}_script_filter_dt_col\", defaultValue=\"\", label=f\"{i}.Filter Date Column\")\n            dbutils.widgets.dropdown(\n                name=f\"{i}_script_repartition_type\",\n                defaultValue=\"\",\n                label=f\"{i}.Repartition Type\",\n                choices=[\"\", \"KEY\", \"NUMBER\"],\n            )\n            dbutils.widgets.text(name=f\"{i}_script_repartition_value\", defaultValue=\"\", label=f\"{i}.Repartition Value\")\n\n        print(\"Please, configure the stages using the widgets and proceed to the next cmd.\")\n        return sql_files_list\n\n    @classmethod\n    def set_views(cls, num_of_views: str):\n        \"\"\"\n        Set views that will be used in the use case.\n\n        Args:\n            num_of_views: Number of views selected.\n\n        \"\"\"\n        dbutils.widgets.removeAll()\n\n        for i in range(1, int(num_of_views) + 1):\n            dbutils.widgets.text(name=f\"view_name{i}\", defaultValue=\"\", label=f\"{i}.View Name\")\n            dbutils.widgets.text(name=f\"view_filter{i}\", defaultValue=\"\", label=f\"{i}.View Filter\")\n\n        print(\"Please, configure the views using the widgets and proceed to the next cmd.\")\n\n    @classmethod\n    def _format_keys_list(cls, key_str: str) -> list:\n        \"\"\"\n        Format the list of keys based on the widget keys data provided.\n\n        Args:\n            key_str: Input text with key column names.\n\n        Returns:\n            A formatted list with the keys selected for repartitioning.\n\n        \"\"\"\n        key_list = key_str.strip().split(\",\")\n        output_list = []\n        for key in key_list:\n            output_list.append(key.replace(\"'\", \"\").replace('\"', \"\").strip())\n        return output_list\n\n    @classmethod\n    def _generate_query_id(cls, usecase_name: str) -> int:\n        \"\"\"\n        Generate the query id for the lookup query builder table.\n\n        The logic to create the ID is a hash of the use case name converted to an integer.\n\n        Args:\n            usecase_name: The name of use case.\n\n        Returns:\n            The use case name hashed.\n\n        \"\"\"\n        hash_val = int(str(hash(usecase_name))[0:9])\n        return hash_val if hash_val > 0 else hash_val * -1\n\n    @classmethod\n    def _get_mapping(cls, dims_dict: dict, dimensions: str, from_date: str, to_date: str, metrics_dict: dict) -> dict:\n        \"\"\"\n        Get mappings based on the dimensions defined on the widget setup.\n\n        Args:\n            dims_dict: A dictionary of dimensions.\n            dimensions: Store supporting information to the fact table.\n            from_date: Aggregating date column for the use case.\n            to_date: Contains the current date (default value is to_date).\n            Information used as template for the framework.\n            metrics_dict: The dictionary of metrics and their setup.\n\n        Returns:\n            mapping_dict: A dictionary of mappings configuration.\n\n        \"\"\"\n        mapping_dict = {}\n        for key in dims_dict:\n            mapping_dict.update({dims_dict[key][\"name\"]: {\"dimensions\": {}, \"metric\": {}, \"filter\": {}}})\n            i = 0\n            for d in dimensions.split(\",\"):\n                i = i + 1\n                mapping_dict[dims_dict[key][\"name\"]][\"dimensions\"].update(\n                    {\"from_date\": from_date, \"to_date\": to_date, f\"d{i}\": d.strip()}\n                )\n                mapping_dict[dims_dict[key][\"name\"]][\"metric\"].update(metrics_dict)\n                if dims_dict[key][\"filter\"]:\n                    mapping_dict[dims_dict[key][\"name\"]][\"filter\"] = dims_dict[key][\"filter\"]\n\n        return mapping_dict\n\n    @classmethod\n    def _print_dims_dict(cls, dims_dict: dict):\n        \"\"\"\n        Print the dictionary of dimensions and views formatted.\n\n        Args:\n            dims_dict: The dictionary of views and their setup.\n        \"\"\"\n        if dims_dict:\n            print(\"VIEWS CONFIGURED:\")\n            for key in dims_dict:\n                print(f\"{key}:\")\n                keys = [k for k, v in dims_dict[key].items()]\n                for k in keys:\n                    print(f\"\\t{k}:\", dims_dict[key][k].replace(\"#+#-#\", '\"'))\n\n    @classmethod\n    def _print_derived_metrics(cls, key_metrics: str, derived_metric: str, metrics_dict: dict):\n        \"\"\"\n        Print the derived dict formatted.\n\n        Args:\n            key_metrics: The key name of each metric configured (e.g. m1, m2, m3).\n            derived_metric: The name of the derived metric configuration (e.g. last_cadence, last_year_cadence,\n                            derived_metric, window_function).\n            metrics_dict: The dictionary of metrics and their setup.\n        \"\"\"\n        if derived_metric == \"derived_metric\":\n            if metrics_dict[key_metrics][derived_metric]:\n                print(f\"\\t- {derived_metric}:\")\n                derived_metric_val_list = [k for k, v in metrics_dict[key_metrics][derived_metric][0].items()]\n                for derived_metric_val in derived_metric_val_list:\n                    print(\n                        f\"\\t  - {derived_metric_val} = \"\n                        f\"{metrics_dict[key_metrics][derived_metric][0][derived_metric_val]}\"\n                    )\n\n    def _print_metrics_dict(self, key_metrics: str, metrics_dict: dict):\n        \"\"\"\n        Print the metrics configured formatted.\n\n        Args:\n            key_metrics: The key name of each metric configured (e.g. m1, m2, m3).\n            metrics_dict: The dictionary of metrics and their setup.\n        \"\"\"\n        print(f\"{key_metrics}:\")\n        list_key_metrics = [k for k, v in metrics_dict[key_metrics].items()]\n        if list_key_metrics:\n            for metric in list_key_metrics:\n                if metric == \"metric_name\":\n                    print(f\"  {metric} = {metrics_dict[key_metrics][metric]}\")\n                else:\n                    for derived_metric in metrics_dict[key_metrics][metric]:\n                        if derived_metric in [\"last_cadence\", \"last_year_cadence\", \"window_function\"]:\n                            print(f\"\\t- {derived_metric}:\")\n                            derived_metric_val_list = [\n                                k for k, v in metrics_dict[key_metrics][metric][derived_metric][0].items()\n                            ]\n                            for derived_metric_val in derived_metric_val_list:\n                                print(\n                                    f\"\\t  - {derived_metric_val} = \"\n                                    f\"{metrics_dict[key_metrics][metric][derived_metric][0][derived_metric_val]}\"\n                                )\n                        else:\n                            self._print_derived_metrics(key_metrics, metric, metrics_dict)\n\n    @classmethod\n    def _print_recon_dict(cls, recon_dict: dict):\n        \"\"\"\n        Print the recon dict formatted.\n\n        Args:\n            recon_dict: A dictionary of reconciliation setup.\n        \"\"\"\n        if recon_dict:\n            print(\"RECON CONFIGURED:\")\n            for key_cadence in recon_dict:\n                if recon_dict[f\"{key_cadence}\"] == {}:\n                    print(f\"{key_cadence}\")\n                else:\n                    print(f\"{key_cadence}:\")\n                keys_recon = [k for k, v in recon_dict[key_cadence].items()]\n                if keys_recon:\n                    for k_recon in keys_recon:\n                        print(f\"  {k_recon}:\")\n                        keys_recon = [k for k, v in recon_dict[key_cadence][k_recon].items()]\n                        for recon_val in keys_recon:\n                            print(\n                                f\"\\t- {recon_val}:snapshot = {recon_dict[key_cadence][k_recon][recon_val]['snapshot']}\"\n                            )\n\n    @classmethod\n    def _print_stages_dict(cls, stages_dict: dict):\n        \"\"\"\n        Print the dictionary of stages formatted.\n\n        Args:\n            stages_dict: A dictionary of stages and their setup.\n        \"\"\"\n        if stages_dict:\n            print(\"STEPS CONFIGURED:\")\n            for key_stages in stages_dict:\n                print(f\"step {key_stages}:\")\n                keys_stages = [k for k, v in stages_dict[key_stages].items()]\n                for k_stages in keys_stages:\n                    if k_stages != \"repartition\":\n                        print(f\"  - {k_stages} = {stages_dict[key_stages][k_stages]}\")\n                    else:\n                        repartition_stages = [k for k, v in stages_dict[key_stages][k_stages].items()]\n                        for stg in repartition_stages:\n                            print(\"  - repartition_type:\")\n                            print(f\"\\t {stg} = {stages_dict[key_stages][k_stages][stg]}\")\n\n    @classmethod\n    def _sort_files(cls, sql_files: str) -> list:\n        \"\"\"\n        Create a list sorted alphabetically based on the sql files provided.\n\n        Args:\n            sql_files: Name of the SQL files that will be sent to the framework\n            to process (e.g. file1.sql, file2.sql).\n\n        Returns:\n            A list of sql files sorted alphabetically.\n\n        \"\"\"\n        fileslist = sql_files.split(\",\")\n        # remove extra spaces from items in the list\n        fileslist = [x.strip() for x in fileslist]\n        for file in range(len(fileslist)):\n            fileslist[file] = fileslist[file].lower().strip()\n            # apply bubble sort to sort the words\n            for n in range(len(fileslist) - 1, 0, -1):\n                for i in range(n):\n                    if fileslist[i] > fileslist[i + 1]:\n                        # swap data if the element is less than the next element in the array\n                        fileslist[i], fileslist[i + 1] = fileslist[i + 1], fileslist[i]\n        return fileslist\n\n    @classmethod\n    def _validate_metrics_config(cls, calc_metric: str, metrics_dict: dict, widget_index: int):\n        \"\"\"\n        Validate the metrics widgets setup.\n\n        Args:\n            calc_metric: Name of the metric calculation set (e.g. last_cadence, last_year_cadence).\n            metrics_dict: The dictionary of metrics and their setup.\n            widget_index: Index of the widget selected to be validated.\n\n        \"\"\"\n        if calc_metric == \"last_cadence\":\n            if dbutils.widgets.get(f\"{widget_index}_{calc_metric}_label\").strip() != \"\":\n                try:\n                    int(dbutils.widgets.get(f\"{widget_index}_{calc_metric}_window\"))\n                    metrics_dict[f\"m{widget_index}\"][\"calculated_metric\"].update(\n                        {\n                            f\"{calc_metric}\": [\n                                {\n                                    \"label\": dbutils.widgets.get(f\"{widget_index}_{calc_metric}_label\"),\n                                    \"window\": dbutils.widgets.get(f\"{widget_index}_{calc_metric}_window\"),\n                                }\n                            ]\n                        }\n                    )\n                    print(f\"{calc_metric} configuration status: OK\")\n                except Exception:\n                    print(f\"{calc_metric} - WRONG CONFIGURATION:\")\n                    print(f\"\\t- The {calc_metric} window value must be INTEGER.\")\n            else:\n                print(f\"{calc_metric} - WRONG CONFIGURATION:\")\n                print(f\"\\t- The {calc_metric} label is mandatory.\")\n        elif calc_metric == \"last_year_cadence\":\n            if dbutils.widgets.get(f\"{widget_index}_{calc_metric}_label\").strip() != \"\":\n                metrics_dict[f\"m{widget_index}\"][\"calculated_metric\"].update(\n                    {\n                        f\"{calc_metric}\": [\n                            {\n                                \"label\": dbutils.widgets.get(f\"{widget_index}_{calc_metric}_label\"),\n                                \"window\": 1,\n                            }\n                        ]\n                    }\n                )\n                print(f\"{calc_metric} configuration status: OK\")\n            else:\n                print(f\"{calc_metric} - WRONG CONFIGURATION:\")\n                print(f\"\\t- The {calc_metric} label is mandatory.\")\n        elif calc_metric == \"window_function\":\n            if dbutils.widgets.get(f\"{widget_index}_{calc_metric}_label\").strip() != \"\":\n                window_list = dbutils.widgets.get(f\"{widget_index}_{calc_metric}_window\").split(\",\")\n                if len(window_list) > 1:\n                    metrics_dict[f\"m{widget_index}\"][\"calculated_metric\"].update(\n                        {\n                            f\"{calc_metric}\": [\n                                {\n                                    \"label\": dbutils.widgets.get(f\"{widget_index}_{calc_metric}_label\"),\n                                    \"window\": [int(x.strip()) for x in window_list],\n                                    \"agg_func\": dbutils.widgets.get(name=f\"{widget_index}_{calc_metric}_agg_func\"),\n                                }\n                            ]\n                        }\n                    )\n                    print(f\"{calc_metric} configuration status: OK\")\n                else:\n                    print(f\"{calc_metric} - WRONG CONFIGURATION:\")\n                    print(\n                        \"\\t- The window function must follow the pattern of \"\n                        \"two integer digits separated with comma (e.g. 3,1).\"\n                    )\n            else:\n                print(f\"{calc_metric} - WRONG CONFIGURATION:\")\n                print(\"\\t- The window_function label is mandatory.\")\n        elif calc_metric == \"derived_metric\":\n            if (\n                    dbutils.widgets.get(name=f\"{widget_index}_{calc_metric}_label\").strip() != \"\"\n                    and dbutils.widgets.get(name=f\"{widget_index}_{calc_metric}_formula\").strip() != \"\"\n            ):\n                metrics_dict[f\"m{widget_index}\"].update(\n                    {\n                        f\"{calc_metric}\": [\n                            {\n                                \"label\": dbutils.widgets.get(name=f\"{widget_index}_{calc_metric}_label\"),\n                                \"formula\": dbutils.widgets.get(name=f\"{widget_index}_{calc_metric}_formula\"),\n                            }\n                        ]\n                    }\n                )\n                print(f\"{calc_metric} configuration status: OK\")\n            else:\n                print(f\"{calc_metric} - WRONG CONFIGURATION:\")\n                print(\"\\t- The derived_metric label and formula are mandatory.\")"
  },
  {
    "path": "cicd/.bumpversion.cfg",
    "content": "[bumpversion]\ncurrent_version = 2.0.0\ncommit = False\ntag = False\n\n[bumpversion:file:pyproject.toml]\nsearch = version = \"{current_version}\"\nreplace = version = \"{new_version}\"\n"
  },
  {
    "path": "cicd/Dockerfile",
    "content": "ARG PYTHON_IMAGE=python:3.12-slim-bullseye\n\nFROM $PYTHON_IMAGE\n\nARG USER_ID=1000\nARG GROUP_ID=1000\nARG CPU_ARCHITECTURE\n\n# Install Prerequisites\nRUN mkdir -p /usr/share/man/man1 && \\\n    apt-get -y update && \\\n    apt-get install -y wget=1.21* gnupg2=2.2* git=1:2* g++=4:10.2.1* rsync=3.2* && \\\n    apt-get -y clean\n\n# Install jdk\nRUN mkdir -p /etc/apt/keyrings && \\\n    wget -qO - https://packages.adoptium.net/artifactory/api/gpg/key/public | gpg --dearmor | tee /etc/apt/trusted.gpg.d/adoptium.gpg > /dev/null && \\\n    echo \"deb https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main\" | tee /etc/apt/sources.list.d/adoptium.list && \\\n    apt-get -y update && \\\n    apt-get -y install temurin-17-jdk && \\\n    apt-get -y clean\nENV JAVA_HOME=/usr/lib/jvm/temurin-17-jdk-${CPU_ARCHITECTURE}\n\n# useradd -l is necessary to avoid docker build hanging in export image phase when using large uids\nRUN groupadd -g ${GROUP_ID} appuser && \\\n    useradd -rm -l -u ${USER_ID} -d /home/appuser -s /bin/bash -g appuser appuser\n\nCOPY cicd/requirements_full.lock /tmp/requirements.txt\n\nUSER appuser\n\nENV PATH=\"/home/appuser/.local/bin:$PATH\"\nRUN python -m pip install --upgrade pip==25.2 setuptools==74.* --user\nRUN python -m pip install --user -r /tmp/requirements.txt\n\nRUN mkdir /home/appuser/.ssh/ && touch /home/appuser/.ssh/known_hosts\n\nRUN echo Image built for $CPU_ARCHITECTURE with python image $PYTHON_IMAGE.\n"
  },
  {
    "path": "cicd/Jenkinsfile",
    "content": "@Library(['GlobalJenkinsLibrary']) _\n\npipeline {\n    options {\n        buildDiscarder(logRotator(numToKeepStr: '30', artifactNumToKeepStr: '30'))\n        timeout(time: 2, unit: 'HOURS')\n        disableConcurrentBuilds()\n        skipDefaultCheckout(true)\n        ansiColor('xterm')\n        timestamps()\n    }\n\n    agent {\n        node {\n            label 'lakehouse_base'\n        }\n    }\n\n    environment {\n        VERSION = env.BRANCH_NAME.replaceAll(\"[/-]\", \"_\").toLowerCase()\n        GIT_CREDENTIALS_ID = \"git-lakehouse-cicd\"\n    }\n\n    stages {\n        stage('cleanup workspace') {\n            steps {\n                cleanWs(disableDeferredWipeout: true, deleteDirs: true)\n            }\n        }\n\n        stage('Clone') {\n            steps {\n                retry(3) {\n                    script {\n                        checkout([\n                                $class           : 'GitSCM',\n                                branches         : scm.branches,\n                                userRemoteConfigs: [[url: 'https://bitbucket.tools.3stripes.net/scm/lak/lakehouse-engine.git', credentialsId: GIT_CREDENTIALS_ID]]\n                        ])\n                    }\n                }\n            }\n        }\n\n        stage('Build Image') {\n            steps {\n                sh 'make build-image version=$VERSION'\n            }\n        }\n\n        stage('Create Docs') {\n            steps {\n                sh 'make docs version=$VERSION'\n            }\n        }\n\n        stage('Parallel') {\n            parallel {\n                stage('Lint') {\n                    steps {\n                        sh 'make lint version=$VERSION'\n                    }\n                }\n\n                stage('Test Security') {\n                    steps {\n                        sh 'make test-security version=$VERSION'\n                    }\n                }\n\n                stage('Audit Dependency Safety'){\n                    steps{\n                        catchError(message: \"${STAGE_NAME} is unstable\", buildResult: 'SUCCESS', stageResult: 'UNSTABLE') {\n                            sh 'make audit-dep-safety version=$VERSION'\n                        }\n                    }\n                }\n\n                stage('Test dependencies') {\n                    steps {\n                        sh 'make test-deps version=$VERSION'\n                    }\n                }\n\n                stage('Test') {\n                    steps {\n                        sh 'make test version=$VERSION'\n                    }\n                }\n            }\n        }\n\n        stage('Sonar') {\n            steps {\n                script {\n                    tools.sonar.run(env: 'COMMUNITY-PRD', version: '1.0', branch: env.BRANCH_NAME)\n                }\n            }\n        }\n    }\n\n    post {\n        always {\n            archiveArtifacts artifacts: 'artefacts/docs/**/*'\n            archiveArtifacts artifacts: 'artefacts/*.json'\n            junit 'artefacts/tests.xml'\n            step([$class: 'CoberturaPublisher', coberturaReportFile: 'artefacts/coverage.xml'])\n        }\n    }\n}"
  },
  {
    "path": "cicd/Jenkinsfile_deploy",
    "content": "pipeline {\n    parameters {\n        string(name: 'BRANCH', defaultValue: 'master', description: 'Branch to use for the deployment process.')\n        string(name: 'VERSION', defaultValue: null, description: 'Version to deploy (git tag in master branch without the \"v\"). E.g., 0.2.0. If you are deploying to dev, from your branch, ignore this.')\n        booleanParam(name: 'SKIP_VALIDATIONS', defaultValue: false, description: 'Whether to skip the validations. Only applicable for feature releases to make them faster.')\n        booleanParam(name: 'SKIP_OS_DEPLOYMENT', defaultValue: false, description: 'Whether to skip the OS Deployment related stages or not.')\n        booleanParam(name: 'NOTIFY', defaultValue: true, description: 'Whether to notify the release or not.')\n    }\n\n    options {\n        buildDiscarder(logRotator(numToKeepStr: '100', artifactNumToKeepStr: '30'))\n        timeout(time: 2, unit: 'HOURS')\n        disableConcurrentBuilds()\n        skipDefaultCheckout(true)\n        ansiColor('xterm')\n        timestamps()\n    }\n\n    agent {\n        node {\n            label 'lakehouse_base'\n        }\n    }\n\n    environment {\n        PYPI_CREDENTIALS = credentials('pypi-credentials')\n        ARTIFACTORY_CREDENTIALS = credentials('artifactory-credentials')\n        GIT_CREDENTIALS_ID = \"git-lakehouse-cicd\"\n        GIT_CREDENTIALS_LAK = credentials('push-to-github-lak')\n        GIT_CREDENTIALS_LAK_DOCS = credentials('push-to-github-lak-docs')\n        DEPLOY_VERSION = getDeploymentVersion()\n        DEPLOY_GIT_OBJECT = getDeploymentGitObject()\n    }\n\n    stages {\n        stage('cleanup workspace') {\n            steps {\n                cleanWs(disableDeferredWipeout: true, deleteDirs: true)\n            }\n        }\n\n        stage('Clone') {\n            steps {\n                retry(3) {\n                    script {\n                        checkout([\n                                $class           : 'GitSCM',\n                                branches         : [['name': env.DEPLOY_GIT_OBJECT]],\n                                userRemoteConfigs: [[url: 'https://bitbucket.tools.3stripes.net/scm/lak/lakehouse-engine.git', credentialsId: GIT_CREDENTIALS_ID]]\n                        ])\n                    }\n                }\n            }\n        }\n\n        stage('Build Image') {\n            steps {\n                sh 'make build-image version=' + \"${env.DEPLOY_VERSION}\"\n            }\n        }\n\n        stage('Parallel') {\n            when {\n                expression {\n                    (!params.SKIP_VALIDATIONS && params.BRANCH != 'master')\n                }\n            }\n            parallel {\n\n                stage('Lint') {\n                    steps {\n                        sh 'make lint version=' + \"${env.DEPLOY_VERSION}\"\n                    }\n                }\n\n                stage('Test Security') {\n                    steps {\n                        sh 'make test-security version=' + \"${env.DEPLOY_VERSION}\"\n                    }\n                }\n\n                stage('Audit Dependency Safety'){\n                    steps{\n                        catchError(message: \"${STAGE_NAME} is unstable\", buildResult: 'SUCCESS', stageResult: 'UNSTABLE') {\n                            sh 'make audit-dep-safety version=$VERSION'\n                        }\n                    }\n                }\n\n                stage('Test dependencies') {\n                    steps {\n                        sh 'make test-deps version=' + \"${env.DEPLOY_VERSION}\"\n                    }\n                }\n\n                stage('Test') {\n                    steps {\n                        sh 'make test version=' + \"${env.DEPLOY_VERSION}\"\n                    }\n                }\n\n            }\n        }\n\n        stage('Deploy') {\n            steps {\n                script {\n                    sh 'make deploy version=' + \"${env.DEPLOY_VERSION}\" + ' artifactory_credentials_file=$ARTIFACTORY_CREDENTIALS'\n                }\n            }\n        }\n\n        stage('Open Source Deployment') {\n            when {\n                expression {\n                    (params.BRANCH == 'master' && !params.SKIP_OS_DEPLOYMENT)\n                }\n            }\n            stages {\n                stage('Sync Code with GitHub') {\n                    steps {\n                        script {\n                            sh 'make sync-to-github version=' + \"${env.DEPLOY_VERSION}\" + ' git_credentials_file=$GIT_CREDENTIALS_LAK repository=lakehouse-engine'\n                        }\n                    }\n                }\n\n                stage('Deploy Docs to Github') {\n                    steps {\n                        script {\n                            sh 'make deploy-docs-to-github version=' + \"${env.DEPLOY_VERSION}\" + ' git_credentials_file=$GIT_CREDENTIALS_LAK_DOCS repository=lakehouse-engine-docs os_deployment=True'\n                        }\n                    }\n                }\n\n                stage('Deploy to Pypi') {\n                    steps {\n                        script {\n                            // we are forcing make build as it was not happening sometimes, for no reason.\n                            sh 'make build os_deployment=True'\n                            sh 'make deploy-to-pypi-and-clean os_deployment=True version=' + \"${env.DEPLOY_VERSION}\" + ' pypi_credentials_file=$PYPI_CREDENTIALS'\n                        }\n                    }\n                }\n            }\n        }\n\n        stage('Notify') {\n            when {\n                expression {\n                    params.BRANCH == 'master' && params.NOTIFY\n                }\n            }\n            steps {\n                script {\n                    params = readYaml file: 'cicd/meta.yaml'\n                    release_notes = sh(script:'cat CHANGELOG.md | cut -d \")\" -f 2 | head -n 10', returnStdout: true).trim()\n                    recipients = params[\"mail_recipients\"].join(\";\")\n                    emailext(\n                            attachLog: false,\n                            compressLog: true,\n                            body: \"\"\"\n                            <BR>A new version <b>$env.DEPLOY_VERSION</b> of the <b>Lakehouse Engine</b> was deployed into Artifactory.<BR><BR>\n                            You can install it just like any other python library, either notebook scoped with pip install or cluster scoped\n                            by specifying the library in the cluster configuration.: \n                            You can check the lakehouse-engine documentation here: ${params[\"engine_docs\"]}.\n                            Check the latest updates here:<BR>\n                            <pre>\n                            ${release_notes}\n                            </pre><BR>\n                            For more details, please check the complete changelog and/or the additional resources listed below:\n                            <ul>\n                              <li>${params[\"changelog_url\"]}</li>\n                              <li>${params[\"code_url\"]}</li>\n                              <li>${params[\"confluence_url\"]}</li>\n                            </ul>\n                            \"\"\",\n                            mimeType: 'text/html',\n                            replyTo: \"${params['reply_to']}\",\n                            from: \"${params['from']}\",\n                            to: recipients,\n                            subject: \"Lakehouse Engine Updates - $env.DEPLOY_VERSION\"\n                    )\n                }\n            }\n        }\n    }\n}\n\n/**\n * Get deployment git object (branch name or tag reference) given certain Jenkins parameters and the team's deployment guidelines.\n * @return git object (branch or tag)\n */\ndef String getDeploymentGitObject() {\n    gitObject = params.BRANCH\n\n    if (params.BRANCH == 'master') {\n        if (params.VERSION ==~ '[\\\\d]{1,3}\\\\.[\\\\d]{1,3}\\\\.[\\\\d]{1,3}') {\n            // force the git object to checkout to be a version tag\n            gitObject = \"refs/tags/v${params.VERSION}\"\n            return gitObject\n        }\n        else {\n            throw new Exception(\"Version ${params.VERSION} does not match valid git version tag. It should be in the form of <major>.<minor>.<patch>.\")\n        }\n    } else {\n        return gitObject\n    }\n}\n\n/**\n * Get deployment version given certain Jenkins parameters and the team's deployment guidelines.\n * @return deployment version\n */\ndef String getDeploymentVersion() {\n    version = params.VERSION\n\n    if (params.BRANCH == 'master') {\n        if (version ==~ '[\\\\d]{1,3}\\\\.[\\\\d]{1,3}\\\\.[\\\\d]{1,3}') {\n            return version\n        }\n        else {\n            throw new Exception(\"Version ${version} does not match valid git version tag. It should be in the form of <major>.<minor>.<patch>.\")\n        }\n    } else {\n        // force branch as the version to be deployed when we are dealing with feature branches.\n        return params.BRANCH.replaceAll(\"[/-]\", \"_\").toLowerCase()\n    }\n}"
  },
  {
    "path": "cicd/bandit.yaml",
    "content": "assert_used:\n  skips: ['*test*']"
  },
  {
    "path": "cicd/code_doc/content.css",
    "content": "/*\nThis CSS file contains all style definitions for documentation content.\n\nAll selectors are scoped with \".pdoc\".\nThis makes sure that the pdoc styling doesn't leak to the rest of the page when pdoc is embedded.\n*/\n\n.pdoc {\n    color: var(--text);\n    /* enforce some styling even if bootstrap reboot is not included */\n    box-sizing: border-box;\n    line-height: 1.5;\n    /* override background from pygments */\n    /*unnecessary since pdoc 10, only left here to keep old custom templates working. */\n    background: none;\n}\n\n.pdoc .pdoc-button {\n    cursor: pointer;\n    display: inline-block;\n    border: solid black 1px;\n    border-radius: 2px;\n    font-size: .75rem;\n    padding: calc(0.5em - 1px) 1em;\n    transition: 100ms all;\n}\n\n\n/* Admonitions */\n.pdoc .pdoc-alert {\n    padding: 1rem 1rem 1rem calc(1.5rem + 24px);\n    border: 1px solid transparent;\n    border-radius: .25rem;\n    background-repeat: no-repeat;\n    background-position: 1rem center;\n    margin-bottom: 1rem;\n}\n\n.pdoc .pdoc-alert > *:last-child {\n    margin-bottom: 0;\n}\n\n/* Admonitions are currently not stylable via theme.css */\n.pdoc .pdoc-alert-note  {\n    color: #000000;\n    background-color: #f1efef;\n    border-color: #f1f1f1;\n    background-image: url(\"data:image/svg+xml,{% filter urlencode %}{% include 'resources/info-circle-fill.svg' %}{% endfilter %}\");\n}\n\n.pdoc .pdoc-alert-warning {\n    color: #664d03;\n    background-color: #fff3cd;\n    border-color: #ffecb5;\n    background-image: url(\"data:image/svg+xml,{% filter urlencode %}{% include 'resources/exclamation-triangle-fill.svg' %}{% endfilter %}\");\n}\n\n.pdoc .pdoc-alert-danger {\n    color: #842029;\n    background-color: #f8d7da;\n    border-color: #f5c2c7;\n    background-image: url(\"data:image/svg+xml,{% filter urlencode %}{% include 'resources/lightning-fill.svg' %}{% endfilter %}\");\n}\n\n.pdoc .visually-hidden {\n    position: absolute !important;\n    width: 1px !important;\n    height: 1px !important;\n    padding: 0 !important;\n    margin: -1px !important;\n    overflow: hidden !important;\n    clip: rect(0, 0, 0, 0) !important;\n    white-space: nowrap !important;\n    border: 0 !important;\n}\n\n.pdoc h1, .pdoc h2, .pdoc h3 {\n    font-weight: 300;\n    margin: .3em 0;\n    padding: .2em 0;\n}\n\n.pdoc > section:not(.module-info) h1 {\n    font-size: 1.5rem;\n    font-weight: 500;\n}\n.pdoc > section:not(.module-info) h2 {\n    font-size: 1.4rem;\n    font-weight: 500;\n}\n.pdoc > section:not(.module-info) h3 {\n    font-size: 1.3rem;\n    font-weight: 500;\n}\n.pdoc > section:not(.module-info) h4 {\n    font-size: 1.2rem;\n}\n.pdoc > section:not(.module-info) h5 {\n    font-size: 1.1rem;\n}\n\n.pdoc a {\n    text-decoration: none;\n    color: var(--link);\n}\n\n.pdoc a:hover {\n    color: var(--link-hover);\n}\n\n.pdoc blockquote {\n    margin-left: 2rem;\n}\n\n.pdoc pre {\n    border-top: 1px solid var(--accent2);\n    border-bottom: 1px solid var(--accent2);\n    margin-top: 0;\n    margin-bottom: 1em;\n    padding: .5rem 0 .5rem .5rem;\n    overflow-x: auto;\n    /*unnecessary since pdoc 10, only left here to keep old custom templates working. */\n    background-color: var(--code);\n}\n\n.pdoc code {\n    color: var(--text);\n    padding: .2em .4em;\n    margin: 0;\n    font-size: 85%;\n    background-color: var(--accent);\n    border-radius: 6px;\n}\n\n.pdoc a > code {\n    color: inherit;\n}\n\n.pdoc pre > code {\n    display: inline-block;\n    font-size: inherit;\n    background: none;\n    border: none;\n    padding: 0;\n}\n\n.pdoc > section:not(.module-info) {\n    /* this margin should collapse with docstring margin,\n       but not for the module docstr which is followed by view_source. */\n    margin-bottom: 1.5rem;\n}\n\n/* Page Heading */\n.pdoc .modulename {\n    margin-top: 0;\n    font-weight: bold;\n}\n\n.pdoc .modulename a {\n    color: var(--link);\n    transition: 100ms all;\n}\n\n/* GitHub Button */\n.pdoc .git-button {\n    float: right;\n    border: solid var(--link) 1px;\n}\n\n.pdoc .git-button:hover {\n    background-color: var(--link);\n    color: var(--pdoc-background);\n}\n\n.view-source-toggle-state,\n.view-source-toggle-state ~ .pdoc-code {\n    display: none;\n}\n.view-source-toggle-state:checked ~ .pdoc-code {\n    display: block;\n}\n\n.view-source-button {\n    display: inline-block;\n    float: right;\n    font-size: .75rem;\n    line-height: 1.5rem;\n    color: var(--muted);\n    padding: 0 .4rem 0 1.3rem;\n    cursor: pointer;\n    /* odd hack to reduce space between \"bullet\" and text */\n    text-indent: -2px;\n}\n.view-source-button > span {\n    visibility: hidden;\n}\n.module-info .view-source-button {\n    float: none;\n    display: flex;\n    justify-content: flex-end;\n    margin: -1.2rem .4rem -.2rem 0;\n}\n.view-source-button::before {\n    /* somewhat awkward recreation of a <summary> element. ideally we'd just use `display: inline list-item`, but\n     that does not work in Chrome (yet), see https://crbug.com/995106. */\n    position: absolute;\n    content: \"View Source\";\n    display: list-item;\n    list-style-type: disclosure-closed;\n}\n.view-source-toggle-state:checked ~ .attr .view-source-button::before,\n.view-source-toggle-state:checked ~ .view-source-button::before {\n    list-style-type: disclosure-open;\n}\n\n/* Docstrings */\n.pdoc .docstring {\n    margin-bottom: 1.5rem;\n}\n\n.pdoc section:not(.module-info) .docstring {\n    margin-left: clamp(0rem, 5vw - 2rem, 1rem);\n}\n\n.pdoc .docstring .pdoc-code {\n    margin-left: 1em;\n    margin-right: 1em;\n}\n\n/* Highlight focused element */\n.pdoc h1:target,\n.pdoc h2:target,\n.pdoc h3:target,\n.pdoc h4:target,\n.pdoc h5:target,\n.pdoc h6:target,\n.pdoc .pdoc-code > pre > span:target {\n    background-color: var(--active);\n    box-shadow: -1rem 0 0 0 var(--active);\n}\n\n.pdoc .pdoc-code > pre > span:target {\n    /* make the highlighted line full width so that the background extends */\n    display: block;\n}\n\n.pdoc div:target > .attr,\n.pdoc section:target > .attr,\n.pdoc dd:target > a {\n    background-color: var(--active);\n}\n\n.pdoc * {\n    scroll-margin: 2rem;\n}\n\n.pdoc .pdoc-code .linenos {\n    user-select: none;\n}\n\n.pdoc .attr:hover {\n    filter: contrast(0.95);\n}\n\n/* Header link */\n.pdoc section, .pdoc .classattr {\n    position: relative;\n}\n\n.pdoc .headerlink {\n    --width: clamp(1rem, 3vw, 2rem);\n    position: absolute;\n    top: 0;\n    left: calc(0rem - var(--width));\n    transition: all 100ms ease-in-out;\n    opacity: 0;\n}\n.pdoc .headerlink::before {\n    content: \"#\";\n    display: block;\n    text-align: center;\n    width: var(--width);\n    height: 2.3rem;\n    line-height: 2.3rem;\n    font-size: 1.5rem;\n}\n\n.pdoc .attr:hover ~ .headerlink,\n.pdoc *:target > .headerlink,\n.pdoc .headerlink:hover {\n    opacity: 1;\n}\n\n/* Attributes */\n.pdoc .attr {\n    display: block;\n    margin: .5rem 0 .5rem;\n    padding: .4rem .4rem .4rem 1rem;\n    background-color: var(--accent);\n    overflow-x: auto;\n}\n\n.pdoc .classattr {\n    margin-left: 2rem;\n}\n\n.pdoc .name {\n    color: var(--name);\n    font-weight: bold;\n}\n\n.pdoc .def {\n    color: var(--def);\n    font-weight: bold;\n}\n\n.pdoc .signature {\n    /* override pygments background color */\n    background-color: transparent;\n}\n\n.pdoc .param, .pdoc .return-annotation {\n    white-space: pre;\n}\n.pdoc .signature.multiline .param {\n    display: block;\n}\n.pdoc .signature.condensed .param {\n    display:inline-block;\n}\n\n.pdoc .annotation {\n    color: var(--annotation);\n}\n\n/* Show/Hide buttons for long default values */\n.pdoc .view-value-toggle-state,\n.pdoc .view-value-toggle-state ~ .default_value {\n    display: none;\n}\n.pdoc .view-value-toggle-state:checked ~ .default_value {\n    display: inherit;\n}\n.pdoc .view-value-button {\n    font-size: .5rem;\n    vertical-align: middle;\n    border-style: dashed;\n    margin-top: -0.1rem;\n}\n.pdoc .view-value-button:hover {\n    background: white;\n}\n.pdoc .view-value-button::before {\n    content: \"show\";\n    text-align: center;\n    width: 2.2em;\n    display: inline-block;\n}\n.pdoc .view-value-toggle-state:checked ~ .view-value-button::before {\n    content: \"hide\";\n}\n\n/* Inherited Members */\n.pdoc .inherited {\n    margin-left: 2rem;\n}\n\n.pdoc .inherited dt {\n    font-weight: 700;\n}\n\n.pdoc .inherited dt, .pdoc .inherited dd {\n    display: inline;\n    margin-left: 0;\n    margin-bottom: .5rem;\n}\n\n.pdoc .inherited dd:not(:last-child):after {\n    content: \", \";\n}\n\n.pdoc .inherited .class:before {\n    content: \"class \";\n}\n\n.pdoc .inherited .function a:after {\n    content: \"()\";\n}\n\n/* Search results */\n.pdoc .search-result .docstring {\n    overflow: auto;\n    max-height: 25vh;\n}\n\n.pdoc .search-result.focused > .attr {\n    background-color: var(--active);\n}\n\n/* \"built with pdoc\" attribution */\n.pdoc .attribution {\n    margin-top: 2rem;\n    display: block;\n    opacity: 0.5;\n    transition: all 200ms;\n    filter: grayscale(100%);\n}\n\n.pdoc .attribution:hover {\n    opacity: 1;\n    filter: grayscale(0%);\n}\n\n.pdoc .attribution img {\n    margin-left: 5px;\n    height: 35px;\n    vertical-align: middle;\n    width: 70px;\n    transition: all 200ms;\n}\n\n.pdoc table {\n    display: block;\n    width: max-content;\n    max-width: 150%;\n    overflow: auto;\n    margin-bottom: 1rem;\n}\n\n.pdoc table th, .pdoc table td {\n    padding: 12px 13px;\n    border: 1px solid var(--accent2);\n}\n\n.pdoc table th {\n    font-weight: 600;\n}"
  },
  {
    "path": "cicd/code_doc/custom_example_macros.py",
    "content": "\"\"\"Macro methods to be used on Lakehouse Engine Docs.\"\"\"\nimport warnings\nimport json\nimport pygments.formatters.html\nfrom markupsafe import Markup\n\nSTACK_LEVEL = 2\n\n\ndef _search_files(file: dict, search_string: str) -> list:\n    \"\"\"Searches for a string and outputs the line.\n\n    Search for a given string in a file and output the line where it is first\n    found.\n\n    Args:\n        file: path of the file to be searched.\n        search_string: string that will be searched for.\n\n    Returns:\n        The number of the first line where a given search_string appears.\n    \"\"\"\n    range_lines = []\n    with open(file) as f:\n        for num, line in enumerate(f, 1):\n            if search_string in line:\n                range_lines.append(num - 1)\n    return range_lines[0]\n\n\ndef _link_example(method_name: str) -> str or None:\n    \"\"\"Searches for a link in a dict.\n\n    Searches for the link of a given method_name, in a specific config file and\n    outputs it.\n\n    Args:\n        method_name: name of the method to be searched for.\n\n    Returns:\n        None or the example link for the given method_name.\n    \"\"\"\n    if method_name in list(lakehouse_engine_examples.keys()):\n        file_link = lakehouse_engine_examples[str(method_name)]\n\n        return lakehouse_engine_examples[\"base_link\"] + file_link if file_link != \"\" else None\n    else:\n        warnings.warn(\n                \"No entry provided for the following transformer: \"\n                + method_name,\n                RuntimeWarning,\n                STACK_LEVEL,\n        )\n\n        return None\n\n\ndef _get_dict_transformer(dict_to_search: dict, transformer: str) -> dict:\n    \"\"\"Searches for a transformer and returns the first dictionary occurrence.\n\n    Search for a given transformer in a dictionary and return the first occurrence.\n\n    Args:\n        dict_to_search: path of the file to be searched.\n        transformer: string that will be searched for.\n\n    Returns:\n        First dictionary where a given transformer is found.\n    \"\"\"\n    dict_transformer = []\n    for spec in dict_to_search[\"transform_specs\"]:\n        for transformer_dict in spec[\"transformers\"]:\n            if transformer_dict[\"function\"] == transformer:\n                dict_transformer.append(transformer_dict)\n\n    return json.dumps(dict_transformer[0], indent=4)\n\n\ndef _highlight_examples(method_name: str) -> str or None:\n    \"\"\"Creates a code snippet.\n\n    Constructs and exposes the code snippet of a given method_name.\n\n    Args:\n        method_name: name of the module to be searched for.\n\n    Returns:\n        None or the code snippet wrapped in html tags.\n    \"\"\"\n    for key, item in lakehouse_engine_examples.items():\n        if method_name == key:\n            file_path = f\"../../{item}\"\n            if file_path == \"../../\":\n                warnings.warn(\n                    \"No unit testing for the following transformer: \" + method_name,\n                    RuntimeWarning,\n                    STACK_LEVEL,\n                    )\n                return None\n\n            first_line = _search_files(file_path, f'\"function\": \"{method_name}\"')\n            with open(file_path) as json_file:\n                acon_file = json.load(json_file)\n            code_snippet = _get_dict_transformer(acon_file, method_name)\n\n            # Defining the lexer which will parse through the snippet of code we want\n            # to highlight\n            lexer = pygments.lexers.JsonLexer()\n            # Defining the format that will be outputted by the pygments library\n            # (on our case it will output the code within html tags)\n            formatter = pygments.formatters.html.HtmlFormatter(\n                linenos=\"inline\",\n                anchorlinenos=True,\n            )\n            formatter.linenostart = first_line\n\n            return Markup(pygments.highlight(code_snippet, lexer, formatter))\n\n\ndef get_example(method_name: str) -> str:\n    \"\"\"Get example based on given argument.\n\n    Args:\n        method_name: name of the module to be searched for.\n\n    Returns:\n        A example.\n    \"\"\"\n    example_link = _link_example(method_name=method_name)\n    json_example = _highlight_examples(method_name=method_name)\n\n    if example_link:\n        return (\n            \"\"\"<details class=\"example\">\\n\"\"\"\n            f\"\"\"<summary>View Example of {method_name} (See full example <a href=\"{example_link}\">here</a>)</summary>\"\"\"\n            f\"\"\"<div class=\"language-json highlight\"><pre><span></span><code>{json_example}</code></pre></div>\\n\"\"\"\n            \"\"\"</details>\"\"\"\n        )\n    else:\n        return \"\"\n\n\nwith open(\"./examples.json\") as json_file:\n    lakehouse_engine_examples = json.load(json_file)\n\ndef define_env(env):\n    \"Declare environment for jinja2 templates for markdown\"\n\n    for fn in [get_example]:\n        env.macro(fn)\n\n    # get mkdocstrings' Python handler\n    python_handler = env.conf[\"plugins\"][\"mkdocstrings\"].get_handler(\"python\")\n\n    # get the `update_env` method of the Python handler\n    update_env = python_handler.update_env\n\n    # override the `update_env` method of the Python handler\n    def patched_update_env(md, config):\n        update_env(md, config)\n\n        # get the `convert_markdown` filter of the env\n        convert_markdown = python_handler.env.filters[\"convert_markdown\"]\n\n        # build a chimera made of macros+mkdocstrings\n        def render_convert(markdown: str, *args, **kwargs):\n            return convert_markdown(env.render(markdown), *args, **kwargs)\n\n        # patch the filter\n        python_handler.env.filters[\"convert_markdown\"] = render_convert\n\n    # patch the method\n    python_handler.update_env = patched_update_env\n"
  },
  {
    "path": "cicd/code_doc/examples.json",
    "content": "{\n  \"base_link\":\"https://github.com/adidas/lakehouse-engine/blob/master/\",\n  \"get_max_value\": \"tests/resources/feature/delta_load/merge_options/update_column_set/batch_delta.json\",\n  \"with_row_id\": \"tests/resources/feature/transformations/chain_transformations/acons/streaming_batch.json\",\n  \"with_auto_increment_id\": \"tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch_delta.json\",\n  \"with_literals\": \"tests/resources/feature/transformations/column_creators/batch.json\",\n  \"cast\": \"tests/resources/feature/schema_evolution/delta_load/batch_delta_disabled.json\",\n  \"column_selector\": \"\",\n  \"flatten_schema\": \"tests/resources/feature/transformations/column_reshapers/flatten_schema/batch.json\",\n  \"explode_columns\": \"tests/resources/feature/transformations/column_reshapers/explode_arrays/batch.json\",\n  \"with_expressions\": \"tests/resources/feature/transformations/column_reshapers/flatten_schema/batch.json\",\n  \"rename\": \"tests/resources/feature/schema_evolution/append_load/batch_append_disabled.json\",\n  \"from_avro\": \"\",\n  \"from_avro_with_registry\": \"\",\n  \"from_json\": \"tests/resources/feature/transformations/column_reshapers/flatten_schema/batch.json\",\n  \"to_json\": \"tests/resources/feature/transformations/column_reshapers/flatten_schema/batch.json\",\n  \"condense_record_mode_cdc\": \"tests/resources/feature/delta_load/record_mode_cdc/backfill/batch_init.json\",\n  \"group_and_rank\": \"tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch_delta.json\",\n  \"hash_masker\": \"tests/resources/feature/transformations/data_maskers/hash_masking.json\",\n  \"column_dropper\": \"tests/resources/feature/transformations/data_maskers/drop_columns.json\",\n  \"add_current_date\": \"tests/resources/feature/transformations/date_transformers/streaming.json\",\n  \"convert_to_date\": \"tests/resources/feature/transformations/date_transformers/streaming.json\",\n  \"convert_to_timestamp\": \"tests/resources/feature/transformations/date_transformers/streaming.json\",\n  \"format_date\": \"tests/resources/feature/transformations/date_transformers/streaming.json\",\n  \"get_date_hierarchy\": \"tests/resources/feature/transformations/date_transformers/streaming.json\",\n  \"incremental_filter\": \"tests/resources/feature/delta_load/record_mode_cdc/backfill/batch_delta.json\",\n  \"expression_filter\": \"tests/resources/feature/full_load/with_filter/batch.json\",\n  \"column_filter_exp\": \"tests/resources/feature/transformations/multiple_transform/batch.json\",\n  \"join\": \"tests/resources/feature/transformations/joiners/batch.json\",\n  \"replace_nulls\": \"tests/resources/feature/transformations/null_handlers/replace_nulls_col_subset.json\",\n  \"with_regex_value\": \"tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch_delta.json\",\n  \"coalesce\": \"tests/resources/feature/writers/acons/write_batch_console.json\",\n  \"repartition\": \"tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming_delta.json\",\n  \"get_transformer\": \"\",\n  \"with_watermark\": \"tests/resources/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/streaming_drop_duplicates_overall_watermark.json\"\n}"
  },
  {
    "path": "cicd/code_doc/gen_ref_nav.py",
    "content": "\"\"\"Module to generate code reference docs.\"\"\"\n\n# Import necessary libraries\nfrom pathlib import Path\nimport mkdocs_gen_files\n\n# Create a new navigation structure\nnav = mkdocs_gen_files.Nav()\n\n# Define the root directory and the source directory\nroot = Path(__file__).parent\nsrc = root / \"mkdocs/lakehouse_engine\"\n\nprint(f\"Looking for files in {src}\")\n\n# Loop over all Python files in the source directory\nfor path in sorted(src.rglob(\"*.py\")):\n    # Get the module path and the documentation path for each file\n    module_path = path.relative_to(src).with_suffix(\"\")\n    doc_path = path.relative_to(src / \"\").with_suffix(\".md\")\n    full_doc_path = Path(\"reference\", doc_path)\n\n    # Split the module path into parts\n    parts = tuple(module_path.parts)\n\n    # Skip files that start with an underscore or have no parts\n    if not parts:\n        continue\n\n    # If the file is an __init__.py file, remove the last part and rename the doc file to index.md\n    if parts[-1] == \"__init__\" and str(parts[:-1]) != \"()\":\n        parts = parts[:-1]\n        doc_path = doc_path.with_name(\"index.md\")\n        full_doc_path = full_doc_path.with_name(\"index.md\")\n    elif parts[-1].startswith(\"_\"):\n        continue\n\n    # Skip the loop iteration if there is no doc path\n    if not doc_path:\n        continue\n\n    # If the doc path has at least one part, add it to the navigation\n    if len(doc_path.parts) >= 1:\n        nav_parts = [f\"{part}\" for part in parts]\n        nav[tuple(nav_parts)] = doc_path.as_posix()\n\n        # Open the full doc path and write the module identifier to it\n        with mkdocs_gen_files.open(full_doc_path, \"w\") as fd:\n            ident = \".\".join(parts)\n            fd.write(f\"::: {ident}\")\n\n        # Set the edit path for the file\n        mkdocs_gen_files.set_edit_path(\n            full_doc_path, \"..\" / path.relative_to(root))\n\n# Open the index.md file and write the built navigation to it\nwith mkdocs_gen_files.open(\"reference/index.md\", \"w\") as nav_file:\n    nav_file.writelines(nav.build_literate_nav())"
  },
  {
    "path": "cicd/code_doc/index.html.jinja2",
    "content": "{% set root_module_name = \"\" %}\n{% extends \"default/index.html.jinja2\" %}\n{% block title %}Lakehouse Engine Documentation{% endblock %}\n{% block nav %}\n    <img src=\"{{ logo }}\" class=\"logo\" alt=\"project logo\"/>\n    <input type=\"search\" placeholder=\"Search...\" role=\"searchbox\" aria-label=\"search\"\n                   pattern=\".+\" required>\n    <h2>Available Modules</h2>\n    <ul>\n        {% for submodule in all_modules if \".\" not in submodule and not submodule.startswith(\"_\") %}\n            <li><a href=\"{{ submodule.replace(\".\",\"/\") }}.html\">{{ submodule.replace(\"_\",\" \").title() }}</a></li>\n        {% endfor %}\n    </ul>\n{% endblock %}\n{% block content %}\n    <header class=\"pdoc\">\n        <h1>Lakehouse Engine Documentation</h1>\n    </header>\n    <main class=\"pdoc\">\n        {% filter to_html %}\n\n{% include \"README.md\" %}\n\n        {% endfilter %}\n    </main>\n    {% if search %}\n        {% include \"search.html.jinja2\" %}\n    {% endif %}\n{% endblock %}"
  },
  {
    "path": "cicd/code_doc/mkdocs.yml",
    "content": "site_name: Lakehouse Engine Documentation\nsite_url: https://adidas.github.io/lakehouse-engine-docs\nrepo_url: https://github.com/adidas/lakehouse-engine\nrepo_name: lakehouse-engine\ndocs_dir: \"mkdocs/docs\"\n\nnav:\n  - Lakehouse Engine: index.md\n  - How to use the Lakehouse Engine?:\n    - Overview: lakehouse_engine_usage/lakehouse_engine_usage.md\n    - Algorithms:\n      - Data Loader:\n        - Overview: lakehouse_engine_usage/data_loader/data_loader.md\n        - Scenarios:\n          - Append Load from JDBC with PERMISSIVE mode (default): lakehouse_engine_usage/data_loader/append_load_from_jdbc_with_permissive_mode/append_load_from_jdbc_with_permissive_mode.md\n          - Append Load with FAILFAST: lakehouse_engine_usage/data_loader/append_load_with_failfast/append_load_with_failfast.md\n          - Batch Delta Load Init, Delta and Backfill with Merge: lakehouse_engine_usage/data_loader/batch_delta_load_init_delta_backfill_with_merge/batch_delta_load_init_delta_backfill_with_merge.md\n          - Custom Transformer: lakehouse_engine_usage/data_loader/custom_transformer/custom_transformer.md\n          - Custom Transformer (SQL): lakehouse_engine_usage/data_loader/custom_transformer_sql/custom_transformer_sql.md\n          - Extract from SAP B4 ADSOs: lakehouse_engine_usage/data_loader/extract_from_sap_b4_adso/extract_from_sap_b4_adso.md\n          - Extract from SAP BW DSOs: lakehouse_engine_usage/data_loader/extract_from_sap_bw_dso/extract_from_sap_bw_dso.md\n          - Extract from SFTP: lakehouse_engine_usage/data_loader/extract_from_sftp/extract_from_sftp.md\n          - Extract using JDBC connection: lakehouse_engine_usage/data_loader/extract_using_jdbc_connection/extract_using_jdbc_connection.md\n          - Filtered Full Load: lakehouse_engine_usage/data_loader/filtered_full_load/filtered_full_load.md\n          - Filtered Full Load with Selective Replace: lakehouse_engine_usage/data_loader/filtered_full_load_with_selective_replace/filtered_full_load_with_selective_replace.md\n          - Flatten Schema and Explode Columns: lakehouse_engine_usage/data_loader/flatten_schema_and_explode_columns/flatten_schema_and_explode_columns.md\n          - Full Load: lakehouse_engine_usage/data_loader/full_load/full_load.md\n          - Read from Dataframe: lakehouse_engine_usage/data_loader/read_from_dataframe/read_from_dataframe.md\n          - Read from Sharepoint: lakehouse_engine_usage/data_loader/read_from_sharepoint/read_from_sharepoint.md\n          - Streaming Append Load with DROPMALFORMED: lakehouse_engine_usage/data_loader/streaming_append_load_with_malformed/streaming_append_load_with_malformed.md\n          - Streaming Append Load with Optimize Dataset Terminator: lakehouse_engine_usage/data_loader/streaming_append_load_with_terminator/streaming_append_load_with_terminator.md\n          - Streaming Delta Load with Group and Rank Condensation: lakehouse_engine_usage/data_loader/streaming_delta_load_with_group_and_rank_condensation/streaming_delta_load_with_group_and_rank_condensation.md\n          - Streaming Delta Load with Late Arriving and Out of Order Events (with and without watermarking): lakehouse_engine_usage/data_loader/streaming_delta_with_late_arriving_and_out_of_order_events/streaming_delta_with_late_arriving_and_out_of_order_events.md\n          - Write and Read Dataframe: lakehouse_engine_usage/data_loader/write_and_read_dataframe/write_and_read_dataframe.md\n          - Write to Console: lakehouse_engine_usage/data_loader/write_to_console/write_to_console.md\n          - Write to REST API: lakehouse_engine_usage/data_loader/write_to_rest_api/write_to_rest_api.md\n          - Write to Sharepoint: lakehouse_engine_usage/data_loader/write_to_sharepoint/write_to_sharepoint.md\n      - Data Quality:\n        - Overview: lakehouse_engine_usage/data_quality/data_quality.md\n        - Scenarios:\n          - Custom Expectations: lakehouse_engine_usage/data_quality/custom_expectations/custom_expectations.md\n          - Data Quality Validator: lakehouse_engine_usage/data_quality/data_quality_validator/data_quality_validator.md\n          - Minimal Example: lakehouse_engine_usage/data_quality/minimal_example/minimal_example.md\n          - Prisma: lakehouse_engine_usage/data_quality/prisma/prisma.md\n          - Result Sink: lakehouse_engine_usage/data_quality/result_sink/result_sink.md\n          - Row Tagging: lakehouse_engine_usage/data_quality/row_tagging/row_tagging.md\n          - Validations Failing: lakehouse_engine_usage/data_quality/validations_failing/validations_failing.md\n      - Reconciliator:\n        - Overview: lakehouse_engine_usage/reconciliator/reconciliator.md\n      - Sensors:\n          - Overview: lakehouse_engine_usage/sensors/sensors.md\n          - Sensor:\n              - Overview: lakehouse_engine_usage/sensors/sensor/sensor.md\n              - Supported Sources:\n                  - Delta Table: lakehouse_engine_usage/sensors/sensor/delta_table/delta_table.md\n                  - Sensor from other Sensor Delta Table: lakehouse_engine_usage/sensors/sensor/delta_upstream_sensor_table/delta_upstream_sensor_table.md\n                  - Sensor from Files: lakehouse_engine_usage/sensors/sensor/file/file.md\n                  - Sensor from JDBC: lakehouse_engine_usage/sensors/sensor/jdbc_table/jdbc_table.md\n                  - Sensor from Kafka: lakehouse_engine_usage/sensors/sensor/kafka/kafka.md\n                  - Sensor from SAP: lakehouse_engine_usage/sensors/sensor/sap_bw_b4/sap_bw_b4.md\n              - Update Sensor control Delta Table after processing the data: lakehouse_engine_usage/sensors/sensor/update_sensor_status/update_sensor_status.md\n          - Heartbeat Sensor:\n              - Overview: lakehouse_engine_usage/sensors/heartbeat/heartbeat.md\n              - Supported Sources:\n                  - Delta Table: lakehouse_engine_usage/sensors/heartbeat/delta_table/delta_table.md\n                  - Kafka: lakehouse_engine_usage/sensors/heartbeat/kafka/kafka.md\n                  - Manual Table: lakehouse_engine_usage/sensors/heartbeat/manual_table/manual_table.md\n                  - SAP BW/4HANA: lakehouse_engine_usage/sensors/heartbeat/sap_bw_b4/sap_bw_b4.md\n                  - Trigger File: lakehouse_engine_usage/sensors/heartbeat/trigger_file/trigger_file.md\n              - Feed Heartbeat Sensor Control Delta Table: lakehouse_engine_usage/sensors/heartbeat/heartbeat_sensor_data_feed/heartbeat_sensor_data_feed.md\n              - Update Heartbeat Sensor control Delta Table after processing the data: lakehouse_engine_usage/sensors/heartbeat/update_heartbeat_sensor_status/update_heartbeat_sensor_status.md\n      - GAB:\n        - Overview: lakehouse_engine_usage/gab/gab.md\n        - Step-by-Step: lakehouse_engine_usage/gab/step_by_step/step_by_step.md\n  - Tools:\n    - Table & File Manager Helper: lakehouse_engine_usage/managerhelper/managerhelper.md\n  - API Documentation: reference/ # (1)!\n\ntheme:\n  name: material\n  language: en\n  logo: assets/img/lakehouse_engine_logo.png\n  favicon: assets/img/lakehouse_engine_logo_symbol_large.png\n  icon:\n    repo: fontawesome/brands/github-alt\n  palette:\n    - media: \"(prefers-color-scheme: light)\"\n      scheme: default\n      primary: blue\n      accent: yellow\n      toggle:\n        icon: material/toggle-switch\n        name: Switch to dark mode\n    - media: \"(prefers-color-scheme: dark), (prefers-color-scheme: no-preference)\"\n      scheme: slate\n      primary: blue\n      accent: yellow\n      toggle:\n        icon: material/toggle-switch-off\n        name: Switch to light mode\n  features:\n    - content.code.annotate\n    - content.code.annotation\n    - content.code.copy\n    - content.code.select\n    - content.tabs.link\n    - content.tooltips\n    - navigation.indexes\n    - navigation.path\n    - navigation.tabs\n    - navigation.tabs.instant\n    - navigation.tabs.sticky\n    - navigation.top\n    - navigation.sections\n    - toc.follow\n    - toc.integrate\n    - search.highlight\n    - search.suggest\n\nextra:\n  social:\n    - icon: fontawesome/brands/github-alt\n      link: https://adidas.github.io/lakehouse-engine\n  version:\n    provider: mike\n    name: Version\n\nplugins:\n  - search\n  - markdown-exec\n  - offline\n  - section-index\n  - mkdocstrings:\n      enabled: !ENV [ENABLE_MKDOCSTRINGS, true]\n      default_handler: python\n      handlers:\n        python:\n          paths: [mkdocs/lakehouse_engine]\n          options:\n            show_source: true\n  - macros:\n      module_name: mkdocs_macros\n  - gen-files:\n      scripts:\n        - gen_ref_nav.py\n  - literate-nav:\n      nav_file: SUMMARY.md\n  - mike:\n      alias_type: symlink\n      canonical_version: latest\n\nextra:\n  social:\n    - icon: fontawesome/brands/github-alt\n      link: https://adidas.github.io/lakehouse-engine\n\nmarkdown_extensions:\n  - admonition\n  - attr_list\n  - extra\n  - footnotes\n  - markdown_include.include:\n      base_path: mkdocs/docs\n  - md_in_html\n  - pymdownx.arithmatex:\n      generic: true\n  - pymdownx.details\n  - pymdownx.emoji:\n      emoji_index: !!python/name:materialx.emoji.twemoji\n      emoji_generator: !!python/name:materialx.emoji.to_svg\n  - pymdownx.highlight:\n      anchor_linenums: true\n      line_spans: __span\n      pygments_lang_class: true\n  - pymdownx.inlinehilite\n  - pymdownx.mark\n  - pymdownx.tabbed:\n      alternate_style: true\n  - pymdownx.snippets\n  - pymdownx.superfences:\n      custom_fences:\n        - name: mermaid\n          class: mermaid\n          format: !!python/name:pymdownx.superfences.fence_code_format ''\n  - toc:\n      permalink: true\n\ncopyright: |\n  &copy; 2025 <a href=\"https://github.com/adidas\"  target=\"_blank\" rel=\"noopener\">adidas</a>"
  },
  {
    "path": "cicd/code_doc/mkdocs_macros.py",
    "content": "\"\"\"Macro methods to be used on Lakehouse Engine Docs.\"\"\"\nimport warnings\nimport json\nimport pygments.formatters.html\nfrom markupsafe import Markup\n\nSTACK_LEVEL = 2\n\n\ndef _search_files(file: dict, search_string: str) -> list:\n    \"\"\"Searches for a string and outputs the line.\n\n    Search for a given string in a file and output the line where it is first\n    found.\n\n    Args:\n        file: path of the file to be searched.\n        search_string: string that will be searched for.\n\n    Returns:\n        The number of the first line where a given search_string appears.\n    \"\"\"\n    range_lines = []\n    with open(file) as f:\n        for num, line in enumerate(f, 1):\n            if search_string in line:\n                range_lines.append(num - 1)\n    return range_lines[0]\n\n\ndef _link_example(method_name: str) -> str or None:\n    \"\"\"Searches for a link in a dict.\n\n    Searches for the link of a given method_name, in a specific config file and\n    outputs it.\n\n    Args:\n        method_name: name of the method to be searched for.\n\n    Returns:\n        None or the example link for the given method_name.\n    \"\"\"\n    if method_name in list(lakehouse_engine_examples.keys()):\n        file_link = lakehouse_engine_examples[str(method_name)]\n\n        return lakehouse_engine_examples[\"base_link\"] + file_link if file_link != \"\" else None\n    else:\n        warnings.warn(\n                \"No entry provided for the following transformer: \"\n                + method_name,\n                RuntimeWarning,\n                STACK_LEVEL,\n        )\n\n        return None\n\n\ndef _get_dict_transformer(dict_to_search: dict, transformer: str) -> dict:\n    \"\"\"Searches for a transformer and returns the first dictionary occurrence.\n\n    Search for a given transformer in a dictionary and return the first occurrence.\n\n    Args:\n        dict_to_search: path of the file to be searched.\n        transformer: string that will be searched for.\n\n    Returns:\n        First dictionary where a given transformer is found.\n    \"\"\"\n    dict_transformer = []\n    for spec in dict_to_search[\"transform_specs\"]:\n        for transformer_dict in spec[\"transformers\"]:\n            if transformer_dict[\"function\"] == transformer:\n                dict_transformer.append(transformer_dict)\n\n    return json.dumps(dict_transformer[0], indent=4)\n\n\ndef _highlight_examples(method_name: str) -> str or None:\n    \"\"\"Creates a code snippet.\n\n    Constructs and exposes the code snippet of a given method_name.\n\n    Args:\n        method_name: name of the module to be searched for.\n\n    Returns:\n        None or the code snippet wrapped in html tags.\n    \"\"\"\n    for key, item in lakehouse_engine_examples.items():\n        if method_name == key:\n            file_path = f\"../../{item}\"\n            if file_path == \"../../\":\n                warnings.warn(\n                    \"No unit testing for the following transformer: \" + method_name,\n                    RuntimeWarning,\n                    STACK_LEVEL,\n                    )\n                return None\n\n            first_line = _search_files(file_path, f'\"function\": \"{method_name}\"')\n            with open(file_path) as json_file:\n                acon_file = json.load(json_file)\n            code_snippet = _get_dict_transformer(acon_file, method_name)\n\n            # Defining the lexer which will parse through the snippet of code we want\n            # to highlight\n            lexer = pygments.lexers.JsonLexer()\n            # Defining the format that will be outputted by the pygments library\n            # (on our case it will output the code within html tags)\n            formatter = pygments.formatters.html.HtmlFormatter(\n                linenos=\"inline\",\n                anchorlinenos=True,\n            )\n            formatter.linenostart = first_line\n\n            return Markup(pygments.highlight(code_snippet, lexer, formatter))\n\n\ndef get_example(method_name: str) -> str:\n    \"\"\"Get example based on given argument.\n\n    Args:\n        method_name: name of the module to be searched for.\n\n    Returns:\n        A example.\n    \"\"\"\n    example_link = _link_example(method_name=method_name)\n    json_example = _highlight_examples(method_name=method_name)\n\n    if example_link:\n        return (\n            \"\"\"<details class=\"example\">\\n\"\"\"\n            f\"\"\"<summary>View Example of {method_name} (See full example <a href=\"{example_link}\">here</a>)</summary>\"\"\"\n            f\"\"\"<div class=\"language-json highlight\"><pre><span></span><code>{json_example}</code></pre></div>\\n\"\"\"\n            \"\"\"</details>\"\"\"\n        )\n    else:\n        return \"\"\n\n\nwith open(\"./examples.json\") as json_file:\n    lakehouse_engine_examples = json.load(json_file)\n\n\ndef format_operations_table(operations_dict: dict) -> str:\n    \"\"\"Format operations dictionary into a markdown table.\n\n    Args:\n        operations_dict: Dictionary containing operations and their parameters.\n\n    Returns:\n        A markdown formatted table with operation details.\n    \"\"\"\n    if not operations_dict:\n        return \"\"\n\n    markdown_output = \"\\n\\n**Available Operations:**\\n\\n\"\n    markdown_output += \"| Operation | Parameters | Type | Mandatory |\\n\"\n    markdown_output += \"|-----------|------------|------|----------|\\n\"\n\n    for operation, params in sorted(operations_dict.items()):\n        if not params:\n            markdown_output += f\"| `{operation}` | - | - | - |\\n\"\n        else:\n            first_param = True\n            for param_name, param_info in params.items():\n                if first_param:\n                    markdown_output += f\"| `{operation}` | `{param_name}` | {param_info.get('type', 'N/A')} | {param_info.get('mandatory', False)} |\\n\"\n                    first_param = False\n                else:\n                    markdown_output += f\"|  | `{param_name}` | {param_info.get('type', 'N/A')} | {param_info.get('mandatory', False)} |\\n\"\n\n    return markdown_output\n\n\ndef get_table_manager_operations() -> str:\n    \"\"\"Get formatted table of TableManager operations.\n\n    Returns:\n        A markdown formatted table with TableManager operations.\n    \"\"\"\n    from lakehouse_engine.core.definitions import TABLE_MANAGER_OPERATIONS\n    return format_operations_table(TABLE_MANAGER_OPERATIONS)\n\n\ndef get_file_manager_operations() -> str:\n    \"\"\"Get formatted table of FileManager operations.\n\n    Returns:\n        A markdown formatted table with FileManager operations.\n    \"\"\"\n    from lakehouse_engine.core.definitions import FILE_MANAGER_OPERATIONS\n    return format_operations_table(FILE_MANAGER_OPERATIONS)\n\n\ndef define_env(env):\n    \"Declare environment for jinja2 templates for markdown\"\n\n    for fn in [get_example, get_table_manager_operations, get_file_manager_operations]:\n        env.macro(fn)\n\n    # get mkdocstrings' Python handler\n    python_handler = env.conf[\"plugins\"][\"mkdocstrings\"].get_handler(\"python\")\n\n    # get the `update_env` method of the Python handler\n    update_env = python_handler.update_env\n\n    # override the `update_env` method of the Python handler\n    def patched_update_env(config):\n        update_env(config)\n\n        # get the `convert_markdown` filter of the env\n        convert_markdown = python_handler.env.filters[\"convert_markdown\"]\n\n        # build a chimera made of macros+mkdocstrings\n        def render_convert(markdown: str, *args, **kwargs):\n            return convert_markdown(env.render(markdown), *args, **kwargs)\n\n        # patch the filter\n        python_handler.env.filters[\"convert_markdown\"] = render_convert\n\n    # patch the method\n    python_handler.update_env = patched_update_env\n"
  },
  {
    "path": "cicd/code_doc/module.html.jinja2",
    "content": "{#\nOn this Jinja template we're extending a pre-existing template,\ncopying the block on which we would like to make changes and\nadding both the \"View Example\" summary tag and the \"View Full Acon\" button.\n#}\n{% extends \"default/module.html.jinja2\" %}\n{% block title %}{{ module.modulename }}{% endblock %}\n{% block nav_submodules %}\n        {% if module.submodules %}\n            <h2>Submodules</h2>\n            <ul>\n                {% for submodule in module.submodules if is_public(submodule) | trim %}\n                    <li><a href=\"./{{ module.name }}/{{ submodule.name }}.html\">{{ submodule.name.replace(\"_\",\" \").title() }}</a></li>\n                {% endfor %}\n            </ul>\n        {% endif %}\n    {% endblock %}\n{% block module_contents %}\n    {% for m in module.flattened_own_members if is_public(m) | trim %}\n        <section id=\"{{ m.qualname or m.name }}\">\n            {{ member(m) }}\n            {% if m.type == \"class\" %}\n                {% for m in m.own_members if m.type != \"class\" and is_public(m) | trim %}\n                    <div id=\"{{ m.qualname }}\" class=\"classattr\">\n                        {{ member(m) }}\n                        {% if m.fullname | highlight_examples %}\n                            {{ view_example(m.fullname) }}\n                        {% endif %}\n                        {% if m.fullname | link_example %}\n                            {{ view_full_acon(m.fullname) }}\n                        {% endif %}\n                    </div>\n                {% endfor %}\n                {% set inherited_members = inherited(m) | trim %}\n                {% if inherited_members %}\n                    <div class=\"inherited\">\n                        <h5>Inherited Members</h5>\n                        <dl>\n                            {{ inherited_members }}\n                        </dl>\n                    </div>\n                {% endif %}\n            {% endif %}\n        </section>\n    {% endfor %}\n{% endblock %}\n{% block attribution %}\n{% endblock %}\n\n{% block module_info %}\n    <section class=\"module-info\">\n        {% block edit_button %}\n            {% if edit_url %}\n                {% if \"github.com\" in edit_url %}\n                    {% set edit_text = \"Edit on GitHub\" %}\n                {% elif \"gitlab\" in edit_url %}\n                    {% set edit_text = \"Edit on GitLab\" %}\n                {% else %}\n                    {% set edit_text = \"Edit Source\" %}\n                {% endif %}\n                <a class=\"pdoc-button git-button\" href=\"{{ edit_url }}\">{{ edit_text }}</a>\n            {% endif %}\n        {% endblock %}\n\n        {% if \"lakehouse_engine\" == module.modulename.split(\".\")[0] %}\n            {{ module_name() }}\n        {% endif %}\n        {{ docstring(module) }}\n        {% if \"lakehouse_engine\" == module.modulename.split(\".\")[0] %}\n            {{ view_source_state(module) }}\n            {{ view_source_button(module) }}\n            {{ view_source_code(module) }}\n        {% endif %}\n    </section>\n{% endblock %}\n\n{#\nOn this macro we're creating the \"View Example\" structure.\n#}\n{% defaultmacro view_example(doc) %}\n    <details>\n    <summary>View Example</summary>\n    {{ doc | highlight_examples }}\n    </details>\n{% enddefaultmacro %}\n\n{#\nOn this macro we're creating the \"View Full Acon\" structure.\n#}\n{% defaultmacro view_full_acon(doc) %}\n    <section>\n        {% set edit_text = \"View Full Acon\" %}\n        <a class=\"pdoc-button git-button\" href=\"{{ doc | link_example }}\" target=\"_blank\">{{ edit_text }}</a>\n    </section>\n    </br>\n    </br>\n{% enddefaultmacro %}\n"
  },
  {
    "path": "cicd/code_doc/render_doc.py",
    "content": "\"\"\"Module for customizing pdoc documentation.\"\"\"\n\nimport json\nimport os\nimport shutil\nimport warnings\nfrom pathlib import Path\n\nimport pygments.formatters.html\nfrom markupsafe import Markup\nfrom pdoc import pdoc, render\n\nSTACK_LEVEL = 2\n\nlogo_path = (\n    \"https://github.com/adidas/lakehouse-engine/blob/master/assets/img/\"\n    \"lakehouse_engine_logo_no_bg_160.png?raw=true\"\n)\n\n\ndef _get_project_version() -> str:\n    version = (\n        os.popen(\n            \"cat cicd/.bumpversion.cfg | grep 'current_version =' | cut -f 3 -d ' '\"\n        )\n        .read()\n        .replace(\"\\n\", \"\")\n    )\n    return version\n\n\ndef _search_files(file: dict, search_string: str) -> list:\n    \"\"\"Searches for a string and outputs the line.\n\n    Search for a given string in a file and output the line where it is first\n    found.\n\n    :param file: path of the file to be searched.\n    :param search_string: string that will be searched for.\n\n    :returns: the number of the first line where a given search_string appears.\n    \"\"\"\n    range_lines = []\n    with open(file) as f:\n        for num, line in enumerate(f, 1):\n            if search_string in line:\n                range_lines.append(num - 1)\n    return range_lines[0]\n\n\ndef _get_dict_transformer(dict_to_search: dict, transformer: str) -> dict:\n    \"\"\"Searches for a transformer and returns the first dictionary occurrence.\n\n    Search for a given transformer in a dictionary and return the first occurrence.\n\n    :param dict_to_search: path of the file to be searched.\n    :param transformer: string that will be searched for.\n\n    :returns: first dictionary where a given transformer is found.\n    \"\"\"\n    dict_transformer = []\n    for spec in dict_to_search[\"transform_specs\"]:\n        for transformer_dict in spec[\"transformers\"]:\n            if transformer_dict[\"function\"] == transformer:\n                dict_transformer.append(transformer_dict)\n    return json.dumps(dict_transformer[0], indent=4)\n\n\ndef _link_example(module_name: str) -> str or None:\n    \"\"\"Searches for a link in a dict.\n\n    Searches for the link of a given module_name, in a specific config file and\n    outputs it.\n\n    :param module_name: name of the module to be searched for.\n\n    :returns: None or the example link for the given module_name.\n    \"\"\"\n    if module_name in list(link_dict.keys()):\n        file_link = link_dict[str(module_name)]\n        return link_dict[\"base_link\"] + file_link if file_link != \"\" else None\n    else:\n        return None\n\n\ndef _highlight_examples(module_name: str) -> str or None:\n    \"\"\"Creates a code snippet.\n\n    Constructs and exposes the code snippet of a given module_name.\n\n    :param module_name: name of the module to be searched for.\n\n    :returns: None or the code snippet wrapped in html tags.\n    \"\"\"\n    transformers_to_ignore = [\n        \"UNSUPPORTED_STREAMING_TRANSFORMERS\",\n        \"AVAILABLE_TRANSFORMERS\",\n        \"__init__\",\n    ]\n    if module_name.split(\".\")[1] == \"transformers\":\n        if module_name not in list(link_dict.keys()):\n            if module_name.split(\".\")[-1] not in list(transformers_to_ignore):\n                warnings.warn(\n                    \"No entry provided for the following transformer: \"\n                    + module_name.split(\".\")[-1],\n                    RuntimeWarning,\n                    STACK_LEVEL,\n                )\n                return None\n\n    for key, item in link_dict.items():\n        if module_name == key:\n            file_path = f\"./{item}\"\n            transformer = key.split(\".\")[-1].lower()\n            if file_path == \"./\":\n                warnings.warn(\n                    \"No unit testing for the following transformer: \" + transformer,\n                    RuntimeWarning,\n                    STACK_LEVEL,\n                )\n                return None\n\n            first_line = _search_files(file_path, f'\"function\": \"{transformer}\"')\n            with open(file_path) as json_file:\n                acon_file = json.load(json_file)\n            code_snippet = _get_dict_transformer(acon_file, transformer)\n            # Defining the lexer which will parse through the snippet of code we want\n            # to highlight\n            lexer = pygments.lexers.JsonLexer()\n            # Defining the format that will be outputted by the pygments library\n            # (on our case it will output the code within html tags)\n            formatter = pygments.formatters.html.HtmlFormatter(\n                cssclass=\"pdoc-code codehilite\",\n                linenos=\"inline\",\n                anchorlinenos=True,\n            )\n            formatter.linenostart = first_line\n            return Markup(pygments.highlight(code_snippet, lexer, formatter))\n\n\nwith open(\"./cicd/code_doc/examples.json\") as json_file:\n    link_dict = json.load(json_file)\n\n# Adding our custom filters to jinja environment\nenv_jinja = render.env\nenv_jinja.filters[\"link_example\"] = _link_example\nenv_jinja.filters[\"highlight_examples\"] = _highlight_examples\n\n\nroot_path = Path(__file__).parents[2]\ndocumentation_path = root_path / \"artefacts\" / \"docs\"\n# Tell pdoc's render to use our jinja template\nrender.configure(\n    template_directory=root_path / \"cicd\" / \"code_doc\" / \".\",\n    docformat=\"google\",\n    logo=logo_path,\n    favicon=logo_path,\n    footer_text=f\"Lakehouse Engine v{_get_project_version()}\",\n    mermaid=True,\n)\n# Temporarily copy README file to be used in index.html page\nshutil.copyfile(\"README.md\", root_path / \"cicd\" / \"code_doc\" / \"README.md\")\n\n# Render pdoc's documentation into artefacts/docs\npdoc(\n    \"./lakehouse_engine/\",\n    \"./lakehouse_engine_usage/\",\n    output_directory=documentation_path,\n)\n\n# Copy the images used on the documentation, to the path where we have the rendered\n# html pages.\nshutil.copytree(\"./assets\", documentation_path / \"assets\", dirs_exist_ok=True)\n\n# Remove the temporary copy README file\nos.remove(root_path / \"cicd\" / \"code_doc\" / \"README.md\")\n"
  },
  {
    "path": "cicd/code_doc/render_docs.py",
    "content": "\"\"\"Module for customizing mkdocs documentation.\"\"\"\n\n# Import necessary libraries\nimport os\nimport shutil\nfrom pathlib import Path\n\n# Define the root directory and the necessary directories\nroot_path = Path(__file__).parents[2]\ncode_doc_path = root_path / \"cicd\" / \"code_doc\"\nmkdocs_base_path = code_doc_path / \"mkdocs\"\nmkdocs_build_path = mkdocs_base_path / \"docs\"\ndocumentation_path = root_path / \"artefacts\" / \"docs\"\n\n# Files and directories to be copied to build the mkdocs documentation\ndocumentation_to_copy = {\n    \"directories_to_copy\": [\n        {\n            \"source\": root_path / \"lakehouse_engine_usage\",\n            \"target\": mkdocs_build_path / \"lakehouse_engine_usage\",\n        },\n        {\n            \"source\": root_path / \"lakehouse_engine\",\n            \"target\": mkdocs_base_path / \"lakehouse_engine\" / \"packages\",\n        },\n        {\n            \"source\": \"./assets\",\n            \"target\": mkdocs_build_path / \"assets\",\n        },\n    ],\n    \"files_to_copy\": [\n        {\n            \"source\": \"README.md\",\n            \"target\": mkdocs_build_path / \"index.md\",\n        },\n        {\n            \"source\": \"pyproject.toml\",\n            \"target\": mkdocs_build_path / \"pyproject.toml\",\n        },\n    ],\n}\n\n\ndef _copy_documentation(directories: list = \"\", files: list = \"\"):\n    \"\"\"Copy files to other directory based on given parameters.\n\n    Args:\n        directories (list): list of directories to copy.\n        files (list): list of files to copy.\n    \"\"\"\n    if directories:\n        for directory in directories:\n            shutil.copytree(\n                directory.get(\"source\"), directory.get(\"target\"), dirs_exist_ok=True\n            )\n    if files:\n        for file in files:\n            shutil.copyfile(file.get(\"source\"), file.get(\"target\"))\n\n\n_copy_documentation(\n    directories=documentation_to_copy.get(\"directories_to_copy\"),\n    files=documentation_to_copy.get(\"files_to_copy\"),\n)\n\n# Use mkdocs build command to build the documentation into the \"site\" folder\nos.system(f\"cd {code_doc_path} && mkdocs build --site-dir {documentation_path}/site\")\n\n# Remove the temporary docs directory mkdocs_base_path\nshutil.rmtree(mkdocs_base_path)\n"
  },
  {
    "path": "cicd/flake8.conf",
    "content": "[flake8]\nmax-line-length = 88\nextend-ignore = E203\ninline-quotes=double\ndocstring-quotes=\"\"\"\nmax-expression-complexity=11\nmax-cognitive-complexity=15\n# there is a python module with same name as io engine module, so\n# we need to ignore this error\nper-file-ignores =\n    lakehouse_engine/io/__init__.py:A005"
  },
  {
    "path": "cicd/meta.yaml",
    "content": "dev_deploy_bucket: s3://sample-dev-bucket\nprod_deploy_bucket: s3://sample-prod-bucket\narm_python_image: arm64v8/python:3.12-slim-bullseye\namd_python_image: python:3.12-slim-bullseye\nengine_docs: https://adidas.github.io/lakehouse-engine-docs/lakehouse_engine.html\ncode_url: https://github.com/adidas/lakehouse-engine"
  },
  {
    "path": "cicd/requirements.txt",
    "content": "# The main dependencies without which the core functionalities of the project will not work.\n# These dependencies are not optional and are always installed when people install the lakehouse-engine library.\n#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\nboto3==1.40.23\nJinja2==3.1.6\npyyaml==6.0.2\npendulum==3.1.0\nimportlib-resources==6.5.2"
  },
  {
    "path": "cicd/requirements_azure.txt",
    "content": "# Dependencies necessary for azure related features to work (ex: mail notifications using o365).\n#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\nmsgraph-sdk==1.40.0\naiohttp==3.13.3 # msgraph-sdk uses a version with known vulnerabilities\nh2==4.3.0 # msgraph-sdk uses a version with known vulnerabilities\nazure-core==1.38.0\nnest-asyncio==1.6.0\nmsal==1.32.3\nurllib3==2.6.3 # msal uses a version with known vulnerabilities\n# Fixing the version to solve known vulnerabilities\nrequests==2.32.4 # when updating also update in all files"
  },
  {
    "path": "cicd/requirements_cicd.txt",
    "content": "# Dependencies necessary for the Lakehouse Engine CICD (tests, linting, deployment,...).\n#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\n\n# cicd\npytest==8.4.1\npytest-cov==6.2.1\nisort==6.0.1\nflake8==7.3.0\nflake8-black==0.3.6\nblack==24.4.0 # fixed because flake8-black points always to the latest black\nflake8-builtins==3.0.0\nflake8-bugbear==24.12.12\nflake8-isort==6.1.2\nflake8-comprehensions==3.16.0\nflake8-docstrings==1.7.0\nflake8-eradicate==1.5.0\nflake8-quotes==3.4.0\nflake8-mutable==1.2.0\nflake8-cognitive-complexity==0.1.0\nflake8-expression-complexity==0.0.11\nmypy==1.17.1\nbandit==1.8.6\nbump2version==1.0.1\nlxml==6.0.0\npytest-sftpserver==1.3.0\npip-tools==7.5.0\npip-audit==2.10.0\ncachecontrol==0.14.4\nfilelock==3.20.3\nbuild==1.3.0\naiosmtpd==1.4.6\n\n# docs\ndistlib==0.3.6\nghp-import==2.1.0\ngriffe==1.15.0\nMarkdown==3.10\nmarkdown-callouts==0.4.0\nmarkdown-exec==1.12.1\nmarkdown-include==0.8.1\nmergedeep==1.3.4\nmike==2.0.0\nmkdocs==1.6.1\nmkdocs-autorefs==1.4.3\nmkdocs-material==9.7.1\nmkdocs-material-extensions==1.3.1\nmkdocstrings-crystal==0.3.9\nmkdocs-macros-plugin==1.5.0\nmkdocstrings-python==2.0.1\nmkdocstrings[python]==1.0.0\nmkdocs-gen-files==0.6.0\nmkdocs-section-index==0.3.10\nmkdocs-literate-nav==0.6.2\npymdown-extensions==10.20\npyyaml_env_tag==0.1\nregex==2023.6.3\nwatchdog==3.0.0\n# Fixing the version to solve known vulnerabilities\nrequests==2.32.4 # when updating also update in all files\n\n\n# types\ntypes-boto3==1.40.23\ntypes-paramiko==2.12.0\ntypes-requests<2.31.0.7\n\n# test\nmoto==4.2.14\nWerkzeug==3.1.6\n\n# deploy to pypi\ntwine==5.1.1\n"
  },
  {
    "path": "cicd/requirements_dq.txt",
    "content": "# Dependencies necessary for the Data Quality features to work.\n#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\ngreat-expectations==1.11.0\nmarshmallow==3.26.2\n# Note: Numpy is not a direct dependency.\n# It is included temporarily to prevent version conflicts.\n#numpy==1.26.4 #dbr17 uses 2.1.3\n# Fixing the version to solve known vulnerabilities\nrequests==2.32.4 # when updating also update in all files dbr17 uses 2.32.3\n"
  },
  {
    "path": "cicd/requirements_os.txt",
    "content": "# Special requirements from which the project depends on, but for which some use cases might use environments with\n# these dependencies pre-installed from the vendors. Thus, they are delivered as optional OS dependencies.\n#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\npyspark==4.0.0\ndelta-spark==4.0.0"
  },
  {
    "path": "cicd/requirements_sftp.txt",
    "content": "#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\nparamiko==4.0.0\npynacl==1.6.2"
  },
  {
    "path": "cicd/requirements_sharepoint.txt",
    "content": "#\n# ! Do not forget running `make build-lock-files` after updating dependency list !\n#\ntenacity==9.0.0\nmsal==1.32.3\nazure-core==1.38.0"
  },
  {
    "path": "lakehouse_engine/__init__.py",
    "content": "\"\"\"Lakehouse engine package containing all the system subpackages.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/algorithms/__init__.py",
    "content": "\"\"\"Package containing all the lakehouse engine algorithms.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/algorithms/algorithm.py",
    "content": "\"\"\"Module containing the Algorithm class.\"\"\"\n\nfrom typing import List, Tuple\n\nfrom lakehouse_engine.core.definitions import (\n    DQDefaults,\n    DQFunctionSpec,\n    DQSpec,\n    OutputFormat,\n)\nfrom lakehouse_engine.core.executable import Executable\n\n\nclass Algorithm(Executable):\n    \"\"\"Class to define the behavior of every algorithm based on ACONs.\"\"\"\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct Algorithm instances.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.acon = acon\n\n    @classmethod\n    def get_dq_spec(\n        cls, spec: dict\n    ) -> Tuple[DQSpec, List[DQFunctionSpec], List[DQFunctionSpec]]:\n        \"\"\"Get data quality specification object from acon.\n\n        Args:\n            spec: data quality specifications.\n\n        Returns:\n            The DQSpec and the List of DQ Functions Specs.\n        \"\"\"\n        dq_spec = DQSpec(\n            spec_id=spec[\"spec_id\"],\n            input_id=spec[\"input_id\"],\n            dq_type=spec[\"dq_type\"],\n            dq_functions=[],\n            dq_db_table=spec.get(\"dq_db_table\"),\n            dq_table_table_filter=spec.get(\"dq_table_table_filter\"),\n            dq_table_extra_filters=spec.get(\n                \"dq_table_extra_filters\", DQSpec.dq_table_extra_filters\n            ),\n            execution_point=spec.get(\"execution_point\"),\n            unexpected_rows_pk=spec.get(\n                \"unexpected_rows_pk\", DQSpec.unexpected_rows_pk\n            ),\n            gx_result_format=spec.get(\"gx_result_format\", DQSpec.gx_result_format),\n            tbl_to_derive_pk=spec.get(\"tbl_to_derive_pk\", DQSpec.tbl_to_derive_pk),\n            tag_source_data=spec.get(\"tag_source_data\", DQSpec.tag_source_data),\n            data_asset_name=spec.get(\"data_asset_name\", DQSpec.data_asset_name),\n            expectation_suite_name=spec.get(\n                \"expectation_suite_name\", DQSpec.expectation_suite_name\n            ),\n            store_backend=spec.get(\"store_backend\", DQDefaults.STORE_BACKEND.value),\n            local_fs_root_dir=spec.get(\"local_fs_root_dir\", DQSpec.local_fs_root_dir),\n            bucket=spec.get(\"bucket\", DQSpec.bucket),\n            checkpoint_store_prefix=spec.get(\n                \"checkpoint_store_prefix\", DQDefaults.CHECKPOINT_STORE_PREFIX.value\n            ),\n            expectations_store_prefix=spec.get(\n                \"expectations_store_prefix\",\n                DQDefaults.EXPECTATIONS_STORE_PREFIX.value,\n            ),\n            validations_store_prefix=spec.get(\n                \"validations_store_prefix\",\n                DQDefaults.VALIDATIONS_STORE_PREFIX.value,\n            ),\n            result_sink_db_table=spec.get(\n                \"result_sink_db_table\", DQSpec.result_sink_db_table\n            ),\n            result_sink_location=spec.get(\n                \"result_sink_location\", DQSpec.result_sink_location\n            ),\n            processed_keys_location=spec.get(\n                \"processed_keys_location\", DQSpec.processed_keys_location\n            ),\n            result_sink_partitions=spec.get(\n                \"result_sink_partitions\", DQSpec.result_sink_partitions\n            ),\n            result_sink_chunk_size=spec.get(\n                \"result_sink_chunk_size\", DQSpec.result_sink_chunk_size\n            ),\n            result_sink_format=spec.get(\n                \"result_sink_format\", OutputFormat.DELTAFILES.value\n            ),\n            result_sink_options=spec.get(\n                \"result_sink_options\", DQSpec.result_sink_options\n            ),\n            result_sink_explode=spec.get(\n                \"result_sink_explode\", DQSpec.result_sink_explode\n            ),\n            result_sink_extra_columns=spec.get(\"result_sink_extra_columns\", []),\n            source=spec.get(\"source\", spec[\"input_id\"]),\n            fail_on_error=spec.get(\"fail_on_error\", DQSpec.fail_on_error),\n            cache_df=spec.get(\"cache_df\", DQSpec.cache_df),\n            critical_functions=spec.get(\n                \"critical_functions\", DQSpec.critical_functions\n            ),\n            max_percentage_failure=spec.get(\n                \"max_percentage_failure\", DQSpec.max_percentage_failure\n            ),\n            enable_row_condition=spec.get(\n                \"enable_row_condition\", DQSpec.enable_row_condition\n            ),\n        )\n\n        dq_functions = cls._get_dq_functions(spec, \"dq_functions\")\n\n        critical_functions = cls._get_dq_functions(spec, \"critical_functions\")\n\n        cls._validate_dq_tag_strategy(dq_spec)\n\n        return dq_spec, dq_functions, critical_functions\n\n    @staticmethod\n    def _get_dq_functions(spec: dict, function_key: str) -> List[DQFunctionSpec]:\n        \"\"\"Get DQ Functions from a DQ Spec, based on a function_key.\n\n        Args:\n            spec: data quality specifications.\n            function_key: dq function key (\"dq_functions\" or\n                \"critical_functions\").\n\n        Returns:\n            a list of DQ Function Specs.\n        \"\"\"\n        functions = []\n\n        if spec.get(function_key, []):\n            for f in spec.get(function_key, []):\n                dq_fn_spec = DQFunctionSpec(\n                    function=f[\"function\"],\n                    args=f.get(\"args\", {}),\n                )\n                functions.append(dq_fn_spec)\n\n        return functions\n\n    @staticmethod\n    def _validate_dq_tag_strategy(spec: DQSpec) -> None:\n        \"\"\"Validate DQ Spec arguments related with the data tagging strategy.\n\n        Args:\n            spec: data quality specifications.\n        \"\"\"\n        if spec.tag_source_data:\n            spec.gx_result_format = DQSpec.gx_result_format\n            spec.fail_on_error = False\n            spec.result_sink_explode = DQSpec.result_sink_explode\n        elif spec.gx_result_format != DQSpec.gx_result_format:\n            spec.tag_source_data = False\n"
  },
  {
    "path": "lakehouse_engine/algorithms/data_loader.py",
    "content": "\"\"\"Module to define DataLoader class.\"\"\"\n\nfrom collections import OrderedDict\nfrom copy import deepcopy\nfrom logging import Logger\nfrom typing import List, Optional\n\nfrom lakehouse_engine.algorithms.algorithm import Algorithm\nfrom lakehouse_engine.core.definitions import (\n    DQFunctionSpec,\n    DQSpec,\n    DQType,\n    InputSpec,\n    MergeOptions,\n    OutputFormat,\n    OutputSpec,\n    ReadType,\n    SharepointOptions,\n    TerminatorSpec,\n    TransformerSpec,\n    TransformSpec,\n)\nfrom lakehouse_engine.dq_processors.exceptions import DQDuplicateRuleIdException\nfrom lakehouse_engine.io.reader_factory import ReaderFactory\nfrom lakehouse_engine.io.writer_factory import WriterFactory\nfrom lakehouse_engine.terminators.notifier_factory import NotifierFactory\nfrom lakehouse_engine.terminators.terminator_factory import TerminatorFactory\nfrom lakehouse_engine.transformers.transformer_factory import TransformerFactory\nfrom lakehouse_engine.utils.dq_utils import PrismaUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DataLoader(Algorithm):\n    \"\"\"Load data using an algorithm configuration (ACON represented as dict).\n\n    This algorithm focuses on the cases where users will be specifying all the algorithm\n    steps and configurations through a dict based configuration, which we name ACON\n    in our framework.\n\n    Since an ACON is a dict you can pass a custom transformer through a python function\n    and, therefore, the DataLoader can also be used to load data with custom\n    transformations not provided in our transformers package.\n\n    As the algorithm base class of the lakehouse-engine framework is based on the\n    concept of ACON, this DataLoader algorithm simply inherits from Algorithm,\n    without overriding anything. We designed the codebase like this to avoid\n    instantiating the Algorithm class directly, which was always meant to be an\n    abstraction for any specific algorithm included in the lakehouse-engine framework.\n    \"\"\"\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct DataLoader algorithm instances.\n\n        A data loader needs several specifications to work properly,\n        but some of them might be optional. The available specifications are:\n\n        - input specifications (mandatory): specify how to read data.\n        - transform specifications (optional): specify how to transform data.\n        - data quality specifications (optional): specify how to execute the data\n            quality process.\n        - output specifications (mandatory): specify how to write data to the\n            target.\n        - terminate specifications (optional): specify what to do after writing into\n            the target (e.g., optimizing target table, vacuum, compute stats, etc).\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self._logger: Logger = LoggingHandler(self.__class__.__name__).get_logger()\n        super().__init__(acon)\n        self.input_specs: List[InputSpec] = self._get_input_specs()\n        # the streaming transformers plan is needed to future change the\n        # execution specification to accommodate streaming mode limitations in invoking\n        # certain functions (e.g., sort, window, generate row ids/auto increments, ...).\n        self._streaming_micro_batch_transformers_plan: dict = {}\n        self.transform_specs: List[TransformSpec] = self._get_transform_specs()\n        # our data quality process is not compatible with streaming mode, hence we\n        # have to run it in micro batches, similar to what happens to certain\n        # transformation functions not supported in streaming mode.\n        self._streaming_micro_batch_dq_plan: dict = {}\n        self.dq_specs: List[DQSpec] = self._get_dq_specs()\n        self.output_specs: List[OutputSpec] = self._get_output_specs()\n        self.terminate_specs: List[TerminatorSpec] = self._get_terminate_specs()\n\n    def read(self) -> OrderedDict:\n        \"\"\"Read data from an input location into a distributed dataframe.\n\n        Returns:\n             An ordered dict with all the dataframes that were read.\n        \"\"\"\n        read_dfs: OrderedDict = OrderedDict({})\n        for spec in self.input_specs:\n            self._logger.info(f\"Found input specification: {spec}\")\n            read_dfs[spec.spec_id] = ReaderFactory.get_data(spec)\n        return read_dfs\n\n    def transform(self, data: OrderedDict) -> OrderedDict:\n        \"\"\"Transform (optionally) the data that was read.\n\n        If there isn't a transformation specification this step will be skipped, and the\n        original dataframes that were read will be returned.\n        Transformations can have dependency from another transformation result, however\n        we need to keep in mind if we are using streaming source and for some reason we\n        need to enable micro batch processing, this result cannot be used as input to\n        another transformation. Micro batch processing in pyspark streaming is only\n        available in .write(), which means this transformation with micro batch needs\n        to be the end of the process.\n\n        Args:\n            data: input dataframes in an ordered dict.\n\n        Returns:\n            Another ordered dict with the transformed dataframes, according to the\n            transformation specification.\n        \"\"\"\n        if not self.transform_specs:\n            return data\n        else:\n            transformed_dfs = OrderedDict(data)\n            for spec in self.transform_specs:\n                self._logger.info(f\"Found transform specification: {spec}\")\n                transformed_df = transformed_dfs[spec.input_id]\n                for transformer in spec.transformers:\n                    transformed_df = transformed_df.transform(\n                        TransformerFactory.get_transformer(transformer, transformed_dfs)\n                    )\n                transformed_dfs[spec.spec_id] = transformed_df\n            return transformed_dfs\n\n    def process_dq(\n        self, data: OrderedDict\n    ) -> tuple[OrderedDict, Optional[dict[str, str]]]:\n        \"\"\"Process the data quality tasks for the data that was read and/or transformed.\n\n        It supports multiple input dataframes. Although just one is advisable.\n\n        It is possible to use data quality validators/expectations that will validate\n        your data and fail the process in case the expectations are not met. The DQ\n        process also generates and keeps updating a site containing the results of the\n        expectations that were done on your data. The location of the site is\n        configurable and can either be on file system or S3. If you define it to be\n        stored on S3, you can even configure your S3 bucket to serve the site so that\n        people can easily check the quality of your data. Moreover, it is also\n        possible to store the result of the DQ process into a defined result sink.\n\n        Args:\n            data: dataframes from previous steps of the algorithm that we which to\n                run the DQ process on.\n\n        Returns:\n            Another ordered dict with the validated dataframes and\n            a dictionary with the errors if they exist, or None.\n        \"\"\"\n        if not self.dq_specs:\n            return data, None\n\n        dq_processed_dfs, error = self._verify_dq_rule_id_uniqueness(\n            data, self.dq_specs\n        )\n        if error:\n            return dq_processed_dfs, error\n        else:\n            from lakehouse_engine.dq_processors.dq_factory import DQFactory\n\n            dq_processed_dfs = OrderedDict(data)\n            for spec in self.dq_specs:\n                df_processed_df = dq_processed_dfs[spec.input_id]\n                self._logger.info(f\"Found data quality specification: {spec}\")\n                if (\n                    spec.dq_type == DQType.PRISMA.value or spec.dq_functions\n                ) and spec.spec_id not in self._streaming_micro_batch_dq_plan:\n\n                    if spec.cache_df:\n                        df_processed_df.cache()\n                    dq_processed_dfs[spec.spec_id] = DQFactory.run_dq_process(\n                        spec, df_processed_df\n                    )\n                else:\n                    dq_processed_dfs[spec.spec_id] = df_processed_df\n\n            return dq_processed_dfs, None\n\n    def write(self, data: OrderedDict) -> OrderedDict:\n        \"\"\"Write the data that was read and transformed (if applicable).\n\n        It supports writing multiple datasets. However, we only recommend to write one\n        dataframe. This recommendation is based on easy debugging and reproducibility,\n        since if we start mixing several datasets being fueled by the same algorithm, it\n        would unleash an infinite sea of reproducibility issues plus tight coupling and\n        dependencies between datasets. Having said that, there may be cases where\n        writing multiple datasets is desirable according to the use case requirements.\n        Use it accordingly.\n\n        Args:\n            data: dataframes that were read and transformed (if applicable).\n\n        Returns:\n            Dataframes that were written.\n        \"\"\"\n        written_dfs: OrderedDict = OrderedDict({})\n        for spec in self.output_specs:\n            self._logger.info(f\"Found output specification: {spec}\")\n\n            written_output = WriterFactory.get_writer(\n                spec, data[spec.input_id], data\n            ).write()\n            if written_output:\n                written_dfs.update(written_output)\n            else:\n                written_dfs[spec.spec_id] = data[spec.input_id]\n\n        return written_dfs\n\n    def terminate(self, data: OrderedDict) -> None:\n        \"\"\"Terminate the algorithm.\n\n        Args:\n            data: dataframes that were written.\n        \"\"\"\n        if self.terminate_specs:\n            for spec in self.terminate_specs:\n                self._logger.info(f\"Found terminate specification: {spec}\")\n                TerminatorFactory.execute_terminator(\n                    spec, data[spec.input_id] if spec.input_id else None\n                )\n\n    def execute(self) -> Optional[OrderedDict]:\n        \"\"\"Define the algorithm execution behaviour.\"\"\"\n        try:\n            self._logger.info(\"Starting read stage...\")\n            read_dfs = self.read()\n            self._logger.info(\"Starting transform stage...\")\n            transformed_dfs = self.transform(read_dfs)\n            self._logger.info(\"Starting data quality stage...\")\n            validated_dfs, errors = self.process_dq(transformed_dfs)\n            self._logger.info(\"Starting write stage...\")\n            written_dfs = self.write(validated_dfs)\n            self._logger.info(\"Starting terminate stage...\")\n            self.terminate(written_dfs)\n            self._logger.info(\"Execution of the algorithm has finished!\")\n        except Exception as e:\n            NotifierFactory.generate_failure_notification(self.terminate_specs, e)\n            raise e\n\n        if errors:\n            raise DQDuplicateRuleIdException(\n                \"Data Written Successfully, but DQ Process Encountered an Issue.\\n\"\n                \"We detected a duplicate dq_rule_id in the dq_spec definition. \"\n                \"As a result, none of the Data Quality (DQ) processes (dq_spec) \"\n                \"were executed.\\n\"\n                \"Please review and verify the following dq_rules:\\n\"\n                f\"{errors}\"\n            )\n\n        return written_dfs\n\n    def _get_input_specs(self) -> List[InputSpec]:\n        \"\"\"Get the input specifications from an acon.\n\n        Returns:\n            List of input specifications.\n        \"\"\"\n        return [InputSpec(**spec) for spec in self.acon[\"input_specs\"]]\n\n    def _get_transform_specs(self) -> List[TransformSpec]:\n        \"\"\"Get the transformation specifications from an acon.\n\n        If we are executing the algorithm in streaming mode and if the\n        transformer function is not supported in streaming mode, it is\n        important to note that ONLY those unsupported operations will\n        go into the streaming_micro_batch_transformers (see if in the function code),\n        in the same order that they appear in the list of transformations. This means\n        that other supported transformations that appear after an\n        unsupported one continue to stay one the normal execution plan,\n        i.e., outside the foreachBatch function. Therefore, this may\n        make your algorithm to execute a different logic than the one you\n        originally intended. For this reason:\n            1) ALWAYS PLACE UNSUPPORTED STREAMING TRANSFORMATIONS AT LAST;\n            2) USE force_streaming_foreach_batch_processing option in transform_spec\n            section.\n            3) USE THE CUSTOM_TRANSFORMATION AND WRITE ALL YOUR TRANSFORMATION LOGIC\n            THERE.\n\n        Check list of unsupported spark streaming operations here:\n        https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations\n\n        Returns:\n            List of transformation specifications.\n        \"\"\"\n        input_read_types = self._get_input_read_types(self.acon[\"input_specs\"])\n        transform_input_ids = self._get_transform_input_ids(\n            self.acon.get(\"transform_specs\", [])\n        )\n        prev_spec_read_types = self._get_previous_spec_read_types(\n            input_read_types, transform_input_ids\n        )\n        transform_specs = []\n        for spec in self.acon.get(\"transform_specs\", []):\n            transform_spec = TransformSpec(\n                spec_id=spec[\"spec_id\"],\n                input_id=spec[\"input_id\"],\n                transformers=[],\n                force_streaming_foreach_batch_processing=spec.get(\n                    \"force_streaming_foreach_batch_processing\", False\n                ),\n            )\n\n            for s in spec[\"transformers\"]:\n                transformer_spec = TransformerSpec(\n                    function=s[\"function\"], args=s.get(\"args\", {})\n                )\n                if (\n                    prev_spec_read_types[transform_spec.input_id]\n                    == ReadType.STREAMING.value\n                    and s[\"function\"]\n                    in TransformerFactory.UNSUPPORTED_STREAMING_TRANSFORMERS\n                ) or (\n                    prev_spec_read_types[transform_spec.input_id]\n                    == ReadType.STREAMING.value\n                    and transform_spec.force_streaming_foreach_batch_processing\n                ):\n                    self._move_to_streaming_micro_batch_transformers(\n                        transform_spec, transformer_spec\n                    )\n                else:\n                    transform_spec.transformers.append(transformer_spec)\n\n            transform_specs.append(transform_spec)\n\n        return transform_specs\n\n    def _get_dq_specs(self) -> List[DQSpec]:\n        \"\"\"Get list of data quality specification objects from acon.\n\n        In streaming mode, we automatically convert the data quality specification in\n        the streaming_micro_batch_dq_processors list for the respective output spec.\n        This is needed because our dq process cannot be executed using native streaming\n        functions.\n\n        Returns:\n            List of data quality spec objects.\n        \"\"\"\n        input_read_types = self._get_input_read_types(self.acon[\"input_specs\"])\n        transform_input_ids = self._get_transform_input_ids(\n            self.acon.get(\"transform_specs\", [])\n        )\n        prev_spec_read_types = self._get_previous_spec_read_types(\n            input_read_types, transform_input_ids\n        )\n\n        dq_specs = []\n        for spec in self.acon.get(\"dq_specs\", []):\n\n            dq_spec, dq_functions, critical_functions = Algorithm.get_dq_spec(spec)\n\n            if prev_spec_read_types[dq_spec.input_id] == ReadType.STREAMING.value:\n                # we need to use deepcopy to explicitly create a copy of the dict\n                # otherwise python only create binding for dicts, and we would be\n                # modifying the original dict, which we don't want to.\n                self._move_to_streaming_micro_batch_dq_processors(\n                    deepcopy(dq_spec), dq_functions, critical_functions\n                )\n            else:\n                dq_spec.dq_functions = dq_functions\n                dq_spec.critical_functions = critical_functions\n\n            self._logger.info(\n                f\"Streaming Micro Batch DQ Plan: \"\n                f\"{str(self._streaming_micro_batch_dq_plan)}\"\n            )\n            dq_specs.append(dq_spec)\n\n        return dq_specs\n\n    def _get_output_specs(self) -> List[OutputSpec]:\n        \"\"\"Get the output specifications from an acon.\n\n        Returns:\n            List of output specifications.\n        \"\"\"\n        return [\n            OutputSpec(\n                spec_id=spec[\"spec_id\"],\n                input_id=spec[\"input_id\"],\n                write_type=spec.get(\"write_type\", None),\n                data_format=spec.get(\"data_format\", OutputFormat.DELTAFILES.value),\n                db_table=spec.get(\"db_table\", None),\n                location=spec.get(\"location\", None),\n                merge_opts=(\n                    MergeOptions(**spec[\"merge_opts\"])\n                    if spec.get(\"merge_opts\")\n                    else None\n                ),\n                sharepoint_opts=(\n                    SharepointOptions(**spec[\"sharepoint_opts\"])\n                    if spec.get(\"sharepoint_opts\")\n                    else None\n                ),\n                partitions=spec.get(\"partitions\", []),\n                streaming_micro_batch_transformers=self._get_streaming_transformer_plan(\n                    spec[\"input_id\"], self.dq_specs\n                ),\n                streaming_once=spec.get(\"streaming_once\", None),\n                streaming_processing_time=spec.get(\"streaming_processing_time\", None),\n                streaming_available_now=spec.get(\n                    \"streaming_available_now\",\n                    (\n                        False\n                        if (\n                            spec.get(\"streaming_once\", None)\n                            or spec.get(\"streaming_processing_time\", None)\n                            or spec.get(\"streaming_continuous\", None)\n                        )\n                        else True\n                    ),\n                ),\n                streaming_continuous=spec.get(\"streaming_continuous\", None),\n                streaming_await_termination=spec.get(\n                    \"streaming_await_termination\", True\n                ),\n                streaming_await_termination_timeout=spec.get(\n                    \"streaming_await_termination_timeout\", None\n                ),\n                with_batch_id=spec.get(\"with_batch_id\", False),\n                options=spec.get(\"options\", None),\n                streaming_micro_batch_dq_processors=(\n                    self._streaming_micro_batch_dq_plan.get(spec[\"input_id\"], [])\n                ),\n            )\n            for spec in self.acon[\"output_specs\"]\n        ]\n\n    def _get_streaming_transformer_plan(\n        self, input_id: str, dq_specs: Optional[List[DQSpec]]\n    ) -> List[TransformerSpec]:\n        \"\"\"Gets the plan for transformations to be applied on streaming micro batches.\n\n        When running both DQ processes and transformations in streaming micro batches,\n        the _streaming_micro_batch_transformers_plan to consider is the one associated\n        with the transformer spec_id and not with the dq spec_id. Thus, on those cases,\n        this method maps the input id of the output_spec (which is the spec_id of a\n        dq_spec) with the dependent transformer spec_id.\n\n        Args:\n            input_id: id of the corresponding input specification.\n            dq_specs: data quality specifications.\n\n        Returns:\n            a list of TransformerSpec, representing the transformations plan.\n        \"\"\"\n        transformer_id = (\n            [dq_spec.input_id for dq_spec in dq_specs if dq_spec.spec_id == input_id][0]\n            if self._streaming_micro_batch_dq_plan.get(input_id)\n            and self._streaming_micro_batch_transformers_plan\n            else input_id\n        )\n\n        streaming_micro_batch_transformers_plan: list[TransformerSpec] = (\n            self._streaming_micro_batch_transformers_plan.get(transformer_id, [])\n        )\n\n        return streaming_micro_batch_transformers_plan\n\n    def _get_terminate_specs(self) -> List[TerminatorSpec]:\n        \"\"\"Get the terminate specifications from an acon.\n\n        Returns:\n            List of terminate specifications.\n        \"\"\"\n        return [TerminatorSpec(**spec) for spec in self.acon.get(\"terminate_specs\", [])]\n\n    def _move_to_streaming_micro_batch_transformers(\n        self, transform_spec: TransformSpec, transformer_spec: TransformerSpec\n    ) -> None:\n        \"\"\"Move the transformer to the list of streaming micro batch transformations.\n\n        If the transform specs contain functions that cannot be executed in streaming\n        mode, this function sends those functions to the output specs\n        streaming_micro_batch_transformers, where they will be executed inside the\n        stream foreachBatch function.\n\n        To accomplish that we use an instance variable that associates the\n        streaming_micro_batch_transformers to each output spec, in order to do reverse\n        lookup when creating the OutputSpec.\n\n        Args:\n            transform_spec: transform specification (overall\n                transformation specification - a transformation may contain multiple\n                transformers).\n            transformer_spec: the specific transformer function and arguments.\n        \"\"\"\n        if transform_spec.spec_id not in self._streaming_micro_batch_transformers_plan:\n            self._streaming_micro_batch_transformers_plan[transform_spec.spec_id] = []\n\n        self._streaming_micro_batch_transformers_plan[transform_spec.spec_id].append(\n            transformer_spec\n        )\n\n    def _move_to_streaming_micro_batch_dq_processors(\n        self,\n        dq_spec: DQSpec,\n        dq_functions: List[DQFunctionSpec],\n        critical_functions: List[DQFunctionSpec],\n    ) -> None:\n        \"\"\"Move the dq function to the list of streaming micro batch transformations.\n\n        If the dq specs contain functions that cannot be executed in streaming mode,\n        this function sends those functions to the output specs\n        streaming_micro_batch_dq_processors, where they will be executed inside the\n        stream foreachBatch function.\n\n        To accomplish that we use an instance variable that associates the\n        streaming_micro_batch_dq_processors to each output spec, in order to do reverse\n        lookup when creating the OutputSpec.\n\n        Args:\n            dq_spec: dq specification (overall dq process specification).\n            dq_functions: the list of dq functions to be considered.\n            critical_functions: list of critical functions to be considered.\n        \"\"\"\n        if dq_spec.spec_id not in self._streaming_micro_batch_dq_plan:\n            self._streaming_micro_batch_dq_plan[dq_spec.spec_id] = []\n\n        dq_spec.dq_functions = dq_functions\n        dq_spec.critical_functions = critical_functions\n        self._streaming_micro_batch_dq_plan[dq_spec.spec_id].append(dq_spec)\n\n    @staticmethod\n    def _get_input_read_types(list_of_specs: List) -> dict:\n        \"\"\"Get a dict of spec ids and read types from a list of input specs.\n\n        Args:\n            list_of_specs: list of input specs ([{k:v}]).\n\n        Returns:\n            Dict of {input_spec_id: read_type}.\n        \"\"\"\n        return {item[\"spec_id\"]: item[\"read_type\"] for item in list_of_specs}\n\n    @staticmethod\n    def _get_transform_input_ids(list_of_specs: List) -> dict:\n        \"\"\"Get a dict of transform spec ids and input ids from list of transform specs.\n\n        Args:\n            list_of_specs: list of transform specs ([{k:v}]).\n\n        Returns:\n            Dict of {transform_spec_id: input_id}.\n        \"\"\"\n        return {item[\"spec_id\"]: item[\"input_id\"] for item in list_of_specs}\n\n    @staticmethod\n    def _get_previous_spec_read_types(\n        input_read_types: dict, transform_input_ids: dict\n    ) -> dict:\n        \"\"\"Get the read types of the previous specification: input and/or transform.\n\n        For the chaining transformations and for DQ process to work seamlessly in batch\n        and streaming mode, we have to figure out if the previous spec to the transform\n        or dq spec(e.g., input spec or transform spec) refers to a batch read type or\n        a streaming read type.\n\n        Args:\n            input_read_types: dict of {input_spec_id: read_type}.\n            transform_input_ids: dict of {transform_spec_id: input_id}.\n\n        Returns:\n            Dict of {input_spec_id or transform_spec_id: read_type}\n        \"\"\"\n        combined_read_types = input_read_types\n        for spec_id, input_id in transform_input_ids.items():\n            combined_read_types[spec_id] = combined_read_types[input_id]\n\n        return combined_read_types\n\n    @staticmethod\n    def _verify_dq_rule_id_uniqueness(\n        data: OrderedDict, dq_specs: list[DQSpec]\n    ) -> tuple[OrderedDict, dict[str, str]]:\n        \"\"\"Verify the uniqueness of dq_rule_id.\n\n        Verify the existence of duplicate dq_rule_id values\n        and prepare the DataFrame for the next stage.\n\n        Args:\n            data: dataframes.\n            dq_specs: a list of DQSpec to be validated.\n\n        Returns:\n             processed df and error if existed.\n        \"\"\"\n        error_dict = PrismaUtils.validate_rule_id_duplication(dq_specs)\n        dq_processed_dfs = OrderedDict(data)\n        for spec in dq_specs:\n            df_processed_df = dq_processed_dfs[spec.input_id]\n            dq_processed_dfs[spec.spec_id] = df_processed_df\n        return dq_processed_dfs, error_dict\n"
  },
  {
    "path": "lakehouse_engine/algorithms/dq_validator.py",
    "content": "\"\"\"Module to define Data Validator class.\"\"\"\n\nfrom delta.tables import DeltaTable\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.utils import StreamingQueryException\n\nfrom lakehouse_engine.algorithms.algorithm import Algorithm\nfrom lakehouse_engine.core.definitions import DQSpec, DQValidatorSpec, InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.dq_processors.dq_factory import DQFactory\nfrom lakehouse_engine.dq_processors.exceptions import (\n    DQDuplicateRuleIdException,\n    DQValidationsFailedException,\n)\nfrom lakehouse_engine.io.reader_factory import ReaderFactory\nfrom lakehouse_engine.utils.dq_utils import PrismaUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DQValidator(Algorithm):\n    \"\"\"Validate data using an algorithm configuration (ACON represented as dict).\n\n    This algorithm focuses on isolate Data Quality Validations from loading,\n    applying a set of data quality functions to a specific input dataset,\n    without the need to define any output specification.\n    You can use any input specification compatible with the lakehouse engine\n    (dataframe, table, files, etc).\n    \"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct DQValidator algorithm instances.\n\n        A data quality validator needs the following specifications to work properly:\n\n        - input specification (mandatory): specify how and what data to\n        read.\n        - data quality specification (mandatory): specify how to execute\n        the data quality process.\n        - restore_prev_version (optional): specify if, having\n        delta table/files as input, they should be restored to the\n        previous version if the data quality process fails. Note: this\n        is only considered if fail_on_error is kept as True.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.spec: DQValidatorSpec = DQValidatorSpec(\n            input_spec=InputSpec(**acon[\"input_spec\"]),\n            dq_spec=self._get_dq_spec(acon[\"dq_spec\"]),\n            restore_prev_version=acon.get(\"restore_prev_version\", None),\n        )\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from an input location into a distributed dataframe.\n\n        Returns:\n             Dataframe with data that was read.\n        \"\"\"\n        current_df = ReaderFactory.get_data(self.spec.input_spec)\n\n        return current_df\n\n    def process_dq(self, data: DataFrame) -> DataFrame:\n        \"\"\"Process the data quality tasks for the data that was read.\n\n        It supports a single input dataframe.\n\n        It is possible to use data quality validators/expectations that will validate\n        your data and fail the process in case the expectations are not met. The DQ\n        process also generates and keeps updating a site containing the results of the\n        expectations that were done on your data. The location of the site is\n        configurable and can either be on file system or S3. If you define it to be\n        stored on S3, you can even configure your S3 bucket to serve the site so that\n        people can easily check the quality of your data. Moreover, it is also\n        possible to store the result of the DQ process into a defined result sink.\n\n        Args:\n            data: input dataframe on which to run the DQ process.\n\n        Returns:\n            Validated dataframe.\n        \"\"\"\n        return DQFactory.run_dq_process(self.spec.dq_spec, data)\n\n    def execute(self) -> None:\n        \"\"\"Define the algorithm execution behaviour.\"\"\"\n        self._LOGGER.info(\"Starting read stage...\")\n        read_df = self.read()\n\n        self._LOGGER.info(\"Starting data quality validator...\")\n\n        self._LOGGER.info(\"Validating DQ definitions\")\n        error_dict = PrismaUtils.validate_rule_id_duplication(specs=[self.spec.dq_spec])\n        if error_dict:\n            raise DQDuplicateRuleIdException(\n                \"Duplicate dq_rule_id detected in dq_spec definition.\\n\"\n                \"We have identified one or more duplicate dq_rule_id \"\n                \"entries in the dq_spec definition. \"\n                \"Please review and verify the following dq_rules:\\n\"\n                f\"{error_dict}\"\n            )\n        try:\n            if read_df.isStreaming:\n                # To handle streaming, and although we are not interested in\n                # writing any data, we still need to start the streaming and\n                # execute the data quality process in micro batches of data.\n                def write_dq_validator_micro_batch(\n                    batch_df: DataFrame, batch_id: int\n                ) -> None:\n                    ExecEnv.get_for_each_batch_session(batch_df)\n                    self.process_dq(batch_df)\n\n                read_df.writeStream.trigger(once=True).foreachBatch(\n                    write_dq_validator_micro_batch\n                ).start().awaitTermination()\n\n            else:\n                self.process_dq(read_df)\n        except (DQValidationsFailedException, StreamingQueryException):\n            if not self.spec.input_spec.df_name and self.spec.restore_prev_version:\n                self._LOGGER.info(\"Restoring delta table/files to previous version...\")\n\n                self._restore_prev_version()\n\n                raise DQValidationsFailedException(\n                    \"Data Quality Validations Failed! The delta \"\n                    \"table/files were restored to the previous version!\"\n                )\n\n            elif self.spec.dq_spec.fail_on_error:\n                raise DQValidationsFailedException(\"Data Quality Validations Failed!\")\n        else:\n            self._LOGGER.info(\"Execution of the algorithm has finished!\")\n\n    @staticmethod\n    def _get_dq_spec(input_dq_spec: dict) -> DQSpec:\n        \"\"\"Get data quality specification from acon.\n\n        Args:\n            input_dq_spec: data quality specification.\n\n        Returns:\n            Data quality spec.\n        \"\"\"\n        dq_spec, dq_functions, critical_functions = Algorithm.get_dq_spec(input_dq_spec)\n\n        dq_spec.dq_functions = dq_functions\n        dq_spec.critical_functions = critical_functions\n\n        return dq_spec\n\n    def _restore_prev_version(self) -> None:\n        \"\"\"Restore delta table or delta files to previous version.\"\"\"\n        if self.spec.input_spec.db_table:\n            delta_table = DeltaTable.forName(\n                ExecEnv.SESSION, self.spec.input_spec.db_table\n            )\n        else:\n            delta_table = DeltaTable.forPath(\n                ExecEnv.SESSION, self.spec.input_spec.location\n            )\n\n        previous_version = (\n            delta_table.history().agg({\"version\": \"max\"}).collect()[0][0] - 1\n        )\n\n        delta_table.restoreToVersion(previous_version)\n"
  },
  {
    "path": "lakehouse_engine/algorithms/exceptions.py",
    "content": "\"\"\"Package defining all the algorithm custom exceptions.\"\"\"\n\n\nclass ReconciliationFailedException(Exception):\n    \"\"\"Exception for when the reconciliation process fails.\"\"\"\n\n    pass\n\n\nclass NoNewDataException(Exception):\n    \"\"\"Exception for when no new data is available.\"\"\"\n\n    pass\n\n\nclass SensorAlreadyExistsException(Exception):\n    \"\"\"Exception for when a sensor with same sensor id already exists.\"\"\"\n\n    pass\n\n\nclass RestoreTypeNotFoundException(Exception):\n    \"\"\"Exception for when the restore type is not found.\"\"\"\n\n    pass\n"
  },
  {
    "path": "lakehouse_engine/algorithms/gab.py",
    "content": "\"\"\"Module to define Gold Asset Builder algorithm behavior.\"\"\"\n\nimport copy\nfrom datetime import datetime, timedelta\n\nimport pendulum\nfrom jinja2 import Template\nfrom pyspark import Row\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import lit\n\nfrom lakehouse_engine.algorithms.algorithm import Algorithm\nfrom lakehouse_engine.core.definitions import (\n    GABCadence,\n    GABCombinedConfiguration,\n    GABDefaults,\n    GABKeys,\n    GABReplaceableKeys,\n    GABSpec,\n    GABStartOfWeek,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.gab_manager import GABCadenceManager, GABViewManager\nfrom lakehouse_engine.core.gab_sql_generator import (\n    GABDeleteGenerator,\n    GABInsertGenerator,\n)\nfrom lakehouse_engine.utils.gab_utils import GABPartitionUtils, GABUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass GAB(Algorithm):\n    \"\"\"Class representing the gold asset builder.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n    _SPARK_DEFAULT_PARALLELISM_CONFIG = (\n        \"spark.sql.sources.parallelPartitionDiscovery.parallelism\"\n    )\n    _SPARK_DEFAULT_PARALLELISM_VALUE = \"10000\"\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct GAB instances.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.spec: GABSpec = GABSpec.create_from_acon(acon=acon)\n\n    def execute(self) -> None:\n        \"\"\"Execute the Gold Asset Builder.\"\"\"\n        self._LOGGER.info(f\"Reading {self.spec.lookup_table} as lkp_query_builder\")\n        lookup_query_builder_df = ExecEnv.SESSION.read.table(self.spec.lookup_table)\n        ExecEnv.SESSION.read.table(self.spec.calendar_table).createOrReplaceTempView(\n            \"df_cal\"\n        )\n        self._LOGGER.info(f\"Generating calendar from {self.spec.calendar_table}\")\n\n        query_label = self.spec.query_label_filter\n        queue = self.spec.queue_filter\n        cadence = self.spec.cadence_filter\n\n        self._LOGGER.info(f\"Query Label Filter {query_label}\")\n        self._LOGGER.info(f\"Queue Filter {queue}\")\n        self._LOGGER.info(f\"Cadence Filter {cadence}\")\n\n        gab_path = self.spec.gab_base_path\n        self._LOGGER.info(f\"Gab Base Path {gab_path}\")\n\n        lookup_query_builder_df = lookup_query_builder_df.filter(\n            (\n                (lookup_query_builder_df.query_label.isin(query_label))\n                & (lookup_query_builder_df.queue.isin(queue))\n                & (lookup_query_builder_df.is_active != lit(\"N\"))\n            )\n        )\n\n        cached = True\n        try:\n            lookup_query_builder_df.cache()\n        except Exception as e:\n            cached = False\n            self._LOGGER.warning(\n                \"Could not cache lookup_query_builder_df dataframe. \"\n                f\"Continuing without caching. Exception: {e}\"\n            )\n\n        for use_case in lookup_query_builder_df.collect():\n            self._process_use_case(\n                use_case=use_case,\n                lookup_query_builder=lookup_query_builder_df,\n                selected_cadences=cadence,\n                gab_path=gab_path,\n            )\n\n        if cached:\n            lookup_query_builder_df.unpersist()\n\n    def _process_use_case(\n        self,\n        use_case: Row,\n        lookup_query_builder: DataFrame,\n        selected_cadences: list[str],\n        gab_path: str,\n    ) -> None:\n        \"\"\"Process each gab use case.\n\n        Args:\n            use_case: gab use case to process.\n            lookup_query_builder: gab configuration data.\n            selected_cadences: selected cadences to process.\n            gab_path: gab base path used to get the use case stages sql files.\n        \"\"\"\n        self._LOGGER.info(f\"Executing use case: {use_case['query_label']}\")\n\n        reconciliation = GABUtils.get_json_column_as_dict(\n            lookup_query_builder=lookup_query_builder,\n            query_id=use_case[\"query_id\"],\n            query_column=\"recon_window\",\n        )\n        self._LOGGER.info(f\"reconcilation window - {reconciliation}\")\n        configured_cadences = list(reconciliation.keys())\n\n        stages = GABUtils.get_json_column_as_dict(\n            lookup_query_builder=lookup_query_builder,\n            query_id=use_case[\"query_id\"],\n            query_column=\"intermediate_stages\",\n        )\n        self._LOGGER.info(f\"intermediate stages - {stages}\")\n\n        self._LOGGER.info(f\"selected_cadences: {selected_cadences}\")\n        self._LOGGER.info(f\"configured_cadences: {configured_cadences}\")\n        cadences = self._get_filtered_cadences(selected_cadences, configured_cadences)\n        self._LOGGER.info(f\"filtered cadences - {cadences}\")\n\n        latest_run_date, latest_config_date = self._get_latest_usecase_data(\n            use_case[\"query_id\"]\n        )\n        self._LOGGER.info(f\"latest_config_date: {latest_config_date}\")\n        self._LOGGER.info(f\"latest_run_date: - {latest_run_date}\")\n        self._set_use_case_stage_template_file(stages, gab_path, use_case)\n        processed_cadences = []\n\n        for cadence in cadences:\n            is_cadence_processed = self._process_use_case_query_cadence(\n                cadence,\n                reconciliation,\n                use_case,\n                stages,\n                lookup_query_builder,\n            )\n            if is_cadence_processed:\n                processed_cadences.append(is_cadence_processed)\n\n        if processed_cadences:\n            self._generate_ddl(\n                latest_config_date=latest_config_date,\n                latest_run_date=latest_run_date,\n                query_id=use_case[\"query_id\"],\n                lookup_query_builder=lookup_query_builder,\n            )\n        else:\n            self._LOGGER.info(\n                f\"Skipping use case {use_case['query_label']}. No cadence processed \"\n                \"for the use case.\"\n            )\n\n    @classmethod\n    def _set_use_case_stage_template_file(\n        cls, stages: dict, gab_path: str, use_case: Row\n    ) -> None:\n        \"\"\"Set templated file for each stage.\n\n        Args:\n            stages: use case stages with their configuration.\n            gab_path: gab base path used to get the use case stages SQL files.\n            use_case: gab use case to process.\n        \"\"\"\n        cls._LOGGER.info(\"Reading templated file for each stage...\")\n\n        for i in range(1, len(stages) + 1):\n            stage = stages[str(i)]\n            stage_file_path = stage[\"file_path\"]\n            full_path = gab_path + stage_file_path\n            cls._LOGGER.info(f\"Stage file path is: {full_path}\")\n            file_read = open(full_path, \"r\").read()\n            templated_file = file_read.replace(\n                \"replace_offset_value\", str(use_case[\"timezone_offset\"])\n            )\n            stage[\"templated_file\"] = templated_file\n            stage[\"full_file_path\"] = full_path\n\n    def _process_use_case_query_cadence(\n        self,\n        cadence: str,\n        reconciliation: dict,\n        use_case: Row,\n        stages: dict,\n        lookup_query_builder: DataFrame,\n    ) -> bool:\n        \"\"\"Identify use case reconciliation window and cadence.\n\n        Args:\n            cadence:  cadence to process.\n            reconciliation: configured use case reconciliation window.\n            use_case: gab use case to process.\n            stages: use case stages with their configuration.\n            lookup_query_builder: gab configuration data.\n        \"\"\"\n        selected_reconciliation_window = {}\n        selected_cadence = reconciliation.get(cadence)\n        self._LOGGER.info(f\"Processing cadence: {cadence}\")\n        self._LOGGER.info(f\"Reconciliation Window - {selected_cadence}\")\n\n        if selected_cadence:\n            selected_reconciliation_window = selected_cadence.get(\"recon_window\")\n\n        self._LOGGER.info(f\"{cadence}: {self.spec.start_date} - {self.spec.end_date}\")\n\n        start_of_week = use_case[\"start_of_the_week\"]\n\n        self._set_week_configuration_by_uc_start_of_week(start_of_week)\n\n        cadence_configuration_at_end_date = (\n            GABUtils.get_cadence_configuration_at_end_date(self.spec.end_date)\n        )\n\n        reconciliation_cadences = GABUtils().get_reconciliation_cadences(\n            cadence=cadence,\n            selected_reconciliation_window=selected_reconciliation_window,\n            cadence_configuration_at_end_date=cadence_configuration_at_end_date,\n            rerun_flag=self.spec.rerun_flag,\n        )\n\n        start_date_str = GABUtils.format_datetime_to_default(self.spec.start_date)\n        end_date_str = GABUtils.format_datetime_to_default(self.spec.end_date)\n\n        for reconciliation_cadence, snapshot_flag in reconciliation_cadences.items():\n            self._process_reconciliation_cadence(\n                reconciliation_cadence=reconciliation_cadence,\n                snapshot_flag=snapshot_flag,\n                cadence=cadence,\n                start_date_str=start_date_str,\n                end_date_str=end_date_str,\n                use_case=use_case,\n                lookup_query_builder=lookup_query_builder,\n                stages=stages,\n            )\n\n        return (cadence in reconciliation.keys()) or (\n            reconciliation_cadences is not None\n        )\n\n    def _process_reconciliation_cadence(\n        self,\n        reconciliation_cadence: str,\n        snapshot_flag: str,\n        cadence: str,\n        start_date_str: str,\n        end_date_str: str,\n        use_case: Row,\n        lookup_query_builder: DataFrame,\n        stages: dict,\n    ) -> None:\n        \"\"\"Process use case reconciliation window.\n\n        Reconcile the pre-aggregated data to cover the late events.\n\n        Args:\n            reconciliation_cadence: reconciliation to process.\n            snapshot_flag: flag indicating if for this cadence the snapshot is enabled.\n            cadence: cadence to process.\n            start_date_str: start date of the period to process.\n            end_date_str: end date of the period to process.\n            use_case: gab use case to process.\n            lookup_query_builder: gab configuration data.\n            stages: use case stages with their configuration.\n\n        Example:\n            Cadence: week;\n            Reconciliation: monthly;\n            This means every weekend previous week aggregations will be calculated and\n                on month end we will reconcile the numbers calculated for last 4 weeks\n                to readjust the number for late events.\n        \"\"\"\n        (\n            window_start_date,\n            window_end_date,\n            filter_start_date,\n            filter_end_date,\n        ) = GABCadenceManager().extended_window_calculator(\n            cadence,\n            reconciliation_cadence,\n            self.spec.current_date,\n            start_date_str,\n            end_date_str,\n            use_case[\"query_type\"],\n            self.spec.rerun_flag,\n            snapshot_flag,\n        )\n\n        if use_case[\"timezone_offset\"]:\n            filter_start_date = filter_start_date + timedelta(\n                hours=use_case[\"timezone_offset\"]\n            )\n            filter_end_date = filter_end_date + timedelta(\n                hours=use_case[\"timezone_offset\"]\n            )\n\n        filter_start_date_str = GABUtils.format_datetime_to_default(filter_start_date)\n        filter_end_date_str = GABUtils.format_datetime_to_default(filter_end_date)\n\n        partition_end = GABUtils.format_datetime_to_default(\n            (window_end_date - timedelta(days=1))\n        )\n\n        window_start_date_str = GABUtils.format_datetime_to_default(window_start_date)\n        window_end_date_str = GABUtils.format_datetime_to_default(window_end_date)\n\n        partition_filter = GABPartitionUtils.get_partition_condition(\n            filter_start_date_str, partition_end\n        )\n\n        self._LOGGER.info(\n            \"extended window for start and end dates are: \"\n            f\"{filter_start_date_str} - {filter_end_date_str}\"\n        )\n\n        unpersist_list = []\n\n        for i in range(1, len(stages) + 1):\n            stage = stages[str(i)]\n            templated_file = stage[\"templated_file\"]\n            stage_file_path = stage[\"full_file_path\"]\n\n            templated = self._process_use_case_query_step(\n                stage=stages[str(i)],\n                templated_file=templated_file,\n                use_case=use_case,\n                reconciliation_cadence=reconciliation_cadence,\n                cadence=cadence,\n                snapshot_flag=snapshot_flag,\n                window_start_date=window_start_date_str,\n                partition_end=partition_end,\n                filter_start_date=filter_start_date_str,\n                filter_end_date=filter_end_date_str,\n                partition_filter=partition_filter,\n            )\n\n            temp_stage_view_name = self._create_stage_view(\n                templated,\n                stages[str(i)],\n                window_start_date_str,\n                window_end_date_str,\n                use_case[\"query_id\"],\n                use_case[\"query_label\"],\n                cadence,\n                stage_file_path,\n            )\n            unpersist_list.append(temp_stage_view_name)\n\n        insert_success = self._generate_view_statement(\n            query_id=use_case[\"query_id\"],\n            cadence=cadence,\n            temp_stage_view_name=temp_stage_view_name,\n            lookup_query_builder=lookup_query_builder,\n            window_start_date=window_start_date_str,\n            window_end_date=window_end_date_str,\n            query_label=use_case[\"query_label\"],\n        )\n        self._LOGGER.info(f\"Inserted data to generate the view: {insert_success}\")\n\n        self._unpersist_cached_views(unpersist_list)\n\n    def _process_use_case_query_step(\n        self,\n        stage: dict,\n        templated_file: str,\n        use_case: Row,\n        reconciliation_cadence: str,\n        cadence: str,\n        snapshot_flag: str,\n        window_start_date: str,\n        partition_end: str,\n        filter_start_date: str,\n        filter_end_date: str,\n        partition_filter: str,\n    ) -> str:\n        \"\"\"Process each use case step.\n\n        Process any intermediate view defined in the gab configuration table as step for\n            the use case.\n\n        Args:\n            stage: stage to process.\n            templated_file: sql file to process at this stage.\n            use_case: gab use case to process.\n            reconciliation_cadence: configured use case reconciliation window.\n            cadence: cadence to process.\n            snapshot_flag: flag indicating if for this cadence the snapshot is enabled.\n            window_start_date: start date for the configured stage.\n            partition_end: end date for the configured stage.\n            filter_start_date: filter start date to replace in the stage query.\n            filter_end_date: filter end date to replace in the stage query.\n            partition_filter: partition condition.\n        \"\"\"\n        filter_col = stage[\"project_date_column\"]\n        if stage[\"filter_date_column\"]:\n            filter_col = stage[\"filter_date_column\"]\n\n        # dummy value to avoid empty error if empty on the configuration\n        project_col = stage.get(\"project_date_column\", \"X\")\n\n        gab_base_configuration_copy = copy.deepcopy(\n            GABCombinedConfiguration.COMBINED_CONFIGURATION.value\n        )\n\n        for item in gab_base_configuration_copy.values():\n            self._update_rendered_item_cadence(\n                reconciliation_cadence, cadence, project_col, item  # type: ignore\n            )\n\n        (\n            rendered_date,\n            rendered_to_date,\n            join_condition,\n        ) = self._get_cadence_configuration(\n            gab_base_configuration_copy,\n            cadence,\n            reconciliation_cadence,\n            snapshot_flag,\n            use_case[\"start_of_the_week\"],\n            project_col,\n            window_start_date,\n            partition_end,\n        )\n\n        rendered_file = self._render_template_query(\n            templated=templated_file,\n            cadence=cadence,\n            start_of_the_week=use_case[\"start_of_the_week\"],\n            query_id=use_case[\"query_id\"],\n            rendered_date=rendered_date,\n            filter_start_date=filter_start_date,\n            filter_end_date=filter_end_date,\n            filter_col=filter_col,\n            timezone_offset=use_case[\"timezone_offset\"],\n            join_condition=join_condition,\n            partition_filter=partition_filter,\n            rendered_to_date=rendered_to_date,\n        )\n\n        return rendered_file\n\n    @classmethod\n    def _get_filtered_cadences(\n        cls, selected_cadences: list[str], configured_cadences: list[str]\n    ) -> list[str]:\n        \"\"\"Get filtered cadences.\n\n        Get the intersection of user selected cadences and use case configured cadences.\n\n        Args:\n            selected_cadences: user selected cadences.\n            configured_cadences: use case configured cadences.\n        \"\"\"\n        return (\n            configured_cadences\n            if \"All\" in selected_cadences\n            else GABCadence.order_cadences(\n                list(set(selected_cadences).intersection(configured_cadences))\n            )\n        )\n\n    def _get_latest_usecase_data(self, query_id: str) -> tuple[datetime, datetime]:\n        \"\"\"Get latest use case data.\n\n        Args:\n            query_id: use case query id.\n        \"\"\"\n        return (\n            self._get_latest_run_date(query_id),\n            self._get_latest_use_case_date(query_id),\n        )\n\n    def _get_latest_run_date(self, query_id: str) -> datetime:\n        \"\"\"Get latest use case run date.\n\n        Args:\n            query_id: use case query id.\n        \"\"\"\n        last_success_run_sql = \"\"\"\n            SELECT run_start_time\n            FROM {database}.gab_log_events\n            WHERE query_id = {query_id}\n            AND stage_name = 'Final Insert'\n            AND status = 'Success'\n            ORDER BY 1 DESC\n            LIMIT 1\n            \"\"\".format(  # nosec: B608\n            database=self.spec.target_database, query_id=query_id\n        )\n        try:\n            latest_run_date: datetime = ExecEnv.SESSION.sql(\n                last_success_run_sql\n            ).collect()[0][0]\n        except Exception:\n            latest_run_date = datetime.strptime(\n                \"2020-01-01\", GABDefaults.DATE_FORMAT.value\n            )\n\n        return latest_run_date\n\n    def _get_latest_use_case_date(self, query_id: str) -> datetime:\n        \"\"\"Get latest use case configured date.\n\n        Args:\n            query_id: use case query id.\n        \"\"\"\n        query_config_sql = \"\"\"\n            SELECT lh_created_on\n            FROM {lkp_query_builder}\n            WHERE query_id = {query_id}\n        \"\"\".format(  # nosec: B608\n            lkp_query_builder=self.spec.lookup_table,\n            query_id=query_id,\n        )\n\n        latest_config_date: datetime = ExecEnv.SESSION.sql(query_config_sql).collect()[\n            0\n        ][0]\n\n        return latest_config_date\n\n    @classmethod\n    def _set_week_configuration_by_uc_start_of_week(cls, start_of_week: str) -> None:\n        \"\"\"Set week configuration by use case start of week.\n\n        Args:\n            start_of_week: use case start of week (MONDAY or SUNDAY).\n        \"\"\"\n        if start_of_week.upper() == \"MONDAY\":\n            pendulum.week_starts_at(pendulum.MONDAY)\n            pendulum.week_ends_at(pendulum.SUNDAY)\n        elif start_of_week.upper() == \"SUNDAY\":\n            pendulum.week_starts_at(pendulum.SUNDAY)\n            pendulum.week_ends_at(pendulum.SATURDAY)\n        else:\n            raise NotImplementedError(\n                f\"The requested {start_of_week} is not implemented.\"\n                \"Supported `start_of_week` values: [MONDAY, SUNDAY]\"\n            )\n\n    @classmethod\n    def _update_rendered_item_cadence(\n        cls, reconciliation_cadence: str, cadence: str, project_col: str, item: dict\n    ) -> None:\n        \"\"\"Override item properties based in the rendered item cadence.\n\n        Args:\n            reconciliation_cadence: configured use case reconciliation window.\n            cadence: cadence to process.\n            project_col: use case projection date column name.\n            item: predefined use case combination.\n        \"\"\"\n        rendered_item = cls._get_rendered_item_cadence(\n            reconciliation_cadence, cadence, project_col, item\n        )\n        item[\"join_select\"] = rendered_item[\"join_select\"]\n        item[\"project_start\"] = rendered_item[\"project_start\"]\n        item[\"project_end\"] = rendered_item[\"project_end\"]\n\n    @classmethod\n    def _get_rendered_item_cadence(\n        cls, reconciliation_cadence: str, cadence: str, project_col: str, item: dict\n    ) -> dict:\n        \"\"\"Update pre-configured gab parameters with use case data.\n\n        Args:\n            reconciliation_cadence: configured use case reconciliation window.\n            cadence: cadence to process.\n            project_col: use case projection date column name.\n            item: predefined use case combination.\n        \"\"\"\n        return {\n            GABKeys.JOIN_SELECT: (\n                item[GABKeys.JOIN_SELECT]\n                .replace(GABReplaceableKeys.CONFIG_WEEK_START, \"Monday\")\n                .replace(\n                    GABReplaceableKeys.RECONCILIATION_CADENCE,\n                    reconciliation_cadence,\n                )\n                .replace(GABReplaceableKeys.CADENCE, cadence)\n            ),\n            GABKeys.PROJECT_START: (\n                item[GABKeys.PROJECT_START]\n                .replace(GABReplaceableKeys.CADENCE, cadence)\n                .replace(GABReplaceableKeys.DATE_COLUMN, project_col)\n            ),\n            GABKeys.PROJECT_END: (\n                item[GABKeys.PROJECT_END]\n                .replace(GABReplaceableKeys.CADENCE, cadence)\n                .replace(GABReplaceableKeys.DATE_COLUMN, project_col)\n            ),\n        }\n\n    @classmethod\n    def _get_cadence_configuration(\n        cls,\n        use_case_configuration: dict,\n        cadence: str,\n        reconciliation_cadence: str,\n        snapshot_flag: str,\n        start_of_week: str,\n        project_col: str,\n        window_start_date: str,\n        partition_end: str,\n    ) -> tuple[str, str, str]:\n        \"\"\"Get use case configuration fields to replace pre-configured parameters.\n\n        Args:\n            use_case_configuration: use case configuration.\n            cadence: cadence to process.\n            reconciliation_cadence: cadence to be reconciliated.\n            snapshot_flag: flag indicating if for this cadence the snapshot is enabled.\n            start_of_week: use case start of week (MONDAY or SUNDAY).\n            project_col: use case projection date column name.\n            window_start_date: start date for the configured stage.\n            partition_end: end date for the configured stage.\n\n        Returns:\n            rendered_from_date: projection start date.\n            rendered_to_date: projection end date.\n            join_condition: string containing the join condition to replace in the\n                templated query by jinja substitution.\n        \"\"\"\n        cadence_dict = next(\n            (\n                dict(configuration)\n                for configuration in use_case_configuration.values()\n                if (\n                    (cadence in configuration[\"cadence\"])\n                    and (reconciliation_cadence in configuration[\"recon\"])\n                    and (snapshot_flag in configuration[\"snap_flag\"])\n                    and (\n                        GABStartOfWeek.get_start_of_week()[start_of_week.upper()]\n                        in configuration[\"week_start\"]\n                    )\n                )\n            ),\n            None,\n        )\n        rendered_from_date = None\n        rendered_to_date = None\n        join_condition = None\n\n        if cadence_dict:\n            rendered_from_date = (\n                cadence_dict[GABKeys.PROJECT_START]\n                .replace(GABReplaceableKeys.CADENCE, cadence)\n                .replace(GABReplaceableKeys.DATE_COLUMN, project_col)\n            )\n            rendered_to_date = (\n                cadence_dict[GABKeys.PROJECT_END]\n                .replace(GABReplaceableKeys.CADENCE, cadence)\n                .replace(GABReplaceableKeys.DATE_COLUMN, project_col)\n            )\n\n            if cadence_dict[GABKeys.JOIN_SELECT]:\n                join_condition = \"\"\"\n                 inner join (\n                     {join_select} from df_cal\n                     where calendar_date\n                     between '{bucket_start}' and '{bucket_end}'\n                 )\n                 df_cal on date({date_column})\n                     between df_cal.cadence_start_date and df_cal.cadence_end_date\n                 \"\"\".format(\n                    join_select=cadence_dict[GABKeys.JOIN_SELECT],\n                    bucket_start=window_start_date,\n                    bucket_end=partition_end,\n                    date_column=project_col,\n                )\n\n        return rendered_from_date, rendered_to_date, join_condition\n\n    def _render_template_query(\n        self,\n        templated: str,\n        cadence: str,\n        start_of_the_week: str,\n        query_id: str,\n        rendered_date: str,\n        filter_start_date: str,\n        filter_end_date: str,\n        filter_col: str,\n        timezone_offset: str,\n        join_condition: str,\n        partition_filter: str,\n        rendered_to_date: str,\n    ) -> str:\n        \"\"\"Replace jinja templated parameters in the SQL with the actual data.\n\n        Args:\n            templated: templated sql file to process at this stage.\n            cadence: cadence to process.\n            start_of_the_week: use case start of week (MONDAY or SUNDAY).\n            query_id: gab configuration table use case identifier.\n            rendered_date: projection start date.\n            filter_start_date: filter start date to replace in the stage query.\n            filter_end_date: filter end date to replace in the stage query.\n            filter_col: use case projection date column name.\n            timezone_offset: timezone offset configured in the use case.\n            join_condition: string containing the join condition.\n            partition_filter: partition condition.\n            rendered_to_date: projection end date.\n        \"\"\"\n        return Template(templated).render(\n            cadence=\"'{cadence}' as cadence\".format(cadence=cadence),\n            cadence_run=cadence,\n            week_start=start_of_the_week,\n            query_id=\"'{query_id}' as query_id\".format(query_id=query_id),\n            project_date_column=rendered_date,\n            target_table=self.spec.target_table,\n            database=self.spec.source_database,\n            start_date=filter_start_date,\n            end_date=filter_end_date,\n            filter_date_column=filter_col,\n            offset_value=timezone_offset,\n            joins=join_condition if join_condition else \"\",\n            partition_filter=partition_filter,\n            to_date=rendered_to_date,\n        )\n\n    def _create_stage_view(\n        self,\n        rendered_template: str,\n        stage: dict,\n        window_start_date: str,\n        window_end_date: str,\n        query_id: str,\n        query_label: str,\n        cadence: str,\n        stage_file_path: str,\n    ) -> str:\n        \"\"\"Create each use case stage view.\n\n        Each stage has a specific order and refer to a specific SQL to be executed.\n\n        Args:\n            rendered_template: rendered stage SQL file.\n            stage: stage to process.\n            window_start_date: start date for the configured stage.\n            window_end_date: end date for the configured stage.\n            query_id: gab configuration table use case identifier.\n            query_label: gab configuration table use case name.\n            cadence: cadence to process.\n            stage_file_path: full stage file path (gab path + stage path).\n        \"\"\"\n        run_start_time = datetime.now()\n        creation_status: str\n        error_message: Exception | str\n\n        try:\n            tmp = ExecEnv.SESSION.sql(rendered_template)\n            num_partitions = ExecEnv.SESSION.conf.get(\n                self._SPARK_DEFAULT_PARALLELISM_CONFIG,\n                self._SPARK_DEFAULT_PARALLELISM_VALUE,\n            )\n\n            if stage[\"repartition\"]:\n                if stage[\"repartition\"].get(\"numPartitions\"):\n                    num_partitions = stage[\"repartition\"][\"numPartitions\"]\n\n                if stage[\"repartition\"].get(\"keys\"):\n                    tmp = tmp.repartition(\n                        int(num_partitions), *stage[\"repartition\"][\"keys\"]\n                    )\n                    self._LOGGER.info(\"Repartitioned on given Key(s)\")\n                else:\n                    tmp = tmp.repartition(int(num_partitions))\n                    self._LOGGER.info(\"Repartitioned on given partition count\")\n\n            temp_step_view_name: str = stage[\"table_alias\"]\n            tmp.createOrReplaceTempView(temp_step_view_name)\n\n            if stage[\"storage_level\"]:\n                ExecEnv.SESSION.sql(\n                    \"CACHE TABLE {tbl} \"\n                    \"OPTIONS ('storageLevel' '{type}')\".format(\n                        tbl=temp_step_view_name,\n                        type=stage[\"storage_level\"],\n                    )\n                )\n                ExecEnv.SESSION.sql(\n                    \"SELECT COUNT(*) FROM {tbl}\".format(  # nosec: B608\n                        tbl=temp_step_view_name\n                    )\n                )\n                self._LOGGER.info(f\"Cached stage view - {temp_step_view_name} \")\n\n            creation_status = \"Success\"\n            error_message = \"NA\"\n        except Exception as err:\n            creation_status = \"Failed\"\n            error_message = err\n            raise err\n        finally:\n            run_end_time = datetime.now()\n            GABUtils().logger(\n                run_start_time,\n                run_end_time,\n                window_start_date,\n                window_end_date,\n                query_id,\n                query_label,\n                cadence,\n                stage_file_path,\n                rendered_template,\n                creation_status,\n                error_message,\n                self.spec.target_database,\n            )\n\n        return temp_step_view_name\n\n    def _generate_view_statement(\n        self,\n        query_id: str,\n        cadence: str,\n        temp_stage_view_name: str,\n        lookup_query_builder: DataFrame,\n        window_start_date: str,\n        window_end_date: str,\n        query_label: str,\n    ) -> bool:\n        \"\"\"Feed use case data to the insights table (default: unified use case table).\n\n        Args:\n            query_id: gab configuration table use case identifier.\n            cadence: cadence to process.\n            temp_stage_view_name: name of the temp view generated by the stage.\n            lookup_query_builder: gab configuration data.\n            window_start_date: start date for the configured stage.\n            window_end_date: end date for the configured stage.\n            query_label: gab configuration table use case name.\n        \"\"\"\n        run_start_time = datetime.now()\n        creation_status: str\n        error_message: Exception | str\n\n        GABDeleteGenerator(\n            query_id=query_id,\n            cadence=cadence,\n            temp_stage_view_name=temp_stage_view_name,\n            lookup_query_builder=lookup_query_builder,\n            target_database=self.spec.target_database,\n            target_table=self.spec.target_table,\n        ).generate_sql()\n\n        gen_ins = GABInsertGenerator(\n            query_id=query_id,\n            cadence=cadence,\n            final_stage_table=temp_stage_view_name,\n            lookup_query_builder=lookup_query_builder,\n            target_database=self.spec.target_database,\n            target_table=self.spec.target_table,\n        ).generate_sql()\n        try:\n            ExecEnv.SESSION.sql(gen_ins)\n\n            creation_status = \"Success\"\n            error_message = \"NA\"\n            inserted = True\n        except Exception as err:\n            creation_status = \"Failed\"\n            error_message = err\n            raise\n        finally:\n            run_end_time = datetime.now()\n            GABUtils().logger(\n                run_start_time,\n                run_end_time,\n                window_start_date,\n                window_end_date,\n                query_id,\n                query_label,\n                cadence,\n                \"Final Insert\",\n                gen_ins,\n                creation_status,\n                error_message,\n                self.spec.target_database,\n            )\n\n        return inserted\n\n    @classmethod\n    def _unpersist_cached_views(cls, unpersist_list: list[str]) -> None:\n        \"\"\"Unpersist cached views.\n\n        Args:\n            unpersist_list: list containing the view names to unpersist.\n        \"\"\"\n        [\n            ExecEnv.SESSION.sql(\"UNCACHE TABLE {tbl}\".format(tbl=i))\n            for i in unpersist_list\n        ]\n\n    def _generate_ddl(\n        self,\n        latest_config_date: datetime,\n        latest_run_date: datetime,\n        query_id: str,\n        lookup_query_builder: DataFrame,\n    ) -> None:\n        \"\"\"Generate the actual gold asset.\n\n        It will create and return the view containing all specified dimensions, metrics\n            and computed metric for each cadence/reconciliation window.\n\n        Args:\n            latest_config_date: latest use case configuration date.\n            latest_run_date: latest use case run date.\n            query_id: gab configuration table use case identifier.\n            lookup_query_builder: gab configuration data.\n        \"\"\"\n        if str(latest_config_date) > str(latest_run_date):\n            GABViewManager(\n                query_id=query_id,\n                lookup_query_builder=lookup_query_builder,\n                target_database=self.spec.target_database,\n                target_table=self.spec.target_table,\n            ).generate_use_case_views()\n        else:\n            self._LOGGER.info(\n                \"View is not being re-created as there are no changes in the \"\n                \"configuration after the latest run\"\n            )\n"
  },
  {
    "path": "lakehouse_engine/algorithms/reconciliator.py",
    "content": "\"\"\"Module containing the Reconciliator class.\"\"\"\n\nfrom enum import Enum\nfrom typing import List\n\nimport pyspark.sql.functions as spark_fns\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import (  # noqa: A004\n    abs,\n    coalesce,\n    col,\n    lit,\n    try_divide,\n    when,\n)\nfrom pyspark.sql.types import FloatType\n\nfrom lakehouse_engine.algorithms.exceptions import ReconciliationFailedException\nfrom lakehouse_engine.core.definitions import InputSpec, ReconciliatorSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.executable import Executable\nfrom lakehouse_engine.io.reader_factory import ReaderFactory\nfrom lakehouse_engine.transformers.optimizers import Optimizers\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass ReconciliationType(Enum):\n    \"\"\"Type of Reconciliation.\"\"\"\n\n    PCT = \"percentage\"\n    ABS = \"absolute\"\n\n\nclass ReconciliationTransformers(Enum):\n    \"\"\"Transformers Available for the Reconciliation Algorithm.\"\"\"\n\n    AVAILABLE_TRANSFORMERS = {\n        \"cache\": Optimizers.cache,\n        \"persist\": Optimizers.persist,\n    }\n\n\nclass Reconciliator(Executable):\n    \"\"\"Class to define the behavior of an algorithm that checks if data reconciles.\n\n    Checking if data reconciles, using this algorithm, is a matter of reading the\n    'truth' data and the 'current' data. You can use any input specification compatible\n    with the lakehouse engine to read 'truth' or 'current' data. On top of that, you\n    can pass a 'truth_preprocess_query' and a 'current_preprocess_query' so you can\n    preprocess the data before it goes into the actual reconciliation process.\n    Moreover, you can use the 'truth_preprocess_query_args' and\n    'current_preprocess_query_args' to pass additional arguments to be used to apply\n    additional operations on top of the dataframe, resulting from the previous steps.\n    With these arguments you can apply additional operations like caching or persisting\n    the Dataframe. The way to pass the additional arguments for the operations is\n    similar to the TransformSpec, but only a few operations are allowed. Those are\n    defined in ReconciliationTransformers.AVAILABLE_TRANSFORMERS.\n\n    The reconciliation process is focused on joining 'truth' with 'current' by all\n    provided columns except the ones passed as 'metrics'. After that it calculates the\n    differences in the metrics attributes (either percentage or absolute difference).\n    Finally, it aggregates the differences, using the supplied aggregation function\n    (e.g., sum, avg, min, max, etc).\n\n    All of these configurations are passed via the ACON to instantiate a\n    ReconciliatorSpec object.\n\n    !!! note\n        It is crucial that both the current and truth datasets have exactly the same\n        structure.\n    !!! note\n        You should not use 0 as yellow or red threshold, as the algorithm will verify\n        if the difference between the truth and current values is bigger\n        or equal than those thresholds.\n    !!! note\n        The reconciliation does not produce any negative values or percentages, as we\n        use the absolute value of the differences. This means that the recon result\n        will not indicate if it was the current values that were bigger or smaller\n        than the truth values, or vice versa.\n    \"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct Algorithm instances.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.spec: ReconciliatorSpec = ReconciliatorSpec(\n            metrics=acon[\"metrics\"],\n            truth_input_spec=InputSpec(**acon[\"truth_input_spec\"]),\n            current_input_spec=InputSpec(**acon[\"current_input_spec\"]),\n            truth_preprocess_query=acon.get(\"truth_preprocess_query\", None),\n            truth_preprocess_query_args=acon.get(\"truth_preprocess_query_args\", None),\n            current_preprocess_query=acon.get(\"current_preprocess_query\", None),\n            current_preprocess_query_args=acon.get(\n                \"current_preprocess_query_args\", None\n            ),\n            ignore_empty_df=acon.get(\"ignore_empty_df\", False),\n        )\n\n    def get_source_of_truth(self) -> DataFrame:\n        \"\"\"Get the source of truth (expected result) for the reconciliation process.\n\n        Returns:\n            DataFrame containing the source of truth.\n        \"\"\"\n        truth_df = ReaderFactory.get_data(self.spec.truth_input_spec)\n        if self.spec.truth_preprocess_query:\n            truth_df.createOrReplaceTempView(\"truth\")\n            truth_df = ExecEnv.SESSION.sql(self.spec.truth_preprocess_query)\n\n        return truth_df\n\n    def get_current_results(self) -> DataFrame:\n        \"\"\"Get the current results from the table that we are checking if it reconciles.\n\n        Returns:\n            DataFrame containing the current results.\n        \"\"\"\n        current_df = ReaderFactory.get_data(self.spec.current_input_spec)\n        if self.spec.current_preprocess_query:\n            current_df.createOrReplaceTempView(\"current\")\n            current_df = ExecEnv.SESSION.sql(self.spec.current_preprocess_query)\n\n        return current_df\n\n    def execute(self) -> None:\n        \"\"\"Reconcile the current results against the truth dataset.\"\"\"\n        truth_df = self.get_source_of_truth()\n        self._apply_preprocess_query_args(\n            truth_df, self.spec.truth_preprocess_query_args\n        )\n        self._logger.info(\"Source of truth:\")\n        truth_df.show(1000, truncate=False)\n\n        current_results_df = self.get_current_results()\n        self._apply_preprocess_query_args(\n            current_results_df, self.spec.current_preprocess_query_args\n        )\n        self._logger.info(\"Current results:\")\n        current_results_df.show(1000, truncate=False)\n\n        status = \"green\"\n\n        # if ignore_empty_df is true, run empty check on truth_df and current_results_df\n        # if both the dataframes are empty then exit with green\n        if (\n            self.spec.ignore_empty_df\n            and truth_df.isEmpty()\n            and current_results_df.isEmpty()\n        ):\n            self._logger.info(\n                f\"ignore_empty_df is {self.spec.ignore_empty_df}, \"\n                f\"truth_df and current_results_df are empty, \"\n                f\"hence ignoring reconciliation\"\n            )\n            self._logger.info(\"The Reconciliation process has succeeded.\")\n            return\n\n        recon_results = self._get_recon_results(\n            truth_df, current_results_df, self.spec.metrics\n        )\n        self._logger.info(f\"Reconciliation result: {recon_results}\")\n\n        for m in self.spec.metrics:\n            metric_name = f\"{m['metric']}_{m['type']}_diff_{m['aggregation']}\"\n            if m[\"yellow\"] <= recon_results[metric_name] < m[\"red\"]:\n                if status == \"green\":\n                    # only switch to yellow if it was green before, otherwise we want\n                    # to preserve 'red' as the final status.\n                    status = \"yellow\"\n            elif m[\"red\"] <= recon_results[metric_name]:\n                status = \"red\"\n\n        if status != \"green\":\n            raise ReconciliationFailedException(\n                f\"The Reconciliation process has failed with status: {status}.\"\n            )\n        else:\n            self._logger.info(\"The Reconciliation process has succeeded.\")\n\n    @staticmethod\n    def _apply_preprocess_query_args(\n        df: DataFrame, preprocess_query_args: List[dict]\n    ) -> DataFrame:\n        \"\"\"Apply transformers on top of the preprocessed query.\n\n        Args:\n            df: dataframe being transformed.\n            preprocess_query_args: dict having the functions/transformations to\n                apply and respective arguments.\n\n        Returns: the transformed Dataframe.\n        \"\"\"\n        transformed_df = df\n\n        if preprocess_query_args is None:\n            try:\n                transformed_df = df.transform(Optimizers.cache())\n            except Exception as e:\n                Reconciliator._logger.warning(\n                    f\"Could not apply default caching to the dataframe.\"\n                    f\"Continuing without caching. Exception: {e}\"\n                )\n        elif len(preprocess_query_args) > 0:\n            for transformation in preprocess_query_args:\n                rec_func = ReconciliationTransformers.AVAILABLE_TRANSFORMERS.value[\n                    transformation[\"function\"]\n                ](\n                    **transformation.get(\"args\", {})\n                )  # type: ignore\n\n                transformed_df = df.transform(rec_func)\n        else:\n            transformed_df = df\n\n        return transformed_df\n\n    def _get_recon_results(\n        self, truth_df: DataFrame, current_results_df: DataFrame, metrics: List[dict]\n    ) -> dict:\n        \"\"\"Get the reconciliation results by comparing truth_df with current_results_df.\n\n        Args:\n            truth_df: dataframe with the truth data to reconcile against. It is\n                typically an aggregated dataset to use as baseline and then we match the\n                current_results_df (Aggregated at the same level) against this truth.\n            current_results_df: dataframe with the current results of the dataset we\n                are trying to reconcile.\n            metrics: list of dicts containing metric, aggregation, yellow threshold and\n                red threshold.\n\n        Return:\n            dictionary with the results (difference between truth and current results)\n        \"\"\"\n        if len(truth_df.head(1)) == 0 or len(current_results_df.head(1)) == 0:\n            raise ReconciliationFailedException(\n                \"The reconciliation has failed because either the truth dataset or the \"\n                \"current results dataset was empty.\"\n            )\n\n        # truth and current are joined on all columns except the metrics\n        joined_df = truth_df.alias(\"truth\").join(\n            current_results_df.alias(\"current\"),\n            [\n                truth_df[c] == current_results_df[c]\n                for c in current_results_df.columns\n                if c not in [m[\"metric\"] for m in metrics]\n            ],\n            how=\"full\",\n        )\n\n        for m in metrics:\n            if m[\"type\"] == ReconciliationType.PCT.value:\n                joined_df = joined_df.withColumn(\n                    f\"{m['metric']}_{m['type']}_diff\",\n                    coalesce(\n                        (\n                            # we need to make sure we don't produce negative values\n                            # because our thresholds only accept > or >= comparisons.\n                            abs(\n                                try_divide(\n                                    (\n                                        col(f\"current.{m['metric']}\")\n                                        - col(f\"truth.{m['metric']}\")\n                                    ),\n                                    abs(col(f\"truth.{m['metric']}\")),\n                                )\n                            )\n                        ),\n                        # if the formula above produces null, we need to consider where\n                        # it came from: we check below if the values were the same,\n                        # and if so the diff is 0, if not the diff is 1 (e.g., the null\n                        # result might have come from a division by 0).\n                        when(\n                            col(f\"current.{m['metric']}\").eqNullSafe(\n                                col(f\"truth.{m['metric']}\")\n                            ),\n                            lit(0),\n                        ).otherwise(lit(1)),\n                    ),\n                )\n            elif m[\"type\"] == ReconciliationType.ABS.value:\n                joined_df = joined_df.withColumn(\n                    f\"{m['metric']}_{m['type']}_diff\",\n                    abs(\n                        coalesce(col(f\"current.{m['metric']}\"), lit(0))\n                        - coalesce(col(f\"truth.{m['metric']}\"), lit(0))\n                    ),\n                )\n            else:\n                raise NotImplementedError(\n                    \"The requested reconciliation type is not yet implemented.\"\n                )\n\n            joined_df = joined_df.withColumn(\n                f\"{m['metric']}_{m['type']}_diff\",\n                col(f\"{m['metric']}_{m['type']}_diff\").cast(FloatType()),\n            )\n\n        results_df = joined_df.agg(\n            *[\n                getattr(spark_fns, m[\"aggregation\"])(\n                    f\"{m['metric']}_{m['type']}_diff\"\n                ).alias(f\"{m['metric']}_{m['type']}_diff_{m['aggregation']}\")\n                for m in metrics\n            ]\n        )\n\n        return results_df.collect()[0].asDict()\n"
  },
  {
    "path": "lakehouse_engine/algorithms/sensor.py",
    "content": "\"\"\"Module to define Sensor algorithm behavior.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.algorithms.algorithm import Algorithm\nfrom lakehouse_engine.algorithms.exceptions import (\n    NoNewDataException,\n    SensorAlreadyExistsException,\n)\nfrom lakehouse_engine.core.definitions import (\n    SENSOR_ALLOWED_DATA_FORMATS,\n    InputFormat,\n    ReadType,\n    SensorSpec,\n    SensorStatus,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.sensor_manager import (\n    SensorControlTableManager,\n    SensorUpstreamManager,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Sensor(Algorithm):\n    \"\"\"Class representing a sensor to check if the upstream has new data.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct Sensor instances.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.spec: SensorSpec = SensorSpec.create_from_acon(acon=acon)\n        self._validate_sensor_spec()\n\n        if self._check_if_sensor_already_exists():\n            raise SensorAlreadyExistsException(\n                \"There's already a sensor registered with same id or assets!\"\n            )\n\n    def execute(self) -> bool:\n        \"\"\"Execute the sensor.\"\"\"\n        self._LOGGER.info(f\"Starting {self.spec.input_spec.data_format} sensor...\")\n\n        new_data_df = SensorUpstreamManager.read_new_data(sensor_spec=self.spec)\n        if self.spec.input_spec.read_type == ReadType.STREAMING.value:\n            Sensor._run_streaming_sensor(sensor_spec=self.spec, new_data_df=new_data_df)\n        elif self.spec.input_spec.read_type == ReadType.BATCH.value:\n            Sensor._run_batch_sensor(\n                sensor_spec=self.spec,\n                new_data_df=new_data_df,\n            )\n\n        has_new_data = SensorControlTableManager.check_if_sensor_has_acquired_data(\n            self.spec.sensor_id,\n            self.spec.control_db_table_name,\n        )\n\n        self._LOGGER.info(\n            f\"Sensor {self.spec.sensor_id} has previously \"\n            f\"acquired data? {has_new_data}\"\n        )\n\n        if self.spec.fail_on_empty_result and not has_new_data:\n            raise NoNewDataException(\n                f\"No data was acquired by {self.spec.sensor_id} sensor.\"\n            )\n\n        return has_new_data\n\n    def _check_if_sensor_already_exists(self) -> bool:\n        \"\"\"Check if sensor already exists in the table to avoid duplicates.\"\"\"\n        row = SensorControlTableManager.read_sensor_table_data(\n            sensor_id=self.spec.sensor_id,\n            control_db_table_name=self.spec.control_db_table_name,\n        )\n\n        if row and row.assets != self.spec.assets:\n            return True\n        else:\n            row = SensorControlTableManager.read_sensor_table_data(\n                assets=self.spec.assets,\n                control_db_table_name=self.spec.control_db_table_name,\n            )\n            return row is not None and row.sensor_id != self.spec.sensor_id\n\n    @classmethod\n    def _run_streaming_sensor(\n        cls, sensor_spec: SensorSpec, new_data_df: DataFrame\n    ) -> None:\n        \"\"\"Run sensor in streaming mode (internally runs in batch mode).\"\"\"\n\n        def foreach_batch_check_new_data(df: DataFrame, batch_id: int) -> None:\n            # forcing session to be available inside forEachBatch on\n            # Spark Connect\n            ExecEnv.get_or_create()\n\n            Sensor._run_batch_sensor(\n                sensor_spec=sensor_spec,\n                new_data_df=df,\n            )\n\n        new_data_df.writeStream.trigger(availableNow=True).option(\n            \"checkpointLocation\", sensor_spec.checkpoint_location\n        ).foreachBatch(foreach_batch_check_new_data).start().awaitTermination()\n\n    @classmethod\n    def _run_batch_sensor(\n        cls,\n        sensor_spec: SensorSpec,\n        new_data_df: DataFrame,\n    ) -> None:\n        \"\"\"Run sensor in batch mode.\n\n        Args:\n            sensor_spec: sensor spec containing all sensor information.\n            new_data_df: DataFrame possibly containing new data.\n        \"\"\"\n        new_data_first_row = SensorUpstreamManager.get_new_data(new_data_df)\n\n        cls._LOGGER.info(\n            f\"Sensor {sensor_spec.sensor_id} has new data from upstream? \"\n            f\"{new_data_first_row is not None}\"\n        )\n\n        if new_data_first_row:\n            SensorControlTableManager.update_sensor_status(\n                sensor_spec=sensor_spec,\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                upstream_key=(\n                    new_data_first_row.UPSTREAM_KEY\n                    if \"UPSTREAM_KEY\" in new_data_df.columns\n                    else None\n                ),\n                upstream_value=(\n                    new_data_first_row.UPSTREAM_VALUE\n                    if \"UPSTREAM_VALUE\" in new_data_df.columns\n                    else None\n                ),\n            )\n            cls._LOGGER.info(\n                f\"Successfully updated sensor status for sensor \"\n                f\"{sensor_spec.sensor_id}...\"\n            )\n\n    def _validate_sensor_spec(self) -> None:\n        \"\"\"Validate if sensor spec Read Type is allowed for the selected Data Format.\"\"\"\n        if InputFormat.exists(self.spec.input_spec.data_format):\n            if (\n                self.spec.input_spec.data_format\n                not in SENSOR_ALLOWED_DATA_FORMATS[self.spec.input_spec.read_type]\n            ):\n                raise NotImplementedError(\n                    f\"A sensor has not been implemented yet for this data format or, \"\n                    f\"this data format is not available for the read_type\"\n                    f\" {self.spec.input_spec.read_type}. \"\n                    f\"Check the allowed combinations of read_type and data_formats:\"\n                    f\" {SENSOR_ALLOWED_DATA_FORMATS}\"\n                )\n        else:\n            raise NotImplementedError(\n                f\"Data format {self.spec.input_spec.data_format} isn't implemented yet.\"\n            )\n"
  },
  {
    "path": "lakehouse_engine/algorithms/sensors/__init__.py",
    "content": "\"\"\"Package containing all the lakehouse engine Sensor Heartbeat algorithms.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/algorithms/sensors/heartbeat.py",
    "content": "\"\"\"Module to define Heartbeat Sensor algorithm behavior.\"\"\"\n\nimport re\nfrom typing import Optional\n\nfrom delta import DeltaTable\nfrom pyspark import Row\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.column import Column\nfrom pyspark.sql.functions import (\n    col,\n    concat_ws,\n    count,\n    current_timestamp,\n    lit,\n    regexp_replace,\n    row_number,\n    trim,\n    upper,\n)\nfrom pyspark.sql.window import Window\n\nfrom lakehouse_engine.algorithms.algorithm import Algorithm\nfrom lakehouse_engine.algorithms.sensors.sensor import Sensor\nfrom lakehouse_engine.core.definitions import (\n    HEARTBEAT_SENSOR_UPDATE_SET,\n    HeartbeatConfigSpec,\n    HeartbeatSensorSource,\n    HeartbeatStatus,\n    SensorStatus,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.sensor_manager import (\n    SensorJobRunManager,\n    SensorUpstreamManager,\n)\nfrom lakehouse_engine.terminators.sensor_terminator import SensorTerminator\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Heartbeat(Algorithm):\n    \"\"\"Class representing a Heartbeat to check if the upstream has new data.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct Heartbeat instances.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.spec: HeartbeatConfigSpec = HeartbeatConfigSpec.create_from_acon(acon=acon)\n\n    def execute(self) -> None:\n        \"\"\"Execute the heartbeat.\"\"\"\n        latest_event_current_timestamp = current_timestamp()\n        heartbeat_sensor_delta_table = DeltaTable.forName(\n            ExecEnv.SESSION,\n            self.spec.heartbeat_sensor_db_table,\n        )\n        sensor_source = self.spec.sensor_source\n\n        active_jobs_from_heartbeat_control_table_df = self._get_active_heartbeat_jobs(\n            heartbeat_sensor_delta_table, sensor_source\n        )\n\n        for (\n            control_table_df_row\n        ) in active_jobs_from_heartbeat_control_table_df.collect():\n\n            sensor_acon = self._get_sensor_acon_from_heartbeat(\n                self.spec, control_table_df_row\n            )\n\n            sensors_with_new_data = self._execute_batch_of_sensor(\n                sensor_acon, control_table_df_row\n            )\n\n            if sensors_with_new_data:\n\n                self._update_heartbeat_status_with_sensor_info(\n                    active_jobs_from_heartbeat_control_table_df,\n                    heartbeat_sensor_delta_table,\n                    self._get_heartbeat_sensor_condition(sensors_with_new_data),\n                    latest_event_current_timestamp,\n                    sensor_source,\n                )\n\n    @classmethod\n    def _get_active_heartbeat_jobs(\n        cls, heartbeat_sensor_delta_table: DeltaTable, sensor_source: str\n    ) -> DataFrame:\n        \"\"\"Get UNPAUSED and NULL or COMPLETED status record from control table.\n\n        :param heartbeat_sensor_delta_table: DeltaTable for heartbeat sensor.\n        :param sensor_source: source system from Spec(e.g. sap_b4, delta, kafka etc.).\n\n        Returns:\n            A control table DataFrame containing records for specified sensor_source\n            that are UNPAUSED and have a status of either NULL or COMPLETED.\n        \"\"\"\n        full_control_table = heartbeat_sensor_delta_table.toDF()\n\n        filtered_control_table = full_control_table.filter(\n            f\"lower(sensor_source) == '{sensor_source}'\"\n        ).filter(\n            \"job_state == 'UNPAUSED' and (status is null OR status == 'COMPLETED')\"\n        )\n\n        return filtered_control_table\n\n    @classmethod\n    def generate_unique_column_values(cls, main_col: str, col_to_append: str) -> str:\n        \"\"\"Generate a unique value by appending columns and replacing specific chars.\n\n        Generate a unique value by appending another column and replacing spaces,\n        dots, and colons with underscores for consistency.\n\n        :param main_col: The primary column value.\n        :param col_to_append: Column value to append for uniqueness.\n\n        Returns:\n            A unique, combined column value.\n        \"\"\"\n        return f\"{re.sub(r'[ :.]', '_', main_col)}_{col_to_append}\"\n\n    @classmethod\n    def _get_sensor_acon_from_heartbeat(\n        cls, heartbeat_spec: HeartbeatConfigSpec, control_table_df_row: Row\n    ) -> dict:\n        \"\"\"Create sensor acon from heartbeat config and specifications.\n\n        :param heartbeat_spec: Heartbeat specifications.\n        :param control_table_df_row: Control table active records Dataframe Row.\n\n        Returns:\n            The sensor acon dict.\n        \"\"\"\n        sensors_to_execute: dict = {\n            \"sensor_id\": (\n                cls.generate_unique_column_values(\n                    control_table_df_row[\"sensor_id\"],\n                    control_table_df_row[\"trigger_job_id\"],\n                )\n            ),  # 1. sensor_id can be same for two or more different trigger_job_id\n            # 2. Replacing colon,space,dot(.) with underscore(_) is required to get the\n            # checkpoint_location fixed in case of delta_table and kafka source\n            \"assets\": [\n                cls.generate_unique_column_values(\n                    control_table_df_row[\"asset_description\"],\n                    control_table_df_row[\"trigger_job_id\"],\n                )\n            ],\n            \"control_db_table_name\": heartbeat_spec.lakehouse_engine_sensor_db_table,\n            \"input_spec\": {\n                \"spec_id\": \"sensor_upstream\",\n                \"read_type\": control_table_df_row[\"sensor_read_type\"],\n                \"data_format\": heartbeat_spec.data_format,\n                \"db_table\": (\n                    control_table_df_row[\"sensor_id\"]\n                    if heartbeat_spec.data_format == \"delta\"\n                    else None\n                ),\n                \"options\": heartbeat_spec.options,\n                \"location\": (\n                    (\n                        heartbeat_spec.base_trigger_file_location\n                        + \"/\"\n                        + control_table_df_row[\"sensor_id\"]\n                    )\n                    if heartbeat_spec.base_trigger_file_location is not None\n                    else None\n                ),\n                \"schema\": heartbeat_spec.schema_dict,\n            },\n            \"preprocess_query\": control_table_df_row[\"preprocess_query\"],\n            \"base_checkpoint_location\": heartbeat_spec.base_checkpoint_location,\n            \"fail_on_empty_result\": False,\n        }\n\n        final_sensors_to_execute = cls._enhance_sensor_acon_extra_options(\n            heartbeat_spec, control_table_df_row, sensors_to_execute\n        )\n\n        return final_sensors_to_execute\n\n    @classmethod\n    def _enhance_sensor_acon_extra_options(\n        cls,\n        heartbeat_spec: HeartbeatConfigSpec,\n        control_table_df_row: Row,\n        sensors_to_execute: dict,\n    ) -> dict:\n        \"\"\"Enhance sensor acon with extra options for specific source system.\n\n        :param heartbeat_spec: Heartbeat specifications.\n        :param control_table_df_row: Control table active records Dataframe Row.\n        :param sensors_to_execute: sensor acon dictionary from previous step.\n\n        Returns:\n            The sensor acon dict having enhanced options for specific sensor_source.\n        \"\"\"\n        LATEST_FETCH_EVENT_TIMESTAMP = (\n            control_table_df_row.latest_event_fetched_timestamp\n        )\n\n        upstream_key = control_table_df_row[\"upstream_key\"]\n\n        upstream_value = (\n            LATEST_FETCH_EVENT_TIMESTAMP.strftime(\"%Y%m%d%H%M%S\")\n            if LATEST_FETCH_EVENT_TIMESTAMP is not None\n            else \"19000101000000\"\n        )\n\n        if control_table_df_row.sensor_source.lower() in [\n            HeartbeatSensorSource.SAP_B4.value,\n            HeartbeatSensorSource.SAP_BW.value,\n        ]:\n\n            sensors_to_execute[\"input_spec\"][\"options\"][\"prepareQuery\"] = (\n                SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                    chain_id=control_table_df_row.sensor_id,\n                    dbtable=heartbeat_spec.jdbc_db_table,\n                )\n            )\n            sensors_to_execute[\"input_spec\"][\"options\"][\"query\"] = (\n                SensorUpstreamManager.generate_filter_exp_query(\n                    sensor_id=control_table_df_row.sensor_id,\n                    filter_exp=\"?upstream_key > '?upstream_value'\",\n                    control_db_table_name=(\n                        heartbeat_spec.lakehouse_engine_sensor_db_table\n                    ),\n                    upstream_key=upstream_key,\n                    upstream_value=upstream_value,\n                )\n            )\n\n        elif (\n            control_table_df_row.sensor_source.lower()\n            == HeartbeatSensorSource.LMU_DELTA_TABLE.value\n        ):\n\n            sensors_to_execute[\"preprocess_query\"] = (\n                SensorUpstreamManager.generate_filter_exp_query(\n                    sensor_id=control_table_df_row.sensor_id,\n                    filter_exp=\"?upstream_key > '?upstream_value'\",\n                    control_db_table_name=(\n                        heartbeat_spec.lakehouse_engine_sensor_db_table\n                    ),\n                    upstream_key=upstream_key,\n                    upstream_value=upstream_value,\n                )\n            )\n\n        elif (\n            control_table_df_row.sensor_source.lower()\n            == HeartbeatSensorSource.KAFKA.value\n        ):\n\n            kafka_options = cls._get_all_kafka_options(\n                heartbeat_spec.kafka_configs,\n                control_table_df_row[\"sensor_id\"],\n                heartbeat_spec.kafka_secret_scope,\n            )\n\n            sensors_to_execute[\"input_spec\"][\"options\"] = kafka_options\n\n        return sensors_to_execute\n\n    @classmethod\n    def _get_all_kafka_options(\n        cls,\n        kafka_configs: dict,\n        kafka_sensor_id: str,\n        kafka_secret_scope: str,\n    ) -> dict:\n        \"\"\"Get all Kafka extra options for sensor ACON.\n\n        Read all heartbeat sensor related kafka config dynamically based on\n        data product name or any other prefix which should match with sensor_id prefix.\n\n        :param kafka_configs: kafka config read from yaml file.\n        :param kafka_sensor_id: kafka topic for which new event to be fetched.\n        :param kafka_secret_scope: secret scope used for kafka processing.\n\n        Returns:\n            The sensor acon dict having enhanced options for kafka source.\n        \"\"\"\n        sensor_id_desc = kafka_sensor_id.split(\":\")\n        dp_name_filter = sensor_id_desc[0].strip()\n        KAFKA_TOPIC = sensor_id_desc[1].strip()\n\n        KAFKA_BOOTSTRAP_SERVERS = kafka_configs[dp_name_filter][\n            \"kafka_bootstrap_servers_list\"\n        ]\n        KAFKA_TRUSTSTORE_LOCATION = kafka_configs[dp_name_filter][\n            \"kafka_ssl_truststore_location\"\n        ]\n        KAFKA_KEYSTORE_LOCATION = kafka_configs[dp_name_filter][\n            \"kafka_ssl_keystore_location\"\n        ]\n        KAFKA_TRUSTSTORE_PSWD_SECRET_KEY = kafka_configs[dp_name_filter][\n            \"truststore_pwd_secret_key\"\n        ]\n        KAFKA_TRUSTSTORE_PSWD = (\n            DatabricksUtils.get_db_utils(ExecEnv.SESSION).secrets.get(\n                scope=kafka_secret_scope,\n                key=KAFKA_TRUSTSTORE_PSWD_SECRET_KEY,\n            )\n            if KAFKA_TRUSTSTORE_PSWD_SECRET_KEY\n            else None\n        )\n        KAFKA_KEYSTORE_PSWD_SECRET_KEY = kafka_configs[dp_name_filter][\n            \"keystore_pwd_secret_key\"\n        ]\n        KAFKA_KEYSTORE_PSWD = (\n            DatabricksUtils.get_db_utils(ExecEnv.SESSION).secrets.get(\n                scope=kafka_secret_scope,\n                key=KAFKA_KEYSTORE_PSWD_SECRET_KEY,\n            )\n            if KAFKA_KEYSTORE_PSWD_SECRET_KEY\n            else None\n        )\n\n        kafka_options_dict = {\n            \"kafka.bootstrap.servers\": KAFKA_BOOTSTRAP_SERVERS,\n            \"subscribe\": KAFKA_TOPIC,\n            \"startingOffsets\": \"earliest\",\n            \"kafka.security.protocol\": \"SSL\",\n            \"kafka.ssl.truststore.location\": KAFKA_TRUSTSTORE_LOCATION,\n            \"kafka.ssl.truststore.password\": KAFKA_TRUSTSTORE_PSWD,\n            \"kafka.ssl.keystore.location\": KAFKA_KEYSTORE_LOCATION,\n            \"kafka.ssl.keystore.password\": KAFKA_KEYSTORE_PSWD,\n        }\n\n        return kafka_options_dict\n\n    @classmethod\n    def _execute_batch_of_sensor(\n        cls, sensor_acon: dict, control_table_df_row: Row\n    ) -> dict:\n        \"\"\"Execute sensor acon to fetch NEW EVENT AVAILABLE for sensor source system.\n\n        :param sensor_acon: sensor acon created from heartbeat config and specs.\n        :param control_table_df_row: Control table active records Dataframe Row.\n\n        Returns:\n            Dict containing sensor_id and trigger_job_id for sensor with new data.\n        \"\"\"\n        sensors_with_new_data: dict = {}\n\n        cls._LOGGER.info(f\"Executing sensor: {sensor_acon}\")\n        has_new_data = Sensor(sensor_acon).execute()\n\n        if has_new_data:\n            sensors_with_new_data[\"sensor_id\"] = control_table_df_row[\"sensor_id\"]\n            sensors_with_new_data[\"trigger_job_id\"] = control_table_df_row[\n                \"trigger_job_id\"\n            ]\n\n        return sensors_with_new_data\n\n    @classmethod\n    def _get_heartbeat_sensor_condition(\n        cls,\n        sensors_with_new_data: dict,\n    ) -> Optional[str]:\n        \"\"\"Get heartbeat sensor new event available condition.\n\n        :param sensors_with_new_data: dict having NEW_EVENT_AVAILABLE sensor_id record.\n\n        Returns:\n            String having condition for sensor having new data available.\n        \"\"\"\n        heartbeat_sensor_with_new_event_available = (\n            f\"(sensor_id = '{sensors_with_new_data['sensor_id']}' AND \"\n            f\"trigger_job_id = '{sensors_with_new_data['trigger_job_id']}')\"\n        )\n\n        return heartbeat_sensor_with_new_event_available\n\n    @classmethod\n    def _update_heartbeat_status_with_sensor_info(\n        cls,\n        heartbeat_sensor_jobs: DataFrame,\n        heartbeat_sensor_delta_table: DeltaTable,\n        heartbeat_with_new_event_available_condition: str,\n        latest_event_current_timestamp: Column,\n        sensor_source: str,\n    ) -> None:\n        \"\"\"Update heartbeat status with sensor info.\n\n        :param heartbeat_sensor_jobs: active UNPAUSED jobs from Control table dataframe.\n        :param heartbeat_sensor_delta_table: heartbeat sensor Delta table.\n        :param heartbeat_with_new_event_available_condition: new event available cond.\n        :param latest_event_current_timestamp: timestamp when new event was captured.\n        \"\"\"\n        if heartbeat_with_new_event_available_condition:\n            sensors_with_new_event_available = (\n                heartbeat_sensor_jobs.filter(\n                    heartbeat_with_new_event_available_condition\n                )\n                .withColumn(\"status\", lit(HeartbeatStatus.NEW_EVENT_AVAILABLE.value))\n                .withColumn(\"status_change_timestamp\", current_timestamp())\n                .withColumn(\n                    \"latest_event_fetched_timestamp\", latest_event_current_timestamp\n                )\n            )\n\n            new_event_merge_condition = f\"\"\"target.sensor_id = src.sensor_id AND\n                target.trigger_job_id = src.trigger_job_id AND\n                target.sensor_source = '{sensor_source}'\"\"\"\n\n            if sensors_with_new_event_available.count() > 0:\n                cls.update_heartbeat_control_table(\n                    heartbeat_sensor_delta_table,\n                    sensors_with_new_event_available,\n                    new_event_merge_condition,\n                )\n        else:\n            cls._LOGGER.info(\"No sensors to execute!\")\n\n    @classmethod\n    def update_heartbeat_control_table(\n        cls,\n        heartbeat_sensor_delta_table: DeltaTable,\n        updated_data: DataFrame,\n        heartbeat_control_table_merge_condition: str,\n    ) -> None:\n        \"\"\"Update heartbeat control table with the new data.\n\n        :param heartbeat_sensor_delta_table: db_table heartbeat sensor control table.\n        :param updated_data: data to update the control table.\n        :param heartbeat_control_table_merge_condition: merge condition for table.\n        \"\"\"\n        cls._LOGGER.info(f\"updated data: {updated_data}\")\n\n        heartbeat_sensor_delta_table.alias(\"target\").merge(\n            updated_data.alias(\"src\"),\n            (heartbeat_control_table_merge_condition),\n        ).whenMatchedUpdate(\n            set=HEARTBEAT_SENSOR_UPDATE_SET\n        ).whenNotMatchedInsertAll().execute()\n\n    @classmethod\n    def get_heartbeat_jobs_to_trigger(\n        cls,\n        heartbeat_sensor_db_table: str,\n        heartbeat_sensor_control_table_df: DataFrame,\n    ) -> list[Row]:\n        \"\"\"Get heartbeat jobs to trigger.\n\n        Check if all the dependencies are satisfied to trigger the job.\n        dependency_flag column to be checked for all sensor_id and\n        trigger_job_id combination keeping status as NEW_EVENT_AVAILABLE in mind.\n\n        Check dependencies based trigger_job_id. From all control table record having\n        status as NEW_EVENT_AVAILABLE, then it will fetch status and dependency_flag\n        for all records having same trigger_job_id. If trigger_job_id, status,\n        dependency_flag combination is same for all dependencies, Get distinct record\n        and do count level aggregation for trigger_job_id, dependency_flag.\n\n        Count level aggregation based on trigger_job_id, dependency_flag picks all\n        those trigger_job_id which doesn`t satisfy dependency as it denotes there are\n        more than one record present having dependency_flag = \"TRUE\" and status is\n        different for same trigger_job_id. If count is not more than 1, means condition\n        satisfied, Job id will be considered for triggering.\n\n        If trigger_job_id, status, dependency_flag combination is not same for all\n        dependencies, aggregated count will result in more than one record and it will\n        go under jobs_to_not_trigger and will not trigger job.\n\n        :param heartbeat_sensor_db_table: heartbeat sensor table name.\n        :param heartbeat_sensor_control_table_df: Dataframe for heartbeat control table.\n        :return: list of jobs to be triggered.\n        \"\"\"\n        # Get all distinct trigger_job_id where status is NEW_EVENT_AVAILABLE\n        trigger_jobs_new_events_df = (\n            heartbeat_sensor_control_table_df.filter(\n                f\"status == '{HeartbeatStatus.NEW_EVENT_AVAILABLE.value}'\"\n            )\n            .select(col(\"trigger_job_id\"))\n            .distinct()\n        )\n\n        # Get distinct trigger_job_id, status, dependency_flag for control table records\n        full_data_df = (\n            ExecEnv.SESSION.table(heartbeat_sensor_db_table)\n            .select(\n                col(\"trigger_job_id\"),\n                col(\"status\"),\n                upper(col(\"dependency_flag\")).alias(\"dependency_flag\"),\n            )\n            .distinct()\n        )\n\n        # Join NEW_EVENT_AVAILABLE records with full table to get all dependencies\n        # based on trigger_job_id. dependency_flag = \"TRUE\" needs to be checked as\n        # we are only concerned with records where dependencies needs to be checked.\n        full_data_trigger_job_id = col(\"full_data.trigger_job_id\")\n        dep_flag_comparison = trim(upper(col(\"dependency_flag\"))) == \"TRUE\"\n        jobs_with_new_events_df = (\n            full_data_df.alias(\"full_data\")\n            .join(\n                trigger_jobs_new_events_df.alias(\"jobs_with_new_events\"),\n                col(\"jobs_with_new_events.trigger_job_id\") == full_data_trigger_job_id,\n                \"inner\",\n            )\n            .select(\n                full_data_trigger_job_id,\n                col(\"full_data.status\"),\n                col(\"full_data.dependency_flag\"),\n            )\n        ).filter(dep_flag_comparison)\n\n        # Count level aggregation based on trigger_job_id, dependency_flag picks all\n        # those trigger_job_id which doesn`t satisfy dependency as it denotes there\n        # are more than one record present having dependency_flag = \"TRUE\" and status\n        # is different for same trigger_job_id.\n        jobs_to_not_trigger_with_new_event_df = (\n            jobs_with_new_events_df.filter(dep_flag_comparison)\n            .groupBy(\"trigger_job_id\", \"dependency_flag\")\n            .agg(count(\"trigger_job_id\").alias(\"count\"))\n            .where(col(\"count\") > 1)\n        )\n\n        jobs_to_trigger_df = (\n            jobs_with_new_events_df.alias(\"full_data\")\n            .join(\n                jobs_to_not_trigger_with_new_event_df.alias(\"jobs_to_not_trigger\"),\n                (col(\"jobs_to_not_trigger.trigger_job_id\") == full_data_trigger_job_id),\n                \"left_anti\",\n            )\n            .groupBy(\"trigger_job_id\", \"status\")\n            .agg(count(\"trigger_job_id\").alias(\"count\"))\n            .where(col(\"count\") == 1)\n        )\n\n        jobs_to_trigger_df = jobs_to_trigger_df.select(\"trigger_job_id\").distinct()\n        jobs_to_trigger = jobs_to_trigger_df.collect()\n\n        return jobs_to_trigger\n\n    @classmethod\n    def get_anchor_job_record(\n        cls, heartbeat_sensor_table_df: DataFrame, job_id: str, sensor_source: str\n    ) -> DataFrame:\n        \"\"\"Identify anchor jobs from the control table.\n\n        Using trigger_job_id as the partition key, ordered by status_change_timestamp\n        in descending order and sensor_id in ascending order, filtered by the specific\n        sensor_source.\n\n        This method partitions records by trigger_job_id, orders them by\n        status_change_timestamp (descending) and sensor_id (ascending), and filters\n        by the specified sensor_source. Filtering on sensor_source makes sure if\n        current source is eligible for triggering the job and updates or not. This\n        process ensures that only the appropriate single record triggers the job and\n        the control table is updated accordingly. This approach eliminates redundant\n        triggers and unnecessary updates.\n\n        :param heartbeat_sensor_table_df: Heartbeat sensor control table Dataframe.\n        :param job_id: Trigger job_id from table for which dependency also satisfies.\n        :param sensor_source: source of the heartbeat sensor record.\n\n        Returns:\n            Control table DataFrame containing anchor job records valid for triggering.\n        \"\"\"\n        heartbeat_anchor_records_df = heartbeat_sensor_table_df.filter(\n            col(\"trigger_job_id\") == job_id\n        ).withColumn(\n            \"row_no\",\n            row_number().over(\n                Window.partitionBy(\"trigger_job_id\").orderBy(\n                    col(\"status_change_timestamp\").desc(), col(\"sensor_id\").asc()\n                )\n            ),\n        )\n\n        heartbeat_anchor_records_df = heartbeat_anchor_records_df.filter(\n            f\"row_no = 1 AND sensor_source = '{sensor_source}'\"\n        ).drop(\"row_no\")\n\n        return heartbeat_anchor_records_df\n\n    def heartbeat_sensor_trigger_jobs(self) -> None:\n        \"\"\"Get heartbeat jobs to trigger.\n\n        :param self.spec: HeartbeatConfigSpec having config and control table spec.\n        \"\"\"\n        heartbeat_sensor_db_table = self.spec.heartbeat_sensor_db_table\n        sensor_source = self.spec.sensor_source\n\n        heartbeat_sensor_delta_table = DeltaTable.forName(\n            ExecEnv.SESSION, heartbeat_sensor_db_table\n        )\n\n        heartbeat_sensor_control_table_df = ExecEnv.SESSION.table(\n            heartbeat_sensor_db_table\n        ).filter(\n            f\"lower(sensor_source) == '{sensor_source}' and (job_state == 'UNPAUSED')\"\n        )\n\n        jobs_to_trigger = self.get_heartbeat_jobs_to_trigger(\n            heartbeat_sensor_db_table, heartbeat_sensor_control_table_df\n        )\n\n        heartbeat_sensor_table_df = ExecEnv.SESSION.table(heartbeat_sensor_db_table)\n        final_df: Optional[DataFrame] = None\n\n        for row in jobs_to_trigger:\n            run_id = None\n            exception = None\n\n            heartbeat_anchor_job_records_df = self.get_anchor_job_record(\n                heartbeat_sensor_table_df, row[\"trigger_job_id\"], sensor_source\n            )\n\n            if heartbeat_anchor_job_records_df.take(1):\n                run_id, exception = SensorJobRunManager.run_job(\n                    row[\"trigger_job_id\"], self.spec.token, self.spec.domain\n                )\n\n                if exception is None and run_id is not None:\n                    status_df = (\n                        heartbeat_sensor_table_df.filter(\n                            (col(\"trigger_job_id\") == row[\"trigger_job_id\"])\n                        )\n                        .withColumn(\"job_start_timestamp\", current_timestamp())\n                        .withColumn(\"status\", lit(HeartbeatStatus.IN_PROGRESS.value))\n                        .withColumn(\"status_change_timestamp\", current_timestamp())\n                    )\n                    final_df = final_df.union(status_df) if final_df else status_df\n\n        if final_df is not None:\n            in_progress_merge_condition = \"\"\"target.sensor_id = src.sensor_id AND\n                target.trigger_job_id = src.trigger_job_id AND\n                target.sensor_source = src.sensor_source\"\"\"\n\n            self.update_heartbeat_control_table(\n                heartbeat_sensor_delta_table, final_df, in_progress_merge_condition\n            )\n\n    @classmethod\n    def _read_heartbeat_sensor_data_feed_csv(\n        cls, heartbeat_sensor_data_feed_path: str\n    ) -> DataFrame:\n        \"\"\"Get rows to insert or delete in heartbeat_sensor table.\n\n        It reads the CSV file stored from the `heartbeat_sensor_data_feed_path` and\n        perform UPSERT and DELETE in control table.\n        - **heartbeat_sensor_data_feed_path**: path where CSV file is stored.\n        \"\"\"\n        data_feed_csv_df = (\n            ExecEnv.SESSION.read.format(\"csv\")\n            .option(\"header\", True)\n            .load(heartbeat_sensor_data_feed_path)\n        )\n        data_feed_csv_df = data_feed_csv_df.withColumn(\n            \"job_state\", upper(col(\"job_state\"))\n        )\n        return data_feed_csv_df\n\n    @classmethod\n    def merge_control_table_data_feed_records(\n        cls,\n        heartbeat_sensor_control_table: str,\n        heartbeat_sensor_data_feed_csv_df: DataFrame,\n    ) -> None:\n        \"\"\"Perform merge operation based on the condition.\n\n        It reads the CSV file stored at `heartbeat_sensor_data_feed_path` folder\n        and perform UPSERT and DELETE in control table.\n        - **heartbeat_sensor_control_table**: Heartbeat sensor control table.\n        - **heartbeat_sensor_data_feed_csv_df**: Dataframe after reading CSV file.\n        \"\"\"\n        delta_table = DeltaTable.forName(\n            ExecEnv.SESSION, heartbeat_sensor_control_table\n        )\n\n        delta_table.alias(\"trgt\").merge(\n            heartbeat_sensor_data_feed_csv_df.alias(\"source\"),\n            (\n                \"\"\"source.sensor_id = trgt.sensor_id and\n                trgt.trigger_job_id = source.trigger_job_id\"\"\"\n            ),\n        ).whenNotMatchedInsert(\n            values={\n                \"sensor_source\": \"source.sensor_source\",\n                \"sensor_id\": \"source.sensor_id\",\n                \"sensor_read_type\": \"source.sensor_read_type\",\n                \"asset_description\": \"source.asset_description\",\n                \"upstream_key\": \"source.upstream_key\",\n                \"preprocess_query\": \"source.preprocess_query\",\n                \"latest_event_fetched_timestamp\": \"null\",\n                \"trigger_job_id\": \"source.trigger_job_id\",\n                \"trigger_job_name\": \"source.trigger_job_name\",\n                \"status\": \"null\",\n                \"status_change_timestamp\": \"null\",\n                \"job_start_timestamp\": \"null\",\n                \"job_end_timestamp\": \"null\",\n                \"job_state\": \"source.job_state\",\n                \"dependency_flag\": \"source.dependency_flag\",\n            }\n        ).whenMatchedUpdate(\n            set={\n                \"sensor_source\": \"source.sensor_source\",\n                \"sensor_id\": \"source.sensor_id\",\n                \"sensor_read_type\": \"source.sensor_read_type\",\n                \"asset_description\": \"source.asset_description\",\n                \"upstream_key\": \"source.upstream_key\",\n                \"preprocess_query\": \"source.preprocess_query\",\n                \"latest_event_fetched_timestamp\": \"trgt.latest_event_fetched_timestamp\",\n                \"trigger_job_id\": \"source.trigger_job_id\",\n                \"trigger_job_name\": \"source.trigger_job_name\",\n                \"status\": \"trgt.status\",\n                \"status_change_timestamp\": \"trgt.status_change_timestamp\",\n                \"job_start_timestamp\": \"trgt.job_start_timestamp\",\n                \"job_end_timestamp\": \"trgt.job_end_timestamp\",\n                \"job_state\": \"source.job_state\",\n                \"dependency_flag\": \"source.dependency_flag\",\n            }\n        ).whenNotMatchedBySourceDelete().execute()\n\n    @classmethod\n    def heartbeat_sensor_control_table_data_feed(\n        cls,\n        heartbeat_sensor_data_feed_path: str,\n        heartbeat_sensor_control_table: str,\n    ) -> None:\n        \"\"\"Control table Data feeder.\n\n        It reads the CSV file stored at `heartbeat_sensor_data_feed_path` and\n        perform UPSERT and DELETE in control table.\n        - **heartbeat_sensor_data_feed_path**: path where CSV file is stored.\n        - **heartbeat_sensor_control_table**: CONTROL table of Heartbeat sensor.\n        \"\"\"\n        heartbeat_sensor_data_feed_csv_df = cls._read_heartbeat_sensor_data_feed_csv(\n            heartbeat_sensor_data_feed_path\n        )\n\n        cls.merge_control_table_data_feed_records(\n            heartbeat_sensor_control_table, heartbeat_sensor_data_feed_csv_df\n        )\n\n    @classmethod\n    def update_sensor_processed_status(\n        cls,\n        sensor_table: str,\n        job_id_filter_control_table_df: DataFrame,\n    ) -> None:\n        \"\"\"UPDATE sensor PROCESSED_NEW_DATA status.\n\n        Update sensor control table with PROCESSED_NEW_DATA status and\n        status_change_timestamp for the triggered job.\n\n        Args:\n            sensor_table: lakehouse engine sensor table name.\n            job_id_filter_control_table_df: Job Id filtered Heartbeat sensor\n            control table dataframe.\n        \"\"\"\n        sensor_id_df = job_id_filter_control_table_df.withColumn(\n            \"sensor_table_sensor_id\",\n            concat_ws(\n                \"_\",\n                regexp_replace(col(\"sensor_id\"), r\"[ :\\.]\", \"_\"),\n                col(\"trigger_job_id\"),\n            ),\n        )\n\n        for row in sensor_id_df.select(\"sensor_table_sensor_id\").collect():\n            SensorTerminator.update_sensor_status(\n                sensor_id=row[\"sensor_table_sensor_id\"],\n                control_db_table_name=sensor_table,\n                status=SensorStatus.PROCESSED_NEW_DATA.value,\n                assets=None,\n            )\n\n    @classmethod\n    def update_heartbeat_sensor_completion_status(\n        cls,\n        heartbeat_sensor_control_table: str,\n        sensor_table: str,\n        job_id: str,\n    ) -> None:\n        \"\"\"UPDATE heartbeat sensor status.\n\n        Update heartbeat sensor control table with COMPLETE status and\n        job_end_timestamp for the triggered job.\n        Update sensor control table with PROCESSED_NEW_DATA status and\n        status_change_timestamp for the triggered job.\n\n        Args:\n            job_id: job_id of the running job. It will refer to\n            trigger_job_id in Control table.\n            sensor_table: lakehouse engine sensor table name.\n            heartbeat_sensor_control_table: Heartbeat sensor control table.\n        \"\"\"\n        job_id_filter_control_table_df = (\n            ExecEnv.SESSION.table(heartbeat_sensor_control_table)\n            .filter(col(\"trigger_job_id\") == job_id)\n            .withColumn(\"status\", lit(HeartbeatStatus.COMPLETED.value))\n            .withColumn(\"status_change_timestamp\", current_timestamp())\n            .withColumn(\"job_end_timestamp\", current_timestamp())\n        )\n\n        cls.update_sensor_processed_status(sensor_table, job_id_filter_control_table_df)\n\n        delta_table = DeltaTable.forName(\n            ExecEnv.SESSION, heartbeat_sensor_control_table\n        )\n\n        (\n            delta_table.alias(\"target\")\n            .merge(\n                job_id_filter_control_table_df.alias(\"source\"),\n                (\n                    f\"\"\"target.sensor_source = source.sensor_source and\n                target.sensor_id = source.sensor_id and\n                target.trigger_job_id = '{job_id}'\"\"\"\n                ),\n            )\n            .whenMatchedUpdate(\n                set={\n                    \"target.status\": \"source.status\",\n                    \"target.status_change_timestamp\": \"source.status_change_timestamp\",\n                    \"target.job_end_timestamp\": \"source.job_end_timestamp\",\n                }\n            )\n            .execute()\n        )\n"
  },
  {
    "path": "lakehouse_engine/algorithms/sensors/sensor.py",
    "content": "\"\"\"Module to define Sensor algorithm behavior.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.algorithms.algorithm import Algorithm\nfrom lakehouse_engine.algorithms.exceptions import (\n    NoNewDataException,\n    SensorAlreadyExistsException,\n)\nfrom lakehouse_engine.core.definitions import (\n    SENSOR_ALLOWED_DATA_FORMATS,\n    InputFormat,\n    ReadType,\n    SensorSpec,\n    SensorStatus,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.sensor_manager import (\n    SensorControlTableManager,\n    SensorUpstreamManager,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Sensor(Algorithm):\n    \"\"\"Class representing a sensor to check if the upstream has new data.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, acon: dict):\n        \"\"\"Construct Sensor instances.\n\n        Args:\n            acon: algorithm configuration.\n        \"\"\"\n        self.spec: SensorSpec = SensorSpec.create_from_acon(acon=acon)\n        self._validate_sensor_spec()\n\n        if self._check_if_sensor_already_exists():\n            raise SensorAlreadyExistsException(\n                \"There's already a sensor registered with same id or assets!\"\n            )\n\n    def execute(self) -> bool:\n        \"\"\"Execute the sensor.\"\"\"\n        self._LOGGER.info(f\"Starting {self.spec.input_spec.data_format} sensor...\")\n\n        new_data_df = SensorUpstreamManager.read_new_data(sensor_spec=self.spec)\n        if self.spec.input_spec.read_type == ReadType.STREAMING.value:\n            Sensor._run_streaming_sensor(sensor_spec=self.spec, new_data_df=new_data_df)\n        elif self.spec.input_spec.read_type == ReadType.BATCH.value:\n            Sensor._run_batch_sensor(\n                sensor_spec=self.spec,\n                new_data_df=new_data_df,\n            )\n\n        has_new_data = SensorControlTableManager.check_if_sensor_has_acquired_data(\n            self.spec.sensor_id,\n            self.spec.control_db_table_name,\n        )\n\n        self._LOGGER.info(\n            f\"Sensor {self.spec.sensor_id} has previously \"\n            f\"acquired data? {has_new_data}\"\n        )\n\n        if self.spec.fail_on_empty_result and not has_new_data:\n            raise NoNewDataException(\n                f\"No data was acquired by {self.spec.sensor_id} sensor.\"\n            )\n\n        return has_new_data\n\n    def _check_if_sensor_already_exists(self) -> bool:\n        \"\"\"Check if sensor already exists in the table to avoid duplicates.\"\"\"\n        row = SensorControlTableManager.read_sensor_table_data(\n            sensor_id=self.spec.sensor_id,\n            control_db_table_name=self.spec.control_db_table_name,\n        )\n\n        if row and row.assets != self.spec.assets:\n            return True\n        else:\n            row = SensorControlTableManager.read_sensor_table_data(\n                assets=self.spec.assets,\n                control_db_table_name=self.spec.control_db_table_name,\n            )\n            return row is not None and row.sensor_id != self.spec.sensor_id\n\n    @classmethod\n    def _run_streaming_sensor(\n        cls, sensor_spec: SensorSpec, new_data_df: DataFrame\n    ) -> None:\n        \"\"\"Run sensor in streaming mode (internally runs in batch mode).\"\"\"\n\n        def foreach_batch_check_new_data(df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(df)\n\n            Sensor._run_batch_sensor(\n                sensor_spec=sensor_spec,\n                new_data_df=df,\n            )\n\n        new_data_df.writeStream.trigger(availableNow=True).option(\n            \"checkpointLocation\", sensor_spec.checkpoint_location\n        ).foreachBatch(foreach_batch_check_new_data).start().awaitTermination()\n\n    @classmethod\n    def _run_batch_sensor(\n        cls,\n        sensor_spec: SensorSpec,\n        new_data_df: DataFrame,\n    ) -> None:\n        \"\"\"Run sensor in batch mode.\n\n        Args:\n            sensor_spec: sensor spec containing all sensor information.\n            new_data_df: DataFrame possibly containing new data.\n        \"\"\"\n        new_data_first_row = SensorUpstreamManager.get_new_data(new_data_df)\n\n        cls._LOGGER.info(\n            f\"Sensor {sensor_spec.sensor_id} has new data from upstream? \"\n            f\"{new_data_first_row is not None}\"\n        )\n\n        if new_data_first_row:\n            SensorControlTableManager.update_sensor_status(\n                sensor_spec=sensor_spec,\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                upstream_key=(\n                    new_data_first_row.UPSTREAM_KEY\n                    if \"UPSTREAM_KEY\" in new_data_df.columns\n                    else None\n                ),\n                upstream_value=(\n                    new_data_first_row.UPSTREAM_VALUE\n                    if \"UPSTREAM_VALUE\" in new_data_df.columns\n                    else None\n                ),\n            )\n            cls._LOGGER.info(\n                f\"Successfully updated sensor status for sensor \"\n                f\"{sensor_spec.sensor_id}...\"\n            )\n\n    def _validate_sensor_spec(self) -> None:\n        \"\"\"Validate if sensor spec Read Type is allowed for the selected Data Format.\"\"\"\n        if InputFormat.exists(self.spec.input_spec.data_format):\n            if (\n                self.spec.input_spec.data_format\n                not in SENSOR_ALLOWED_DATA_FORMATS[self.spec.input_spec.read_type]\n            ):\n                raise NotImplementedError(\n                    f\"A sensor has not been implemented yet for this data format or, \"\n                    f\"this data format is not available for the read_type\"\n                    f\" {self.spec.input_spec.read_type}. \"\n                    f\"Check the allowed combinations of read_type and data_formats:\"\n                    f\" {SENSOR_ALLOWED_DATA_FORMATS}\"\n                )\n        else:\n            raise NotImplementedError(\n                f\"Data format {self.spec.input_spec.data_format} isn't implemented yet.\"\n            )\n"
  },
  {
    "path": "lakehouse_engine/configs/__init__.py",
    "content": "\"\"\"This module receives a config file which is included in the wheel.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/configs/engine.yaml",
    "content": "dq_bucket: s3://sample-dq-bucket\ndq_dev_bucket: s3://sample-dq-dev-bucket\ndq_functions_column_list:\n  - dq_rule_id\n  - execution_point\n  - filters\n  - schema\n  - table\n  - column\n  - dimension\ndq_result_sink_columns_to_delete:\n  - partial_unexpected_list\n  - partial_unexpected_counts\n  - partial_unexpected_index_list\n  - unexpected_list\nsharepoint_authority: https://login.microsoftonline.com\nsharepoint_api_domain: https://graph.microsoft.com\nsharepoint_company_domain: your_company_name.sharepoint.com\nnotif_disallowed_email_servers:\n  - sample.blocked.email_server\nengine_usage_path: s3://sample-log-bucket\nengine_dev_usage_path: s3://sample-log-dev-bucket\nraise_on_config_not_available: False\nprod_catalog: sample_catalog\nenvironment: prod"
  },
  {
    "path": "lakehouse_engine/core/__init__.py",
    "content": "\"\"\"Package with the core behaviour of the lakehouse engine.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/core/dbfs_file_manager.py",
    "content": "\"\"\"File manager module using dbfs.\"\"\"\n\nfrom lakehouse_engine.core.file_manager import FileManager\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\ndef _dry_run(bucket: str, object_paths: list) -> dict:\n    \"\"\"Build the dry run request return format.\n\n    Args:\n        bucket: name of bucket to perform operation.\n        object_paths: paths of object to list.\n\n    Returns:\n        A dict with a list of objects that would be copied/deleted.\n    \"\"\"\n    response = {}\n\n    for path in object_paths:\n        path = _get_path(bucket, path)\n\n        object_list: list = []\n        object_list = _list_objects(path, object_list)\n\n        if object_list:\n            response[path] = object_list\n        else:\n            response[path] = [\"No such key\"]\n\n    return response\n\n\ndef _list_objects(path: str, objects_list: list) -> list:\n    \"\"\"List all the objects in a path.\n\n    Args:\n        path: path to be used to perform the list.\n        objects_list: A list of object names, empty by default.\n\n    Returns:\n         A list of object names.\n    \"\"\"\n    from lakehouse_engine.core.exec_env import ExecEnv\n\n    ls_objects_list = DatabricksUtils.get_db_utils(ExecEnv.SESSION).fs.ls(path)\n\n    for file_or_directory in ls_objects_list:\n        if file_or_directory.isDir():\n            _list_objects(file_or_directory.path, objects_list)\n        else:\n            objects_list.append(file_or_directory.path)\n    return objects_list\n\n\ndef _get_path(bucket: str, path: str) -> str:\n    \"\"\"Get complete path.\n\n    For s3 path, the bucket (e.g. bucket-example) and path\n    (e.g. folder1/folder2) will be filled with part of the path.\n    For dbfs path, the path will have the complete path\n    (dbfs:/example) and bucket as null.\n\n    Args:\n        bucket: bucket for s3 objects.\n        path: path to access the directory of file.\n\n    Returns:\n         The complete path with or without bucket.\n    \"\"\"\n    if bucket.strip():\n        path = f\"s3://{bucket}/{path}\".strip()\n    else:\n        path = path.strip()\n\n    return path\n\n\nclass DBFSFileManager(FileManager):\n    \"\"\"Set of actions to manipulate dbfs files in several ways.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def get_function(self) -> None:\n        \"\"\"Get a specific function to execute.\"\"\"\n        available_functions = {\n            \"delete_objects\": self.delete_objects,\n            \"copy_objects\": self.copy_objects,\n            \"move_objects\": self.move_objects,\n        }\n\n        self._logger.info(\"Function being executed: {}\".format(self.function))\n        if self.function in available_functions.keys():\n            func = available_functions[self.function]\n            func()\n        else:\n            raise NotImplementedError(\n                f\"The requested function {self.function} is not implemented.\"\n            )\n\n    @staticmethod\n    def _delete_objects(bucket: str, objects_paths: list) -> None:\n        \"\"\"Delete objects recursively.\n\n        Params:\n            bucket: name of bucket to perform the delete operation.\n            objects_paths: objects to be deleted.\n        \"\"\"\n        from lakehouse_engine.core.exec_env import ExecEnv\n\n        for path in objects_paths:\n            path = _get_path(bucket, path)\n\n            DBFSFileManager._logger.info(f\"Deleting: {path}\")\n\n            try:\n                delete_operation = DatabricksUtils.get_db_utils(ExecEnv.SESSION).fs.rm(\n                    path, True\n                )\n\n                if delete_operation:\n                    DBFSFileManager._logger.info(f\"Deleted: {path}\")\n                else:\n                    DBFSFileManager._logger.info(f\"Not able to delete: {path}\")\n            except Exception as e:\n                DBFSFileManager._logger.error(f\"Error deleting {path} - {e}\")\n                raise e\n\n    def delete_objects(self) -> None:\n        \"\"\"Delete objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be deleted based on the given keys.\n        \"\"\"\n        bucket = self.configs[\"bucket\"]\n        objects_paths = self.configs[\"object_paths\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        if dry_run:\n            response = _dry_run(bucket=bucket, object_paths=objects_paths)\n\n            self._logger.info(\"Paths that would be deleted:\")\n            self._logger.info(response)\n        else:\n            self._delete_objects(bucket, objects_paths)\n\n    def copy_objects(self) -> None:\n        \"\"\"Copies objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be copied based on the given keys.\n        \"\"\"\n        source_bucket = self.configs[\"bucket\"]\n        source_object = self.configs[\"source_object\"]\n        destination_bucket = self.configs[\"destination_bucket\"]\n        destination_object = self.configs[\"destination_object\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        if dry_run:\n            response = _dry_run(bucket=source_bucket, object_paths=[source_object])\n\n            self._logger.info(\"Paths that would be copied:\")\n            self._logger.info(response)\n        else:\n            self._copy_objects(\n                source_bucket=source_bucket,\n                source_object=source_object,\n                destination_bucket=destination_bucket,\n                destination_object=destination_object,\n            )\n\n    @staticmethod\n    def _copy_objects(\n        source_bucket: str,\n        source_object: str,\n        destination_bucket: str,\n        destination_object: str,\n    ) -> None:\n        \"\"\"Copies objects and 'directories'.\n\n        Args:\n            source_bucket: name of bucket to perform the copy.\n            source_object: object/folder to be copied.\n            destination_bucket: name of the target bucket to copy.\n            destination_object: target object/folder to copy.\n        \"\"\"\n        from lakehouse_engine.core.exec_env import ExecEnv\n\n        copy_from = _get_path(source_bucket, source_object)\n        copy_to = _get_path(destination_bucket, destination_object)\n\n        DBFSFileManager._logger.info(f\"Copying: {copy_from} to {copy_to}\")\n\n        try:\n            DatabricksUtils.get_db_utils(ExecEnv.SESSION).fs.cp(\n                copy_from, copy_to, True\n            )\n\n            DBFSFileManager._logger.info(f\"Copied: {copy_from} to {copy_to}\")\n        except Exception as e:\n            DBFSFileManager._logger.error(\n                f\"Error copying file {copy_from} to {copy_to} - {e}\"\n            )\n            raise e\n\n    def move_objects(self) -> None:\n        \"\"\"Moves objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be moved based on the given keys.\n        \"\"\"\n        source_bucket = self.configs[\"bucket\"]\n        source_object = self.configs[\"source_object\"]\n        destination_bucket = self.configs[\"destination_bucket\"]\n        destination_object = self.configs[\"destination_object\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        if dry_run:\n            response = _dry_run(bucket=source_bucket, object_paths=[source_object])\n\n            self._logger.info(\"Paths that would be moved:\")\n            self._logger.info(response)\n        else:\n            self._move_objects(\n                source_bucket=source_bucket,\n                source_object=source_object,\n                destination_bucket=destination_bucket,\n                destination_object=destination_object,\n            )\n\n    @staticmethod\n    def _move_objects(\n        source_bucket: str,\n        source_object: str,\n        destination_bucket: str,\n        destination_object: str,\n    ) -> None:\n        \"\"\"Moves objects and 'directories'.\n\n        Args:\n            source_bucket: name of bucket to perform the move.\n            source_object: object/folder to be moved.\n            destination_bucket: name of the target bucket to move.\n            destination_object: target object/folder to move.\n        \"\"\"\n        from lakehouse_engine.core.exec_env import ExecEnv\n\n        move_from = _get_path(source_bucket, source_object)\n        move_to = _get_path(destination_bucket, destination_object)\n\n        DBFSFileManager._logger.info(f\"Moving: {move_from} to {move_to}\")\n\n        try:\n            DatabricksUtils.get_db_utils(ExecEnv.SESSION).fs.mv(\n                move_from, move_to, True\n            )\n\n            DBFSFileManager._logger.info(f\"Moved: {move_from} to {move_to}\")\n        except Exception as e:\n            DBFSFileManager._logger.error(\n                f\"Error moving file {move_from} to {move_to} - {e}\"\n            )\n            raise e\n"
  },
  {
    "path": "lakehouse_engine/core/definitions.py",
    "content": "\"\"\"Definitions of standard values and structures for core components.\"\"\"\n\nfrom dataclasses import dataclass\nfrom datetime import datetime\nfrom enum import Enum\nfrom pathlib import Path\nfrom typing import ClassVar, Collection, List, Optional, Tuple\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.types import (\n    ArrayType,\n    BooleanType,\n    StringType,\n    StructField,\n    StructType,\n    TimestampType,\n)\n\nfrom lakehouse_engine.io.exceptions import InputNotFoundException\n\n\nclass CollectEngineUsage(Enum):\n    \"\"\"Options for collecting engine usage stats.\n\n    - enabled, enables the collection and storage of Lakehouse Engine\n    usage statistics for any environment.\n    - prod_only, enables the collection and storage of Lakehouse Engine\n    usage statistics for production environment only.\n    - disabled, disables the collection and storage of Lakehouse Engine\n    usage statistics, for all environments.\n    \"\"\"\n\n    ENABLED = \"enabled\"\n    PROD_ONLY = \"prod_only\"\n    DISABLED = \"disabled\"\n\n\n@dataclass\nclass EngineConfig(object):\n    \"\"\"Definitions that can come from the Engine Config file.\n\n    - dq_bucket: S3 prod bucket used to store data quality related artifacts.\n    - dq_dev_bucket: S3 dev bucket used to store data quality related artifacts.\n    - notif_disallowed_email_servers: email servers not allowed to be used\n        for sending notifications.\n    - engine_usage_path: path where the engine prod usage stats are stored.\n    - engine_dev_usage_path: path where the engine dev usage stats are stored.\n    - collect_engine_usage: whether to enable the collection of lakehouse\n        engine usage stats or not.\n    - dq_functions_column_list: list of columns to be added to the meta argument\n        of GX when using PRISMA.\n    - raise_on_config_not_available: whether to raise an exception if a spark config\n        is not available.\n    - prod_catalog: name of the prod catalog being used. This is useful to derive\n        whether the environment is prod or dev, so the dev or prod buckets/paths can be\n        used for storing engine usage stats and dq artifacts.\n    - environment: environment that the engine is being executed on. Takes precedence\n        over prod_catalog when defining if the environment is prod or dev.\n    - sharepoint_authority: authority for the Sharepoint api.\n    - sharepoint_company_domain: company domain for the Sharepoint api.\n    - sharepoint_api_domain: api domain for the Sharepoint api.\n    \"\"\"\n\n    dq_bucket: Optional[str] = None\n    dq_dev_bucket: Optional[str] = None\n    notif_disallowed_email_servers: Optional[list] = None\n    engine_usage_path: Optional[str] = None\n    engine_dev_usage_path: Optional[str] = None\n    collect_engine_usage: str = CollectEngineUsage.ENABLED.value\n    dq_functions_column_list: Optional[list] = None\n    dq_result_sink_columns_to_delete: Optional[list] = None\n    sharepoint_authority: Optional[str] = None\n    sharepoint_company_domain: Optional[str] = None\n    sharepoint_api_domain: Optional[str] = None\n    raise_on_config_not_available: bool = False\n    prod_catalog: Optional[str] = None\n    environment: Optional[str] = None\n\n\nclass EngineStats(object):\n    \"\"\"Definitions for collection of Lakehouse Engine Stats.\n\n    !!! note\n        whenever the value comes from a key inside a Spark Config\n        that returns an array, it can be specified with a '#' so that it\n        is adequately processed.\n    \"\"\"\n\n    CLUSTER_USAGE_TAGS = \"spark.databricks.clusterUsageTags\"\n    DEF_SPARK_CONFS = {\n        \"dp_name\": f\"{CLUSTER_USAGE_TAGS}.clusterAllTags#accountName\",\n        \"environment\": f\"{CLUSTER_USAGE_TAGS}.clusterAllTags#environment\",\n        \"workspace_id\": f\"{CLUSTER_USAGE_TAGS}.orgId\",\n        \"job_id\": f\"{CLUSTER_USAGE_TAGS}.clusterAllTags#JobId\",\n        \"job_name\": f\"{CLUSTER_USAGE_TAGS}.clusterAllTags#RunName\",\n        \"run_id\": f\"{CLUSTER_USAGE_TAGS}.clusterAllTags#ClusterName\",\n    }\n    DEF_DATABRICKS_CONTEXT_KEYS = {\n        \"environment\": \"environment\",\n        \"dp_name\": \"jobName\",\n        \"run_id\": \"runId\",\n        \"job_id\": \"jobId\",\n        \"job_name\": \"jobName\",\n        \"workspace_id\": \"workspaceId\",\n        \"policy_id\": \"usagePolicyId\",\n    }\n\n\nclass InputFormat(Enum):\n    \"\"\"Formats of algorithm input.\"\"\"\n\n    JDBC = \"jdbc\"\n    AVRO = \"avro\"\n    JSON = \"json\"\n    CSV = \"csv\"\n    PARQUET = \"parquet\"\n    DELTAFILES = \"delta\"\n    CLOUDFILES = \"cloudfiles\"\n    KAFKA = \"kafka\"\n    SQL = \"sql\"\n    SAP_BW = \"sap_bw\"\n    SAP_B4 = \"sap_b4\"\n    DATAFRAME = \"dataframe\"\n    SFTP = \"sftp\"\n    SHAREPOINT = \"sharepoint\"\n\n    @classmethod\n    def values(cls):  # type: ignore\n        \"\"\"Generates a list containing all enum values.\n\n        Returns:\n            A list with all enum values.\n        \"\"\"\n        return (c.value for c in cls)\n\n    @classmethod\n    def exists(cls, input_format: str) -> bool:\n        \"\"\"Checks if the input format exists in the enum values.\n\n        Args:\n            input_format: format to check if exists.\n\n        Returns:\n            If the input format exists in our enum.\n        \"\"\"\n        return input_format in cls.values()\n\n\n# Formats of input that are considered files.\nFILE_INPUT_FORMATS = [\n    InputFormat.AVRO.value,\n    InputFormat.JSON.value,\n    InputFormat.PARQUET.value,\n    InputFormat.CSV.value,\n    InputFormat.DELTAFILES.value,\n    InputFormat.CLOUDFILES.value,\n]\n\nSHAREPOINT_SUPPORTED_EXTENSIONS = {\".csv\", \".xlsx\"}\n\n\n@dataclass\nclass SharepointFile:\n    \"\"\"Represents a file from Sharepoint with metadata and optional content.\"\"\"\n\n    file_name: str\n    time_created: str\n    time_modified: str\n    content: Optional[bytes] = None\n    _folder: Optional[str] = None\n    skip_rename: bool = False\n    _already_archived: bool = False\n\n    @property\n    def file_extension(self) -> str:\n        \"\"\"Returns the file extension of the stored file.\"\"\"\n        return Path(self.file_name).suffix\n\n    @property\n    def file_path(self) -> str:\n        \"\"\"Full Sharepoint path including folder and file name.\"\"\"\n        if not self._folder:\n            raise AttributeError(\"file_path unavailable; _folder not set.\")\n        return f\"{self._folder}/{self.file_name}\"\n\n    @property\n    def is_csv(self) -> bool:\n        \"\"\"True if file is a CSV.\"\"\"\n        return self.file_extension.lower() == \".csv\"\n\n    @property\n    def is_excel(self) -> bool:\n        \"\"\"True if file is an Excel file.\"\"\"\n        return self.file_extension.lower() == \".xlsx\"\n\n    @property\n    def content_size(self) -> int:\n        \"\"\"Size of content in bytes.\"\"\"\n        return len(self.content) if self.content else 0\n\n\n@dataclass\nclass SharepointOptions(object):\n    \"\"\"Options for Sharepoint I/O (used by both reader and writer).\n\n    This dataclass is shared by the Sharepoint reader and writer. Some fields\n    are required/used only in *read* mode, others only in *write* mode.\n    Use `validate_for_reader()` / `validate_for_writer()` to enforce the\n    correct subsets.\n\n    Common (reader & writer):\n      - client_id (str): Azure AD application (client) ID.\n      - tenant_id (str): Azure AD tenant (directory) ID.\n      - site_name (str): Sharepoint site name.\n      - drive_name (str): Document library/drive name.\n      - secret (str): Client secret.\n      - local_path (str): Local/volume path for staging (read/write temp).\n      - api_version (str): Microsoft Graph API version (default: \"v1.0\").\n      - conflict_behaviour (Optional[str]): e.g. 'replace', 'fail'.\n      - allowed_extensions (Optional[Collection[str]]):\n          Defaults to SHAREPOINT_SUPPORTED_EXTENSIONS {\".csv\", \".xlsx\"}.\n\n    Reader-specific:\n      - folder_relative_path (Optional[str]): Folder (or full file path)\n          to read from.\n      - file_name (Optional[str]): Name of a single file inside the folder\n          to read. If `folder_relative_path` already points to a file,\n          `file_name` must be None.\n      - file_type (Optional[str]): \"csv\" or \"xlsx\" when reading a folder.\n      - file_pattern (Optional[str]): Glob (e.g. '*.csv') when reading a folder.\n      - local_options (Optional[dict]): Spark CSV read options (e.g. header, sep).\n      - chunk_size (Optional[int]): Download chunk size (bytes).\n\n    Writer-specific:\n      - file_name (Optional[str]): Target file name to upload.\n      - local_options (Optional[dict]): Spark CSV write options.\n      - chunk_size (Optional[int]): Upload chunk size (bytes).\n\n    Archiving (reader):\n      - archive_enabled (bool): Whether to move files after a successful/failed read.\n          Default: True.\n      - archive_success_subfolder (Optional[str]): Success folder (default \"done\").\n          Set None to keep in place.\n      - archive_error_subfolder (Optional[str]): Error folder (default \"error\").\n          Set None to keep in place.\n    \"\"\"\n\n    # Common\n    client_id: str\n    tenant_id: str\n    site_name: str\n    drive_name: str\n    secret: str\n    local_path: str\n    file_name: Optional[str] = None  # used by reader (optional) and writer (target)\n    api_version: str = \"v1.0\"\n    conflict_behaviour: Optional[str] = None\n    allowed_extensions: Optional[Collection[str]] = None\n\n    # Reader\n    file_type: Optional[str] = None\n    folder_relative_path: Optional[str] = None\n    file_pattern: Optional[str] = None\n    chunk_size: Optional[int] = 100 * 1024 * 1024  # 100 MB (read & write)\n    local_options: Optional[dict] = None  # (read & write)\n\n    # Reader archiving\n    archive_enabled: bool = True\n    archive_success_subfolder: Optional[str] = \"done\"\n    archive_error_subfolder: Optional[str] = \"error\"\n\n    REQUIRED_READER_OPTS: ClassVar[Tuple[str, ...]] = (\n        \"site_name\",\n        \"drive_name\",\n        \"folder_relative_path\",\n    )\n    REQUIRED_WRITER_OPTS: ClassVar[Tuple[str, ...]] = (\n        \"site_name\",\n        \"drive_name\",\n        \"local_path\",\n    )\n\n    def __post_init__(self) -> None:\n        \"\"\"Normalize and validate Sharepoint options (types, extensions, etc).\"\"\"\n        allowed_extensions = self._get_allowed_extensions()\n        allowed_file_types = {extension.lstrip(\".\") for extension in allowed_extensions}\n\n        self._validate_file_type(allowed_file_types)\n        self._normalize_folder_relative_path()\n\n        self._validate_folder_relative_path_extension_if_looks_like_file(\n            allowed_extensions\n        )\n        self._validate_single_file_mode_constraints_if_folder_is_file_path(\n            allowed_extensions\n        )\n\n        self._validate_file_name_and_file_pattern_are_not_both_set()\n\n    def _get_allowed_extensions(self) -> set[str]:\n        \"\"\"Return the supported file extensions (lowercased).\"\"\"\n        return {\n            extension.lower()\n            for extension in (\n                self.allowed_extensions or SHAREPOINT_SUPPORTED_EXTENSIONS\n            )\n        }\n\n    def _validate_file_type(self, allowed_file_types: set[str]) -> None:\n        \"\"\"Validate that `file_type` is supported when provided.\"\"\"\n        if not self.file_type:\n            return\n\n        if self.file_type.lower() not in allowed_file_types:\n            raise ValueError(\n                f\"`file_type` must be one of {sorted(allowed_file_types)}. \"\n                f\"Got: '{self.file_type}'\"\n            )\n\n    def _normalize_folder_relative_path(self) -> None:\n        \"\"\"Strip leading and trailing slashes from `folder_relative_path`.\"\"\"\n        if self.folder_relative_path:\n            self.folder_relative_path = self.folder_relative_path.strip(\"/\")\n\n    def _ends_with_supported_extension(\n        self,\n        path_value: str,\n        allowed_extensions: set[str],\n    ) -> bool:\n        \"\"\"Return True if the path ends with any supported extension.\"\"\"\n        lowered_path_value = path_value.lower()\n        return any(\n            lowered_path_value.endswith(extension) for extension in allowed_extensions\n        )\n\n    def _validate_single_file_mode_constraints_if_folder_is_file_path(\n        self,\n        allowed_extensions: set[str],\n    ) -> None:\n        \"\"\"Forbid file name, pattern, and type when folder_relative_path end is file.\"\"\"\n        if not self.folder_relative_path:\n            return\n\n        if not self._ends_with_supported_extension(\n            self.folder_relative_path, allowed_extensions\n        ):\n            return\n\n        if self.file_name:\n            raise ValueError(\n                \"When `folder_relative_path` points to a file, `file_name` must \"\n                \"be None.\"\n            )\n        if self.file_pattern:\n            raise ValueError(\n                \"When `folder_relative_path` points to a file, `file_pattern` must \"\n                \"be None.\"\n            )\n        if self.file_type:\n            raise ValueError(\n                \"When `folder_relative_path` points to a file, `file_type` must \"\n                \"be None (it's derived from file_path extension)\"\n            )\n\n    def _validate_file_name_extension(self, allowed_extensions: set[str]) -> None:\n        \"\"\"Validate that `file_name` ends with a supported extension when provided.\"\"\"\n        if not self.file_name:\n            return\n\n        if not self._ends_with_supported_extension(self.file_name, allowed_extensions):\n            raise ValueError(\n                f\"`file_name` must end with one of {sorted(allowed_extensions)},\"\n                f\" got: {self.file_name}\"\n            )\n\n    def _validate_file_name_and_file_pattern_are_not_both_set(self) -> None:\n        \"\"\"Validate that `file_name` and `file_pattern` are not both set.\"\"\"\n        if self.file_name and self.file_pattern:\n            raise ValueError(\n                \"Conflicting options: provide either `file_name` or `file_pattern`\"\n                \", not both.\"\n            )\n\n    def _validate_folder_relative_path_extension_if_looks_like_file(\n        self,\n        allowed_extensions: set[str],\n    ) -> None:\n        \"\"\"Fail if folder_relative_path is a file path but has unsupported extension.\"\"\"\n        if not self.folder_relative_path:\n            return\n\n        last_segment = self.folder_relative_path.split(\"/\")[-1]\n        looks_like_file = \".\" in last_segment\n        if not looks_like_file:\n            return\n\n        if self._ends_with_supported_extension(last_segment, allowed_extensions):\n            return\n\n        raise ValueError(\n            f\"`folder_relative_path` appears to be a file path but does not end \"\n            f\"with one of {sorted(allowed_extensions)}: {self.folder_relative_path}\"\n        )\n\n    def validate_for_reader(self) -> None:\n        \"\"\"Validate Sharepoint options required for reading.\"\"\"\n        missing = [opt for opt in self.REQUIRED_READER_OPTS if not getattr(self, opt)]\n        if missing:\n            raise InputNotFoundException(\n                f\"Missing required Sharepoint options for reader: {', '.join(missing)}\"\n            )\n        allowed_extensions = self._get_allowed_extensions()\n        if self.file_name and not self._ends_with_supported_extension(\n            self.file_name, allowed_extensions\n        ):\n            raise ValueError(\n                f\"`file_name` must end with one of {sorted(allowed_extensions)}, \"\n                \"got: {self.file_name}\"\n            )\n\n    def validate_for_writer(self) -> None:\n        \"\"\"Validate Sharepoint options required for writing.\"\"\"\n        missing = [opt for opt in self.REQUIRED_WRITER_OPTS if not getattr(self, opt)]\n        if missing:\n            raise InputNotFoundException(\n                f\"Missing required Sharepoint options for writer: {', '.join(missing)}\"\n            )\n\n\nclass OutputFormat(Enum):\n    \"\"\"Formats of algorithm output.\"\"\"\n\n    JDBC = \"jdbc\"\n    AVRO = \"avro\"\n    JSON = \"json\"\n    CSV = \"csv\"\n    PARQUET = \"parquet\"\n    DELTAFILES = \"delta\"\n    KAFKA = \"kafka\"\n    CONSOLE = \"console\"\n    NOOP = \"noop\"\n    DATAFRAME = \"dataframe\"\n    REST_API = \"rest_api\"\n    FILE = \"file\"  # Internal use only\n    TABLE = \"table\"  # Internal use only\n    SHAREPOINT = \"sharepoint\"\n\n    @classmethod\n    def values(cls):  # type: ignore\n        \"\"\"Generates a list containing all enum values.\n\n        Returns:\n            A list with all enum values.\n        \"\"\"\n        return (c.value for c in cls)\n\n    @classmethod\n    def exists(cls, output_format: str) -> bool:\n        \"\"\"Checks if the output format exists in the enum values.\n\n        Args:\n            output_format: format to check if exists.\n\n        Returns:\n            If the output format exists in our enum.\n        \"\"\"\n        return output_format in cls.values()\n\n\n# Formats of output that are considered files.\nFILE_OUTPUT_FORMATS = [\n    OutputFormat.AVRO.value,\n    OutputFormat.JSON.value,\n    OutputFormat.PARQUET.value,\n    OutputFormat.CSV.value,\n    OutputFormat.DELTAFILES.value,\n]\n\n\nclass NotifierType(Enum):\n    \"\"\"Type of notifier available.\"\"\"\n\n    EMAIL = \"email\"\n\n\nclass NotificationRuntimeParameters(Enum):\n    \"\"\"Parameters to be replaced in runtime.\"\"\"\n\n    DATABRICKS_JOB_NAME = \"databricks_job_name\"\n    DATABRICKS_WORKSPACE_ID = \"databricks_workspace_id\"\n    JOB_EXCEPTION = \"exception\"\n\n\nNOTIFICATION_RUNTIME_PARAMETERS = [\n    NotificationRuntimeParameters.DATABRICKS_JOB_NAME.value,\n    NotificationRuntimeParameters.DATABRICKS_WORKSPACE_ID.value,\n    NotificationRuntimeParameters.JOB_EXCEPTION.value,\n]\n\n\nclass ReadType(Enum):\n    \"\"\"Define the types of read operations.\n\n    - BATCH - read the data in batch mode (e.g., Spark batch).\n    - STREAMING - read the data in streaming mode (e.g., Spark streaming).\n    \"\"\"\n\n    BATCH = \"batch\"\n    STREAMING = \"streaming\"\n\n\nclass ReadMode(Enum):\n    \"\"\"Different modes that control how we handle compliance to the provided schema.\n\n    These read modes map to Spark's read modes at the moment.\n    \"\"\"\n\n    PERMISSIVE = \"PERMISSIVE\"\n    FAILFAST = \"FAILFAST\"\n    DROPMALFORMED = \"DROPMALFORMED\"\n\n\nclass DQDefaults(Enum):\n    \"\"\"Defaults used on the data quality process.\"\"\"\n\n    FILE_SYSTEM_STORE = \"file_system\"\n    FILE_SYSTEM_S3_STORE = \"s3\"\n    DQ_BATCH_IDENTIFIERS = [\"spec_id\", \"input_id\", \"timestamp\"]\n    DATASOURCE_CLASS_NAME = \"Datasource\"\n    DATASOURCE_EXECUTION_ENGINE = \"SparkDFExecutionEngine\"\n    DATA_CONNECTORS_CLASS_NAME = \"RuntimeDataConnector\"\n    DATA_CONNECTORS_MODULE_NAME = \"great_expectations.datasource.data_connector\"\n    STORE_BACKEND = \"s3\"\n    EXPECTATIONS_STORE_PREFIX = \"dq/expectations/\"\n    VALIDATIONS_STORE_PREFIX = \"dq/validations/\"\n    CHECKPOINT_STORE_PREFIX = \"dq/checkpoints/\"\n    CUSTOM_EXPECTATION_LIST = [\n        \"expect_column_values_to_be_date_not_older_than\",\n        \"expect_column_pair_a_to_be_smaller_or_equal_than_b\",\n        \"expect_multicolumn_column_a_must_equal_b_or_c\",\n        \"expect_queried_column_agg_value_to_be\",\n        \"expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b\",\n        \"expect_column_pair_a_to_be_not_equal_to_b\",\n        \"expect_column_values_to_not_be_null_or_empty_string\",\n    ]\n    DQ_COLUMNS_TO_KEEP_TYPES = [\n        \"success\",\n        \"run_time\",\n        \"validation_results\",\n        \"expectation_success\",\n        \"exception_info\",\n        \"meta\",\n        \"run_time_year\",\n        \"run_time_month\",\n        \"run_time_day\",\n        \"source_primary_key\",\n        \"evaluated_expectations\",\n        \"success_percent\",\n        \"successful_expectations\",\n        \"unsuccessful_expectations\",\n        \"unexpected_index_list\",\n    ]\n    DQ_VALIDATIONS_SCHEMA = StructType(\n        [\n            StructField(\n                \"dq_validations\",\n                StructType(\n                    [\n                        StructField(\"run_name\", StringType()),\n                        StructField(\"run_success\", BooleanType()),\n                        StructField(\"raised_exceptions\", BooleanType()),\n                        StructField(\"run_row_success\", BooleanType()),\n                        StructField(\n                            \"dq_failure_details\",\n                            ArrayType(\n                                StructType(\n                                    [\n                                        StructField(\"expectation_type\", StringType()),\n                                        StructField(\"kwargs\", StringType()),\n                                    ]\n                                ),\n                            ),\n                        ),\n                    ]\n                ),\n            )\n        ]\n    )\n\n\nclass WriteType(Enum):\n    \"\"\"Types of write operations.\"\"\"\n\n    OVERWRITE = \"overwrite\"\n    COMPLETE = \"complete\"\n    APPEND = \"append\"\n    UPDATE = \"update\"\n    MERGE = \"merge\"\n    ERROR_IF_EXISTS = \"error\"\n    IGNORE_IF_EXISTS = \"ignore\"\n\n\n@dataclass\nclass InputSpec(object):\n    \"\"\"Specification of an algorithm input.\n\n    This is very aligned with the way the execution environment connects to the sources\n    (e.g., spark sources).\n\n    - spec_id: spec_id of the input specification\n    - read_type: ReadType type of read\n        operation.\n    - data_format: format of the input.\n    - sftp_files_format: format of the files (csv, fwf, json, xml...) in a sftp\n        directory.\n    - df_name: dataframe name.\n    - db_table: table name in the form of `<db>.<table>`.\n    - location: uri that identifies from where to read data in the\n      specified format.\n    - sharepoint_opts: Options to apply when reading from Sharepoint.\n    - enforce_schema_from_table: if we want to enforce the table schema or not,\n    by providing a table name in the form of `<db>.<table>`.\n    - query: sql query to execute and return the dataframe. Use it if you do not want\n        to read from a file system nor from a table, but rather from a sql query.\n    - schema: dict representation of a schema of the input (e.g., Spark struct\n    type schema).\n    - schema_path: path to a file with a representation of a schema of the\n    input (e.g., Spark struct type schema).\n    - disable_dbfs_retry: optional flag to disable file storage dbfs.\n    - with_filepath: if we want to include the path of the file that is being\n    read. Only\n        works with the file reader (batch and streaming modes are supported).\n    - options: dict with other relevant options according to the execution\n        environment (e.g., spark) possible sources.\n    - calculate_upper_bound: when to calculate upper bound to extract from SAP BW\n        or not.\n    - calc_upper_bound_schema: specific schema for the calculated upper_bound.\n    - generate_predicates: when to generate predicates to extract from SAP BW or not.\n    - predicates_add_null: if we want to include is null on partition by predicates.\n    - temp_view: optional name of a view to point to the input dataframe to be used\n        to create or replace a temp view on top of the dataframe.\n    \"\"\"\n\n    spec_id: str\n    read_type: str\n    data_format: Optional[str] = None\n    sftp_files_format: Optional[str] = None\n    df_name: Optional[DataFrame] = None\n    db_table: Optional[str] = None\n    location: Optional[str] = None\n    sharepoint_opts: Optional[SharepointOptions] = None\n    query: Optional[str] = None\n    enforce_schema_from_table: Optional[str] = None\n    schema: Optional[dict] = None\n    schema_path: Optional[str] = None\n    disable_dbfs_retry: bool = False\n    with_filepath: bool = False\n    options: Optional[dict] = None\n    jdbc_args: Optional[dict] = None\n    calculate_upper_bound: bool = False\n    calc_upper_bound_schema: Optional[str] = None\n    generate_predicates: bool = False\n    predicates_add_null: bool = True\n    temp_view: Optional[str] = None\n\n    def __post_init__(self) -> None:\n        \"\"\"Normalize Sharepoint options if passed as a raw dictionary.\n\n        Args:\n            self: Instance of the class where `sharepoint_opts` attribute\n                may be either a dictionary or a SharepointOptions object.\n        \"\"\"\n        if isinstance(self.sharepoint_opts, dict):\n            self.sharepoint_opts = SharepointOptions(**self.sharepoint_opts)\n\n\n@dataclass\nclass TransformerSpec(object):\n    \"\"\"Transformer Specification, i.e., a single transformation amongst many.\n\n    - function: name of the function (or callable function) to be executed.\n    - args: (not applicable if using a callable function) dict with the arguments\n        to pass to the function `<k,v>` pairs with the name of the parameter of\n        the function and the respective value.\n    \"\"\"\n\n    function: str\n    args: dict\n\n\n@dataclass\nclass TransformSpec(object):\n    \"\"\"Transformation Specification.\n\n    I.e., the specification that defines the many transformations to be done to the data\n    that was read.\n\n    - spec_id: id of the terminate specification\n    - input_id: id of the corresponding input\n    specification.\n    - transformers: list of transformers to execute.\n    - force_streaming_foreach_batch_processing: sometimes, when using streaming, we want\n        to force the transform to be executed in the foreachBatch function to ensure\n        non-supported streaming operations can be properly executed.\n    \"\"\"\n\n    spec_id: str\n    input_id: str\n    transformers: List[TransformerSpec]\n    force_streaming_foreach_batch_processing: bool = False\n\n\nclass DQType(Enum):\n    \"\"\"Available data quality tasks.\"\"\"\n\n    VALIDATOR = \"validator\"\n    PRISMA = \"prisma\"\n\n\nclass DQResultFormat(Enum):\n    \"\"\"Available data quality result formats.\"\"\"\n\n    COMPLETE = \"COMPLETE\"\n\n\nclass DQExecutionPoint(Enum):\n    \"\"\"Available data quality execution points.\"\"\"\n\n    IN_MOTION = \"in_motion\"\n    AT_REST = \"at_rest\"\n\n\nclass DQTableBaseParameters(Enum):\n    \"\"\"Base parameters for importing DQ rules from a table.\"\"\"\n\n    PRISMA_BASE_PARAMETERS = [\"arguments\", \"dq_tech_function\"]\n\n\n@dataclass\nclass DQFunctionSpec(object):\n    \"\"\"Defines a data quality function specification.\n\n    - function - name of the data quality function (expectation) to execute.\n    It follows the great_expectations api https://greatexpectations.io/expectations/.\n    - args - args of the function (expectation). Follow the same api as above.\n    \"\"\"\n\n    function: str\n    args: Optional[dict] = None\n\n\n@dataclass\nclass DQSpec(object):\n    \"\"\"Data quality overall specification.\n\n    - spec_id - id of the specification.\n    - input_id - id of the input specification.\n    - dq_type - type of DQ process to execute (e.g. validator).\n    - dq_functions - list of function specifications to execute.\n    - dq_db_table - name of table to derive the dq functions from.\n    - dq_table_table_filter - name of the table which rules are to be applied in the\n        validations (Only used when deriving dq functions).\n    - dq_table_extra_filters - extra filters to be used when deriving dq functions.\n        This is a sql expression to be applied to the dq_db_table.\n    - execution_point - execution point of the dq functions. [at_rest, in_motion].\n        This is set during the load_data or dq_validator functions.\n    - unexpected_rows_pk - the list of columns composing the primary key of the\n        source data to identify the rows failing the DQ validations. Note: only one\n        of tbl_to_derive_pk or unexpected_rows_pk arguments need to be provided. It\n        is mandatory to provide one of these arguments when using tag_source_data\n        as True. When tag_source_data is False, this is not mandatory, but still\n        recommended.\n    - tbl_to_derive_pk - db.table to automatically derive the unexpected_rows_pk from.\n        Note: only one of tbl_to_derive_pk or unexpected_rows_pk arguments need to\n        be provided. It is mandatory to provide one of these arguments when using\n        tag_source_data as True. hen tag_source_data is False, this is not\n        mandatory, but still recommended.\n    - gx_result_format - great expectations result format. Default: \"COMPLETE\".\n    - tag_source_data - when set to true, this will ensure that the DQ process ends by\n        tagging the source data with an additional column with information about the\n        DQ results. This column makes it possible to identify if the DQ run was\n        succeeded in general and, if not, it unlocks the insights to know what\n        specific rows have made the DQ validations fail and why. Default: False.\n        Note: it only works if result_sink_explode is True, gx_result_format is\n        COMPLETE, fail_on_error is False (which is done automatically when\n        you specify tag_source_data as True) and tbl_to_derive_pk or\n        unexpected_rows_pk is configured.\n    - store_backend - which store_backend to use (e.g. s3 or file_system).\n    - local_fs_root_dir - path of the root directory. Note: only applicable for\n        store_backend file_system.\n    - bucket - the bucket name to consider for the store_backend (store DQ artefacts).\n        Note: only applicable for store_backend s3.\n    - expectations_store_prefix - prefix where to store expectations' data. Note: only\n        applicable for store_backend s3.\n    - validations_store_prefix - prefix where to store validations' data. Note: only\n        applicable for store_backend s3.\n    - checkpoint_store_prefix - prefix where to store checkpoints' data. Note: only\n        applicable for store_backend s3.\n    - data_asset_name - name of the data asset to consider when configuring the great\n        expectations' data source.\n    - expectation_suite_name - name to consider for great expectations' suite.\n    - result_sink_db_table - db.table_name indicating the database and table in which\n        to save the results of the DQ process.\n    - result_sink_location - file system location in which to save the results of the\n        DQ process.\n    - result_sink_chunk_size - number of records per chunk when writing the results of\n        the DQ process. Default: 1000000 records.\n    - processed_keys_location - file system location where the keys processed by the\n        DQ Process are saved. This is specifically used when the DQ Type is PRISMA.\n        Note that this location is always constructed during the process, so any\n        value defined in the configuration will be overwritten.\n    - data_product_name - name of the data product.\n    - result_sink_partitions - the list of partitions to consider.\n    - result_sink_format - format of the result table (e.g. delta, parquet, kafka...).\n    - result_sink_options - extra spark options for configuring the result sink.\n        E.g: can be used to configure a Kafka sink if result_sink_format is kafka.\n    - result_sink_explode - flag to determine if the output table/location should have\n        the columns exploded (as True) or not (as False). Default: True.\n    - result_sink_extra_columns - list of extra columns to be exploded (following\n        the pattern \"<name>.*\") or columns to be selected. It is only used when\n        result_sink_explode is set to True.\n    - source - name of data source, to be easier to identify in analysis. If not\n        specified, it is set as default <input_id>. This will be only used\n        when result_sink_explode is set to True.\n    - fail_on_error - whether to fail the algorithm if the validations of your data in\n        the DQ process failed.\n    - cache_df - whether to cache the dataframe before running the DQ process or not.\n    - critical_functions - functions that should not fail. When this argument is\n        defined, fail_on_error is nullified.\n    - max_percentage_failure - percentage of failure that should be allowed.\n        This argument has priority over both fail_on_error and critical_functions.\n    - enable_row_condition - flag to determine if the row_conditions should be\n    enabled or not. row_conditions allow you to filter the rows that are\n    processed by the DQ functions. This is useful when you want to run the\n    DQ functions only on a subset of the data. Default: False. Note: When using PRISMA,\n    if you enable this flag, bear in mind that the number of processed keys will be\n    numerically different from the evaluated keys. This happens because the\n    row_conditions limit the number of rows that are processed by the DQ functions,\n    but the we consider processed keys as all the keys that are passed to the dq_spec.\n    \"\"\"\n\n    spec_id: str\n    input_id: str\n    dq_type: str\n    dq_functions: Optional[List[DQFunctionSpec]] = None\n    dq_db_table: Optional[str] = None\n    dq_table_table_filter: Optional[str] = None\n    dq_table_extra_filters: Optional[str] = None\n    execution_point: Optional[str] = None\n    unexpected_rows_pk: Optional[List[str]] = None\n    tbl_to_derive_pk: Optional[str] = None\n    gx_result_format: Optional[str] = DQResultFormat.COMPLETE.value\n    tag_source_data: Optional[bool] = False\n    store_backend: str = DQDefaults.STORE_BACKEND.value\n    local_fs_root_dir: Optional[str] = None\n    bucket: Optional[str] = None\n    expectations_store_prefix: str = DQDefaults.EXPECTATIONS_STORE_PREFIX.value\n    validations_store_prefix: str = DQDefaults.VALIDATIONS_STORE_PREFIX.value\n    checkpoint_store_prefix: str = DQDefaults.CHECKPOINT_STORE_PREFIX.value\n    data_asset_name: Optional[str] = None\n    expectation_suite_name: Optional[str] = None\n    result_sink_db_table: Optional[str] = None\n    result_sink_location: Optional[str] = None\n    result_sink_chunk_size: Optional[int] = 1000000\n    processed_keys_location: Optional[str] = None\n    data_product_name: Optional[str] = None\n    result_sink_partitions: Optional[List[str]] = None\n    result_sink_format: str = OutputFormat.DELTAFILES.value\n    result_sink_options: Optional[dict] = None\n    result_sink_explode: bool = True\n    result_sink_extra_columns: Optional[List[str]] = None\n    source: Optional[str] = None\n    fail_on_error: bool = True\n    cache_df: bool = False\n    critical_functions: Optional[List[DQFunctionSpec]] = None\n    max_percentage_failure: Optional[float] = None\n    enable_row_condition: bool = False\n\n\n@dataclass\nclass MergeOptions(object):\n    \"\"\"Options for a merge operation.\n\n    - merge_predicate: predicate to apply to the merge operation so that we can\n        check if a new record corresponds to a record already included in the\n        historical data.\n    - insert_only: indicates if the merge should only insert data (e.g., deduplicate\n        scenarios).\n    - delete_predicate: predicate to apply to the delete operation.\n    - update_predicate: predicate to apply to the update operation.\n    - insert_predicate: predicate to apply to the insert operation.\n    - update_column_set: rules to apply to the update operation which allows to\n        set the value for each column to be updated.\n        (e.g. {\"data\": \"new.data\", \"count\": \"current.count + 1\"} )\n    - insert_column_set: rules to apply to the insert operation which allows to\n        set the value for each column to be inserted.\n        (e.g. {\"date\": \"updates.date\", \"count\": \"1\"} )\n    \"\"\"\n\n    merge_predicate: str\n    insert_only: bool = False\n    delete_predicate: Optional[str] = None\n    update_predicate: Optional[str] = None\n    insert_predicate: Optional[str] = None\n    update_column_set: Optional[dict] = None\n    insert_column_set: Optional[dict] = None\n\n\n@dataclass\nclass OutputSpec(object):\n    \"\"\"Specification of an algorithm output.\n\n    This is very aligned with the way the execution environment connects to the output\n    systems (e.g., spark outputs).\n\n    - spec_id: id of the output specification.\n    - input_id: id of the corresponding input specification.\n    - write_type: type of write operation.\n    - data_format: format of the output. Defaults to DELTA.\n    - db_table: table name in the form of `<db>.<table>`.\n    - location: uri that identifies from where to write data in the specified format.\n    - sharepoint_opts: options to apply on writing on Sharepoint operations.\n    - partitions: list of partition input_col names.\n    - merge_opts: options to apply to the merge operation.\n    - streaming_micro_batch_transformers: transformers to invoke for each streaming\n        micro batch, before writing (i.e., in Spark's foreachBatch structured\n        streaming function). Note: the lakehouse engine manages this for you, so\n        you don't have to manually specify streaming transformations here, so we don't\n        advise you to manually specify transformations through this parameter. Supply\n        them as regular transformers in the transform_specs sections of an ACON.\n    - streaming_once: if the streaming query is to be executed just once, or not,\n        generating just one micro batch.\n    - streaming_processing_time: if streaming query is to be kept alive, this indicates\n        the processing time of each micro batch.\n    - streaming_available_now: if set to True, set a trigger that processes all\n        available data in multiple batches then terminates the query.\n        When using streaming, this is the default trigger that the lakehouse-engine will\n        use, unless you configure a different one.\n    - streaming_continuous: set a trigger that runs a continuous query with a given\n        checkpoint interval.\n    - streaming_await_termination: whether to wait (True) for the termination of the\n        streaming query (e.g. timeout or exception) or not (False). Default: True.\n    - streaming_await_termination_timeout: a timeout to set to the\n        streaming_await_termination. Default: None.\n    - with_batch_id: whether to include the streaming batch id in the final data,\n        or not. It only takes effect in streaming mode.\n    - options: dict with other relevant options according to the execution environment\n        (e.g., spark) possible outputs.  E.g.,: JDBC options, checkpoint location for\n        streaming, etc.\n    - streaming_micro_batch_dq_processors: similar to streaming_micro_batch_transformers\n        but for the DQ functions to be executed. Used internally by the lakehouse\n        engine, so you don't have to supply DQ functions through this parameter. Use the\n        dq_specs of the acon instead.\n    \"\"\"\n\n    spec_id: str\n    input_id: str\n    write_type: str\n    data_format: str = OutputFormat.DELTAFILES.value\n    db_table: Optional[str] = None\n    location: Optional[str] = None\n    sharepoint_opts: Optional[SharepointOptions] = None\n    merge_opts: Optional[MergeOptions] = None\n    partitions: Optional[List[str]] = None\n    streaming_micro_batch_transformers: Optional[List[TransformerSpec]] = None\n    streaming_once: Optional[bool] = None\n    streaming_processing_time: Optional[str] = None\n    streaming_available_now: bool = True\n    streaming_continuous: Optional[str] = None\n    streaming_await_termination: bool = True\n    streaming_await_termination_timeout: Optional[int] = None\n    with_batch_id: bool = False\n    options: Optional[dict] = None\n    streaming_micro_batch_dq_processors: Optional[List[DQSpec]] = None\n\n\n@dataclass\nclass TerminatorSpec(object):\n    \"\"\"Terminator Specification.\n\n    I.e., the specification that defines a terminator operation to be executed. Examples\n    are compute statistics, vacuum, optimize, etc.\n\n    - function: terminator function to execute.\n    - args: arguments of the terminator function.\n    - input_id: id of the corresponding output specification (Optional).\n    \"\"\"\n\n    function: str\n    args: Optional[dict] = None\n    input_id: Optional[str] = None\n\n\n@dataclass\nclass ReconciliatorSpec(object):\n    \"\"\"Reconciliator Specification.\n\n    - metrics: list of metrics in the form of:\n        [{\n            metric: name of the column present in both truth and current datasets,\n            aggregation: sum, avg, max, min, ...,\n            type: percentage or absolute,\n            yellow: value,\n            red: value\n        }].\n    - recon_type: reconciliation type (percentage or absolute). Percentage calculates\n        the difference between truth and current results as a percentage (x-y/x), and\n        absolute calculates the raw difference (x - y).\n    - truth_input_spec: input specification of the truth data.\n    - current_input_spec: input specification of the current results data\n    - truth_preprocess_query: additional query on top of the truth input data to\n        preprocess the truth data before it gets fueled into the reconciliation process.\n        Important note: you need to assume that the data out of\n        the truth_input_spec is referencable by a table called 'truth'.\n    - truth_preprocess_query_args: optional dict having the functions/transformations to\n        apply on top of the truth_preprocess_query and respective arguments. Note: cache\n        is being applied on the Dataframe, by default. For turning the default behavior\n        off, pass `\"truth_preprocess_query_args\": []`.\n    - current_preprocess_query: additional query on top of the current results input\n        data to preprocess the current results data before it gets fueled into the\n        reconciliation process. Important note: you need to assume that the data out of\n        the current_results_input_spec is referencable by a table called 'current'.\n    - current_preprocess_query_args: optional dict having the\n        functions/transformations to apply on top of the current_preprocess_query\n        and respective arguments. Note: cache is being applied on the Dataframe,\n        by default. For turning the default behavior off, pass\n        `\"current_preprocess_query_args\": []`.\n    - ignore_empty_df: optional boolean, to ignore the recon process if source & target\n       dataframes are empty, recon will exit success code (passed)\n    \"\"\"\n\n    metrics: List[dict]\n    truth_input_spec: InputSpec\n    current_input_spec: InputSpec\n    truth_preprocess_query: Optional[str] = None\n    truth_preprocess_query_args: Optional[List[dict]] = None\n    current_preprocess_query: Optional[str] = None\n    current_preprocess_query_args: Optional[List[dict]] = None\n    ignore_empty_df: Optional[bool] = False\n\n\n@dataclass\nclass DQValidatorSpec(object):\n    \"\"\"Data Quality Validator Specification.\n\n    - input_spec: input specification of the data to be checked/validated.\n    - dq_spec: data quality specification.\n    - restore_prev_version: specify if, having\n        delta table/files as input, they should be restored to the\n        previous version if the data quality process fails. Note: this\n        is only considered if fail_on_error is kept as True.\n    \"\"\"\n\n    input_spec: InputSpec\n    dq_spec: DQSpec\n    restore_prev_version: Optional[bool] = False\n\n\nclass SQLDefinitions(Enum):\n    \"\"\"SQL definitions statements.\"\"\"\n\n    compute_table_stats = \"ANALYZE TABLE {} COMPUTE STATISTICS\"\n    drop_table_stmt = \"DROP TABLE IF EXISTS\"\n    drop_view_stmt = \"DROP VIEW IF EXISTS\"\n    truncate_stmt = \"TRUNCATE TABLE\"\n    describe_stmt = \"DESCRIBE TABLE\"\n    optimize_stmt = \"OPTIMIZE\"\n    show_tbl_props_stmt = \"SHOW TBLPROPERTIES\"\n    delete_where_stmt = \"DELETE FROM {} WHERE {}\"\n\n\nclass FileManagerAPIKeys(Enum):\n    \"\"\"File Manager s3 api keys.\"\"\"\n\n    CONTENTS = \"Contents\"\n    KEY = \"Key\"\n    CONTINUATION = \"NextContinuationToken\"\n    BUCKET = \"Bucket\"\n    OBJECTS = \"Objects\"\n\n\n@dataclass\nclass SensorSpec(object):\n    \"\"\"Sensor Specification.\n\n    - sensor_id: sensor id.\n    - assets: a list of assets that are considered as available to\n        consume downstream after this sensor has status\n        PROCESSED_NEW_DATA.\n    - control_db_table_name: db.table to store sensor metadata.\n    - input_spec: input specification of the source to be checked for new data.\n    - preprocess_query: SQL query to transform/filter the result from the\n        upstream. Consider that we should refer to 'new_data' whenever\n        we are referring to the input of the sensor. E.g.:\n            \"SELECT dummy_col FROM new_data WHERE ...\"\n    - checkpoint_location: optional location to store checkpoints to resume\n        from. These checkpoints use the same as Spark checkpoint strategy.\n        For Spark readers that do not support checkpoints, use the\n        preprocess_query parameter to form a SQL query to filter the result\n        from the upstream accordingly.\n    - fail_on_empty_result: if the sensor should throw an error if there is no new data\n    in the upstream. Default: True.\n    \"\"\"\n\n    sensor_id: str\n    assets: List[str]\n    control_db_table_name: str\n    input_spec: InputSpec\n    preprocess_query: Optional[str]\n    checkpoint_location: Optional[str]\n    fail_on_empty_result: bool = True\n\n    @classmethod\n    def create_from_acon(cls, acon: dict):  # type: ignore\n        \"\"\"Create SensorSpec from acon.\n\n        Args:\n            acon: sensor ACON.\n        \"\"\"\n        checkpoint_location = acon.get(\"base_checkpoint_location\")\n        if checkpoint_location:\n            checkpoint_location = (\n                f\"{checkpoint_location.rstrip('/')}/lakehouse_engine/\"\n                f\"sensors/{acon['sensor_id']}\"\n            )\n\n        return cls(\n            sensor_id=acon[\"sensor_id\"],\n            assets=acon[\"assets\"],\n            control_db_table_name=acon[\"control_db_table_name\"],\n            input_spec=InputSpec(**acon[\"input_spec\"]),\n            preprocess_query=acon.get(\"preprocess_query\"),\n            checkpoint_location=checkpoint_location,\n            fail_on_empty_result=acon.get(\"fail_on_empty_result\", True),\n        )\n\n\nclass SensorStatus(Enum):\n    \"\"\"Status for a sensor.\"\"\"\n\n    ACQUIRED_NEW_DATA = \"ACQUIRED_NEW_DATA\"\n    PROCESSED_NEW_DATA = \"PROCESSED_NEW_DATA\"\n\n\nSENSOR_SCHEMA = StructType(\n    [\n        StructField(\"sensor_id\", StringType(), False),\n        StructField(\"assets\", ArrayType(StringType(), False), True),\n        StructField(\"status\", StringType(), False),\n        StructField(\"status_change_timestamp\", TimestampType(), False),\n        StructField(\"checkpoint_location\", StringType(), True),\n        StructField(\"upstream_key\", StringType(), True),\n        StructField(\"upstream_value\", StringType(), True),\n    ]\n)\n\nSENSOR_UPDATE_SET: dict = {\n    \"sensors.sensor_id\": \"updates.sensor_id\",\n    \"sensors.status\": \"updates.status\",\n    \"sensors.status_change_timestamp\": \"updates.status_change_timestamp\",\n}\n\nSENSOR_ALLOWED_DATA_FORMATS = {\n    ReadType.STREAMING.value: [InputFormat.KAFKA.value, *FILE_INPUT_FORMATS],\n    ReadType.BATCH.value: [\n        InputFormat.DELTAFILES.value,\n        InputFormat.JDBC.value,\n    ],\n}\n\n\nclass SAPLogchain(Enum):\n    \"\"\"Defaults used on consuming data from SAP Logchain.\"\"\"\n\n    DBTABLE = \"SAPPHA.RSPCLOGCHAIN\"\n    GREEN_STATUS = \"G\"\n    ENGINE_TABLE = \"sensor_new_data\"\n\n\nclass RestoreType(Enum):\n    \"\"\"Archive types.\"\"\"\n\n    BULK = \"Bulk\"\n    STANDARD = \"Standard\"\n    EXPEDITED = \"Expedited\"\n\n    @classmethod\n    def values(cls):  # type: ignore\n        \"\"\"Generates a list containing all enum values.\n\n        Returns:\n            A list with all enum values.\n        \"\"\"\n        return (c.value for c in cls)\n\n    @classmethod\n    def exists(cls, restore_type: str) -> bool:\n        \"\"\"Checks if the restore type exists in the enum values.\n\n        Args:\n            restore_type: restore type to check if exists.\n\n        Returns:\n            If the restore type exists in our enum.\n        \"\"\"\n        return restore_type in cls.values()\n\n\nclass RestoreStatus(Enum):\n    \"\"\"Archive types.\"\"\"\n\n    NOT_STARTED = \"not_started\"\n    ONGOING = \"ongoing\"\n    RESTORED = \"restored\"\n\n\nARCHIVE_STORAGE_CLASS = [\n    \"GLACIER\",\n    \"DEEP_ARCHIVE\",\n    \"GLACIER_IR\",\n]\n\n\nclass SQLParser(Enum):\n    \"\"\"Defaults to use for parsing.\"\"\"\n\n    DOUBLE_QUOTES = '\"'\n    SINGLE_QUOTES = \"'\"\n    BACKSLASH = \"\\\\\"\n    SINGLE_TRACE = \"-\"\n    DOUBLE_TRACES = \"--\"\n    SLASH = \"/\"\n    OPENING_MULTIPLE_LINE_COMMENT = \"/*\"\n    CLOSING_MULTIPLE_LINE_COMMENT = \"*/\"\n    PARAGRAPH = \"\\n\"\n    STAR = \"*\"\n\n    MULTIPLE_LINE_COMMENT = [\n        OPENING_MULTIPLE_LINE_COMMENT,\n        CLOSING_MULTIPLE_LINE_COMMENT,\n    ]\n\n\nclass GABDefaults(Enum):\n    \"\"\"Defaults used on the GAB process.\"\"\"\n\n    DATE_FORMAT = \"%Y-%m-%d\"\n    DIMENSIONS_DEFAULT_COLUMNS = [\"from_date\", \"to_date\"]\n    DEFAULT_DIMENSION_CALENDAR_TABLE = \"dim_calendar\"\n    DEFAULT_LOOKUP_QUERY_BUILDER_TABLE = \"lkp_query_builder\"\n\n\nclass GABStartOfWeek(Enum):\n    \"\"\"Representation of start of week values on GAB.\"\"\"\n\n    SUNDAY = \"S\"\n    MONDAY = \"M\"\n\n    @classmethod\n    def get_start_of_week(cls) -> dict:\n        \"\"\"Get the start of week enum as a dict.\n\n        Returns:\n            dict containing all enum entries as `{name:value}`.\n        \"\"\"\n        return {\n            start_of_week.name: start_of_week.value for start_of_week in GABStartOfWeek\n        }\n\n    @classmethod\n    def get_values(cls) -> set[str]:\n        \"\"\"Get the start of week enum values as set.\n\n        Returns:\n            set containing all possible values `{value}`.\n        \"\"\"\n        return {start_of_week.value for start_of_week in GABStartOfWeek}\n\n\n@dataclass\nclass GABSpec(object):\n    \"\"\"Gab Specification.\n\n    - query_label_filter: query use-case label to execute.\n    - queue_filter: queue to execute the job.\n    - cadence_filter: selected cadences to build the asset.\n    - target_database: target database to write.\n    - curr_date: current date.\n    - start_date: period start date.\n    - end_date: period end date.\n    - rerun_flag: rerun flag.\n    - target_table: target table to write.\n    - source_database: source database.\n    - gab_base_path: base path to read the use cases.\n    - lookup_table: gab configuration table.\n    - calendar_table: gab calendar table.\n    \"\"\"\n\n    query_label_filter: list[str]\n    queue_filter: list[str]\n    cadence_filter: list[str]\n    target_database: str\n    current_date: datetime\n    start_date: datetime\n    end_date: datetime\n    rerun_flag: str\n    target_table: str\n    source_database: str\n    gab_base_path: str\n    lookup_table: str\n    calendar_table: str\n\n    @classmethod\n    def create_from_acon(cls, acon: dict):  # type: ignore\n        \"\"\"Create GabSpec from acon.\n\n        Args:\n            acon: gab ACON.\n        \"\"\"\n        lookup_table = f\"{acon['source_database']}.\" + (\n            acon.get(\n                \"lookup_table\", GABDefaults.DEFAULT_LOOKUP_QUERY_BUILDER_TABLE.value\n            )\n        )\n\n        calendar_table = f\"{acon['source_database']}.\" + (\n            acon.get(\n                \"calendar_table\", GABDefaults.DEFAULT_DIMENSION_CALENDAR_TABLE.value\n            )\n        )\n\n        def format_date(date_to_format: datetime | str) -> datetime:\n            if isinstance(date_to_format, str):\n                return datetime.strptime(date_to_format, GABDefaults.DATE_FORMAT.value)\n            else:\n                return date_to_format\n\n        return cls(\n            query_label_filter=acon[\"query_label_filter\"],\n            queue_filter=acon[\"queue_filter\"],\n            cadence_filter=acon[\"cadence_filter\"],\n            target_database=acon[\"target_database\"],\n            current_date=datetime.now(),\n            start_date=format_date(acon[\"start_date\"]),\n            end_date=format_date(acon[\"end_date\"]),\n            rerun_flag=acon[\"rerun_flag\"],\n            target_table=acon[\"target_table\"],\n            source_database=acon[\"source_database\"],\n            gab_base_path=acon[\"gab_base_path\"],\n            lookup_table=lookup_table,\n            calendar_table=calendar_table,\n        )\n\n\nclass GABCadence(Enum):\n    \"\"\"Representation of the supported cadences on GAB.\"\"\"\n\n    DAY = 1\n    WEEK = 2\n    MONTH = 3\n    QUARTER = 4\n    YEAR = 5\n\n    @classmethod\n    def get_ordered_cadences(cls) -> dict:\n        \"\"\"Get the cadences ordered by the value.\n\n        Returns:\n            dict containing ordered cadences as `{name:value}`.\n        \"\"\"\n        return {\n            cadence.name: cadence.value\n            for cadence in sorted(GABCadence, key=lambda gab_cadence: gab_cadence.value)\n        }\n\n    @classmethod\n    def get_cadences(cls) -> set[str]:\n        \"\"\"Get the cadences values as set.\n\n        Returns:\n            set containing all possible cadence values as `{value}`.\n        \"\"\"\n        return {cadence.name for cadence in GABCadence}\n\n    @classmethod\n    def order_cadences(cls, cadences_to_order: list[str]) -> list[str]:\n        \"\"\"Order a list of cadences by value.\n\n        Returns:\n            ordered set containing the received cadences.\n        \"\"\"\n        return sorted(\n            cadences_to_order,\n            key=lambda item: cls.get_ordered_cadences().get(item),  # type: ignore\n        )\n\n\nclass GABKeys:\n    \"\"\"Constants used to update pre-configured gab dict key.\"\"\"\n\n    JOIN_SELECT = \"join_select\"\n    PROJECT_START = \"project_start\"\n    PROJECT_END = \"project_end\"\n\n\nclass GABReplaceableKeys:\n    \"\"\"Constants used to replace pre-configured gab dict values.\"\"\"\n\n    CADENCE = \"${cad}\"\n    DATE_COLUMN = \"${date_column}\"\n    CONFIG_WEEK_START = \"${config_week_start}\"\n    RECONCILIATION_CADENCE = \"${rec_cadence}\"\n\n\nclass GABCombinedConfiguration(Enum):\n    \"\"\"GAB combined configuration.\n\n    Based on the use case configuration return the values to override in the SQL file.\n    This enum aims to exhaustively map each combination of `cadence`, `reconciliation`,\n        `week_start` and `snap_flag` return the corresponding values `join_select`,\n        `project_start` and `project_end` to replace this values in the stages SQL file.\n\n    Return corresponding configuration (join_select, project_start, project_end) for\n        each combination (cadence x recon x week_start x snap_flag).\n    \"\"\"\n\n    _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE = (\n        \"date(date_trunc('${cad}',${date_column}))\"\n    )\n    _DEFAULT_PROJECT_START = \"df_cal.cadence_start_date\"\n    _DEFAULT_PROJECT_END = \"df_cal.cadence_end_date\"\n\n    COMBINED_CONFIGURATION = {\n        # Combination of:\n        # - cadence: `DAY`\n        # - reconciliation_window: `DAY`, `WEEK`, `MONTH`, `QUARTER`, `YEAR`\n        # - week_start: `S`, `M`\n        # - snapshot_flag: `Y`, `N`\n        1: {\n            \"cadence\": GABCadence.DAY.name,\n            \"recon\": GABCadence.get_cadences(),\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": {\"Y\", \"N\"},\n            \"join_select\": \"\",\n            \"project_start\": _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE,\n            \"project_end\": _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE,\n        },\n        # Combination of:\n        # - cadence: `WEEK`\n        # - reconciliation_window: `DAY`\n        # - week_start: `S`, `M`\n        # - snapshot_flag: `Y`\n        2: {\n            \"cadence\": GABCadence.WEEK.name,\n            \"recon\": GABCadence.DAY.name,\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct case\n                when '${config_week_start}' = 'Monday' then weekstart_mon\n                when '${config_week_start}' = 'Sunday' then weekstart_sun\n            end as cadence_start_date,\n            calendar_date as cadence_end_date\n        \"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        # Combination of:\n        # - cadence: `WEEK`\n        # - reconciliation_window: `DAY, `MONTH`, `QUARTER`, `YEAR`\n        # - week_start: `M`\n        # - snapshot_flag: `Y`, `N`\n        3: {\n            \"cadence\": GABCadence.WEEK.name,\n            \"recon\": {\n                GABCadence.DAY.name,\n                GABCadence.MONTH.name,\n                GABCadence.QUARTER.name,\n                GABCadence.YEAR.name,\n            },\n            \"week_start\": \"M\",\n            \"snap_flag\": {\"Y\", \"N\"},\n            \"join_select\": \"\"\"\n            select distinct case\n                when '${config_week_start}'  = 'Monday' then weekstart_mon\n                when '${config_week_start}' = 'Sunday' then weekstart_sun\n            end as cadence_start_date,\n            case\n                when '${config_week_start}' = 'Monday' then weekend_mon\n                when '${config_week_start}' = 'Sunday' then weekend_sun\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        4: {\n            \"cadence\": GABCadence.MONTH.name,\n            \"recon\": GABCadence.DAY.name,\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct month_start as cadence_start_date,\n            calendar_date as cadence_end_date\n        \"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        5: {\n            \"cadence\": GABCadence.MONTH.name,\n            \"recon\": GABCadence.WEEK.name,\n            \"week_start\": GABStartOfWeek.MONDAY.value,\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct month_start as cadence_start_date,\n            case\n                when date(\n                    date_trunc('MONTH',add_months(calendar_date, 1))\n                )-1 < weekend_mon\n                    then date(date_trunc('MONTH',add_months(calendar_date, 1)))-1\n                else weekend_mon\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        6: {\n            \"cadence\": GABCadence.MONTH.name,\n            \"recon\": GABCadence.WEEK.name,\n            \"week_start\": GABStartOfWeek.SUNDAY.value,\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct month_start as cadence_start_date,\n            case\n                when date(\n                    date_trunc('MONTH',add_months(calendar_date, 1))\n                )-1 < weekend_sun\n                    then date(date_trunc('MONTH',add_months(calendar_date, 1)))-1\n                else weekend_sun\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        7: {\n            \"cadence\": GABCadence.MONTH.name,\n            \"recon\": GABCadence.get_cadences(),\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": {\"Y\", \"N\"},\n            \"join_select\": \"\",\n            \"project_start\": _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE,\n            \"project_end\": \"date(date_trunc('MONTH',add_months(${date_column}, 1)))-1\",\n        },\n        8: {\n            \"cadence\": GABCadence.QUARTER.name,\n            \"recon\": GABCadence.DAY.name,\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct quarter_start as cadence_start_date,\n            calendar_date as cadence_end_date\n        \"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        9: {\n            \"cadence\": GABCadence.QUARTER.name,\n            \"recon\": GABCadence.WEEK.name,\n            \"week_start\": GABStartOfWeek.MONDAY.value,\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct quarter_start as cadence_start_date,\n            case\n                when weekend_mon > date(\n                    date_trunc('QUARTER',add_months(calendar_date, 3))\n                )-1\n                    then date(date_trunc('QUARTER',add_months(calendar_date, 3)))-1\n                else weekend_mon\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        10: {\n            \"cadence\": GABCadence.QUARTER.name,\n            \"recon\": GABCadence.WEEK.name,\n            \"week_start\": GABStartOfWeek.SUNDAY.value,\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct quarter_start as cadence_start_date,\n            case\n                when weekend_sun > date(\n                    date_trunc('QUARTER',add_months(calendar_date, 3))\n                )-1\n                    then date(date_trunc('QUARTER',add_months(calendar_date, 3)))-1\n                else weekend_sun\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        11: {\n            \"cadence\": GABCadence.QUARTER.name,\n            \"recon\": GABCadence.MONTH.name,\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct quarter_start as cadence_start_date,\n            month_end as cadence_end_date\n        \"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        12: {\n            \"cadence\": GABCadence.QUARTER.name,\n            \"recon\": GABCadence.YEAR.name,\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"N\",\n            \"join_select\": \"\",\n            \"project_start\": _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE,\n            \"project_end\": \"\"\"\n            date(\n                date_trunc(\n                    '${cad}',add_months(date(date_trunc('${cad}',${date_column})), 3)\n                )\n            )-1\n        \"\"\",\n        },\n        13: {\n            \"cadence\": GABCadence.QUARTER.name,\n            \"recon\": GABCadence.get_cadences(),\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"N\",\n            \"join_select\": \"\",\n            \"project_start\": _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE,\n            \"project_end\": \"\"\"\n            date(\n                date_trunc(\n                    '${cad}',add_months( date(date_trunc('${cad}',${date_column})), 3)\n                )\n            )-1\n        \"\"\",\n        },\n        14: {\n            \"cadence\": GABCadence.YEAR.name,\n            \"recon\": GABCadence.WEEK.name,\n            \"week_start\": GABStartOfWeek.MONDAY.value,\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct year_start as cadence_start_date,\n            case\n                when weekend_mon > date(\n                    date_trunc('YEAR',add_months(calendar_date, 12))\n                )-1\n                    then date(date_trunc('YEAR',add_months(calendar_date, 12)))-1\n                else weekend_mon\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        15: {\n            \"cadence\": GABCadence.YEAR.name,\n            \"recon\": GABCadence.WEEK.name,\n            \"week_start\": GABStartOfWeek.SUNDAY.value,\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct year_start as cadence_start_date,\n            case\n                when weekend_sun > date(\n                    date_trunc('YEAR',add_months(calendar_date, 12))\n                )-1\n                    then date(date_trunc('YEAR',add_months(calendar_date, 12)))-1\n                else weekend_sun\n            end as cadence_end_date\"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        16: {\n            \"cadence\": GABCadence.YEAR.name,\n            \"recon\": GABCadence.get_cadences(),\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"N\",\n            \"inverse_flag\": \"Y\",\n            \"join_select\": \"\",\n            \"project_start\": _PROJECT_DATE_COLUMN_TRUNCATED_BY_CADENCE,\n            \"project_end\": \"\"\"\n            date(\n                date_trunc(\n                    '${cad}',add_months(date(date_trunc('${cad}',${date_column})), 12)\n                )\n            )-1\n        \"\"\",\n        },\n        17: {\n            \"cadence\": GABCadence.YEAR.name,\n            \"recon\": {\n                GABCadence.DAY.name,\n                GABCadence.MONTH.name,\n                GABCadence.QUARTER.name,\n            },\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": \"Y\",\n            \"join_select\": \"\"\"\n            select distinct year_start as cadence_start_date,\n            case\n                when '${rec_cadence}' = 'DAY' then calendar_date\n                when '${rec_cadence}' = 'MONTH' then month_end\n                when '${rec_cadence}' = 'QUARTER' then quarter_end\n            end as cadence_end_date\n        \"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n        18: {\n            \"cadence\": GABCadence.get_cadences(),\n            \"recon\": GABCadence.get_cadences(),\n            \"week_start\": GABStartOfWeek.get_values(),\n            \"snap_flag\": {\"Y\", \"N\"},\n            \"join_select\": \"\"\"\n            select distinct\n            case\n                when '${cad}' = 'WEEK' and '${config_week_start}' = 'Monday'\n                    then weekstart_mon\n                when  '${cad}' = 'WEEK' and '${config_week_start}' = 'Sunday'\n                    then weekstart_sun\n                else\n                    date(date_trunc('${cad}',calendar_date))\n            end as cadence_start_date,\n            case\n                when '${cad}' = 'WEEK' and '${config_week_start}' = 'Monday'\n                    then weekend_mon\n                when  '${cad}' = 'WEEK' and '${config_week_start}' = 'Sunday'\n                    then weekend_sun\n                when '${cad}' = 'DAY'\n                    then date(date_trunc('${cad}',calendar_date))\n                when '${cad}' = 'MONTH'\n                    then date(\n                        date_trunc(\n                            'MONTH',\n                            add_months(date(date_trunc('${cad}',calendar_date)), 1)\n                        )\n                    )-1\n                when '${cad}' = 'QUARTER'\n                    then date(\n                        date_trunc(\n                            'QUARTER',\n                            add_months(date(date_trunc('${cad}',calendar_date)) , 3)\n                        )\n                    )-1\n                when '${cad}' = 'YEAR'\n                    then date(\n                        date_trunc(\n                            'YEAR',\n                            add_months(date(date_trunc('${cad}',calendar_date)), 12)\n                        )\n                    )-1\n            end as cadence_end_date\n        \"\"\",\n            \"project_start\": _DEFAULT_PROJECT_START,\n            \"project_end\": _DEFAULT_PROJECT_END,\n        },\n    }\n\n\n@dataclass\nclass HeartbeatConfigSpec(object):\n    \"\"\"Heartbeat Configurations and control table specifications.\n\n    This provides the way in which the Heartbeat can pass environment and\n    specific quantum related config information to sensor acon.\n\n    - sensor_source: specifies the source system of sensor, for e.g.\n        sap_b4, sap_bw, delta_table, kafka, lmu_delta_table, trigger_file etc.\n        It is also a part of heartbeat control table, Therefore it is useful for\n        filtering out data from Heartbeat control table based on template source system.\n    - data_format: format of the input source, e.g jdbc, delta, kafka, cloudfiles etc.\n    - heartbeat_sensor_db_table: heartbeat control table along\n        with database from config.\n    - lakehouse_engine_sensor_db_table: Control table along with database(config).\n    - options: dict with other relevant options for reading data from specified input\n        data_format. This can vary for each source system.\n        For e.g. For sap systems, DRIVER, URL, USERNAME, PASSWORD are required which are\n        all being read from config file of quantum.\n    - jdbc_db_table: schema and table name of JDBC sources.\n    - token: token to access Databricks Job API(read from config).\n    - domain: workspace domain url for quantum(read from config).\n    - base_checkpoint_location: checkpoint location for streaming sources(from config).\n    - kafka_configs: configs required for kafka. It is (read from config) as JSON.\n        config hierarchy is [sensor_kafka --> <dp_name/prefix> --> main kafka options].\n    - kafka_secret_scope: secret scope for kafka (read from config).\n    - base_trigger_file_location: location where all the trigger files are being\n        created (read from config).\n    - schema_dict: dict representation of schema of the trigger file (e.g. Spark struct\n        type schema).\n    \"\"\"\n\n    sensor_source: str\n    data_format: str\n    heartbeat_sensor_db_table: str\n    lakehouse_engine_sensor_db_table: str\n    token: str\n    domain: str\n    options: Optional[dict]\n    jdbc_db_table: Optional[str]\n    base_checkpoint_location: Optional[str]\n    kafka_configs: Optional[dict]\n    kafka_secret_scope: Optional[str]\n    base_trigger_file_location: Optional[str]\n    schema_dict: Optional[dict]\n\n    @classmethod\n    def create_from_acon(cls, acon: dict):  # type: ignore\n        \"\"\"Create HeartbeatConfigSpec from acon.\n\n        Args:\n            acon: Heartbeat ACON.\n        \"\"\"\n        return cls(\n            sensor_source=acon[\"sensor_source\"],\n            data_format=acon[\"data_format\"],\n            heartbeat_sensor_db_table=acon[\"heartbeat_sensor_db_table\"],\n            lakehouse_engine_sensor_db_table=acon[\"lakehouse_engine_sensor_db_table\"],\n            token=acon[\"token\"],\n            domain=acon[\"domain\"],\n            options=acon.get(\"options\"),\n            jdbc_db_table=acon.get(\"jdbc_db_table\"),\n            base_checkpoint_location=acon.get(\"base_checkpoint_location\"),\n            kafka_configs=acon.get(\"kafka_configs\"),\n            kafka_secret_scope=acon.get(\"kafka_secret_scope\"),\n            base_trigger_file_location=acon.get(\"base_trigger_file_location\"),\n            schema_dict=acon.get(\"schema_dict\"),\n        )\n\n\nclass HeartbeatSensorSource(Enum):\n    \"\"\"Formats of algorithm input.\"\"\"\n\n    SAP_BW = \"sap_bw\"\n    SAP_B4 = \"sap_b4\"\n    DELTA_TABLE = \"delta_table\"\n    KAFKA = \"kafka\"\n    LMU_DELTA_TABLE = \"lmu_delta_table\"\n    TRIGGER_FILE = \"trigger_file\"\n\n    @classmethod\n    def values(cls):  # type: ignore\n        \"\"\"Generates a list containing all enum values.\n\n        Returns:\n            A list with all enum values.\n        \"\"\"\n        return (c.value for c in cls)\n\n\nclass HeartbeatStatus(Enum):\n    \"\"\"Status for a sensor.\"\"\"\n\n    NEW_EVENT_AVAILABLE = \"NEW_EVENT_AVAILABLE\"\n    IN_PROGRESS = \"IN_PROGRESS\"\n    COMPLETED = \"COMPLETED\"\n\n\nHEARTBEAT_SENSOR_UPDATE_SET: dict = {\n    \"target.sensor_source\": \"src.sensor_source\",\n    \"target.sensor_id\": \"src.sensor_id\",\n    \"target.asset_description\": \"src.asset_description\",\n    \"target.upstream_key\": \"src.upstream_key\",\n    \"target.preprocess_query\": \"src.preprocess_query\",\n    \"target.latest_event_fetched_timestamp\": \"src.latest_event_fetched_timestamp\",\n    \"target.trigger_job_id\": \"src.trigger_job_id\",\n    \"target.trigger_job_name\": \"src.trigger_job_name\",\n    \"target.status\": \"src.status\",\n    \"target.status_change_timestamp\": \"src.status_change_timestamp\",\n    \"target.job_start_timestamp\": \"src.job_start_timestamp\",\n    \"target.job_end_timestamp\": \"src.job_end_timestamp\",\n    \"target.job_state\": \"src.job_state\",\n    \"target.dependency_flag\": \"src.dependency_flag\",\n    \"target.sensor_read_type\": \"src.sensor_read_type\",\n}\n\n\nTABLE_MANAGER_OPERATIONS = {\n    \"compute_table_statistics\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"create_table\": {\n        \"path\": {\"type\": \"str\", \"mandatory\": True},\n        \"disable_dbfs_retry\": {\"type\": \"bool\", \"mandatory\": False},\n        \"delimiter\": {\"type\": \"str\", \"mandatory\": False},\n        \"advanced_parser\": {\"type\": \"bool\", \"mandatory\": False},\n    },\n    \"create_tables\": {\n        \"path\": {\"type\": \"str\", \"mandatory\": True},\n        \"disable_dbfs_retry\": {\"type\": \"bool\", \"mandatory\": False},\n        \"delimiter\": {\"type\": \"str\", \"mandatory\": False},\n        \"advanced_parser\": {\"type\": \"bool\", \"mandatory\": False},\n    },\n    \"create_view\": {\n        \"path\": {\"type\": \"str\", \"mandatory\": True},\n        \"disable_dbfs_retry\": {\"type\": \"bool\", \"mandatory\": False},\n        \"delimiter\": {\"type\": \"str\", \"mandatory\": False},\n        \"advanced_parser\": {\"type\": \"bool\", \"mandatory\": False},\n    },\n    \"drop_table\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"drop_view\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"execute_sql\": {\n        \"sql\": {\"type\": \"str\", \"mandatory\": True},\n        \"delimiter\": {\"type\": \"str\", \"mandatory\": False},\n        \"advanced_parser\": {\"type\": \"bool\", \"mandatory\": False},\n    },\n    \"truncate\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"vacuum\": {\n        \"table_or_view\": {\"type\": \"str\", \"mandatory\": False},\n        \"path\": {\"type\": \"str\", \"mandatory\": False},\n        \"vacuum_hours\": {\"type\": \"int\", \"mandatory\": False},\n    },\n    \"describe\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"optimize\": {\n        \"table_or_view\": {\"type\": \"str\", \"mandatory\": False},\n        \"path\": {\"type\": \"str\", \"mandatory\": False},\n        \"where_clause\": {\"type\": \"str\", \"mandatory\": False},\n        \"optimize_zorder_col_list\": {\"type\": \"str\", \"mandatory\": False},\n    },\n    \"show_tbl_properties\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"get_tbl_pk\": {\"table_or_view\": {\"type\": \"str\", \"mandatory\": True}},\n    \"repair_table\": {\n        \"table_or_view\": {\"type\": \"str\", \"mandatory\": True},\n        \"sync_metadata\": {\"type\": \"bool\", \"mandatory\": True},\n    },\n    \"delete_where\": {\n        \"table_or_view\": {\"type\": \"str\", \"mandatory\": True},\n        \"where_clause\": {\"type\": \"str\", \"mandatory\": True},\n    },\n}\n\n\nFILE_MANAGER_OPERATIONS = {\n    \"delete_objects\": {\n        \"bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"object_paths\": {\"type\": \"list\", \"mandatory\": True},\n        \"dry_run\": {\"type\": \"bool\", \"mandatory\": True},\n    },\n    \"copy_objects\": {\n        \"bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"source_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"destination_bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"destination_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"dry_run\": {\"type\": \"bool\", \"mandatory\": True},\n    },\n    \"move_objects\": {\n        \"bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"source_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"destination_bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"destination_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"dry_run\": {\"type\": \"bool\", \"mandatory\": True},\n    },\n    \"request_restore\": {\n        \"bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"source_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"restore_expiration\": {\"type\": \"int\", \"mandatory\": True},\n        \"retrieval_tier\": {\"type\": \"str\", \"mandatory\": True},\n        \"dry_run\": {\"type\": \"bool\", \"mandatory\": True},\n    },\n    \"check_restore_status\": {\n        \"bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"source_object\": {\"type\": \"str\", \"mandatory\": True},\n    },\n    \"request_restore_to_destination_and_wait\": {\n        \"bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"source_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"destination_bucket\": {\"type\": \"str\", \"mandatory\": True},\n        \"destination_object\": {\"type\": \"str\", \"mandatory\": True},\n        \"restore_expiration\": {\"type\": \"int\", \"mandatory\": True},\n        \"retrieval_tier\": {\"type\": \"str\", \"mandatory\": True},\n        \"dry_run\": {\"type\": \"bool\", \"mandatory\": True},\n    },\n}\n"
  },
  {
    "path": "lakehouse_engine/core/exec_env.py",
    "content": "\"\"\"Module to take care of creating a singleton of the execution environment class.\"\"\"\n\nfrom dataclasses import replace\n\nfrom pyspark.sql import DataFrame, SparkSession\n\nfrom lakehouse_engine.core.definitions import EngineConfig\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass ExecEnv(object):\n    \"\"\"Represents the basic resources regarding the engine execution environment.\n\n    Currently, it is used to encapsulate both the logic to get the Spark\n    session and the engine configurations.\n    \"\"\"\n\n    SESSION: SparkSession\n    _LOGGER = LoggingHandler(__name__).get_logger()\n    ENGINE_CONFIG: EngineConfig = EngineConfig(**ConfigUtils.get_config())\n    IS_SERVERLESS = DatabricksUtils.is_serverless_workload()\n\n    @classmethod\n    def set_default_engine_config(\n        cls,\n        package: str = \"lakehouse_engine.configs\",\n        custom_configs_dict: dict = None,\n        custom_configs_file_path: str = None,\n    ) -> None:\n        \"\"\"Set default engine configurations.\n\n        The function set the default engine configurations by reading\n        them from a specified package and overwrite them if the user\n        pass a dictionary or a file path with new configurations.\n\n        Args:\n            package: package where the engine default configurations can be found.\n            custom_configs_dict: a dictionary with custom configurations\n            to overwrite the default ones.\n            custom_configs_file_path: path for the file with custom\n            configurations to overwrite the default ones.\n        \"\"\"\n        cls.ENGINE_CONFIG = EngineConfig(**ConfigUtils.get_config(package))\n        if custom_configs_dict:\n            cls.ENGINE_CONFIG = replace(cls.ENGINE_CONFIG, **custom_configs_dict)\n        if custom_configs_file_path:\n            cls.ENGINE_CONFIG = replace(\n                cls.ENGINE_CONFIG,\n                **ConfigUtils.get_config_from_file(custom_configs_file_path),\n            )\n\n    @classmethod\n    def get_or_create(\n        cls,\n        session: SparkSession = None,\n        enable_hive_support: bool = True,\n        app_name: str = None,\n        config: dict = None,\n    ) -> None:\n        \"\"\"Get or create an execution environment session (currently Spark).\n\n        It instantiates a singleton session that can be accessed anywhere from the\n        lakehouse engine. By default, if there is an existing Spark Session in\n        the environment (getActiveSession()), this function re-uses it. It can\n        be further extended in the future to support forcing the creation of new\n        isolated sessions even when a Spark Session is already active.\n\n        Args:\n            session: spark session.\n            enable_hive_support: whether to enable hive support or not.\n            app_name: application name.\n            config: extra spark configs to supply to the spark session.\n        \"\"\"\n        if not cls.IS_SERVERLESS:\n            default_config = {\n                \"spark.databricks.delta.optimizeWrite.enabled\": True,\n                \"spark.sql.adaptive.enabled\": True,\n                \"spark.databricks.delta.merge.enableLowShuffle\": True,\n            }\n            cls._LOGGER.info(\n                f\"Using the following default configs you may want to override them \"\n                f\"for your job: {default_config}\"\n            )\n        else:\n            default_config = {}\n        final_config: dict = {**default_config, **(config if config else {})}\n        cls._LOGGER.info(f\"Final config is: {final_config}\")\n\n        if session:\n            cls.SESSION = session\n        elif SparkSession.getActiveSession():\n            cls.SESSION = SparkSession.getActiveSession()\n            cls._set_spark_configs(final_config)\n        else:\n            cls._LOGGER.info(\"Creating a new Spark Session\")\n\n            session_builder = SparkSession.builder.appName(app_name)\n            cls._set_spark_configs(final_config, session_builder)\n\n            if enable_hive_support:\n                session_builder = session_builder.enableHiveSupport()\n            cls.SESSION = session_builder.getOrCreate()\n\n    @classmethod\n    def get_for_each_batch_session(cls, df: DataFrame) -> None:\n        \"\"\"Get the execution environment session for foreachBatch operations.\n\n        For Spark connect scenarios, spark is not able to re-use the Spark session\n        from an external scope as it cannot serialise it, so the session\n        needs to be retrieved and stored again in the ExecEnv class.\n        \"\"\"\n        cls.SESSION = df.sparkSession.getActiveSession()\n\n    @classmethod\n    def _set_spark_configs(\n        cls, final_config: dict, session_builder: SparkSession.Builder = None\n    ) -> None:\n        \"\"\"Set Spark session configurations based on final_config.\n\n        This method attempts to set each configuration key-value pair in the provided\n        final_config dictionary to the Spark session. If a configuration key is not\n        available in the current environment, it logs a warning and skips that key.\n\n        Args:\n            final_config: dictionary with spark configurations to set.\n            session_builder: spark session builder.\n        \"\"\"\n        for key, value in final_config.items():\n            try:\n                if session_builder:\n                    session_builder.config(key, value)\n                else:\n                    cls.SESSION.conf.set(key, value)\n            except Exception as e:\n                if (\n                    \"[CONFIG_NOT_AVAILABLE]\" in str(e)\n                    and not ExecEnv.ENGINE_CONFIG.raise_on_config_not_available\n                ):\n                    cls._LOGGER.warning(\n                        f\"Spark config '{key}' is not available in this \"\n                        f\"environment and will be skipped.\"\n                    )\n                else:\n                    raise e\n\n    @classmethod\n    def get_environment(cls) -> str:\n        \"\"\"Get the environment where the process is running.\n\n        Returns:\n            Name of the environment.\n        \"\"\"\n        if cls.ENGINE_CONFIG.environment:\n            return cls.ENGINE_CONFIG.environment\n\n        catalog = cls.SESSION.sql(\"SELECT current_catalog()\").collect()[0][0]\n        if catalog.lower() == cls.ENGINE_CONFIG.prod_catalog:\n            return \"prod\"\n        else:\n            return \"dev\"\n"
  },
  {
    "path": "lakehouse_engine/core/executable.py",
    "content": "\"\"\"Module representing an executable lakehouse engine component.\"\"\"\n\nfrom abc import ABC, abstractmethod\nfrom typing import Any, Optional\n\n\nclass Executable(ABC):\n    \"\"\"Abstract class defining the behaviour of an executable component.\"\"\"\n\n    @abstractmethod\n    def execute(self) -> Optional[Any]:\n        \"\"\"Define the executable component behaviour.\n\n        E.g., the behaviour of an algorithm inheriting from this.\n        \"\"\"\n        pass\n"
  },
  {
    "path": "lakehouse_engine/core/file_manager.py",
    "content": "\"\"\"Module for abstract representation of a file manager system.\"\"\"\n\nfrom abc import ABC, abstractmethod\nfrom typing import Any\n\nfrom lakehouse_engine.algorithms.exceptions import RestoreTypeNotFoundException\nfrom lakehouse_engine.utils.storage.file_storage_functions import FileStorageFunctions\n\n\nclass FileManager(ABC):  # noqa: B024\n    \"\"\"Abstract file manager class.\n\n    {{ get_file_manager_operations() }}\n    \"\"\"\n\n    def __init__(self, configs: dict):\n        \"\"\"Construct FileManager algorithm instances.\n\n        Args:\n            configs: configurations for the FileManager algorithm.\n        \"\"\"\n        self.configs = configs\n        self.function = self.configs[\"function\"]\n\n    @abstractmethod\n    def delete_objects(self) -> None:\n        \"\"\"Delete objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be deleted based on the given keys.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def copy_objects(self) -> None:\n        \"\"\"Copies objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be copied based on the given keys.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def move_objects(self) -> None:\n        \"\"\"Moves objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be moved based on the given keys.\n        \"\"\"\n        pass\n\n\nclass FileManagerFactory(ABC):  # noqa: B024\n    \"\"\"Class for file manager factory.\"\"\"\n\n    @staticmethod\n    def execute_function(configs: dict) -> Any:\n        \"\"\"Get a specific File Manager and function to execute.\"\"\"\n        from lakehouse_engine.core.dbfs_file_manager import DBFSFileManager\n        from lakehouse_engine.core.s3_file_manager import S3FileManager\n\n        disable_dbfs_retry = (\n            configs[\"disable_dbfs_retry\"]\n            if \"disable_dbfs_retry\" in configs.keys()\n            else False\n        )\n\n        if disable_dbfs_retry:\n            S3FileManager(configs).get_function()\n        elif FileStorageFunctions.is_boto3_configured():\n            try:\n                S3FileManager(configs).get_function()\n            except (ValueError, NotImplementedError, RestoreTypeNotFoundException):\n                raise\n            except Exception:\n                DBFSFileManager(configs).get_function()\n        else:\n            DBFSFileManager(configs).get_function()\n"
  },
  {
    "path": "lakehouse_engine/core/gab_manager.py",
    "content": "\"\"\"Module to define GAB Manager classes.\"\"\"\n\nimport calendar\nfrom datetime import datetime, timedelta\nfrom typing import Tuple, cast\n\nimport pendulum\nfrom pendulum import DateTime\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import GABCadence, GABDefaults\nfrom lakehouse_engine.core.gab_sql_generator import GABViewGenerator\nfrom lakehouse_engine.utils.gab_utils import GABUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass GABCadenceManager(object):\n    \"\"\"Class to control the GAB Cadence Window.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def extended_window_calculator(\n        self,\n        cadence: str,\n        reconciliation_cadence: str,\n        current_date: datetime,\n        start_date_str: str,\n        end_date_str: str,\n        query_type: str,\n        rerun_flag: str,\n        snapshot_flag: str,\n    ) -> tuple[datetime, datetime, datetime, datetime]:\n        \"\"\"extended_window_calculator function.\n\n        Calculates the extended window of any cadence despite the user providing\n        custom dates which are not the exact start and end dates of a cadence.\n\n        Args:\n            cadence: cadence to process\n            reconciliation_cadence: reconciliation to process.\n            current_date: current date.\n            start_date_str: start date of the period to process.\n            end_date_str: end date of the period to process.\n            query_type: use case query type.\n            rerun_flag: flag indicating if it's a rerun or a normal run.\n            snapshot_flag: flag indicating if for this cadence the snapshot is enabled.\n        \"\"\"\n        cad_order = GABCadence.get_ordered_cadences()\n\n        derived_cadence = self._get_reconciliation_cadence(\n            cad_order, rerun_flag, cadence, reconciliation_cadence, snapshot_flag\n        )\n\n        self._LOGGER.info(f\"cadence passed to extended window: {derived_cadence}\")\n\n        start_date = datetime.strptime(start_date_str, GABDefaults.DATE_FORMAT.value)\n        end_date = datetime.strptime(end_date_str, GABDefaults.DATE_FORMAT.value)\n\n        bucket_start_date, bucket_end_date = self.get_cadence_start_end_dates(\n            cadence, derived_cadence, start_date, end_date, query_type, current_date\n        )\n\n        self._LOGGER.info(f\"bucket dates: {bucket_start_date} - {bucket_end_date}\")\n\n        filter_start_date, filter_end_date = self.get_cadence_start_end_dates(\n            cadence,\n            (\n                reconciliation_cadence\n                if cad_order[cadence] < cad_order[reconciliation_cadence]\n                else cadence\n            ),\n            start_date,\n            end_date,\n            query_type,\n            current_date,\n        )\n\n        self._LOGGER.info(f\"filter dates: {filter_start_date} - {filter_end_date}\")\n\n        return bucket_start_date, bucket_end_date, filter_start_date, filter_end_date\n\n    @classmethod\n    def _get_reconciliation_cadence(\n        cls,\n        cadence_order: dict,\n        rerun_flag: str,\n        cadence: str,\n        reconciliation_cadence: str,\n        snapshot_flag: str,\n    ) -> str:\n        \"\"\"Get bigger cadence when rerun_flag or snapshot.\n\n        Args:\n            cadence_order: ordered cadences.\n            rerun_flag: flag indicating if it's a rerun or a normal run.\n            cadence: cadence to process.\n            reconciliation_cadence: reconciliation to process.\n            snapshot_flag: flag indicating if for this cadence the snapshot is enabled.\n        \"\"\"\n        derived_cadence = reconciliation_cadence\n\n        if rerun_flag == \"Y\":\n            if cadence_order[cadence] > cadence_order[reconciliation_cadence]:\n                derived_cadence = cadence\n            elif cadence_order[cadence] < cadence_order[reconciliation_cadence]:\n                derived_cadence = reconciliation_cadence\n        else:\n            if (\n                cadence_order[cadence] > cadence_order[reconciliation_cadence]\n                and snapshot_flag == \"Y\"\n            ) or (cadence_order[cadence] < cadence_order[reconciliation_cadence]):\n                derived_cadence = reconciliation_cadence\n            elif (\n                cadence_order[cadence] > cadence_order[reconciliation_cadence]\n                and snapshot_flag == \"N\"\n            ):\n                derived_cadence = cadence\n\n        return derived_cadence\n\n    def get_cadence_start_end_dates(\n        self,\n        cadence: str,\n        derived_cadence: str,\n        start_date: datetime,\n        end_date: datetime,\n        query_type: str,\n        current_date: datetime,\n    ) -> tuple[datetime, datetime]:\n        \"\"\"Generate the new set of extended start and end dates based on the cadence.\n\n        Running week cadence again to extend to correct week start and end date in case\n            of recon window for Week cadence is present.\n        For end_date 2012-12-31,in case of Quarter Recon window present for Week\n            cadence, start and end dates are recalculated to 2022-10-01 to 2022-12-31.\n        But these are not start and end dates of week. Hence, to correct this, new dates\n            are passed again to get the correct dates.\n\n        Args:\n            cadence: cadence to process.\n            derived_cadence: cadence reconciliation to process.\n            start_date: start date of the period to process.\n            end_date: end date of the period to process.\n            query_type: use case query type.\n            current_date: current date to be used in the end date, in case the end date\n                is greater than current date so the end date should be the current date.\n        \"\"\"\n        new_start_date = self._get_cadence_calculated_date(\n            derived_cadence=derived_cadence, base_date=start_date, is_start=True\n        )\n        new_end_date = self._get_cadence_calculated_date(\n            derived_cadence=derived_cadence, base_date=end_date, is_start=False\n        )\n\n        if cadence.upper() == \"WEEK\":\n            new_start_date = (\n                pendulum.datetime(\n                    int(new_start_date.strftime(\"%Y\")),\n                    int(new_start_date.strftime(\"%m\")),\n                    int(new_start_date.strftime(\"%d\")),\n                )\n                .start_of(\"week\")\n                .replace(tzinfo=None)\n            )\n            new_end_date = (\n                pendulum.datetime(\n                    int(new_end_date.strftime(\"%Y\")),\n                    int(new_end_date.strftime(\"%m\")),\n                    int(new_end_date.strftime(\"%d\")),\n                )\n                .end_of(\"week\")\n                .replace(hour=0, minute=0, second=0, microsecond=0)\n                .replace(tzinfo=None)\n            )\n\n        new_end_date = new_end_date + timedelta(days=1)\n\n        if new_end_date >= current_date:\n            new_end_date = current_date\n\n        if query_type == \"NAM\":\n            new_end_date = new_end_date + timedelta(days=1)\n\n        return new_start_date, new_end_date\n\n    @classmethod\n    def _get_cadence_calculated_date(\n        cls, derived_cadence: str, base_date: datetime, is_start: bool\n    ) -> datetime | DateTime:  # type: ignore\n        cadence_base_date = cls._get_cadence_base_date(derived_cadence, base_date)\n        cadence_date_calculated: DateTime | datetime\n\n        if derived_cadence.upper() == \"WEEK\":\n            cadence_date_calculated = cls._get_calculated_week_date(\n                cast(DateTime, cadence_base_date), is_start\n            )\n        elif derived_cadence.upper() == \"MONTH\":\n            cadence_date_calculated = cls._get_calculated_month_date(\n                cast(datetime, cadence_base_date), is_start\n            )\n        elif derived_cadence.upper() in [\"QUARTER\", \"YEAR\"]:\n            cadence_date_calculated = cls._get_calculated_quarter_or_year_date(\n                cast(DateTime, cadence_base_date), is_start, derived_cadence\n            )\n        else:\n            cadence_date_calculated = cadence_base_date  # type: ignore\n\n        return cadence_date_calculated  # type: ignore\n\n    @classmethod\n    def _get_cadence_base_date(\n        cls, derived_cadence: str, base_date: datetime\n    ) -> datetime | DateTime | str:  # type: ignore\n        \"\"\"Get start date for the selected cadence.\n\n        Args:\n            derived_cadence: cadence reconciliation to process.\n            base_date: base date used to compute the start date of the cadence.\n        \"\"\"\n        if derived_cadence.upper() in [\"DAY\", \"MONTH\"]:\n            cadence_date_calculated = base_date\n        elif derived_cadence.upper() in [\"WEEK\", \"QUARTER\", \"YEAR\"]:\n            cadence_date_calculated = pendulum.datetime(\n                int(base_date.strftime(\"%Y\")),\n                int(base_date.strftime(\"%m\")),\n                int(base_date.strftime(\"%d\")),\n            )\n        else:\n            cadence_date_calculated = \"0\"  # type: ignore\n\n        return cadence_date_calculated\n\n    @classmethod\n    def _get_calculated_week_date(\n        cls, cadence_date_calculated: DateTime, is_start: bool\n    ) -> DateTime:\n        \"\"\"Get WEEK start/end date.\n\n        Args:\n            cadence_date_calculated: base date to compute the week date.\n            is_start: flag indicating if we should get the start or end for the cadence.\n        \"\"\"\n        if is_start:\n            cadence_date_calculated = cadence_date_calculated.start_of(\"week\").replace(\n                tzinfo=None\n            )\n        else:\n            cadence_date_calculated = (\n                cadence_date_calculated.end_of(\"week\")\n                .replace(hour=0, minute=0, second=0, microsecond=0)\n                .replace(tzinfo=None)\n            )\n\n        return cadence_date_calculated\n\n    @classmethod\n    def _get_calculated_month_date(\n        cls, cadence_date_calculated: datetime, is_start: bool\n    ) -> datetime:\n        \"\"\"Get MONTH start/end date.\n\n        Args:\n            cadence_date_calculated: base date to compute the month date.\n            is_start: flag indicating if we should get the start or end for the cadence.\n        \"\"\"\n        if is_start:\n            cadence_date_calculated = cadence_date_calculated - timedelta(\n                days=(int(cadence_date_calculated.strftime(\"%d\")) - 1)\n            )\n        else:\n            cadence_date_calculated = datetime(\n                int(cadence_date_calculated.strftime(\"%Y\")),\n                int(cadence_date_calculated.strftime(\"%m\")),\n                calendar.monthrange(\n                    int(cadence_date_calculated.strftime(\"%Y\")),\n                    int(cadence_date_calculated.strftime(\"%m\")),\n                )[1],\n            )\n\n        return cadence_date_calculated\n\n    @classmethod\n    def _get_calculated_quarter_or_year_date(\n        cls, cadence_date_calculated: DateTime, is_start: bool, cadence: str\n    ) -> DateTime:\n        \"\"\"Get QUARTER/YEAR start/end date.\n\n        Args:\n            cadence_date_calculated: base date to compute the quarter/year date.\n            is_start: flag indicating if we should get the start or end for the cadence.\n            cadence: selected cadence (possible values: QUARTER or YEAR).\n        \"\"\"\n        if is_start:\n            cadence_date_calculated = cadence_date_calculated.first_of(\n                cadence.lower()\n            ).replace(tzinfo=None)\n        else:\n            cadence_date_calculated = cadence_date_calculated.last_of(\n                cadence.lower()\n            ).replace(tzinfo=None)\n\n        return cadence_date_calculated\n\n\nclass GABViewManager(object):\n    \"\"\"Class to control the GAB View creation.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(\n        self,\n        query_id: str,\n        lookup_query_builder: DataFrame,\n        target_database: str,\n        target_table: str,\n    ):\n        \"\"\"Construct GABViewManager instances.\n\n        Args:\n            query_id: gab configuration table use case identifier.\n            lookup_query_builder: gab configuration data.\n            target_database: target database to write.\n            target_table: target table to write.\n        \"\"\"\n        self.query_id = query_id\n        self.lookup_query_builder = lookup_query_builder\n        self.target_database = target_database\n        self.target_table = target_table\n\n    def generate_use_case_views(self) -> None:\n        \"\"\"Generate all the use case views.\n\n        Generates the DDLs for each of the views. This DDL is dynamically built based on\n        the mappings provided in the config table.\n        \"\"\"\n        reconciliation_window = GABUtils.get_json_column_as_dict(\n            self.lookup_query_builder, self.query_id, \"recon_window\"\n        )\n\n        cadence_snapshot_status = self._get_cadence_snapshot_status(\n            reconciliation_window\n        )\n\n        (\n            cadences_with_snapshot,\n            cadences_without_snapshot,\n        ) = self._split_cadence_by_snapshot(cadence_snapshot_status)\n\n        mappings = GABUtils.get_json_column_as_dict(\n            self.lookup_query_builder, self.query_id, \"mappings\"\n        )\n\n        for view_name in mappings.keys():\n            self._generate_use_case_view(\n                mappings,\n                view_name,\n                cadence_snapshot_status,\n                cadences_with_snapshot,\n                cadences_without_snapshot,\n                self.target_database,\n                self.target_table,\n                self.query_id,\n            )\n\n    @classmethod\n    def _generate_use_case_view(\n        cls,\n        mappings: dict,\n        view_name: str,\n        cadence_snapshot_status: dict,\n        cadences_with_snapshot: list[str],\n        cadences_without_snapshot: list[str],\n        target_database: str,\n        target_table: str,\n        query_id: str,\n    ) -> None:\n        \"\"\"Generate the selected use case views.\n\n        Args:\n            mappings: use case mappings configuration.\n            view_name: name of the view to be generated.\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n            cadences_with_snapshot: cadences to execute with snapshot.\n            cadences_without_snapshot: cadences to execute without snapshot.\n            target_database: target database to write.\n            target_table: target table to write.\n            query_id: gab configuration table use case identifier.\n        \"\"\"\n        view_configuration = mappings[view_name]\n\n        view_dimensions = view_configuration[\"dimensions\"]\n        view_metrics = view_configuration[\"metric\"]\n        custom_filter = view_configuration[\"filter\"]\n\n        view_filter = \" \"\n        if custom_filter:\n            view_filter = \" AND \" + custom_filter\n\n        (\n            dimensions,\n            dimensions_and_metrics,\n            dimensions_and_metrics_with_alias,\n        ) = cls._get_dimensions_and_metrics_from_use_case_view(\n            view_dimensions, view_metrics\n        )\n\n        (\n            final_cols,\n            final_calculated_script,\n            final_calculated_script_snapshot,\n        ) = cls._get_calculated_and_derived_metrics_from_use_case_view(\n            view_metrics, view_dimensions, cadence_snapshot_status\n        )\n\n        GABViewGenerator(\n            cadence_snapshot_status=cadence_snapshot_status,\n            target_database=target_database,\n            view_name=view_name,\n            final_cols=final_cols,\n            target_table=target_table,\n            dimensions_and_metrics_with_alias=dimensions_and_metrics_with_alias,\n            dimensions=dimensions,\n            dimensions_and_metrics=dimensions_and_metrics,\n            final_calculated_script=final_calculated_script,\n            query_id=query_id,\n            view_filter=view_filter,\n            final_calculated_script_snapshot=final_calculated_script_snapshot,\n            without_snapshot_cadences=cadences_without_snapshot,\n            with_snapshot_cadences=cadences_with_snapshot,\n        ).generate_sql()\n\n    @classmethod\n    def _get_dimensions_and_metrics_from_use_case_view(\n        cls, view_dimensions: dict, view_metrics: dict\n    ) -> Tuple[str, str, str]:\n        \"\"\"Get dimensions and metrics from use case.\n\n        Args:\n            view_dimensions: use case configured dimensions.\n            view_metrics: use case configured metrics.\n        \"\"\"\n        (\n            extracted_dimensions_with_alias,\n            extracted_dimensions_without_alias,\n        ) = GABUtils.extract_columns_from_mapping(\n            columns=view_dimensions,\n            is_dimension=True,\n            extract_column_without_alias=True,\n            table_alias=\"a\",\n            is_extracted_value_as_name=False,\n        )\n\n        dimensions_without_default_columns = [\n            extracted_dimension\n            for extracted_dimension in extracted_dimensions_without_alias\n            if extracted_dimension not in GABDefaults.DIMENSIONS_DEFAULT_COLUMNS.value\n        ]\n\n        dimensions = \",\".join(dimensions_without_default_columns)\n        dimensions_with_alias = \",\".join(extracted_dimensions_with_alias)\n\n        (\n            extracted_metrics_with_alias,\n            extracted_metrics_without_alias,\n        ) = GABUtils.extract_columns_from_mapping(\n            columns=view_metrics,\n            is_dimension=False,\n            extract_column_without_alias=True,\n            table_alias=\"a\",\n            is_extracted_value_as_name=False,\n        )\n        metrics = \",\".join(extracted_metrics_without_alias)\n        metrics_with_alias = \",\".join(extracted_metrics_with_alias)\n\n        dimensions_and_metrics_with_alias = (\n            dimensions_with_alias + \",\" + metrics_with_alias\n        )\n        dimensions_and_metrics = dimensions + \",\" + metrics\n\n        return dimensions, dimensions_and_metrics, dimensions_and_metrics_with_alias\n\n    @classmethod\n    def _get_calculated_and_derived_metrics_from_use_case_view(\n        cls, view_metrics: dict, view_dimensions: dict, cadence_snapshot_status: dict\n    ) -> Tuple[str, str, str]:\n        \"\"\"Get calculated and derived metrics from use case.\n\n        Args:\n            view_dimensions: use case configured dimensions.\n            view_metrics: use case configured metrics.\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n        \"\"\"\n        calculated_script = []\n        calculated_script_snapshot = []\n        derived_script = []\n        for metric_key, metric_value in view_metrics.items():\n            (\n                calculated_metrics_script,\n                calculated_metrics_script_snapshot,\n                derived_metrics_script,\n            ) = cls._get_calculated_metrics(\n                metric_key, metric_value, view_dimensions, cadence_snapshot_status\n            )\n            calculated_script += [*calculated_metrics_script]\n            calculated_script_snapshot += [*calculated_metrics_script_snapshot]\n            derived_script += [*derived_metrics_script]\n\n        joined_calculated_script = cls._join_list_to_string_when_present(\n            calculated_script\n        )\n        joined_calculated_script_snapshot = cls._join_list_to_string_when_present(\n            calculated_script_snapshot\n        )\n\n        joined_derived = cls._join_list_to_string_when_present(\n            to_join=derived_script, starting_value=\"*,\", default_value=\"*\"\n        )\n\n        return (\n            joined_derived,\n            joined_calculated_script,\n            joined_calculated_script_snapshot,\n        )\n\n    @classmethod\n    def _join_list_to_string_when_present(\n        cls,\n        to_join: list[str],\n        separator: str = \",\",\n        starting_value: str = \",\",\n        default_value: str = \"\",\n    ) -> str:\n        \"\"\"Join list to string when has values, otherwise return the default value.\n\n        Args:\n            to_join: values to join.\n            separator: separator to be used in the join.\n            starting_value: value to be started before the join.\n            default_value: value to be returned if the list is empty.\n        \"\"\"\n        return starting_value + separator.join(to_join) if to_join else default_value\n\n    @classmethod\n    def _get_cadence_snapshot_status(cls, result: dict) -> dict:\n        cadence_snapshot_status = {}\n        for k, v in result.items():\n            cadence_snapshot_status[k] = next(\n                (\n                    next(\n                        (\n                            snap_list[\"snapshot\"]\n                            for snap_list in loop_outer_cad.values()\n                            if snap_list[\"snapshot\"] == \"Y\"\n                        ),\n                        \"N\",\n                    )\n                    for loop_outer_cad in v.values()\n                    if v\n                ),\n                \"N\",\n            )\n\n        return cadence_snapshot_status\n\n    @classmethod\n    def _split_cadence_by_snapshot(\n        cls, cadence_snapshot_status: dict\n    ) -> tuple[list[str], list[str]]:\n        \"\"\"Split cadences by the snapshot value.\n\n        Args:\n            cadence_snapshot_status: cadences to be split by snapshot status.\n        \"\"\"\n        with_snapshot_cadences = []\n        without_snapshot_cadences = []\n\n        for key_snap_status, value_snap_status in cadence_snapshot_status.items():\n            if value_snap_status == \"Y\":\n                with_snapshot_cadences.append(key_snap_status)\n            else:\n                without_snapshot_cadences.append(key_snap_status)\n\n        return with_snapshot_cadences, without_snapshot_cadences\n\n    @classmethod\n    def _get_calculated_metrics(\n        cls,\n        metric_key: str,\n        metric_value: dict,\n        view_dimensions: dict,\n        cadence_snapshot_status: dict,\n    ) -> tuple[list[str], list[str], list[str]]:\n        \"\"\"Get calculated metrics from use case.\n\n        Args:\n            metric_key: use case metric name.\n            metric_value: use case metric value.\n            view_dimensions: use case configured dimensions.\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n        \"\"\"\n        dim_partition = \",\".join([str(i) for i in view_dimensions.keys()][2:])\n        dim_partition = \"cadence,\" + dim_partition\n        calculated_metrics = metric_value[\"calculated_metric\"]\n        derived_metrics = metric_value[\"derived_metric\"]\n        calculated_metrics_script: list[str] = []\n        calculated_metrics_script_snapshot: list[str] = []\n        derived_metrics_script: list[str] = []\n\n        if calculated_metrics:\n            (\n                calculated_metrics_script,\n                calculated_metrics_script_snapshot,\n            ) = cls._get_calculated_metric(\n                metric_key, calculated_metrics, dim_partition, cadence_snapshot_status\n            )\n\n        if derived_metrics:\n            derived_metrics_script = cls._get_derived_metrics(derived_metrics)\n\n        return (\n            calculated_metrics_script,\n            calculated_metrics_script_snapshot,\n            derived_metrics_script,\n        )\n\n    @classmethod\n    def _get_derived_metrics(cls, derived_metric: dict) -> list[str]:\n        \"\"\"Get derived metrics from use case.\n\n        Args:\n            derived_metric: use case derived metrics.\n        \"\"\"\n        derived_metric_script = []\n\n        for i in range(0, len(derived_metric)):\n            derived_formula = str(derived_metric[i][\"formula\"])\n            derived_label = derived_metric[i][\"label\"]\n            derived_metric_script.append(derived_formula + \" AS \" + derived_label)\n\n        return derived_metric_script\n\n    @classmethod\n    def _get_calculated_metric(\n        cls,\n        metric_key: str,\n        calculated_metric: dict,\n        dimension_partition: str,\n        cadence_snapshot_status: dict,\n    ) -> tuple[list[str], list[str]]:\n        \"\"\"Get calculated metrics from use case.\n\n        Args:\n            metric_key: use case metric name.\n            calculated_metric: use case calculated metrics.\n            dimension_partition: dimension partition.\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n        \"\"\"\n        last_cadence_script: list[str] = []\n        last_year_cadence_script: list[str] = []\n        window_script: list[str] = []\n        last_cadence_script_snapshot: list[str] = []\n        last_year_cadence_script_snapshot: list[str] = []\n        window_script_snapshot: list[str] = []\n\n        if \"last_cadence\" in calculated_metric:\n            (\n                last_cadence_script,\n                last_cadence_script_snapshot,\n            ) = cls._get_cadence_calculated_metric(\n                metric_key,\n                dimension_partition,\n                calculated_metric,\n                cadence_snapshot_status,\n                \"last_cadence\",\n            )\n        if \"last_year_cadence\" in calculated_metric:\n            (\n                last_year_cadence_script,\n                last_year_cadence_script_snapshot,\n            ) = cls._get_cadence_calculated_metric(\n                metric_key,\n                dimension_partition,\n                calculated_metric,\n                cadence_snapshot_status,\n                \"last_year_cadence\",\n            )\n        if \"window_function\" in calculated_metric:\n            window_script, window_script_snapshot = cls._get_window_calculated_metric(\n                metric_key,\n                dimension_partition,\n                calculated_metric,\n                cadence_snapshot_status,\n            )\n\n        calculated_script = [\n            *last_cadence_script,\n            *last_year_cadence_script,\n            *window_script,\n        ]\n        calculated_script_snapshot = [\n            *last_cadence_script_snapshot,\n            *last_year_cadence_script_snapshot,\n            *window_script_snapshot,\n        ]\n\n        return calculated_script, calculated_script_snapshot\n\n    @classmethod\n    def _get_window_calculated_metric(\n        cls,\n        metric_key: str,\n        dimension_partition: str,\n        calculated_metric: dict,\n        cadence_snapshot_status: dict,\n    ) -> tuple[list, list]:\n        \"\"\"Get window calculated metrics from use case.\n\n        Args:\n            metric_key: use case metric name.\n            dimension_partition: dimension partition.\n            calculated_metric: use case calculated metrics.\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n        \"\"\"\n        calculated_script = []\n        calculated_script_snapshot = []\n\n        for i in range(0, len(calculated_metric[\"window_function\"])):\n            window_function = calculated_metric[\"window_function\"][i][\"agg_func\"]\n            window_function_start = calculated_metric[\"window_function\"][i][\"window\"][0]\n            window_function_end = calculated_metric[\"window_function\"][i][\"window\"][1]\n            window_label = calculated_metric[\"window_function\"][i][\"label\"]\n\n            calculated_script.append(\n                f\"\"\"\n                NVL(\n                    {window_function}({metric_key}) OVER\n                    (\n                        PARTITION BY {dimension_partition}\n                        order by from_date ROWS BETWEEN\n                            {str(window_function_start)} PRECEDING\n                            AND {str(window_function_end)} PRECEDING\n                    ),\n                    0\n                ) AS\n                {window_label}\n                \"\"\"\n            )\n\n            if \"Y\" in cadence_snapshot_status.values():\n                calculated_script_snapshot.append(\n                    f\"\"\"\n                    NVL(\n                        {window_function}({metric_key}) OVER\n                        (\n                            PARTITION BY {dimension_partition} ,rn\n                            order by from_date ROWS BETWEEN\n                                {str(window_function_start)} PRECEDING\n                                AND {str(window_function_end)} PRECEDING\n                        ),\n                        0\n                    ) AS\n                    {window_label}\n                    \"\"\"\n                )\n\n        return calculated_script, calculated_script_snapshot\n\n    @classmethod\n    def _get_cadence_calculated_metric(\n        cls,\n        metric_key: str,\n        dimension_partition: str,\n        calculated_metric: dict,\n        cadence_snapshot_status: dict,\n        cadence: str,\n    ) -> tuple[list, list]:\n        \"\"\"Get cadence calculated metrics from use case.\n\n        Args:\n            metric_key: use case metric name.\n            calculated_metric: use case calculated metrics.\n            dimension_partition: dimension partition.\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n            cadence: cadence to process.\n        \"\"\"\n        calculated_script = []\n        calculated_script_snapshot = []\n\n        for i in range(0, len(calculated_metric[cadence])):\n            cadence_lag = cls._get_cadence_item_lag(calculated_metric, cadence, i)\n            cadence_label = calculated_metric[cadence][i][\"label\"]\n\n            calculated_script.append(\n                cls._get_cadence_lag_statement(\n                    metric_key,\n                    cadence_lag,\n                    dimension_partition,\n                    cadence_label,\n                    snapshot=False,\n                    cadence=cadence,\n                )\n            )\n\n            if \"Y\" in cadence_snapshot_status.values():\n                calculated_script_snapshot.append(\n                    cls._get_cadence_lag_statement(\n                        metric_key,\n                        cadence_lag,\n                        dimension_partition,\n                        cadence_label,\n                        snapshot=True,\n                        cadence=cadence,\n                    )\n                )\n\n        return calculated_script, calculated_script_snapshot\n\n    @classmethod\n    def _get_cadence_item_lag(\n        cls, calculated_metric: dict, cadence: str, item: int\n    ) -> str:\n        \"\"\"Get calculated metric item lag.\n\n        Args:\n            calculated_metric: use case calculated metrics.\n            cadence: cadence to process.\n            item: metric item.\n        \"\"\"\n        return str(calculated_metric[cadence][item][\"window\"])\n\n    @classmethod\n    def _get_cadence_lag_statement(\n        cls,\n        metric_key: str,\n        cadence_lag: str,\n        dimension_partition: str,\n        cadence_label: str,\n        snapshot: bool,\n        cadence: str,\n    ) -> str:\n        \"\"\"Get cadence lag statement.\n\n        Args:\n            metric_key: use case metric name.\n            cadence_lag: cadence window lag.\n            dimension_partition: dimension partition.\n            cadence_label: cadence name.\n            snapshot: indicate if the snapshot is enabled.\n            cadence: cadence to process.\n        \"\"\"\n        cadence_lag_statement = \"\"\n        if cadence == \"last_cadence\":\n            cadence_lag_statement = (\n                \"NVL(LAG(\"\n                + metric_key\n                + \",\"\n                + cadence_lag\n                + \") OVER(PARTITION BY \"\n                + dimension_partition\n                + (\",rn\" if snapshot else \"\")\n                + \" order by from_date),0) AS \"\n                + cadence_label\n            )\n        elif cadence == \"last_year_cadence\":\n            cadence_lag_statement = (\n                \"NVL(LAG(\"\n                + metric_key\n                + \",\"\n                + cadence_lag\n                + \") OVER(PARTITION BY \"\n                + dimension_partition\n                + (\",rn\" if snapshot else \"\")\n                + \"\"\",\n                    case\n                        when cadence in ('DAY','MONTH','QUARTER')\n                            then struct(month(from_date), day(from_date))\n                        when cadence in('WEEK')\n                            then struct(weekofyear(from_date+1),1)\n                    end order by from_date),0) AS \"\"\"\n                + cadence_label\n            )\n        else:\n            cls._LOGGER.error(f\"Cadence {cadence} not implemented yet\")\n\n        return cadence_lag_statement\n"
  },
  {
    "path": "lakehouse_engine/core/gab_sql_generator.py",
    "content": "\"\"\"Module to define GAB SQL classes.\"\"\"\n\nimport ast\nimport json\nfrom abc import ABC, abstractmethod\nfrom typing import Any, Callable, Optional\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, lit, struct, to_json\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.gab_utils import GABUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\ndef _execute_sql(func) -> Callable:  # type: ignore\n    \"\"\"Execute the SQL resulting from the function.\n\n    This function is protected to be used just in this module.\n    It's used to decorate functions that returns a SQL statement.\n\n    Args:\n        func: function that will return the sql to execute\n    \"\"\"\n\n    def inner(*args: Any) -> None:\n        generated_sql = func(*args)\n        if generated_sql:\n            ExecEnv.SESSION.sql(generated_sql)\n\n    return inner\n\n\nclass GABSQLGenerator(ABC):\n    \"\"\"Abstract class defining the behaviour of a GAB SQL Generator.\"\"\"\n\n    @abstractmethod\n    def generate_sql(self) -> Optional[str]:\n        \"\"\"Define the generate sql command.\n\n        E.g., the behaviour of gab generate sql inheriting from this.\n        \"\"\"\n        pass\n\n\nclass GABInsertGenerator(GABSQLGenerator):\n    \"\"\"GAB insert generator.\n\n    Creates the insert statement based on the dimensions and metrics provided in\n    the configuration table.\n    \"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(\n        self,\n        query_id: str,\n        cadence: str,\n        final_stage_table: str,\n        lookup_query_builder: DataFrame,\n        target_database: str,\n        target_table: str,\n    ):\n        \"\"\"Construct GABInsertGenerator instances.\n\n        Args:\n            query_id: gab configuration table use case identifier.\n            cadence:  inputted cadence to process.\n            final_stage_table: stage view name.\n            lookup_query_builder: gab configuration data.\n            target_database: target database to write.\n            target_table: target table to write.\n        \"\"\"\n        self.query_id = query_id\n        self.cadence = cadence\n        self.final_stage_table = final_stage_table\n        self.lookup_query_builder = lookup_query_builder\n        self.target_database = target_database\n        self.target_table = target_table\n\n    def generate_sql(self) -> Optional[str]:\n        \"\"\"Generate insert sql statement to the insights table.\"\"\"\n        insert_sql_statement = self._insert_statement_generator()\n\n        return insert_sql_statement\n\n    def _insert_statement_generator(self) -> str:\n        \"\"\"Generate GAB insert statement.\n\n        Creates the insert statement based on the dimensions and metrics provided in\n        the configuration table.\n        \"\"\"\n        result = GABUtils.get_json_column_as_dict(\n            self.lookup_query_builder, self.query_id, \"mappings\"\n        )\n\n        for result_key in result.keys():\n            joined_dimensions, joined_metrics = self._get_mapping_columns(\n                mapping=result[result_key]\n            )\n            gen_ins = f\"\"\"\n                INSERT INTO {self.target_database}.{self.target_table}\n                SELECT\n                    {self.query_id} as query_id,\n                    '{self.cadence}' as cadence,\n                    {joined_dimensions},\n                    {joined_metrics},\n                    current_timestamp() as lh_created_on\n                FROM {self.final_stage_table}\n                \"\"\"  # nosec: B608\n\n        return gen_ins\n\n    @classmethod\n    def _get_mapping_columns(cls, mapping: dict) -> tuple[str, str]:\n        \"\"\"Get mapping columns(dimensions and metrics) as joined string.\n\n        Args:\n            mapping: use case mappings configuration.\n        \"\"\"\n        dimensions_mapping = mapping[\"dimensions\"]\n        metrics_mapping = mapping[\"metric\"]\n\n        joined_dimensions = cls._join_extracted_column_with_filled_columns(\n            columns=dimensions_mapping, is_dimension=True\n        )\n        joined_metrics = cls._join_extracted_column_with_filled_columns(\n            columns=metrics_mapping, is_dimension=False\n        )\n\n        return joined_dimensions, joined_metrics\n\n    @classmethod\n    def _join_extracted_column_with_filled_columns(\n        cls, columns: dict, is_dimension: bool\n    ) -> str:\n        \"\"\"Join extracted columns with empty filled columns.\n\n        Args:\n            columns: use case columns and values.\n            is_dimension: flag identifying if is a dimension or a metric.\n        \"\"\"\n        extracted_columns_with_alias = (\n            GABUtils.extract_columns_from_mapping(  # type: ignore\n                columns=columns, is_dimension=is_dimension\n            )\n        )\n\n        filled_columns = cls._fill_empty_columns(\n            extracted_columns=extracted_columns_with_alias,  # type: ignore\n            is_dimension=is_dimension,\n        )\n\n        joined_columns = [*extracted_columns_with_alias, *filled_columns]\n\n        return \",\".join(joined_columns)\n\n    @classmethod\n    def _fill_empty_columns(\n        cls, extracted_columns: list[str], is_dimension: bool\n    ) -> list[str]:\n        \"\"\"Fill empty columns as null.\n\n        As the data is expected to have 40 columns we have to fill the unused columns.\n\n        Args:\n            extracted_columns: use case extracted columns.\n            is_dimension: flag identifying if is a dimension or a metric.\n        \"\"\"\n        filled_columns = []\n\n        for ins in range(\n            (\n                len(extracted_columns) - 1\n                if is_dimension\n                else len(extracted_columns) + 1\n            ),\n            41,\n        ):\n            filled_columns.append(\n                \" null as {}{}\".format(\"d\" if is_dimension else \"m\", ins)\n            )\n\n        return filled_columns\n\n\nclass GABViewGenerator(GABSQLGenerator):\n    \"\"\"GAB view generator.\n\n    Creates the use case view statement to be consumed.\n    \"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(\n        self,\n        cadence_snapshot_status: dict,\n        target_database: str,\n        view_name: str,\n        final_cols: str,\n        target_table: str,\n        dimensions_and_metrics_with_alias: str,\n        dimensions: str,\n        dimensions_and_metrics: str,\n        final_calculated_script: str,\n        query_id: str,\n        view_filter: str,\n        final_calculated_script_snapshot: str,\n        without_snapshot_cadences: list[str],\n        with_snapshot_cadences: list[str],\n    ):\n        \"\"\"Construct GABViewGenerator instances.\n\n        Args:\n            cadence_snapshot_status: each cadence with the corresponding snapshot\n                status.\n            target_database: target database to write.\n            view_name: name of the view to be generated.\n            final_cols: columns to return in the view.\n            target_table: target table to write.\n            dimensions_and_metrics_with_alias: configured dimensions and metrics with\n                alias to compute in the view.\n            dimensions: use case configured dimensions.\n            dimensions_and_metrics: use case configured dimensions and metrics.\n            final_calculated_script: use case calculated metrics.\n            query_id: gab configuration table use case identifier.\n            view_filter: filter to add in the view.\n            final_calculated_script_snapshot: use case calculated metrics with snapshot.\n            without_snapshot_cadences: cadences without snapshot.\n            with_snapshot_cadences: cadences with snapshot.\n        \"\"\"\n        self.cadence_snapshot_status = cadence_snapshot_status\n        self.target_database = target_database\n        self.result_key = view_name\n        self.final_cols = final_cols\n        self.target_table = target_table\n        self.dimensions_and_metrics_with_alias = dimensions_and_metrics_with_alias\n        self.dimensions = dimensions\n        self.dimensions_and_metrics = dimensions_and_metrics\n        self.final_calculated_script = final_calculated_script\n        self.query_id = query_id\n        self.view_filter = view_filter\n        self.final_calculated_script_snapshot = final_calculated_script_snapshot\n        self.without_snapshot_cadences = without_snapshot_cadences\n        self.with_snapshot_cadences = with_snapshot_cadences\n\n    @_execute_sql\n    def generate_sql(self) -> Optional[str]:\n        \"\"\"Generate use case view sql statement.\"\"\"\n        consumption_view_sql = self._create_consumption_view()\n\n        return consumption_view_sql\n\n    def _create_consumption_view(self) -> str:\n        \"\"\"Create consumption view.\"\"\"\n        final_view_query = self._generate_consumption_view_statement(\n            self.cadence_snapshot_status,\n            self.target_database,\n            self.final_cols,\n            self.target_table,\n            self.dimensions_and_metrics_with_alias,\n            self.dimensions,\n            self.dimensions_and_metrics,\n            self.final_calculated_script,\n            self.query_id,\n            self.view_filter,\n            self.final_calculated_script_snapshot,\n            without_snapshot_cadences=\",\".join(\n                f'\"{w}\"' for w in self.without_snapshot_cadences\n            ),\n            with_snapshot_cadences=\",\".join(\n                f'\"{w}\"' for w in self.with_snapshot_cadences\n            ),\n        )\n\n        rendered_query = \"\"\"\n            CREATE OR REPLACE VIEW {database}.{view_name} AS {final_view_query}\n            \"\"\".format(\n            database=self.target_database,\n            view_name=self.result_key,\n            final_view_query=final_view_query,\n        )\n        self._LOGGER.info(f\"Consumption view statement: {rendered_query}\")\n        return rendered_query\n\n    @classmethod\n    def _generate_consumption_view_statement(\n        cls,\n        cadence_snapshot_status: dict,\n        target_database: str,\n        final_cols: str,\n        target_table: str,\n        dimensions_and_metrics_with_alias: str,\n        dimensions: str,\n        dimensions_and_metrics: str,\n        final_calculated_script: str,\n        query_id: str,\n        view_filter: str,\n        final_calculated_script_snapshot: str,\n        without_snapshot_cadences: str,\n        with_snapshot_cadences: str,\n    ) -> str:\n        \"\"\"Generate consumption view.\n\n        Args:\n            cadence_snapshot_status: cadences to execute with the information if it has\n                snapshot.\n            target_database: target database to write.\n            final_cols: use case columns exposed in the consumption view.\n            target_table: target table to write.\n            dimensions_and_metrics_with_alias: dimensions and metrics as string columns\n                with alias.\n            dimensions: dimensions as string columns.\n            dimensions_and_metrics: dimensions and metrics as string columns\n                without alias.\n            final_calculated_script: final calculated metrics script.\n            query_id: gab configuration table use case identifier.\n            view_filter: filter to execute on the view.\n            final_calculated_script_snapshot: final calculated metrics with snapshot\n                script.\n            without_snapshot_cadences: cadences without snapshot.\n            with_snapshot_cadences: cadences with snapshot.\n        \"\"\"\n        cls._LOGGER.info(\"Generating consumption view statement...\")\n        cls._LOGGER.info(\n            f\"\"\"\n            {{\n                target_database: {target_database},\n                target_table: {target_table},\n                query_id: {query_id},\n                cadence_and_snapshot_status: {cadence_snapshot_status},\n                cadences_without_snapshot: [{without_snapshot_cadences}],\n                cadences_with_snapshot: [{with_snapshot_cadences}],\n                final_cols: {final_cols},\n                dimensions_and_metrics_with_alias: {dimensions_and_metrics_with_alias},\n                dimensions: {dimensions},\n                dimensions_with_metrics: {dimensions_and_metrics},\n                final_calculated_script: {final_calculated_script},\n                final_calculated_script_snapshot: {final_calculated_script_snapshot},\n                view_filter: {view_filter}\n            }}\"\"\"\n        )\n        if (\n            \"Y\" in cadence_snapshot_status.values()\n            and \"N\" in cadence_snapshot_status.values()\n        ):\n            consumption_view_query = f\"\"\"\n                WITH TEMP1 AS (\n                    SELECT\n                        a.cadence,\n                        {dimensions_and_metrics_with_alias}{final_calculated_script}\n                    FROM {target_database}.{target_table} a\n                    WHERE a.query_id = {query_id}\n                    AND cadence IN ({without_snapshot_cadences})\n                    {view_filter}\n                ),\n                TEMP_RN AS (\n                    SELECT\n                        a.cadence,\n                        a.from_date,\n                        a.to_date,\n                        {dimensions_and_metrics},\n                        row_number() over(\n                            PARTITION BY\n                                a.cadence,\n                                {dimensions},\n                                a.from_date\n                            order by to_date\n                        ) as rn\n                    FROM {target_database}.{target_table} a\n                    WHERE a.query_id = {query_id}\n                    AND cadence IN ({with_snapshot_cadences})\n                    {view_filter}\n                ),\n                TEMP2 AS (\n                    SELECT\n                        a.cadence,\n                        {dimensions_and_metrics_with_alias}{final_calculated_script_snapshot}\n                    FROM TEMP_RN a\n                ),\n                TEMP3 AS (SELECT * FROM TEMP1 UNION SELECT * from TEMP2)\n                SELECT {final_cols} FROM TEMP3\n            \"\"\"  # nosec: B608\n        elif \"N\" in cadence_snapshot_status.values():\n            consumption_view_query = f\"\"\"\n                WITH TEMP1 AS (\n                    SELECT\n                        a.cadence,\n                        {dimensions_and_metrics_with_alias}{final_calculated_script}\n                    FROM {target_database}.{target_table} a\n                    WHERE a.query_id = {query_id}\n                    AND cadence IN ({without_snapshot_cadences})  {view_filter}\n                )\n                SELECT {final_cols} FROM TEMP1\n            \"\"\"  # nosec: B608\n        else:\n            consumption_view_query = f\"\"\"\n                WITH TEMP_RN AS (\n                    SELECT\n                        a.cadence,\n                        a.from_date,\n                        a.to_date,\n                        {dimensions_and_metrics},\n                        row_number() over(\n                            PARTITION BY\n                                a.cadence,\n                                a.from_date,\n                                a.to_date,\n                                {dimensions},\n                                a.from_date\n                        order by to_date) as rn\n                    FROM {target_database}.{target_table} a\n                    WHERE a.query_id = {query_id}\n                    AND cadence IN ({with_snapshot_cadences})\n                    {view_filter}\n                ),\n                TEMP2 AS (\n                    SELECT\n                        a.cadence,\n                        {dimensions_and_metrics_with_alias}{final_calculated_script_snapshot}\n                    FROM TEMP_RN a\n                )\n                SELECT {final_cols} FROM TEMP2\n            \"\"\"  # nosec: B608\n\n        return consumption_view_query\n\n\nclass GABDeleteGenerator(GABSQLGenerator):\n    \"\"\"GAB delete generator.\n\n    Creates the delete statement to clean the use case base data on the insights table.\n    \"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def __init__(\n        self,\n        query_id: str,\n        cadence: str,\n        temp_stage_view_name: str,\n        lookup_query_builder: DataFrame,\n        target_database: str,\n        target_table: str,\n    ):\n        \"\"\"Construct GABViewGenerator instances.\n\n        Args:\n            query_id: gab configuration table use case identifier.\n            cadence:  inputted cadence to process.\n            temp_stage_view_name: stage view name.\n            lookup_query_builder: gab configuration data.\n            target_database: target database to write.\n            target_table: target table to write.\n        \"\"\"\n        self.query_id = query_id\n        self.cadence = cadence\n        self.temp_stage_view_name = temp_stage_view_name\n        self.lookup_query_builder = lookup_query_builder\n        self.target_database = target_database\n        self.target_table = target_table\n\n    @_execute_sql\n    def generate_sql(self) -> Optional[str]:\n        \"\"\"Generate delete sql statement.\n\n        This statement is to clean the insights table for the corresponding use case.\n        \"\"\"\n        delete_sql_statement = self._delete_statement_generator()\n\n        return delete_sql_statement\n\n    def _delete_statement_generator(self) -> str:\n        df_filtered = self.lookup_query_builder.filter(\n            col(\"query_id\") == lit(self.query_id)\n        )\n\n        df_map = df_filtered.select(col(\"mappings\"))\n        view_df = df_map.select(\n            to_json(struct([df_map[x] for x in df_map.columns]))\n        ).collect()[0][0]\n        line = json.loads(view_df)\n\n        for line_v in line.values():\n            result = ast.literal_eval(line_v)\n\n        for result_key in result.keys():\n            result_new = result[result_key]\n            dim_from_date = result_new[\"dimensions\"][\"from_date\"]\n            dim_to_date = result_new[\"dimensions\"][\"to_date\"]\n\n        self._LOGGER.info(f\"temp stage view name: {self.temp_stage_view_name}\")\n\n        min_from_date = ExecEnv.SESSION.sql(\n            \"\"\"\n            SELECT\n                MIN({from_date}) as min_from_date\n            FROM {iter_stages}\"\"\".format(  # nosec: B608\n                iter_stages=self.temp_stage_view_name, from_date=dim_from_date\n            )\n        ).collect()[0][0]\n        max_from_date = ExecEnv.SESSION.sql(\n            \"\"\"\n            SELECT\n                MAX({from_date}) as max_from_date\n            FROM {iter_stages}\"\"\".format(  # nosec: B608\n                iter_stages=self.temp_stage_view_name, from_date=dim_from_date\n            )\n        ).collect()[0][0]\n\n        min_to_date = ExecEnv.SESSION.sql(\n            \"\"\"\n            SELECT\n                MIN({to_date}) as min_to_date\n            FROM {iter_stages}\"\"\".format(  # nosec: B608\n                iter_stages=self.temp_stage_view_name, to_date=dim_to_date\n            )\n        ).collect()[0][0]\n        max_to_date = ExecEnv.SESSION.sql(\n            \"\"\"\n            SELECT\n                MAX({to_date}) as max_to_date\n            FROM {iter_stages}\"\"\".format(  # nosec: B608\n                iter_stages=self.temp_stage_view_name, to_date=dim_to_date\n            )\n        ).collect()[0][0]\n\n        gen_del = \"\"\"\n        DELETE FROM {target_database}.{target_table} a\n            WHERE query_id = {query_id}\n            AND cadence = '{cadence}'\n            AND from_date BETWEEN '{min_from_date}' AND '{max_from_date}'\n            AND to_date BETWEEN '{min_to_date}' AND '{max_to_date}'\n        \"\"\".format(  # nosec: B608\n            target_database=self.target_database,\n            target_table=self.target_table,\n            query_id=self.query_id,\n            cadence=self.cadence,\n            min_from_date=min_from_date,\n            max_from_date=max_from_date,\n            min_to_date=min_to_date,\n            max_to_date=max_to_date,\n        )\n\n        return gen_del\n"
  },
  {
    "path": "lakehouse_engine/core/s3_file_manager.py",
    "content": "\"\"\"File manager module using boto3.\"\"\"\n\nimport time\nfrom typing import Any, Optional, Tuple\n\nimport boto3\n\nfrom lakehouse_engine.algorithms.exceptions import RestoreTypeNotFoundException\nfrom lakehouse_engine.core.definitions import (\n    ARCHIVE_STORAGE_CLASS,\n    FileManagerAPIKeys,\n    RestoreStatus,\n    RestoreType,\n)\nfrom lakehouse_engine.core.file_manager import FileManager\nfrom lakehouse_engine.utils.file_utils import get_directory_path\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\ndef _dry_run(bucket: str, object_paths: list) -> dict:\n    \"\"\"Build the dry run request return format.\n\n    Args:\n        bucket: name of bucket to perform operation.\n        object_paths: paths of object to list.\n\n    Returns:\n        A dict with a list of objects that would be copied/deleted.\n    \"\"\"\n    response = {}\n\n    for path in object_paths:\n        if _check_directory(bucket, path):\n            path = get_directory_path(path)\n\n        res = _list_objects_recursively(bucket=bucket, path=path)\n\n        if res:\n            response[path] = res\n        else:\n            response[path] = [\"No such key\"]\n\n    return response\n\n\ndef _list_objects(\n    s3_client: Any, bucket: str, path: str, paginator: str = \"\"\n) -> Tuple[list, str]:\n    \"\"\"List 1000 objects in a bucket given a prefix and paginator in s3.\n\n    Args:\n        bucket: name of bucket to perform the list.\n        path: path to be used as a prefix.\n        paginator: paginator token to be used.\n\n    Returns:\n         A list of object names.\n    \"\"\"\n    object_list = []\n\n    if not paginator:\n        list_response = s3_client.list_objects_v2(Bucket=bucket, Prefix=path)\n    else:\n        list_response = s3_client.list_objects_v2(\n            Bucket=bucket,\n            Prefix=path,\n            ContinuationToken=paginator,\n        )\n\n    if FileManagerAPIKeys.CONTENTS.value in list_response:\n        for obj in list_response[FileManagerAPIKeys.CONTENTS.value]:\n            object_list.append(obj[FileManagerAPIKeys.KEY.value])\n\n    if FileManagerAPIKeys.CONTINUATION.value in list_response:\n        pagination = list_response[FileManagerAPIKeys.CONTINUATION.value]\n    else:\n        pagination = \"\"\n\n    return object_list, pagination\n\n\ndef _list_objects_recursively(bucket: str, path: str) -> list:\n    \"\"\"Recursively list all objects given a prefix in s3.\n\n    Args:\n        bucket: name of bucket to perform the list.\n        path: path to be used as a prefix.\n\n    Returns:\n        A list of object names fetched recursively.\n    \"\"\"\n    object_list = []\n    more_objects = True\n    paginator = \"\"\n\n    s3 = boto3.client(\"s3\")\n\n    while more_objects:\n        temp_list, paginator = _list_objects(s3, bucket, path, paginator)\n\n        object_list.extend(temp_list)\n\n        if not paginator:\n            more_objects = False\n\n    return object_list\n\n\ndef _check_directory(bucket: str, path: str) -> bool:\n    \"\"\"Checks if the object is a 'directory' in s3.\n\n    Args:\n        bucket: name of bucket to perform the check.\n        path: path to be used as a prefix.\n\n    Returns:\n        If path represents a 'directory'.\n    \"\"\"\n    s3 = boto3.client(\"s3\")\n    objects, _ = _list_objects(s3, bucket, path)\n    return len(objects) > 1\n\n\nclass S3FileManager(FileManager):\n    \"\"\"Set of actions to manipulate s3 files in several ways.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def get_function(self) -> None:\n        \"\"\"Get a specific function to execute.\"\"\"\n        available_functions = {\n            \"delete_objects\": self.delete_objects,\n            \"copy_objects\": self.copy_objects,\n            \"request_restore\": self.request_restore,\n            \"check_restore_status\": self.check_restore_status,\n            \"request_restore_to_destination_and_wait\": (\n                self.request_restore_to_destination_and_wait\n            ),\n        }\n\n        self._logger.info(\"Function being executed: {}\".format(self.function))\n        if self.function in available_functions.keys():\n            func = available_functions[self.function]\n            func()\n        else:\n            raise NotImplementedError(\n                f\"The requested function {self.function} is not implemented.\"\n            )\n\n    def _delete_objects(self, bucket: str, objects_paths: list) -> None:\n        \"\"\"Delete objects recursively in s3.\n\n        Params:\n            bucket: name of bucket to perform the delete operation.\n            objects_paths: objects to be deleted.\n        \"\"\"\n        s3 = boto3.client(\"s3\")\n\n        for path in objects_paths:\n            if _check_directory(bucket, path):\n                path = get_directory_path(path)\n            else:\n                path = path.strip()\n\n            more_objects = True\n            paginator = \"\"\n            objects_to_delete = []\n\n            while more_objects:\n                objects_found, paginator = _list_objects(\n                    s3_client=s3, bucket=bucket, path=path, paginator=paginator\n                )\n                for obj in objects_found:\n                    objects_to_delete.append({FileManagerAPIKeys.KEY.value: obj})\n\n                if not paginator:\n                    more_objects = False\n\n                response = s3.delete_objects(\n                    Bucket=bucket,\n                    Delete={FileManagerAPIKeys.OBJECTS.value: objects_to_delete},\n                )\n                self._logger.info(response)\n                objects_to_delete = []\n\n    def delete_objects(self) -> None:\n        \"\"\"Delete objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be deleted based on the given keys.\n        \"\"\"\n        bucket = self.configs[\"bucket\"]\n        objects_paths = self.configs[\"object_paths\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        if dry_run:\n            response = _dry_run(bucket=bucket, object_paths=objects_paths)\n\n            self._logger.info(\"Paths that would be deleted:\")\n            self._logger.info(response)\n        else:\n            self._delete_objects(bucket, objects_paths)\n\n    def copy_objects(self) -> None:\n        \"\"\"Copies objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be copied based on the given keys.\n        \"\"\"\n        source_bucket = self.configs[\"bucket\"]\n        source_object = self.configs[\"source_object\"]\n        destination_bucket = self.configs[\"destination_bucket\"]\n        destination_object = self.configs[\"destination_object\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        S3FileManager._copy_objects(\n            source_bucket=source_bucket,\n            source_object=source_object,\n            destination_bucket=destination_bucket,\n            destination_object=destination_object,\n            dry_run=dry_run,\n        )\n\n    def move_objects(self) -> None:\n        \"\"\"Moves objects and 'directories'.\n\n        If dry_run is set to True the function will print a dict with all the\n        paths that would be moved based on the given keys.\n        \"\"\"\n        pass\n\n    def request_restore(self) -> None:\n        \"\"\"Request the restore of archived data.\"\"\"\n        source_bucket = self.configs[\"bucket\"]\n        source_object = self.configs[\"source_object\"]\n        restore_expiration = self.configs[\"restore_expiration\"]\n        retrieval_tier = self.configs[\"retrieval_tier\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        ArchiveFileManager.request_restore(\n            source_bucket,\n            source_object,\n            restore_expiration,\n            retrieval_tier,\n            dry_run,\n        )\n\n    def check_restore_status(self) -> None:\n        \"\"\"Check the restore status of archived data.\"\"\"\n        source_bucket = self.configs[\"bucket\"]\n        source_object = self.configs[\"source_object\"]\n\n        restore_status = ArchiveFileManager.check_restore_status(\n            source_bucket, source_object\n        )\n\n        self._logger.info(\n            f\"\"\"\n            Restore status:\n            - Not Started: {restore_status.get('not_started_objects')}\n            - Ongoing: {restore_status.get('ongoing_objects')}\n            - Restored: {restore_status.get('restored_objects')}\n            Total objects in this restore process: {restore_status.get('total_objects')}\n            \"\"\"\n        )\n\n    def request_restore_to_destination_and_wait(self) -> None:\n        \"\"\"Request and wait for the restore to complete, polling the restore status.\n\n        After the restore is done, copy the restored files to destination\n        \"\"\"\n        source_bucket = self.configs[\"bucket\"]\n        source_object = self.configs[\"source_object\"]\n        destination_bucket = self.configs[\"destination_bucket\"]\n        destination_object = self.configs[\"destination_object\"]\n        restore_expiration = self.configs[\"restore_expiration\"]\n        retrieval_tier = self.configs[\"retrieval_tier\"]\n        dry_run = self.configs[\"dry_run\"]\n\n        ArchiveFileManager.request_restore_and_wait(\n            source_bucket=source_bucket,\n            source_object=source_object,\n            restore_expiration=restore_expiration,\n            retrieval_tier=retrieval_tier,\n            dry_run=dry_run,\n        )\n\n        S3FileManager._logger.info(\n            f\"Restoration complete for {source_bucket} and {source_object}\"\n        )\n        S3FileManager._logger.info(\n            f\"Starting to copy data from {source_bucket}/{source_object} to \"\n            f\"{destination_bucket}/{destination_object}\"\n        )\n        S3FileManager._copy_objects(\n            source_bucket=source_bucket,\n            source_object=source_object,\n            destination_bucket=destination_bucket,\n            destination_object=destination_object,\n            dry_run=dry_run,\n        )\n        S3FileManager._logger.info(\n            f\"Finished copying data, data should be available on {destination_bucket}/\"\n            f\"{destination_object}\"\n        )\n\n    @staticmethod\n    def _copy_objects(\n        source_bucket: str,\n        source_object: str,\n        destination_bucket: str,\n        destination_object: str,\n        dry_run: bool,\n    ) -> None:\n        \"\"\"Copies objects and 'directories' in s3.\n\n        Args:\n            source_bucket: name of bucket to perform the copy.\n            source_object: object/folder to be copied.\n            destination_bucket: name of the target bucket to copy.\n            destination_object: target object/folder to copy.\n            dry_run: if dry_run is set to True the function will print a dict with\n                all the paths that would be deleted based on the given keys.\n        \"\"\"\n        s3 = boto3.client(\"s3\")\n\n        if dry_run:\n            response = _dry_run(bucket=source_bucket, object_paths=[source_object])\n\n            S3FileManager._logger.info(\"Paths that would be copied:\")\n            S3FileManager._logger.info(response)\n        else:\n            original_object_name = source_object.split(\"/\")[-1]\n\n            if _check_directory(source_bucket, source_object):\n                source_object = get_directory_path(source_object)\n\n                copy_object = _list_objects_recursively(\n                    bucket=source_bucket, path=source_object\n                )\n\n                for obj in copy_object:\n                    S3FileManager._logger.info(f\"Copying obj: {obj}\")\n\n                    final_path = obj.replace(source_object, \"\")\n\n                    response = s3.copy_object(\n                        Bucket=destination_bucket,\n                        CopySource={\n                            FileManagerAPIKeys.BUCKET.value: source_bucket,\n                            FileManagerAPIKeys.KEY.value: obj,\n                        },\n                        Key=f\"{destination_object}/{original_object_name}/{final_path}\",\n                    )\n                    S3FileManager._logger.info(response)\n            else:\n                S3FileManager._logger.info(f\"Copying obj: {source_object}\")\n\n                response = s3.copy_object(\n                    Bucket=destination_bucket,\n                    CopySource={\n                        FileManagerAPIKeys.BUCKET.value: source_bucket,\n                        FileManagerAPIKeys.KEY.value: source_object,\n                    },\n                    Key=f\"\"\"{destination_object}/{original_object_name}\"\"\",\n                )\n                S3FileManager._logger.info(response)\n\n\nclass ArchiveFileManager(object):\n    \"\"\"Set of actions to restore archives.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def _get_archived_object(bucket: str, object_key: str) -> Optional[Any]:\n        \"\"\"Get the archived object if it's an object.\n\n        Args:\n            bucket: name of bucket to check get the object.\n            object_key: object to get.\n\n        Returns:\n            S3 Object if it's an archived object, otherwise None.\n        \"\"\"\n        s3 = boto3.resource(\"s3\")\n        object_to_restore = s3.Object(bucket, object_key)\n\n        if (\n            object_to_restore.storage_class is not None\n            and object_to_restore.storage_class in ARCHIVE_STORAGE_CLASS\n        ):\n            return object_to_restore\n        else:\n            return None\n\n    @staticmethod\n    def _check_object_restore_status(\n        bucket: str, object_key: str\n    ) -> Optional[RestoreStatus]:\n        \"\"\"Check the restore status of the archive.\n\n        Args:\n            bucket: name of bucket to check the restore status.\n            object_key: object to check the restore status.\n\n        Returns:\n            The restore status represented by an enum, possible values are:\n                NOT_STARTED, ONGOING or RESTORED\n        \"\"\"\n        archived_object = ArchiveFileManager._get_archived_object(bucket, object_key)\n\n        if archived_object is None:\n            status = None\n        elif archived_object.restore is None:\n            status = RestoreStatus.NOT_STARTED\n        elif 'ongoing-request=\"true\"' in archived_object.restore:\n            status = RestoreStatus.ONGOING\n        else:\n            status = RestoreStatus.RESTORED\n\n        return status\n\n    @staticmethod\n    def check_restore_status(source_bucket: str, source_object: str) -> dict:\n        \"\"\"Check the restore status of archived data.\n\n        Args:\n            source_bucket: name of bucket to check the restore status.\n            source_object: object to check the restore status.\n\n        Returns:\n            A dict containing the amount of objects in each status.\n        \"\"\"\n        not_started_objects = 0\n        ongoing_objects = 0\n        restored_objects = 0\n        total_objects = 0\n\n        if _check_directory(source_bucket, source_object):\n            source_object = get_directory_path(source_object)\n\n        objects_to_restore = _list_objects_recursively(\n            bucket=source_bucket, path=source_object\n        )\n\n        for obj in objects_to_restore:\n            ArchiveFileManager._logger.info(f\"Checking restore status for: {obj}\")\n\n            restore_status = ArchiveFileManager._check_object_restore_status(\n                source_bucket, obj\n            )\n            if not restore_status:\n                ArchiveFileManager._logger.warning(\n                    f\"Restore status not found for {source_bucket}/{obj}\"\n                )\n            else:\n                total_objects += 1\n\n                if RestoreStatus.NOT_STARTED == restore_status:\n                    not_started_objects += 1\n                elif RestoreStatus.ONGOING == restore_status:\n                    ongoing_objects += 1\n                else:\n                    restored_objects += 1\n\n                ArchiveFileManager._logger.info(\n                    f\"{obj} restore status is {restore_status.value}\"\n                )\n\n        return {\n            \"total_objects\": total_objects,\n            \"not_started_objects\": not_started_objects,\n            \"ongoing_objects\": ongoing_objects,\n            \"restored_objects\": restored_objects,\n        }\n\n    @staticmethod\n    def _request_restore_object(\n        bucket: str, object_key: str, expiration: int, retrieval_tier: str\n    ) -> None:\n        \"\"\"Request a restore of the archive.\n\n        Args:\n            bucket: name of bucket to perform the restore.\n            object_key: object to be restored.\n            expiration: restore expiration.\n            retrieval_tier: type of restore, possible values are:\n                Bulk, Standard or Expedited.\n        \"\"\"\n        if not RestoreType.exists(retrieval_tier):\n            raise RestoreTypeNotFoundException(\n                f\"Restore type {retrieval_tier} not supported.\"\n            )\n\n        if _check_directory(bucket, object_key):\n            object_key = get_directory_path(object_key)\n\n        archived_object = ArchiveFileManager._get_archived_object(bucket, object_key)\n\n        if archived_object and archived_object.restore is None:\n            ArchiveFileManager._logger.info(f\"Restoring archive {bucket}/{object_key}.\")\n            archived_object.restore_object(\n                RestoreRequest={\n                    \"Days\": expiration,\n                    \"GlacierJobParameters\": {\"Tier\": retrieval_tier},\n                }\n            )\n        else:\n            ArchiveFileManager._logger.info(\n                f\"Restore request for {bucket}/{object_key} not performed.\"\n            )\n\n    @staticmethod\n    def request_restore(\n        source_bucket: str,\n        source_object: str,\n        restore_expiration: int,\n        retrieval_tier: str,\n        dry_run: bool,\n    ) -> None:\n        \"\"\"Request the restore of archived data.\n\n        Args:\n            source_bucket: name of bucket to perform the restore.\n            source_object: object to be restored.\n            restore_expiration: restore expiration in days.\n            retrieval_tier: type of restore, possible values are:\n                Bulk, Standard or Expedited.\n            dry_run: if dry_run is set to True the function will print a dict with\n                all the paths that would be deleted based on the given keys.\n        \"\"\"\n        if _check_directory(source_bucket, source_object):\n            source_object = get_directory_path(source_object)\n\n        if dry_run:\n            response = _dry_run(bucket=source_bucket, object_paths=[source_object])\n\n            ArchiveFileManager._logger.info(\"Paths that would be restored:\")\n            ArchiveFileManager._logger.info(response)\n        else:\n            objects_to_restore = _list_objects_recursively(\n                bucket=source_bucket, path=source_object\n            )\n\n            for obj in objects_to_restore:\n                ArchiveFileManager._request_restore_object(\n                    source_bucket,\n                    obj,\n                    restore_expiration,\n                    retrieval_tier,\n                )\n\n    @staticmethod\n    def request_restore_and_wait(\n        source_bucket: str,\n        source_object: str,\n        restore_expiration: int,\n        retrieval_tier: str,\n        dry_run: bool,\n    ) -> None:\n        \"\"\"Request and wait for the restore to complete, polling the restore status.\n\n        Args:\n            source_bucket: name of bucket to perform the restore.\n            source_object: object to be restored.\n            restore_expiration: restore expiration in days.\n            retrieval_tier: type of restore, possible values are:\n                Bulk, Standard or Expedited.\n            dry_run: if dry_run is set to True the function will print a dict with\n                all the paths that would be deleted based on the given keys.\n        \"\"\"\n        if retrieval_tier != RestoreType.EXPEDITED.value:\n            ArchiveFileManager._logger.error(\n                f\"Retrieval Tier {retrieval_tier} not allowed on this operation! This \"\n                \"kind of restore should be used just with `Expedited` retrieval tier \"\n                \"to save cluster costs.\"\n            )\n            raise ValueError(\n                f\"Retrieval Tier {retrieval_tier} not allowed on this operation! This \"\n                \"kind of restore should be used just with `Expedited` retrieval tier \"\n                \"to save cluster costs.\"\n            )\n\n        ArchiveFileManager.request_restore(\n            source_bucket=source_bucket,\n            source_object=source_object,\n            restore_expiration=restore_expiration,\n            retrieval_tier=retrieval_tier,\n            dry_run=dry_run,\n        )\n        restore_status = ArchiveFileManager.check_restore_status(\n            source_bucket, source_object\n        )\n        ArchiveFileManager._logger.info(f\"Restore status: {restore_status}\")\n\n        if not dry_run:\n            ArchiveFileManager._logger.info(\"Checking the restore status in 5 minutes.\")\n            wait_time = 300\n            while restore_status.get(\"total_objects\") > restore_status.get(\n                \"restored_objects\"\n            ):\n                ArchiveFileManager._logger.info(\n                    \"Not all objects have been restored yet, checking the status again \"\n                    f\"in {wait_time} seconds.\"\n                )\n                time.sleep(wait_time)\n                wait_time = 30\n                restore_status = ArchiveFileManager.check_restore_status(\n                    source_bucket, source_object\n                )\n                ArchiveFileManager._logger.info(f\"Restore status: {restore_status}\")\n"
  },
  {
    "path": "lakehouse_engine/core/sensor_manager.py",
    "content": "\"\"\"Module to define Sensor Manager classes.\"\"\"\n\nimport json\nfrom datetime import datetime\nfrom typing import List, Optional, Tuple\n\nimport requests\nfrom delta.tables import DeltaTable\nfrom pyspark.sql import DataFrame, Row\nfrom pyspark.sql.functions import array, col, lit\n\nfrom lakehouse_engine.core.definitions import (\n    SENSOR_SCHEMA,\n    SENSOR_UPDATE_SET,\n    SAPLogchain,\n    SensorSpec,\n    SensorStatus,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader_factory import ReaderFactory\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SensorControlTableManager(object):\n    \"\"\"Class to control the Sensor execution.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def check_if_sensor_has_acquired_data(\n        cls,\n        sensor_id: str,\n        control_db_table_name: str,\n    ) -> bool:\n        \"\"\"Check if sensor has acquired new data.\n\n        Args:\n            sensor_id: sensor id.\n            control_db_table_name: `db.table` to control sensor runs.\n\n        Returns:\n            True if acquired new data, otherwise False\n        \"\"\"\n        sensor_table_data = cls.read_sensor_table_data(\n            sensor_id=sensor_id, control_db_table_name=control_db_table_name\n        )\n        cls._LOGGER.info(f\"sensor_table_data = {sensor_table_data}\")\n\n        return (\n            sensor_table_data is not None\n            and sensor_table_data.status == SensorStatus.ACQUIRED_NEW_DATA.value\n        )\n\n    @classmethod\n    def update_sensor_status(\n        cls,\n        sensor_spec: SensorSpec,\n        status: str,\n        upstream_key: str = None,\n        upstream_value: str = None,\n    ) -> None:\n        \"\"\"Control sensor execution storing the execution data in a delta table.\n\n        Args:\n            sensor_spec: sensor spec containing all sensor\n                information we need to update the control status.\n            status: status of the sensor.\n            upstream_key: upstream key (e.g., used to store an attribute\n                name from the upstream so that new data can be detected\n                automatically).\n            upstream_value: upstream value (e.g., used to store the max\n                attribute value from the upstream so that new data can be\n                detected automatically).\n        \"\"\"\n        cls._LOGGER.info(\n            f\"Updating sensor status for sensor {sensor_spec.sensor_id}...\"\n        )\n\n        data = cls._convert_sensor_to_data(\n            spec=sensor_spec,\n            status=status,\n            upstream_key=upstream_key,\n            upstream_value=upstream_value,\n        )\n\n        sensor_update_set = cls._get_sensor_update_set(\n            assets=sensor_spec.assets,\n            checkpoint_location=sensor_spec.checkpoint_location,\n            upstream_key=upstream_key,\n            upstream_value=upstream_value,\n        )\n\n        cls._update_sensor_control(\n            data=data,\n            sensor_update_set=sensor_update_set,\n            sensor_control_table=sensor_spec.control_db_table_name,\n            sensor_id=sensor_spec.sensor_id,\n        )\n\n    @classmethod\n    def _update_sensor_control(\n        cls,\n        data: List[dict],\n        sensor_update_set: dict,\n        sensor_control_table: str,\n        sensor_id: str,\n    ) -> None:\n        \"\"\"Update sensor control delta table.\n\n        Args:\n            data: to be updated.\n            sensor_update_set: columns which we had update.\n            sensor_control_table: control table name.\n            sensor_id: sensor_id to be updated.\n        \"\"\"\n        sensors_delta_table = DeltaTable.forName(\n            ExecEnv.SESSION,\n            sensor_control_table,\n        )\n        sensors_updates = ExecEnv.SESSION.createDataFrame(data, SENSOR_SCHEMA)\n        sensors_delta_table.alias(\"sensors\").merge(\n            sensors_updates.alias(\"updates\"),\n            f\"sensors.sensor_id = '{sensor_id}' AND \"\n            \"sensors.sensor_id = updates.sensor_id\",\n        ).whenMatchedUpdate(set=sensor_update_set).whenNotMatchedInsertAll().execute()\n\n    @classmethod\n    def _convert_sensor_to_data(\n        cls,\n        spec: SensorSpec,\n        status: str,\n        upstream_key: str,\n        upstream_value: str,\n        status_change_timestamp: Optional[datetime] = None,\n    ) -> List[dict]:\n        \"\"\"Convert sensor data to dataframe input data.\n\n        Args:\n            spec: sensor spec containing sensor identifier data.\n            status: new sensor data status.\n            upstream_key: key used to acquired data from the upstream.\n            upstream_value: max value from the upstream_key\n                acquired from the upstream.\n            status_change_timestamp: timestamp we commit\n                this change in the sensor control table.\n\n        Returns:\n            Sensor data as list[dict], used to create a\n                dataframe to store the data into the sensor_control_table.\n        \"\"\"\n        status_change_timestamp = (\n            datetime.now()\n            if status_change_timestamp is None\n            else status_change_timestamp\n        )\n        return [\n            {\n                \"sensor_id\": spec.sensor_id,\n                \"assets\": spec.assets,\n                \"status\": status,\n                \"status_change_timestamp\": status_change_timestamp,\n                \"checkpoint_location\": spec.checkpoint_location,\n                \"upstream_key\": str(upstream_key),\n                \"upstream_value\": str(upstream_value),\n            }\n        ]\n\n    @classmethod\n    def _get_sensor_update_set(cls, **kwargs: Optional[str] | List[str]) -> dict:\n        \"\"\"Get the sensor update set.\n\n        Args:\n            kwargs: Containing the following keys:\n            - assets\n            - checkpoint_location\n            - upstream_key\n            - upstream_value\n\n        Returns:\n            A set containing the fields to update in the control_table.\n        \"\"\"\n        sensor_update_set = dict(SENSOR_UPDATE_SET)\n        for key, value in kwargs.items():\n            if value:\n                sensor_update_set[f\"sensors.{key}\"] = f\"updates.{key}\"\n\n        return sensor_update_set\n\n    @classmethod\n    def read_sensor_table_data(\n        cls,\n        control_db_table_name: str,\n        sensor_id: str = None,\n        assets: list = None,\n    ) -> Optional[Row]:\n        \"\"\"Read data from delta table containing sensor status info.\n\n        Args:\n            sensor_id: sensor id. If this parameter is defined search occurs\n                only considering this parameter. Otherwise, it considers sensor\n                assets and checkpoint location.\n            control_db_table_name: db.table to control sensor runs.\n            assets: list of assets that are fueled by the pipeline\n                where this sensor is.\n\n        Returns:\n            Row containing the data for the provided sensor_id.\n        \"\"\"\n        df = DeltaTable.forName(\n            ExecEnv.SESSION,\n            control_db_table_name,\n        ).toDF()\n\n        if sensor_id:\n            df = df.where(col(\"sensor_id\") == sensor_id)\n        elif assets:\n            df = df.where(col(\"assets\") == array(*[lit(asset) for asset in assets]))\n        else:\n            raise ValueError(\n                \"Either sensor_id or assets need to be provided as arguments.\"\n            )\n\n        return df.first()\n\n\nclass SensorUpstreamManager(object):\n    \"\"\"Class to deal with Sensor Upstream data.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def generate_filter_exp_query(\n        cls,\n        sensor_id: str,\n        filter_exp: str,\n        control_db_table_name: str = None,\n        upstream_key: str = None,\n        upstream_value: str = None,\n        upstream_table_name: str = None,\n    ) -> str:\n        \"\"\"Generates a sensor preprocess query based on timestamp logic.\n\n        Args:\n            sensor_id: sensor id.\n            filter_exp: expression to filter incoming new data.\n                You can use the placeholder `?upstream_value` so that\n                it can be replaced by the upstream_value in the\n                control_db_table_name for this specific sensor_id.\n            control_db_table_name: db.table to retrieve the last status change\n                timestamp. This is only relevant for the jdbc sensor.\n            upstream_key: the key of custom sensor information\n                to control how to identify new data from the\n                upstream (e.g., a time column in the upstream).\n            upstream_value: value for custom sensor\n                to identify new data from the upstream\n                (e.g., the value of a time present in the upstream)\n                If none we will set the default value.\n                Note: This parameter is used just to override the\n                default value `-2147483647`.\n            upstream_table_name: value for custom sensor\n                to query new data from the upstream.\n                If none we will set the default value,\n                our `sensor_new_data` view.\n\n        Returns:\n            The query string.\n        \"\"\"\n        source_table = upstream_table_name if upstream_table_name else \"sensor_new_data\"\n        select_exp = \"SELECT COUNT(1) as count\"\n        if control_db_table_name:\n            if not upstream_key:\n                raise ValueError(\n                    \"If control_db_table_name is defined, upstream_key should \"\n                    \"also be defined!\"\n                )\n\n            default_upstream_value: str = \"-2147483647\"\n            trigger_name = upstream_key\n            trigger_value = (\n                default_upstream_value if upstream_value is None else upstream_value\n            )\n            sensor_table_data = SensorControlTableManager.read_sensor_table_data(\n                sensor_id=sensor_id, control_db_table_name=control_db_table_name\n            )\n\n            if sensor_table_data and sensor_table_data.upstream_value:\n                trigger_value = sensor_table_data.upstream_value\n\n            filter_exp = filter_exp.replace(\"?upstream_key\", trigger_name).replace(\n                \"?upstream_value\", trigger_value\n            )\n            select_exp = (\n                f\"SELECT COUNT(1) as count, '{trigger_name}' as UPSTREAM_KEY, \"\n                f\"max({trigger_name}) as UPSTREAM_VALUE\"\n            )\n\n        query = (\n            f\"{select_exp} \"\n            f\"FROM {source_table} \"\n            f\"WHERE {filter_exp} \"\n            f\"HAVING COUNT(1) > 0\"\n        )\n\n        return query\n\n    @classmethod\n    def generate_sensor_table_preprocess_query(\n        cls,\n        sensor_id: str,\n    ) -> str:\n        \"\"\"Generates a query to be used for a sensor having other sensor as upstream.\n\n        Args:\n            sensor_id: sensor id.\n\n        Returns:\n            The query string.\n        \"\"\"\n        query = (\n            f\"SELECT * \"  # nosec\n            f\"FROM sensor_new_data \"\n            f\"WHERE\"\n            f\" _change_type in ('insert', 'update_postimage')\"\n            f\" and sensor_id = '{sensor_id}'\"\n            f\" and status = '{SensorStatus.PROCESSED_NEW_DATA.value}'\"\n        )\n\n        return query\n\n    @classmethod\n    def read_new_data(cls, sensor_spec: SensorSpec) -> DataFrame:\n        \"\"\"Read new data from the upstream into the sensor 'new_data_df'.\n\n        Args:\n            sensor_spec: sensor spec containing all sensor information.\n\n        Returns:\n            An empty dataframe if it doesn't have new data otherwise the new data\n        \"\"\"\n        new_data_df = ReaderFactory.get_data(sensor_spec.input_spec)\n\n        if sensor_spec.preprocess_query:\n            new_data_df.createOrReplaceTempView(\"sensor_new_data\")\n            new_data_df = ExecEnv.SESSION.sql(sensor_spec.preprocess_query)\n\n        return new_data_df\n\n    @classmethod\n    def get_new_data(\n        cls,\n        new_data_df: DataFrame,\n    ) -> Optional[Row]:\n        \"\"\"Get new data from upstream df if it's present.\n\n        Args:\n            new_data_df: DataFrame possibly containing new data.\n\n        Returns:\n            Optional row, present if there is new data in the upstream,\n            absent otherwise.\n        \"\"\"\n        return new_data_df.first()\n\n    @classmethod\n    def generate_sensor_sap_logchain_query(\n        cls,\n        chain_id: str,\n        dbtable: str = SAPLogchain.DBTABLE.value,\n        status: str = SAPLogchain.GREEN_STATUS.value,\n        engine_table_name: str = SAPLogchain.ENGINE_TABLE.value,\n    ) -> str:\n        \"\"\"Generates a sensor query based in the SAP Logchain table.\n\n        Args:\n            chain_id: chain id to query the status on SAP.\n            dbtable: db.table to retrieve the data to\n                check if the sap chain is already finished.\n            status: db.table to retrieve the last status change\n                timestamp.\n            engine_table_name: table name exposed with the SAP LOGCHAIN data.\n                This table will be used in the jdbc query.\n\n        Returns:\n            The query string.\n        \"\"\"\n        if not chain_id:\n            raise ValueError(\n                \"To query on log chain SAP table the chain id should be defined!\"\n            )\n\n        select_exp = (\n            \"SELECT CHAIN_ID, CONCAT(DATUM, ZEIT) AS LOAD_DATE, ANALYZED_STATUS\"\n        )\n        filter_exp = (\n            f\"UPPER(CHAIN_ID) = UPPER('{chain_id}') \"\n            f\"AND UPPER(ANALYZED_STATUS) = UPPER('{status}')\"\n        )\n\n        query = (\n            f\"WITH {engine_table_name} AS (\"\n            f\"{select_exp} \"\n            f\"FROM {dbtable} \"\n            f\"WHERE {filter_exp}\"\n            \")\"\n        )\n\n        return query\n\n\nclass SensorJobRunManager(object):\n    \"\"\"Class to manage triggering of Jobs via Job Run API.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def run_job(cls, job_id: str, token: str, host: str) -> Tuple[int, Optional[str]]:\n        \"\"\"Trigger the job based on its id.\n\n        Args:\n            job_id: the id of the job to trigger.\n            token: token required to access Databricks API.\n            host: host for workspace.\n        \"\"\"\n        run_id = None\n        ex = None\n\n        headers = {\"Authorization\": f\"Bearer {token}\"}\n        body = json.dumps(\n            {\n                \"job_id\": job_id,\n                \"notebook_params\": {\"msg\": \"triggered via heartbeat sensor\"},\n            }\n        )\n\n        res = requests.post(\n            f\"https://{host}/api/2.1/jobs/run-now\",\n            data=body,\n            headers=headers,\n            timeout=3600,\n        )\n\n        if res.status_code == 200:\n            run_id = (json.loads(res.text))[\"run_id\"]\n            cls._LOGGER.info(\n                f\"Job : {str(job_id)} triggered successfully... RUN ID : {str(run_id)}\"\n            )\n        else:\n            ex = str(res.json()[\"error_code\"]) + \"  \" + res.json()[\"message\"]\n            cls._LOGGER.error(f\"An error has occurred: {ex}\")\n\n        return run_id, ex\n"
  },
  {
    "path": "lakehouse_engine/core/table_manager.py",
    "content": "\"\"\"Table manager module.\"\"\"\n\nfrom typing import List\n\nfrom delta.tables import DeltaTable\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import translate\n\nfrom lakehouse_engine.core.definitions import SQLDefinitions\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.sql_parser_utils import SQLParserUtils\n\n\nclass TableManager(object):\n    \"\"\"Set of actions to manipulate tables/views in several ways.\n\n    {{ get_table_manager_operations() }}\n    \"\"\"\n\n    def __init__(self, configs: dict):\n        \"\"\"Construct TableManager algorithm instances.\n\n        Args:\n            configs: configurations for the TableManager algorithm.\n        \"\"\"\n        self._logger = LoggingHandler(__name__).get_logger()\n        self.configs = configs\n        self.function = self.configs[\"function\"]\n\n    def get_function(self) -> None:\n        \"\"\"Get a specific function to execute.\"\"\"\n        available_functions = {\n            \"compute_table_statistics\": self.compute_table_statistics,\n            \"create_table\": self.create,\n            \"create_tables\": self.create_many,\n            \"create_view\": self.create,\n            \"drop_table\": self.drop_table,\n            \"drop_view\": self.drop_view,\n            \"execute_sql\": self.execute_sql,\n            \"truncate\": self.truncate,\n            \"vacuum\": self.vacuum,\n            \"describe\": self.describe,\n            \"optimize\": self.optimize,\n            \"show_tbl_properties\": self.show_tbl_properties,\n            \"get_tbl_pk\": self.get_tbl_pk,\n            \"repair_table\": self.repair_table,\n            \"delete_where\": self.delete_where,\n        }\n\n        self._logger.info(\"Function being executed: {}\".format(self.function))\n\n        if self.function in available_functions.keys():\n            func = available_functions[self.function]\n            func()\n        else:\n            raise NotImplementedError(\n                f\"The requested function {self.function} is not implemented.\"\n            )\n\n    def create(self) -> None:\n        \"\"\"Create a new table or view on metastore.\"\"\"\n        disable_dbfs_retry = (\n            self.configs[\"disable_dbfs_retry\"]\n            if \"disable_dbfs_retry\" in self.configs.keys()\n            else False\n        )\n        sql = ConfigUtils.read_sql(self.configs[\"path\"], disable_dbfs_retry)\n        try:\n            sql_commands = SQLParserUtils().split_sql_commands(\n                sql_commands=sql,\n                delimiter=self.configs.get(\"delimiter\", \";\"),\n                advanced_parser=self.configs.get(\"advanced_parser\", False),\n            )\n            for command in sql_commands:\n                if command.strip():\n                    self._logger.info(f\"sql command: {command}\")\n                    ExecEnv.SESSION.sql(command)\n            self._logger.info(f\"{self.function} successfully executed!\")\n        except Exception as e:\n            self._logger.error(e)\n            raise\n\n    def create_many(self) -> None:\n        \"\"\"Create multiple tables or views on metastore.\n\n        In this function the path to the ddl files can be separated by comma.\n        \"\"\"\n        self.execute_multiple_sql_files()\n\n    def compute_table_statistics(self) -> None:\n        \"\"\"Compute table statistics.\"\"\"\n        sql = SQLDefinitions.compute_table_stats.value.format(\n            self.configs[\"table_or_view\"]\n        )\n        try:\n            self._logger.info(f\"sql command: {sql}\")\n            ExecEnv.SESSION.sql(sql)\n            self._logger.info(f\"{self.function} successfully executed!\")\n        except Exception as e:\n            self._logger.error(e)\n            raise\n\n    def drop_table(self) -> None:\n        \"\"\"Delete table function deletes table from metastore and erases all data.\"\"\"\n        drop_stmt = \"{} {}\".format(\n            SQLDefinitions.drop_table_stmt.value,\n            self.configs[\"table_or_view\"],\n        )\n\n        self._logger.info(f\"sql command: {drop_stmt}\")\n        ExecEnv.SESSION.sql(drop_stmt)\n        self._logger.info(\"Table successfully dropped!\")\n\n    def drop_view(self) -> None:\n        \"\"\"Delete view function deletes view from metastore and erases all data.\"\"\"\n        drop_stmt = \"{} {}\".format(\n            SQLDefinitions.drop_view_stmt.value,\n            self.configs[\"table_or_view\"],\n        )\n\n        self._logger.info(f\"sql command: {drop_stmt}\")\n        ExecEnv.SESSION.sql(drop_stmt)\n        self._logger.info(\"View successfully dropped!\")\n\n    def truncate(self) -> None:\n        \"\"\"Truncate function erases all data but keeps metadata.\"\"\"\n        truncate_stmt = \"{} {}\".format(\n            SQLDefinitions.truncate_stmt.value,\n            self.configs[\"table_or_view\"],\n        )\n\n        self._logger.info(f\"sql command: {truncate_stmt}\")\n        ExecEnv.SESSION.sql(truncate_stmt)\n        self._logger.info(\"Table successfully truncated!\")\n\n    def vacuum(self) -> None:\n        \"\"\"Vacuum function erases older versions from Delta Lake tables or locations.\"\"\"\n        if not self.configs.get(\"table_or_view\", None):\n            delta_table = DeltaTable.forPath(ExecEnv.SESSION, self.configs[\"path\"])\n\n            self._logger.info(f\"Vacuuming location: {self.configs['path']}\")\n            delta_table.vacuum(self.configs.get(\"vacuum_hours\", 168))\n        else:\n            delta_table = DeltaTable.forName(\n                ExecEnv.SESSION, self.configs[\"table_or_view\"]\n            )\n\n            self._logger.info(f\"Vacuuming table: {self.configs['table_or_view']}\")\n            delta_table.vacuum(self.configs.get(\"vacuum_hours\", 168))\n\n    def describe(self) -> None:\n        \"\"\"Describe function describes metadata from some table or view.\"\"\"\n        describe_stmt = \"{} {}\".format(\n            SQLDefinitions.describe_stmt.value,\n            self.configs[\"table_or_view\"],\n        )\n\n        self._logger.info(f\"sql command: {describe_stmt}\")\n        output = ExecEnv.SESSION.sql(describe_stmt)\n        self._logger.info(output)\n\n    def optimize(self) -> None:\n        \"\"\"Optimize function optimizes the layout of Delta Lake data.\"\"\"\n        if self.configs.get(\"where_clause\", None):\n            where_exp = \"WHERE {}\".format(self.configs[\"where_clause\"].strip())\n        else:\n            where_exp = \"\"\n\n        if self.configs.get(\"optimize_zorder_col_list\", None):\n            zorder_exp = \"ZORDER BY ({})\".format(\n                self.configs[\"optimize_zorder_col_list\"].strip()\n            )\n        else:\n            zorder_exp = \"\"\n\n        optimize_stmt = \"{} {} {} {}\".format(\n            SQLDefinitions.optimize_stmt.value,\n            (\n                f\"delta.`{self.configs.get('path', None)}`\"\n                if not self.configs.get(\"table_or_view\", None)\n                else self.configs.get(\"table_or_view\", None)\n            ),\n            where_exp,\n            zorder_exp,\n        )\n\n        self._logger.info(f\"sql command: {optimize_stmt}\")\n        output = ExecEnv.SESSION.sql(optimize_stmt)\n        self._logger.info(output)\n\n    def execute_multiple_sql_files(self) -> None:\n        \"\"\"Execute multiple statements in multiple sql files.\n\n        In this function the path to the files is separated by comma.\n        \"\"\"\n        for table_metadata_file in self.configs[\"path\"].split(\",\"):\n            disable_dbfs_retry = (\n                self.configs[\"disable_dbfs_retry\"]\n                if \"disable_dbfs_retry\" in self.configs.keys()\n                else False\n            )\n            sql = ConfigUtils.read_sql(table_metadata_file.strip(), disable_dbfs_retry)\n            sql_commands = SQLParserUtils().split_sql_commands(\n                sql_commands=sql,\n                delimiter=self.configs.get(\"delimiter\", \";\"),\n                advanced_parser=self.configs.get(\"advanced_parser\", False),\n            )\n            for command in sql_commands:\n                if command.strip():\n                    self._logger.info(f\"sql command: {command}\")\n                    ExecEnv.SESSION.sql(command)\n            self._logger.info(\"sql file successfully executed!\")\n\n    def execute_sql(self) -> None:\n        \"\"\"Execute sql commands separated by semicolon (;).\"\"\"\n        sql_commands = SQLParserUtils().split_sql_commands(\n            sql_commands=self.configs.get(\"sql\"),\n            delimiter=self.configs.get(\"delimiter\", \";\"),\n            advanced_parser=self.configs.get(\"advanced_parser\", False),\n        )\n        for command in sql_commands:\n            if command.strip():\n                self._logger.info(f\"sql command: {command}\")\n                ExecEnv.SESSION.sql(command)\n        self._logger.info(\"sql successfully executed!\")\n\n    def show_tbl_properties(self) -> DataFrame:\n        \"\"\"Show Table Properties.\n\n        Returns:\n            A dataframe with the table properties.\n        \"\"\"\n        show_tbl_props_stmt = \"{} {}\".format(\n            SQLDefinitions.show_tbl_props_stmt.value,\n            self.configs[\"table_or_view\"],\n        )\n\n        self._logger.info(f\"sql command: {show_tbl_props_stmt}\")\n        output = ExecEnv.SESSION.sql(show_tbl_props_stmt)\n        self._logger.info(output)\n        return output\n\n    def get_tbl_pk(self) -> List[str]:\n        \"\"\"Get the primary key of a particular table.\n\n        Returns:\n            The list of columns that are part of the primary key.\n        \"\"\"\n        output: List[str] = (\n            self.show_tbl_properties()\n            .filter(\"key == 'lakehouse.primary_key'\")\n            .select(\"value\")\n            .withColumn(\"value\", translate(\"value\", \" `\", \"\"))\n            .first()[0]\n            .split(\",\")\n        )\n        self._logger.info(output)\n\n        return output\n\n    def repair_table(self) -> None:\n        \"\"\"Run the repair table command.\"\"\"\n        table_name = self.configs[\"table_or_view\"]\n        sync_metadata = self.configs[\"sync_metadata\"]\n\n        repair_stmt = (\n            f\"MSCK REPAIR TABLE {table_name} \"\n            f\"{'SYNC METADATA' if sync_metadata else ''}\"\n        )\n\n        self._logger.info(f\"sql command: {repair_stmt}\")\n        output = ExecEnv.SESSION.sql(repair_stmt)\n        self._logger.info(output)\n\n    def delete_where(self) -> None:\n        \"\"\"Run the delete where command.\"\"\"\n        table_name = self.configs[\"table_or_view\"]\n        delete_where = self.configs[\"where_clause\"].strip()\n\n        delete_stmt = SQLDefinitions.delete_where_stmt.value.format(\n            table_name, delete_where\n        )\n\n        self._logger.info(f\"sql command: {delete_stmt}\")\n        output = ExecEnv.SESSION.sql(delete_stmt)\n        self._logger.info(output)\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/__init__.py",
    "content": "\"\"\"Package to define data quality processes available in the lakehouse engine.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/__init__.py",
    "content": "\"\"\"Package containing custom DQ expectations available in the lakehouse engine.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b.py",
    "content": "\"\"\"Expectation to check if column 'a' is not equal to column 'b'.\"\"\"\n\nfrom typing import Any, Dict, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine, SparkDFExecutionEngine\nfrom great_expectations.expectations.expectation import ColumnPairMapExpectation\nfrom great_expectations.expectations.metrics.map_metric_provider import (\n    ColumnPairMapMetricProvider,\n    column_pair_condition_partial,\n)\n\nfrom lakehouse_engine.utils.expectations_utils import validate_result\n\n\nclass ColumnPairCustom(ColumnPairMapMetricProvider):\n    \"\"\"Asserts that column 'A' is not equal to column 'B'.\n\n    Additionally, It compares Null as well.\n    \"\"\"\n\n    condition_metric_name = \"column_pair_values.a_not_equal_to_b\"\n    condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n    )\n    condition_value_keys = ()\n\n    @column_pair_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: ColumnPairMapMetricProvider,\n        column_A: Any,\n        column_B: Any,\n        **kwargs: dict,\n    ) -> Any:\n        \"\"\"Implementation of the expectation's logic.\n\n        Args:\n            column_A: Value of the row of column_A.\n            column_B: Value of the row of column_B.\n            kwargs: dict with additional parameters.\n\n        Returns:\n            If the condition is met.\n        \"\"\"\n        return ((column_A.isNotNull()) | (column_B.isNotNull())) & (\n            column_A != column_B\n        )  # noqa: E501\n\n\nclass ExpectColumnPairAToBeNotEqualToB(ColumnPairMapExpectation):\n    \"\"\"Expect values in column A to be not equal to column B.\n\n    Args:\n        column_A: The first column name.\n        column_B: The second column name.\n\n    Keyword Args:\n        allow_cross_type_comparisons: If True, allow\n            comparisons between types (e.g. integer and string).\n            Otherwise, attempting such comparisons will raise an exception.\n        ignore_row_if: \"both_values_are_missing\",\n            \"either_value_is_missing\", \"neither\" (default).\n        result_format: Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC` (default), `COMPLETE`, or `SUMMARY`.\n        include_config: If True (default), then include the expectation config\n            as part of the result object.\n        catch_exceptions: If True, then catch exceptions and\n            include them as part of the result object. Default: False.\n        meta: A JSON-serializable dictionary (nesting allowed)\n            that will be included in the output without modification.\n\n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"\n\n    mostly: float = 1.0\n    ignore_row_if: str = \"neither\"\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    include_config: bool = True\n    catch_exceptions: bool = False\n    column_A: Any = None\n    column_B: Any = None\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"a\": [\"IE4019\", \"IM6092\", \"IE1405\"],\n                        \"b\": [\"IE4019\", \"IM6092\", \"IE1405\"],\n                        \"c\": [\"IE1404\", \"IN6192\", \"842075\"],\n                    },\n                    \"schemas\": {\n                        \"spark\": {\n                            \"a\": \"StringType\",\n                            \"b\": \"StringType\",\n                            \"c\": \"StringType\",\n                        }\n                    },\n                }\n            ],\n            \"tests\": [\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"b\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"b\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\"b\": \"IE4019\", \"a\": \"IE4019\"},\n                            {\"b\": \"IM6092\", \"a\": \"IM6092\"},\n                            {\"b\": \"IE1405\", \"a\": \"IE1405\"},\n                        ],\n                    },\n                },\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"c\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": True,\n                        \"unexpected_index_list\": [],\n                    },\n                },\n            ],\n        },\n    ]\n\n    map_metric = \"column_pair_values.a_not_equal_to_b\"\n    success_keys = (\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n        \"mostly\",\n    )\n\n    def _validate(\n        self,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> Any:\n        \"\"\"Custom implementation of the GE _validate method.\n\n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GE test logic does not do this validation, and thus\n        we need to make it manually.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        validate_result(\n            self,\n            metrics,\n        )\n\n        return super()._validate(metrics, runtime_configuration, execution_engine)\n\n\n\"\"\"Mandatory block of code. If it is removed the expectation will not be available.\"\"\"\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectColumnPairAToBeNotEqualToB().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b.py",
    "content": "\"\"\"Expectation to check if column 'a' is lower or equal than column 'b'.\"\"\"\n\nfrom typing import Any, Dict, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine, SparkDFExecutionEngine\nfrom great_expectations.expectations.expectation import ColumnPairMapExpectation\nfrom great_expectations.expectations.metrics.map_metric_provider import (\n    ColumnPairMapMetricProvider,\n    column_pair_condition_partial,\n)\n\nfrom lakehouse_engine.utils.expectations_utils import validate_result\n\n\nclass ColumnPairCustom(ColumnPairMapMetricProvider):\n    \"\"\"Asserts that column 'A' is lower or equal than column 'B'.\n\n    Additionally, the 'margin' parameter can be used to add a margin to the\n    check between column 'A' and 'B': 'A' <= 'B' + 'margin'.\n    \"\"\"\n\n    condition_metric_name = \"column_pair_values.a_smaller_or_equal_than_b\"\n    condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n    )\n    condition_value_keys = (\"margin\",)\n\n    @column_pair_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: ColumnPairMapMetricProvider,\n        column_A: Any,\n        column_B: Any,\n        **kwargs: dict,\n    ) -> Any:\n        \"\"\"Implementation of the expectation's logic.\n\n        Args:\n            column_A: Value of the row of column_A.\n            column_B: Value of the row of column_B.\n            kwargs: dict with additional parameters.\n\n        Returns:\n            If the condition is met.\n        \"\"\"\n        margin = kwargs.get(\"margin\") or None\n        if margin is None:\n            approx = 0\n        elif not isinstance(margin, (int, float, complex)):\n            raise TypeError(\n                f\"margin must be one of int, float, complex.\"\n                f\" Found: {margin} as {type(margin)}\"\n            )\n        else:\n            approx = margin  # type: ignore\n\n        return column_A <= column_B + approx  # type: ignore\n\n\nclass ExpectColumnPairAToBeSmallerOrEqualThanB(ColumnPairMapExpectation):\n    \"\"\"Expect values in column A to be lower or equal than column B.\n\n    Args:\n        column_A: The first column name.\n        column_B: The second column name.\n        margin: additional approximation to column B value.\n\n    Keyword Args:\n        allow_cross_type_comparisons: If True, allow\n            comparisons between types (e.g. integer and string).\n            Otherwise, attempting such comparisons will raise an exception.\n        ignore_row_if: \"both_values_are_missing\",\n            \"either_value_is_missing\", \"neither\" (default).\n        result_format: Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC` (default), `COMPLETE`, or `SUMMARY`.\n        include_config: If True (default), then include the expectation config\n            as part of the result object.\n        catch_exceptions: If True, then catch exceptions and\n            include them as part of the result object. Default: False.\n        meta: A JSON-serializable dictionary (nesting allowed)\n            that will be included in the output without modification.\n\n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"\n\n    mostly: float = 1.0\n    ignore_row_if: str = \"neither\"\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    include_config: bool = True\n    catch_exceptions: bool = False\n    margin: Any = None\n    column_A: Any = None\n    column_B: Any = None\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"a\": [11, 22, 50],\n                        \"b\": [10, 21, 100],\n                        \"c\": [9, 21, 30],\n                    },\n                    \"schemas\": {\n                        \"spark\": {\n                            \"a\": \"IntegerType\",\n                            \"b\": \"IntegerType\",\n                            \"c\": \"IntegerType\",\n                        }\n                    },\n                }\n            ],\n            \"tests\": [\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"c\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"c\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\"c\": 9, \"a\": 11},\n                            {\"c\": 21, \"a\": 22},\n                            {\"c\": 30, \"a\": 50},\n                        ],\n                    },\n                },\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"b\",\n                        \"margin\": 1,\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": True,\n                        \"unexpected_index_list\": [],\n                    },\n                },\n            ],\n        },\n    ]\n\n    map_metric = \"column_pair_values.a_smaller_or_equal_than_b\"\n    success_keys = (\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n        \"margin\",\n        \"mostly\",\n    )\n\n    def _validate(\n        self,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> Any:\n        \"\"\"Custom implementation of the GE _validate method.\n\n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GE test logic does not do this validation, and thus\n        we need to make it manually.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        validate_result(\n            self,\n            metrics,\n        )\n\n        return super()._validate(metrics, runtime_configuration, execution_engine)\n\n\n\"\"\"Mandatory block of code. If it is removed the expectation will not be available.\"\"\"\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectColumnPairAToBeSmallerOrEqualThanB().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b.py",
    "content": "\"\"\"Expectation to check if date column 'a' is greater or equal to date column 'b'.\"\"\"\n\nimport datetime\nfrom typing import Any, Dict, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine, SparkDFExecutionEngine\nfrom great_expectations.expectations.expectation import ColumnPairMapExpectation\nfrom great_expectations.expectations.metrics.map_metric_provider import (\n    ColumnPairMapMetricProvider,\n    column_pair_condition_partial,\n)\n\nfrom lakehouse_engine.utils.expectations_utils import validate_result\n\n\n# This class defines a Metric to support your Expectation\nclass ColumnPairDateAToBeGreaterOrEqualToDateB(ColumnPairMapMetricProvider):\n    \"\"\"Asserts that date column 'A' is greater or equal to date column 'B'.\"\"\"\n\n    # This is the id string that will be used to refer your metric.\n    condition_metric_name = \"column_pair_values.date_a_greater_or_equal_to_date_b\"\n    condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n    )\n\n    @column_pair_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: ColumnPairMapMetricProvider,\n        column_A: Any,\n        column_B: Any,\n        **kwargs: dict,\n    ) -> Any:\n        \"\"\"Implementation of the expectation's logic.\n\n        Args:\n            column_A: Value of the row of column_A.\n            column_B: Value of the row of column_B.\n            kwargs: dict with additional parameters.\n\n        Returns:\n            Boolean on the basis of condition.\n        \"\"\"\n        return (\n            (column_A.isNotNull()) & (column_B.isNotNull()) & (column_A >= column_B)\n        )  # type: ignore\n\n\nclass ExpectColumnPairDateAToBeGreaterThanOrEqualToDateB(ColumnPairMapExpectation):\n    \"\"\"Expect values in date column A to be greater than or equal to date column B.\n\n    Args:\n        column_A: The first date column name.\n        column_B: The second date column name.\n\n    Keyword Args:\n        ignore_row_if: \"both_values_are_missing\",\n            \"either_value_is_missing\", \"neither\" (default).\n        result_format: Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC` (default), `COMPLETE`, or `SUMMARY`.\n        include_config: If True (default), then include the\n            expectation config as part of the result object.\n        catch_exceptions: If True, then catch exceptions and\n            include them as part of the result object. Default: False.\n        meta: A JSON-serializable dictionary (nesting allowed)\n            that will be included in the output without modification.\n\n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"\n\n    mostly: float = 1.0\n    ignore_row_if: str = \"neither\"\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    include_config: bool = True\n    catch_exceptions: bool = True\n    column_A: Any = None\n    column_B: Any = None\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"a\": [\n                            \"2029-01-12\",\n                            \"2024-11-21\",\n                            \"2022-01-01\",\n                        ],\n                        \"b\": [\n                            \"2019-02-11\",\n                            \"2014-12-22\",\n                            \"2012-09-09\",\n                        ],\n                        \"c\": [\n                            \"2010-02-11\",\n                            \"2015-12-22\",\n                            \"2022-09-09\",\n                        ],\n                    },\n                    \"schemas\": {\n                        \"spark\": {\n                            \"a\": \"DateType\",\n                            \"b\": \"DateType\",\n                            \"c\": \"DateType\",\n                        }\n                    },\n                }\n            ],\n            \"tests\": [\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"b\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\", \"b\"],\n                        },\n                    },\n                    \"out\": {\"success\": True, \"unexpected_index_list\": []},\n                },\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"b\",\n                        \"column_B\": \"c\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\n                                \"a\": datetime.date(2024, 11, 21),\n                                \"b\": datetime.date(2014, 12, 22),\n                                \"c\": datetime.date(2015, 12, 22),\n                            },\n                            {\n                                \"a\": datetime.date(2022, 1, 1),\n                                \"b\": datetime.date(2012, 9, 9),\n                                \"c\": datetime.date(2022, 9, 9),\n                            },\n                        ],\n                    },\n                },\n            ],\n        }\n    ]\n\n    map_metric = \"column_pair_values.date_a_greater_or_equal_to_date_b\"\n    success_keys = (\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n        \"mostly\",\n    )\n\n    def _validate(\n        self,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> Any:\n        \"\"\"Custom implementation of the GE _validate method.\n\n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GE test logic does not do this validation, and thus\n        we need to make it manually.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        validate_result(\n            self,\n            metrics,\n        )\n\n        return super()._validate(metrics, runtime_configuration, execution_engine)\n\n\n\"\"\"Mandatory block of code. If it is removed the expectation will not be available.\"\"\"\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectColumnPairDateAToBeGreaterThanOrEqualToDateB().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_column_values_to_be_date_not_older_than.py",
    "content": "\"\"\"Expectation to check if column value is a date within a timeframe.\"\"\"\n\nimport datetime\nfrom datetime import timedelta\nfrom typing import Any, Dict, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine, SparkDFExecutionEngine\nfrom great_expectations.expectations.expectation import ColumnMapExpectation\nfrom great_expectations.expectations.metrics import ColumnMapMetricProvider\nfrom great_expectations.expectations.metrics.map_metric_provider import (\n    column_condition_partial,\n)\n\nfrom lakehouse_engine.utils.expectations_utils import validate_result\n\n\nclass ColumnValuesDateNotOlderThan(ColumnMapMetricProvider):\n    \"\"\"Asserts that column values are a date that isn't older than a given date.\"\"\"\n\n    condition_metric_name = \"column_values.date_is_not_older_than\"\n    condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column\",\n        \"ignore_row_if\",\n    )  # type: ignore\n    condition_value_keys = (\"timeframe\",)\n\n    @column_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: ColumnMapMetricProvider,\n        column: Any,\n        **kwargs: dict,\n    ) -> Any:\n        \"\"\"Implementation of the expectation's logic.\n\n        Since timedelta can only define an interval up to weeks, a month is defined\n        as 4 weeks and a year is defined as 52 weeks.\n\n        Args:\n            column: Name of column to validate.\n            kwargs: dict with additional parameters.\n\n        Returns:\n            If the condition is met.\n        \"\"\"\n        timeframe = kwargs.get(\"timeframe\") or None\n        weeks = (\n            timeframe.get(\"weeks\", 0)\n            + (timeframe.get(\"months\", 0) * 4)\n            + (timeframe.get(\"years\", 0) * 52)\n        )\n\n        delta = timedelta(\n            days=timeframe.get(\"days\", 0),\n            seconds=timeframe.get(\"seconds\", 0),\n            microseconds=timeframe.get(\"microseconds\", 0),\n            milliseconds=timeframe.get(\"milliseconds\", 0),\n            minutes=timeframe.get(\"minutes\", 0),\n            hours=timeframe.get(\"hours\", 0),\n            weeks=weeks,\n        )\n\n        return delta > (datetime.datetime.now() - column)\n\n\nclass ExpectColumnValuesToBeDateNotOlderThan(ColumnMapExpectation):\n    \"\"\"Expect value in column to be date that is not older than a given time.\n\n    Since timedelta can only define an interval up to weeks, a month is defined\n    as 4 weeks and a year is defined as 52 weeks.\n\n    Args:\n        column: Name of column to validate\n        Note: Column must be of type Date, Timestamp or String (with Timestamp format).\n            Format: yyyy-MM-ddTHH:mm:ss\n        timeframe: dict with the definition of the timeframe.\n        kwargs: dict with additional parameters.\n\n    Keyword Args:\n        allow_cross_type_comparisons: If True, allow\n            comparisons between types (e.g. integer and string).\n            Otherwise, attempting such comparisons will raise an exception.\n        ignore_row_if: \"both_values_are_missing\",\n            \"either_value_is_missing\", \"neither\" (default).\n        result_format: Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC` (default), `COMPLETE`, or `SUMMARY`.\n        include_config: If True (default), then include the expectation config\n            as part of the result object.\n        catch_exceptions: If True, then catch exceptions and\n            include them as part of the result object. Default: False.\n        meta: A JSON-serializable dictionary (nesting allowed)\n            that will be included in the output without modification.\n\n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"\n\n    mostly: float = 1.0\n    ignore_row_if: str = \"neither\"\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    include_config: bool = True\n    catch_exceptions: bool = False\n    timeframe: Any = {}\n    column: Any = None\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"a\": [\n                            datetime.datetime(2023, 6, 1, 12, 0, 0),\n                            datetime.datetime(2023, 6, 2, 12, 0, 0),\n                            datetime.datetime(2023, 6, 3, 12, 0, 0),\n                        ],\n                        \"b\": [\n                            datetime.datetime(1800, 6, 1, 12, 0, 0),\n                            datetime.datetime(2023, 6, 2, 12, 0, 0),\n                            datetime.datetime(1800, 6, 3, 12, 0, 0),\n                        ],\n                    }\n                }\n            ],\n            \"schemas\": {\"spark\": {\"a\": \"TimestampType\", \"b\": \"TimestampType\"}},\n            \"tests\": [\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column\": \"a\",\n                        \"timeframe\": {\"years\": 100},\n                        \"result_format\": {\n                            \"result_format\": \"BASIC\",\n                            \"unexpected_index_column_names\": [\"b\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": True,\n                        \"unexpected_index_list\": [],\n                    },\n                },\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column\": \"b\",\n                        \"timeframe\": {\"years\": 100},\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\n                                \"a\": datetime.datetime(2023, 6, 1, 12, 0),\n                                \"b\": datetime.datetime(1800, 6, 1, 12, 0),\n                            },\n                            {\n                                \"a\": datetime.datetime(2023, 6, 3, 12, 0),\n                                \"b\": datetime.datetime(1800, 6, 3, 12, 0),\n                            },\n                        ],\n                    },\n                },\n            ],\n        },\n    ]\n\n    map_metric = \"column_values.date_is_not_older_than\"\n    success_keys = (\"column\", \"ignore_row_if\", \"timeframe\", \"mostly\")\n\n    def _validate(\n        self,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> Any:\n        \"\"\"Custom implementation of the GE _validate method.\n\n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GE test logic does not do this validation, and thus\n        we need to make it manually.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        validate_result(\n            self,\n            metrics,\n        )\n\n        return super()._validate(metrics, runtime_configuration, execution_engine)\n\n\n\"\"\"Mandatory block of code. If it is removed the expectation will not be available.\"\"\"\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectColumnValuesToBeDateNotOlderThan().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_column_values_to_not_be_null_or_empty_string.py",
    "content": "\"\"\"Expectation to check if column value is not null or empty string.\"\"\"\n\nfrom typing import Any, Dict, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine, SparkDFExecutionEngine\nfrom great_expectations.expectations.expectation import ColumnMapExpectation\nfrom great_expectations.expectations.metrics import ColumnMapMetricProvider\nfrom great_expectations.expectations.metrics.map_metric_provider import (\n    column_condition_partial,\n)\n\nfrom lakehouse_engine.utils.expectations_utils import validate_result\n\n\nclass ColumnValuesNotNullOrEpmtyString(ColumnMapMetricProvider):\n    \"\"\"Asserts that column values are not null or empty string.\"\"\"\n\n    condition_metric_name = \"column_values.not_null_or_empty_string\"\n    filter_column_isnull = False\n    condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column\",\n        \"ignore_row_if\",\n    )  # type: ignore\n    condition_value_keys = ()\n\n    @column_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: ColumnMapMetricProvider,\n        column: Any,\n        **kwargs: dict,\n    ) -> Any:\n        \"\"\"Implementation of the expectation's logic.\n\n        Args:\n            column: Name of column to validate.\n            kwargs: dict with additional parameters.\n\n        Returns:\n            If the condition is met.\n        \"\"\"\n        return (column.isNotNull()) & (column != \"\")\n\n\nclass ExpectColumnValuesToNotBeNullOrEmptyString(ColumnMapExpectation):\n    \"\"\"Expect value in column to be not null or empty string.\n\n    Args:\n        column: Name of column to validate.\n        kwargs: dict with additional parameters.\n\n    Keyword Args:\n        allow_cross_type_comparisons: If True, allow\n            comparisons between types (e.g. integer and string).\n            Otherwise, attempting such comparisons will raise an exception.\n        ignore_row_if: \"both_values_are_missing\",\n            \"either_value_is_missing\", \"neither\" (default).\n        result_format: Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC` (default), `COMPLETE`, or `SUMMARY`.\n        include_config: If True (default), then include the expectation config\n            as part of the result object.\n        catch_exceptions: If True, then catch exceptions and\n            include them as part of the result object. Default: False.\n        meta: A JSON-serializable dictionary (nesting allowed)\n            that will be included in the output without modification.\n\n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"\n\n    mostly: float = 1.0\n    ignore_row_if: str = \"neither\"\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    include_config: bool = True\n    catch_exceptions: bool = False\n    column: Any = None\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"a\": [\n                            \"4061622965678\",\n                            \"4061622965679\",\n                            \"4061622965680\",\n                        ],\n                        \"b\": [\n                            \"4061622965678\",\n                            \"\",\n                            \"4061622965680\",\n                        ],\n                    }\n                }\n            ],\n            \"schemas\": {\"spark\": {\"a\": \"StringType\", \"b\": \"StringType\"}},\n            \"tests\": [\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column\": \"a\",\n                        \"result_format\": {\n                            \"result_format\": \"BASIC\",\n                            \"unexpected_index_column_names\": [\"b\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": True,\n                        \"unexpected_index_list\": [],\n                    },\n                },\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column\": \"b\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\n                                \"a\": \"4061622965679\",\n                                \"b\": \"\",\n                            }\n                        ],\n                    },\n                },\n            ],\n        },\n    ]\n\n    map_metric = \"column_values.not_null_or_empty_string\"\n    success_keys = (\"column\", \"ignore_row_if\", \"mostly\")\n\n    def _validate(\n        self,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> Any:\n        \"\"\"Custom implementation of the GE _validate method.\n\n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GE test logic does not do this validation, and thus\n        we need to make it manually.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        validate_result(\n            self,\n            metrics,\n        )\n\n        return super()._validate(metrics, runtime_configuration, execution_engine)\n\n\n\"\"\"Mandatory block of code. If it is removed the expectation will not be available.\"\"\"\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectColumnValuesToNotBeNullOrEmptyString().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c.py",
    "content": "\"\"\"Expectation to check if column 'a' equals 'b', or 'c'.\"\"\"\n\nfrom typing import Any, Dict, Literal, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine, SparkDFExecutionEngine\nfrom great_expectations.expectations.expectation import MulticolumnMapExpectation\nfrom great_expectations.expectations.metrics.map_metric_provider import (\n    MulticolumnMapMetricProvider,\n    multicolumn_condition_partial,\n)\n\nfrom lakehouse_engine.utils.expectations_utils import validate_result\n\n\nclass MulticolumnCustomMetric(MulticolumnMapMetricProvider):\n    \"\"\"Expectation metric definition.\n\n    This expectation asserts that column 'a' must equal to column 'b' or column 'c'.\n    In addition to this it is possible to validate that column 'b' or 'c' match a regex.\n    \"\"\"\n\n    condition_metric_name = \"multicolumn_values.column_a_must_equal_b_or_c\"\n    condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column_list\",\n        \"ignore_row_if\",\n    )\n\n    condition_value_keys = (\"validation_regex_b\", \"validation_regex_c\")\n\n    @multicolumn_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: MulticolumnMapMetricProvider, column_list: list, **kwargs: dict\n    ) -> Any:\n        validation_regex_b = (\n            kwargs.get(\"validation_regex_b\") if \"validation_regex_b\" in kwargs else \".*\"\n        )\n        validation_regex_c = (\n            kwargs.get(\"validation_regex_c\") if \"validation_regex_c\" in kwargs else \".*\"\n        )\n\n        return (column_list[0].isNotNull()) & (\n            (\n                column_list[1].isNotNull()\n                & (column_list[1].rlike(validation_regex_b))\n                & (column_list[0] == column_list[1])\n            )\n            | (\n                (column_list[1].isNull())\n                & (column_list[2].rlike(validation_regex_c))\n                & (column_list[0] == column_list[2])\n            )\n        )\n\n\nclass ExpectMulticolumnColumnAMustEqualBOrC(MulticolumnMapExpectation):\n    \"\"\"Expect that the column 'a' is equal to 'b' when this is not empty; otherwise 'a' must be equal to 'c'.\n\n    Args:\n        column_list: The column names to evaluate.\n\n    Keyword Args:\n        ignore_row_if: default to \"never\".\n        result_format:  Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC`, `COMPLETE`, or `SUMMARY`.\n            Default set to `BASIC`.\n        include_config: If True, then include the expectation\n            config as part of the result object.\n            Default set to True.\n        catch_exceptions: If True, then catch exceptions\n            and include them as part of the result object.\n            Default set to False.\n\n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"  # noqa: E501\n\n    ignore_row_if: Literal[\n        \"all_values_are_missing\", \"any_value_is_missing\", \"never\"\n    ] = \"never\"\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    include_config: bool = True\n    catch_exceptions: bool = False\n    mostly: float = 1.0\n    column_list: Any = None\n    validation_regex_c: Any = None\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"a\": [\"d001\", \"1000\", \"1001\"],\n                        \"b\": [None, \"1000\", \"1001\"],\n                        \"c\": [\"d001\", \"d002\", \"d002\"],\n                        \"d\": [\"d001\", \"d002\", \"1001\"],\n                    },\n                    \"schemas\": {\n                        \"spark\": {\n                            \"a\": \"StringType\",\n                            \"b\": \"StringType\",\n                            \"c\": \"StringType\",\n                            \"d\": \"StringType\",\n                        }\n                    },\n                }\n            ],\n            \"tests\": [\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_list\": [\"d\", \"b\", \"c\"],\n                        \"validation_regex_c\": \"d[0-9]{3}$\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"d\", \"b\", \"c\"],\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\n                                \"d\": \"d002\",\n                                \"b\": \"1000\",\n                                \"c\": \"d002\",\n                            }\n                        ],\n                    },\n                },\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_list\": [\"a\", \"b\", \"c\"],\n                        \"validation_regex_c\": \"d[0-9]{3}$\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\", \"b\", \"c\"],\n                        },\n                    },\n                    \"out\": {\"success\": True},\n                },\n            ],\n        },\n    ]\n\n    map_metric = \"multicolumn_values.column_a_must_equal_b_or_c\"\n    success_keys = (\n        \"validation_regex_b\",\n        \"validation_regex_c\",\n        \"mostly\",\n    )  # type: ignore\n\n    def _validate(\n        self,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> Any:\n        \"\"\"Custom implementation of the GE _validate method.\n\n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GE test logic does not do this validation, and thus\n        we need to make it manually.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        validate_result(\n            self,\n            metrics,\n        )\n\n        return super()._validate(metrics, runtime_configuration, execution_engine)\n\n\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectMulticolumnColumnAMustEqualBOrC().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/custom_expectations/expect_queried_column_agg_value_to_be.py",
    "content": "\"\"\"Expectation to check if aggregated column satisfy the condition.\"\"\"\n\nfrom typing import Any, Dict, Optional\n\nfrom great_expectations.execution_engine import ExecutionEngine\nfrom great_expectations.expectations.expectation import (\n    ExpectationValidationResult,\n    QueryExpectation,\n)\nfrom great_expectations.expectations.expectation_configuration import (\n    ExpectationConfiguration,\n)\n\n\nclass ExpectQueriedColumnAggValueToBe(QueryExpectation):\n    \"\"\"Expect agg of column to satisfy the condition specified.\n\n    Args:\n        template_dict: dict with the following keys:\n            - column (column to check sum).\n            - group_column_list (group by column names to be listed).\n            - condition (how to validate the aggregated value eg: between,\n                greater, lesser).\n            - max_value (maximum allowed value).\n            - min_value (minimum allowed value).\n            - agg_type (sum/count/max/min).\n    \"\"\"\n\n    metric_dependencies = (\"query.template_values\",)\n    query_temp = \"\"\"\n        SELECT {group_column_list}, {agg_type}({column})\n        FROM {batch}\n        GROUP BY {group_column_list}\n    \"\"\"\n\n    include_config: bool = True\n    mostly: float = 1.0\n    result_format: dict = {\"result_format\": \"BASIC\"}\n    catch_exceptions: bool = False\n    meta: Any = None\n    query: str = query_temp\n    template_dict: Any = None\n\n    success_keys = (\"template_dict\", \"query\")\n    condition_domain_keys = (\n        \"query\",\n        \"template_dict\",\n        \"batch_id\",\n        \"row_condition\",\n        \"condition_parser\",\n    )\n\n    def validate_configuration(\n        self, configuration: Optional[ExpectationConfiguration] = None\n    ) -> None:\n        \"\"\"Validates that a configuration has been set.\n\n        Args:\n            configuration (OPTIONAL[ExpectationConfiguration]):\n                An optional Expectation Configuration entry.\n\n        Returns:\n            None. Raises InvalidExpectationConfigurationError\n        \"\"\"\n        super().validate_configuration(configuration)\n\n    @staticmethod\n    def _validate_between(\n        x: str, y: int, expected_max_value: int, expected_min_value: int\n    ) -> dict:\n        \"\"\"Method to check whether value satisfy the between condition.\n\n        Args:\n            x: contains key of dict(query_result).\n            y: contains value of dict(query_result).\n            expected_max_value: max value passed.\n            expected_min_value: min value passed.\n\n        Returns:\n            dict with the results after being validated.\n        \"\"\"\n        if expected_min_value <= y <= expected_max_value:\n            return {\n                \"info\": f\"Value is within range\\\n                    {expected_min_value} and {expected_max_value}\",\n                \"success\": True,\n            }\n        else:\n            return {\n                \"success\": False,\n                \"result\": {\n                    \"info\": f\"Value not in range\\\n                        {expected_min_value} and {expected_max_value}\",\n                    \"observed_value\": (x, y),\n                },\n            }\n\n    @staticmethod\n    def _validate_lesser(x: str, y: int, expected_max_value: int) -> dict:\n        \"\"\"Method to check whether value satisfy the less condition.\n\n        Args:\n            x: contains key of dict(query_result).\n            y: contains value of dict(query_result).\n            expected_max_value: max value passed.\n\n        Returns:\n            dict with the results after being validated.\n        \"\"\"\n        if y < expected_max_value:\n            return {\n                \"info\": f\"Value is lesser than {expected_max_value}\",\n                \"success\": True,\n            }\n        else:\n            return {\n                \"success\": False,\n                \"result\": {\n                    \"info\": f\"Value is greater than {expected_max_value}\",\n                    \"observed_value\": (x, y),\n                },\n            }\n\n    @staticmethod\n    def _validate_greater(x: str, y: int, expected_min_value: int) -> dict:\n        \"\"\"Method to check whether value satisfy the greater condition.\n\n        Args:\n            x: contains key of dict(query_result).\n            y: contains value of dict(query_result).\n            expected_min_value: min value passed.\n\n        Returns:\n            dict with the results after being validated.\n        \"\"\"\n        if y > expected_min_value:\n            return {\n                \"info\": f\"Value is greater than {expected_min_value}\",\n                \"success\": True,\n            }\n        else:\n            return {\n                \"success\": False,\n                \"result\": {\n                    \"info\": f\"Value is less than {expected_min_value}\",\n                    \"observed_value\": (x, y),\n                },\n            }\n\n    def _validate_condition(self, query_result: dict, template_dict: dict) -> dict:\n        \"\"\"Method to check whether value satisfy the expected result.\n\n        Args:\n            query_result: contains dict of key and value.\n            template_dict: contains dict of input provided.\n\n        Returns:\n            dict with the results after being validated.\n        \"\"\"\n        result: Dict[Any, Any] = {}\n        for x, y in query_result.items():\n            condition_check = template_dict[\"condition\"]\n            if condition_check == \"between\":\n                _max = template_dict[\"max_value\"]\n                _min = template_dict[\"min_value\"]\n                result = self._validate_between(x, y, _max, _min)\n            elif condition_check == \"lesser\":\n                _max = template_dict[\"max_value\"]\n                result = self._validate_lesser(x, y, _max)\n            else:\n                _min = template_dict[\"min_value\"]\n                result = self._validate_greater(x, y, _min)\n\n        return result\n\n    @staticmethod\n    def _generate_dict(query_result: list) -> dict:\n        \"\"\"Generate a dict from a list of dicts and merge the group by columns values.\n\n        Args:\n            query_result: contains list of dict values obtained from query.\n\n        Returns:\n            Dict\n\n        Example:\n            input: [dict_values(['Male', 25, 3500]), dict_values(['Female', 25, 6200]),\n                dict_values(['Female', 20, 3500]), dict_values(['Male', 20, 6900])].\n            output: {'Male|25': 3500, 'Female|25': 6200,\n                'Female|20': 3500, 'Male|20': 6900}.\n        \"\"\"\n        intermediate_list = []\n        final_list = []\n        for i in range(len(query_result)):\n            intermediate_list.append(list(query_result[i]))\n            for element in intermediate_list:\n                if type(element) is list:\n                    output = \"|\".join(map(str, element))\n                    key = \"|\".join(map(str, element[0:-1]))\n                    value = output.replace(key + \"|\", \"\")\n                    final_list.append(key)\n                    final_list.append(value)\n\n        new_result = {\n            final_list[i]: int(final_list[i + 1]) for i in range(0, len(final_list), 2)\n        }\n\n        return new_result\n\n    def _validate(\n        self,\n        metrics: dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> ExpectationValidationResult | dict:\n        \"\"\"Implementation of the GE _validate method.\n\n        This method is used on the tests to validate the result\n        of the query output.\n\n        Args:\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n\n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        query_result = metrics.get(\"query.template_values\")\n        query_result = [element.values() for element in query_result]\n        query_result = self._generate_dict(query_result)\n        template_dict = self._validate_template_dict(self)\n        output = self._validate_condition(query_result, template_dict)\n\n        return output\n\n    @staticmethod\n    def _validate_template_dict(self: Any) -> dict:\n        \"\"\"Validate the template dict.\n\n        Returns:\n            Dict. Raises TypeError and KeyError\n        \"\"\"\n        template_dict = self.template_dict\n\n        if not isinstance(template_dict, dict):\n            raise TypeError(\"template_dict must be supplied as a dict\")\n\n        if not all(\n            [\n                \"column\" in template_dict,\n                \"group_column_list\" in template_dict,\n                \"agg_type\" in template_dict,\n                \"condition\" in template_dict,\n            ]\n        ):\n            raise KeyError(\n                \"The following keys have to be in the \\\n                    template dict: column, group_column_list, condition, agg_type\"\n            )\n\n        return template_dict\n\n    examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": [\n                {\n                    \"data\": {\n                        \"ID\": [1, 2, 3, 4, 5, 6],\n                        \"Names\": [\n                            \"Ramesh\",\n                            \"Nasser\",\n                            \"Jessica\",\n                            \"Komal\",\n                            \"Jude\",\n                            \"Muffy\",\n                        ],\n                        \"Age\": [25, 25, 25, 20, 20, 25],\n                        \"Gender\": [\n                            \"Male\",\n                            \"Male\",\n                            \"Female\",\n                            \"Female\",\n                            \"Male\",\n                            \"Female\",\n                        ],\n                        \"Salary\": [1000, 2500, 5000, 3500, 6900, 1200],\n                    },\n                    \"schemas\": {\n                        \"spark\": {\n                            \"ID\": \"IntegerType\",\n                            \"Names\": \"StringType\",\n                            \"Age\": \"IntegerType\",\n                            \"Gender\": \"StringType\",\n                            \"Salary\": \"IntegerType\",\n                        }\n                    },\n                }\n            ],\n            \"tests\": [\n                {\n                    \"title\": \"basic_positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"template_dict\": {\n                            \"column\": \"Salary\",\n                            \"group_column_list\": \"Gender\",\n                            \"agg_type\": \"sum\",\n                            \"condition\": \"greater\",\n                            \"min_value\": 2000,\n                        },\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                        },\n                    },\n                    \"out\": {\"success\": True},\n                    \"only_for\": [\"spark\"],\n                },\n                {\n                    \"title\": \"basic_positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"template_dict\": {\n                            \"column\": \"Salary\",\n                            \"group_column_list\": \"Gender,Age\",\n                            \"agg_type\": \"sum\",\n                            \"condition\": \"between\",\n                            \"max_value\": 7000,\n                            \"min_value\": 2000,\n                        },\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                        },\n                    },\n                    \"out\": {\"success\": True},\n                    \"only_for\": [\"spark\"],\n                },\n                {\n                    \"title\": \"basic_positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"template_dict\": {\n                            \"column\": \"Salary\",\n                            \"group_column_list\": \"Age\",\n                            \"agg_type\": \"max\",\n                            \"condition\": \"lesser\",\n                            \"max_value\": 10000,\n                        },\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                        },\n                    },\n                    \"out\": {\"success\": True},\n                    \"only_for\": [\"spark\"],\n                },\n                {\n                    \"title\": \"basic_negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"template_dict\": {\n                            \"column\": \"Salary\",\n                            \"group_column_list\": \"Gender\",\n                            \"agg_type\": \"count\",\n                            \"condition\": \"greater\",\n                            \"min_value\": 4,\n                        },\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                        },\n                    },\n                    \"out\": {\"success\": False},\n                    \"only_for\": [\"sqlite\", \"spark\"],\n                },\n                {\n                    \"title\": \"basic_negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"template_dict\": {\n                            \"column\": \"Salary\",\n                            \"group_column_list\": \"Gender,Age\",\n                            \"agg_type\": \"sum\",\n                            \"condition\": \"between\",\n                            \"max_value\": 2000,\n                            \"min_value\": 1000,\n                        },\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                        },\n                    },\n                    \"out\": {\"success\": False},\n                    \"only_for\": [\"spark\"],\n                },\n            ],\n        },\n    ]\n\n    library_metadata = {\n        \"tags\": [\"query-based\"],\n    }\n\n\nif __name__ == \"__main__\":\n    ExpectQueriedColumnAggValueToBe().print_diagnostic_checklist()\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/dq_factory.py",
    "content": "\"\"\"Module containing the class definition of the Data Quality Factory.\"\"\"\n\nimport importlib\nimport json\nimport random\nfrom copy import deepcopy\nfrom datetime import datetime, timezone\nfrom json import dumps, loads\nfrom typing import Optional, Tuple\n\nimport great_expectations as gx\nfrom great_expectations import ExpectationSuite\nfrom great_expectations.checkpoint import CheckpointResult\nfrom great_expectations.core.batch_definition import BatchDefinition\nfrom great_expectations.core.run_identifier import RunIdentifier\nfrom great_expectations.data_context import EphemeralDataContext\nfrom great_expectations.data_context.types.base import (\n    DataContextConfig,\n    FilesystemStoreBackendDefaults,\n    S3StoreBackendDefaults,\n)\nfrom great_expectations.expectations.expectation_configuration import (\n    ExpectationConfiguration,\n)\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import (\n    col,\n    dayofmonth,\n    explode,\n    from_json,\n    lit,\n    month,\n    schema_of_json,\n    struct,\n    to_json,\n    to_timestamp,\n    transform,\n    year,\n)\nfrom pyspark.sql.types import FloatType, StringType\n\nfrom lakehouse_engine.core.definitions import (\n    DQDefaults,\n    DQFunctionSpec,\n    DQResultFormat,\n    DQSpec,\n    DQType,\n    OutputSpec,\n    WriteType,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.table_manager import TableManager\nfrom lakehouse_engine.dq_processors.exceptions import DQValidationsFailedException\nfrom lakehouse_engine.dq_processors.validator import Validator\nfrom lakehouse_engine.io.writer_factory import WriterFactory\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DQFactory(object):\n    \"\"\"Class for the Data Quality Factory.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n    _TIMESTAMP = datetime.now(timezone.utc).strftime(\"%Y%m%d%H%M%S\")\n\n    @classmethod\n    def _add_critical_function_tag(cls, args: dict) -> dict:\n        \"\"\"Add tags to function considered critical.\n\n        Adds a tag to each of the functions passed on the dq_specs to\n        denote that they are critical_functions. This means that if any\n        of them fails, the dq process will fail, even if the threshold\n        is not surpassed.\n        This is done by adding a tag to the meta dictionary of the\n        expectation configuration.\n\n        Args:\n            args: arguments passed on the dq_spec\n\n        Returns:\n            A dictionary with the args with the critical function tag.\n        \"\"\"\n        if \"meta\" in args.keys():\n            meta = args[\"meta\"]\n\n            if isinstance(meta[\"notes\"], str):\n                meta[\"notes\"] = meta[\"notes\"] + \" **Critical function**.\"\n            else:\n                meta[\"notes\"][\"content\"] = (\n                    meta[\"notes\"][\"content\"] + \" **Critical function**.\"\n                )\n\n            args[\"meta\"] = meta\n            return args\n\n        else:\n            args[\"meta\"] = {\n                \"notes\": {\n                    \"format\": \"markdown\",\n                    \"content\": \"**Critical function**.\",\n                }\n            }\n            return args\n\n    @classmethod\n    def _configure_checkpoint(\n        cls,\n        context: EphemeralDataContext,\n        dataframe_bd: BatchDefinition,\n        suite: ExpectationSuite,\n        dq_spec: DQSpec,\n        data: DataFrame,\n        checkpoint_run_time: str,\n    ) -> Tuple[CheckpointResult, Optional[list]]:\n        \"\"\"Create and configure the validation checkpoint.\n\n        Creates and configures a validation definition based on the suite\n        and then creates, configures and runs the checkpoint returning,\n        at the end, the result as well as the primary key from the dq_specs.\n\n        Args:\n            context: The data context from GX\n            dataframe_bd: The dataframe with the batch definition to validate\n            suite: A group of expectations to validate\n            dq_spec: The arguments directly passed from the acon in the dq_spec key\n            data: Input dataframe to run the dq process on.\n            checkpoint_run_time: A string with the time in miliseconds\n\n        Returns:\n            A tuple with the result from the checkpoint run and the primary key\n            from the dq_spec.\n        \"\"\"\n        validation_definition = context.validation_definitions.add(\n            gx.ValidationDefinition(\n                data=dataframe_bd,\n                suite=suite,\n                name=f\"{dq_spec.spec_id}-{dq_spec.input_id}\"\n                f\"-validation-{checkpoint_run_time}\",\n            )\n        )\n\n        source_pk = cls._get_unexpected_rows_pk(dq_spec)\n        result_format: dict = {\n            \"result_format\": DQResultFormat.COMPLETE.value,\n        }\n\n        # If the source primary key is defined, we add it to the result format\n        # so that it is included in the results from GX.\n        if source_pk:\n            result_format = {\n                **result_format,\n                \"unexpected_index_column_names\": source_pk,\n            }\n\n        checkpoint = context.checkpoints.add(\n            gx.Checkpoint(\n                name=f\"{dq_spec.spec_id}-{dq_spec.input_id}\"\n                f\"-checkpoint-{checkpoint_run_time}\"\n                f\"-{str(random.randint(1, 100))}\",  # nosec B311\n                validation_definitions=[validation_definition],\n                actions=[],\n                result_format=result_format,\n            )\n        )\n\n        result = checkpoint.run(\n            batch_parameters={\"dataframe\": data},\n            run_id=RunIdentifier(\n                run_name=f\"{checkpoint_run_time}\"\n                f\"-{dq_spec.spec_id}-{dq_spec.input_id}\"\n                f\"-{str(random.randint(1, 100))}-checkpoint\",  # nosec B311\n                run_time=datetime.strptime(checkpoint_run_time, \"%Y%m%d-%H%M%S%f\"),\n            ),\n        )\n\n        return result, source_pk\n\n    @classmethod\n    def _check_row_condition(\n        cls, dq_spec: DQSpec, dq_function: DQFunctionSpec\n    ) -> DQFunctionSpec:\n        \"\"\"Enables/disables row_conditions.\n\n        Checks for row_codition arguments in the definition of expectations\n        and enables/disables their usage based on the enable_row_condition\n        argument. row_conditions allow you to filter the rows that are\n        processed by the DQ functions. This is useful when you want to run the\n        DQ functions only on a subset of the data.\n\n        Args:\n            dq_spec: The arguments directly passed from the acon in the dq_spec key\n            dq_function: A DQFunctionSpec with the definition of a dq function.\n\n        Returns:\n            The definition of a dq_function with or without the row_condition key.\n        \"\"\"\n        if (\n            not dq_spec.enable_row_condition\n            and \"row_condition\" in dq_function.args.keys()\n        ):\n            del dq_function.args[\"row_condition\"]\n            cls._LOGGER.info(\n                f\"Disabling row_condition for function: {dq_function.function}\"\n            )\n        return dq_function\n\n    @classmethod\n    def _add_suite(\n        cls, context: EphemeralDataContext, dq_spec: DQSpec, checkpoint_run_time: str\n    ) -> ExpectationSuite:\n        \"\"\"Create and configure an ExpectationSuite.\n\n        Creates and configures an expectation suite, adding the dq functions\n        passed on the dq_spec as well as the dq_critical_functions also passed\n        on the dq_spec, if they exist. Finally return the configured suite.\n\n        Args:\n            context: The data context from GX\n            dq_spec: The arguments directly passed from the acon in the dq_spec key\n            checkpoint_run_time: A string with the time in miliseconds\n\n        Returns:\n            A configured ExpectationSuite object.\n        \"\"\"\n        expectation_suite_name = (\n            dq_spec.expectation_suite_name\n            if dq_spec.expectation_suite_name\n            else f\"{dq_spec.spec_id}-{dq_spec.input_id}\"\n            f\"-{dq_spec.dq_type}-{checkpoint_run_time}\"\n        )\n        suite = context.suites.add(gx.ExpectationSuite(name=expectation_suite_name))\n\n        for dq_function in dq_spec.dq_functions:\n            dq_function = cls._check_row_condition(dq_spec, dq_function)\n            suite.add_expectation_configuration(\n                ExpectationConfiguration(\n                    type=dq_function.function,\n                    kwargs=dq_function.args if dq_function.args else {},\n                    meta=dq_function.args.get(\"meta\") if dq_function.args else {},\n                )\n            )\n        if dq_spec.critical_functions:\n            for critical_function in dq_spec.critical_functions:\n                meta_args = cls._add_critical_function_tag(critical_function.args)\n                suite.add_expectation_configuration(\n                    ExpectationConfiguration(\n                        type=critical_function.function,\n                        kwargs=(\n                            critical_function.args if critical_function.args else {}\n                        ),\n                        meta=meta_args,\n                    )\n                )\n\n        suite.save()\n        return suite\n\n    @classmethod\n    def _check_expectation_result(cls, result_dict: dict) -> dict:\n        \"\"\"Add an empty dict if the unexpected_index_list key is empty.\n\n        Checks if the unexpected_index_list key has any element, if it doesn't,\n        add an empty dictionary to the result key. This is needed due to some\n        edge cases that appeared due to the GX update to version 1.3.13 where\n        the unexpected_index_list would sometimes exist even for successful\n        validation runs.\n\n        Args:\n            result_dict: A dict with the result_dict from a checkpoint run.\n\n        Returns:\n            The configured result_dict\n        \"\"\"\n        for expectation_result in result_dict[\"results\"]:\n            if \"unexpected_index_list\" in expectation_result[\"result\"].keys():\n                if len(expectation_result[\"result\"][\"unexpected_index_list\"]) < 1:\n                    expectation_result[\"result\"] = {}\n        return result_dict\n\n    @classmethod\n    def run_dq_process(cls, dq_spec: DQSpec, data: DataFrame) -> DataFrame:\n        \"\"\"Run the specified data quality process on a dataframe.\n\n        Based on the dq_specs we apply the defined expectations on top of the dataframe\n        in order to apply the necessary validations and then output the result of\n        the data quality process.\n\n        The logic of the function is as follows:\n        1. Import the custom expectations defined in the engine.\n        2. Create the context based on the dq_spec. - The context is the base class for\n        the GX, an ephemeral context means that it does not store/load the\n        configuration of the environment in a configuration file.\n        3. Add the data source to the context. - This is the data source that will be\n        used to run the dq process, in our case Spark.\n        4. Create the dataframe asset and batch definition. - The asset represents the\n        data where the expectations are applied and the batch definition is the\n        way how the data should be split, in the case of dataframes it is always\n        the whole dataframe.\n        5. Create the expectation suite. - This is the group of expectations that will\n        be applied to the data.\n        6. Create the checkpoint and run it. - The checkpoint is the object that will\n        run the expectations on the data and return the results.\n        7. Transform the results and write them to the result sink. - The results are\n        transformed to a more readable format and then written to the result sink.\n        8. Log the results and raise an exception if needed. - The results are logged\n        and if there are any failed expectations the process will raise an exception\n        based on the dq_spec.\n        9. Tag the source data if needed. - If the dq_spec has the tag_source_data\n        argument set to True, the source data will be tagged with the dq results.\n\n        Args:\n            dq_spec: data quality specification.\n            data: input dataframe to run the dq process on.\n\n        Returns:\n            The DataFrame containing the results of the DQ process.\n        \"\"\"\n        # Creating the context\n        if dq_spec.dq_type == \"validator\" or dq_spec.dq_type == \"prisma\":\n\n            for expectation in DQDefaults.CUSTOM_EXPECTATION_LIST.value:\n                importlib.__import__(\n                    \"lakehouse_engine.dq_processors.custom_expectations.\" + expectation\n                )\n\n            context = gx.get_context(\n                cls._get_data_context_config(dq_spec), mode=\"ephemeral\"\n            )\n\n            # Adding data source to context\n            dataframe_data_source = context.data_sources.add_spark(\n                name=f\"{dq_spec.spec_id}-{dq_spec.input_id}-datasource\",\n                persist=False,\n            )\n            dataframe_asset = dataframe_data_source.add_dataframe_asset(\n                name=f\"{dq_spec.spec_id}-{dq_spec.input_id}-asset\"\n            )\n            dataframe_bd = dataframe_asset.add_batch_definition_whole_dataframe(\n                name=f\"{dq_spec.spec_id}-{dq_spec.input_id}-batch\"\n            )\n\n            checkpoint_run_time = datetime.today().strftime(\"%Y%m%d-%H%M%S%f\")\n\n            suite = cls._add_suite(context, dq_spec, checkpoint_run_time)\n\n            result, source_pk = cls._configure_checkpoint(\n                context, dataframe_bd, suite, dq_spec, data, checkpoint_run_time\n            )\n\n            expectation_result_key = list(result.run_results.keys())[0]\n\n            result_dict = result.run_results[expectation_result_key].to_json_dict()\n\n            result_dict = cls._check_expectation_result(result_dict)\n\n            data = cls._transform_checkpoint_results(\n                data, source_pk, result_dict, dq_spec\n            )\n\n            # Processed keys are only added for the PRISMA dq type\n            # because they are being used to calculate the good\n            # records that were processed in a run.\n            if dq_spec.dq_type == DQType.PRISMA.value:\n\n                keys = data.select(\n                    [col(c).cast(StringType()).alias(c) for c in source_pk]\n                )\n                keys = keys.withColumn(\n                    \"run_name\", lit(result_dict[\"meta\"][\"run_id\"][\"run_name\"])\n                )\n\n                cls._write_to_location(dq_spec, keys, processed_keys=True)\n\n        else:\n            raise TypeError(\n                f\"Type of Data Quality '{dq_spec.dq_type}' is not supported.\"\n            )\n\n        return data\n\n    @classmethod\n    def _check_critical_functions_tags(cls, failed_expectations: dict) -> list:\n        critical_failure = []\n\n        for expectation in failed_expectations.values():\n            meta = expectation[\"meta\"]\n            if meta and (\n                (\"notes\" in meta.keys() and \"Critical function\" in meta[\"notes\"])\n                or (\n                    \"content\" in meta[\"notes\"].keys()\n                    and \"Critical function\" in meta[\"notes\"][\"content\"]\n                )\n            ):\n                critical_failure.append(expectation[\"type\"])\n\n        return critical_failure\n\n    @classmethod\n    def _check_chunk_usage(cls, results_dict: dict, dq_spec: DQSpec) -> bool:\n        \"\"\"Check if the results should be split into chunks.\n\n        If the size of the results dictionary is too big, we will split it into\n        smaller chunks. This is needed to avoid memory issues when processing\n        large datasets.\n\n        Args:\n            results_dict: The results dictionary to be checked.\n            dq_spec: data quality specification.\n\n        Returns:\n            True if the results dictionary is too big, False otherwise.\n        \"\"\"\n        for ele in results_dict[\"results\"]:\n            if (\n                \"unexpected_index_list\" in ele[\"result\"].keys()\n                and len(ele[\"result\"][\"unexpected_index_list\"])\n                > dq_spec.result_sink_chunk_size\n            ):\n                return True\n\n        return False\n\n    @classmethod\n    def _explode_results(\n        cls,\n        df: DataFrame,\n        dq_spec: DQSpec,\n    ) -> DataFrame:\n        \"\"\"Transform dq results dataframe exploding a set of columns.\n\n        Args:\n            df: dataframe with dq results to be exploded.\n            dq_spec: data quality specification.\n        \"\"\"\n        df = df.withColumn(\"validation_results\", explode(\"results\")).withColumn(\n            \"source\", lit(dq_spec.source)\n        )\n\n        if (\n            not df.schema[\"validation_results\"]\n            .dataType.fieldNames()  # type: ignore\n            .__contains__(\"result\")\n        ):\n            df = df.withColumn(\n                \"validation_results\",\n                col(\"validation_results\").withField(\n                    \"result\", struct(lit(None).alias(\"observed_value\"))\n                ),\n            )\n\n        kwargs_columns = [\n            f\"validation_results.expectation_config.kwargs.{col_name}\"\n            for col_name in df.select(\n                \"validation_results.expectation_config.kwargs.*\"\n            ).columns\n        ]\n\n        cols_to_cast = [\"max_value\", \"min_value\", \"sum_total\"]\n        for col_name in kwargs_columns:\n            if col_name.split(\".\")[-1] in cols_to_cast:\n                df = df.withColumn(\n                    \"validation_results\",\n                    col(\"validation_results\").withField(\n                        \"expectation_config\",\n                        col(\"validation_results.expectation_config\").withField(\n                            \"kwargs\",\n                            col(\n                                \"validation_results.expectation_config.kwargs\"\n                            ).withField(\n                                col_name.split(\".\")[-1],\n                                col(col_name).cast(FloatType()),\n                            ),\n                        ),\n                    ),\n                )\n\n        new_columns = [\n            \"validation_results.expectation_config.kwargs.*\",\n            \"validation_results.expectation_config.type as expectation_type\",\n            \"validation_results.success as expectation_success\",\n            \"validation_results.exception_info\",\n            \"statistics.*\",\n        ] + dq_spec.result_sink_extra_columns\n\n        df_exploded = df.selectExpr(*df.columns, *new_columns).drop(\n            *[c.replace(\".*\", \"\").split(\" as\")[0] for c in new_columns]\n        )\n\n        df_exploded = df_exploded.drop(\n            \"statistics\", \"id\", \"results\", \"meta\", \"suite_name\"\n        )\n\n        if (\n            \"meta\"\n            in df_exploded.select(\"validation_results.expectation_config.*\").columns\n        ):\n            df_exploded = df_exploded.withColumn(\n                \"meta\", col(\"validation_results.expectation_config.meta\")\n            )\n\n        schema = df_exploded.schema.simpleString()\n\n        if (\n            dq_spec.gx_result_format.upper() == DQResultFormat.COMPLETE.value\n            and \"unexpected_index_list\" in schema\n        ):\n            df_exploded = df_exploded.withColumn(\n                \"unexpected_index_list\",\n                transform(\n                    col(\"validation_results.result.unexpected_index_list\"),\n                    lambda y: y.withField(\"run_success\", lit(False)),\n                ),\n            )\n\n        if \"observed_value\" in schema:\n            df_exploded = df_exploded.withColumn(\n                \"observed_value\", col(\"validation_results.result.observed_value\")\n            )\n\n        return (\n            df_exploded.withColumn(\"run_time_year\", year(to_timestamp(\"run_time\")))\n            .withColumn(\"run_time_month\", month(to_timestamp(\"run_time\")))\n            .withColumn(\"run_time_day\", dayofmonth(to_timestamp(\"run_time\")))\n            .withColumn(\n                \"kwargs\", to_json(col(\"validation_results.expectation_config.kwargs\"))\n            )\n            .withColumn(\"validation_results\", to_json(col(\"validation_results\")))\n        )\n\n    @classmethod\n    def _get_data_context_config(cls, dq_spec: DQSpec) -> DataContextConfig:\n        \"\"\"Get the configuration of the data context.\n\n        Based on the configuration it is possible to define the backend to be\n        the file system (e.g. local file system) or S3, meaning that the DQ artefacts\n        will be stored according to this configuration.\n\n        Args:\n            dq_spec: data quality process specification.\n\n        Returns:\n            The DataContextConfig object configuration.\n        \"\"\"\n        store_backend: FilesystemStoreBackendDefaults | S3StoreBackendDefaults\n\n        if dq_spec.store_backend == DQDefaults.FILE_SYSTEM_STORE.value:\n            store_backend = FilesystemStoreBackendDefaults(\n                root_directory=dq_spec.local_fs_root_dir\n            )\n        elif dq_spec.store_backend == DQDefaults.FILE_SYSTEM_S3_STORE.value:\n            store_backend = S3StoreBackendDefaults(\n                default_bucket_name=dq_spec.bucket,\n                validation_results_store_prefix=dq_spec.validations_store_prefix,\n                checkpoint_store_prefix=dq_spec.checkpoint_store_prefix,\n                expectations_store_prefix=dq_spec.expectations_store_prefix,\n            )\n\n        return DataContextConfig(\n            store_backend_defaults=store_backend,\n            analytics_enabled=False,\n        )\n\n    @classmethod\n    def _get_data_source_defaults(cls, dq_spec: DQSpec) -> dict:\n        \"\"\"Get the configuration for a datasource.\n\n        Args:\n            dq_spec: data quality specification.\n\n        Returns:\n            The python dictionary with the datasource configuration.\n        \"\"\"\n        return {\n            \"name\": f\"{dq_spec.spec_id}-{dq_spec.input_id}-datasource\",\n            \"class_name\": DQDefaults.DATASOURCE_CLASS_NAME.value,\n            \"execution_engine\": {\n                \"class_name\": DQDefaults.DATASOURCE_EXECUTION_ENGINE.value,\n                \"persist\": False,\n            },\n            \"data_connectors\": {\n                f\"{dq_spec.spec_id}-{dq_spec.input_id}-data_connector\": {\n                    \"module_name\": DQDefaults.DATA_CONNECTORS_MODULE_NAME.value,\n                    \"class_name\": DQDefaults.DATA_CONNECTORS_CLASS_NAME.value,\n                    \"assets\": {\n                        (\n                            dq_spec.data_asset_name\n                            if dq_spec.data_asset_name\n                            else f\"{dq_spec.spec_id}-{dq_spec.input_id}\"\n                        ): {\"batch_identifiers\": DQDefaults.DQ_BATCH_IDENTIFIERS.value}\n                    },\n                }\n            },\n        }\n\n    @classmethod\n    def _get_failed_expectations(\n        cls,\n        results: dict,\n        dq_spec: DQSpec,\n        failed_expectations: dict,\n        evaluated_expectations: dict,\n        is_final_chunk: bool,\n    ) -> Tuple[dict, dict]:\n        \"\"\"Get the failed expectations of a Checkpoint result.\n\n        Args:\n            results: the results of the DQ process.\n            dq_spec: data quality specification.\n            failed_expectations: dict of failed expectations.\n            evaluated_expectations: dict of evaluated expectations.\n            is_final_chunk: boolean indicating if this is the final chunk.\n\n        Returns: a tuple with a dict of failed expectations\n                and a dict of evaluated expectations.\n        \"\"\"\n        expectations_results = results[\"results\"]\n        for result in expectations_results:\n            evaluated_expectations[result[\"expectation_config\"][\"id\"]] = result[\n                \"expectation_config\"\n            ]\n            if not result[\"success\"]:\n                failed_expectations[result[\"expectation_config\"][\"id\"]] = result[\n                    \"expectation_config\"\n                ]\n                if result[\"exception_info\"][\"raised_exception\"]:\n                    cls._LOGGER.error(\n                        f\"\"\"The expectation {str(result[\"expectation_config\"])}\n                        raised the following exception:\n                        {result[\"exception_info\"][\"exception_message\"]}\"\"\"\n                    )\n        cls._LOGGER.error(\n            f\"{len(failed_expectations)} out of {len(evaluated_expectations)} \"\n            f\"Data Quality Expectation(s) have failed! Failed Expectations: \"\n            f\"{failed_expectations}\"\n        )\n\n        percentage_failure = 1 - (results[\"statistics\"][\"success_percent\"] / 100)\n\n        if (\n            dq_spec.max_percentage_failure is not None\n            and dq_spec.max_percentage_failure < percentage_failure\n            and is_final_chunk\n        ):\n            raise DQValidationsFailedException(\n                f\"Max error threshold is being surpassed! \"\n                f\"Expected: {dq_spec.max_percentage_failure} \"\n                f\"Got: {percentage_failure}\"\n            )\n\n        return failed_expectations, evaluated_expectations\n\n    @classmethod\n    def _get_unexpected_rows_pk(cls, dq_spec: DQSpec) -> Optional[list]:\n        \"\"\"Get primary key for using on rows failing DQ validations.\n\n        Args:\n            dq_spec: data quality specification.\n\n        Returns: the list of columns that are part of the primary key.\n        \"\"\"\n        if dq_spec.unexpected_rows_pk:\n            return dq_spec.unexpected_rows_pk\n        elif dq_spec.tbl_to_derive_pk:\n            return TableManager(\n                {\"function\": \"get_tbl_pk\", \"table_or_view\": dq_spec.tbl_to_derive_pk}\n            ).get_tbl_pk()\n        elif dq_spec.tag_source_data:\n            raise ValueError(\n                \"You need to provide either the argument \"\n                \"'unexpected_rows_pk' or 'tbl_to_derive_pk'.\"\n            )\n        else:\n            return None\n\n    @classmethod\n    def _log_or_fail(\n        cls,\n        results: dict,\n        dq_spec: DQSpec,\n        failed_expectations: dict,\n        evaluated_expectations: dict,\n        is_final_chunk: bool,\n    ) -> Tuple[dict, dict]:\n        \"\"\"Log the execution of the Data Quality process.\n\n        Args:\n            results: the results of the DQ process.\n            dq_spec: data quality specification.\n            failed_expectations: list of failed expectations.\n            evaluated_expectations: list of evaluated expectations.\n            is_final_chunk: boolean indicating if this is the final chunk.\n\n        Returns: a tuple with a dict of failed expectations\n                and a dict of evaluated expectations.\n        \"\"\"\n        if results[\"success\"]:\n            cls._LOGGER.info(\n                \"The data passed all the expectations defined. Everything looks good!\"\n            )\n        else:\n            failed_expectations, evaluated_expectations = cls._get_failed_expectations(\n                results,\n                dq_spec,\n                failed_expectations,\n                evaluated_expectations,\n                is_final_chunk,\n            )\n\n        if dq_spec.critical_functions and is_final_chunk:\n            critical_failure = cls._check_critical_functions_tags(failed_expectations)\n\n            if critical_failure:\n                raise DQValidationsFailedException(\n                    f\"Data Quality Validations Failed, the following critical \"\n                    f\"expectations failed: {critical_failure}.\"\n                )\n        if dq_spec.fail_on_error and is_final_chunk and failed_expectations:\n            raise DQValidationsFailedException(\"Data Quality Validations Failed!\")\n\n        return failed_expectations, evaluated_expectations\n\n    @classmethod\n    def _transform_checkpoint_results(\n        cls,\n        data: DataFrame,\n        source_pk: list,\n        checkpoint_results: dict,\n        dq_spec: DQSpec,\n    ) -> DataFrame:\n        \"\"\"Transforms the checkpoint results and creates new entries.\n\n        All the items of the dictionary are cast to a json like format.\n        All columns are cast to json like format.\n        After that the dictionary is converted into a dataframe.\n\n        Args:\n            data: input dataframe to run the dq process on.\n            source_pk: list of columns that are part of the primary key.\n            checkpoint_results: dict with results of the checkpoint run.\n            dq_spec: data quality specification.\n            checkpoint_run_time: A string with the time in miliseconds.\n\n        Returns:\n            Transformed results dataframe.\n        \"\"\"\n        results_dict = loads(dumps(checkpoint_results))\n\n        # Check the size of the results dictionary, if it is too big\n        # we will split it into smaller chunks.\n        results_dict_list = cls._generate_chunks(results_dict, dq_spec)\n\n        index = 0\n\n        failed_expectations: dict = {}\n        evaluated_expectations: dict = {}\n\n        # The processed chunk is removed from the list of results\n        # so the memory is freed as soon as possible.\n        while index < len(results_dict_list):\n            is_final_chunk = len(results_dict_list) == 1\n            data, failed_expectations, evaluated_expectations = cls._process_chunk(\n                dq_spec,\n                source_pk,\n                results_dict_list[index],\n                data,\n                failed_expectations,\n                evaluated_expectations,\n                is_final_chunk,\n            )\n            del results_dict_list[index]\n\n        return data\n\n    @classmethod\n    def _process_chunk(\n        cls,\n        dq_spec: DQSpec,\n        source_pk: list[str],\n        ele: dict,\n        data: DataFrame,\n        failed_expectations: dict,\n        evaluated_expectations: dict,\n        is_final_chunk: bool,\n    ) -> Tuple[DataFrame, dict, dict]:\n        \"\"\"Process a chunk of the results.\n\n        Args:\n            dq_spec: data quality specification.\n            source_pk: list of columns that are part of the primary key.\n            ele: dictionary with the results of the dq process.\n            data: input dataframe to run the dq process on.\n            failed_expectations: list of failed expectations.\n            evaluated_expectations: list of evaluated expectations.\n            is_final_chunk: boolean indicating if this is the final chunk.\n\n        Returns:\n            A tuple with the processed data, failed expectations and evaluated\n            expectations.\n        \"\"\"\n        df = ExecEnv.SESSION.createDataFrame([json.dumps(ele)], schema=StringType())\n        schema = schema_of_json(lit(json.dumps(ele)))\n        df = (\n            df.withColumn(\"value\", from_json(\"value\", schema))\n            .select(\"value.*\")\n            .withColumn(\"spec_id\", lit(dq_spec.spec_id))\n            .withColumn(\"input_id\", lit(dq_spec.input_id))\n            .withColumn(\"run_name\", col(\"meta.run_id.run_name\"))\n            .withColumn(\"run_time\", col(\"meta.run_id.run_time\"))\n        )\n        exploded_df = (\n            cls._explode_results(df, dq_spec)\n            if dq_spec.result_sink_explode\n            else df.withColumn(\"validation_results\", to_json(col(\"results\"))).drop(\n                \"statistics\", \"meta\", \"suite_name\", \"results\", \"id\"\n            )\n        )\n\n        exploded_df = exploded_df.withColumn(\"source_primary_key\", lit(source_pk))\n\n        exploded_df = cls._cast_columns_to_string(exploded_df)\n\n        cls._write_to_location(dq_spec, exploded_df)\n\n        failed_expectations, evaluated_expectations = cls._log_or_fail(\n            ele, dq_spec, failed_expectations, evaluated_expectations, is_final_chunk\n        )\n        if (\n            dq_spec.tag_source_data\n            and dq_spec.result_sink_explode\n            and dq_spec.fail_on_error is not True\n        ):\n            data = Validator.tag_source_with_dq(source_pk, data, exploded_df)\n            return data, failed_expectations, evaluated_expectations\n        return data, failed_expectations, evaluated_expectations\n\n    @classmethod\n    def _cast_columns_to_string(cls, df: DataFrame) -> DataFrame:\n        \"\"\"Cast selected columns of the dataframe to string type.\n\n        Args:\n            df: The input dataframe.\n\n        Returns:\n            A new dataframe with selected columns cast to string type.\n        \"\"\"\n        for col_name in df.columns:\n            if col_name not in DQDefaults.DQ_COLUMNS_TO_KEEP_TYPES.value:\n                df = df.withColumn(col_name, df[col_name].cast(StringType()))\n        return df\n\n    @classmethod\n    def _generate_chunks(cls, results_dict: dict, dq_spec: DQSpec) -> list:\n        \"\"\"Split the results dictionary into smaller chunks.\n\n        This is needed to avoid memory issues when processing large datasets.\n        The size of the chunks is defined by the dq_spec.result_sink_chunk_size.\n\n        Args:\n            results_dict: The results dictionary to be split.\n            dq_spec: data quality specification.\n\n        Returns:\n            A list of dictionaries, where each dictionary is a chunk of the original\n            results dictionary.\n        \"\"\"\n        results_dict_list = []\n\n        split = cls._check_chunk_usage(results_dict, dq_spec)\n\n        if split:\n            # Here we are splitting the results into chunks per expectation\n            # and then we are splitting the unexpected_index_list into\n            # chunks of size dq_spec.result_sink_chunk_size.\n            results_dict_list = cls._split_into_chunks(results_dict, dq_spec)\n        else:\n            # If the results are not too big, we can process them all at once.\n            results_dict_list = [results_dict]\n\n        return results_dict_list\n\n    @classmethod\n    def _split_into_chunks(cls, results_dict: dict, dq_spec: DQSpec) -> list:\n        \"\"\"Split the results into smaller chunks.\n\n        This is needed to avoid memory issues when processing large datasets.\n        The size of the chunks is defined by the dq_spec.result_sink_chunk_size.\n\n        Args:\n            results: The results to be split.\n            dq_spec: data quality specification.\n\n        Returns:\n            A list of dictionaries, where each dictionary is a chunk of the original\n            results.\n        \"\"\"\n        results_dict_list = []\n\n        for ele in results_dict[\"results\"]:\n            base_result = deepcopy(results_dict)\n\n            if \"unexpected_index_list\" in ele[\"result\"].keys():\n                for key in ExecEnv.ENGINE_CONFIG.dq_result_sink_columns_to_delete:\n                    del ele[\"result\"][key]\n\n                unexpected_index_list = ele[\"result\"][\"unexpected_index_list\"]\n                unexpected_index_list_chunks = cls.split_into_chunks(\n                    unexpected_index_list, dq_spec.result_sink_chunk_size\n                )\n\n                del ele[\"result\"][\"unexpected_index_list\"]\n\n                for chunk in unexpected_index_list_chunks:\n                    ele[\"result\"][\"unexpected_index_list\"] = chunk\n                    base_result[\"results\"] = [ele]\n                    results_dict_list.append(deepcopy(base_result))\n            else:\n                base_result[\"results\"] = [ele]\n                results_dict_list.append(base_result)\n\n        return results_dict_list\n\n    @classmethod\n    def _write_to_location(\n        cls,\n        dq_spec: DQSpec,\n        df: DataFrame,\n        processed_keys: bool = False,\n    ) -> None:\n        \"\"\"Write dq results dataframe to a table or location.\n\n        It can be written:\n        - a raw output (having result_sink_explode set as False)\n        - an exploded output (having result_sink_explode set as True), which\n        is more prepared for analysis, with some columns exploded, flatten and\n        transformed. It can also be set result_sink_extra_columns with other\n        columns desired to have in the output table or location.\n        - processed keys when running the dq process with the dq_type set as\n        'prisma'.\n\n        Args:\n            dq_spec: data quality specification.\n            df: dataframe with dq results to write.\n            processed_keys: boolean indicating if the dataframe contains\n                the processed keys.\n        \"\"\"\n        if processed_keys:\n            table = None\n            location = dq_spec.processed_keys_location\n            options = {\"mergeSchema\": \"true\"}\n        else:\n            table = dq_spec.result_sink_db_table\n            location = dq_spec.result_sink_location\n            options = {\"mergeSchema\": \"true\"} if dq_spec.result_sink_explode else {}\n\n        if table or location:\n            WriterFactory.get_writer(\n                spec=OutputSpec(\n                    spec_id=\"dq_result_sink\",\n                    input_id=\"dq_result\",\n                    db_table=table,\n                    location=location,\n                    partitions=(\n                        dq_spec.result_sink_partitions\n                        if dq_spec.result_sink_partitions\n                        else []\n                    ),\n                    write_type=WriteType.APPEND.value,\n                    data_format=dq_spec.result_sink_format,\n                    options=(\n                        options\n                        if dq_spec.result_sink_options is None\n                        else {**dq_spec.result_sink_options, **options}\n                    ),\n                ),\n                df=df,\n                data=None,\n            ).write()\n\n    @staticmethod\n    def split_into_chunks(lst: list, chunk_size: int) -> list:\n        \"\"\"Split a list into chunks of a specified size.\n\n        Args:\n            lst: The list to be split.\n            chunk_size: Number of records in each chunk.\n\n        Returns:\n            A list of lists, where each inner list is a chunk of the original list.\n        \"\"\"\n        if chunk_size <= 0:\n            raise ValueError(\"Chunk size must be a positive integer.\")\n        chunk_list = []\n        for i in range(0, len(lst), chunk_size):\n            chunk_list.append(lst[i : i + chunk_size])\n        return chunk_list\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/exceptions.py",
    "content": "\"\"\"Package defining all the DQ custom exceptions.\"\"\"\n\n\nclass DQValidationsFailedException(Exception):\n    \"\"\"Exception for when the data quality validations fail.\"\"\"\n\n    pass\n\n\nclass DQCheckpointsResultsException(Exception):\n    \"\"\"Exception for when the checkpoint results parsing fail.\"\"\"\n\n    pass\n\n\nclass DQSpecMalformedException(Exception):\n    \"\"\"Exception for when the DQSpec is malformed.\"\"\"\n\n    pass\n\n\nclass DQDuplicateRuleIdException(Exception):\n    \"\"\"Exception for when a duplicated rule id is found.\"\"\"\n\n    pass\n"
  },
  {
    "path": "lakehouse_engine/dq_processors/validator.py",
    "content": "\"\"\"Module containing the definition of a data quality validator.\"\"\"\n\nfrom typing import Any, List\n\nfrom great_expectations.core.batch import RuntimeBatchRequest\nfrom great_expectations.data_context import EphemeralDataContext\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import (\n    col,\n    collect_set,\n    concat,\n    explode,\n    first,\n    lit,\n    struct,\n    when,\n)\n\nfrom lakehouse_engine.core.definitions import DQDefaults, DQFunctionSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Validator(object):\n    \"\"\"Class containing the data quality validator.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def get_dq_validator(\n        cls,\n        context: EphemeralDataContext,\n        batch_request: RuntimeBatchRequest,\n        expectation_suite_name: str,\n        dq_functions: List[DQFunctionSpec],\n        critical_functions: List[DQFunctionSpec],\n    ) -> Any:\n        \"\"\"Get a validator according to the specification.\n\n        We use getattr to dynamically execute any expectation available.\n        getattr(validator, function) is similar to validator.function(). With this\n        approach, we can execute any expectation supported.\n\n        Args:\n            context: the BaseDataContext containing the configurations for the data\n                source and store backend.\n            batch_request: run time batch request to be able to query underlying data.\n            expectation_suite_name: name of the expectation suite.\n            dq_functions: a list of DQFunctionSpec to consider in the expectation suite.\n            critical_functions: list of critical expectations in the expectation suite.\n\n        Returns:\n            The validator with the expectation suite stored.\n        \"\"\"\n        validator = context.get_validator(\n            batch_request=batch_request, expectation_suite_name=expectation_suite_name\n        )\n        if dq_functions:\n            for dq_function in dq_functions:\n                getattr(validator, dq_function.function)(\n                    **dq_function.args if dq_function.args else {}\n                )\n\n        if critical_functions:\n            for critical_function in critical_functions:\n                meta_args = cls._add_critical_function_tag(critical_function.args)\n\n                getattr(validator, critical_function.function)(**meta_args)\n\n        return validator.save_expectation_suite(discard_failed_expectations=False)\n\n    @classmethod\n    def tag_source_with_dq(\n        cls, source_pk: List[str], source_df: DataFrame, results_df: DataFrame\n    ) -> DataFrame:\n        \"\"\"Tags the source dataframe with a new column having the DQ results.\n\n        Args:\n            source_pk: the primary key of the source data.\n            source_df: the source dataframe to be tagged with DQ results.\n            results_df: dq results dataframe.\n\n        Returns: a dataframe tagged with the DQ results.\n        \"\"\"\n        run_success = results_df.select(\"success\").first()[0]\n        run_name = results_df.select(\"run_name\").first()[0]\n        raised_exceptions = (\n            True\n            if results_df.filter(\"exception_info.raised_exception == True\").count() > 0\n            else False\n        )\n\n        failures_df = (\n            results_df.filter(\n                \"expectation_success == False and size(unexpected_index_list) > 0\"\n            )\n            if \"unexpected_index_list\" in results_df.schema.simpleString()\n            else results_df.filter(\"expectation_success == False\")\n        )\n\n        if failures_df.isEmpty() is not True:\n\n            source_df = cls._get_row_tagged_fail_df(\n                failures_df, raised_exceptions, source_df, source_pk\n            )\n\n        return cls._join_complementary_data(\n            run_name, run_success, raised_exceptions, source_df\n        )\n\n    @classmethod\n    def _add_critical_function_tag(cls, args: dict) -> dict:\n        if \"meta\" in args.keys():\n            meta = args[\"meta\"]\n\n            if isinstance(meta[\"notes\"], str):\n                meta[\"notes\"] = meta[\"notes\"] + \" **Critical function**.\"\n            else:\n                meta[\"notes\"][\"content\"] = (\n                    meta[\"notes\"][\"content\"] + \" **Critical function**.\"\n                )\n\n            args[\"meta\"] = meta\n            return args\n\n        else:\n            args[\"meta\"] = {\n                \"notes\": {\n                    \"format\": \"markdown\",\n                    \"content\": \"**Critical function**.\",\n                }\n            }\n            return args\n\n    @staticmethod\n    def _get_row_tagged_fail_df(\n        failures_df: DataFrame,\n        raised_exceptions: bool,\n        source_df: DataFrame,\n        source_pk: List[str],\n    ) -> DataFrame:\n        \"\"\"Get the source_df DataFrame tagged with the row level failures.\n\n        Args:\n            failures_df: dataframe having all failed expectations from the DQ execution.\n            raised_exceptions: whether there was at least one expectation raising\n                exceptions (True) or not (False).\n            source_df: the source dataframe being tagged with DQ results.\n            source_pk: the primary key of the source data.\n\n        Returns: the source_df tagged with the row level failures.\n        \"\"\"\n        if \"unexpected_index_list\" in failures_df.schema.simpleString():\n\n            row_failures_df = (\n                failures_df.alias(\"a\")\n                .withColumn(\"exploded_list\", explode(col(\"unexpected_index_list\")))\n                .selectExpr(\"a.*\", \"exploded_list.*\")\n                .groupBy(*source_pk)\n                .agg(\n                    struct(\n                        first(col(\"run_name\")).alias(\"run_name\"),\n                        first(col(\"success\")).alias(\"run_success\"),\n                        lit(raised_exceptions).alias(\"raised_exceptions\"),\n                        first(col(\"expectation_success\")).alias(\"run_row_success\"),\n                        collect_set(\n                            struct(\n                                col(\"expectation_type\"),\n                                col(\"kwargs\"),\n                            )\n                        ).alias(\"dq_failure_details\"),\n                    ).alias(\"dq_validations\")\n                )\n            )\n\n            if all(item in row_failures_df.columns for item in source_pk):\n                join_cond = [\n                    col(f\"a.{key}\").eqNullSafe(col(f\"b.{key}\")) for key in source_pk\n                ]\n                columns = [\n                    col_name\n                    for col_name in source_df.columns\n                    if col_name != \"dq_validations\"\n                ]\n\n                # Since we are creating multiple rows per run, if the dq_validations\n                # column already exists, we need to add the new dq_validations to\n                # the existing dq_validations.\n                existing_validations = \"a.dq_validations\"\n                existing_validations_details = \"a.dq_validations.dq_failure_details\"\n                new_validations = \"b.dq_validations\"\n                new_validations_details = \"b.dq_validations.dq_failure_details\"\n\n                if \"dq_validations\" in source_df.columns:\n                    source_df = (\n                        source_df.alias(\"a\")\n                        .join(row_failures_df.alias(\"b\"), join_cond, \"left\")\n                        .select(\n                            *[f\"a.{col}\" for col in columns],\n                            when(\n                                col(new_validations).isNotNull()\n                                & col(existing_validations_details).isNotNull(),\n                                col(new_validations).withField(\n                                    \"dq_failure_details\",\n                                    concat(\n                                        col(existing_validations_details),\n                                        col(new_validations_details),\n                                    ),\n                                ),\n                            )\n                            .when(\n                                col(new_validations).isNotNull()\n                                & col(new_validations_details).isNotNull(),\n                                col(new_validations),\n                            )\n                            .otherwise(col(existing_validations))\n                            .alias(\"dq_validations\"),\n                        )\n                    )\n                else:\n                    source_df = (\n                        source_df.alias(\"a\")\n                        .join(row_failures_df.alias(\"b\"), join_cond, \"left\")\n                        .select(\"a.*\", new_validations)\n                    )\n\n        return source_df\n\n    @staticmethod\n    def _join_complementary_data(\n        run_name: str, run_success: bool, raised_exceptions: bool, source_df: DataFrame\n    ) -> DataFrame:\n        \"\"\"Join the source_df DataFrame with complementary data.\n\n        The source_df was already tagged/joined with the row level DQ failures, in case\n        there were any. However, there might be cases for which we don't have any\n        failure (everything succeeded) or cases for which only not row level failures\n        happened (e.g. table level expectations or column level aggregations), and, for\n        those we need to join the source_df with complementary data.\n\n        Args:\n            run_name: the name of the DQ execution in great expectations.\n            run_success: whether the general execution of the DQ was succeeded (True)\n                or not (False).\n            raised_exceptions: whether there was at least one expectation raising\n                exceptions (True) or not (False).\n            source_df: the source dataframe being tagged with DQ results.\n\n        Returns: the source_df tagged with complementary data.\n        \"\"\"\n        complementary_data = [\n            {\n                \"dq_validations\": {\n                    \"run_name\": run_name,\n                    \"run_success\": run_success,\n                    \"raised_exceptions\": raised_exceptions,\n                    \"run_row_success\": True,\n                }\n            }\n        ]\n\n        complementary_df = ExecEnv.SESSION.createDataFrame(\n            complementary_data, schema=DQDefaults.DQ_VALIDATIONS_SCHEMA.value\n        )\n\n        return (\n            source_df.crossJoin(\n                complementary_df.withColumnRenamed(\n                    \"dq_validations\", \"tmp_dq_validations\"\n                )\n            )\n            .withColumn(\n                \"dq_validations\",\n                (\n                    when(\n                        col(\"dq_validations\").isNotNull(), col(\"dq_validations\")\n                    ).otherwise(col(\"tmp_dq_validations\"))\n                    if \"dq_validations\" in source_df.columns\n                    else col(\"tmp_dq_validations\")\n                ),\n            )\n            .drop(\"tmp_dq_validations\")\n        )\n"
  },
  {
    "path": "lakehouse_engine/engine.py",
    "content": "\"\"\"Contract of the lakehouse engine with all the available functions to be executed.\"\"\"\n\nfrom typing import List, Optional, OrderedDict\n\nfrom lakehouse_engine.algorithms.data_loader import DataLoader\nfrom lakehouse_engine.algorithms.gab import GAB\nfrom lakehouse_engine.algorithms.reconciliator import Reconciliator\nfrom lakehouse_engine.algorithms.sensors.heartbeat import Heartbeat\nfrom lakehouse_engine.algorithms.sensors.sensor import Sensor, SensorStatus\nfrom lakehouse_engine.core.definitions import (\n    CollectEngineUsage,\n    SAPLogchain,\n    TerminatorSpec,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.file_manager import FileManagerFactory\nfrom lakehouse_engine.core.sensor_manager import SensorUpstreamManager\nfrom lakehouse_engine.core.table_manager import TableManager\nfrom lakehouse_engine.terminators.notifier_factory import NotifierFactory\nfrom lakehouse_engine.terminators.sensor_terminator import SensorTerminator\nfrom lakehouse_engine.utils.acon_utils import (\n    validate_and_resolve_acon,\n    validate_manager_list,\n)\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.engine_usage_stats import EngineUsageStats\n\n\ndef load_data(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> Optional[OrderedDict]:\n    \"\"\"Load data using the DataLoader algorithm.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks or other\n            apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    try:\n        acon = ConfigUtils.get_acon(acon_path, acon)\n        ExecEnv.get_or_create(app_name=\"data_loader\", config=acon.get(\"exec_env\", None))\n        acon = validate_and_resolve_acon(acon, \"in_motion\")\n    finally:\n        EngineUsageStats.store_engine_usage(\n            acon, load_data.__name__, collect_engine_usage, spark_confs\n        )\n    return DataLoader(acon).execute()\n\n\ndef execute_reconciliation(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Execute the Reconciliator algorithm.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks or other\n            apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    try:\n        acon = ConfigUtils.get_acon(acon_path, acon)\n        ExecEnv.get_or_create(\n            app_name=\"reconciliator\", config=acon.get(\"exec_env\", None)\n        )\n        acon = validate_and_resolve_acon(acon)\n    finally:\n        EngineUsageStats.store_engine_usage(\n            acon, execute_reconciliation.__name__, collect_engine_usage, spark_confs\n        )\n    Reconciliator(acon).execute()\n\n\ndef execute_dq_validation(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Execute the DQValidator algorithm.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks or other\n            apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    from lakehouse_engine.algorithms.dq_validator import DQValidator\n\n    try:\n        acon = ConfigUtils.get_acon(acon_path, acon)\n        ExecEnv.get_or_create(\n            app_name=\"dq_validator\", config=acon.get(\"exec_env\", None)\n        )\n        acon = validate_and_resolve_acon(acon, \"at_rest\")\n    finally:\n        EngineUsageStats.store_engine_usage(\n            acon, execute_dq_validation.__name__, collect_engine_usage, spark_confs\n        )\n    DQValidator(acon).execute()\n\n\ndef manage_table(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Manipulate tables/views using Table Manager algorithm.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks\n            or other apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    acon = ConfigUtils.get_acon(acon_path, acon)\n    ExecEnv.get_or_create(app_name=\"manage_table\", config=acon.get(\"exec_env\", None))\n    EngineUsageStats.store_engine_usage(\n        acon, manage_table.__name__, collect_engine_usage, spark_confs\n    )\n    TableManager(acon).get_function()\n\n\ndef execute_manager(\n    acon: dict,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Execute the Lakehouse Engine Manager.\n\n    This function allows users to execute multiple managers in a single\n    call by providing a list of acons.\n\n    Args:\n        acon: list of acons to be executed by the manager.\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"lakehouse_engine_manager\")\n    acon_list = validate_manager_list(acon)\n    for acon in acon_list:\n        EngineUsageStats.store_engine_usage(\n            acon, execute_manager.__name__, collect_engine_usage, spark_confs\n        )\n        if acon[\"manager\"] == \"file\":\n            FileManagerFactory.execute_function(configs=acon)\n        elif acon[\"manager\"] == \"table\":\n            TableManager(acon).get_function()\n        else:\n            raise ValueError(f\"Manager {acon['manager']} not recognized.\")\n\n\ndef manage_files(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Manipulate s3 files using File Manager algorithm.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks\n            or other apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    acon = ConfigUtils.get_acon(acon_path, acon)\n    ExecEnv.get_or_create(app_name=\"manage_files\", config=acon.get(\"exec_env\", None))\n    EngineUsageStats.store_engine_usage(\n        acon, manage_files.__name__, collect_engine_usage, spark_confs\n    )\n    FileManagerFactory.execute_function(configs=acon)\n\n\ndef execute_sensor(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> bool:\n    \"\"\"Execute a sensor based on a Sensor Algorithm Configuration.\n\n    A sensor is useful to check if an upstream system has new data.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks\n            or other apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    acon = ConfigUtils.get_acon(acon_path, acon)\n    ExecEnv.get_or_create(app_name=\"execute_sensor\", config=acon.get(\"exec_env\", None))\n    EngineUsageStats.store_engine_usage(\n        acon, execute_sensor.__name__, collect_engine_usage, spark_confs\n    )\n    return Sensor(acon).execute()\n\n\ndef execute_sensor_heartbeat(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Execute a sensor based on a Heartbeat Algorithm Configuration.\n\n    The heartbeat mechanism monitors whether an upstream system has new data.\n\n    The heartbeat job runs continuously within a defined data product or\n    according to a user-defined schedule.\n\n    This job operates based on the Control table, where source-related entries can be\n    fed by users using the Heartbeat Data Feeder job.\n\n    Each source (such as SAP, delta_table, Kafka, Local Manual Upload, etc.) can have\n    tasks added in parallel within the Heartbeat Job.\n\n    Based on source heartbeat ACON and control table entries,\n    Heartbeat will send a final sensor acon to the existing sensor modules,\n    which checks if a new event is available for the control table record.\n\n    The sensor then returns the NEW_EVENT_AVAILABLE status to the Heartbeat modules,\n    which update the control table.\n\n    Following this, the related Databricks jobs are triggered through the\n    Databricks Job API, ensuring that all dependencies are met.\n\n    This process allows the Heartbeat sensor to efficiently manage and centralize\n    the entire workflow with minimal user intervention and\n    enhance sensor features by providing centralization, efficently manage and\n    track using control table.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks\n            or other apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    acon = ConfigUtils.get_acon(acon_path, acon)\n    ExecEnv.get_or_create(\n        app_name=\"execute_heartbeat\", config=acon.get(\"exec_env\", None)\n    )\n    EngineUsageStats.store_engine_usage(\n        acon, execute_sensor_heartbeat.__name__, collect_engine_usage, spark_confs\n    )\n    return Heartbeat(acon).execute()\n\n\ndef trigger_heartbeat_sensor_jobs(\n    acon: dict,\n) -> None:\n    \"\"\"Trigger the jobs via Databricks job API.\n\n    Args:\n        acon: Heartbeat ACON containing data product configs and options.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"trigger_heartbeat_sensor_jobs\")\n    Heartbeat(acon).heartbeat_sensor_trigger_jobs()\n\n\ndef execute_heartbeat_sensor_data_feed(\n    heartbeat_sensor_data_feed_path: str,\n    heartbeat_sensor_control_table: str,\n) -> None:\n    \"\"\"Control table Data feeder.\n\n    It reads the CSV file stored at `data` folder and\n    perform UPSERT and DELETE in control table.\n\n    Args:\n        heartbeat_sensor_data_feed_path: path where CSV file is stored.\n        heartbeat_sensor_control_table: CONTROL table of Heartbeat sensor.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"execute_heartbeat_sensor_data_feed\")\n    Heartbeat.heartbeat_sensor_control_table_data_feed(\n        heartbeat_sensor_data_feed_path, heartbeat_sensor_control_table\n    )\n\n\ndef update_heartbeat_sensor_status(\n    heartbeat_sensor_control_table: str,\n    sensor_table: str,\n    job_id: str,\n) -> None:\n    \"\"\"UPDATE heartbeat sensor status.\n\n    Update heartbeat sensor control table with COMPLETE status and\n    job_end_timestamp for the triggered job.\n    Update sensor control table with PROCESSED_NEW_DATA status and\n    status_change_timestamp for the triggered job.\n\n    Args:\n        heartbeat_sensor_control_table: Heartbeat sensor control table name.\n        sensor_table: lakehouse engine sensor table name.\n        job_id: job_id of the running job. It refers to trigger_job_id in Control table.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"update_heartbeat_sensor_status\")\n    Heartbeat.update_heartbeat_sensor_completion_status(\n        heartbeat_sensor_control_table, sensor_table, job_id\n    )\n\n\ndef update_sensor_status(\n    sensor_id: str,\n    control_db_table_name: str,\n    status: str = SensorStatus.PROCESSED_NEW_DATA.value,\n    assets: List[str] = None,\n) -> None:\n    \"\"\"Update internal sensor status.\n\n    Update the sensor status in the control table,\n    it should be used to tell the system\n    that the sensor has processed all new data that was previously identified,\n    hence updating the shifted sensor status.\n    Usually used to move from `SensorStatus.ACQUIRED_NEW_DATA` to\n    `SensorStatus.PROCESSED_NEW_DATA`,\n    but there might be scenarios - still to identify -\n    where we can update the sensor status from/to different statuses.\n\n    Args:\n        sensor_id: sensor id.\n        control_db_table_name: `db.table` to store sensor checkpoints.\n        status: status of the sensor.\n        assets: a list of assets that are considered as available to\n            consume downstream after this sensor has status\n            PROCESSED_NEW_DATA.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"update_sensor_status\")\n    SensorTerminator.update_sensor_status(\n        sensor_id=sensor_id,\n        control_db_table_name=control_db_table_name,\n        status=status,\n        assets=assets,\n    )\n\n\ndef generate_sensor_query(\n    sensor_id: str,\n    filter_exp: str = None,\n    control_db_table_name: str = None,\n    upstream_key: str = None,\n    upstream_value: str = None,\n    upstream_table_name: str = None,\n) -> str:\n    \"\"\"Generates a preprocess query to be used in a sensor configuration.\n\n    Args:\n        sensor_id: sensor id.\n        filter_exp: expression to filter incoming new data.\n            You can use the placeholder ?default_upstream_key and\n            ?default_upstream_value, so that it can be replaced by the\n            respective values in the control_db_table_name for this specific\n            sensor_id.\n        control_db_table_name: `db.table` to retrieve the last status change\n            timestamp. This is only relevant for the jdbc sensor.\n        upstream_key: the key of custom sensor information to control how to\n            identify new data from the upstream (e.g., a time column in the\n            upstream).\n        upstream_value: the upstream value\n            to identify new data from the upstream (e.g., the value of a time\n            present in the upstream).\n        upstream_table_name: value for custom sensor\n            to query new data from the upstream\n            If none we will set the default value,\n            our `sensor_new_data` view.\n\n    Returns:\n        The query string.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"generate_sensor_preprocess_query\")\n    if filter_exp:\n        return SensorUpstreamManager.generate_filter_exp_query(\n            sensor_id=sensor_id,\n            filter_exp=filter_exp,\n            control_db_table_name=control_db_table_name,\n            upstream_key=upstream_key,\n            upstream_value=upstream_value,\n            upstream_table_name=upstream_table_name,\n        )\n    else:\n        return SensorUpstreamManager.generate_sensor_table_preprocess_query(\n            sensor_id=sensor_id\n        )\n\n\ndef generate_sensor_sap_logchain_query(\n    chain_id: str,\n    dbtable: str = SAPLogchain.DBTABLE.value,\n    status: str = SAPLogchain.GREEN_STATUS.value,\n    engine_table_name: str = SAPLogchain.ENGINE_TABLE.value,\n) -> str:\n    \"\"\"Generates a sensor query based in the SAP Logchain table.\n\n    Args:\n        chain_id: chain id to query the status on SAP.\n        dbtable: `db.table` to retrieve the data to\n            check if the sap chain is already finished.\n        status: `db.table` to retrieve the last status change\n            timestamp.\n        engine_table_name: table name exposed with the SAP LOGCHAIN data.\n            This table will be used in the jdbc query.\n\n    Returns:\n        The query string.\n    \"\"\"\n    ExecEnv.get_or_create(app_name=\"generate_sensor_sap_logchain_query\")\n    return SensorUpstreamManager.generate_sensor_sap_logchain_query(\n        chain_id=chain_id,\n        dbtable=dbtable,\n        status=status,\n        engine_table_name=engine_table_name,\n    )\n\n\ndef send_notification(args: dict) -> None:\n    \"\"\"Send a notification using a notifier.\n\n    Args:\n        args: arguments for the notifier.\n    \"\"\"\n    notifier = NotifierFactory.get_notifier(\n        spec=TerminatorSpec(function=\"notify\", args=args)\n    )\n\n    notifier.create_notification()\n    notifier.send_notification()\n\n\ndef execute_gab(\n    acon_path: Optional[str] = None,\n    acon: Optional[dict] = None,\n    collect_engine_usage: str = CollectEngineUsage.PROD_ONLY.value,\n    spark_confs: dict = None,\n) -> None:\n    \"\"\"Execute the gold asset builder based on a GAB Algorithm Configuration.\n\n    GaB is useful to build your gold assets with predefined functions for recurrent\n    periods.\n\n    Args:\n        acon_path: path of the acon (algorithm configuration) file.\n        acon: acon provided directly through python code (e.g., notebooks\n            or other apps).\n        collect_engine_usage: Lakehouse usage statistics collection strategy.\n        spark_confs: optional dictionary with the spark confs to be used when collecting\n            the engine usage.\n    \"\"\"\n    acon = ConfigUtils.get_acon(acon_path, acon)\n    ExecEnv.get_or_create(app_name=\"execute_gab\", config=acon.get(\"exec_env\", None))\n    EngineUsageStats.store_engine_usage(\n        acon, execute_gab.__name__, collect_engine_usage, spark_confs\n    )\n    GAB(acon).execute()\n"
  },
  {
    "path": "lakehouse_engine/io/__init__.py",
    "content": "\"\"\"Input and Output package responsible for the behaviour of reading and writing.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/io/exceptions.py",
    "content": "\"\"\"Package defining all the io custom exceptions.\"\"\"\n\n\nclass IncrementalFilterInputNotFoundException(Exception):\n    \"\"\"Exception for when the input of an incremental filter is not found.\n\n    This may occur when tables are being loaded in incremental way, taking the increment\n    definition out of a specific table, but the table still does not exist, mainly\n    because probably it was not loaded for the first time yet.\n    \"\"\"\n\n    pass\n\n\nclass WrongIOFormatException(Exception):\n    \"\"\"Exception for when a user provides a wrong I/O format.\"\"\"\n\n    pass\n\n\nclass NotSupportedException(RuntimeError):\n    \"\"\"Exception for when a user provides a not supported operation.\"\"\"\n\n    pass\n\n\nclass InputNotFoundException(Exception):\n    \"\"\"Exception for when a user does not provide a mandatory input.\"\"\"\n\n    pass\n\n\nclass EndpointNotFoundException(Exception):\n    \"\"\"Exception for when the endpoint is not found by the Graph API.\"\"\"\n\n    pass\n\n\nclass LocalPathNotFoundException(Exception):\n    \"\"\"Exception for when a local path is not found.\"\"\"\n\n    pass\n\n\nclass WriteToLocalException(Exception):\n    \"\"\"Exception for when an error occurs when trying to write to the local path.\"\"\"\n\n    pass\n\n\nclass SharePointAPIError(Exception):\n    \"\"\"Custom exception class to handle errors Sharepoint API requests.\"\"\"\n\n    pass\n\n\nclass InvalidSharepointPathException(Exception):\n    \"\"\"Raised when folder path conflicts with file name.\n\n    Happens if both `folder_relative_path` and `file_name` are set, but the folder path\n    looks like a file path (last segment has a dot).\n    \"\"\"\n\n    pass\n"
  },
  {
    "path": "lakehouse_engine/io/reader.py",
    "content": "\"\"\"Defines abstract reader behaviour.\"\"\"\n\nfrom abc import ABC, abstractmethod\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Reader(ABC):\n    \"\"\"Abstract Reader class.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct Reader instances.\n\n        Args:\n            input_spec: input specification for reading data.\n        \"\"\"\n        self._logger = LoggingHandler(self.__class__.__name__).get_logger()\n        self._input_spec = input_spec\n\n    @abstractmethod\n    def read(self) -> DataFrame:\n        \"\"\"Abstract read method.\n\n        Returns:\n            A dataframe read according to the input specification.\n        \"\"\"\n        raise NotImplementedError\n"
  },
  {
    "path": "lakehouse_engine/io/reader_factory.py",
    "content": "\"\"\"Module for reader factory.\"\"\"\n\nfrom abc import ABC\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import FILE_INPUT_FORMATS, InputFormat, InputSpec\nfrom lakehouse_engine.io.readers.dataframe_reader import DataFrameReader\nfrom lakehouse_engine.io.readers.file_reader import FileReader\nfrom lakehouse_engine.io.readers.jdbc_reader import JDBCReader\nfrom lakehouse_engine.io.readers.kafka_reader import KafkaReader\nfrom lakehouse_engine.io.readers.query_reader import QueryReader\nfrom lakehouse_engine.io.readers.sap_b4_reader import SAPB4Reader\nfrom lakehouse_engine.io.readers.sap_bw_reader import SAPBWReader\nfrom lakehouse_engine.io.readers.sharepoint_reader import SharepointReader\nfrom lakehouse_engine.io.readers.table_reader import TableReader\n\n\nclass ReaderFactory(ABC):  # noqa: B024\n    \"\"\"Class for reader factory.\"\"\"\n\n    @classmethod\n    def get_data(cls, spec: InputSpec) -> DataFrame:\n        \"\"\"Get data according to the input specification following a factory pattern.\n\n        Args:\n            spec: input specification to get the data.\n\n        Returns:\n            A dataframe containing the data.\n        \"\"\"\n        if spec.db_table:\n            read_df = TableReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.JDBC.value:\n            read_df = JDBCReader(input_spec=spec).read()\n        elif spec.data_format in FILE_INPUT_FORMATS:\n            read_df = FileReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.KAFKA.value:\n            read_df = KafkaReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.SQL.value:\n            read_df = QueryReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.SAP_BW.value:\n            read_df = SAPBWReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.SAP_B4.value:\n            read_df = SAPB4Reader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.DATAFRAME.value:\n            read_df = DataFrameReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.SFTP.value:\n            from lakehouse_engine.io.readers.sftp_reader import SFTPReader\n\n            read_df = SFTPReader(input_spec=spec).read()\n            return SFTPReader(input_spec=spec).read()\n        elif spec.data_format == InputFormat.SHAREPOINT.value:\n            return SharepointReader(input_spec=spec).read()\n        else:\n            raise NotImplementedError(\n                f\"The requested input spec format {spec.data_format} is not supported.\"\n            )\n\n        if spec.temp_view:\n            read_df.createOrReplaceTempView(spec.temp_view)\n\n        return read_df\n"
  },
  {
    "path": "lakehouse_engine/io/readers/__init__.py",
    "content": "\"\"\"Readers package to define reading behaviour.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/io/readers/dataframe_reader.py",
    "content": "\"\"\"Module to define behaviour to read from dataframes.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec\nfrom lakehouse_engine.io.reader import Reader\n\n\nclass DataFrameReader(Reader):\n    \"\"\"Class to read data from a dataframe.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct DataFrameReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from a dataframe.\n\n        Returns:\n            A dataframe containing the data from a dataframe previously\n            computed.\n        \"\"\"\n        return self._input_spec.df_name\n"
  },
  {
    "path": "lakehouse_engine/io/readers/file_reader.py",
    "content": "\"\"\"Module to define behaviour to read from files.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import FILE_INPUT_FORMATS, InputSpec, ReadType\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\n\n\nclass FileReader(Reader):\n    \"\"\"Class to read from files.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct FileReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read file data.\n\n        Returns:\n            A dataframe containing the data from the files.\n        \"\"\"\n        if (\n            self._input_spec.read_type == ReadType.BATCH.value\n            and self._input_spec.data_format in FILE_INPUT_FORMATS\n        ):\n            df = ExecEnv.SESSION.read.load(\n                path=self._input_spec.location,\n                format=self._input_spec.data_format,\n                schema=SchemaUtils.from_input_spec(self._input_spec),\n                **self._input_spec.options if self._input_spec.options else {},\n            )\n\n            if self._input_spec.with_filepath:\n                # _metadata contains hidden columns\n                df = df.selectExpr(\n                    \"*\", \"_metadata.file_path as lhe_extraction_filepath\"\n                )\n\n            return df\n        elif (\n            self._input_spec.read_type == ReadType.STREAMING.value\n            and self._input_spec.data_format in FILE_INPUT_FORMATS\n        ):\n            df = ExecEnv.SESSION.readStream.load(\n                path=self._input_spec.location,\n                format=self._input_spec.data_format,\n                schema=SchemaUtils.from_input_spec(self._input_spec),\n                **self._input_spec.options if self._input_spec.options else {},\n            )\n\n            if self._input_spec.with_filepath:\n                # _metadata contains hidden columns\n                df = df.selectExpr(\n                    \"*\", \"_metadata.file_path as lhe_extraction_filepath\"\n                )\n\n            return df\n        else:\n            raise NotImplementedError(\n                \"The requested read type and format combination is not supported.\"\n            )\n"
  },
  {
    "path": "lakehouse_engine/io/readers/jdbc_reader.py",
    "content": "\"\"\"Module to define behaviour to read from JDBC sources.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputFormat, InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionUtils,\n)\n\n\nclass JDBCReader(Reader):\n    \"\"\"Class to read from JDBC source.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct JDBCReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from JDBC source.\n\n        Returns:\n            A dataframe containing the data from the JDBC source.\n        \"\"\"\n        if (\n            self._input_spec.options is not None\n            and self._input_spec.options.get(\"predicates\", None) is not None\n        ):\n            raise WrongArgumentsException(\"Predicates can only be used with jdbc_args.\")\n\n        options = self._input_spec.options if self._input_spec.options else {}\n        if self._input_spec.calculate_upper_bound:\n            jdbc_util = JDBCExtractionUtils(\n                JDBCExtraction(\n                    user=options[\"user\"],\n                    password=options[\"password\"],\n                    url=options[\"url\"],\n                    dbtable=options[\"dbtable\"],\n                    extraction_type=options.get(\n                        \"extraction_type\", JDBCExtraction.extraction_type\n                    ),\n                    partition_column=options[\"partitionColumn\"],\n                    calc_upper_bound_schema=self._input_spec.calc_upper_bound_schema,\n                    default_upper_bound=options.get(\n                        \"default_upper_bound\", JDBCExtraction.default_upper_bound\n                    ),\n                )\n            )  # type: ignore\n            options[\"upperBound\"] = jdbc_util.get_spark_jdbc_optimal_upper_bound()\n\n        if self._input_spec.jdbc_args:\n            return ExecEnv.SESSION.read.options(**options).jdbc(\n                **self._input_spec.jdbc_args\n            )\n        else:\n            return (\n                ExecEnv.SESSION.read.format(InputFormat.JDBC.value)\n                .options(**options)\n                .load()\n            )\n"
  },
  {
    "path": "lakehouse_engine/io/readers/kafka_reader.py",
    "content": "\"\"\"Module to define behaviour to read from Kafka.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputFormat, InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\n\n\nclass KafkaReader(Reader):\n    \"\"\"Class to read from Kafka.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct KafkaReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read Kafka data.\n\n        Returns:\n            A dataframe containing the data from Kafka.\n        \"\"\"\n        df = ExecEnv.SESSION.readStream.load(\n            format=InputFormat.KAFKA.value,\n            **self._input_spec.options if self._input_spec.options else {},\n        )\n\n        return df\n"
  },
  {
    "path": "lakehouse_engine/io/readers/query_reader.py",
    "content": "\"\"\"Module to define behaviour to read from a query.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\n\n\nclass QueryReader(Reader):\n    \"\"\"Class to read data from a query.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct QueryReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from a query.\n\n        Returns:\n            A dataframe containing the data from the query.\n        \"\"\"\n        return ExecEnv.SESSION.sql(self._input_spec.query)\n"
  },
  {
    "path": "lakehouse_engine/io/readers/sap_b4_reader.py",
    "content": "\"\"\"Module to define behaviour to read from SAP B4 sources.\"\"\"\n\nfrom logging import Logger\nfrom typing import Tuple\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\nfrom lakehouse_engine.utils.extraction.sap_b4_extraction_utils import (\n    ADSOTypes,\n    SAPB4Extraction,\n    SAPB4ExtractionUtils,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SAPB4Reader(Reader):\n    \"\"\"Class to read from SAP B4 source.\"\"\"\n\n    _LOGGER: Logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct SAPB4Reader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n        self.jdbc_utils = self._get_jdbc_utils()\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from SAP B4 source.\n\n        Returns:\n            A dataframe containing the data from the SAP B4 source.\n        \"\"\"\n        options_args, jdbc_args = self._get_options()\n        return ExecEnv.SESSION.read.options(**options_args).jdbc(**jdbc_args)\n\n    def _get_jdbc_utils(self) -> SAPB4ExtractionUtils:\n        jdbc_extraction = SAPB4Extraction(\n            user=self._input_spec.options[\"user\"],\n            password=self._input_spec.options[\"password\"],\n            url=self._input_spec.options[\"url\"],\n            dbtable=self._input_spec.options[\"dbtable\"],\n            adso_type=self._input_spec.options[\"adso_type\"],\n            request_status_tbl=self._input_spec.options.get(\n                \"request_status_tbl\", SAPB4Extraction.request_status_tbl\n            ),\n            changelog_table=self._input_spec.options.get(\n                \"changelog_table\",\n                (\n                    self._input_spec.options[\"dbtable\"]\n                    if self._input_spec.options[\"adso_type\"] == ADSOTypes.AQ.value\n                    else self._input_spec.options[\"changelog_table\"]\n                ),\n            ),\n            data_target=SAPB4ExtractionUtils.get_data_target(self._input_spec.options),\n            act_req_join_condition=self._input_spec.options.get(\n                \"act_req_join_condition\", SAPB4Extraction.act_req_join_condition\n            ),\n            latest_timestamp_data_location=self._input_spec.options.get(\n                \"latest_timestamp_data_location\",\n                SAPB4Extraction.latest_timestamp_data_location,\n            ),\n            latest_timestamp_input_col=self._input_spec.options.get(\n                \"latest_timestamp_input_col\",\n                SAPB4Extraction.latest_timestamp_input_col,\n            ),\n            latest_timestamp_data_format=self._input_spec.options.get(\n                \"latest_timestamp_data_format\",\n                SAPB4Extraction.latest_timestamp_data_format,\n            ),\n            extraction_type=self._input_spec.options.get(\n                \"extraction_type\", SAPB4Extraction.extraction_type\n            ),\n            driver=self._input_spec.options.get(\"driver\", SAPB4Extraction.driver),\n            num_partitions=self._input_spec.options.get(\n                \"numPartitions\", SAPB4Extraction.num_partitions\n            ),\n            partition_column=self._input_spec.options.get(\n                \"partitionColumn\", SAPB4Extraction.partition_column\n            ),\n            lower_bound=self._input_spec.options.get(\n                \"lowerBound\", SAPB4Extraction.lower_bound\n            ),\n            upper_bound=self._input_spec.options.get(\n                \"upperBound\", SAPB4Extraction.upper_bound\n            ),\n            default_upper_bound=self._input_spec.options.get(\n                \"default_upper_bound\", SAPB4Extraction.default_upper_bound\n            ),\n            fetch_size=self._input_spec.options.get(\n                \"fetchSize\", SAPB4Extraction.fetch_size\n            ),\n            compress=self._input_spec.options.get(\"compress\", SAPB4Extraction.compress),\n            custom_schema=self._input_spec.options.get(\n                \"customSchema\", SAPB4Extraction.custom_schema\n            ),\n            extraction_timestamp=self._input_spec.options.get(\n                \"extraction_timestamp\",\n                SAPB4Extraction.extraction_timestamp,\n            ),\n            min_timestamp=self._input_spec.options.get(\n                \"min_timestamp\", SAPB4Extraction.min_timestamp\n            ),\n            max_timestamp=self._input_spec.options.get(\n                \"max_timestamp\", SAPB4Extraction.max_timestamp\n            ),\n            default_max_timestamp=self._input_spec.options.get(\n                \"default_max_timestamp\", SAPB4Extraction.default_max_timestamp\n            ),\n            default_min_timestamp=self._input_spec.options.get(\n                \"default_min_timestamp\", SAPB4Extraction.default_min_timestamp\n            ),\n            max_timestamp_custom_schema=self._input_spec.options.get(\n                \"max_timestamp_custom_schema\",\n                SAPB4Extraction.max_timestamp_custom_schema,\n            ),\n            generate_predicates=self._input_spec.generate_predicates,\n            predicates=self._input_spec.options.get(\n                \"predicates\", SAPB4Extraction.predicates\n            ),\n            predicates_add_null=self._input_spec.predicates_add_null,\n            extra_cols_req_status_tbl=self._input_spec.options.get(\n                \"extra_cols_req_status_tbl\", SAPB4Extraction.extra_cols_req_status_tbl\n            ),\n            calc_upper_bound_schema=self._input_spec.calc_upper_bound_schema,\n            include_changelog_tech_cols=self._input_spec.options.get(\n                \"include_changelog_tech_cols\",\n                (\n                    False\n                    if self._input_spec.options[\"adso_type\"] == ADSOTypes.AQ.value\n                    else True\n                ),\n            ),\n        )\n        return SAPB4ExtractionUtils(jdbc_extraction)\n\n    def _get_options(self) -> Tuple[dict, dict]:\n        \"\"\"Get Spark Options using JDBC utilities.\n\n        Returns:\n            A tuple dict containing the options args and\n            jdbc args to be passed to Spark.\n        \"\"\"\n        self._LOGGER.info(\n            f\"Initial options passed to the SAP B4 Reader: {self._input_spec.options}\"\n        )\n\n        options_args, jdbc_args = self.jdbc_utils.get_spark_jdbc_options()\n\n        if self._input_spec.generate_predicates or self._input_spec.options.get(\n            \"predicates\", None\n        ):\n            options_args.update(\n                self.jdbc_utils.get_additional_spark_options(\n                    self._input_spec,\n                    options_args,\n                    [\"partitionColumn\", \"numPartitions\", \"lowerBound\", \"upperBound\"],\n                )\n            )\n        else:\n            if self._input_spec.calculate_upper_bound:\n                options_args[\"upperBound\"] = (\n                    self.jdbc_utils.get_spark_jdbc_optimal_upper_bound()\n                )\n\n            options_args.update(\n                self.jdbc_utils.get_additional_spark_options(\n                    self._input_spec, options_args\n                )\n            )\n\n        self._LOGGER.info(\n            f\"Final options to fill SAP B4 Reader Options: {options_args}\"\n        )\n        self._LOGGER.info(f\"Final jdbc args to fill SAP B4 Reader JDBC: {jdbc_args}\")\n        return options_args, jdbc_args\n"
  },
  {
    "path": "lakehouse_engine/io/readers/sap_bw_reader.py",
    "content": "\"\"\"Module to define behaviour to read from SAP BW sources.\"\"\"\n\nfrom logging import Logger\nfrom typing import Tuple\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\nfrom lakehouse_engine.utils.extraction.sap_bw_extraction_utils import (\n    SAPBWExtraction,\n    SAPBWExtractionUtils,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SAPBWReader(Reader):\n    \"\"\"Class to read from SAP BW source.\"\"\"\n\n    _LOGGER: Logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct SAPBWReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n        self.jdbc_utils = self._get_jdbc_utils()\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from SAP BW source.\n\n        Returns:\n            A dataframe containing the data from the SAP BW source.\n        \"\"\"\n        options_args, jdbc_args = self._get_options()\n        return ExecEnv.SESSION.read.options(**options_args).jdbc(**jdbc_args)\n\n    def _get_jdbc_utils(self) -> SAPBWExtractionUtils:\n        jdbc_extraction = SAPBWExtraction(\n            user=self._input_spec.options[\"user\"],\n            password=self._input_spec.options[\"password\"],\n            url=self._input_spec.options[\"url\"],\n            dbtable=self._input_spec.options[\"dbtable\"],\n            latest_timestamp_data_location=self._input_spec.options.get(\n                \"latest_timestamp_data_location\",\n                SAPBWExtraction.latest_timestamp_data_location,\n            ),\n            latest_timestamp_input_col=self._input_spec.options.get(\n                \"latest_timestamp_input_col\", SAPBWExtraction.latest_timestamp_input_col\n            ),\n            latest_timestamp_data_format=self._input_spec.options.get(\n                \"latest_timestamp_data_format\",\n                SAPBWExtraction.latest_timestamp_data_format,\n            ),\n            extraction_type=self._input_spec.options.get(\n                \"extraction_type\", SAPBWExtraction.extraction_type\n            ),\n            act_request_table=self._input_spec.options.get(\n                \"act_request_table\", SAPBWExtraction.act_request_table\n            ),\n            request_col_name=self._input_spec.options.get(\n                \"request_col_name\", SAPBWExtraction.request_col_name\n            ),\n            act_req_join_condition=self._input_spec.options.get(\n                \"act_req_join_condition\", SAPBWExtraction.act_req_join_condition\n            ),\n            driver=self._input_spec.options.get(\"driver\", SAPBWExtraction.driver),\n            changelog_table=self._input_spec.options.get(\n                \"changelog_table\", SAPBWExtraction.changelog_table\n            ),\n            num_partitions=self._input_spec.options.get(\n                \"numPartitions\", SAPBWExtraction.num_partitions\n            ),\n            partition_column=self._input_spec.options.get(\n                \"partitionColumn\", SAPBWExtraction.partition_column\n            ),\n            lower_bound=self._input_spec.options.get(\n                \"lowerBound\", SAPBWExtraction.lower_bound\n            ),\n            upper_bound=self._input_spec.options.get(\n                \"upperBound\", SAPBWExtraction.upper_bound\n            ),\n            default_upper_bound=self._input_spec.options.get(\n                \"default_upper_bound\", SAPBWExtraction.default_upper_bound\n            ),\n            fetch_size=self._input_spec.options.get(\n                \"fetchSize\", SAPBWExtraction.fetch_size\n            ),\n            compress=self._input_spec.options.get(\"compress\", SAPBWExtraction.compress),\n            custom_schema=self._input_spec.options.get(\n                \"customSchema\", SAPBWExtraction.custom_schema\n            ),\n            extraction_timestamp=self._input_spec.options.get(\n                \"extraction_timestamp\",\n                SAPBWExtraction.extraction_timestamp,\n            ),\n            odsobject=self._input_spec.options.get(\n                \"odsobject\",\n                SAPBWExtractionUtils.get_odsobject(self._input_spec.options),\n            ),\n            min_timestamp=self._input_spec.options.get(\n                \"min_timestamp\", SAPBWExtraction.min_timestamp\n            ),\n            max_timestamp=self._input_spec.options.get(\n                \"max_timestamp\", SAPBWExtraction.max_timestamp\n            ),\n            default_max_timestamp=self._input_spec.options.get(\n                \"default_max_timestamp\", SAPBWExtraction.default_max_timestamp\n            ),\n            default_min_timestamp=self._input_spec.options.get(\n                \"default_min_timestamp\", SAPBWExtraction.default_min_timestamp\n            ),\n            max_timestamp_custom_schema=self._input_spec.options.get(\n                \"max_timestamp_custom_schema\",\n                SAPBWExtraction.max_timestamp_custom_schema,\n            ),\n            include_changelog_tech_cols=self._input_spec.options.get(\n                \"include_changelog_tech_cols\",\n                SAPBWExtraction.include_changelog_tech_cols,\n            ),\n            generate_predicates=self._input_spec.generate_predicates,\n            predicates=self._input_spec.options.get(\n                \"predicates\", SAPBWExtraction.predicates\n            ),\n            predicates_add_null=self._input_spec.predicates_add_null,\n            extra_cols_act_request=self._input_spec.options.get(\n                \"extra_cols_act_request\", SAPBWExtraction.extra_cols_act_request\n            ),\n            get_timestamp_from_act_request=self._input_spec.options.get(\n                \"get_timestamp_from_act_request\",\n                SAPBWExtraction.get_timestamp_from_act_request,\n            ),\n            calc_upper_bound_schema=self._input_spec.calc_upper_bound_schema,\n            sap_bw_schema=self._input_spec.options.get(\n                \"sap_bw_schema\", SAPBWExtraction.sap_bw_schema\n            ),\n            ods_prefix=self._input_spec.options.get(\n                \"ods_prefix\", SAPBWExtraction.ods_prefix\n            ),\n            logsys=self._input_spec.options.get(\"logsys\", SAPBWExtraction.logsys),\n        )\n        return SAPBWExtractionUtils(jdbc_extraction)\n\n    def _get_options(self) -> Tuple[dict, dict]:\n        \"\"\"Get Spark Options using JDBC utilities.\n\n        Returns:\n            A tuple dict containing the options args and\n            jdbc args to be passed to Spark.\n        \"\"\"\n        self._LOGGER.info(\n            f\"Initial options passed to the SAP BW Reader: {self._input_spec.options}\"\n        )\n\n        options_args, jdbc_args = self.jdbc_utils.get_spark_jdbc_options()\n\n        if self._input_spec.generate_predicates or self._input_spec.options.get(\n            \"predicates\", None\n        ):\n            options_args.update(\n                self.jdbc_utils.get_additional_spark_options(\n                    self._input_spec,\n                    options_args,\n                    [\"partitionColumn\", \"numPartitions\", \"lowerBound\", \"upperBound\"],\n                )\n            )\n        else:\n            if self._input_spec.calculate_upper_bound:\n                options_args[\"upperBound\"] = (\n                    self.jdbc_utils.get_spark_jdbc_optimal_upper_bound()\n                )\n\n            options_args.update(\n                self.jdbc_utils.get_additional_spark_options(\n                    self._input_spec, options_args\n                )\n            )\n\n        self._LOGGER.info(\n            f\"Final options to fill SAP BW Reader Options: {options_args}\"\n        )\n        self._LOGGER.info(f\"Final jdbc args to fill SAP BW Reader JDBC: {jdbc_args}\")\n        return options_args, jdbc_args\n"
  },
  {
    "path": "lakehouse_engine/io/readers/sftp_reader.py",
    "content": "\"\"\"Module to define behaviour to read from SFTP.\"\"\"\n\nimport gzip\nfrom datetime import datetime\nfrom io import TextIOWrapper\nfrom logging import Logger\nfrom typing import List\nfrom zipfile import ZipFile\n\nimport pandas as pd\nfrom pandas import DataFrame as PandasDataFrame\nfrom pandas.errors import EmptyDataError\nfrom paramiko.sftp_file import SFTPFile\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec, ReadType\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\nfrom lakehouse_engine.utils.extraction.sftp_extraction_utils import SFTPExtractionUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SFTPReader(Reader):\n    \"\"\"Class to read from SFTP.\"\"\"\n\n    _logger: Logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct SFTPReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read SFTP data.\n\n        Returns:\n            A dataframe containing the data from SFTP.\n        \"\"\"\n        if self._input_spec.read_type == ReadType.BATCH.value:\n            options_args = self._input_spec.options if self._input_spec.options else {}\n\n            sftp_files_format = SFTPExtractionUtils.validate_format(\n                self._input_spec.sftp_files_format.lower()\n            )\n\n            location = SFTPExtractionUtils.validate_location(self._input_spec.location)\n\n            sftp, transport = SFTPExtractionUtils.get_sftp_client(options_args)\n\n            files_list = SFTPExtractionUtils.get_files_list(\n                sftp, location, options_args\n            )\n\n            dfs: List[PandasDataFrame] = []\n            try:\n                for filename in files_list:\n                    with sftp.open(filename, \"r\") as sftp_file:\n                        try:\n                            pdf = self._read_files(\n                                filename,\n                                sftp_file,\n                                options_args.get(\"args\", {}),\n                                sftp_files_format,\n                            )\n                            if options_args.get(\"file_metadata\", None):\n                                pdf[\"filename\"] = filename\n                                pdf[\"modification_time\"] = datetime.fromtimestamp(\n                                    sftp.stat(filename).st_mtime\n                                )\n                            self._append_files(pdf, dfs)\n                        except EmptyDataError:\n                            self._logger.info(f\"{filename} - Empty or malformed file.\")\n                if dfs:\n                    df = ExecEnv.SESSION.createDataFrame(pd.concat(dfs))\n                else:\n                    raise ValueError(\n                        \"No files were found with the specified parameters.\"\n                    )\n            finally:\n                sftp.close()\n                transport.close()\n        else:\n            raise NotImplementedError(\n                \"The requested read type supports only BATCH mode.\"\n            )\n        return df\n\n    @classmethod\n    def _append_files(cls, pdf: PandasDataFrame, dfs: List) -> List:\n        \"\"\"Append to the list dataframes with data.\n\n        Args:\n            pdf: a Pandas dataframe containing data from files.\n            dfs: a list of Pandas dataframes.\n\n        Returns:\n            A list of not empty Pandas dataframes.\n        \"\"\"\n        if not pdf.empty:\n            dfs.append(pdf)\n        return dfs\n\n    @classmethod\n    def _read_files(\n        cls, filename: str, sftp_file: SFTPFile, option_args: dict, files_format: str\n    ) -> PandasDataFrame:\n        \"\"\"Open and decompress files to be extracted from SFTP.\n\n        For zip files, to avoid data type inferred issues\n        during the iteration, all data will be read as string.\n        Also, empty dataframes will NOT be considered to be processed.\n        For the not considered ones, the file names will be logged.\n\n        Args:\n            filename: the filename to be read.\n            sftp_file: SFTPFile object representing the open file.\n            option_args: options from the acon.\n            files_format: a string containing the file extension.\n\n        Returns:\n            A pandas dataframe with data from the file.\n        \"\"\"\n        reader = getattr(pd, f\"read_{files_format}\")\n\n        if filename.endswith(\".gz\"):\n            with gzip.GzipFile(fileobj=sftp_file, mode=\"rb\") as gz_file:\n                pdf = reader(\n                    TextIOWrapper(gz_file),  # type: ignore\n                    **option_args,\n                )\n        elif filename.endswith(\".zip\"):\n            with ZipFile(sftp_file, \"r\") as zf:  # type: ignore\n                dfs = [\n                    reader(TextIOWrapper(zf.open(f)), **option_args).fillna(\"\")\n                    for f in zf.namelist()\n                ]\n                if not pd.concat(dfs, ignore_index=True).empty:\n                    pdf = pd.concat(dfs, ignore_index=True).astype(str)\n                else:\n                    pdf = pd.DataFrame()\n                    cls._logger.info(f\"{filename} - Empty or malformed file.\")\n        else:\n            pdf = reader(\n                sftp_file,\n                **option_args,\n            )\n        return pdf\n"
  },
  {
    "path": "lakehouse_engine/io/readers/sharepoint_reader.py",
    "content": "\"\"\"Module to define the behaviour to read from Sharepoint.\"\"\"\n\nimport csv\nimport fnmatch\nimport time\nfrom functools import reduce\nfrom pathlib import Path\nfrom typing import Optional\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.types import StructType\n\nfrom lakehouse_engine.core.definitions import InputSpec, SharepointFile\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.exceptions import (\n    InvalidSharepointPathException,\n    NotSupportedException,\n)\nfrom lakehouse_engine.io.reader import Reader\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.sharepoint_utils import SharepointUtils\n\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n\nclass SharepointReader(Reader):\n    \"\"\"Reader implementation for Sharepoint files.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct SharepointReader instance.\n\n        Args:\n            input_spec: InputSpec with Sharepoint parameters.\n        \"\"\"\n        super().__init__(input_spec)\n        self.opts = self._input_spec.sharepoint_opts\n        self.sharepoint_utils = self._get_sharepoint_utils()\n\n        if self.opts.file_name and self.opts.folder_relative_path:\n            folder_name = Path(self.opts.folder_relative_path).name\n            if \".\" in folder_name:\n                raise InvalidSharepointPathException(\n                    f\"Invalid path setup: `folder_relative_path` \"\n                    f\"('{self.opts.folder_relative_path}') appears to include a file, \"\n                    f\"but `file_name` ('{self.opts.file_name}') was also provided. \"\n                    f\"Provide either a folder+file_name, or a full file path not both.\"\n                )\n            _LOGGER.warning(\n                \"Using `file_name` with a folder path. \"\n                \"This will read only one file. \"\n                \"To read all files in the folder, set `file_name` to None.\"\n            )\n            self.file_path = f\"{self.opts.folder_relative_path}/{self.opts.file_name}\"\n        elif (\n            self.opts.folder_relative_path\n            and \".\" in Path(self.opts.folder_relative_path).name\n        ):\n            self.file_path = self.opts.folder_relative_path  # full path with extension\n        else:\n            self.file_path = self.opts.folder_relative_path\n\n        if self.opts.file_name and self.opts.file_pattern:\n            _LOGGER.warning(\n                \"`file_name` is provided. `file_pattern` will be ignored and only the \"\n                \"specified file will be read.\"\n            )\n\n        self.pattern = self.opts.file_pattern  # may be None\n\n        # Compute archive base folder from final self.file_path\n        archive_base_folder = None\n        if self.file_path:\n            p = Path(self.file_path)\n            archive_base_folder = str(p.parent) if p.suffix else str(p)\n\n        # Set archive folders\n        self.success_folder = (\n            f\"{archive_base_folder}/{self.opts.archive_success_subfolder}\"\n            if (archive_base_folder and self.opts.archive_success_subfolder)\n            else None\n        )\n        self.error_folder = (\n            f\"{archive_base_folder}/{self.opts.archive_error_subfolder}\"\n            if (archive_base_folder and self.opts.archive_error_subfolder)\n            else None\n        )\n\n    def read(self) -> DataFrame:\n        \"\"\"Read a Sharepoint file using a format-specific reader.\n\n        This method delegates to a reader resolved by file extension or the\n        declared `file_type` (e.g., SharepointCsvReader or SharepointExcelReader).\n\n        Returns:\n            Spark DataFrame.\n\n        Raises:\n            InputNotFoundException: Missing required Sharepoint options.\n            NotSupportedException: Streaming requested or reader unsupported.\n        \"\"\"\n        self._input_spec.sharepoint_opts.validate_for_reader()\n\n        if self._input_spec.read_type == \"streaming\":\n            raise NotSupportedException(\n                \"Sharepoint reader doesn't support streaming input.\"\n            )\n\n        return SharepointReaderFactory.get_reader(self._input_spec).read()\n\n    def _get_sharepoint_utils(self) -> SharepointUtils:\n        \"\"\"Build a SharepointUtils instance from input specs.\n\n        Returns:\n            SharepointUtils.\n        \"\"\"\n        return SharepointUtils(\n            client_id=self._input_spec.sharepoint_opts.client_id,\n            tenant_id=self._input_spec.sharepoint_opts.tenant_id,\n            local_path=self._input_spec.sharepoint_opts.local_path,\n            api_version=self._input_spec.sharepoint_opts.api_version,\n            site_name=self._input_spec.sharepoint_opts.site_name,\n            drive_name=self._input_spec.sharepoint_opts.drive_name,\n            file_name=self._input_spec.sharepoint_opts.file_name,\n            folder_relative_path=self._input_spec.sharepoint_opts.folder_relative_path,\n            chunk_size=self._input_spec.sharepoint_opts.chunk_size,\n            local_options=self._input_spec.sharepoint_opts.local_options,\n            secret=self._input_spec.sharepoint_opts.secret,\n            conflict_behaviour=self._input_spec.sharepoint_opts.conflict_behaviour,\n            file_pattern=self._input_spec.sharepoint_opts.file_pattern,\n            file_type=self._input_spec.sharepoint_opts.file_type,\n        )\n\n\nclass SharepointCsvReader(SharepointReader):\n    \"\"\"Read CSV files from Sharepoint and return Spark DataFrame.\n\n    Supports reading a single file or combining multiple files in a folder.\n    Ensures schema consistency and archives processed files.\n    \"\"\"\n\n    def read(self, file_path: str = None, pattern: str = None) -> DataFrame:\n        \"\"\"Read CSV data from Sharepoint.\n\n        Args:\n            file_path: Full file or folder path (overrides options if provided).\n            pattern: Optional substring filter for folder mode.\n\n        Returns:\n            Spark DataFrame.\n\n        Raises:\n            ValueError: Invalid/missing path or path not found.\n        \"\"\"\n        file_path = file_path or self.file_path\n        pattern = pattern or self.pattern\n\n        if not file_path:\n            raise ValueError(\n                \"\"\"`file_name` or `folder_relative_path` must be provided via\n                sharepoint_opts.\"\"\"\n            )\n\n        # Case 1: file_path includes a file (e.g., folder/file.csv or full path)\n        if \".\" in Path(file_path).name:\n            sp_file = self.sharepoint_utils.get_file_metadata(file_path)\n            _LOGGER.info(f\"Detected single-file read mode for '{file_path}'.\")\n            return self._load_and_archive_file(sp_file)\n\n        # Case 2: it's a folder — use optional pattern\n        if not self.sharepoint_utils.check_if_endpoint_exists(file_path):\n            raise ValueError(f\"Folder '{file_path}' does not exist in Sharepoint.\")\n\n        _LOGGER.info(\n            f\"Detected folder read mode for '{file_path}' \"\n            + (\n                f\"with pattern '{pattern}'.\"\n                if pattern\n                else \"with no pattern (all files).\"\n            )\n        )\n        return self.read_csv_folder(file_path, pattern)\n\n    def _load_and_archive_file(self, sp_file: SharepointFile) -> DataFrame:\n        \"\"\"Download a Sharepoint CSV, stage it locally, load with Spark, and archive it.\n\n        Handles:\n        - Writing the CSV to a temporary local path.\n        - Reading it as a Spark DataFrame.\n        - Archiving goes to the configured success/error subfolders when enabled\n        (defaults: \"done\"/\"error\").\n\n        Args:\n            sp_file: File metadata and content.\n\n        Returns:\n            Spark DataFrame.\n\n        Raises:\n            ValueError: Empty content.\n            Exception: Staging or read failure.\n        \"\"\"\n        if self.file_path:\n            base_folder = (\n                str(Path(self.file_path).parent)\n                if \".\" in Path(self.file_path).name\n                else str(Path(self.file_path))\n            )\n        else:\n            base_folder = sp_file._folder if getattr(sp_file, \"_folder\", None) else None\n\n        success_subfolder = self.opts.archive_success_subfolder or \"done\"\n        error_subfolder = self.opts.archive_error_subfolder or \"error\"\n\n        success_folder = f\"{base_folder}/{success_subfolder}\" if base_folder else None\n        error_folder = f\"{base_folder}/{error_subfolder}\" if base_folder else None\n\n        archive_target = error_folder  # default to error unless full read succeeds\n\n        try:\n            # IMPORTANT: empty check inside try so finally always runs\n            if not sp_file.content:\n                raise ValueError(\n                    f\"File '{getattr(sp_file, 'file_path', None) or self.file_path}' \"\n                    \"is empty or could not be downloaded.\"\n                )\n\n            with self.sharepoint_utils.staging_area() as tmp_dir_raw:\n                tmp_dir: Path = Path(tmp_dir_raw)\n\n                sp_file, df = self._load_csv_to_spark(sp_file, tmp_dir)\n                archive_target = success_folder  # only mark success after full read\n\n                _LOGGER.info(\n                    f\"Successfully read '{sp_file.file_path}' into Spark DataFrame.\"\n                )\n                df = df.cache()\n                df.count()  # Force materialization\n                return df\n\n        except Exception as e:\n            _LOGGER.error(f\"Error processing '{sp_file.file_name}': {e}\")\n            raise\n\n        finally:\n            self.sharepoint_utils.archive_sharepoint_file(\n                sp_file=sp_file,\n                to_path=archive_target,\n                move_enabled=self.opts.archive_enabled,\n            )\n\n    def _get_csv_files_in_folder(\n        self, folder_path: str, pattern: str = None\n    ) -> list[SharepointFile]:\n        \"\"\"List CSV files in a Sharepoint folder, optionally filtered by pattern.\n\n        Args:\n            folder_path: Sharepoint folder path.\n            pattern: Optional glob/substring pattern.\n\n        Returns:\n            List of SharepointFile.\n        \"\"\"\n        items = self.sharepoint_utils.list_items_in_path(folder_path)\n        files = []\n\n        if pattern:\n            _LOGGER.info(\n                f\"\"\"Filtering Sharepoint files in '{folder_path}' using glob-style\n                pattern: '{pattern}'.\n                Ensure your pattern uses wildcards (e.g., '*.csv', 'sales_*.csv').\n                \"\"\"\n            )\n\n        for item in items:\n            file = SharepointFile(\n                file_name=item[\"name\"],\n                time_created=item.get(\"createdDateTime\", \"\"),\n                time_modified=item.get(\"lastModifiedDateTime\", \"\"),\n                _folder=folder_path,\n            )\n\n            if not file.is_csv:\n                continue\n\n            if pattern:\n                if not fnmatch.fnmatch(file.file_name, pattern):\n                    continue\n\n            files.append(file)\n\n        return sorted(files, key=lambda f: f.file_name)\n\n    def _load_csv_to_spark(\n        self, sp_file: SharepointFile, tmp_dir: Path\n    ) -> tuple[SharepointFile, DataFrame]:\n        \"\"\"Load a staged CSV into Spark and return file + DataFrame.\n\n        Args:\n            sp_file: Sharepoint file metadata.\n            tmp_dir: Local staging directory.\n\n        Returns:\n            (SharepointFile, Spark DataFrame).\n\n        Raises:\n            ValueError: Empty or undownloadable file.\n        \"\"\"\n        sp_file = self.sharepoint_utils.get_file_metadata(sp_file.file_path)\n\n        local_file = self.sharepoint_utils.save_to_staging_area(sp_file)\n\n        spark_options = self.resolve_spark_csv_options(sp_file.content)\n\n        try:\n            _LOGGER.info(f\"Starting to read file: {sp_file.file_name}\")\n            start_time = time.time()\n            df = (\n                ExecEnv.SESSION.read.format(\"csv\")\n                .options(**spark_options)\n                .load(str(local_file))\n                .cache()\n            )\n            _LOGGER.info(\n                f\"\"\"Finished reading file: {sp_file.file_name} in\n                {round(time.time() - start_time, 2)} seconds\"\"\"\n            )\n            df.count()  # force materialization\n\n            return sp_file, df\n\n        except Exception as e:\n            _LOGGER.error(\n                f\"Failed to read local copy of Sharepoint file: {local_file}\",\n                exc_info=True,\n            )\n            raise ValueError(\n                f\"Failed to read Sharepoint file: '{sp_file.file_path}'.\"\n            ) from e\n\n    def read_csv_folder(self, folder_path: str, pattern: str = None) -> DataFrame:\n        \"\"\"Read and combine CSVs from a Sharepoint folder.\n\n        If a pattern is provided, only files whose names contain the pattern will be\n        read.\n        Only archives files to the configured success subfolder if the full read\n        and union succeeds.\n        Files causing schema mismatches or other read issues are moved to the\n        configured error subfolder (when enabled).\n\n        Args:\n            folder_path: Sharepoint folder path.\n            pattern: Optional substring filter for filenames.\n\n        Returns:\n            Combined Spark DataFrame.\n\n        Raises:\n            ValueError: No valid files or schema mismatches.\n        \"\"\"\n        files = self._get_csv_files_in_folder(folder_path, pattern)\n        if not files:\n            raise ValueError(f\"No CSV files found in folder: {folder_path}\")\n\n        valid_files, dfs = [], []\n        base_schema = None\n\n        with self.sharepoint_utils.staging_area() as tmp_dir_raw:\n            tmp_dir: Path = Path(tmp_dir_raw)\n\n            for file in files:\n                try:\n                    file_with_content, df = self._validate_and_read_file(\n                        file, tmp_dir, base_schema\n                    )\n                    base_schema = base_schema or df.schema\n                    dfs.append(df)\n                    valid_files.append(file_with_content)\n                except Exception as e:\n                    self._handle_file_error(file, folder_path, e)\n                    raise\n\n        if not dfs:\n            raise ValueError(\"No valid CSV files could be loaded from folder.\")\n\n        combined = reduce(lambda a, b: a.unionByName(b), dfs).cache()\n        combined.count()\n\n        for sp_file in valid_files:\n            self.sharepoint_utils.archive_sharepoint_file(\n                sp_file,\n                to_path=(\n                    f\"{folder_path}/{self.opts.archive_success_subfolder}\"\n                    if self.opts.archive_success_subfolder\n                    else None\n                ),\n                move_enabled=self.opts.archive_enabled,\n            )\n\n        return combined\n\n    def _validate_and_read_file(\n        self,\n        file: SharepointFile,\n        tmp_dir: Path,\n        base_schema: Optional[StructType],\n    ) -> tuple[SharepointFile, DataFrame]:\n        \"\"\"Validate schema and read CSV file into a Spark DataFrame.\n\n        Args:\n            file: Sharepoint file to read.\n            tmp_dir: Temporary staging directory.\n            base_schema: Schema to validate against.\n\n        Returns:\n            (validated SharepointFile, DataFrame).\n\n        Raises:\n            ValueError: Schema mismatch.\n        \"\"\"\n        file_with_content, df = self._load_csv_to_spark(file, tmp_dir)\n\n        if base_schema and df.schema != base_schema:\n            _LOGGER.error(\n                f\"\"\"Schema mismatch in '{file.file_name}'. Expected: {base_schema},\n                Found: {df.schema}\"\"\"\n            )\n            self.sharepoint_utils.archive_sharepoint_file(\n                sp_file=file_with_content,\n                to_path=self.error_folder,\n                move_enabled=self.opts.archive_enabled,\n            )\n            raise ValueError(f\"Schema mismatch in '{file.file_name}'\")\n\n        return file_with_content, df\n\n    def _handle_file_error(\n        self,\n        file: SharepointFile,\n        folder_path: str,\n        error: Exception,\n    ) -> None:\n        \"\"\"Handle file read or processing errors by logging and archiving.\n\n        Logs the error, prevents duplicate archiving, and moves the file\n        to the error subfolder when enabled. Falls back gracefully if\n        archiving fails.\n\n        Args:\n            file: Problematic SharepointFile.\n            folder_path: Folder path for fallback archiving.\n            error: Exception encountered.\n        \"\"\"\n        _LOGGER.error(f\"Error processing '{file.file_name}': {error}\")\n        if not getattr(file, \"_already_archived\", False):\n            file.skip_rename = True\n            try:\n                self.sharepoint_utils.archive_sharepoint_file(\n                    sp_file=file,\n                    to_path=self.error_folder,\n                    move_enabled=self.opts.archive_enabled,\n                )\n                file._already_archived = True\n            except Exception as archive_error:\n                _LOGGER.warning(f\"Secondary archiving failed: {archive_error}\")\n        else:\n            _LOGGER.info(\n                f\"Skipping second archive for '{file.file_name}' (already archived)\"\n            )\n\n    def detect_delimiter(\n        self,\n        file_content: bytes,\n        provided_delimiter: Optional[str] = None,\n        expected_columns: Optional[list] = None,\n    ) -> str:\n        \"\"\"Detect the appropriate delimiter for a CSV file.\n\n        If a delimiter is explicitly provided by the user, it will be used directly\n        (sniffing is bypassed).\n        Otherwise, attempts to auto-detect the delimiter using csv.Sniffer based on the\n        first line or expected columns.\n\n        Args:\n            file_content: Raw CSV bytes.\n            provided_delimiter: Explicit delimiter to use.\n            expected_columns: Optional expected header names.\n\n        Returns:\n            Final delimiter.\n\n        Raises:\n            ValueError: Unable to determine delimiter.\n        \"\"\"\n        if provided_delimiter:\n            _LOGGER.info(f\"User-specified delimiter '{provided_delimiter}' selected.\")\n            return provided_delimiter\n\n        try:\n            text = file_content.decode(\"utf-8\")\n            dialect = csv.Sniffer().sniff(text, delimiters=\";,|\\t\")\n            detected_delimiter = dialect.delimiter\n\n            _LOGGER.info(\n                f\"No user-specified delimiter. Auto-detected: '{detected_delimiter}'\"\n            )\n\n            first_line = text.splitlines()[0].strip()\n            actual_column_count = len(first_line.split(detected_delimiter))\n\n            if expected_columns:\n                expected_count = len(expected_columns)\n                if actual_column_count != expected_count:\n                    _LOGGER.warning(\n                        f\"\"\"Detected delimiter '{detected_delimiter}' resulted in\n                        {actual_column_count} columns,\n                        but {expected_count} were expected. Consider specifying\n                        the delimiter explicitly.\"\"\"\n                    )\n            elif actual_column_count <= 1:\n                _LOGGER.warning(\n                    f\"\"\"Detected delimiter '{detected_delimiter}' resulted in only\n                    {actual_column_count} column.\n                     Consider specifying the delimiter explicitly in\n                     'sharepoint_opts.local_options'.\"\"\"\n                )\n\n            return detected_delimiter\n\n        except Exception as e:\n            _LOGGER.warning(\n                f\"Failed to auto-detect delimiter. Defaulting to comma. Reason: {e}\"\n            )\n            return \",\"\n\n    def resolve_spark_csv_options(self, file_content: bytes) -> dict:\n        \"\"\"Resolve Spark CSV read options by validating or detecting delimiter.\n\n        Args:\n            file_content: Raw file bytes.\n\n        Returns:\n            Dict of Spark CSV options (includes delimiter).\n        \"\"\"\n        local_options = self._input_spec.sharepoint_opts.local_options or {}\n\n        if \"sep\" in local_options:\n            user_delimiter = local_options[\"sep\"]\n        elif \"delimiter\" in local_options:\n            user_delimiter = local_options[\"delimiter\"]\n        else:\n            user_delimiter = None\n\n        expected_columns = local_options.get(\"expected_columns\")\n\n        final_delimiter = self.detect_delimiter(\n            file_content=file_content,\n            provided_delimiter=user_delimiter,\n            expected_columns=expected_columns,\n        )\n\n        # Warn if expected column names do not match the header when using the selected\n        # delimiter\n        if expected_columns:\n            try:\n                header_line = file_content.decode(\"utf-8\").splitlines()[0].strip()\n                actual_columns = [c.strip() for c in header_line.split(final_delimiter)]\n\n                expected_normalized = [str(c).strip().lower() for c in expected_columns]\n                actual_normalized = [c.strip().lower() for c in actual_columns]\n\n                if actual_normalized != expected_normalized:\n                    _LOGGER.warning(\n                        \"Expected columns don't match CSV header using delimiter '%s'. \"\n                        \"Expected: %s vs. Actual: %s. The read will proceed; \"\n                        \"consider specifying the correct delimiter or \"\n                        \"updating expected_columns.\",\n                        final_delimiter,\n                        expected_columns,\n                        actual_columns,\n                    )\n            except Exception as e:\n                _LOGGER.warning(\n                    \"Failed to validate expected_columns against CSV header. \"\n                    \"The read will proceed. Reason: %s\",\n                    e,\n                )\n\n        # Safety fallback if detector returned nothing for some reason\n        final_delimiter = final_delimiter or \",\"\n\n        spark_options = dict(local_options)\n        spark_options[\"sep\"] = final_delimiter\n        # Remove \"delimiter\" to avoid ambiguity as spark uses \"sep\"\n        spark_options.pop(\"delimiter\", None)\n\n        return spark_options\n\n\nclass SharepointExcelReader(SharepointReader):\n    \"\"\"Read Excel files from Sharepoint (not yet implemented).\"\"\"\n\n    def read(self) -> DataFrame:\n        \"\"\"Read Excel files from Sharepoint.\n\n        This method is not yet implemented and currently raises an error.\n        Intended for future support of .xlsx file read from Sharepoint folders or files.\n\n        Raises:\n            NotImplementedError: Always, since Excel reading is not implemented.\n        \"\"\"\n        raise NotImplementedError(\"Excel reading is not yet implemented.\")\n\n\nclass SharepointReaderFactory:\n    \"\"\"Select the correct Sharepoint reader based on file type, file name, folder path.\n\n    Default to using the file path from SharepointUtils instance.\n    \"\"\"\n\n    @staticmethod\n    def get_reader(input_spec: InputSpec) -> SharepointReader:\n        \"\"\"Select the appropriate Sharepoint reader based on input specification.\n\n        Resolution order:\n        1. Use file extension from `file_name` if provided.\n        2. If `folder_relative_path` includes a file with extension, use that.\n        3. If neither applies, use `file_type`.\n\n        Args:\n            input_spec: InputSpec with Sharepoint options.\n\n        Returns:\n            Reader instance for the resolved file type.\n\n        Raises:\n            ValueError: If file format is unsupported or cannot be determined.\n        \"\"\"\n        opts = input_spec.sharepoint_opts\n\n        # 1. If reading a specific file, use file_name\n        if opts.file_name:\n            ext = Path(opts.file_name).suffix.lower()\n\n        # 2. If folder_relative_path includes extension, treat it as full path\n        elif opts.folder_relative_path and \".\" in Path(opts.folder_relative_path).name:\n            ext = Path(opts.folder_relative_path).suffix.lower()\n\n        # 3. Otherwise, rely on file_type\n        elif opts.file_type:\n            ext = f\".{opts.file_type.lower()}\"\n\n        else:\n            raise ValueError(\n                \"\"\"Cannot determine file format. Please provide `file_name`,\n                 a full file path in `folder_relative_path`, or explicitly set\n                 `file_type`.\"\"\"\n            )\n\n        readers = {\n            \".csv\": SharepointCsvReader,\n            \".xlsx\": SharepointExcelReader,\n        }\n        try:\n            _LOGGER.info(f\"Detected {ext} read mode.\")\n            return readers[ext](input_spec)\n        except KeyError:\n            raise ValueError(f\"Unsupported file format: {ext}\")\n"
  },
  {
    "path": "lakehouse_engine/io/readers/table_reader.py",
    "content": "\"\"\"Module to define behaviour to read from tables.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputSpec, ReadType\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader import Reader\n\n\nclass TableReader(Reader):\n    \"\"\"Class to read data from a table.\"\"\"\n\n    def __init__(self, input_spec: InputSpec):\n        \"\"\"Construct TableReader instances.\n\n        Args:\n            input_spec: input specification.\n        \"\"\"\n        super().__init__(input_spec)\n\n    def read(self) -> DataFrame:\n        \"\"\"Read data from a table.\n\n        Returns:\n            A dataframe containing the data from the table.\n        \"\"\"\n        if self._input_spec.read_type == ReadType.BATCH.value:\n            return ExecEnv.SESSION.read.options(\n                **self._input_spec.options if self._input_spec.options else {}\n            ).table(self._input_spec.db_table)\n        elif self._input_spec.read_type == ReadType.STREAMING.value:\n            return ExecEnv.SESSION.readStream.options(\n                **self._input_spec.options if self._input_spec.options else {}\n            ).table(self._input_spec.db_table)\n        else:\n            self._logger.error(\"The requested read type is not supported.\")\n            raise NotImplementedError\n"
  },
  {
    "path": "lakehouse_engine/io/writer.py",
    "content": "\"\"\"Defines abstract writer behaviour.\"\"\"\n\nfrom abc import ABC, abstractmethod\nfrom typing import Any, Callable, Dict, List, Optional, OrderedDict\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import lit\n\nfrom lakehouse_engine.core.definitions import DQSpec, OutputSpec\nfrom lakehouse_engine.transformers.transformer_factory import TransformerFactory\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Writer(ABC):\n    \"\"\"Abstract Writer class.\"\"\"\n\n    def __init__(\n        self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict = None\n    ):\n        \"\"\"Construct Writer instances.\n\n        Args:\n            output_spec: output specification to write data.\n            df: dataframe to write.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        self._logger = LoggingHandler(self.__class__.__name__).get_logger()\n        self._output_spec = output_spec\n        self._df = df\n        self._data = data\n\n    @abstractmethod\n    def write(self) -> Optional[OrderedDict]:\n        \"\"\"Abstract write method.\"\"\"\n        raise NotImplementedError\n\n    @staticmethod\n    def write_transformed_micro_batch(**kwargs: Any) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        This function must define an inner function that manipulates a streaming batch,\n        and then return that function. Look for concrete implementations of this\n        function for more clarity.\n\n        Args:\n            kwargs: any keyword arguments.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            logger = LoggingHandler(__name__).get_logger()\n            logger.warning(\"Skipping transform micro batch... nothing to do.\")\n\n        return inner\n\n    @classmethod\n    def get_transformed_micro_batch(\n        cls,\n        output_spec: OutputSpec,\n        batch_df: DataFrame,\n        batch_id: int,\n        data: OrderedDict,\n    ) -> DataFrame:\n        \"\"\"Get the result of the transformations applied to a micro batch dataframe.\n\n        Args:\n            output_spec: output specification associated with the writer.\n            batch_df: batch dataframe (given from streaming foreachBatch).\n            batch_id: if of the batch (given from streaming foreachBatch).\n            data: list of all dfs generated on previous steps before writer\n                to be available on micro batch transforms.\n\n        Returns:\n            The transformed dataframe.\n        \"\"\"\n        transformed_df = batch_df\n        if output_spec.with_batch_id:\n            transformed_df = transformed_df.withColumn(\"lhe_batch_id\", lit(batch_id))\n\n        for transformer in output_spec.streaming_micro_batch_transformers:\n            transformed_df = transformed_df.transform(\n                TransformerFactory.get_transformer(transformer, data)\n            )\n\n        return transformed_df\n\n    @classmethod\n    def get_streaming_trigger(cls, output_spec: OutputSpec) -> Dict:\n        \"\"\"Define which streaming trigger will be used.\n\n        Args:\n            output_spec: output specification.\n\n        Returns:\n            A dict containing streaming trigger.\n        \"\"\"\n        trigger: Dict[str, Any] = {}\n\n        if output_spec.streaming_available_now:\n            trigger[\"availableNow\"] = output_spec.streaming_available_now\n        elif output_spec.streaming_once:\n            trigger[\"once\"] = output_spec.streaming_once\n        elif output_spec.streaming_processing_time:\n            trigger[\"processingTime\"] = output_spec.streaming_processing_time\n        elif output_spec.streaming_continuous:\n            trigger[\"continuous\"] = output_spec.streaming_continuous\n        else:\n            raise NotImplementedError(\n                \"The requested output spec streaming trigger is not supported.\"\n            )\n\n        return trigger\n\n    @staticmethod\n    def run_micro_batch_dq_process(df: DataFrame, dq_spec: List[DQSpec]) -> DataFrame:\n        \"\"\"Run the data quality process in a streaming micro batch dataframe.\n\n        Iterates over the specs and performs the checks or analysis depending on the\n        data quality specification provided in the configuration.\n\n        Args:\n            df: the dataframe in which to run the dq process on.\n            dq_spec: data quality specification.\n\n        Returns: the validated dataframe.\n        \"\"\"\n        from lakehouse_engine.dq_processors.dq_factory import DQFactory\n\n        validated_df = df\n        for spec in dq_spec:\n            validated_df = DQFactory.run_dq_process(spec, df)\n\n        return validated_df\n"
  },
  {
    "path": "lakehouse_engine/io/writer_factory.py",
    "content": "\"\"\"Module for writer factory.\"\"\"\n\nfrom abc import ABC\nfrom typing import OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import (\n    FILE_OUTPUT_FORMATS,\n    OutputFormat,\n    OutputSpec,\n    WriteType,\n)\nfrom lakehouse_engine.io.writer import Writer\nfrom lakehouse_engine.io.writers.console_writer import ConsoleWriter\nfrom lakehouse_engine.io.writers.dataframe_writer import DataFrameWriter\nfrom lakehouse_engine.io.writers.delta_merge_writer import DeltaMergeWriter\nfrom lakehouse_engine.io.writers.file_writer import FileWriter\nfrom lakehouse_engine.io.writers.jdbc_writer import JDBCWriter\nfrom lakehouse_engine.io.writers.kafka_writer import KafkaWriter\nfrom lakehouse_engine.io.writers.rest_api_writer import RestApiWriter\nfrom lakehouse_engine.io.writers.sharepoint_writer import SharepointWriter\nfrom lakehouse_engine.io.writers.table_writer import TableWriter\n\n\nclass WriterFactory(ABC):  # noqa: B024\n    \"\"\"Class for writer factory.\"\"\"\n\n    AVAILABLE_WRITERS = {\n        OutputFormat.TABLE.value: TableWriter,\n        OutputFormat.DELTAFILES.value: DeltaMergeWriter,\n        OutputFormat.JDBC.value: JDBCWriter,\n        OutputFormat.FILE.value: FileWriter,\n        OutputFormat.KAFKA.value: KafkaWriter,\n        OutputFormat.CONSOLE.value: ConsoleWriter,\n        OutputFormat.DATAFRAME.value: DataFrameWriter,\n        OutputFormat.REST_API.value: RestApiWriter,\n        OutputFormat.SHAREPOINT.value: SharepointWriter,\n    }\n\n    @classmethod\n    def _get_writer_name(cls, spec: OutputSpec) -> str:\n        \"\"\"Get the writer name according to the output specification.\n\n        Args:\n            OutputSpec spec: output specification to write data.\n\n        Returns:\n            Writer: writer name that will be created to write the data.\n        \"\"\"\n        if spec.db_table and spec.write_type != WriteType.MERGE.value:\n            writer_name = OutputFormat.TABLE.value\n        elif (\n            spec.data_format == OutputFormat.DELTAFILES.value or spec.db_table\n        ) and spec.write_type == WriteType.MERGE.value:\n            writer_name = OutputFormat.DELTAFILES.value\n        elif spec.data_format in FILE_OUTPUT_FORMATS:\n            writer_name = OutputFormat.FILE.value\n        else:\n            writer_name = spec.data_format\n        return writer_name\n\n    @classmethod\n    def get_writer(cls, spec: OutputSpec, df: DataFrame, data: OrderedDict) -> Writer:\n        \"\"\"Get a writer according to the output specification using a factory pattern.\n\n        Args:\n            spec: output specification to write data.\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            Writer: writer that will write the data.\n        \"\"\"\n        writer_name = cls._get_writer_name(spec)\n        writer = cls.AVAILABLE_WRITERS.get(writer_name)\n\n        if writer:\n            return writer(output_spec=spec, df=df, data=data)  # type: ignore\n        else:\n            raise NotImplementedError(\n                f\"The requested output spec format {spec.data_format} is not supported.\"\n            )\n"
  },
  {
    "path": "lakehouse_engine/io/writers/__init__.py",
    "content": "\"\"\"Package containing the writers responsible for writing data.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/io/writers/console_writer.py",
    "content": "\"\"\"Module to define behaviour to write to console.\"\"\"\n\nfrom typing import Callable, OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.writer import Writer\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass ConsoleWriter(Writer):\n    \"\"\"Class to write data to console.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct ConsoleWriter instances.\n\n        Args:\n            output_spec: output specification\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Write data to console.\"\"\"\n        self._output_spec.options = (\n            self._output_spec.options if self._output_spec.options else {}\n        )\n        if not self._df.isStreaming:\n            self._logger.info(\"Dataframe preview:\")\n            self._show_df(self._df, self._output_spec)\n        else:\n            self._logger.info(\"Stream Dataframe preview:\")\n            self._write_to_console_in_streaming_mode(\n                self._df, self._output_spec, self._data\n            )\n\n    @staticmethod\n    def _show_df(df: DataFrame, output_spec: OutputSpec) -> None:\n        \"\"\"Given a dataframe it applies Spark's show function to show it.\n\n        Args:\n            df: dataframe to be shown.\n            output_spec: output specification.\n        \"\"\"\n        df.show(\n            n=output_spec.options.get(\"limit\", 20),\n            truncate=output_spec.options.get(\"truncate\", True),\n            vertical=output_spec.options.get(\"vertical\", False),\n        )\n\n    @staticmethod\n    def _show_streaming_df(output_spec: OutputSpec) -> Callable:\n        \"\"\"Define how to show a streaming df.\n\n        Args:\n            output_spec: output specification.\n\n        Returns:\n            A function to show df in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            ConsoleWriter._logger.info(f\"Showing DF for batch {batch_id}\")\n            ConsoleWriter._show_df(batch_df, output_spec)\n\n        return inner\n\n    @staticmethod\n    def _write_to_console_in_streaming_mode(\n        df: DataFrame, output_spec: OutputSpec, data: OrderedDict\n    ) -> None:\n        \"\"\"Write to console in streaming mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        df_writer = df.writeStream.trigger(**Writer.get_streaming_trigger(output_spec))\n\n        if (\n            output_spec.streaming_micro_batch_transformers\n            or output_spec.streaming_micro_batch_dq_processors\n        ):\n            stream_df = df_writer.foreachBatch(\n                ConsoleWriter._write_transformed_micro_batch(output_spec, data)\n            ).start()\n        else:\n            stream_df = df_writer.foreachBatch(\n                ConsoleWriter._show_streaming_df(output_spec)\n            ).start()\n\n        if output_spec.streaming_await_termination:\n            stream_df.awaitTermination(output_spec.streaming_await_termination_timeout)\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec, data: OrderedDict\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            ConsoleWriter._show_df(transformed_df, output_spec)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/dataframe_writer.py",
    "content": "\"\"\"Module to define behaviour to write to dataframe.\"\"\"\n\nimport uuid\nfrom typing import Callable, Optional, OrderedDict\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.types import StructType\n\nfrom lakehouse_engine.core.definitions import OutputFormat, OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.exceptions import NotSupportedException\nfrom lakehouse_engine.io.writer import Writer\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.spark_utils import SparkUtils\n\n\nclass DataFrameWriter(Writer):\n    \"\"\"Class to write data to dataframe.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct DataFrameWriter instances.\n\n        Args:\n            output_spec: output specification.\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n        self.view_prefix = \"global_temp\" if not ExecEnv.IS_SERVERLESS else \"\"\n\n    def write(self) -> Optional[OrderedDict]:\n        \"\"\"Write data to dataframe.\"\"\"\n        self._output_spec.options = (\n            self._output_spec.options if self._output_spec.options else {}\n        )\n        written_dfs: OrderedDict = OrderedDict({})\n\n        if (\n            self._output_spec.streaming_processing_time\n            or self._output_spec.streaming_continuous\n        ):\n            raise NotSupportedException(\n                f\"DataFrame writer doesn't support \"\n                f\"processing time or continuous streaming \"\n                f\"for step ${self._output_spec.spec_id}.\"\n            )\n\n        if self._df.isStreaming:\n            output_df = self._write_to_dataframe_in_streaming_mode(\n                self._df, self._output_spec, self._data\n            )\n        else:\n            output_df = self._df\n\n        written_dfs[self._output_spec.spec_id] = output_df\n\n        return written_dfs\n\n    def _get_prefixed_view_name(self, stream_df_view_name: str) -> str:\n        \"\"\"Return the fully qualified view name with prefix if needed.\"\"\"\n        return \".\".join(filter(None, [self.view_prefix, stream_df_view_name]))\n\n    def _create_temp_view(self, df: DataFrame, stream_df_view_name: str) -> None:\n        \"\"\"Given a dataframe create a temp view to be available for consumption.\n\n        Args:\n            df: dataframe to be shown.\n            stream_df_view_name: stream df view name.\n        \"\"\"\n        prefixed_view_name = self._get_prefixed_view_name(stream_df_view_name)\n        if self._table_exists(stream_df_view_name):\n            self._logger.info(\"Temp view already exists\")\n            existing_data = ExecEnv.SESSION.table(f\"{prefixed_view_name}\")\n            df = existing_data.union(df)\n\n        SparkUtils.create_temp_view(df, stream_df_view_name)\n\n    def _write_streaming_df(self, stream_df_view_name: str) -> Callable:\n        \"\"\"Define how to create a df from streaming df.\n\n        Args:\n            stream_df_view_name: stream df view name.\n\n        Returns:\n            A function to show df in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            self._create_temp_view(batch_df, stream_df_view_name)\n\n        return inner\n\n    def _write_to_dataframe_in_streaming_mode(\n        self, df: DataFrame, output_spec: OutputSpec, data: OrderedDict\n    ) -> DataFrame:\n        \"\"\"Write to DataFrame in streaming mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        app_id = str(uuid.uuid4())\n        stream_df_view_name = f\"`{app_id}_{output_spec.spec_id}`\"\n        self._logger.info(\"Drop temp view if exists\")\n        prefixed_view_name = self._get_prefixed_view_name(stream_df_view_name)\n\n        if self._table_exists(stream_df_view_name):\n            # Cleaning Temp view to not maintain state and impact\n            # consecutive acon runs\n            ExecEnv.SESSION.sql(f\"DROP VIEW {prefixed_view_name}\")\n\n        df_writer = df.writeStream.trigger(**Writer.get_streaming_trigger(output_spec))\n\n        if (\n            output_spec.streaming_micro_batch_transformers\n            or output_spec.streaming_micro_batch_dq_processors\n        ):\n            stream_df = (\n                df_writer.options(**output_spec.options if output_spec.options else {})\n                .format(OutputFormat.NOOP.value)\n                .foreachBatch(\n                    self._write_transformed_micro_batch(\n                        output_spec, data, stream_df_view_name\n                    )\n                )\n                .start()\n            )\n        else:\n            stream_df = (\n                df_writer.options(**output_spec.options if output_spec.options else {})\n                .format(OutputFormat.NOOP.value)\n                .foreachBatch(self._write_streaming_df(stream_df_view_name))\n                .start()\n            )\n\n        if output_spec.streaming_await_termination:\n            stream_df.awaitTermination(output_spec.streaming_await_termination_timeout)\n\n        self._logger.info(\"Reading stream data as df if exists\")\n        if self._table_exists(stream_df_view_name):\n            stream_data_as_df = ExecEnv.SESSION.table(f\"{prefixed_view_name}\")\n        else:\n            self._logger.info(\n                f\"DataFrame writer couldn't find any data to return \"\n                f\"for streaming, check if you are using checkpoint \"\n                f\"for step {output_spec.spec_id}.\"\n            )\n            stream_data_as_df = ExecEnv.SESSION.createDataFrame(\n                data=[], schema=StructType([])\n            )\n\n        return stream_data_as_df\n\n    def _table_exists(self, table_name: str) -> bool:\n        \"\"\"Check if the table or view exists in the session catalog.\n\n        Args:\n            table_name: table/view name to check if exists in the session.\n        \"\"\"\n        if not ExecEnv.IS_SERVERLESS:\n            tables = ExecEnv.SESSION.sql(f\"SHOW TABLES IN {self.view_prefix}\")\n        else:\n            tables = ExecEnv.SESSION.sql(\"SHOW TABLES\")\n        return (\n            len(tables.filter(f\"tableName = '{table_name.strip('`')}'\").collect()) > 0\n        )\n\n    def _write_transformed_micro_batch(\n        self, output_spec: OutputSpec, data: OrderedDict, stream_as_df_view: str\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n            stream_as_df_view: stream df view name.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            self._create_temp_view(transformed_df, stream_as_df_view)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/delta_merge_writer.py",
    "content": "\"\"\"Module to define the behaviour of delta merges.\"\"\"\n\nfrom typing import Callable, Optional, OrderedDict\n\nfrom delta.tables import DeltaMergeBuilder, DeltaTable\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputFormat, OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.exceptions import WrongIOFormatException\nfrom lakehouse_engine.io.writer import Writer\n\n\nclass DeltaMergeWriter(Writer):\n    \"\"\"Class to merge data using delta lake.\"\"\"\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct DeltaMergeWriter instances.\n\n        Args:\n            output_spec: output specification containing merge options and\n                relevant information.\n            df: the dataframe containing the new data to be merged.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Merge new data with current data.\"\"\"\n        delta_table = self._get_delta_table(self._output_spec)\n        if self._df.isStreaming:\n            stream_df = (\n                self._df.writeStream.options(\n                    **self._output_spec.options if self._output_spec.options else {}\n                )\n                .foreachBatch(\n                    self._write_transformed_micro_batch(\n                        self._output_spec, self._data, delta_table\n                    )\n                )\n                .trigger(**Writer.get_streaming_trigger(self._output_spec))\n                .start()\n            )\n\n            if self._output_spec.streaming_await_termination:\n                stream_df.awaitTermination(\n                    self._output_spec.streaming_await_termination_timeout\n                )\n        else:\n            DeltaMergeWriter._merge(delta_table, self._output_spec, self._df)\n\n    @staticmethod\n    def _get_delta_table(output_spec: OutputSpec) -> DeltaTable:\n        \"\"\"Get the delta table given an output specification w/ table name or location.\n\n        Args:\n            output_spec: output specification.\n\n        Returns:\n            DeltaTable: the delta table instance.\n        \"\"\"\n        if output_spec.db_table:\n            delta_table = DeltaTable.forName(ExecEnv.SESSION, output_spec.db_table)\n        elif output_spec.data_format == OutputFormat.DELTAFILES.value:\n            delta_table = DeltaTable.forPath(ExecEnv.SESSION, output_spec.location)\n        else:\n            raise WrongIOFormatException(\n                f\"{output_spec.data_format} is not compatible with Delta Merge \"\n                f\"Writer.\"\n            )\n\n        return delta_table\n\n    @staticmethod\n    def _insert(\n        delta_merge: DeltaMergeBuilder,\n        insert_predicate: Optional[str],\n        insert_column_set: Optional[dict],\n    ) -> DeltaMergeBuilder:\n        \"\"\"Get the builder of merge data with insert predicate and column set.\n\n        Args:\n            delta_merge: builder of the merge data.\n            insert_predicate: condition of the insert.\n            insert_column_set: rules for setting the values of\n                columns that need to be inserted.\n\n        Returns:\n            DeltaMergeBuilder: builder of the merge data with insert.\n        \"\"\"\n        if insert_predicate:\n            if insert_column_set:\n                delta_merge = delta_merge.whenNotMatchedInsert(\n                    condition=insert_predicate,\n                    values=insert_column_set,\n                )\n            else:\n                delta_merge = delta_merge.whenNotMatchedInsertAll(\n                    condition=insert_predicate\n                )\n        else:\n            if insert_column_set:\n                delta_merge = delta_merge.whenNotMatchedInsert(values=insert_column_set)\n            else:\n                delta_merge = delta_merge.whenNotMatchedInsertAll()\n\n        return delta_merge\n\n    @staticmethod\n    def _merge(delta_table: DeltaTable, output_spec: OutputSpec, df: DataFrame) -> None:\n        \"\"\"Perform a delta lake merge according to several merge options.\n\n        Args:\n            delta_table: delta table to which to merge data.\n            output_spec: output specification containing the merge options.\n            df: dataframe with the new data to be merged into the delta table.\n        \"\"\"\n        delta_merge = delta_table.alias(\"current\").merge(\n            df.alias(\"new\"), output_spec.merge_opts.merge_predicate\n        )\n\n        if not output_spec.merge_opts.insert_only:\n            if output_spec.merge_opts.delete_predicate:\n                delta_merge = delta_merge.whenMatchedDelete(\n                    output_spec.merge_opts.delete_predicate\n                )\n            delta_merge = DeltaMergeWriter._update(\n                delta_merge,\n                output_spec.merge_opts.update_predicate,\n                output_spec.merge_opts.update_column_set,\n            )\n\n        delta_merge = DeltaMergeWriter._insert(\n            delta_merge,\n            output_spec.merge_opts.insert_predicate,\n            output_spec.merge_opts.insert_column_set,\n        )\n\n        delta_merge.execute()\n\n    @staticmethod\n    def _update(\n        delta_merge: DeltaMergeBuilder,\n        update_predicate: Optional[str],\n        update_column_set: Optional[dict],\n    ) -> DeltaMergeBuilder:\n        \"\"\"Get the builder of merge data with update predicate and column set.\n\n        Args:\n            delta_merge: builder of the merge data.\n            update_predicate: condition of the update.\n            update_column_set: rules for setting the values of\n                columns that need to be updated.\n\n        Returns:\n            DeltaMergeBuilder: builder of the merge data with update.\n        \"\"\"\n        if update_predicate:\n            if update_column_set:\n                delta_merge = delta_merge.whenMatchedUpdate(\n                    condition=update_predicate,\n                    set=update_column_set,\n                )\n            else:\n                delta_merge = delta_merge.whenMatchedUpdateAll(\n                    condition=update_predicate\n                )\n        else:\n            if update_column_set:\n                delta_merge = delta_merge.whenMatchedUpdate(set=update_column_set)\n            else:\n                delta_merge = delta_merge.whenMatchedUpdateAll()\n\n        return delta_merge\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec,\n        data: OrderedDict,\n        delta_table: Optional[DeltaTable] = None,\n    ) -> Callable:\n        \"\"\"Perform the merge in streaming mode by specifying a transform function.\n\n        This function returns a function that will be invoked in the foreachBatch in\n        streaming mode, performing a delta lake merge while streaming the micro batches.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n            delta_table: delta table for which to merge the streaming data\n                with.\n\n        Returns:\n            Function to call in .foreachBatch streaming function.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            DeltaMergeWriter._merge(delta_table, output_spec, transformed_df)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/file_writer.py",
    "content": "\"\"\"Module to define behaviour to write to files.\"\"\"\n\nfrom typing import Callable, OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.writer import Writer\n\n\nclass FileWriter(Writer):\n    \"\"\"Class to write data to files.\"\"\"\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct FileWriter instances.\n\n        Args:\n            output_spec: output specification\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Write data to files.\"\"\"\n        if not self._df.isStreaming:\n            self._write_to_files_in_batch_mode(self._df, self._output_spec)\n        else:\n            self._write_to_files_in_streaming_mode(\n                self._df, self._output_spec, self._data\n            )\n\n    @staticmethod\n    def _write_to_files_in_batch_mode(df: DataFrame, output_spec: OutputSpec) -> None:\n        \"\"\"Write to files in batch mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n        \"\"\"\n        df.write.format(output_spec.data_format).partitionBy(\n            output_spec.partitions\n        ).options(**output_spec.options if output_spec.options else {}).mode(\n            output_spec.write_type\n        ).save(\n            output_spec.location\n        )\n\n    @staticmethod\n    def _write_to_files_in_streaming_mode(\n        df: DataFrame, output_spec: OutputSpec, data: OrderedDict\n    ) -> None:\n        \"\"\"Write to files in streaming mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        df_writer = df.writeStream.trigger(**Writer.get_streaming_trigger(output_spec))\n\n        if (\n            output_spec.streaming_micro_batch_transformers\n            or output_spec.streaming_micro_batch_dq_processors\n        ):\n            stream_df = (\n                df_writer.options(**output_spec.options if output_spec.options else {})\n                .foreachBatch(\n                    FileWriter._write_transformed_micro_batch(output_spec, data)\n                )\n                .start()\n            )\n        else:\n            stream_df = (\n                df_writer.format(output_spec.data_format)\n                .partitionBy(output_spec.partitions)\n                .options(**output_spec.options if output_spec.options else {})\n                .outputMode(output_spec.write_type)\n                .start(output_spec.location)\n            )\n\n        if output_spec.streaming_await_termination:\n            stream_df.awaitTermination(output_spec.streaming_await_termination_timeout)\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec, data: OrderedDict\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            FileWriter._write_to_files_in_batch_mode(transformed_df, output_spec)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/jdbc_writer.py",
    "content": "\"\"\"Module that defines the behaviour to write to JDBC targets.\"\"\"\n\nfrom typing import Callable, OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.writer import Writer\n\n\nclass JDBCWriter(Writer):\n    \"\"\"Class to write to JDBC targets.\"\"\"\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct JDBCWriter instances.\n\n        Args:\n            output_spec: output specification.\n            df: dataframe to be writen.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Write data into JDBC target.\"\"\"\n        if not self._df.isStreaming:\n            self._write_to_jdbc_in_batch_mode(self._df, self._output_spec)\n        else:\n            stream_df = (\n                self._df.writeStream.trigger(\n                    **Writer.get_streaming_trigger(self._output_spec)\n                )\n                .options(\n                    **self._output_spec.options if self._output_spec.options else {}\n                )\n                .foreachBatch(\n                    self._write_transformed_micro_batch(self._output_spec, self._data)\n                )\n                .start()\n            )\n\n            if self._output_spec.streaming_await_termination:\n                stream_df.awaitTermination(\n                    self._output_spec.streaming_await_termination_timeout\n                )\n\n    @staticmethod\n    def _write_to_jdbc_in_batch_mode(df: DataFrame, output_spec: OutputSpec) -> None:\n        \"\"\"Write to jdbc in batch mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n        \"\"\"\n        df.write.format(output_spec.data_format).partitionBy(\n            output_spec.partitions\n        ).options(**output_spec.options if output_spec.options else {}).mode(\n            output_spec.write_type\n        ).save(\n            output_spec.location\n        )\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec, data: OrderedDict\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            JDBCWriter._write_to_jdbc_in_batch_mode(transformed_df, output_spec)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/kafka_writer.py",
    "content": "\"\"\"Module that defines the behaviour to write to Kafka.\"\"\"\n\nfrom typing import Callable, OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.writer import Writer\n\n\nclass KafkaWriter(Writer):\n    \"\"\"Class to write to a Kafka target.\"\"\"\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct KafkaWriter instances.\n\n        Args:\n            output_spec: output specification.\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Write data to Kafka.\"\"\"\n        if not self._df.isStreaming:\n            self._write_to_kafka_in_batch_mode(self._df, self._output_spec)\n        else:\n            self._write_to_kafka_in_streaming_mode(\n                self._df, self._output_spec, self._data\n            )\n\n    @staticmethod\n    def _write_to_kafka_in_batch_mode(df: DataFrame, output_spec: OutputSpec) -> None:\n        \"\"\"Write to Kafka in batch mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n        \"\"\"\n        df.write.format(output_spec.data_format).options(\n            **output_spec.options if output_spec.options else {}\n        ).mode(output_spec.write_type).save()\n\n    @staticmethod\n    def _write_to_kafka_in_streaming_mode(\n        df: DataFrame, output_spec: OutputSpec, data: OrderedDict\n    ) -> None:\n        \"\"\"Write to kafka in streaming mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        df_writer = df.writeStream.trigger(**Writer.get_streaming_trigger(output_spec))\n\n        if (\n            output_spec.streaming_micro_batch_transformers\n            or output_spec.streaming_micro_batch_dq_processors\n        ):\n            stream_df = (\n                df_writer.options(**output_spec.options if output_spec.options else {})\n                .foreachBatch(\n                    KafkaWriter._write_transformed_micro_batch(output_spec, data)\n                )\n                .start()\n            )\n        else:\n            stream_df = (\n                df_writer.format(output_spec.data_format)\n                .options(**output_spec.options if output_spec.options else {})\n                .start()\n            )\n\n        if output_spec.streaming_await_termination:\n            stream_df.awaitTermination(output_spec.streaming_await_termination_timeout)\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec, data: OrderedDict\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            KafkaWriter._write_to_kafka_in_batch_mode(transformed_df, output_spec)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/rest_api_writer.py",
    "content": "\"\"\"Module to define behaviour to write to REST APIs.\"\"\"\n\nimport json\nfrom typing import Any, Callable, OrderedDict\n\nfrom pyspark.sql import DataFrame, Row\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.writer import Writer\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.rest_api import (\n    RESTApiException,\n    RestMethods,\n    RestStatusCodes,\n    execute_api_request,\n)\n\n\nclass RestApiWriter(Writer):\n    \"\"\"Class to write data to a REST API.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct RestApiWriter instances.\n\n        Args:\n            output_spec: output specification.\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Write data to REST API.\"\"\"\n        if not self._df.isStreaming:\n            self._write_to_rest_api_in_batch_mode(self._df, self._output_spec)\n        else:\n            self._write_to_rest_api_in_streaming_mode(\n                self._df, self._output_spec, self._data\n            )\n\n    @staticmethod\n    def _get_func_to_send_payload_to_rest_api(output_spec: OutputSpec) -> Callable:\n        \"\"\"Define and return a function to send the payload to the REST api.\n\n        Args:\n            output_spec: Output Specification containing configurations to\n                communicate with the REST api. Within the output_spec, the user\n                can specify several options:\n                    - rest_api_header: http headers.\n                    - rest_api_basic_auth: basic http authentication details\n                        (e.g., {\"username\": \"x\", \"password\": \"y\"}).\n                    - rest_api_url: url of the api.\n                    - rest_api_method: REST method (e.g., POST or PUT).\n                    - rest_api_sleep_seconds: sleep seconds to avoid throttling.\n                    - rest_api_is_file_payload: if the payload to be sent to the\n                        api is in the format of a file using multipart encoding\n                        upload. if this is true, then the payload will always be\n                        sent using the \"files\" parameter in Python's requests\n                        library.\n                    - rest_api_file_payload_name: when rest_api_is_file_payload\n                        is true, this option can be used to define the file\n                        identifier in Python's requests library.\n                    - extra_json_payload: when rest_api_file_payload_name is False,\n                        can be used to provide additional JSON variables to add to\n                        the original payload. This is useful to complement\n                        the original payload with some extra input to better\n                        configure the final payload to send to the REST api. An\n                        example can be to add a constant configuration value to\n                        add to the payload data.\n\n        Returns:\n            Function to be called inside Spark dataframe.foreach.\n        \"\"\"\n        headers = output_spec.options.get(\"rest_api_header\", None)\n        basic_auth_dict = output_spec.options.get(\"rest_api_basic_auth\", None)\n        url = output_spec.options[\"rest_api_url\"]\n        method = output_spec.options.get(\"rest_api_method\", RestMethods.POST.value)\n        sleep_seconds = output_spec.options.get(\"rest_api_sleep_seconds\", 0)\n        is_file_payload = output_spec.options.get(\"rest_api_is_file_payload\", False)\n        file_payload_name = output_spec.options.get(\n            \"rest_api_file_payload_name\", \"file\"\n        )\n        extra_json_payload = output_spec.options.get(\n            \"rest_api_extra_json_payload\", None\n        )\n        success_status_codes = output_spec.options.get(\n            \"rest_api_success_status_codes\", RestStatusCodes.OK_STATUS_CODES.value\n        )\n\n        def send_payload_to_rest_api(row: Row) -> Any:\n            \"\"\"Send payload to the REST API.\n\n            The payload needs to be prepared as a JSON string column in a dataframe.\n            E.g., {\"a\": \"a value\", \"b\": \"b value\"}.\n\n            Args:\n                row: a row in a dataframe.\n            \"\"\"\n            if \"payload\" not in row:\n                raise ValueError(\"Input DataFrame must contain 'payload' column.\")\n\n            str_payload = row.payload\n\n            payload = None\n            if not is_file_payload:\n                payload = json.loads(str_payload)\n            else:\n                payload = {file_payload_name: str_payload}\n\n            if extra_json_payload:\n                payload.update(extra_json_payload)\n\n            RestApiWriter._logger.debug(f\"Original payload: {str_payload}\")\n            RestApiWriter._logger.debug(f\"Final payload: {payload}\")\n\n            response = execute_api_request(\n                method=method,\n                url=url,\n                headers=headers,\n                basic_auth_dict=basic_auth_dict,\n                json=payload if not is_file_payload else None,\n                files=payload if is_file_payload else None,\n                sleep_seconds=sleep_seconds,\n            )\n\n            RestApiWriter._logger.debug(\n                f\"Response: {response.status_code} - {response.text}\"\n            )\n\n            if response.status_code not in success_status_codes:\n                raise RESTApiException(\n                    f\"API response status code {response.status_code} is not in\"\n                    f\" {success_status_codes}. Got {response.text}\"\n                )\n\n        return send_payload_to_rest_api\n\n    @staticmethod\n    def _write_to_rest_api_in_batch_mode(\n        df: DataFrame, output_spec: OutputSpec\n    ) -> None:\n        \"\"\"Write to REST API in Spark batch mode.\n\n        This function uses the dataframe.foreach function to generate a payload\n        for each row of the dataframe and send it to the REST API endpoint.\n\n        Warning! Make sure your execution environment supports RDD api operations,\n        as there are environments where RDD operation may not be supported. As,\n        df.foreach() is a shorthand for df.rdd.foreach(), this can bring issues\n        in such environments.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n        \"\"\"\n        df.foreach(RestApiWriter._get_func_to_send_payload_to_rest_api(output_spec))\n\n    @staticmethod\n    def _write_to_rest_api_in_streaming_mode(\n        df: DataFrame, output_spec: OutputSpec, data: OrderedDict\n    ) -> None:\n        \"\"\"Write to REST API in streaming mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        df_writer = df.writeStream.trigger(**Writer.get_streaming_trigger(output_spec))\n\n        stream_df = (\n            df_writer.options(**output_spec.options if output_spec.options else {})\n            .foreachBatch(\n                RestApiWriter._write_transformed_micro_batch(output_spec, data)\n            )\n            .start()\n        )\n\n        if output_spec.streaming_await_termination:\n            stream_df.awaitTermination(output_spec.streaming_await_termination_timeout)\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec, data: OrderedDict\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            RestApiWriter._write_to_rest_api_in_batch_mode(transformed_df, output_spec)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/io/writers/sharepoint_writer.py",
    "content": "\"\"\"Module to define the behaviour to write to Sharepoint.\"\"\"\n\nimport os\nfrom typing import OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.io.exceptions import (\n    EndpointNotFoundException,\n    NotSupportedException,\n    WriteToLocalException,\n)\nfrom lakehouse_engine.io.writer import Writer\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.sharepoint_utils import SharepointUtils\n\n\nclass SharepointWriter(Writer):\n    \"\"\"Class to write data to Sharepoint.\n\n    This writer is designed specifically for uploading a single file\n    to Sharepoint. It first writes the data locally before uploading\n    it to the specified Sharepoint location. Since it handles only\n    a single file at a time, any logic for writing multiple files\n    must be implemented on the notebook-side.\n    \"\"\"\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct FileWriter instances.\n\n        Args:\n            output_spec: output specification\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n        self.sharepoint_utils = self._get_sharepoint_utils()\n        self._logger = LoggingHandler(__name__).get_logger()\n\n    def write(self) -> None:\n        \"\"\"Upload data to Sharepoint.\"\"\"\n        if self._df.isStreaming:\n            raise NotSupportedException(\"Sharepoint writer doesn't support streaming!\")\n\n        self._output_spec.sharepoint_opts.validate_for_writer()\n        if not self.sharepoint_utils.check_if_endpoint_exists(\n            folder_root_path=self._output_spec.sharepoint_opts.folder_relative_path\n        ):\n            raise EndpointNotFoundException(\"The provided endpoint does not exist!\")\n\n        self._write_to_sharepoint_in_batch_mode(self._df)\n\n    def _get_sharepoint_utils(self) -> SharepointUtils:\n        sharepoint_utils = SharepointUtils(\n            client_id=self._output_spec.sharepoint_opts.client_id,\n            tenant_id=self._output_spec.sharepoint_opts.tenant_id,\n            local_path=self._output_spec.sharepoint_opts.local_path,\n            api_version=self._output_spec.sharepoint_opts.api_version,\n            site_name=self._output_spec.sharepoint_opts.site_name,\n            drive_name=self._output_spec.sharepoint_opts.drive_name,\n            file_name=self._output_spec.sharepoint_opts.file_name,\n            folder_relative_path=self._output_spec.sharepoint_opts.folder_relative_path,\n            chunk_size=self._output_spec.sharepoint_opts.chunk_size,\n            local_options=self._output_spec.sharepoint_opts.local_options,\n            secret=self._output_spec.sharepoint_opts.secret,\n            conflict_behaviour=self._output_spec.sharepoint_opts.conflict_behaviour,\n        )\n\n        return sharepoint_utils\n\n    def _write_to_sharepoint_in_batch_mode(self, df: DataFrame) -> None:\n        \"\"\"Write to Sharepoint in batch mode.\n\n        This method first writes the provided DataFrame to a local file using the\n        SharePointUtils `write_to_local_path` method. If the local file is successfully\n        written, it then uploads the file to Sharepoint using the `write_to_sharepoint`\n        method, logging the process and outcome.\n\n        Args:\n            df: The DataFrame to write to a local file and subsequently\n                upload to Sharepoint.\n        \"\"\"\n        local_path = self._output_spec.sharepoint_opts.local_path\n        file_name = self._output_spec.sharepoint_opts.file_name\n\n        self._logger.info(f\"Starting to write the data to the local path: {local_path}\")\n\n        try:\n            self.sharepoint_utils.write_to_local_path(df)\n        except IOError as err:\n            self.sharepoint_utils.delete_local_path()\n            self._logger.info(f\"Deleted the local folder: {local_path}\")\n            raise WriteToLocalException(\n                f\"The data was not written on the local path: {local_path}\"\n            ) from err\n\n        self._logger.info(f\"The data was written to the local path: {local_path}\")\n        file_size = os.path.getsize(local_path)\n        self._logger.info(\n            f\"Uploading the {file_name} ({file_size} bytes) to Sharepoint.\"\n        )\n        self.sharepoint_utils.write_to_sharepoint()\n        self._logger.info(f\"The {file_name} was uploaded to Sharepoint with success!\")\n        self.sharepoint_utils.delete_local_path()\n        self._logger.info(f\"Deleted the local folder: {local_path}\")\n"
  },
  {
    "path": "lakehouse_engine/io/writers/table_writer.py",
    "content": "\"\"\"Module that defines the behaviour to write to tables.\"\"\"\n\nfrom typing import Any, Callable, OrderedDict\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputFormat, OutputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.writer import Writer\n\n\nclass TableWriter(Writer):\n    \"\"\"Class to write to a table.\"\"\"\n\n    def __init__(self, output_spec: OutputSpec, df: DataFrame, data: OrderedDict):\n        \"\"\"Construct TableWriter instances.\n\n        Args:\n            output_spec: output specification.\n            df: dataframe to be written.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        super().__init__(output_spec, df, data)\n\n    def write(self) -> None:\n        \"\"\"Write data to a table.\n\n        After the write operation we repair the table (e.g., update partitions).\n        However, there's a caveat to this, which is the fact that this repair\n        operation is not reachable if we are running long-running streaming mode.\n        Therefore, we recommend not using the TableWriter with formats other than\n        delta lake for those scenarios (as delta lake does not need msck repair).\n        So, you can: 1) use delta lake format for the table; 2) use the FileWriter\n        and run the repair with a certain frequency in a separate task of your\n        pipeline.\n        \"\"\"\n        if not self._df.isStreaming:\n            self._write_to_table_in_batch_mode(self._df, self._output_spec)\n        else:\n            df_writer = self._df.writeStream.trigger(\n                **Writer.get_streaming_trigger(self._output_spec)\n            )\n\n            if (\n                self._output_spec.streaming_micro_batch_transformers\n                or self._output_spec.streaming_micro_batch_dq_processors\n            ):\n                stream_df = (\n                    df_writer.options(\n                        **self._output_spec.options if self._output_spec.options else {}\n                    )\n                    .foreachBatch(\n                        self._write_transformed_micro_batch(\n                            self._output_spec, self._data\n                        )\n                    )\n                    .start()\n                )\n\n                if self._output_spec.streaming_await_termination:\n                    stream_df.awaitTermination(\n                        self._output_spec.streaming_await_termination_timeout\n                    )\n            else:\n                self._write_to_table_in_streaming_mode(df_writer, self._output_spec)\n\n        if (\n            self._output_spec.data_format != OutputFormat.DELTAFILES.value\n            and self._output_spec.partitions\n        ):\n            ExecEnv.SESSION.sql(f\"MSCK REPAIR TABLE {self._output_spec.db_table}\")\n\n    @staticmethod\n    def _write_to_table_in_batch_mode(df: DataFrame, output_spec: OutputSpec) -> None:\n        \"\"\"Write to a metastore table in batch mode.\n\n        Args:\n            df: dataframe to write.\n            output_spec: output specification.\n        \"\"\"\n        df_writer = df.write.format(output_spec.data_format)\n\n        if output_spec.partitions:\n            df_writer = df_writer.partitionBy(output_spec.partitions)\n\n        if output_spec.location:\n            df_writer = df_writer.options(\n                path=output_spec.location,\n                **output_spec.options if output_spec.options else {},\n            )\n        else:\n            df_writer = df_writer.options(\n                **output_spec.options if output_spec.options else {}\n            )\n\n        df_writer.mode(output_spec.write_type).saveAsTable(output_spec.db_table)\n\n    @staticmethod\n    def _write_to_table_in_streaming_mode(\n        df_writer: Any, output_spec: OutputSpec\n    ) -> None:\n        \"\"\"Write to a metastore table in streaming mode.\n\n        Args:\n            df_writer: dataframe writer.\n            output_spec: output specification.\n        \"\"\"\n        df_writer = df_writer.outputMode(output_spec.write_type).format(\n            output_spec.data_format\n        )\n\n        if output_spec.partitions:\n            df_writer = df_writer.partitionBy(output_spec.partitions)\n\n        if output_spec.location:\n            df_writer = df_writer.options(\n                path=output_spec.location,\n                **output_spec.options if output_spec.options else {},\n            )\n        else:\n            df_writer = df_writer.options(\n                **output_spec.options if output_spec.options else {}\n            )\n\n        if output_spec.streaming_await_termination:\n            df_writer.toTable(output_spec.db_table).awaitTermination(\n                output_spec.streaming_await_termination_timeout\n            )\n        else:\n            df_writer.toTable(output_spec.db_table)\n\n    @staticmethod\n    def _write_transformed_micro_batch(  # type: ignore\n        output_spec: OutputSpec, data: OrderedDict\n    ) -> Callable:\n        \"\"\"Define how to write a streaming micro batch after transforming it.\n\n        Args:\n            output_spec: output specification.\n            data: list of all dfs generated on previous steps before writer.\n\n        Returns:\n            A function to be executed in the foreachBatch spark write method.\n        \"\"\"\n\n        def inner(batch_df: DataFrame, batch_id: int) -> None:\n            ExecEnv.get_for_each_batch_session(batch_df)\n            transformed_df = Writer.get_transformed_micro_batch(\n                output_spec, batch_df, batch_id, data\n            )\n\n            if output_spec.streaming_micro_batch_dq_processors:\n                transformed_df = Writer.run_micro_batch_dq_process(\n                    transformed_df, output_spec.streaming_micro_batch_dq_processors\n                )\n\n            TableWriter._write_to_table_in_batch_mode(transformed_df, output_spec)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/terminators/__init__.py",
    "content": "\"\"\"Package to define algorithm terminators (e.g., vacuum, optimize, compute stats).\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/terminators/cdf_processor.py",
    "content": "\"\"\"Defines change data feed processor behaviour.\"\"\"\n\nfrom datetime import datetime, timedelta\nfrom typing import OrderedDict\n\nfrom delta.tables import DeltaTable\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, date_format\n\nfrom lakehouse_engine.core.definitions import (\n    InputSpec,\n    OutputFormat,\n    OutputSpec,\n    ReadType,\n    TerminatorSpec,\n    WriteType,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.reader_factory import ReaderFactory\nfrom lakehouse_engine.io.writer_factory import WriterFactory\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass CDFProcessor(object):\n    \"\"\"Change data feed processor class.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def expose_cdf(cls, spec: TerminatorSpec) -> None:\n        \"\"\"Expose CDF to external location.\n\n        Args:\n            spec: terminator specification.\n        \"\"\"\n        cls._logger.info(\"Reading CDF from input table...\")\n\n        df_cdf = ReaderFactory.get_data(cls._get_table_cdf_input_specs(spec))\n        new_df_cdf = df_cdf.withColumn(\n            \"_commit_timestamp\",\n            date_format(col(\"_commit_timestamp\"), \"yyyyMMddHHmmss\"),\n        )\n\n        cls._logger.info(\"Writing CDF to external table...\")\n        cls._write_cdf_to_external(\n            spec,\n            new_df_cdf.repartition(\n                spec.args.get(\n                    \"materialized_cdf_num_partitions\", col(\"_commit_timestamp\")\n                )\n            ),\n        )\n\n        # used to delete old data on CDF table (don't remove parquet).\n        if spec.args.get(\"clean_cdf\", True):\n            cls._logger.info(\"Cleaning CDF table...\")\n            cls.delete_old_data(spec)\n\n        # used to delete old parquet files.\n        if spec.args.get(\"vacuum_cdf\", False):\n            cls._logger.info(\"Vacuuming CDF table...\")\n            cls.vacuum_cdf_data(spec)\n\n    @staticmethod\n    def _write_cdf_to_external(\n        spec: TerminatorSpec, df: DataFrame, data: OrderedDict = None\n    ) -> None:\n        \"\"\"Write cdf results dataframe.\n\n        Args:\n            spec: terminator specification.\n            df: dataframe with cdf results to write.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        WriterFactory.get_writer(\n            spec=OutputSpec(\n                spec_id=\"materialized_cdf\",\n                input_id=\"input_table\",\n                location=spec.args[\"materialized_cdf_location\"],\n                write_type=WriteType.APPEND.value,\n                data_format=spec.args.get(\"data_format\", OutputFormat.DELTAFILES.value),\n                options=spec.args[\"materialized_cdf_options\"],\n                partitions=[\"_commit_timestamp\"],\n            ),\n            df=df,\n            data=data,\n        ).write()\n\n    @staticmethod\n    def _get_table_cdf_input_specs(spec: TerminatorSpec) -> InputSpec:\n        \"\"\"Get the input specifications from a terminator spec.\n\n        Args:\n            spec: terminator specifications.\n\n        Returns:\n            List of input specifications.\n        \"\"\"\n        options = {\n            \"readChangeFeed\": \"true\",\n            **spec.args.get(\"db_table_options\", {}),\n        }\n\n        input_specs = InputSpec(\n            spec_id=\"input_table\",\n            db_table=spec.args[\"db_table\"],\n            read_type=ReadType.STREAMING.value,\n            data_format=OutputFormat.DELTAFILES.value,\n            options=options,\n        )\n\n        return input_specs\n\n    @classmethod\n    def delete_old_data(cls, spec: TerminatorSpec) -> None:\n        \"\"\"Delete old data from cdf delta table.\n\n        Args:\n            spec: terminator specifications.\n        \"\"\"\n        today_datetime = datetime.today()\n        limit_date = today_datetime + timedelta(\n            days=spec.args.get(\"days_to_keep\", 30) * -1\n        )\n        limit_timestamp = limit_date.strftime(\"%Y%m%d%H%M%S\")\n\n        cdf_delta_table = DeltaTable.forPath(\n            ExecEnv.SESSION, spec.args[\"materialized_cdf_location\"]\n        )\n\n        cdf_delta_table.delete(col(\"_commit_timestamp\") < limit_timestamp)\n\n    @classmethod\n    def vacuum_cdf_data(cls, spec: TerminatorSpec) -> None:\n        \"\"\"Vacuum old data from cdf delta table.\n\n        Args:\n            spec: terminator specifications.\n        \"\"\"\n        cdf_delta_table = DeltaTable.forPath(\n            ExecEnv.SESSION, spec.args[\"materialized_cdf_location\"]\n        )\n\n        cdf_delta_table.vacuum(spec.args.get(\"vacuum_hours\", 168))\n"
  },
  {
    "path": "lakehouse_engine/terminators/dataset_optimizer.py",
    "content": "\"\"\"Module with dataset optimizer terminator.\"\"\"\n\nfrom typing import List, Optional\n\nfrom pyspark.sql.utils import AnalysisException, ParseException\n\nfrom lakehouse_engine.core.table_manager import TableManager\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DatasetOptimizer(object):\n    \"\"\"Class with dataset optimizer terminator.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def optimize_dataset(\n        cls,\n        db_table: Optional[str] = None,\n        location: Optional[str] = None,\n        compute_table_stats: bool = True,\n        vacuum: bool = True,\n        vacuum_hours: int = 720,\n        optimize: bool = True,\n        optimize_where: Optional[str] = None,\n        optimize_zorder_col_list: Optional[List[str]] = None,\n        debug: bool = False,\n    ) -> None:\n        \"\"\"Optimize a dataset based on a set of pre-conceived optimizations.\n\n        Most of the time the dataset is a table, but it can be a file-based one only.\n\n        Args:\n            db_table: `database_name.table_name`.\n            location: dataset/table filesystem location.\n            compute_table_stats: to compute table statistics or not.\n            vacuum: (delta lake tables only) whether to vacuum the delta lake\n                table or not.\n            vacuum_hours: (delta lake tables only) number of hours to consider\n                in vacuum operation.\n            optimize: (delta lake tables only) whether to optimize the table or\n                not. Custom optimize parameters can be supplied through ExecEnv (Spark)\n                configs\n            optimize_where: expression to use in the optimize function.\n            optimize_zorder_col_list: (delta lake tables only) list of\n                columns to consider in the zorder optimization process. Custom optimize\n                parameters can be supplied through ExecEnv (Spark) configs.\n            debug: flag indicating if we are just debugging this for local\n                tests and therefore pass through all the exceptions to perform some\n                assertions in local tests.\n        \"\"\"\n        if optimize:\n            if debug:\n                try:\n                    cls._optimize(\n                        db_table, location, optimize_where, optimize_zorder_col_list\n                    )\n                except ParseException:\n                    pass\n            else:\n                cls._optimize(\n                    db_table, location, optimize_where, optimize_zorder_col_list\n                )\n\n        if vacuum:\n            cls._vacuum(db_table, location, vacuum_hours)\n\n        if compute_table_stats:\n            if debug:\n                try:\n                    cls._compute_table_stats(db_table)\n                except AnalysisException:\n                    pass\n            else:\n                cls._compute_table_stats(db_table)\n\n    @classmethod\n    def _compute_table_stats(cls, db_table: str) -> None:\n        \"\"\"Compute table statistics.\n\n        Args:\n            db_table: `<db>.<table>` string.\n        \"\"\"\n        if not db_table:\n            raise WrongArgumentsException(\"A table needs to be provided.\")\n\n        config = {\"function\": \"compute_table_statistics\", \"table_or_view\": db_table}\n        cls._logger.info(f\"Computing table statistics for {db_table}...\")\n        TableManager(config).compute_table_statistics()\n\n    @classmethod\n    def _vacuum(cls, db_table: str, location: str, hours: int) -> None:\n        \"\"\"Vacuum a delta table.\n\n        Args:\n            db_table: `<db>.<table>` string. Takes precedence over location.\n            location: location of the delta table.\n            hours: number of hours to consider in vacuum operation.\n        \"\"\"\n        if not db_table and not location:\n            raise WrongArgumentsException(\"A table or location need to be provided.\")\n\n        table_or_location = db_table if db_table else f\"delta.`{location}`\"\n\n        config = {\n            \"function\": \"compute_table_statistics\",\n            \"table_or_view\": table_or_location,\n            \"vacuum_hours\": hours,\n        }\n        cls._logger.info(f\"Vacuuming table {table_or_location}...\")\n        TableManager(config).vacuum()\n\n    @classmethod\n    def _optimize(\n        cls, db_table: str, location: str, where: str, zorder_cols: List[str]\n    ) -> None:\n        \"\"\"Optimize a delta table.\n\n        Args:\n            db_table: `<db>.<table>` string. Takes precedence over location.\n            location: location of the delta table.\n            where: expression to use in the optimize function.\n            zorder_cols: list of columns to consider in the zorder optimization process.\n        \"\"\"\n        if not db_table and not location:\n            raise WrongArgumentsException(\"A table or location needs to be provided.\")\n\n        table_or_location = db_table if db_table else f\"delta.`{location}`\"\n\n        config = {\n            \"function\": \"compute_table_statistics\",\n            \"table_or_view\": table_or_location,\n            \"optimize_where\": where,\n            \"optimize_zorder_col_list\": \",\".join(zorder_cols if zorder_cols else []),\n        }\n        cls._logger.info(f\"Optimizing table {table_or_location}...\")\n        TableManager(config).optimize()\n"
  },
  {
    "path": "lakehouse_engine/terminators/notifier.py",
    "content": "\"\"\"Module with notification terminator.\"\"\"\n\nfrom abc import ABC, abstractmethod\n\nfrom jinja2 import Template\n\nfrom lakehouse_engine.core.definitions import (\n    NotificationRuntimeParameters,\n    TerminatorSpec,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.terminators.notifiers.notification_templates import (\n    NotificationsTemplates,\n)\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Notifier(ABC):\n    \"\"\"Abstract Notification class.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, notification_spec: TerminatorSpec):\n        \"\"\"Construct Notification instances.\n\n        Args:\n            notification_spec: notification specification.\n        \"\"\"\n        self.type = notification_spec.args.get(\"type\")\n        self.notification = notification_spec.args\n\n    @abstractmethod\n    def create_notification(self) -> None:\n        \"\"\"Abstract create notification method.\"\"\"\n        raise NotImplementedError\n\n    @abstractmethod\n    def send_notification(self) -> None:\n        \"\"\"Abstract send notification method.\"\"\"\n        raise NotImplementedError\n\n    def _render_notification_field(self, template_field: str) -> str:\n        \"\"\"Render the notification given args.\n\n        Args:\n            template_field: Message with templates to be replaced.\n\n        Returns:\n            Rendered field\n        \"\"\"\n        args = {}\n        field_template = Template(template_field)\n        if (\n            NotificationRuntimeParameters.DATABRICKS_JOB_NAME.value in template_field\n            or NotificationRuntimeParameters.DATABRICKS_WORKSPACE_ID.value\n            in template_field\n            or NotificationRuntimeParameters.JOB_EXCEPTION.value in template_field\n        ):\n            workspace_id, job_name = DatabricksUtils.get_databricks_job_information(\n                ExecEnv.SESSION\n            )\n            args[\"databricks_job_name\"] = job_name\n            args[\"databricks_workspace_id\"] = workspace_id\n            args[\"exception\"] = self.notification.get(\"exception\")\n\n        return field_template.render(args)\n\n    @staticmethod\n    def check_if_notification_is_failure_notification(\n        spec: TerminatorSpec,\n    ) -> bool:\n        \"\"\"Check if given notification is a failure notification.\n\n        Args:\n            spec: spec to validate if it is a failure notification.\n\n        Returns:\n            A boolean telling if the notification is a failure notification\n        \"\"\"\n        notification = spec.args\n        is_notification_failure_notification: bool = False\n\n        if \"template\" in notification.keys():\n            template: dict = NotificationsTemplates.EMAIL_NOTIFICATIONS_TEMPLATES.get(\n                notification[\"template\"], {}\n            )\n\n            if template:\n                is_notification_failure_notification = notification.get(\n                    \"on_failure\", True\n                )\n            else:\n                raise ValueError(f\"\"\"Template {notification[\"template\"]} not found.\"\"\")\n        else:\n            is_notification_failure_notification = notification.get(\"on_failure\", True)\n\n        return is_notification_failure_notification\n"
  },
  {
    "path": "lakehouse_engine/terminators/notifier_factory.py",
    "content": "\"\"\"Module for notifier factory.\"\"\"\n\nfrom lakehouse_engine.core.definitions import NotifierType, TerminatorSpec\nfrom lakehouse_engine.terminators.notifier import Notifier\nfrom lakehouse_engine.terminators.notifiers.email_notifier import EmailNotifier\nfrom lakehouse_engine.terminators.notifiers.exceptions import NotifierNotFoundException\n\n\nclass NotifierFactory(object):\n    \"\"\"Class for notification factory.\"\"\"\n\n    NOTIFIER_TYPES = {NotifierType.EMAIL.value: EmailNotifier}\n\n    @classmethod\n    def get_notifier(cls, spec: TerminatorSpec) -> Notifier:\n        \"\"\"Get a notifier according to the terminator specs using a factory.\n\n        Args:\n            spec: terminator specification.\n\n        Returns:\n            Notifier: notifier that will handle notifications.\n        \"\"\"\n        notifier_name = spec.args.get(\"type\")\n        notifier = cls.NOTIFIER_TYPES.get(notifier_name)\n\n        if notifier:\n            return notifier(notification_spec=spec)\n        else:\n            raise NotifierNotFoundException(\n                f\"The requested notification format {notifier_name} is not supported.\"\n            )\n\n    @staticmethod\n    def generate_failure_notification(spec: list, exception: Exception) -> None:\n        \"\"\"Check if it is necessary to send a failure notification and generate it.\n\n        Args:\n            spec: List of termination specs\n            exception: Exception that caused the failure.\n        \"\"\"\n        notification_specs = []\n\n        for terminator in spec:\n            if terminator.function == \"notify\":\n                notification_specs.append(terminator)\n\n        for notification in notification_specs:\n            failure_notification_spec = notification.args\n            generate_failure_notification = failure_notification_spec.get(\n                \"generate_failure_notification\", False\n            )\n\n            if generate_failure_notification or (\n                Notifier.check_if_notification_is_failure_notification(notification)\n            ):\n                failure_notification_spec[\"exception\"] = str(exception)\n\n                if generate_failure_notification:\n                    failure_notification_spec[\"template\"] = (\n                        f\"\"\"failure_notification_{failure_notification_spec[\"type\"]}\"\"\"\n                    )\n\n                failure_spec = TerminatorSpec(\n                    function=\"notification\", args=failure_notification_spec\n                )\n\n                notifier = NotifierFactory.get_notifier(failure_spec)\n                notifier.create_notification()\n                notifier.send_notification()\n"
  },
  {
    "path": "lakehouse_engine/terminators/notifiers/__init__.py",
    "content": "\"\"\"Notifications module.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/terminators/notifiers/email_notifier.py",
    "content": "\"\"\"Module with email notifier.\"\"\"\n\nimport asyncio\nimport smtplib\nfrom email.mime.application import MIMEApplication\nfrom email.mime.multipart import MIMEMultipart\nfrom email.mime.text import MIMEText\nfrom posixpath import basename\nfrom typing import Any\n\nfrom lakehouse_engine.core.definitions import TerminatorSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.terminators.notifier import Notifier\nfrom lakehouse_engine.terminators.notifiers.exceptions import (\n    NotifierConfigException,\n    NotifierTemplateNotFoundException,\n)\nfrom lakehouse_engine.terminators.notifiers.notification_templates import (\n    NotificationsTemplates,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass EmailNotifier(Notifier):\n    \"\"\"Base Notification class.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, notification_spec: TerminatorSpec):\n        \"\"\"Construct Email Notification instance.\n\n        Args:\n            notification_spec: notification specification.\n        \"\"\"\n        super().__init__(notification_spec)\n\n    def create_notification(self) -> None:\n        \"\"\"Creates the notification to be sent.\"\"\"\n        if \"template\" in self.notification.keys():\n            template: dict = NotificationsTemplates.EMAIL_NOTIFICATIONS_TEMPLATES.get(\n                self.notification[\"template\"], {}\n            )\n\n            if template:\n                self.notification[\"message\"] = self._render_notification_field(\n                    template[\"message\"]\n                )\n                self.notification[\"subject\"] = self._render_notification_field(\n                    template[\"subject\"]\n                )\n                self.notification[\"mimetype\"] = template[\"mimetype\"]\n\n            else:\n                raise NotifierTemplateNotFoundException(\n                    f\"\"\"Template {self.notification[\"template\"]} does not exist\"\"\"\n                )\n\n        elif \"message\" in self.notification.keys():\n            self.notification[\"message\"] = self._render_notification_field(\n                self.notification[\"message\"]\n            )\n            self.notification[\"subject\"] = self._render_notification_field(\n                self.notification[\"subject\"]\n            )\n        else:\n            raise NotifierConfigException(\"Malformed Notification Definition\")\n\n    def send_notification(self) -> None:\n        \"\"\"Sends the notification by using a series of methods.\"\"\"\n        self._validate_email_notification()\n\n        server = self.notification[\"server\"]\n        notification_office_email_servers = [\"smtp.office365.com\"]\n\n        if (\n            ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers is not None\n            and server in ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers\n        ):\n            raise NotifierConfigException(\n                f\"Trying to use disallowed smtp server: '{server}'.\\n\"\n                f\"Disallowed smtp servers: \"\n                f\"{str(ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers)}\"\n            )\n        elif server in notification_office_email_servers:\n            self._authenticate_and_send_office365()\n        else:\n            self._authenticate_and_send_simple_smtp()\n\n    def _authenticate_and_send_office365(self) -> None:\n        \"\"\"Authenticates and sends an email notification using Graph API.\"\"\"\n        from azure.identity.aio import ClientSecretCredential\n        from msgraph import GraphServiceClient\n\n        self._logger.info(\"Attempting authentication using Graph API.\")\n\n        request_body = self._create_graph_api_email_body()\n\n        self._logger.info(f\"Sending notification email with body: {request_body}\")\n\n        credential = ClientSecretCredential(\n            tenant_id=self.notification[\"tenant_id\"],\n            client_id=self.notification[\"user\"],\n            client_secret=self.notification[\"password\"],\n        )\n        client = GraphServiceClient(credentials=credential)\n\n        import nest_asyncio\n\n        nest_asyncio.apply()\n        asyncio.get_event_loop().run_until_complete(\n            client.users.by_user_id(self.notification[\"from\"]).send_mail.post(\n                body=request_body\n            )\n        )\n\n        self._logger.info(\"Notification email sent successfully.\")\n\n    def _authenticate_and_send_simple_smtp(self) -> None:\n        \"\"\"Authenticates and sends an email notification using simple authentication.\"\"\"\n        with smtplib.SMTP(\n            self.notification[\"server\"], self.notification[\"port\"]\n        ) as smtp:\n            try:\n                smtp.starttls()\n                smtp.login(\n                    self.notification.get(\"user\", \"\"),\n                    self.notification.get(\"password\", \"\"),\n                )\n            except smtplib.SMTPException as e:\n                self._logger.exception(\n                    f\"Exception while authenticating to smtp: {str(e)}\"\n                )\n                self._logger.exception(\n                    \"Attempting to send the notification without authentication\"\n                )\n\n            mesg = MIMEMultipart()\n            mesg[\"From\"] = self.notification[\"from\"]\n\n            to = self.notification.get(\"to\", [])\n            cc = self.notification.get(\"cc\", [])\n            bcc = self.notification.get(\"bcc\", [])\n\n            mesg[\"To\"] = \", \".join(to)\n            mesg[\"CC\"] = \", \".join(cc)\n            mesg[\"BCC\"] = \", \".join(bcc)\n            mesg[\"Subject\"] = self.notification[\"subject\"]\n            mesg[\"Importance\"] = self._get_importance(\n                self.notification.get(\"importance\", \"normal\")\n            )\n\n            match self.notification.get(\"mimetype\", \"plain\"):\n                case \"html\" | \"text/html\":\n                    mimetype = \"html\"\n                case \"text\" | \"text/plain\" | \"plain\" | \"text/text\":\n                    mimetype = \"text\"\n                case _:\n                    self._logger.warning(\n                        f\"\"\"Unknown mimetype '{self.notification[\"mimetype\"]}' \"\"\"\n                        f\"provided. Defaulting to 'plain'.\"\n                    )\n                    mimetype = \"text\"\n\n            body = MIMEText(self.notification[\"message\"], mimetype)\n            mesg.attach(body)\n\n            for f in self.notification.get(\"attachments\", []):\n                with open(f, \"rb\") as fil:\n                    part = MIMEApplication(fil.read(), Name=basename(f))\n                part[\"Content-Disposition\"] = 'attachment; filename=\"%s\"' % basename(f)\n                mesg.attach(part)\n\n            try:\n                smtp.sendmail(\n                    self.notification[\"from\"], to + cc + bcc, mesg.as_string()\n                )\n                self._logger.info(\"Email sent successfully.\")\n            except smtplib.SMTPException as e:\n                self._logger.exception(f\"Exception while sending email: {str(e)}\")\n\n    def _validate_email_notification(self) -> None:\n        \"\"\"Validates the email notification.\"\"\"\n        if not self.notification.get(\"from\"):\n            raise NotifierConfigException(\n                \"Email notification must contain 'from' field.\"\n            )\n        if not self.notification.get(\"server\"):\n            raise NotifierConfigException(\n                \"Email notification must contain 'server' field.\"\n            )\n        if not self.notification.get(\"port\"):\n            raise NotifierConfigException(\n                \"Email notification must contain 'port' field.\"\n            )\n        if (\n            not self.notification.get(\"to\")\n            and not self.notification.get(\"cc\")\n            and not self.notification.get(\"bcc\")\n        ):\n            raise NotifierConfigException(\n                \"No recipients provided. Please provide at least one recipient.\"\n            )\n\n    def _get_importance(self, importance: str) -> Any:\n        \"\"\"Get the importance of the email notification.\n\n        Args:\n            importance: Importance level of the email.\n\n        Returns:\n            Importance level for the email notification.\n        \"\"\"\n        from msgraph.generated.models.importance import Importance\n\n        match importance:\n            case \"critical\" | \"high\":\n                return Importance.High\n            case \"normal\":\n                return Importance.Normal\n            case \"low\":\n                return Importance.Low\n            case _:\n                self._logger.warning(\n                    f\"\"\"Unknown importance '{importance}' provided. \"\"\"\n                    f\"Defaulting to 'normal'.\"\n                )\n                return Importance.Normal\n\n    def _create_graph_api_email_body(self) -> Any:\n        \"\"\"Create the email body for the Graph API.\n\n        Returns:\n            Email body for the Graph API.\n        \"\"\"\n        from msgraph.generated.models.body_type import BodyType\n        from msgraph.generated.models.file_attachment import FileAttachment\n        from msgraph.generated.models.item_body import ItemBody\n        from msgraph.generated.models.message import Message\n        from msgraph.generated.users.item.send_mail.send_mail_post_request_body import (\n            SendMailPostRequestBody,\n        )\n\n        request_body = SendMailPostRequestBody()\n        message = Message()\n        message.subject = self.notification[\"subject\"]\n\n        message_body = ItemBody()\n\n        message_body.content = self.notification[\"message\"]\n        match self.notification.get(\"mimetype\", \"plain\"):\n            case \"html\" | \"text/html\":\n                message_body.content_type = BodyType.Html\n            case \"text\" | \"text/plain\" | \"plain\" | \"text/text\":\n                message_body.content_type = BodyType.Text\n            case _:\n                self._logger.warning(\n                    f\"\"\"Unknown mimetype '{self.notification[\"mimetype\"]}' \"\"\"\n                    f\"provided. Defaulting to 'text'.\"\n                )\n                message_body.content_type = BodyType.Text\n\n        message.body = message_body\n\n        attachments = []\n        for attachment_file in self.notification.get(\"attachments\", []):\n            attachment_name = attachment_file.split(\"/\")[-1]\n\n            with open(attachment_file, \"rb\") as f:\n                content = f.read()\n\n            attachment = FileAttachment()\n            attachment.name = attachment_name\n            attachment.size = len(content)\n            attachment.content_bytes = content\n\n            attachments.append(attachment)\n\n        message.attachments = attachments  # type: ignore\n\n        message.to_recipients = self._set_graph_api_recipients(\"to\")\n        message.cc_recipients = self._set_graph_api_recipients(\"cc\")\n        message.bcc_recipients = self._set_graph_api_recipients(\"bcc\")\n\n        message.importance = self._get_importance(\n            self.notification.get(\"importance\", \"normal\")\n        )\n\n        request_body.message = message\n        request_body.save_to_sent_items = False\n\n        return request_body\n\n    def _set_graph_api_recipients(self, recipient_type: str) -> list:\n        \"\"\"Set the recipients for the Graph API.\n\n        Args:\n            recipient_type: Type of recipient (to, cc or bcc).\n\n        Returns:\n            List of recipients for the Graph API.\n        \"\"\"\n        from msgraph.generated.models.email_address import EmailAddress\n        from msgraph.generated.models.recipient import Recipient\n\n        recipients = []\n        for email in self.notification.get(recipient_type, []):\n            recipient = Recipient()\n            recipient_address = EmailAddress()\n            recipient_address.address = email\n            recipient.email_address = recipient_address\n\n            recipients.append(recipient)\n        return recipients\n"
  },
  {
    "path": "lakehouse_engine/terminators/notifiers/exceptions.py",
    "content": "\"\"\"Package defining all the Notifier custom exceptions.\"\"\"\n\n\nclass NotifierNotFoundException(Exception):\n    \"\"\"Exception for when the notifier is not found.\"\"\"\n\n    pass\n\n\nclass NotifierConfigException(Exception):\n    \"\"\"Exception for when the notifier configuration is invalid.\"\"\"\n\n    pass\n\n\nclass NotifierTemplateNotFoundException(Exception):\n    \"\"\"Exception for when the notifier is not found.\"\"\"\n\n    pass\n\n\nclass NotifierTemplateConfigException(Exception):\n    \"\"\"Exception for when the notifier config is incorrect.\"\"\"\n\n    pass\n"
  },
  {
    "path": "lakehouse_engine/terminators/notifiers/notification_templates.py",
    "content": "\"\"\"Email notification templates.\"\"\"\n\n\nclass NotificationsTemplates(object):\n    \"\"\"Templates for notifications.\"\"\"\n\n    EMAIL_NOTIFICATIONS_TEMPLATES = {\n        \"failure_notification_email\": {\n            \"subject\": \"Service Failure\",\n            \"mimetype\": \"text/text\",\n            \"message\": \"\"\"\n            Job {{ databricks_job_name }} in workspace {{ databricks_workspace_id }} has\n            failed with the exception: {{ exception }}\"\"\",\n            \"on_failure\": True,\n        },\n    }\n"
  },
  {
    "path": "lakehouse_engine/terminators/sensor_terminator.py",
    "content": "\"\"\"Module with sensor terminator.\"\"\"\n\nfrom typing import List\n\nfrom lakehouse_engine.core.definitions import SensorSpec, SensorStatus\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.sensor_manager import SensorControlTableManager\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SensorTerminator(object):\n    \"\"\"Sensor Terminator class.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def update_sensor_status(\n        cls,\n        sensor_id: str,\n        control_db_table_name: str,\n        status: str = SensorStatus.PROCESSED_NEW_DATA.value,\n        assets: List[str] = None,\n    ) -> None:\n        \"\"\"Update internal sensor status.\n\n        Update the sensor status in the control table, it should be used to tell the\n        system that the sensor has processed all new data that was previously\n        identified, hence updating the shifted sensor status.\n        Usually used to move from `SensorStatus.ACQUIRED_NEW_DATA` to\n        `SensorStatus.PROCESSED_NEW_DATA`, but there might be scenarios - still\n        to identify - where we can update the sensor status from/to different statuses.\n\n        Args:\n            sensor_id: sensor id.\n            control_db_table_name: `db.table` to store sensor checkpoints.\n            status: status of the sensor.\n            assets: a list of assets that are considered as available to\n                consume downstream after this sensor has status\n                PROCESSED_NEW_DATA.\n        \"\"\"\n        if status not in [s.value for s in SensorStatus]:\n            raise NotImplementedError(f\"Status {status} not accepted in sensor.\")\n\n        ExecEnv.get_or_create(app_name=\"update_sensor_status\")\n        SensorControlTableManager.update_sensor_status(\n            sensor_spec=SensorSpec(\n                sensor_id=sensor_id,\n                control_db_table_name=control_db_table_name,\n                assets=assets,\n                input_spec=None,\n                preprocess_query=None,\n                checkpoint_location=None,\n            ),\n            status=status,\n        )\n"
  },
  {
    "path": "lakehouse_engine/terminators/spark_terminator.py",
    "content": "\"\"\"Module with spark terminator.\"\"\"\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SparkTerminator(object):\n    \"\"\"Spark Terminator class.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def terminate_spark(cls) -> None:\n        \"\"\"Terminate spark session.\"\"\"\n        cls._logger.info(\"Terminating spark session...\")\n        ExecEnv.SESSION.stop()\n"
  },
  {
    "path": "lakehouse_engine/terminators/terminator_factory.py",
    "content": "\"\"\"Module with the factory pattern to return terminators.\"\"\"\n\nfrom typing import Optional\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import TerminatorSpec\nfrom lakehouse_engine.terminators.notifier import Notifier\nfrom lakehouse_engine.terminators.notifier_factory import NotifierFactory\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass TerminatorFactory(object):\n    \"\"\"TerminatorFactory class following the factory pattern.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def execute_terminator(\n        spec: TerminatorSpec, df: Optional[DataFrame] = None\n    ) -> None:\n        \"\"\"Execute a terminator following the factory pattern.\n\n        Args:\n            spec: terminator specification.\n            df: dataframe to be used in the terminator. Needed when a\n                terminator requires one dataframe as input.\n\n        Returns:\n            Transformer function to be executed in .transform() spark function.\n        \"\"\"\n        if spec.function == \"optimize_dataset\":\n            from lakehouse_engine.terminators.dataset_optimizer import DatasetOptimizer\n\n            DatasetOptimizer.optimize_dataset(**spec.args)\n        elif spec.function == \"terminate_spark\":\n            from lakehouse_engine.terminators.spark_terminator import SparkTerminator\n\n            SparkTerminator.terminate_spark()\n        elif spec.function == \"expose_cdf\":\n            from lakehouse_engine.terminators.cdf_processor import CDFProcessor\n\n            CDFProcessor.expose_cdf(spec)\n        elif spec.function == \"notify\":\n            if not Notifier.check_if_notification_is_failure_notification(spec):\n                notifier = NotifierFactory.get_notifier(spec)\n                notifier.create_notification()\n                notifier.send_notification()\n        else:\n            raise NotImplementedError(\n                f\"The requested terminator {spec.function} is not implemented.\"\n            )\n"
  },
  {
    "path": "lakehouse_engine/transformers/__init__.py",
    "content": "\"\"\"Package to define transformers available in the lakehouse engine.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/transformers/aggregators.py",
    "content": "\"\"\"Aggregators module.\"\"\"\n\nfrom typing import Callable\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, max  # noqa: A004\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Aggregators(object):\n    \"\"\"Class containing all aggregation functions.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def get_max_value(input_col: str, output_col: str = \"latest\") -> Callable:\n        \"\"\"Get the maximum value of a given column of a dataframe.\n\n        Args:\n            input_col: name of the input column.\n            output_col: name of the output column (defaults to \"latest\").\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='get_max_value')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.select(col(input_col)).agg(max(input_col).alias(output_col))\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/column_creators.py",
    "content": "\"\"\"Column creators transformers module.\"\"\"\n\nfrom typing import Any, Callable, Dict\n\nfrom pyspark.sql import DataFrame, Window\nfrom pyspark.sql.functions import col, lit, monotonically_increasing_id, row_number\nfrom pyspark.sql.types import IntegerType\n\nfrom lakehouse_engine.transformers.exceptions import (\n    UnsupportedStreamingTransformerException,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass ColumnCreators(object):\n    \"\"\"Class containing all functions that can create columns to add value.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def with_row_id(\n        cls,\n        output_col: str = \"lhe_row_id\",\n    ) -> Callable:\n        \"\"\"Create a sequential but not consecutive id.\n\n        Args:\n            output_col: optional name of the output column.\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='with_row_id')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if not df.isStreaming:\n                return df.withColumn(output_col, monotonically_increasing_id())\n            else:\n                raise UnsupportedStreamingTransformerException(\n                    \"Transformer with_row_id is not supported in streaming mode.\"\n                )\n\n        return inner\n\n    @classmethod\n    def with_auto_increment_id(\n        cls, output_col: str = \"lhe_row_id\", rdd: bool = True\n    ) -> Callable:\n        \"\"\"Create a sequential and consecutive id.\n\n        Args:\n            output_col: optional name of the output column.\n            rdd: optional parameter to use spark rdd.\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='with_auto_increment_id')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if not df.isStreaming:\n                if len(df.take(1)) == 0:\n                    # if df is empty we have to prevent the algorithm from failing\n                    return df.withColumn(output_col, lit(None).cast(IntegerType()))\n                elif rdd:\n                    return (\n                        df.rdd.zipWithIndex()\n                        .toDF()\n                        .select(col(\"_1.*\"), col(\"_2\").alias(output_col))\n                    )\n                else:\n                    w = Window.orderBy(monotonically_increasing_id())\n                    return df.withColumn(output_col, (row_number().over(w)) - 1)\n\n            else:\n                raise UnsupportedStreamingTransformerException(\n                    \"Transformer with_auto_increment_id is not supported in \"\n                    \"streaming mode.\"\n                )\n\n        return inner\n\n    @classmethod\n    def with_literals(\n        cls,\n        literals: Dict[str, Any],\n    ) -> Callable:\n        \"\"\"Create columns given a map of column names and literal values (constants).\n\n        Args:\n            Dict[str, Any] literals: map of column names and literal values (constants).\n\n        Returns:\n            Callable: A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='with_literals')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            df_with_literals = df\n            for name, value in literals.items():\n                df_with_literals = df_with_literals.withColumn(name, lit(value))\n            return df_with_literals\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/column_reshapers.py",
    "content": "\"\"\"Module with column reshaping transformers.\"\"\"\n\nfrom collections import OrderedDict\nfrom typing import Any, Callable, Dict, List, Optional\n\nimport pyspark.sql.types as spark_types\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.avro.functions import from_avro\nfrom pyspark.sql.functions import (\n    col,\n    explode_outer,\n    expr,\n    from_json,\n    map_entries,\n    struct,\n    to_json,\n)\n\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\n\n\nclass ColumnReshapers(object):\n    \"\"\"Class containing column reshaping transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def cast(cls, cols: Dict[str, str]) -> Callable:\n        \"\"\"Cast specific columns into the designated type.\n\n        Args:\n            cols: dict with columns and respective target types.\n                Target types need to have the exact name of spark types:\n                https://spark.apache.org/docs/latest/sql-ref-datatypes.html\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='cast')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            cast_df = df\n            for c, t in cols.items():\n                cast_df = cast_df.withColumn(c, col(c).cast(getattr(spark_types, t)()))\n\n            return cast_df\n\n        return inner\n\n    @classmethod\n    def column_selector(cls, cols: OrderedDict) -> Callable:\n        \"\"\"Select specific columns with specific output aliases.\n\n        Args:\n            cols: dict with columns to select and respective aliases.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='column_selector')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.select(*[col(c).alias(a) for c, a in cols.items()])\n\n        return inner\n\n    @classmethod\n    def flatten_schema(\n        cls,\n        max_level: int = None,\n        shorten_names: bool = False,\n        alias: bool = True,\n        num_chars: int = 7,\n        ignore_cols: List = None,\n    ) -> Callable:\n        \"\"\"Flatten the schema of the dataframe.\n\n        Args:\n            max_level: level until which you want to flatten the schema.\n                Default: None.\n            shorten_names: whether to shorten the names of the prefixes\n                of the fields being flattened or not. Default: False.\n            alias: whether to define alias for the columns being flattened\n                or not. Default: True.\n            num_chars: number of characters to consider when shortening\n                the names of the fields. Default: 7.\n            ignore_cols: columns which you don't want to flatten.\n                Default: None.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='flatten_schema')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.select(\n                SchemaUtils.schema_flattener(\n                    schema=df.schema,\n                    max_level=max_level,\n                    shorten_names=shorten_names,\n                    alias=alias,\n                    num_chars=num_chars,\n                    ignore_cols=ignore_cols,\n                )\n            )\n\n        return inner\n\n    @classmethod\n    def explode_columns(\n        cls,\n        explode_arrays: bool = False,\n        array_cols_to_explode: List[str] = None,\n        explode_maps: bool = False,\n        map_cols_to_explode: List[str] = None,\n    ) -> Callable:\n        \"\"\"Explode columns with types like ArrayType and MapType.\n\n        After it can be applied the flatten_schema transformation,\n        if we desired for example to explode the map (as we explode a StructType)\n        or to explode a StructType inside the array.\n        We recommend you to specify always the columns desired to explode\n        and not explode all columns.\n\n        Args:\n            explode_arrays: whether you want to explode array columns (True)\n                or not (False). Default: False.\n            array_cols_to_explode: array columns which you want to explode.\n                If you don't specify it will get all array columns and explode them.\n                Default: None.\n            explode_maps: whether you want to explode map columns (True)\n                or not (False). Default: False.\n            map_cols_to_explode: map columns which you want to explode.\n                If you don't specify it will get all map columns and explode them.\n                Default: None.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='explode_columns')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if explode_arrays or (array_cols_to_explode is not None):\n                df = cls._explode_arrays(df, array_cols_to_explode)\n\n            if explode_maps or (map_cols_to_explode is not None):\n                df = cls._explode_maps(df, map_cols_to_explode)\n\n            return df\n\n        return inner\n\n    @classmethod\n    def _get_columns(\n        cls,\n        df: DataFrame,\n        data_type: Any,\n    ) -> List:\n        \"\"\"Get a list of columns from the dataframe of the data types specified.\n\n        Args:\n            df: input dataframe.\n            data_type: data type specified.\n\n        Returns:\n            List of columns with the datatype specified.\n        \"\"\"\n        cols = []\n        for field in df.schema.fields:\n            if isinstance(field.dataType, data_type):\n                cols.append(field.name)\n        return cols\n\n    @classmethod\n    def with_expressions(cls, cols_and_exprs: Dict[str, str]) -> Callable:\n        \"\"\"Execute Spark SQL expressions to create the specified columns.\n\n        This function uses the Spark expr function. [Check here](\n        https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.expr.html).\n\n        Args:\n            cols_and_exprs: dict with columns and respective expressions to compute\n                (Spark SQL expressions).\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='with_expressions')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            enriched_df = df\n            for c, e in cols_and_exprs.items():\n                enriched_df = enriched_df.withColumn(c, expr(e))\n\n            return enriched_df\n\n        return inner\n\n    @classmethod\n    def rename(cls, cols: Dict[str, str], escape_col_names: bool = True) -> Callable:\n        \"\"\"Rename specific columns into the designated name.\n\n        Args:\n            cols: dict with columns and respective target names.\n            escape_col_names: whether to escape column names (e.g. `/BIC/COL1`) or not.\n                If True it creates a column with the new name and drop the old one.\n                If False, uses the native withColumnRenamed Spark function.\n                Default: True.\n\n        Returns:\n            Function to be called in .transform() spark function.\n\n        {{get_example(method_name='rename')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            renamed_df = df\n            if escape_col_names:\n                for old_name, new_name in cols.items():\n                    renamed_df = renamed_df.withColumn(new_name, col(old_name))\n                    renamed_df = renamed_df.drop(old_name)\n            else:\n                for old_name, new_name in cols.items():\n                    renamed_df = df.withColumnRenamed(old_name, new_name)\n\n            return renamed_df\n\n        return inner\n\n    @classmethod\n    def from_avro(\n        cls,\n        schema: str = None,\n        key_col: str = \"key\",\n        value_col: str = \"value\",\n        options: dict = None,\n        expand_key: bool = False,\n        expand_value: bool = True,\n    ) -> Callable:\n        \"\"\"Select all attributes from avro.\n\n        Args:\n            schema: the schema string.\n            key_col: the name of the key column.\n            value_col: the name of the value column.\n            options: extra options (e.g., mode: \"PERMISSIVE\").\n            expand_key: whether you want to expand the content inside the key\n                column or not. Default: false.\n            expand_value: whether you want to expand the content inside the value\n                column or not. Default: true.\n\n        Returns:\n            Function to be called in .transform() spark function.\n\n        {{get_example(method_name='from_avro')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            cols_to_select = [\n                column for column in df.columns if column not in [key_col, value_col]\n            ]\n\n            return df.select(\n                *cols_to_select,\n                key_col,\n                from_avro(col(value_col), schema, options if options else None).alias(\n                    value_col\n                ),\n            ).select(\n                *cols_to_select,\n                f\"{key_col}.*\" if expand_key else key_col,\n                f\"{value_col}.*\" if expand_value else value_col,\n            )\n\n        return inner\n\n    @classmethod\n    def from_avro_with_registry(\n        cls,\n        schema_registry: str,\n        value_schema: str,\n        value_col: str = \"value\",\n        key_schema: str = None,\n        key_col: str = \"key\",\n        expand_key: bool = False,\n        expand_value: bool = True,\n        options: dict = None,\n    ) -> Callable:\n        \"\"\"Select all attributes from avro using a schema registry.\n\n        Args:\n            schema_registry: the url to the schema registry.\n            value_schema: the name of the value schema entry in the schema registry.\n            value_col: the name of the value column.\n            key_schema: the name of the key schema entry in the schema\n                registry. Default: None.\n            key_col: the name of the key column.\n            expand_key: whether you want to expand the content inside the key\n                column or not. Default: false.\n            expand_value: whether you want to expand the content inside the value\n                column or not. Default: true.\n            options: extra options (e.g., mode: \"PERMISSIVE\").\n\n        Returns:\n            Function to be called in .transform() spark function.\n\n        {{get_example(method_name='from_avro_with_registry')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            cols_to_select = [\n                column for column in df.columns if column not in [key_col, value_col]\n            ]\n\n            return df.select(  # type: ignore\n                *cols_to_select,\n                (\n                    from_avro(\n                        data=col(key_col),\n                        subject=key_schema,\n                        schemaRegistryAddress=schema_registry,  # type: ignore\n                        options=options if options else None,\n                    ).alias(key_col)\n                    if key_schema\n                    else key_col\n                ),\n                from_avro(\n                    data=col(value_col),\n                    subject=value_schema,\n                    schemaRegistryAddress=schema_registry,  # type: ignore\n                    options=options if options else None,\n                ).alias(value_col),\n            ).select(\n                *cols_to_select,\n                f\"{key_col}.*\" if expand_key else key_col,\n                f\"{value_col}.*\" if expand_value else value_col,\n            )\n\n        return inner\n\n    @classmethod\n    def from_json(\n        cls,\n        input_col: str,\n        schema_path: Optional[str] = None,\n        schema: Optional[dict] = None,\n        json_options: Optional[dict] = None,\n        drop_all_cols: bool = False,\n        disable_dbfs_retry: bool = False,\n    ) -> Callable:\n        \"\"\"Convert a json string into a json column (struct).\n\n        The new json column can be added to the existing columns (default) or it can\n        replace all the others, being the only one to output. The new column gets the\n        same name as the original one suffixed with '_json'.\n\n        Args:\n            input_col: dict with columns and respective target names.\n            schema_path: path to the StructType schema (spark schema).\n            schema: dict with the StructType schema (spark schema).\n            json_options: options to parse the json value.\n            drop_all_cols: whether to drop all the input columns or not.\n                Defaults to False.\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='from_json')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if schema_path:\n                json_schema = SchemaUtils.from_file(schema_path, disable_dbfs_retry)\n            elif schema:\n                json_schema = SchemaUtils.from_dict(schema)\n            else:\n                raise WrongArgumentsException(\n                    \"A file or dict schema needs to be provided.\"\n                )\n\n            if drop_all_cols:\n                df_with_json = df.select(\n                    from_json(\n                        col(input_col).cast(\"string\").alias(f\"{input_col}_json\"),\n                        json_schema,\n                        json_options if json_options else None,\n                    ).alias(f\"{input_col}_json\")\n                )\n            else:\n                df_with_json = df.select(\n                    \"*\",\n                    from_json(\n                        col(input_col).cast(\"string\").alias(f\"{input_col}_json\"),\n                        json_schema,\n                        json_options if json_options else None,\n                    ).alias(f\"{input_col}_json\"),\n                )\n\n            return df_with_json\n\n        return inner\n\n    @classmethod\n    def to_json(\n        cls, in_cols: List[str], out_col: str, json_options: Optional[dict] = None\n    ) -> Callable:\n        \"\"\"Convert dataframe columns into a json value.\n\n        Args:\n            in_cols: name(s) of the input column(s).\n                Example values:\n                \"*\" - all\n                columns; \"my_col\" - one column named \"my_col\";\n                \"my_col1, my_col2\" - two columns.\n            out_col: name of the output column.\n            json_options: options to parse the json value.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='to_json')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.withColumn(\n                out_col,\n                to_json(struct(*in_cols), json_options if json_options else None),\n            )\n\n        return inner\n\n    @classmethod\n    def _explode_arrays(cls, df: DataFrame, cols_to_explode: List[str]) -> DataFrame:\n        \"\"\"Explode array columns from dataframe.\n\n        Args:\n            df: the dataframe to apply the explode operation.\n            cols_to_explode: list of array columns to perform explode.\n\n        Returns:\n            A dataframe with array columns exploded.\n        \"\"\"\n        if cols_to_explode is None:\n            cols_to_explode = cls._get_columns(df, spark_types.ArrayType)\n\n        for column in cols_to_explode:\n            df = df.withColumn(column, explode_outer(column))\n\n        return df\n\n    @classmethod\n    def _explode_maps(cls, df: DataFrame, cols_to_explode: List[str]) -> DataFrame:\n        \"\"\"Explode map columns from dataframe.\n\n        Args:\n            df: the dataframe to apply the explode operation.\n            cols_to_explode: list of map columns to perform explode.\n\n        Returns:\n            A dataframe with map columns exploded.\n        \"\"\"\n        if cols_to_explode is None:\n            cols_to_explode = cls._get_columns(df, spark_types.MapType)\n\n        for column in cols_to_explode:\n            df = df.withColumn(column, explode_outer(map_entries(col(column))))\n\n        return df\n"
  },
  {
    "path": "lakehouse_engine/transformers/condensers.py",
    "content": "\"\"\"Condensers module.\"\"\"\n\nfrom typing import Callable, List, Optional\n\nfrom pyspark.sql import DataFrame, Window\nfrom pyspark.sql.functions import col, row_number\n\nfrom lakehouse_engine.transformers.exceptions import (\n    UnsupportedStreamingTransformerException,\n    WrongArgumentsException,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Condensers(object):\n    \"\"\"Class containing all the functions to condensate data for later merges.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def condense_record_mode_cdc(\n        cls,\n        business_key: List[str],\n        record_mode_col: str,\n        valid_record_modes: List[str],\n        ranking_key_desc: Optional[List[str]] = None,\n        ranking_key_asc: Optional[List[str]] = None,\n    ) -> Callable:\n        \"\"\"Condense Change Data Capture (CDC) based on record_mode strategy.\n\n        This CDC data is particularly seen in some CDC enabled systems. Other systems\n        may have different CDC strategies.\n\n        Args:\n            business_key: The business key (logical primary key) of the data.\n            ranking_key_desc: In this type of CDC condensation the data needs to be\n                in descending order in a certain way, using columns specified in this\n                parameter.\n            ranking_key_asc: In this type of CDC condensation the data needs to be\n                in ascending order in a certain way, using columns specified in\n                this parameter.\n            record_mode_col: Name of the record mode input_col.\n            valid_record_modes: Depending on the context, not all record modes may be\n                considered for condensation. Use this parameter to skip those.\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='condense_record_mode_cdc')}}\n        \"\"\"\n        if not ranking_key_desc and not ranking_key_asc:\n            raise WrongArgumentsException(\n                \"The condense_record_mode_cdc transformer requires data to be either\"\n                \"in descending or ascending order, but no arguments for ordering\"\n                \"were provided.\"\n            )\n\n        def inner(df: DataFrame) -> DataFrame:\n            if not df.isStreaming:\n                partition_window = Window.partitionBy(\n                    [col(c) for c in business_key]\n                ).orderBy(\n                    [\n                        col(c).desc()\n                        for c in (ranking_key_desc if ranking_key_desc else [])\n                    ]  # type: ignore\n                    + [\n                        col(c).asc()\n                        for c in (ranking_key_asc if ranking_key_asc else [])\n                    ]  # type: ignore\n                )\n\n                return (\n                    df.withColumn(\"ranking\", row_number().over(partition_window))\n                    .filter(\n                        col(record_mode_col).isNull()\n                        | col(record_mode_col).isin(valid_record_modes)\n                    )\n                    .filter(col(\"ranking\") == 1)\n                    .drop(\"ranking\")\n                )\n            else:\n                raise UnsupportedStreamingTransformerException(\n                    \"Transformer condense_record_mode_cdc is not supported in \"\n                    \"streaming mode.\"\n                )\n\n        return inner\n\n    @classmethod\n    def group_and_rank(\n        cls, group_key: List[str], ranking_key: List[str], descending: bool = True\n    ) -> Callable:\n        \"\"\"Condense data based on a simple group by + take latest mechanism.\n\n        Args:\n            group_key: list of column names to use in the group by.\n            ranking_key: the data needs to be in descending order using columns\n                specified in this parameter.\n            descending: if the ranking considers descending order or not. Defaults to\n                True.\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='group_and_rank')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if not df.isStreaming:\n                partition_window = Window.partitionBy(\n                    [col(c) for c in group_key]\n                ).orderBy(\n                    [\n                        col(c).desc() if descending else col(c).asc()\n                        for c in (ranking_key if ranking_key else [])\n                    ]  # type: ignore\n                )\n\n                return (\n                    df.withColumn(\"ranking\", row_number().over(partition_window))\n                    .filter(col(\"ranking\") == 1)\n                    .drop(\"ranking\")\n                )\n            else:\n                raise UnsupportedStreamingTransformerException(\n                    \"Transformer group_and_rank is not supported in streaming mode.\"\n                )\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/custom_transformers.py",
    "content": "\"\"\"Custom transformers module.\"\"\"\n\nfrom typing import Callable\n\nfrom pyspark.sql import DataFrame\n\n\nclass CustomTransformers(object):\n    \"\"\"Class representing a CustomTransformers.\"\"\"\n\n    @staticmethod\n    def custom_transformation(custom_transformer: Callable) -> Callable:\n        \"\"\"Execute a custom transformation provided by the user.\n\n        This transformer can be very useful whenever the user cannot use our provided\n        transformers, or they want to write complex logic in the transform step of the\n        algorithm.\n\n        .. warning:: Attention!\n            Please bear in mind that the custom_transformer function provided\n            as argument needs to receive a DataFrame and return a DataFrame,\n            because it is how Spark's .transform method is able to chain the\n            transformations.\n\n        Example:\n        ```python\n        def my_custom_logic(df: DataFrame) -> DataFrame:\n        ```\n\n        Args:\n            custom_transformer: custom transformer function. A python function with all\n                required pyspark logic provided by the user.\n\n        Returns:\n            Callable: the same function provided as parameter, in order to e called\n                later in the TransformerFactory.\n\n        {{get_example(method_name='custom_transformation')}}\n        \"\"\"\n        return custom_transformer\n\n    @staticmethod\n    def sql_transformation(sql: str) -> Callable:\n        \"\"\"Execute a SQL transformation provided by the user.\n\n        This transformer can be very useful whenever the user wants to perform\n        SQL-based transformations that are not natively supported by the\n        lakehouse engine transformers.\n\n        Args:\n            sql: the SQL query to be executed. This can read from any table or\n                view from the catalog, or any dataframe registered as a temp\n                view.\n\n        Returns:\n            Callable: A function to be called in .transform() spark function.\n\n        {{get_example(method_name='sql_transformation')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.sparkSession.sql(sql)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/data_maskers.py",
    "content": "\"\"\"Module with data masking transformers.\"\"\"\n\nfrom typing import Callable, List\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import hash, sha2  # noqa: A004\n\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DataMaskers(object):\n    \"\"\"Class containing data masking transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def hash_masker(\n        cls,\n        cols: List[str],\n        approach: str = \"SHA\",\n        num_bits: int = 256,\n        suffix: str = \"_hash\",\n    ) -> Callable:\n        \"\"\"Mask specific columns using an hashing approach.\n\n        Args:\n            cols: list of column names to mask.\n            approach: hashing approach. Defaults to 'SHA'. There's \"MURMUR3\" as well.\n            num_bits: number of bits of the SHA approach. Only applies to SHA approach.\n            suffix: suffix to apply to new column name. Defaults to \"_hash\".\n                Note: you can pass an empty suffix to have the original column replaced.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='hash_masker')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            masked_df = df\n            for col in cols:\n                if approach == \"MURMUR3\":\n                    masked_df = masked_df.withColumn(col + suffix, hash(col))\n                elif approach == \"SHA\":\n                    masked_df = masked_df.withColumn(col + suffix, sha2(col, num_bits))\n                else:\n                    raise WrongArgumentsException(\"Hashing approach is not supported.\")\n\n            return masked_df\n\n        return inner\n\n    @classmethod\n    def column_dropper(cls, cols: List[str]) -> Callable:\n        \"\"\"Drop specific columns.\n\n        Args:\n            cols: list of column names to drop.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='column_dropper')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            drop_df = df\n            for col in cols:\n                drop_df = drop_df.drop(col)\n\n            return drop_df\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/date_transformers.py",
    "content": "\"\"\"Module containing date transformers.\"\"\"\n\nfrom datetime import datetime\nfrom typing import Callable, List, Optional\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, date_format, lit, to_date, to_timestamp\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DateTransformers(object):\n    \"\"\"Class with set of transformers to transform dates in several forms.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def add_current_date(output_col: str) -> Callable:\n        \"\"\"Add column with current date.\n\n        The current date comes from the driver as a constant, not from every executor.\n\n        Args:\n            output_col: name of the output column.\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='add_current_date')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.withColumn(output_col, lit(datetime.now()))\n\n        return inner\n\n    @staticmethod\n    def convert_to_date(\n        cols: List[str], source_format: Optional[str] = None\n    ) -> Callable:\n        \"\"\"Convert multiple string columns with a source format into dates.\n\n        Args:\n            cols: list of names of the string columns to convert.\n            source_format: dates source format (e.g., YYYY-MM-dd). [Check here](\n                https://docs.oracle.com/javase/10/docs/api/java/time/format/DateTimeFormatter.html).\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='convert_to_date')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            converted_df = df\n            for c in cols:\n                converted_df = converted_df.withColumn(\n                    c, to_date(col(c), source_format)\n                )\n\n            return converted_df\n\n        return inner\n\n    @staticmethod\n    def convert_to_timestamp(\n        cols: List[str], source_format: Optional[str] = None\n    ) -> Callable:\n        \"\"\"Convert multiple string columns with a source format into timestamps.\n\n        Args:\n            cols: list of names of the string columns to convert.\n            source_format: dates source format (e.g., MM-dd-yyyy HH:mm:ss.SSS).\n                [Check here](\n                https://docs.oracle.com/javase/10/docs/api/java/time/format/DateTimeFormatter.html).\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='convert_to_timestamp')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            converted_df = df\n            for c in cols:\n                converted_df = converted_df.withColumn(\n                    c, to_timestamp(col(c), source_format)\n                )\n\n            return converted_df\n\n        return inner\n\n    @staticmethod\n    def format_date(cols: List[str], target_format: Optional[str] = None) -> Callable:\n        \"\"\"Convert multiple date/timestamp columns into strings with the target format.\n\n        Args:\n            cols: list of names of the string columns to convert.\n            target_format: strings target format (e.g., YYYY-MM-dd). [Check here](\n                https://docs.oracle.com/javase/10/docs/api/java/time/format/DateTimeFormatter.html).\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='format_date')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            converted_df = df\n            for c in cols:\n                converted_df = converted_df.withColumn(\n                    c, date_format(col(c), target_format)\n                )\n\n            return converted_df\n\n        return inner\n\n    @staticmethod\n    def get_date_hierarchy(cols: List[str], formats: Optional[dict] = None) -> Callable:\n        \"\"\"Create day/month/week/quarter/year hierarchy for the provided date columns.\n\n        Uses Spark's extract function.\n\n        Args:\n            cols: list of names of the date columns to create the hierarchy.\n            formats: dict with the correspondence between the hierarchy and the format\n                to apply. [Check here](\n                https://docs.oracle.com/javase/10/docs/api/java/time/format/DateTimeFormatter.html).\n                Example: {\n                    \"year\": \"year\",\n                    \"month\": \"month\",\n                    \"day\": \"day\",\n                    \"week\": \"week\",\n                    \"quarter\": \"quarter\"\n                }\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='get_date_hierarchy')}}\n        \"\"\"\n        if not formats:\n            formats = {\n                \"year\": \"year\",\n                \"month\": \"month\",\n                \"day\": \"day\",\n                \"week\": \"week\",\n                \"quarter\": \"quarter\",\n            }\n\n        def inner(df: DataFrame) -> DataFrame:\n            transformer_df = df\n            for c in cols:\n                transformer_df = transformer_df.selectExpr(\n                    \"*\",\n                    f\"extract({formats['day']} from {c}) as {c}_day\",\n                    f\"extract({formats['month']} from {c}) as {c}_month\",\n                    f\"extract({formats['week']} from {c}) as {c}_week\",\n                    f\"extract({formats['quarter']} from {c}) as {c}_quarter\",\n                    f\"extract({formats['year']} from {c}) as {c}_year\",\n                )\n\n            return transformer_df\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/exceptions.py",
    "content": "\"\"\"Module for all the transformers exceptions.\"\"\"\n\n\nclass WrongArgumentsException(Exception):\n    \"\"\"Exception for when a user provides wrong arguments to a transformer.\"\"\"\n\n    pass\n\n\nclass UnsupportedStreamingTransformerException(Exception):\n    \"\"\"Exception for when a user requests a transformer not supported in streaming.\"\"\"\n\n    pass\n"
  },
  {
    "path": "lakehouse_engine/transformers/filters.py",
    "content": "\"\"\"Module containing the filters transformers.\"\"\"\n\nfrom typing import Any, Callable, List, Optional\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col\n\nfrom lakehouse_engine.transformers.watermarker import Watermarker\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Filters(object):\n    \"\"\"Class containing the filters transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def incremental_filter(\n        cls,\n        input_col: str,\n        increment_value: Optional[Any] = None,\n        increment_df: Optional[DataFrame] = None,\n        increment_col: str = \"latest\",\n        greater_or_equal: bool = False,\n    ) -> Callable:\n        \"\"\"Incrementally Filter a certain dataframe given an increment logic.\n\n        This logic can either be an increment value or an increment dataframe from\n        which the get the latest value from. By default, the operator for the\n        filtering process is greater or equal to cover cases where we receive late\n        arriving data not cover in a previous load. You can change greater_or_equal\n        to false to use greater, when you trust the source will never output more data\n        with the increment after you have load the data (e.g., you will never load\n        data until the source is still dumping data, which may cause you to get an\n        incomplete picture of the last arrived data).\n\n        Args:\n            input_col: input column name\n            increment_value: value to which to filter the data, considering the\n                provided input_Col.\n            increment_df: a dataframe to get the increment value from.\n                you either specify this or the increment_value (this takes precedence).\n                This is a good approach to get the latest value from a given dataframe\n                that was read and apply that value as filter here. In this way you can\n                perform incremental loads based on the last value of a given dataframe\n                (e.g., table or file based). Can be used together with the\n                get_max_value transformer to accomplish these incremental based loads.\n                See our append load feature tests  to see how to provide an acon for\n                incremental loads, taking advantage of the scenario explained here.\n            increment_col: name of the column from which to get the increment\n                value from (when using increment_df approach). This assumes there's\n                only one row in the increment_df, reason why is a good idea to use\n                together with the get_max_value transformer. Defaults to \"latest\"\n                because that's the default output column name provided by the\n                get_max_value transformer.\n            greater_or_equal: if filtering should be done by also including the\n                increment value or not (useful for scenarios where you are performing\n                increment loads but still want to include data considering the increment\n                value, and not only values greater than that increment... examples may\n                include scenarios where you already loaded data including those values,\n                but the source produced more data containing those values).\n                Defaults to false.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='incremental_filter')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if increment_df:\n                if greater_or_equal:\n                    return df.filter(  # type: ignore\n                        col(input_col) >= increment_df.collect()[0][increment_col]\n                    )\n                else:\n                    return df.filter(  # type: ignore\n                        col(input_col) > increment_df.collect()[0][increment_col]\n                    )\n            else:\n                if greater_or_equal:\n                    return df.filter(col(input_col) >= increment_value)  # type: ignore\n                else:\n                    return df.filter(col(input_col) > increment_value)  # type: ignore\n\n        return inner\n\n    @staticmethod\n    def expression_filter(exp: str) -> Callable:\n        \"\"\"Filter a dataframe based on an expression.\n\n        Args:\n            exp: filter expression.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='expression_filter')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.filter(exp)  # type: ignore\n\n        return inner\n\n    @staticmethod\n    def column_filter_exp(exp: List[str]) -> Callable:\n        \"\"\"Filter a dataframe's columns based on a list of SQL expressions.\n\n        Args:\n            exp: column filter expressions.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='column_filter_exp')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.selectExpr(*exp)  # type: ignore\n\n        return inner\n\n    @staticmethod\n    def drop_duplicate_rows(\n        cols: List[str] = None, watermarker: dict = None\n    ) -> Callable:\n        \"\"\"Drop duplicate rows using spark function dropDuplicates().\n\n        This transformer can be used with or without arguments.\n        The provided argument needs to be a list of columns.\n        For example: [“Name”,”VAT”] will drop duplicate records within\n        \"Name\" and \"VAT\" columns.\n        If the transformer is used without providing any columns list or providing\n        an empty list, such as [] the result will be the same as using\n        the distinct() pyspark function. If the watermark dict is present it will\n        ensure that the drop operation will apply to rows within the watermark timeline\n        window.\n\n\n        Args:\n            cols: column names.\n            watermarker: properties to apply watermarker to the transformer.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='drop_duplicate_rows')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if watermarker:\n                df = Watermarker.with_watermark(\n                    watermarker[\"col\"], watermarker[\"watermarking_time\"]\n                )(df)\n            if not cols:\n                return df.dropDuplicates()\n            else:\n                return df.dropDuplicates(cols)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/joiners.py",
    "content": "\"\"\"Module with join transformers.\"\"\"\n\nimport uuid\nfrom typing import Callable, List, Optional\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.transformers.watermarker import Watermarker\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.spark_utils import SparkUtils\n\n\nclass Joiners(object):\n    \"\"\"Class containing join transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def join(\n        cls,\n        join_with: DataFrame,\n        join_condition: str,\n        left_df_alias: str = \"a\",\n        right_df_alias: str = \"b\",\n        join_type: str = \"inner\",\n        broadcast_join: bool = True,\n        select_cols: Optional[List[str]] = None,\n        watermarker: Optional[dict] = None,\n    ) -> Callable:\n        \"\"\"Join two dataframes based on specified type and columns.\n\n        Some stream to stream joins are only possible if you apply Watermark, so this\n        method also provides a parameter to enable watermarking specification.\n\n        Args:\n            left_df_alias: alias of the first dataframe.\n            join_with: right dataframe.\n            right_df_alias: alias of the second dataframe.\n            join_condition: condition to join dataframes.\n            join_type: type of join. Defaults to inner.\n                Available values: inner, cross, outer, full, full outer,\n                left, left outer, right, right outer, semi,\n                left semi, anti, and left anti.\n            broadcast_join: whether to perform a broadcast join or not.\n            select_cols: list of columns to select at the end.\n            watermarker: properties to apply watermarking.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='join')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            # The goal here is to avoid problems on\n            # simultaneously running process,\n            # so an id is added as a prefix for the alias.\n            app_id = str(uuid.uuid4())\n            left = f\"`{app_id}_{left_df_alias}`\"\n            right = f\"`{app_id}_{right_df_alias}`\"\n            df_join_with = join_with\n            if watermarker:\n                left_df_watermarking = watermarker.get(left_df_alias, None)\n                right_df_watermarking = watermarker.get(right_df_alias, None)\n                if left_df_watermarking:\n                    df = Watermarker.with_watermark(\n                        left_df_watermarking[\"col\"],\n                        left_df_watermarking[\"watermarking_time\"],\n                    )(df)\n                if right_df_watermarking:\n                    df_join_with = Watermarker.with_watermark(\n                        right_df_watermarking[\"col\"],\n                        right_df_watermarking[\"watermarking_time\"],\n                    )(df_join_with)\n\n            l_prefix = SparkUtils.create_temp_view(df, left, return_prefix=True)\n            r_prefix = SparkUtils.create_temp_view(\n                df_join_with, right, return_prefix=True\n            )\n\n            query = f\"\"\"\n                SELECT {f\"/*+ BROADCAST({right_df_alias}) */\" if broadcast_join else \"\"}\n                {\", \".join(select_cols)}\n                FROM {l_prefix}{left} AS {left_df_alias}\n                {join_type.upper()}\n                JOIN {r_prefix}{right} AS {right_df_alias}\n                ON {join_condition}\n            \"\"\"  # nosec: B608\n\n            cls._logger.info(f\"Execution query: {query}\")\n\n            return ExecEnv.SESSION.sql(query)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/null_handlers.py",
    "content": "\"\"\"Module with null handlers transformers.\"\"\"\n\nfrom typing import Callable, List\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass NullHandlers(object):\n    \"\"\"Class containing null handler transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def replace_nulls(\n        cls,\n        replace_on_nums: bool = True,\n        default_num_value: int = -999,\n        replace_on_strings: bool = True,\n        default_string_value: str = \"UNKNOWN\",\n        subset_cols: List[str] = None,\n    ) -> Callable:\n        \"\"\"Replace nulls in a dataframe.\n\n        Args:\n            replace_on_nums: if it is to replace nulls on numeric columns.\n                Applies to ints, longs and floats.\n            default_num_value: default integer value to use as replacement.\n            replace_on_strings: if it is to replace nulls on string columns.\n            default_string_value: default string value to use as replacement.\n            subset_cols: list of columns in which to replace nulls. If not\n                provided, all nulls in all columns will be replaced as specified.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='replace_nulls')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if replace_on_nums:\n                df = df.na.fill(default_num_value, subset_cols)\n            if replace_on_strings:\n                df = df.na.fill(default_string_value, subset_cols)\n\n            return df\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/optimizers.py",
    "content": "\"\"\"Optimizers module.\"\"\"\n\nfrom typing import Callable\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.storagelevel import StorageLevel\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Optimizers(object):\n    \"\"\"Class containing all the functions that can provide optimizations.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def cache(cls) -> Callable:\n        \"\"\"Caches the current dataframe.\n\n        The default storage level used is MEMORY_AND_DISK.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='cache')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.cache()\n\n        return inner\n\n    @classmethod\n    def persist(cls, storage_level: str = None) -> Callable:\n        \"\"\"Caches the current dataframe with a specific StorageLevel.\n\n        Args:\n            storage_level: the type of StorageLevel, as default MEMORY_AND_DISK_DESER.\n                [More options here](\n                https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.StorageLevel.html).\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='persist')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            level = getattr(\n                StorageLevel, storage_level, StorageLevel.MEMORY_AND_DISK_DESER\n            )\n\n            return df.persist(level)\n\n        return inner\n\n    @classmethod\n    def unpersist(cls, blocking: bool = False) -> Callable:\n        \"\"\"Removes the dataframe from the disk and memory.\n\n        Args:\n            blocking: whether to block until all the data blocks are\n                removed from disk/memory or run asynchronously.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='unpersist')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.unpersist(blocking)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/regex_transformers.py",
    "content": "\"\"\"Regex transformers module.\"\"\"\n\nfrom typing import Callable\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, regexp_extract\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass RegexTransformers(object):\n    \"\"\"Class containing all regex functions.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def with_regex_value(\n        input_col: str,\n        output_col: str,\n        regex: str,\n        drop_input_col: bool = False,\n        idx: int = 1,\n    ) -> Callable:\n        \"\"\"Get the result of applying a regex to an input column (via regexp_extract).\n\n        Args:\n            input_col: name of the input column.\n            output_col: name of the output column.\n            regex: regular expression.\n            drop_input_col: whether to drop input_col or not.\n            idx: index to return.\n\n        Returns:\n            A function to be executed in the .transform() spark function.\n\n        {{get_example(method_name='with_regex_value')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            df = df.withColumn(output_col, regexp_extract(col(input_col), regex, idx))\n\n            if drop_input_col:\n                df = df.drop(input_col)\n\n            return df\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/repartitioners.py",
    "content": "\"\"\"Module with repartitioners transformers.\"\"\"\n\nfrom typing import Callable, List, Optional\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Repartitioners(object):\n    \"\"\"Class containing repartitioners transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def coalesce(cls, num_partitions: int) -> Callable:\n        \"\"\"Coalesce a dataframe into n partitions.\n\n        Args:\n            num_partitions: num of partitions to coalesce.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='coalesce')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.coalesce(num_partitions)\n\n        return inner\n\n    @classmethod\n    def repartition(\n        cls, num_partitions: Optional[int] = None, cols: Optional[List[str]] = None\n    ) -> Callable:\n        \"\"\"Repartition a dataframe into n partitions.\n\n        If num_partitions is provided repartitioning happens based on the provided\n        number, otherwise it happens based on the values of the provided cols (columns).\n\n        Args:\n            num_partitions: num of partitions to repartition.\n            cols: list of columns to use for repartitioning.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='repartition')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            if cols:\n                return df.repartition(num_partitions, *cols)\n            elif num_partitions:\n                return df.repartition(num_partitions)\n            else:\n                raise WrongArgumentsException(\n                    \"num_partitions or cols should be specified\"\n                )\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/transformer_factory.py",
    "content": "\"\"\"Module with the factory pattern to return transformers.\"\"\"\n\nfrom typing import Callable, OrderedDict\n\nfrom lakehouse_engine.core.definitions import TransformerSpec\nfrom lakehouse_engine.transformers.aggregators import Aggregators\nfrom lakehouse_engine.transformers.column_creators import ColumnCreators\nfrom lakehouse_engine.transformers.column_reshapers import ColumnReshapers\nfrom lakehouse_engine.transformers.condensers import Condensers\nfrom lakehouse_engine.transformers.custom_transformers import CustomTransformers\nfrom lakehouse_engine.transformers.data_maskers import DataMaskers\nfrom lakehouse_engine.transformers.date_transformers import DateTransformers\nfrom lakehouse_engine.transformers.filters import Filters\nfrom lakehouse_engine.transformers.joiners import Joiners\nfrom lakehouse_engine.transformers.null_handlers import NullHandlers\nfrom lakehouse_engine.transformers.optimizers import Optimizers\nfrom lakehouse_engine.transformers.regex_transformers import RegexTransformers\nfrom lakehouse_engine.transformers.repartitioners import Repartitioners\nfrom lakehouse_engine.transformers.unions import Unions\nfrom lakehouse_engine.transformers.watermarker import Watermarker\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass TransformerFactory(object):\n    \"\"\"TransformerFactory class following the factory pattern.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    UNSUPPORTED_STREAMING_TRANSFORMERS = [\n        \"condense_record_mode_cdc\",\n        \"group_and_rank\",\n        \"with_auto_increment_id\",\n        \"with_row_id\",\n    ]\n\n    AVAILABLE_TRANSFORMERS = {\n        \"add_current_date\": DateTransformers.add_current_date,\n        \"cache\": Optimizers.cache,\n        \"cast\": ColumnReshapers.cast,\n        \"coalesce\": Repartitioners.coalesce,\n        \"column_dropper\": DataMaskers.column_dropper,\n        \"column_filter_exp\": Filters.column_filter_exp,\n        \"column_selector\": ColumnReshapers.column_selector,\n        \"condense_record_mode_cdc\": Condensers.condense_record_mode_cdc,\n        \"convert_to_date\": DateTransformers.convert_to_date,\n        \"convert_to_timestamp\": DateTransformers.convert_to_timestamp,\n        \"custom_transformation\": CustomTransformers.custom_transformation,\n        \"drop_duplicate_rows\": Filters.drop_duplicate_rows,\n        \"expression_filter\": Filters.expression_filter,\n        \"format_date\": DateTransformers.format_date,\n        \"flatten_schema\": ColumnReshapers.flatten_schema,\n        \"explode_columns\": ColumnReshapers.explode_columns,\n        \"from_avro\": ColumnReshapers.from_avro,\n        \"from_avro_with_registry\": ColumnReshapers.from_avro_with_registry,\n        \"from_json\": ColumnReshapers.from_json,\n        \"get_date_hierarchy\": DateTransformers.get_date_hierarchy,\n        \"get_max_value\": Aggregators.get_max_value,\n        \"group_and_rank\": Condensers.group_and_rank,\n        \"hash_masker\": DataMaskers.hash_masker,\n        \"incremental_filter\": Filters.incremental_filter,\n        \"join\": Joiners.join,\n        \"persist\": Optimizers.persist,\n        \"rename\": ColumnReshapers.rename,\n        \"repartition\": Repartitioners.repartition,\n        \"replace_nulls\": NullHandlers.replace_nulls,\n        \"sql_transformation\": CustomTransformers.sql_transformation,\n        \"to_json\": ColumnReshapers.to_json,\n        \"union\": Unions.union,\n        \"union_by_name\": Unions.union_by_name,\n        \"with_watermark\": Watermarker.with_watermark,\n        \"unpersist\": Optimizers.unpersist,\n        \"with_auto_increment_id\": ColumnCreators.with_auto_increment_id,\n        \"with_expressions\": ColumnReshapers.with_expressions,\n        \"with_literals\": ColumnCreators.with_literals,\n        \"with_regex_value\": RegexTransformers.with_regex_value,\n        \"with_row_id\": ColumnCreators.with_row_id,\n    }\n\n    @staticmethod\n    def get_transformer(spec: TransformerSpec, data: OrderedDict = None) -> Callable:\n        \"\"\"Get a transformer following the factory pattern.\n\n        Args:\n            spec: transformer specification (individual transformation... not to be\n                confused with list of all transformations).\n            data: ordered dict of dataframes to be transformed. Needed when a\n                transformer requires more than one dataframe as input.\n\n        Returns:\n            Transformer function to be executed in .transform() spark function.\n\n        {{get_example(method_name='get_transformer')}}\n        \"\"\"\n        if spec.function == \"incremental_filter\":\n            # incremental_filter optionally expects a DataFrame as input, so find it.\n            args_copy = TransformerFactory._get_spec_args_copy(spec.args)\n            if \"increment_df\" in args_copy:\n                args_copy[\"increment_df\"] = data[args_copy[\"increment_df\"]]\n            return TransformerFactory.AVAILABLE_TRANSFORMERS[  # type: ignore\n                spec.function\n            ](**args_copy)\n        elif spec.function == \"join\":\n            # get the dataframe given the input_id in the input specs of the acon.\n            args_copy = TransformerFactory._get_spec_args_copy(spec.args)\n            args_copy[\"join_with\"] = data[args_copy[\"join_with\"]]\n            return TransformerFactory.AVAILABLE_TRANSFORMERS[  # type: ignore\n                spec.function\n            ](**args_copy)\n        elif spec.function == \"union\" or spec.function == \"union_by_name\":\n            # get the list of dataframes given the input_id in the input specs\n            # of the acon.\n            args_copy = TransformerFactory._get_spec_args_copy(spec.args)\n            args_copy[\"union_with\"] = []\n            for union_with_spec_id in spec.args[\"union_with\"]:\n                args_copy[\"union_with\"].append(data[union_with_spec_id])\n            return TransformerFactory.AVAILABLE_TRANSFORMERS[  # type: ignore\n                spec.function\n            ](**args_copy)\n        elif spec.function in TransformerFactory.AVAILABLE_TRANSFORMERS:\n            return TransformerFactory.AVAILABLE_TRANSFORMERS[  # type: ignore\n                spec.function\n            ](**spec.args)\n        else:\n            raise NotImplementedError(\n                f\"The requested transformer {spec.function} is not implemented.\"\n            )\n\n    @staticmethod\n    def _get_spec_args_copy(spec_args: dict) -> dict:\n        \"\"\"Returns a shallow copy of `spec_args` to ensure immutability.\n\n        Args:\n            spec_args (dict): A dictionary containing the arguments of a\n            TransformerSpec.\n\n        Returns:\n            dict: A shallow copy of `spec_args`, preventing modifications to the\n            original dictionary. This is important in Spark, especially when\n            retries of failed attempts occur. For example, if during the first\n            run the `join_with` argument (initially a string) is replaced with a\n            DataFrame (as done in the `get_transformer` function), then on a retry,\n            depending on how Spark handles state, the `join_with` argument may no\n            longer be a string but a DataFrame, leading to key error.\n        \"\"\"\n        return dict(spec_args)\n"
  },
  {
    "path": "lakehouse_engine/transformers/unions.py",
    "content": "\"\"\"Module with union transformers.\"\"\"\n\nfrom functools import reduce\nfrom typing import Callable, List\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Unions(object):\n    \"\"\"Class containing union transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def union(\n        cls,\n        union_with: List[DataFrame],\n        deduplication: bool = True,\n    ) -> Callable:\n        \"\"\"Union dataframes, resolving columns by position (not by name).\n\n        Args:\n            union_with: list of dataframes to union.\n            deduplication: whether to perform deduplication of elements or not.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='union')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            union_df = reduce(lambda x, y: x.union(y), [df] + union_with)\n\n            return union_df.distinct() if deduplication else union_df\n\n        return inner\n\n    @classmethod\n    def union_by_name(\n        cls,\n        union_with: List[DataFrame],\n        deduplication: bool = True,\n        allow_missing_columns: bool = True,\n    ) -> Callable:\n        \"\"\"Union dataframes, resolving columns by name (not by position).\n\n        Args:\n            union_with: list of dataframes to union.\n            deduplication: whether to perform deduplication of elements or not.\n            allow_missing_columns: allow the union of DataFrames with different\n                schemas.\n\n        Returns:\n            A function to be called in .transform() spark function.\n\n        {{get_example(method_name='union_by_name')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            union_df = reduce(\n                lambda x, y: x.unionByName(\n                    y, allowMissingColumns=allow_missing_columns\n                ),\n                [df] + union_with,\n            )\n\n            return union_df.distinct() if deduplication else union_df\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/transformers/watermarker.py",
    "content": "\"\"\"Watermarker module.\"\"\"\n\nfrom typing import Callable\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass Watermarker(object):\n    \"\"\"Class containing all watermarker transformers.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def with_watermark(watermarker_column: str, watermarker_time: str) -> Callable:\n        \"\"\"Get the dataframe with watermarker defined.\n\n        Args:\n            watermarker_column: name of the input column to be considered for\n                the watermarking. Note: it must be a timestamp.\n            watermarker_time: time window to define the watermark value.\n\n        Returns:\n            A function to be executed on other transformers.\n\n        {{get_example(method_name='with_watermark')}}\n        \"\"\"\n\n        def inner(df: DataFrame) -> DataFrame:\n            return df.withWatermark(watermarker_column, watermarker_time)\n\n        return inner\n"
  },
  {
    "path": "lakehouse_engine/utils/__init__.py",
    "content": "\"\"\"Utilities package.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/utils/acon_utils.py",
    "content": "\"\"\"Module to perform validations and resolve the acon.\"\"\"\n\nfrom lakehouse_engine.core.definitions import (\n    FILE_MANAGER_OPERATIONS,\n    TABLE_MANAGER_OPERATIONS,\n    DQType,\n    InputFormat,\n    OutputFormat,\n)\nfrom lakehouse_engine.io.exceptions import WrongIOFormatException\nfrom lakehouse_engine.utils.dq_utils import PrismaUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n\ndef validate_manager_list(acon: dict) -> list:\n    \"\"\"Function to validate an acon with a list of operations.\n\n    Args:\n        acon: Acon to be validated.\n    \"\"\"\n    error_list: list[str] = []\n    operations: list[dict] = acon.get(\"operations\", [])\n\n    if not operations:\n        raise RuntimeError(\"No operations found in the acon.\")\n\n    for operation in operations:\n        validate_managers(operation, error_list)\n    if error_list:\n        error_list_str = \"\\n\" + \"\\n\".join(error_list)\n        raise RuntimeError(f\"Errors found during validation:{error_list_str}\")\n\n    return operations\n\n\ndef validate_and_resolve_acon(acon: dict, execution_point: str = \"\") -> dict:\n    \"\"\"Function to validate and resolve the acon.\n\n    Args:\n        acon: Acon to be validated and resolved.\n        execution_point: Execution point to resolve the dq functions.\n\n    Returns:\n        Acon after validation and resolution.\n    \"\"\"\n    # Performing validations\n    validate_readers(acon)\n    validate_writers(acon)\n    validate_managers(acon)\n\n    # Resolving the acon\n    if execution_point:\n        acon = resolve_dq_functions(acon, execution_point)\n\n    _LOGGER.info(f\"Read Algorithm Configuration: {str(acon)}\")\n\n    return acon\n\n\ndef validate_readers(acon: dict) -> None:\n    \"\"\"Function to validate the readers in the acon.\n\n    Args:\n        acon: Acon to be validated.\n\n    Raises:\n        RuntimeError: If the input format is not supported.\n    \"\"\"\n    if \"input_specs\" in acon.keys() or \"input_spec\" in acon.keys():\n        for spec in acon.get(\"input_specs\", []) or [acon.get(\"input_spec\", {})]:\n            if (\n                not InputFormat.exists(spec.get(\"data_format\"))\n                and \"db_table\" not in spec.keys()\n            ):\n                raise WrongIOFormatException(\n                    f\"Input format not supported: {spec.get('data_format')}\"\n                )\n\n\ndef validate_writers(acon: dict) -> None:\n    \"\"\"Function to validate the writers in the acon.\n\n    Args:\n        acon: Acon to be validated.\n\n    Raises:\n        RuntimeError: If the output format is not supported.\n    \"\"\"\n    if \"output_specs\" in acon.keys() or \"output_spec\" in acon.keys():\n        for spec in acon.get(\"output_specs\", []) or [acon.get(\"output_spec\", {})]:\n            if not OutputFormat.exists(spec.get(\"data_format\")):\n                raise WrongIOFormatException(\n                    f\"Output format not supported: {spec.get('data_format')}\"\n                )\n\n\ndef validate_managers(acon: dict, error_list: list = None) -> None:\n    \"\"\"Function to validate the managers in the acon.\n\n    Args:\n        acon: Acon to be validated.\n        error_list: List to collect errors.\n    \"\"\"\n    manager_type = acon.get(\"manager\")\n    temp_error_list = []\n    if not manager_type:\n        return\n\n    function_name = acon.get(\"function\")\n    if not function_name:\n        error = \"Missing 'function' parameter for manager\"\n        temp_error_list.append(error)\n\n    if manager_type == \"file\":\n        operations_dict = FILE_MANAGER_OPERATIONS\n    elif manager_type == \"table\":\n        operations_dict = TABLE_MANAGER_OPERATIONS\n    else:\n        error = f\"Manager type not supported: {manager_type}\"\n        temp_error_list.append(error)\n\n    if function_name not in operations_dict:\n        error = f\"Function '{function_name}' not supported for {manager_type} manager\"\n        temp_error_list.append(error)\n    else:\n        expected_params = operations_dict[function_name]\n\n        missing_mandatory = validate_mandatory_parameters(acon, expected_params)\n        if missing_mandatory:\n            error = (\n                f\"Missing mandatory parameters for {manager_type} \"\n                f\"manager function {function_name}: {missing_mandatory}\"\n            )\n            temp_error_list.append(error)\n\n        type_errors = validate_parameter_types(acon, expected_params)\n\n        if type_errors:\n            error = (\n                f\"Type validation errors for {manager_type} \"\n                f\"manager function {function_name}: {type_errors}\"\n            )\n            temp_error_list.append(error)\n\n    if error_list is not None:\n        error_list.extend(temp_error_list)\n    else:\n        if temp_error_list:\n            error_list_str = \"\\n\".join(temp_error_list)\n            raise RuntimeError(error_list_str)\n\n\ndef validate_mandatory_parameters(acon: dict, expected_params: dict) -> list:\n    \"\"\"Function to validate mandatory parameters in the acon.\n\n    Args:\n        acon: Acon to be validated.\n        expected_params: Expected parameters with their mandatory status.\n\n    Returns:\n        List of missing mandatory parameters.\n    \"\"\"\n    missing_mandatory = []\n    for param_name, param_info in expected_params.items():\n        if param_info[\"mandatory\"] and param_name not in acon:\n            missing_mandatory.append(param_name)\n\n    return missing_mandatory\n\n\ndef validate_parameter_types(acon: dict, expected_params: dict) -> list:\n    \"\"\"Function to validate parameter types in the acon.\n\n    Args:\n        acon: Acon to be validated.\n        expected_params: Expected parameters with their types.\n\n    Returns:\n        List of type validation errors.\n    \"\"\"\n    type_errors = []\n    for param_name, param_value in acon.items():\n        if param_name in expected_params:\n            expected_type = expected_params[param_name][\"type\"]\n            param_type_name = type(param_value).__name__\n\n            expected_python_type = {\n                \"str\": str,\n                \"bool\": bool,\n                \"int\": int,\n                \"list\": list,\n            }.get(expected_type)\n\n            if expected_python_type and not isinstance(\n                param_value, expected_python_type\n            ):\n                type_errors.append(\n                    f\"Parameter '{param_name}' expected {expected_type}, \"\n                    f\"got {param_type_name}\"\n                )\n\n    return type_errors\n\n\ndef resolve_dq_functions(acon: dict, execution_point: str) -> dict:\n    \"\"\"Function to resolve the dq functions in the acon.\n\n    Args:\n        acon: Acon to resolve the dq functions.\n        execution_point: Execution point of the dq_functions.\n\n    Returns:\n        Acon after resolving the dq functions.\n    \"\"\"\n    if acon.get(\"dq_spec\"):\n        if acon.get(\"dq_spec\").get(\"dq_type\") == DQType.PRISMA.value:\n            acon[\"dq_spec\"] = PrismaUtils.build_prisma_dq_spec(\n                spec=acon.get(\"dq_spec\"), execution_point=execution_point\n            )\n    elif acon.get(\"dq_specs\"):\n        resolved_dq_specs = []\n        for spec in acon.get(\"dq_specs\", []):\n            if spec.get(\"dq_type\") == DQType.PRISMA.value:\n                resolved_dq_specs.append(\n                    PrismaUtils.build_prisma_dq_spec(\n                        spec=spec, execution_point=execution_point\n                    )\n                )\n            else:\n                resolved_dq_specs.append(spec)\n        acon[\"dq_specs\"] = resolved_dq_specs\n    return acon\n"
  },
  {
    "path": "lakehouse_engine/utils/configs/__init__.py",
    "content": "\"\"\"Config utilities package.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/utils/configs/config_utils.py",
    "content": "\"\"\"Module to read configurations.\"\"\"\n\nfrom importlib.metadata import PackageNotFoundError, version\nfrom typing import Any, Optional\n\nimport yaml\nfrom importlib_resources import as_file, files\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.storage.file_storage_functions import FileStorageFunctions\n\n\nclass ConfigUtils(object):\n    \"\"\"Config utilities class.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n    SENSITIVE_INFO = [\n        \"kafka.ssl.keystore.password\",\n        \"kafka.ssl.truststore.password\",\n        \"password\",\n        \"secret\",\n        \"credential\",\n        \"credentials\",\n        \"pass\",\n        \"key\",\n    ]\n\n    @classmethod\n    def get_acon(\n        cls,\n        acon_path: Optional[str] = None,\n        acon: Optional[dict] = None,\n        disable_dbfs_retry: bool = False,\n    ) -> dict:\n        \"\"\"Get acon based on a filesystem path or on a dict.\n\n        Args:\n            acon_path: path of the acon (algorithm configuration) file.\n            acon: acon provided directly through python code (e.g., notebooks\n                or other apps).\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            Dict representation of an acon.\n        \"\"\"\n        acon = (\n            acon if acon else ConfigUtils.read_json_acon(acon_path, disable_dbfs_retry)\n        )\n        return acon\n\n    @staticmethod\n    def get_config(package: str = \"lakehouse_engine.configs\") -> Any:\n        \"\"\"Get the lakehouse engine configuration file.\n\n        Args:\n            package: package where the engine default configurations can be found.\n\n        Returns:\n            Configuration dictionary\n        \"\"\"\n        config_path = files(package) / \"engine.yaml\"\n        with as_file(config_path) as config_file:\n            with open(config_file, \"r\") as config:\n                config = yaml.safe_load(config)\n        return config\n\n    @staticmethod\n    def get_config_from_file(config_file_path: str) -> Any:\n        \"\"\"Get the lakehouse engine configurations using a file path.\n\n         Args:\n            config_file_path: a string with a path for a yaml file\n            with custom configurations.\n\n        Returns:\n            Configuration dictionary\n        \"\"\"\n        with open(config_file_path, \"r\") as config:\n            config = yaml.safe_load(config)\n        return config\n\n    @classmethod\n    def get_engine_version(cls) -> str:\n        \"\"\"Get Lakehouse Engine version from the installed packages.\n\n        Returns:\n            String of engine version.\n        \"\"\"\n        try:\n            _version = version(\"lakehouse-engine\")\n        except PackageNotFoundError:\n            cls._LOGGER.info(\"Could not identify Lakehouse Engine version.\")\n            _version = \"\"\n        return str(_version)\n\n    @staticmethod\n    def read_json_acon(path: str, disable_dbfs_retry: bool = False) -> Any:\n        \"\"\"Read an acon (algorithm configuration) file.\n\n        Args:\n            path: path to the acon file.\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            The acon file content as a dict.\n        \"\"\"\n        return FileStorageFunctions.read_json(path, disable_dbfs_retry)\n\n    @staticmethod\n    def read_sql(path: str, disable_dbfs_retry: bool = False) -> Any:\n        \"\"\"Read a DDL file in Spark SQL format from a cloud object storage system.\n\n        Args:\n            path: path to the SQL file.\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            Content of the SQL file.\n        \"\"\"\n        return FileStorageFunctions.read_sql(path, disable_dbfs_retry)\n\n    @classmethod\n    def remove_sensitive_info(cls, dict_to_replace: dict | list) -> dict | list:\n        \"\"\"Remove sensitive info from a dictionary.\n\n        Args:\n            dict_to_replace: dict where we want to remove sensitive info.\n\n        Returns:\n            dict without sensitive information.\n        \"\"\"\n        if isinstance(dict_to_replace, list):\n            return [cls.remove_sensitive_info(k) for k in dict_to_replace]\n        elif isinstance(dict_to_replace, dict):\n            return {\n                k: \"******\" if k in cls.SENSITIVE_INFO else cls.remove_sensitive_info(v)\n                for k, v in dict_to_replace.items()\n            }\n        else:\n            return dict_to_replace\n"
  },
  {
    "path": "lakehouse_engine/utils/databricks_utils.py",
    "content": "\"\"\"Utilities for databricks operations.\"\"\"\n\nimport ast\nimport json\nimport os\nimport re\nfrom typing import Any, Tuple\n\nfrom pyspark.sql import SparkSession\n\nfrom lakehouse_engine.core.definitions import EngineStats\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DatabricksUtils(object):\n    \"\"\"Databricks utilities class.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def is_serverless_workload() -> bool:\n        \"\"\"Check if the current databricks workload is serverless.\n\n        Returns:\n            True if the current databricks workload is serverless, False otherwise.\n        \"\"\"\n        if os.getenv(\"IS_SERVERLESS\", \"false\").lower() == \"true\":\n            return True\n        else:\n            return False\n\n    @staticmethod\n    def get_db_utils(spark: SparkSession) -> Any:\n        \"\"\"Get db utils on databricks.\n\n        Args:\n            spark: spark session.\n\n        Returns:\n            Dbutils from databricks.\n        \"\"\"\n        try:\n            from pyspark.dbutils import DBUtils\n\n            if \"dbutils\" not in locals():\n                dbutils = DBUtils(spark)\n            else:\n                dbutils = locals().get(\"dbutils\")\n        except ImportError:\n            import IPython\n\n            dbutils = IPython.get_ipython().user_ns[\"dbutils\"]\n        return dbutils\n\n    @staticmethod\n    def get_databricks_job_information(spark: SparkSession) -> Tuple[str, str]:\n        \"\"\"Get notebook context from running acon.\n\n        Args:\n            spark: spark session.\n\n        Returns:\n            Dict containing databricks notebook context.\n        \"\"\"\n        dbutils = DatabricksUtils.get_db_utils(spark)\n        notebook_context = json.loads(\n            (\n                dbutils.notebook.entry_point.getDbutils()\n                .notebook()\n                .getContext()\n                .safeToJson()\n            )\n        )\n\n        return notebook_context[\"attributes\"].get(\"orgId\"), notebook_context[\n            \"attributes\"\n        ].get(\"jobName\")\n\n    @staticmethod\n    def _get_dp_name(job_name: str) -> str:\n        \"\"\"Extract the dp_name from a Databricks job name.\n\n        The job name is expected to have a suffix separated by '-', and the dp_name is\n        the part before the last '-'. Only '_' is used in the rest of the job name.\n        E.g. 'sadp-template-my_awesome_job'\n\n        Args:\n            job_name: The Databricks job name string.\n\n        Returns:\n            The extracted dp_name.\n        \"\"\"\n        return job_name.rsplit(\"-\", 1)[0] if job_name and \"-\" in job_name else job_name\n\n    @staticmethod\n    def get_spark_conf_values(usage_stats: dict, spark_confs: dict) -> None:\n        \"\"\"Get information from spark session configurations.\n\n        Args:\n            usage_stats: usage_stats dictionary file.\n            spark_confs: optional dictionary with the spark tags to be used when\n                collecting the engine usage.\n        \"\"\"\n        from lakehouse_engine.core.exec_env import ExecEnv\n\n        spark_confs = (\n            EngineStats.DEF_SPARK_CONFS\n            if spark_confs is None\n            else EngineStats.DEF_SPARK_CONFS | spark_confs\n        )\n\n        for spark_conf_key, spark_conf_value in spark_confs.items():\n            # whenever the spark_conf_value has #, it means it is an array, so we need\n            # to split it and adequately process it\n            if \"#\" in spark_conf_value:\n                array_key = spark_conf_value.split(\"#\")\n                array_values = ast.literal_eval(\n                    ExecEnv.SESSION.conf.get(array_key[0], \"[]\")\n                )\n                final_value = [\n                    key_val[\"value\"]\n                    for key_val in array_values\n                    if key_val[\"key\"] == array_key[1]\n                ]\n                usage_stats[spark_conf_key] = (\n                    final_value[0] if len(final_value) > 0 else \"\"\n                )\n            else:\n                usage_stats[spark_conf_key] = ExecEnv.SESSION.conf.get(\n                    spark_conf_value, \"\"\n                )\n\n        run_id_extracted = re.search(\"run-([1-9]\\\\w+)\", usage_stats.get(\"run_id\", \"\"))\n        usage_stats[\"run_id\"] = run_id_extracted.group(1) if run_id_extracted else \"\"\n\n    @classmethod\n    def get_usage_context_for_serverless(cls, usage_stats: dict) -> None:\n        \"\"\"Get information from the execution environment for serverless scenarios.\n\n        Since in serverless environments we might not have access to all the spark\n        confs we want to collect, we will try to get that information from the\n        execution environment when possible.\n\n        Args:\n            usage_stats: usage_stats dictionary file.\n        \"\"\"\n        try:\n            from dbruntime.databricks_repl_context import get_context\n\n            from lakehouse_engine.core.exec_env import ExecEnv\n\n            context = get_context()\n            for key, attr in EngineStats.DEF_DATABRICKS_CONTEXT_KEYS.items():\n                if key == \"dp_name\":\n                    usage_stats[key] = DatabricksUtils._get_dp_name(\n                        getattr(context, attr, None)\n                    )\n                elif key == \"environment\":\n                    usage_stats[key] = ExecEnv.get_environment()\n                else:\n                    usage_stats[key] = getattr(context, attr, None)\n        except Exception as ex:\n            cls._LOGGER.error(f\"Error getting Serverless Usage Context: {ex}\")\n"
  },
  {
    "path": "lakehouse_engine/utils/dq_utils.py",
    "content": "\"\"\"Module containing utils for DQ processing.\"\"\"\n\nfrom json import loads\n\nfrom pyspark.sql.functions import col, from_json, schema_of_json, struct\n\nfrom lakehouse_engine.core.definitions import DQSpec, DQTableBaseParameters, DQType\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.dq_processors.exceptions import DQSpecMalformedException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n\nclass DQUtils:\n    \"\"\"Utils related to the data quality process.\"\"\"\n\n    @staticmethod\n    def import_dq_rules_from_table(\n        spec: dict,\n        execution_point: str,\n        base_expectation_arguments: list,\n        extra_meta_arguments: list,\n    ) -> dict:\n        \"\"\"Import dq rules from a table.\n\n        Args:\n            spec: data quality specification.\n            execution_point: if the execution is in_motion or at_rest.\n            base_expectation_arguments: base arguments for dq functions.\n            extra_meta_arguments: extra meta arguments for dq functions.\n\n        Returns:\n            The dictionary containing the dq spec with dq functions defined.\n        \"\"\"\n        dq_db_table = spec[\"dq_db_table\"]\n        dq_functions = []\n\n        if spec.get(\"dq_table_table_filter\"):\n            dq_table_table_filter = spec[\"dq_table_table_filter\"]\n        else:\n            raise DQSpecMalformedException(\n                \"When importing rules from a table \"\n                \"dq_table_table_filter must be defined.\"\n            )\n\n        extra_filters_query = (\n            f\"\"\" and {spec[\"dq_table_extra_filters\"]}\"\"\"\n            if spec.get(\"dq_table_extra_filters\")\n            else \"\"\n        )\n\n        fields = base_expectation_arguments + extra_meta_arguments\n\n        dq_functions_query = f\"\"\"\n            SELECT {\", \".join(fields)}\n            FROM {dq_db_table}\n            WHERE\n            execution_point='{execution_point}' and table = '{dq_table_table_filter}'\n            {extra_filters_query}\"\"\"  # nosec: B608\n\n        raw_dq_functions = ExecEnv.SESSION.sql(dq_functions_query)\n\n        arguments = raw_dq_functions.select(\"arguments\").collect()\n        parsed_arguments = [loads(argument.arguments) for argument in arguments]\n        combined_dict: dict = {}\n\n        for argument in parsed_arguments:\n            combined_dict = {**combined_dict, **argument}\n\n        dq_function_arguments_schema = schema_of_json(str(combined_dict))\n\n        processed_dq_functions = (\n            raw_dq_functions.withColumn(\n                \"json_data\", from_json(col(\"arguments\"), dq_function_arguments_schema)\n            )\n            .withColumn(\n                \"parsed_arguments\",\n                struct(\n                    col(\"json_data.*\"),\n                    struct(extra_meta_arguments).alias(\"meta\"),\n                ),\n            )\n            .drop(col(\"json_data\"))\n        )\n\n        unique_dq_functions = processed_dq_functions.drop_duplicates(\n            [\"dq_tech_function\", \"arguments\"]\n        )\n\n        duplicated_rows = processed_dq_functions.subtract(unique_dq_functions)\n\n        if duplicated_rows.count() > 0:\n            _LOGGER.warning(\"Found Duplicates Rows:\")\n            duplicated_rows.show(truncate=False)\n\n        processed_dq_functions_list = unique_dq_functions.collect()\n        for processed_dq_function in processed_dq_functions_list:\n            dq_functions.append(\n                {\n                    \"function\": f\"{processed_dq_function.dq_tech_function}\",\n                    \"args\": {\n                        k: v\n                        for k, v in processed_dq_function.parsed_arguments.asDict(\n                            recursive=True\n                        ).items()\n                        if v is not None\n                    },\n                }\n            )\n\n        spec[\"dq_functions\"] = dq_functions\n\n        return spec\n\n    @staticmethod\n    def validate_dq_functions(\n        spec: dict, execution_point: str = \"\", extra_meta_arguments: list = None\n    ) -> None:\n        \"\"\"Function to validate the dq functions defined in the dq_spec.\n\n        This function validates that the defined dq_functions contain all\n        the fields defined in the extra_meta_arguments parameter.\n\n        Args:\n            spec: data quality specification.\n            execution_point: if the execution is in_motion or at_rest.\n            extra_meta_arguments: extra meta arguments for dq functions.\n\n        Raises:\n            DQSpecMalformedException: If the dq spec is malformed.\n        \"\"\"\n        dq_functions = spec[\"dq_functions\"]\n        if not extra_meta_arguments:\n            _LOGGER.info(\n                \"No extra meta parameters defined. \"\n                \"Skipping validation of imported dq rule.\"\n            )\n            return\n\n        for dq_function in dq_functions:\n            if not dq_function.get(\"args\").get(\"meta\", None):\n                raise DQSpecMalformedException(\n                    \"The dq function must have a meta field containing all \"\n                    f\"the fields defined: {extra_meta_arguments}.\"\n                )\n            else:\n\n                meta = dq_function[\"args\"][\"meta\"]\n                given_keys = meta.keys()\n                missing_keys = sorted(set(extra_meta_arguments) - set(given_keys))\n                if missing_keys:\n                    raise DQSpecMalformedException(\n                        \"The dq function meta field must contain all the \"\n                        f\"fields defined: {extra_meta_arguments}.\\n\"\n                        f\"Found fields: {list(given_keys)}.\\n\"\n                        f\"Diff: {list(missing_keys)}\"\n                    )\n                if execution_point and meta[\"execution_point\"] != execution_point:\n                    raise DQSpecMalformedException(\n                        \"The dq function execution point must be the same as \"\n                        \"the execution point of the dq spec.\"\n                    )\n\n\nclass PrismaUtils:\n    \"\"\"Prisma related utils.\"\"\"\n\n    @staticmethod\n    def build_prisma_dq_spec(spec: dict, execution_point: str) -> dict:\n        \"\"\"Fetch dq functions from given table.\n\n        Args:\n            spec: data quality specification.\n            execution_point: if the execution is in_motion or at_rest.\n\n        Returns:\n            The dictionary containing the dq spec with dq functions defined.\n        \"\"\"\n        if spec.get(\"dq_db_table\"):\n            spec = DQUtils.import_dq_rules_from_table(\n                spec,\n                execution_point,\n                DQTableBaseParameters.PRISMA_BASE_PARAMETERS.value,\n                ExecEnv.ENGINE_CONFIG.dq_functions_column_list,\n            )\n        elif spec.get(\"dq_functions\"):\n            DQUtils.validate_dq_functions(\n                spec,\n                execution_point,\n                ExecEnv.ENGINE_CONFIG.dq_functions_column_list,\n            )\n        else:\n            raise DQSpecMalformedException(\n                \"When using PRISMA either dq_db_table or \"\n                \"dq_functions needs to be defined.\"\n            )\n\n        dq_bucket = (\n            ExecEnv.ENGINE_CONFIG.dq_bucket\n            if ExecEnv.get_environment() == \"prod\"\n            else ExecEnv.ENGINE_CONFIG.dq_dev_bucket\n        )\n\n        spec[\"critical_functions\"] = []\n        spec[\"execution_point\"] = execution_point\n        spec[\"result_sink_db_table\"] = None\n        spec[\"result_sink_explode\"] = True\n        spec[\"fail_on_error\"] = spec.get(\"fail_on_error\", False)\n        spec[\"max_percentage_failure\"] = spec.get(\"max_percentage_failure\", 1)\n\n        if not spec.get(\"result_sink_extra_columns\", None):\n            spec[\"result_sink_extra_columns\"] = [\n                \"validation_results.expectation_config.meta\",\n            ]\n        else:\n            spec[\"result_sink_extra_columns\"] = [\n                \"validation_results.expectation_config.meta\",\n            ] + spec[\"result_sink_extra_columns\"]\n        if not spec.get(\"data_product_name\", None):\n            raise DQSpecMalformedException(\n                \"When using PRISMA DQ data_product_name must be defined.\"\n            )\n        spec[\"result_sink_location\"] = (\n            f\"{dq_bucket}/{spec['data_product_name']}/result_sink/\"\n        )\n        spec[\"processed_keys_location\"] = (\n            f\"{dq_bucket}/{spec['data_product_name']}/dq_processed_keys/\"\n        )\n        if not spec.get(\"tbl_to_derive_pk\", None) and not spec.get(\n            \"unexpected_rows_pk\", None\n        ):\n            raise DQSpecMalformedException(\n                \"When using PRISMA DQ either \"\n                \"tbl_to_derive_pk or unexpected_rows_pk need to be defined.\"\n            )\n        return spec\n\n    @staticmethod\n    def validate_rule_id_duplication(\n        specs: list[DQSpec],\n    ) -> dict[str, str]:\n        \"\"\"Verify uniqueness of the dq_rule_id.\n\n        Args:\n            specs: a list of DQSpec to be validated\n\n        Returns:\n             A dictionary with the spec_id as key and\n             rule_id as value for any duplicates.\n        \"\"\"\n        error_dict = {}\n\n        for spec in specs:\n            dq_db_table = spec.dq_db_table\n            dq_functions = spec.dq_functions\n            spec_id = spec.spec_id\n\n            if spec.dq_type == DQType.PRISMA.value and dq_db_table:\n                dq_rule_id_query = f\"\"\"\n                    SELECT dq_rule_id, COUNT(*) AS count\n                    FROM {dq_db_table}\n                    GROUP BY dq_rule_id\n                    HAVING COUNT(*) > 1;\n                    \"\"\"  # nosec: B608\n\n                duplicate_rule_id_table = ExecEnv.SESSION.sql(dq_rule_id_query)\n\n                if not duplicate_rule_id_table.isEmpty():\n                    rows = duplicate_rule_id_table.collect()\n                    df_str = \"; \".join([str(row) for row in rows])\n                    error_dict[f\"dq_spec_id: {spec_id}\"] = df_str\n\n            elif spec.dq_type == DQType.PRISMA.value and dq_functions:\n                dq_rules_id_list = []\n                for dq_function in dq_functions:\n                    dq_rules_id_list.append(dq_function.args[\"meta\"][\"dq_rule_id\"])\n\n                if len(dq_rules_id_list) != len(set(dq_rules_id_list)):\n                    error_dict[f\"dq_spec_id: {spec_id}\"] = \"; \".join(\n                        [str(dq_rule_id) for dq_rule_id in dq_rules_id_list]\n                    )\n\n        return error_dict\n"
  },
  {
    "path": "lakehouse_engine/utils/engine_usage_stats.py",
    "content": "\"\"\"Utilities for recording the engine activity.\"\"\"\n\nimport json\nfrom datetime import datetime\nfrom urllib.parse import urlparse\n\nfrom lakehouse_engine.core.definitions import CollectEngineUsage\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.storage.file_storage_functions import FileStorageFunctions\n\n\nclass EngineUsageStats(object):\n    \"\"\"Engine Usage utilities class.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def store_engine_usage(\n        cls,\n        acon: dict,\n        func_name: str,\n        collect_engine_usage: str = None,\n        spark_confs: dict = None,\n    ) -> None:\n        \"\"\"Collects and store Lakehouse Engine usage statistics.\n\n        These statistics include the acon and other relevant information, such as\n        the lakehouse engine version and the functions/algorithms being used.\n\n        Args:\n            acon: acon dictionary file.\n            func_name: function name that called this log acon.\n            collect_engine_usage: Lakehouse usage statistics collection strategy.\n            spark_confs: optional dictionary with the spark confs to be used when\n                collecting the engine usage.\n        \"\"\"\n        if not cls._should_collect_usage(collect_engine_usage):\n            return\n        try:\n            start_timestamp = datetime.now()\n            timestamp_str = start_timestamp.strftime(\"%Y%m%d%H%M%S\")\n            usage_stats = cls._prepare_usage_stats(acon, spark_confs)\n            engine_usage_path = cls._select_usage_path(\n                usage_stats, collect_engine_usage\n            )\n            if engine_usage_path is None:\n                return\n\n            cls._add_metadata_to_stats(usage_stats, func_name, start_timestamp)\n            log_file_name = f\"eng_usage_{func_name}_{timestamp_str}.json\"\n            usage_stats_str = json.dumps(usage_stats, default=str)\n            url = urlparse(\n                f\"{engine_usage_path}/{usage_stats['dp_name']}/\"\n                f\"{start_timestamp.year}/{start_timestamp.month}/\"\n                f\"{log_file_name}\",\n                allow_fragments=False,\n            )\n            try:\n                FileStorageFunctions.write_payload(\n                    engine_usage_path, url, usage_stats_str\n                )\n                cls._LOGGER.info(\"Storing Lakehouse Engine usage statistics\")\n            except FileNotFoundError as e:\n                cls._LOGGER.error(f\"Could not write engine stats into file: {e}.\")\n        except Exception as e:\n            cls._LOGGER.error(\n                \"Failed while collecting the lakehouse engine stats: \"\n                f\"Unexpected {e=}, {type(e)=}.\"\n            )\n\n    @classmethod\n    def _should_collect_usage(cls, collect_engine_usage: str) -> bool:\n        return (\n            collect_engine_usage\n            in [CollectEngineUsage.ENABLED.value, CollectEngineUsage.PROD_ONLY.value]\n            or ExecEnv.ENGINE_CONFIG.collect_engine_usage\n            in CollectEngineUsage.ENABLED.value\n        )\n\n    @classmethod\n    def _prepare_usage_stats(cls, acon: dict, spark_confs: dict) -> dict:\n        usage_stats = {\"acon\": ConfigUtils.remove_sensitive_info(acon)}\n        if not ExecEnv.IS_SERVERLESS:\n            DatabricksUtils.get_spark_conf_values(usage_stats, spark_confs)\n        else:\n            DatabricksUtils.get_usage_context_for_serverless(usage_stats)\n        return usage_stats\n\n    @classmethod\n    def _select_usage_path(\n        cls, usage_stats: dict, collect_engine_usage: str\n    ) -> str | None:\n        if usage_stats.get(\"environment\") == \"prod\":\n            return ExecEnv.ENGINE_CONFIG.engine_usage_path\n        elif collect_engine_usage != CollectEngineUsage.PROD_ONLY.value:\n            return ExecEnv.ENGINE_CONFIG.engine_dev_usage_path\n        return None\n\n    @classmethod\n    def _add_metadata_to_stats(\n        cls, usage_stats: dict, func_name: str, start_timestamp: datetime\n    ) -> None:\n        usage_stats[\"function\"] = func_name\n        usage_stats[\"engine_version\"] = ConfigUtils.get_engine_version()\n        usage_stats[\"start_timestamp\"] = start_timestamp\n        usage_stats[\"year\"] = start_timestamp.year\n        usage_stats[\"month\"] = start_timestamp.month\n"
  },
  {
    "path": "lakehouse_engine/utils/expectations_utils.py",
    "content": "\"\"\"Utilities to be used by custom expectations.\"\"\"\n\nfrom typing import Any, Dict\n\n\ndef validate_result(\n    expectation_configuration: Any,\n    metrics: dict,\n) -> None:\n    \"\"\"Validates that the unexpected_index_list in the tests is corretly defined.\n\n    Additionally, it validates the expectation using the GE _validate method.\n\n    Args:\n        expectation_configuration: Expectation configuration.\n        metrics: Test result metrics.\n        runtime_configuration: Configuration used when running the expectation.\n        execution_engine: Execution engine used in the expectation.\n        base_expectation: Base expectation to validate.\n    \"\"\"\n    example_unexpected_index_list = _get_example_unexpected_index_list(\n        expectation_configuration\n    )\n\n    test_unexpected_index_list = _get_test_unexpected_index_list(\n        expectation_configuration.map_metric, metrics\n    )\n    if example_unexpected_index_list:\n        if example_unexpected_index_list != test_unexpected_index_list:\n            raise AssertionError(\n                f\"Example unexpected_index_list: {example_unexpected_index_list}\\n\"\n                f\"Test unexpected_index_list: {test_unexpected_index_list}\"\n            )\n\n\ndef _get_example_unexpected_index_list(expectation_configuration: Any) -> list:\n    \"\"\"Retrieves the unexpected index list defined from the example used on the test.\n\n    This needs to be done manually because GE allows us to get either the complete\n    output of the test or the complete configuration used on the test.\n    To get around this limitation this function is used to fetch the example used\n    in the test directly from the expectation itself.\n\n    Args:\n        expectation_configuration: Expectation configuration.\n\n    Returns:\n        List of unexpected indexes defined in the example used.\n    \"\"\"\n    filtered_example: dict = {\"out\": {\"unexpected_index_list\": []}}\n\n    for example in expectation_configuration.examples:\n        for test in example[\"tests\"]:  # type: ignore\n            example_result_format = []\n            if \"result_format\" in expectation_configuration.result_format:\n                example_result_format = expectation_configuration.result_format\n\n            if test[\"in\"][\"result_format\"] == example_result_format:\n                filtered_example = test\n\n    example_unexpected_index_list = []\n    if \"unexpected_index_list\" in filtered_example[\"out\"]:\n        example_unexpected_index_list = filtered_example[\"out\"][\"unexpected_index_list\"]\n\n    return example_unexpected_index_list\n\n\ndef _get_test_unexpected_index_list(metric_name: str, metrics: Dict) -> list:\n    \"\"\"Retrieves the unexpected index list from the test case that has been run.\n\n    Args:\n        metric_name: Name of the metric to retrieve the unexpected index list.\n        metrics: Metric values resulting from the test.\n\n    Returns:\n        List of unexpected indexes retrieved form the test.\n    \"\"\"\n    test_unexpected_index_list = []\n    if f\"{metric_name}.unexpected_index_list\" in metrics:\n        if metrics[f\"{metric_name}.unexpected_index_list\"]:\n            test_unexpected_index_list = metrics[f\"{metric_name}.unexpected_index_list\"]\n        else:\n            test_unexpected_index_list = []\n\n    return test_unexpected_index_list\n"
  },
  {
    "path": "lakehouse_engine/utils/extraction/__init__.py",
    "content": "\"\"\"Extraction utilities package.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/utils/extraction/jdbc_extraction_utils.py",
    "content": "\"\"\"Utilities module for JDBC extraction processes.\"\"\"\n\nfrom abc import abstractmethod\nfrom dataclasses import dataclass\nfrom datetime import datetime, timezone\nfrom enum import Enum\nfrom logging import Logger\nfrom typing import Any, Dict, List, Optional, Tuple\n\nfrom lakehouse_engine.core.definitions import InputFormat, InputSpec, ReadType\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass JDBCExtractionType(Enum):\n    \"\"\"Standardize the types of extractions we can have from a JDBC source.\"\"\"\n\n    INIT = \"init\"\n    DELTA = \"delta\"\n\n\n@dataclass\nclass JDBCExtraction(object):\n    \"\"\"Configurations available for an Extraction from a JDBC source.\n\n    These configurations cover:\n\n    - user: username to connect to JDBC source.\n    - password: password to connect to JDBC source (always use secrets,\n        don't use text passwords in your code).\n    - url: url to connect to JDBC source.\n    - dbtable: `database.table` to extract data from.\n    - calc_upper_bound_schema: custom schema used for the upper bound calculation.\n    - changelog_table: table of type changelog from which to extract data,\n        when the extraction type is delta.\n    - partition_column: column used to split the extraction.\n    - latest_timestamp_data_location: data location (e.g., s3) containing the data\n        to get the latest timestamp already loaded into bronze.\n    - latest_timestamp_data_format: the format of the dataset in\n        latest_timestamp_data_location. Default: delta.\n    - extraction_type: type of extraction (delta or init). Default: \"delta\".\n    - driver: JDBC driver name. Default: \"com.sap.db.jdbc.Driver\".\n    - num_partitions: number of Spark partitions to split the extraction.\n    - lower_bound: lower bound to decide the partition stride.\n    - upper_bound: upper bound to decide the partition stride. If\n        calculate_upper_bound is True, then upperBound will be\n        derived by our upper bound optimizer, using the partition column.\n    - default_upper_bound: the value to use as default upper bound in case\n        the result of the upper bound calculation is None. Default: \"1\".\n    - fetch_size: how many rows to fetch per round trip. Default: \"100000\".\n    - compress: enable network compression. Default: True.\n    - custom_schema: specify custom_schema for particular columns of the\n        returned dataframe in the init/delta extraction of the source table.\n    - min_timestamp: min timestamp to consider to filter the changelog data.\n        Default: None and automatically derived from the location provided.\n        In case this one is provided it has precedence and the calculation\n        is not done.\n    - max_timestamp: max timestamp to consider to filter the changelog data.\n        Default: None and automatically derived from the table having information\n        about the extraction requests, their timestamps and their status.\n        In case this one is provided it has precedence and the calculation\n        is not done.\n    - generate_predicates: whether to generate predicates automatically or not.\n        Default: False.\n    - predicates: list containing all values to partition (if generate_predicates\n        is used, the manual values provided are ignored). Default: None.\n    - predicates_add_null: whether to consider null on predicates list.\n        Default: True.\n    - extraction_timestamp: the timestamp of the extraction. Default: current time\n        following the format \"%Y%m%d%H%M%S\".\n    - max_timestamp_custom_schema: custom schema used on the max_timestamp derivation\n        from the table holding the extraction requests information.\n    \"\"\"\n\n    user: str\n    password: str\n    url: str\n    dbtable: str\n    calc_upper_bound_schema: Optional[str] = None\n    changelog_table: Optional[str] = None\n    partition_column: Optional[str] = None\n    latest_timestamp_data_location: Optional[str] = None\n    latest_timestamp_data_format: str = InputFormat.DELTAFILES.value\n    extraction_type: str = JDBCExtractionType.DELTA.value\n    driver: str = \"com.sap.db.jdbc.Driver\"\n    num_partitions: Optional[int] = None\n    lower_bound: Optional[int | float | str] = None\n    upper_bound: Optional[int | float | str] = None\n    default_upper_bound: str = \"1\"\n    fetch_size: str = \"100000\"\n    compress: bool = True\n    custom_schema: Optional[str] = None\n    min_timestamp: Optional[str] = None\n    max_timestamp: Optional[str] = None\n    generate_predicates: bool = False\n    predicates: Optional[List] = None\n    predicates_add_null: bool = True\n    extraction_timestamp: str = datetime.now(timezone.utc).strftime(\"%Y%m%d%H%M%S\")\n    max_timestamp_custom_schema: Optional[str] = None\n\n\nclass JDBCExtractionUtils(object):\n    \"\"\"Utils for managing data extraction from particularly relevant JDBC sources.\"\"\"\n\n    def __init__(self, jdbc_extraction: Any):\n        \"\"\"Construct JDBCExtractionUtils.\n\n        Args:\n            jdbc_extraction: JDBC Extraction configurations. Can be of type:\n                JDBCExtraction, SAPB4Extraction or SAPBWExtraction.\n        \"\"\"\n        self._LOGGER: Logger = LoggingHandler(__name__).get_logger()\n        self._JDBC_EXTRACTION = jdbc_extraction\n\n    @staticmethod\n    def get_additional_spark_options(\n        input_spec: InputSpec, options: dict, ignore_options: List = None\n    ) -> dict:\n        \"\"\"Helper to get additional Spark Options initially passed.\n\n        If people provide additional Spark options, not covered by the util function\n        arguments (get_spark_jdbc_options), we need to consider them.\n        Thus, we update the options retrieved by the utils, by checking if there is\n        any Spark option initially provided that is not yet considered in the retrieved\n        options or function arguments and if the value for the key is not None.\n        If these conditions are filled, we add the options and return the complete dict.\n\n        Args:\n            input_spec: the input specification.\n            options: dict with Spark options.\n            ignore_options: list of options to be ignored by the process.\n                Spark read has two different approaches to parallelize\n                reading process, one of them is using upper/lower bound,\n                another one is using predicates, those process can't be\n                executed at the same time, you must choose one of them.\n                By choosing predicates you can't pass lower and upper bound,\n                also can't pass number of partitions and partition column\n                otherwise spark will interpret the execution partitioned by\n                upper and lower bound and will expect to fill all variables.\n                To avoid fill all predicates hardcoded at the acon, there is\n                a feature that automatically generates all predicates for init\n                or delta load based on input partition column, but at the end\n                of the process, partition column can't be passed to the options,\n                because we are choosing predicates execution, that is why to\n                generate predicates we need to pass some options to ignore.\n\n        Returns:\n             a dict with all the options passed as argument, plus the options that\n             were initially provided, but were not used in the util\n             (get_spark_jdbc_options).\n        \"\"\"\n        func_args = JDBCExtractionUtils.get_spark_jdbc_options.__code__.co_varnames\n\n        if ignore_options is None:\n            ignore_options = []\n        ignore_options = ignore_options + list(options.keys()) + list(func_args)\n\n        return {\n            key: value\n            for key, value in input_spec.options.items()\n            if key not in ignore_options and value is not None\n        }\n\n    def get_predicates(self, predicates_query: str) -> List:\n        \"\"\"Get the predicates list, based on a predicates query.\n\n        Args:\n            predicates_query: query to use as the basis to get the distinct values for\n                a specified column, based on which predicates are generated.\n\n        Returns:\n            List containing the predicates to use to split the extraction from\n            JDBC sources.\n        \"\"\"\n        jdbc_args = {\n            \"url\": self._JDBC_EXTRACTION.url,\n            \"table\": predicates_query,\n            \"properties\": {\n                \"user\": self._JDBC_EXTRACTION.user,\n                \"password\": self._JDBC_EXTRACTION.password,\n                \"driver\": self._JDBC_EXTRACTION.driver,\n            },\n        }\n        from lakehouse_engine.io.reader_factory import ReaderFactory\n\n        predicates_df = ReaderFactory.get_data(\n            InputSpec(\n                spec_id=\"get_predicates\",\n                data_format=InputFormat.JDBC.value,\n                read_type=ReadType.BATCH.value,\n                jdbc_args=jdbc_args,\n            )\n        )\n\n        predicates_list = [\n            f\"{self._JDBC_EXTRACTION.partition_column}='{row[0]}'\"\n            for row in predicates_df.collect()\n        ]\n\n        if self._JDBC_EXTRACTION.predicates_add_null:\n            predicates_list.append(f\"{self._JDBC_EXTRACTION.partition_column} IS NULL\")\n        self._LOGGER.info(\n            f\"The following predicate list was generated: {predicates_list}\"\n        )\n\n        return predicates_list\n\n    def get_spark_jdbc_options(self) -> Tuple[dict, dict]:\n        \"\"\"Get the Spark options to extract data from a JDBC source.\n\n        Returns:\n            The Spark jdbc args dictionary, including the query to submit\n            and also options args dictionary.\n        \"\"\"\n        options_args: Dict[str, Any] = {\n            \"fetchSize\": self._JDBC_EXTRACTION.fetch_size,\n            \"compress\": self._JDBC_EXTRACTION.compress,\n        }\n\n        jdbc_args = {\n            \"url\": self._JDBC_EXTRACTION.url,\n            \"properties\": {\n                \"user\": self._JDBC_EXTRACTION.user,\n                \"password\": self._JDBC_EXTRACTION.password,\n                \"driver\": self._JDBC_EXTRACTION.driver,\n            },\n        }\n\n        if self._JDBC_EXTRACTION.extraction_type == JDBCExtractionType.DELTA.value:\n            jdbc_args[\"table\"], predicates_query = self._get_delta_query()\n        else:\n            jdbc_args[\"table\"], predicates_query = self._get_init_query()\n\n        if self._JDBC_EXTRACTION.custom_schema:\n            options_args[\"customSchema\"] = self._JDBC_EXTRACTION.custom_schema\n\n        if self._JDBC_EXTRACTION.generate_predicates:\n            jdbc_args[\"predicates\"] = self.get_predicates(predicates_query)\n        else:\n            if self._JDBC_EXTRACTION.predicates:\n                jdbc_args[\"predicates\"] = self._JDBC_EXTRACTION.predicates\n            else:\n                options_args = self._get_extraction_partition_opts(\n                    options_args,\n                )\n\n        return options_args, jdbc_args\n\n    def get_spark_jdbc_optimal_upper_bound(self) -> Any:\n        \"\"\"Get an optimal upperBound to properly split a Spark JDBC extraction.\n\n        Returns:\n             Either an int, date or timestamp to serve as upperBound Spark JDBC option.\n        \"\"\"\n        options = {}\n        if self._JDBC_EXTRACTION.calc_upper_bound_schema:\n            options[\"customSchema\"] = self._JDBC_EXTRACTION.calc_upper_bound_schema\n\n        table = (\n            self._JDBC_EXTRACTION.dbtable\n            if self._JDBC_EXTRACTION.extraction_type == JDBCExtractionType.INIT.value\n            else self._JDBC_EXTRACTION.changelog_table\n        )\n        jdbc_args = {\n            \"url\": self._JDBC_EXTRACTION.url,\n            \"table\": f\"(SELECT COALESCE(MAX({self._JDBC_EXTRACTION.partition_column}), \"\n            f\"{self._JDBC_EXTRACTION.default_upper_bound}) \"\n            f\"upper_bound FROM {table})\",  # nosec: B608\n            \"properties\": {\n                \"user\": self._JDBC_EXTRACTION.user,\n                \"password\": self._JDBC_EXTRACTION.password,\n                \"driver\": self._JDBC_EXTRACTION.driver,\n            },\n        }\n\n        from lakehouse_engine.io.reader_factory import ReaderFactory\n\n        upper_bound_df = ReaderFactory.get_data(\n            InputSpec(\n                spec_id=\"get_optimal_upper_bound\",\n                data_format=InputFormat.JDBC.value,\n                read_type=ReadType.BATCH.value,\n                jdbc_args=jdbc_args,\n                options=options,\n            )\n        )\n        upper_bound = upper_bound_df.first()[0]\n\n        if upper_bound is not None:\n            self._LOGGER.info(\n                f\"Upper Bound '{upper_bound}' derived from \"\n                f\"'{self._JDBC_EXTRACTION.dbtable}' using the column \"\n                f\"'{self._JDBC_EXTRACTION.partition_column}'\"\n            )\n            return upper_bound\n        else:\n            raise AttributeError(\n                f\"Not able to calculate upper bound from \"\n                f\"'{self._JDBC_EXTRACTION.dbtable}' using \"\n                f\"the column '{self._JDBC_EXTRACTION.partition_column}'\"\n            )\n\n    def _get_extraction_partition_opts(\n        self,\n        options_args: dict,\n    ) -> dict:\n        \"\"\"Get an options dict with custom extraction partition options.\n\n        Args:\n            options_args: spark jdbc reader options.\n        \"\"\"\n        if self._JDBC_EXTRACTION.num_partitions:\n            options_args[\"numPartitions\"] = self._JDBC_EXTRACTION.num_partitions\n        if self._JDBC_EXTRACTION.upper_bound:\n            options_args[\"upperBound\"] = self._JDBC_EXTRACTION.upper_bound\n        if self._JDBC_EXTRACTION.lower_bound:\n            options_args[\"lowerBound\"] = self._JDBC_EXTRACTION.lower_bound\n        if self._JDBC_EXTRACTION.partition_column:\n            options_args[\"partitionColumn\"] = self._JDBC_EXTRACTION.partition_column\n\n        return options_args\n\n    def _get_max_timestamp(self, max_timestamp_query: str) -> str:\n        \"\"\"Get the max timestamp, based on the provided query.\n\n        Args:\n            max_timestamp_query: the query used to derive the max timestamp.\n\n        Returns:\n            A string having the max timestamp.\n        \"\"\"\n        jdbc_args = {\n            \"url\": self._JDBC_EXTRACTION.url,\n            \"table\": max_timestamp_query,\n            \"properties\": {\n                \"user\": self._JDBC_EXTRACTION.user,\n                \"password\": self._JDBC_EXTRACTION.password,\n                \"driver\": self._JDBC_EXTRACTION.driver,\n            },\n        }\n        from lakehouse_engine.io.reader_factory import ReaderFactory\n\n        max_timestamp_df = ReaderFactory.get_data(\n            InputSpec(\n                spec_id=\"get_max_timestamp\",\n                data_format=InputFormat.JDBC.value,\n                read_type=ReadType.BATCH.value,\n                jdbc_args=jdbc_args,\n                options={\n                    \"customSchema\": self._JDBC_EXTRACTION.max_timestamp_custom_schema\n                },\n            )\n        )\n        max_timestamp = max_timestamp_df.first()[0]\n        self._LOGGER.info(\n            f\"Max timestamp {max_timestamp} derived from query: {max_timestamp_query}\"\n        )\n\n        return str(max_timestamp)\n\n    @abstractmethod\n    def _get_delta_query(self) -> Tuple[str, str]:\n        \"\"\"Get a query to extract delta (partially) from a source.\"\"\"\n        pass\n\n    @abstractmethod\n    def _get_init_query(self) -> Tuple[str, str]:\n        \"\"\"Get a query to extract init (fully) from a source.\"\"\"\n        pass\n"
  },
  {
    "path": "lakehouse_engine/utils/extraction/sap_b4_extraction_utils.py",
    "content": "\"\"\"Utilities module for SAP B4 extraction processes.\"\"\"\n\nimport re\nfrom dataclasses import dataclass\nfrom enum import Enum\nfrom logging import Logger\nfrom typing import Any, Optional, Tuple\n\nfrom lakehouse_engine.core.definitions import InputSpec, ReadType\nfrom lakehouse_engine.transformers.aggregators import Aggregators\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionUtils,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass ADSOTypes(Enum):\n    \"\"\"Standardise the types of ADSOs we can have for Extractions from SAP B4.\"\"\"\n\n    AQ = \"AQ\"\n    CL = \"CL\"\n    SUPPORTED_TYPES = [AQ, CL]\n\n\n@dataclass\nclass SAPB4Extraction(JDBCExtraction):\n    \"\"\"Configurations available for an Extraction from SAP B4.\n\n    It inherits from JDBCExtraction configurations, so it can use\n    and/or overwrite those configurations.\n\n    These configurations cover:\n\n    - latest_timestamp_input_col: the column containing the request timestamps\n        in the dataset in latest_timestamp_data_location. Default: REQTSN.\n    - request_status_tbl: the name of the SAP B4 table having information\n        about the extraction requests. Composed of database.table.\n        Default: SAPHANADB.RSPMREQUEST.\n    - request_col_name: name of the column having the request timestamp to join\n        with the request status table. Default: REQUEST_TSN.\n    - data_target: the data target to extract from. User in the join operation with\n        the request status table.\n    - act_req_join_condition: the join condition into activation table\n        can be changed using this property.\n        Default: 'tbl.reqtsn = req.request_col_name'.\n    - include_changelog_tech_cols: whether to include the technical columns\n        (usually coming from the changelog) table or not.\n    - extra_cols_req_status_tbl: columns to be added from request status table.\n        It needs to contain the prefix \"req.\". E.g. \"req.col1 as column_one,\n        req.col2 as column_two\".\n    - request_status_tbl_filter: filter to use for filtering the request status table,\n        influencing the calculation of the max timestamps and the delta extractions.\n    - adso_type: the type of ADSO that you are extracting from. Can be \"AQ\" or \"CL\".\n    - max_timestamp_custom_schema: the custom schema to apply on the calculation of\n        the max timestamp to consider for the delta extractions.\n        Default: timestamp DECIMAL(23,0).\n    - default_max_timestamp: the timestamp to use as default, when it is not possible\n        to derive one.\n    - default_min_timestamp: the timestamp to use as default, when it is not possible\n        to derive one.\n    - custom_schema: specify custom_schema for particular columns of the\n        returned dataframe in the init/delta extraction of the source table.\n    \"\"\"\n\n    latest_timestamp_input_col: str = \"REQTSN\"\n    request_status_tbl: str = \"SAPHANADB.RSPMREQUEST\"\n    request_col_name: str = \"REQUEST_TSN\"\n    data_target: Optional[str] = None\n    act_req_join_condition: Optional[str] = None\n    include_changelog_tech_cols: Optional[bool] = None\n    extra_cols_req_status_tbl: Optional[str] = None\n    request_status_tbl_filter: Optional[str] = None\n    adso_type: Optional[str] = None\n    max_timestamp_custom_schema: str = \"timestamp DECIMAL(23,0)\"\n    default_max_timestamp: str = \"1970000000000000000000\"\n    default_min_timestamp: str = \"1970000000000000000000\"\n    custom_schema: str = \"REQTSN DECIMAL(23,0)\"\n\n\nclass SAPB4ExtractionUtils(JDBCExtractionUtils):\n    \"\"\"Utils for managing data extraction from SAP B4.\"\"\"\n\n    def __init__(self, sap_b4_extraction: SAPB4Extraction):\n        \"\"\"Construct SAPB4ExtractionUtils.\n\n        Args:\n            sap_b4_extraction: SAP B4 Extraction configurations.\n        \"\"\"\n        self._LOGGER: Logger = LoggingHandler(__name__).get_logger()\n        self._B4_EXTRACTION = sap_b4_extraction\n        self._B4_EXTRACTION.request_status_tbl_filter = (\n            self._get_req_status_tbl_filter()\n        )\n        self._MAX_TIMESTAMP_QUERY = f\"\"\" --# nosec\n                (SELECT COALESCE(MAX({self._B4_EXTRACTION.request_col_name}),\n                    {self._B4_EXTRACTION.default_max_timestamp}) as timestamp\n                FROM {self._B4_EXTRACTION.request_status_tbl}\n                WHERE {self._B4_EXTRACTION.request_status_tbl_filter})\n            \"\"\"  # nosec: B608\n        super().__init__(sap_b4_extraction)\n\n    @staticmethod\n    def get_data_target(input_spec_opt: dict) -> str:\n        \"\"\"Get the data_target from the data_target option or derive it.\n\n        By definition data_target is the same for the table and changelog table and\n        is the same string ignoring everything before / and the first and last\n        character after /. E.g. for a dbtable /BIC/abtable12, the data_target\n        would be btable1.\n\n        Args:\n            input_spec_opt: options from the input_spec.\n\n        Returns:\n            A string with the data_target.\n        \"\"\"\n        exclude_chars = \"\"\"[\"'\\\\\\\\]\"\"\"\n        data_target: str = input_spec_opt.get(\n            \"data_target\",\n            re.sub(exclude_chars, \"\", input_spec_opt[\"dbtable\"]).split(\"/\")[-1][1:-1],\n        )\n\n        return data_target\n\n    def _get_init_query(self) -> Tuple[str, str]:\n        \"\"\"Get a query to do an init load based on a ADSO on a SAP B4 system.\n\n        Returns:\n            A query to submit to SAP B4 for the initial data extraction. The query\n            is enclosed in parentheses so that Spark treats it as a table and supports\n            it in the dbtable option.\n        \"\"\"\n        extraction_query = self._get_init_extraction_query()\n\n        predicates_query = f\"\"\"\n        (SELECT DISTINCT({self._B4_EXTRACTION.partition_column})\n        FROM {self._B4_EXTRACTION.dbtable} t)\n        \"\"\"  # nosec: B608\n\n        return extraction_query, predicates_query\n\n    def _get_init_extraction_query(self) -> str:\n        \"\"\"Get the init extraction query based on current timestamp.\n\n        Returns:\n            A query to submit to SAP B4 for the initial data extraction.\n        \"\"\"\n        changelog_tech_cols = (\n            f\"\"\"{self._B4_EXTRACTION.extraction_timestamp}000000000 AS reqtsn,\n                '0' AS datapakid,\n                0 AS record,\"\"\"\n            if self._B4_EXTRACTION.include_changelog_tech_cols\n            else \"\"\n        )\n\n        extraction_query = f\"\"\"\n                (SELECT t.*, {changelog_tech_cols}\n                    CAST({self._B4_EXTRACTION.extraction_timestamp}\n                        AS DECIMAL(15,0)) AS extraction_start_timestamp\n                FROM {self._B4_EXTRACTION.dbtable} t\n                )\"\"\"  # nosec: B608\n\n        return extraction_query\n\n    def _get_delta_query(self) -> Tuple[str, str]:\n        \"\"\"Get a delta query for an SAP B4 ADSO.\n\n        An SAP B4 ADSO requires a join with a special type of table often called\n        requests status table (RSPMREQUEST), in which B4 tracks down the timestamps,\n        status and metrics associated with the several data loads that were performed\n        into B4. Depending on the type of ADSO (AQ or CL) the join condition and also\n        the ADSO/table to consider to extract from will be different.\n        For AQ types, there is only the active table, from which we extract both inits\n        and deltas and this is also the table used to join with RSPMREQUEST to derive\n        the next portion of the data to extract.\n        For the CL types, we have an active table/adso from which we extract the init\n        and one changelog table from which we extract the delta portions of data.\n        Depending, if it is an init or delta one table or the other is also used to join\n        with RSPMREQUEST.\n\n        The logic on this function basically ensures that we are reading from the source\n        table considering the data that has arrived between the maximum timestamp that\n        is available in our target destination and the max timestamp of the extractions\n        performed and registered in the RSPMREQUEST table, which follow the filtering\n         criteria.\n\n        Returns:\n            A query to submit to SAP B4 for the delta data extraction. The query\n            is enclosed in parentheses so that Spark treats it as a table and supports\n            it in the dbtable option.\n        \"\"\"\n        if not self._B4_EXTRACTION.min_timestamp:\n            from lakehouse_engine.io.reader_factory import ReaderFactory\n\n            latest_timestamp_data_df = ReaderFactory.get_data(\n                InputSpec(\n                    spec_id=\"data_with_latest_timestamp\",\n                    data_format=self._B4_EXTRACTION.latest_timestamp_data_format,\n                    read_type=ReadType.BATCH.value,\n                    location=self._B4_EXTRACTION.latest_timestamp_data_location,\n                )\n            )\n            min_timestamp = latest_timestamp_data_df.transform(\n                Aggregators.get_max_value(\n                    self._B4_EXTRACTION.latest_timestamp_input_col\n                )\n            ).first()[0]\n        else:\n            min_timestamp = self._B4_EXTRACTION.min_timestamp\n\n        min_timestamp = (\n            min_timestamp\n            if min_timestamp\n            else self._B4_EXTRACTION.default_min_timestamp\n        )\n\n        max_timestamp = (\n            self._B4_EXTRACTION.max_timestamp\n            if self._B4_EXTRACTION.max_timestamp\n            else self._get_max_timestamp(self._MAX_TIMESTAMP_QUERY)\n        )\n\n        if self._B4_EXTRACTION.act_req_join_condition:\n            join_condition = f\"{self._B4_EXTRACTION.act_req_join_condition}\"\n        else:\n            join_condition = f\"tbl.reqtsn = req.{self._B4_EXTRACTION.request_col_name}\"\n\n        base_query = f\"\"\" --# nosec\n        FROM {self._B4_EXTRACTION.changelog_table} AS tbl\n        JOIN {self._B4_EXTRACTION.request_status_tbl} AS req\n            ON {join_condition}\n        WHERE {self._B4_EXTRACTION.request_status_tbl_filter}\n            AND req.{self._B4_EXTRACTION.request_col_name} > {min_timestamp}\n            AND req.{self._B4_EXTRACTION.request_col_name} <= {max_timestamp})\n        \"\"\"\n\n        main_cols = f\"\"\"\n            (SELECT tbl.*,\n                CAST({self._B4_EXTRACTION.extraction_timestamp} AS DECIMAL(15,0))\n                    AS extraction_start_timestamp\n            \"\"\"\n\n        # We join the main columns considered for the extraction with\n        # extra_cols_act_request that people might want to use, filtering to only\n        # add the comma and join the strings, in case extra_cols_act_request is\n        # not None or empty.\n        extraction_query_cols = \",\".join(\n            filter(None, [main_cols, self._B4_EXTRACTION.extra_cols_req_status_tbl])\n        )\n\n        extraction_query = extraction_query_cols + base_query\n\n        predicates_query = f\"\"\"\n        (SELECT DISTINCT({self._B4_EXTRACTION.partition_column})\n        {base_query}\n        \"\"\"\n\n        return extraction_query, predicates_query\n\n    def _get_req_status_tbl_filter(self) -> Any:\n        if self._B4_EXTRACTION.request_status_tbl_filter:\n            return self._B4_EXTRACTION.request_status_tbl_filter\n        else:\n            if self._B4_EXTRACTION.adso_type == ADSOTypes.AQ.value:\n                return f\"\"\"\n                    STORAGE = 'AQ' AND REQUEST_IS_IN_PROCESS = 'N' AND\n                    LAST_OPERATION_TYPE IN ('C', 'U') AND REQUEST_STATUS IN ('GG', 'GR')\n                    AND UPPER(DATATARGET) = UPPER('{self._B4_EXTRACTION.data_target}')\n                \"\"\"\n            elif self._B4_EXTRACTION.adso_type == ADSOTypes.CL.value:\n                return f\"\"\"\n                    STORAGE = 'AT' AND REQUEST_IS_IN_PROCESS = 'N' AND\n                    LAST_OPERATION_TYPE IN ('C', 'U') AND REQUEST_STATUS IN ('GG')\n                    AND UPPER(DATATARGET) = UPPER('{self._B4_EXTRACTION.data_target}')\n                \"\"\"\n            else:\n                raise NotImplementedError(\n                    f\"The requested ADSO Type is not fully implemented and/or tested.\"\n                    f\"Supported ADSO Types: {ADSOTypes.SUPPORTED_TYPES}\"\n                )\n"
  },
  {
    "path": "lakehouse_engine/utils/extraction/sap_bw_extraction_utils.py",
    "content": "\"\"\"Utilities module for SAP BW extraction processes.\"\"\"\n\nfrom dataclasses import dataclass\nfrom logging import Logger\nfrom typing import Optional, Tuple\n\nfrom lakehouse_engine.core.definitions import InputFormat, InputSpec, ReadType\nfrom lakehouse_engine.transformers.aggregators import Aggregators\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionType,\n    JDBCExtractionUtils,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\n@dataclass\nclass SAPBWExtraction(JDBCExtraction):\n    \"\"\"Configurations available for an Extraction from SAP BW.\n\n    It inherits from SAPBWExtraction configurations, so it can use\n    and/or overwrite those configurations.\n\n    These configurations cover:\n\n    - latest_timestamp_input_col: the column containing the actrequest timestamp\n        in the dataset in latest_timestamp_data_location. Default:\n        \"actrequest_timestamp\".\n    - act_request_table: the name of the SAP BW activation requests table.\n        Composed of database.table. Default: SAPPHA.RSODSACTREQ.\n    - request_col_name: name of the column having the request to join\n        with the activation request table. Default: actrequest.\n    - act_req_join_condition: the join condition into activation table\n        can be changed using this property.\n        Default: 'changelog_tbl.request = act_req.request_col_name'.\n    - odsobject: name of BW Object, used for joining with the activation request\n        table to get the max actrequest_timestamp to consider while filtering\n        the changelog table.\n    - include_changelog_tech_cols: whether to include the technical columns\n        (usually coming from the changelog) table or not. Default: True.\n    - extra_cols_act_request: list of columns to be added from act request table.\n        It needs to contain the prefix \"act_req.\". E.g. \"act_req.col1\n        as column_one, act_req.col2 as column_two\".\n    - get_timestamp_from_act_request: whether to get init timestamp\n        from act request table or assume current/given timestamp.\n    - sap_bw_schema: sap bw schema. Default: SAPPHA.\n    - max_timestamp_custom_schema: the custom schema to apply on the calculation of\n        the max timestamp to consider for the delta extractions.\n        Default: timestamp DECIMAL(23,0).\n    - default_max_timestamp: the timestamp to use as default, when it is not possible\n        to derive one.\n    - default_min_timestamp: the timestamp to use as default, when it is not possible\n        to derive one.\n    - ods_prefix: the prefix to use when looking for the changelog table in SAP BW.\n         Default: \"8\".\n     - logsys: the BW source & receiver system ID to use to get the tsprefix\n        (prefix for transfer structures) which is used while deriving the changelog\n        table. Default: None & generated based on the schema.\n    \"\"\"\n\n    latest_timestamp_input_col: str = \"actrequest_timestamp\"\n    request_col_name: str = \"actrequest\"\n    act_req_join_condition: Optional[str] = None\n    odsobject: Optional[str] = None\n    include_changelog_tech_cols: bool = True\n    extra_cols_act_request: Optional[str] = None\n    get_timestamp_from_act_request: bool = False\n    sap_bw_schema: str = \"SAPPHA\"\n    act_request_table: str = f\"{sap_bw_schema}.RSODSACTREQ\"\n    max_timestamp_custom_schema: str = \"timestamp DECIMAL(15,0)\"\n    default_max_timestamp: str = \"197000000000000\"\n    default_min_timestamp: str = \"197000000000000\"\n    ods_prefix: str = \"8\"\n    logsys: Optional[str] = None\n    custom_schema: Optional[str] = \"REQUEST VARCHAR(30), DATAPAKID VARCHAR(6)\"\n\n\nclass SAPBWExtractionUtils(JDBCExtractionUtils):\n    \"\"\"Utils for managing data extraction from particularly relevant JDBC sources.\"\"\"\n\n    def __init__(self, sap_bw_extraction: SAPBWExtraction):\n        \"\"\"Construct SAPBWExtractionUtils.\n\n        Args:\n            sap_bw_extraction: SAP BW Extraction configurations.\n        \"\"\"\n        self._LOGGER: Logger = LoggingHandler(__name__).get_logger()\n        self._BW_EXTRACTION = sap_bw_extraction\n        self._BW_EXTRACTION.changelog_table = self.get_changelog_table()\n        self._MAX_TIMESTAMP_QUERY = f\"\"\" --# nosec\n                (SELECT COALESCE(MAX(timestamp),\n                    {self._BW_EXTRACTION.default_max_timestamp}) as timestamp\n                FROM {self._BW_EXTRACTION.act_request_table}\n                WHERE odsobject = '{self._BW_EXTRACTION.odsobject}'\n                 AND operation = 'A' AND status = '0')\n            \"\"\"  # nosec: B608\n        super().__init__(sap_bw_extraction)\n\n    def get_changelog_table(self) -> str:\n        \"\"\"Get the changelog table, given an odsobject.\n\n        Returns:\n             String to use as changelog_table.\n        \"\"\"\n        if (\n            self._BW_EXTRACTION.odsobject is not None\n            and self._BW_EXTRACTION.changelog_table is None\n            and self._BW_EXTRACTION.extraction_type != JDBCExtractionType.INIT.value\n        ):\n            logsys_cond = self.get_logsys_cond()\n            prefix = self._BW_EXTRACTION.ods_prefix\n            odsobject = self._BW_EXTRACTION.odsobject\n\n            if self._BW_EXTRACTION.sap_bw_schema:\n                system_table = f\"{self._BW_EXTRACTION.sap_bw_schema}.RSTSODS\"\n                pref_table = f\"{self._BW_EXTRACTION.sap_bw_schema}.RSBASIDOC\"\n            else:\n                system_table = \"RSTSODS\"\n                pref_table = \"RSBASIDOC\"\n\n            query = f\"\"\"\n                    (SELECT ODSNAME_TECH\n                    FROM {system_table} o\n                    JOIN {pref_table} p ON {logsys_cond}\n                    AND o.ODSNAME = '{prefix}{odsobject}_' || p.tsprefix\n                    AND USERAPP = 'CHANGELOG' AND VERSION = '000')\n                \"\"\"  # nosec: B608\n            self._LOGGER.info(\n                f\"Deriving changelog_table using the following query: {query}\"\n            )\n            jdbc_args = {\n                \"url\": self._BW_EXTRACTION.url,\n                \"table\": query,\n                \"properties\": {\n                    \"user\": self._BW_EXTRACTION.user,\n                    \"password\": self._BW_EXTRACTION.password,\n                    \"driver\": self._BW_EXTRACTION.driver,\n                },\n            }\n            from lakehouse_engine.io.reader_factory import ReaderFactory\n\n            changelog_df = ReaderFactory.get_data(\n                InputSpec(\n                    spec_id=\"changelog_table\",\n                    data_format=InputFormat.JDBC.value,\n                    read_type=ReadType.BATCH.value,\n                    jdbc_args=jdbc_args,\n                )\n            )\n            changelog_tbl_nbr = changelog_df.count()\n            if changelog_tbl_nbr > 1:\n                raise ValueError(\n                    f\"More than one changelog table found for {odsobject}.\"\n                    f\"Aborting. {changelog_df.show()}\"\n                )\n            if changelog_tbl_nbr == 0:\n                raise ValueError(f\"No changelog table found for {odsobject}. Aborting.\")\n\n            changelog_table = (\n                f'{self._BW_EXTRACTION.sap_bw_schema}.\"{changelog_df.first()[0]}\"'\n                if self._BW_EXTRACTION.sap_bw_schema\n                else str(changelog_df.first()[0])\n            )\n        else:\n            changelog_table = (\n                self._BW_EXTRACTION.changelog_table\n                if self._BW_EXTRACTION.changelog_table\n                else f\"{self._BW_EXTRACTION.dbtable}_cl\"\n            )\n        self._LOGGER.info(f\"The changelog table derived is: '{changelog_table}'\")\n\n        return changelog_table\n\n    @staticmethod\n    def get_odsobject(input_spec_opt: dict) -> str:\n        \"\"\"Get the odsobject based on the provided options.\n\n        With the table name we may also get the db name, so we need to split.\n        Moreover, there might be the need for people to specify odsobject if\n        it is different from the dbtable.\n\n        Args:\n            input_spec_opt: options from the input_spec.\n\n        Returns:\n            A string with the odsobject.\n        \"\"\"\n        return str(\n            input_spec_opt[\"dbtable\"].split(\".\")[1]\n            if len(input_spec_opt[\"dbtable\"].split(\".\")) > 1\n            else input_spec_opt[\"dbtable\"]\n        )\n\n    def get_logsys_cond(self) -> str:\n        \"\"\"Get logsys condition to join & get the tsprefix for the changelog derivation.\n\n        Usually the condition on the else is enough.\n\n        Returns:\n            The logsys condition.\n        \"\"\"\n        if self._BW_EXTRACTION.logsys:\n            logsys = self._BW_EXTRACTION.logsys\n            return f\"p.slogsys = '{logsys}' AND p.rlogsys = '{logsys}'\"\n        else:\n            return \"p.slogsys = p.rlogsys\"\n\n    def _get_init_query(self) -> Tuple[str, str]:\n        \"\"\"Get a query to do an init load based on a DSO on a SAP BW system.\n\n        Returns:\n            A query to submit to SAP BW for the initial data extraction. The query\n            is enclosed in parentheses so that Spark treats it as a table and supports\n            it in the dbtable option.\n        \"\"\"\n        if self._BW_EXTRACTION.get_timestamp_from_act_request:\n            # check if we are dealing with a DSO of type Write Optimised\n            if self._BW_EXTRACTION.dbtable == self._BW_EXTRACTION.changelog_table:\n                extraction_query = self._get_init_extraction_query_act_req_timestamp()\n            else:\n                raise AttributeError(\n                    \"Not able to get the extraction query. The option \"\n                    \"'get_timestamp_from_act_request' is only \"\n                    \"available/useful for DSOs of type Write Optimised.\"\n                )\n        else:\n            extraction_query = self._get_init_extraction_query()\n\n        predicates_query = f\"\"\"\n        (SELECT DISTINCT({self._BW_EXTRACTION.partition_column})\n        FROM {self._BW_EXTRACTION.dbtable} t)\n        \"\"\"  # nosec: B608\n\n        return extraction_query, predicates_query\n\n    def _get_init_extraction_query(self) -> str:\n        \"\"\"Get extraction query based on given/current timestamp.\n\n        Returns:\n            A query to submit to SAP BW for the initial data extraction.\n        \"\"\"\n        changelog_tech_cols = (\n            f\"\"\"'0' AS request,\n                CAST({self._BW_EXTRACTION.extraction_timestamp} AS DECIMAL(15, 0))\n                 AS actrequest_timestamp,\n                '0' AS datapakid,\n                0 AS partno,\n                0 AS record,\"\"\"\n            if self._BW_EXTRACTION.include_changelog_tech_cols\n            else f\"CAST({self._BW_EXTRACTION.extraction_timestamp} \"\n            f\"AS DECIMAL(15, 0))\"\n            f\" AS actrequest_timestamp,\"\n        )\n\n        extraction_query = f\"\"\"\n                (SELECT t.*,\n                    {changelog_tech_cols}\n                    CAST({self._BW_EXTRACTION.extraction_timestamp}\n                        AS DECIMAL(15, 0)) AS extraction_start_timestamp\n                FROM {self._BW_EXTRACTION.dbtable} t\n                )\"\"\"  # nosec: B608\n\n        return extraction_query\n\n    def _get_init_extraction_query_act_req_timestamp(self) -> str:\n        \"\"\"Get extraction query assuming the init timestamp from act_request table.\n\n        Returns:\n            A query to submit to SAP BW for the initial data extraction from\n            write optimised DSOs, receiving the actrequest_timestamp from\n            the activation requests table.\n        \"\"\"\n        extraction_query = f\"\"\"\n            (SELECT t.*,\n                act_req.timestamp as actrequest_timestamp,\n                CAST({self._BW_EXTRACTION.extraction_timestamp} AS DECIMAL(15, 0))\n                 AS extraction_start_timestamp\n            FROM {self._BW_EXTRACTION.dbtable} t\n            JOIN {self._BW_EXTRACTION.act_request_table} AS act_req ON\n                t.request = act_req.{self._BW_EXTRACTION.request_col_name}\n            WHERE act_req.odsobject = '{self._BW_EXTRACTION.odsobject}'\n                AND operation = 'A' AND status = '0'\n            )\"\"\"  # nosec: B608\n\n        return extraction_query\n\n    def _get_delta_query(self) -> Tuple[str, str]:\n        \"\"\"Get a delta query for an SAP BW DSO.\n\n        An SAP BW DSO requires a join with a special type of table often called\n        activation requests table, in which BW tracks down the timestamps associated\n        with the several data loads that were performed into BW. Because the changelog\n        table only contains the active request id, and that cannot be sorted by the\n        downstream consumers to figure out the latest change, we need to join the\n        changelog table with this special table to get the activation requests\n        timestamps to then use them to figure out the latest changes in the delta load\n        logic afterwards.\n\n        Additionally, we also need to know which was the latest timestamp already loaded\n        into the lakehouse bronze layer. The latest timestamp should always be available\n        in the bronze dataset itself or in a dataset that tracks down all the actrequest\n        timestamps that were already loaded. So we get the max value out of the\n        respective actrequest timestamp column in that dataset.\n\n        Returns:\n            A query to submit to SAP BW for the delta data extraction. The query\n            is enclosed in parentheses so that Spark treats it as a table and supports\n            it in the dbtable option.\n        \"\"\"\n        if not self._BW_EXTRACTION.min_timestamp:\n            from lakehouse_engine.io.reader_factory import ReaderFactory\n\n            latest_timestamp_data_df = ReaderFactory.get_data(\n                InputSpec(\n                    spec_id=\"data_with_latest_timestamp\",\n                    data_format=self._BW_EXTRACTION.latest_timestamp_data_format,\n                    read_type=ReadType.BATCH.value,\n                    location=self._BW_EXTRACTION.latest_timestamp_data_location,\n                )\n            )\n            min_timestamp = latest_timestamp_data_df.transform(\n                Aggregators.get_max_value(\n                    self._BW_EXTRACTION.latest_timestamp_input_col\n                )\n            ).first()[0]\n        else:\n            min_timestamp = self._BW_EXTRACTION.min_timestamp\n\n        max_timestamp = (\n            self._BW_EXTRACTION.max_timestamp\n            if self._BW_EXTRACTION.max_timestamp\n            else self._get_max_timestamp(self._MAX_TIMESTAMP_QUERY)\n        )\n\n        if self._BW_EXTRACTION.act_req_join_condition:\n            join_condition = f\"{self._BW_EXTRACTION.act_req_join_condition}\"\n        else:\n            join_condition = (\n                f\"changelog_tbl.request = \"\n                f\"act_req.{self._BW_EXTRACTION.request_col_name}\"\n            )\n\n        base_query = f\"\"\" --# nosec\n        FROM {self._BW_EXTRACTION.changelog_table} AS changelog_tbl\n        JOIN {self._BW_EXTRACTION.act_request_table} AS act_req\n            ON {join_condition}\n        WHERE act_req.odsobject = '{self._BW_EXTRACTION.odsobject}'\n            AND act_req.timestamp > {min_timestamp}\n            AND act_req.timestamp <= {max_timestamp}\n            AND operation = 'A' AND status = '0')\n        \"\"\"\n\n        main_cols = f\"\"\"\n            (SELECT changelog_tbl.*,\n                act_req.TIMESTAMP AS actrequest_timestamp,\n                CAST({self._BW_EXTRACTION.extraction_timestamp} AS DECIMAL(15,0))\n                    AS extraction_start_timestamp\n            \"\"\"\n        # We join the main columns considered for the extraction with\n        # extra_cols_act_request that people might want to use, filtering to only\n        # add the comma and join the strings, in case extra_cols_act_request is\n        # not None or empty.\n        extraction_query_cols = \",\".join(\n            filter(None, [main_cols, self._BW_EXTRACTION.extra_cols_act_request])\n        )\n\n        extraction_query = extraction_query_cols + base_query\n\n        predicates_query = f\"\"\"\n        (SELECT DISTINCT({self._BW_EXTRACTION.partition_column})\n        {base_query}\n        \"\"\"\n\n        return extraction_query, predicates_query\n"
  },
  {
    "path": "lakehouse_engine/utils/extraction/sftp_extraction_utils.py",
    "content": "\"\"\"Utilities module for SFTP extraction processes.\"\"\"\n\nimport stat\nfrom base64 import decodebytes\nfrom datetime import datetime\nfrom enum import Enum\nfrom logging import Logger\nfrom stat import S_ISREG\nfrom typing import Any, List, Set, Tuple\n\nimport paramiko as p\nfrom paramiko import Ed25519Key, PKey, RSAKey, Transport\nfrom paramiko.sftp_client import SFTPAttributes, SFTPClient  # type: ignore\n\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SFTPInputFormat(Enum):\n    \"\"\"Formats of algorithm input.\"\"\"\n\n    CSV = \"csv\"\n    FWF = \"fwf\"\n    JSON = \"json\"\n    XML = \"xml\"\n\n\nclass SFTPExtractionFilter(Enum):\n    \"\"\"Standardize the types of filters we can have from a SFTP source.\"\"\"\n\n    file_name_contains = \"file_name_contains\"\n    LATEST_FILE = \"latest_file\"\n    EARLIEST_FILE = \"earliest_file\"\n    GREATER_THAN = \"date_time_gt\"\n    LOWER_THAN = \"date_time_lt\"\n\n\nclass SFTPExtractionUtils(object):\n    \"\"\"Utils for managing data extraction from particularly relevant SFTP sources.\"\"\"\n\n    _logger: Logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def get_files_list(\n        cls, sftp: SFTPClient, remote_path: str, options_args: dict\n    ) -> Set[str]:\n        \"\"\"Get a list of files to be extracted from SFTP.\n\n        The arguments (options_args) to list files are:\n\n        - date_time_gt(str):\n            Filter the files greater than the string datetime\n            formatted as \"YYYY-MM-DD\" or \"YYYY-MM-DD HH:MM:SS\".\n        - date_time_lt(str):\n            Filter the files lower than the string datetime\n            formatted as \"YYYY-MM-DD\" or \"YYYY-MM-DD HH:MM:SS\".\n        - earliest_file(bool):\n            Filter the earliest dated file in the directory.\n        - file_name_contains(str):\n            Filter files when match the pattern.\n        - latest_file(bool):\n            Filter the most recent dated file in the directory.\n        - sub_dir(bool):\n            When true, the engine will search files into subdirectories\n            of the remote_path.\n            It will consider one level below the remote_path.\n            When sub_dir is used with latest_file/earliest_file argument,\n            the engine will retrieve the latest_file/earliest_file\n            for each subdirectory.\n\n        Args:\n            sftp: the SFTP client object.\n            remote_path: path of files to be filtered.\n            options_args: options from the acon.\n\n        Returns:\n            A list containing the file names to be passed to Spark.\n        \"\"\"\n        all_items, folder_path = cls._get_folder_items(remote_path, sftp, options_args)\n\n        filtered_files: Set[str] = set()\n\n        try:\n            for item, folder in zip(all_items, folder_path):\n                file_contains = cls._file_has_pattern(item, options_args)\n                file_in_interval = cls._file_in_date_interval(item, options_args)\n                if file_contains and file_in_interval:\n                    filtered_files.add(folder + item.filename)\n\n            if (\n                SFTPExtractionFilter.EARLIEST_FILE.value in options_args.keys()\n                or SFTPExtractionFilter.LATEST_FILE.value in options_args.keys()\n            ):\n                filtered_files = cls._get_earliest_latest_file(\n                    sftp, options_args, filtered_files, folder_path\n                )\n\n        except Exception as e:\n            cls._logger.error(f\"SFTP list_files EXCEPTION: - {e}\")\n        return filtered_files\n\n    @classmethod\n    def get_sftp_client(\n        cls,\n        options_args: dict,\n    ) -> Tuple[SFTPClient, Transport]:\n        \"\"\"Get the SFTP client.\n\n        The SFTP client is used to open an SFTP session across an open\n        SSH Transport and perform remote file operations.\n\n        Args:\n            options_args: dictionary containing SFTP connection parameters.\n                The Paramiko arguments expected to connect are:\n\n                - \"hostname\": the server to connect to.\n                - \"port\": the server port to connect to.\n                - \"username\": the username to authenticate as.\n                - \"password\": used for password authentication.\n                - \"pkey\": optional - an optional public key to use for\n                    authentication.\n                - \"passphrase\" – optional - options used for decrypting private\n                    keys.\n                - \"key_filename\" – optional - the filename, or list of filenames,\n                    of optional private key(s) and/or certs to try for\n                    authentication.\n                - \"timeout\" – an optional timeout (in seconds) for the TCP connect.\n                - \"allow_agent\" – optional - set to False to disable\n                    connecting to the SSH agent.\n                - \"look_for_keys\" – optional - set to False to disable searching\n                    for discoverable private key files in ~/.ssh/.\n                - \"compress\" – optional - set to True to turn on compression.\n                - \"sock\" - optional - an open socket or socket-like object\n                    to use for communication to the target host.\n                - \"gss_auth\" – optional - True if you want to use GSS-API\n                    authentication.\n                - \"gss_kex\" – optional - Perform GSS-API Key Exchange and\n                    user authentication.\n                - \"gss_deleg_creds\" – optional - Delegate GSS-API client\n                    credentials or not.\n                - \"gss_host\" – optional - The targets name in the kerberos database.\n                - \"gss_trust_dns\" – optional - Indicates whether or\n                    not the DNS is trusted to securely canonicalize the name of the\n                    host being connected to (default True).\n                - \"banner_timeout\" – an optional timeout (in seconds)\n                    to wait for the SSH banner to be presented.\n                - \"auth_timeout\" – an optional timeout (in seconds)\n                    to wait for an authentication response.\n                - \"disabled_algorithms\" – an optional dict passed directly to\n                    Transport and its keyword argument of the same name.\n                - \"transport_factory\" – an optional callable which is handed a\n                    subset of the constructor arguments (primarily those related\n                    to the socket, GSS functionality, and algorithm selection)\n                    and generates a Transport instance to be used by this client.\n                    Defaults to Transport.__init__.\n\n                The parameter to specify the private key is expected to be in\n                RSA format. Attempting a connection with a blank host key is\n                not allowed unless the argument \"add_auto_policy\" is explicitly\n                set to True.\n\n        Returns:\n            sftp -> a new SFTPClient session object.\n            transport -> the Transport for this connection.\n        \"\"\"\n        ssh_client = p.SSHClient()\n        try:\n            if not options_args.get(\"pkey\") and not options_args.get(\"add_auto_policy\"):\n                raise WrongArgumentsException(\n                    \"Get SFTP Client: No host key (pkey) was provided and the \"\n                    + \"add_auto_policy property is false.\"\n                )\n\n            if options_args.get(\"pkey\") and not options_args.get(\"key_type\"):\n                raise WrongArgumentsException(\n                    \"Get SFTP Client: The key_type must be provided when \"\n                    + \"the host key (pkey) is provided.\"\n                )\n\n            if options_args.get(\"pkey\", None) and options_args.get(\"key_type\", None):\n                key = cls._get_host_keys(\n                    options_args.get(\"pkey\", None), options_args.get(\"key_type\", None)\n                )\n                ssh_client.get_host_keys().add(\n                    hostname=f\"[{options_args.get('hostname')}]:\"\n                    + f\"{options_args.get('port')}\",\n                    keytype=\"ssh-rsa\",\n                    key=key,\n                )\n            elif options_args.get(\"add_auto_policy\", None):\n                ssh_client.load_system_host_keys()\n                ssh_client.set_missing_host_key_policy(p.WarningPolicy())  # nosec: B507\n            else:\n                ssh_client.load_system_host_keys()\n                ssh_client.set_missing_host_key_policy(p.RejectPolicy())\n\n            ssh_client.connect(\n                hostname=options_args.get(\"hostname\"),\n                port=options_args.get(\"port\", 22),\n                username=options_args.get(\"username\", None),\n                password=options_args.get(\"password\", None),\n                key_filename=options_args.get(\"key_filename\", None),\n                timeout=options_args.get(\"timeout\", None),\n                allow_agent=options_args.get(\"allow_agent\", True),\n                look_for_keys=options_args.get(\"look_for_keys\", True),\n                compress=options_args.get(\"compress\", False),\n                sock=options_args.get(\"sock\", None),\n                gss_auth=options_args.get(\"gss_auth\", False),\n                gss_kex=options_args.get(\"gss_kex\", False),\n                gss_deleg_creds=options_args.get(\"gss_deleg_creds\", False),\n                gss_host=options_args.get(\"gss_host\", False),\n                banner_timeout=options_args.get(\"banner_timeout\", None),\n                auth_timeout=options_args.get(\"auth_timeout\", None),\n                gss_trust_dns=options_args.get(\"gss_trust_dns\", None),\n                passphrase=options_args.get(\"passphrase\", None),\n                disabled_algorithms=options_args.get(\"disabled_algorithms\", None),\n                transport_factory=options_args.get(\"transport_factory\", None),\n            )\n\n            sftp = ssh_client.open_sftp()\n            transport = ssh_client.get_transport()\n        except ConnectionError as e:\n            cls._logger.error(e)\n            raise\n        return sftp, transport\n\n    @classmethod\n    def validate_format(cls, files_format: str) -> str:\n        \"\"\"Validate the file extension based on the format definitions.\n\n        Args:\n            files_format: a string containing the file extension.\n\n        Returns:\n            The string validated and formatted.\n        \"\"\"\n        formats_allowed = [\n            SFTPInputFormat.CSV.value,\n            SFTPInputFormat.FWF.value,\n            SFTPInputFormat.JSON.value,\n            SFTPInputFormat.XML.value,\n        ]\n\n        if files_format not in formats_allowed:\n            raise WrongArgumentsException(\n                f\"The formats allowed for SFTP are {formats_allowed}.\"\n            )\n\n        return files_format\n\n    @classmethod\n    def validate_location(cls, location: str) -> str:\n        \"\"\"Validate the location. Add \"/\" in the case it does not exist.\n\n        Args:\n            location: file path.\n\n        Returns:\n            The location validated.\n        \"\"\"\n        return location if location.rfind(\"/\") == len(location) - 1 else location + \"/\"\n\n    @classmethod\n    def _file_has_pattern(cls, item: SFTPAttributes, options_args: dict) -> bool:\n        \"\"\"Check if a file follows the pattern used for filtering.\n\n        Args:\n            item: item available in SFTP directory.\n            options_args: options from the acon.\n\n        Returns:\n            A boolean telling whether the file contains a pattern or not.\n        \"\"\"\n        file_to_consider = True\n\n        if SFTPExtractionFilter.file_name_contains.value in options_args.keys():\n            if not (\n                options_args.get(SFTPExtractionFilter.file_name_contains.value)\n                in item.filename\n                and (S_ISREG(item.st_mode) or cls._is_compressed(item.filename))\n            ):\n                file_to_consider = False\n\n        return file_to_consider\n\n    @classmethod\n    def _file_in_date_interval(\n        cls,\n        item: SFTPAttributes,\n        options_args: dict,\n    ) -> bool:\n        \"\"\"Check if the file is in the expected date interval.\n\n        The logic is applied based on the arguments greater_than and lower_than.\n        i.e:\n\n        - if greater_than and lower_than have values,\n        then it performs a between.\n        - if only lower_than has values,\n        then only values lower than the input value will be retrieved.\n        - if only greater_than has values,\n        then only values greater than the input value will be retrieved.\n\n        Args:\n            item: item available in SFTP directory.\n            options_args: options from the acon.\n\n        Returns:\n            A boolean telling whether the file is in the expected date interval or not.\n        \"\"\"\n        file_to_consider = True\n\n        if (\n            SFTPExtractionFilter.LOWER_THAN.value in options_args.keys()\n            or SFTPExtractionFilter.GREATER_THAN.value in options_args.keys()\n            and (S_ISREG(item.st_mode) or cls._is_compressed(item.filename))\n        ):\n            lower_than = options_args.get(\n                SFTPExtractionFilter.LOWER_THAN.value, \"9999-12-31\"\n            )\n            greater_than = options_args.get(\n                SFTPExtractionFilter.GREATER_THAN.value, \"1900-01-01\"\n            )\n\n            file_date = datetime.fromtimestamp(item.st_mtime)\n\n            if not (\n                (\n                    lower_than == greater_than\n                    and cls._validate_date(greater_than)\n                    <= file_date\n                    <= cls._validate_date(lower_than)\n                )\n                or (\n                    cls._validate_date(greater_than)\n                    < file_date\n                    < cls._validate_date(lower_than)\n                )\n            ):\n                file_to_consider = False\n\n        return file_to_consider\n\n    @classmethod\n    def _get_earliest_latest_file(\n        cls,\n        sftp: SFTPClient,\n        options_args: dict,\n        list_filter_files: Set[str],\n        folder_path: List,\n    ) -> Set[str]:\n        \"\"\"Get the earliest or latest file of a directory.\n\n        Args:\n            sftp: the SFTP client object.\n            options_args: options from the acon.\n            list_filter_files: set of file names to filter from.\n            folder_path: the location of files.\n\n        Returns:\n            A set containing the earliest/latest file name.\n        \"\"\"\n        list_earl_lat_files: Set[str] = set()\n\n        for folder in folder_path:\n            file_date = 0\n            file_name = \"\"\n            all_items, _ = cls._get_folder_items(f\"{folder}\", sftp, options_args)\n            for item in all_items:\n                if (\n                    folder + item.filename in list_filter_files\n                    and (S_ISREG(item.st_mode) or cls._is_compressed(item.filename))\n                    and (\n                        options_args.get(\"earliest_file\")\n                        and (file_date == 0 or item.st_mtime < file_date)\n                    )\n                    or (\n                        options_args.get(\"latest_file\")\n                        and (file_date == 0 or item.st_mtime > file_date)\n                    )\n                ):\n                    file_date = item.st_mtime\n                    file_name = folder + item.filename\n            list_earl_lat_files.add(file_name)\n\n        return list_earl_lat_files\n\n    @classmethod\n    def _get_folder_items(\n        cls, remote_path: str, sftp: SFTPClient, options_args: dict\n    ) -> Tuple:\n        \"\"\"Get the files and the directory to be processed.\n\n        Args:\n            remote_path: root folder path.\n            sftp: a SFTPClient session object.\n            options_args: options from the acon.\n\n        Returns:\n            A tuple with a list of items (file object) and a list of directories.\n        \"\"\"\n        sub_dir = options_args.get(\"sub_dir\", False)\n        all_items: List[SFTPAttributes] = sftp.listdir_attr(remote_path)\n        items: List[SFTPAttributes] = []\n        folders: List = []\n\n        for item in all_items:\n            is_dir = stat.S_ISDIR(item.st_mode)\n            if is_dir and sub_dir and not item.filename.endswith((\".gz\", \".zip\")):\n                dirs = sftp.listdir_attr(f\"{remote_path}{item.filename}\")\n                for file in dirs:\n                    items.append(file)\n                    folders.append(f\"{remote_path}{item.filename}/\")\n            else:\n                items.append(item)\n                folders.append(remote_path)\n\n        return items, folders\n\n    @classmethod\n    def _get_host_keys(cls, pkey: str, key_type: str) -> PKey:\n        \"\"\"Get the pkey that will be added to the server.\n\n        Args:\n            pkey: a string with a host key value.\n            key_type: the type of key (rsa or ed25519).\n\n        Returns:\n            A PKey that will be used to authenticate the connection.\n        \"\"\"\n        key: RSAKey | Ed25519Key = None\n        if pkey and key_type.lower() == \"rsa\":\n            b_pkey = bytes(pkey, \"UTF-8\")\n            key = p.RSAKey(data=decodebytes(b_pkey))\n        elif pkey and key_type.lower() == \"ed25519\":\n            b_pkey = bytes(pkey, \"UTF-8\")\n            key = p.Ed25519Key(data=decodebytes(b_pkey))\n\n        return key\n\n    @classmethod\n    def _is_compressed(cls, filename: str) -> Any:\n        \"\"\"Validate if it is a compressed file.\n\n        Args:\n            filename: name of the file to be validated.\n\n        Returns:\n            A boolean with the result.\n        \"\"\"\n        return filename.endswith((\".gz\", \".zip\"))\n\n    @classmethod\n    def _validate_date(cls, date_text: str) -> datetime:\n        \"\"\"Validate the input date format.\n\n        Args:\n            date_text: a string with the date or datetime value.\n            The expected formats are:\n                YYYY-MM-DD and YYYY-MM-DD HH:MM:SS\n\n        Returns:\n            The datetime validated and formatted.\n        \"\"\"\n        for fmt in (\"%Y-%m-%d\", \"%Y-%m-%d %H:%M:%S\"):\n            try:\n                if date_text is not None:\n                    return datetime.strptime(date_text, fmt)\n            except ValueError:\n                pass\n        raise ValueError(\n            \"Incorrect data format, should be YYYY-MM-DD or YYYY-MM-DD HH:MM:SS.\"\n        )\n"
  },
  {
    "path": "lakehouse_engine/utils/file_utils.py",
    "content": "\"\"\"Utilities for file name based operations.\"\"\"\n\nimport re\nfrom os import listdir\nfrom typing import List\n\n\ndef get_file_names_without_file_type(\n    path: str, file_type: str, exclude_regex: str\n) -> list:\n    \"\"\"Function to retrieve list of file names in a folder.\n\n    This function filters by file type and removes the extension of the file name\n    it returns.\n\n    Args:\n        path: path to the folder to list files\n        file_type: type of the file to include in list\n        exclude_regex: regex of file names to exclude\n\n    Returns:\n        A list of file names without file type.\n    \"\"\"\n    file_list: List[str] = []\n\n    for file in listdir(path):\n        if not re.search(exclude_regex, file) and file.endswith(file_type):\n            file_list.append(file.split(\".\")[0])\n\n    return file_list\n\n\ndef get_directory_path(path: str) -> str:\n    \"\"\"Add '/' to the end of the path of a directory.\n\n    Args:\n        path: directory to be processed\n\n    Returns:\n        Directory path stripped and with '/' at the end.\n    \"\"\"\n    path = path.strip()\n    return path if path[-1] == \"/\" else path + \"/\"\n"
  },
  {
    "path": "lakehouse_engine/utils/gab_utils.py",
    "content": "\"\"\"Module to define GAB Utility classes.\"\"\"\n\nimport ast\nimport calendar\nimport json\nfrom datetime import datetime\nfrom typing import Optional\n\nimport pendulum\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, lit, struct, to_json\n\nfrom lakehouse_engine.core.definitions import GABCadence, GABDefaults\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass GABUtils(object):\n    \"\"\"Class containing utility functions for GAB.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    def logger(\n        self,\n        run_start_time: datetime,\n        run_end_time: datetime,\n        start: str,\n        end: str,\n        query_id: str,\n        query_label: str,\n        cadence: str,\n        stage_file_path: str,\n        query: str,\n        status: str,\n        error_message: Exception | str,\n        target_database: str,\n    ) -> None:\n        \"\"\"Store the execution of each stage in the log events table.\n\n        Args:\n            run_start_time: execution start time.\n            run_end_time: execution end time.\n            start: use case start date.\n            end: use case end date.\n            query_id: gab configuration table use case identifier.\n            query_label: gab configuration table use case name.\n            cadence: cadence to process.\n            stage_file_path: stage file path.\n            query: query to execute.\n            status: status of the query execution.\n            error_message: error message if present.\n            target_database: target database to write.\n        \"\"\"\n        ins = \"\"\"\n        INSERT INTO {database}.gab_log_events\n        VALUES (\n            '{run_start_time}',\n            '{run_end_time}',\n            '{start}',\n            '{end}',\n            {query_id},\n            '{query_label}',\n            '{cadence}',\n            '{stage_file_path}',\n            '{query}',\n            '{status}',\n            '{error_message}'\n        )\"\"\".format(  # nosec: B608\n            database=target_database,\n            run_start_time=run_start_time,\n            run_end_time=run_end_time,\n            start=start,\n            end=end,\n            query_id=query_id,\n            query_label=query_label,\n            cadence=cadence,\n            stage_file_path=stage_file_path,\n            query=self._escape_quote(query),\n            status=status,\n            error_message=(\n                self._escape_quote(str(error_message))\n                if status == \"Failed\"\n                else error_message\n            ),\n        )\n\n        ExecEnv.SESSION.sql(ins)\n\n    @classmethod\n    def _escape_quote(cls, to_escape: str) -> str:\n        \"\"\"Escape quote on string.\n\n        Args:\n            to_escape: string to escape.\n        \"\"\"\n        return to_escape.replace(\"'\", r\"\\'\").replace('\"', r\"\\\"\")\n\n    @classmethod\n    def get_json_column_as_dict(\n        cls, lookup_query_builder: DataFrame, query_id: str, query_column: str\n    ) -> dict:  # type: ignore\n        \"\"\"Get JSON column as dictionary.\n\n        Args:\n            lookup_query_builder: gab configuration data.\n            query_id: gab configuration table use case identifier.\n            query_column: column to get as json.\n        \"\"\"\n        column_df = lookup_query_builder.filter(\n            col(\"query_id\") == lit(query_id)\n        ).select(col(query_column))\n\n        column_df_json = column_df.select(\n            to_json(struct([column_df[x] for x in column_df.columns]))\n        ).collect()[0][0]\n\n        json_column = json.loads(column_df_json)\n\n        for mapping in json_column.values():\n            column_as_json = ast.literal_eval(mapping)\n\n        return column_as_json  # type: ignore\n\n    @classmethod\n    def extract_columns_from_mapping(\n        cls,\n        columns: dict,\n        is_dimension: bool,\n        extract_column_without_alias: bool = False,\n        table_alias: Optional[str] = None,\n        is_extracted_value_as_name: bool = True,\n    ) -> tuple[list[str], list[str]] | list[str]:\n        \"\"\"Extract and transform columns to SQL select statement.\n\n        Args:\n            columns: data to extract the columns.\n            is_dimension: flag identifying if is a dimension or a metric.\n            extract_column_without_alias: flag to inform if it's to extract columns\n                without aliases.\n            table_alias: name or alias from the source table.\n            is_extracted_value_as_name: identify if the extracted value is the\n                column name.\n        \"\"\"\n        column_with_alias = (\n            \"\".join([table_alias, \".\", \"{} as {}\"]) if table_alias else \"{} as {}\"\n        )\n        column_without_alias = (\n            \"\".join([table_alias, \".\", \"{}\"]) if table_alias else \"{}\"\n        )\n\n        extracted_columns_with_alias = []\n        extracted_columns_without_alias = []\n        for column_name, column_value in columns.items():\n            if extract_column_without_alias:\n                extracted_column_without_alias = column_without_alias.format(\n                    cls._get_column_format_without_alias(\n                        is_dimension,\n                        column_name,\n                        column_value,\n                        is_extracted_value_as_name,\n                    )\n                )\n                extracted_columns_without_alias.append(extracted_column_without_alias)\n\n            extracted_column_with_alias = column_with_alias.format(\n                *cls._extract_column_with_alias(\n                    is_dimension,\n                    column_name,\n                    column_value,\n                    is_extracted_value_as_name,\n                )\n            )\n            extracted_columns_with_alias.append(extracted_column_with_alias)\n\n        return (\n            (extracted_columns_with_alias, extracted_columns_without_alias)\n            if extract_column_without_alias\n            else extracted_columns_with_alias\n        )\n\n    @classmethod\n    def _extract_column_with_alias(\n        cls,\n        is_dimension: bool,\n        column_name: str,\n        column_value: str | dict,\n        is_extracted_value_as_name: bool = True,\n    ) -> tuple[str, str]:\n        \"\"\"Extract column name with alias.\n\n        Args:\n            is_dimension: flag indicating if the column is a dimension.\n            column_name: name of the column.\n            column_value: value of the column.\n            is_extracted_value_as_name: flag indicating if the name of the column is the\n                extracted value.\n        \"\"\"\n        extracted_value = (\n            column_value\n            if is_dimension\n            else (column_value[\"metric_name\"])  # type: ignore\n        )\n\n        return (\n            (extracted_value, column_name)  # type: ignore\n            if is_extracted_value_as_name\n            else (column_name, extracted_value)\n        )\n\n    @classmethod\n    def _get_column_format_without_alias(\n        cls,\n        is_dimension: bool,\n        column_name: str,\n        column_value: str | dict,\n        is_extracted_value_as_name: bool = True,\n    ) -> str:\n        \"\"\"Extract column name without alias.\n\n        Args:\n            is_dimension: flag indicating if the column is a dimension.\n            column_name: name of the column.\n            column_value: value of the column.\n            is_extracted_value_as_name: flag indicating if the name of the column is the\n                extracted value.\n        \"\"\"\n        extracted_value: str = (\n            column_value\n            if is_dimension\n            else (column_value[\"metric_name\"])  # type: ignore\n        )\n\n        return extracted_value if is_extracted_value_as_name else column_name\n\n    @classmethod\n    def get_cadence_configuration_at_end_date(cls, end_date: datetime) -> dict:\n        \"\"\"A dictionary that corresponds to the conclusion of a cadence.\n\n        Any end date inputted by the user we check this end date is actually end of\n            a cadence (YEAR, QUARTER, MONTH, WEEK).\n        If the user input is 2024-03-31 this is a month end and a quarter end that\n            means any use cases configured as month or quarter need to be calculated.\n\n        Args:\n            end_date: base end date.\n        \"\"\"\n        init_end_date_dict = {}\n\n        expected_end_cadence_date = pendulum.datetime(\n            int(end_date.strftime(\"%Y\")),\n            int(end_date.strftime(\"%m\")),\n            int(end_date.strftime(\"%d\")),\n        ).replace(tzinfo=None)\n\n        # Validating YEAR cadence\n        if end_date == expected_end_cadence_date.last_of(\"year\"):\n            init_end_date_dict[\"YEAR\"] = \"N\"\n\n        # Validating QUARTER cadence\n        if end_date == expected_end_cadence_date.last_of(\"quarter\"):\n            init_end_date_dict[\"QUARTER\"] = \"N\"\n\n        # Validating MONTH cadence\n        if end_date == datetime(\n            int(end_date.strftime(\"%Y\")),\n            int(end_date.strftime(\"%m\")),\n            calendar.monthrange(\n                int(end_date.strftime(\"%Y\")), int(end_date.strftime(\"%m\"))\n            )[1],\n        ):\n            init_end_date_dict[\"MONTH\"] = \"N\"\n\n        # Validating WEEK cadence\n        if end_date == expected_end_cadence_date.end_of(\"week\").replace(\n            hour=0, minute=0, second=0, microsecond=0\n        ):\n            init_end_date_dict[\"WEEK\"] = \"N\"\n\n        init_end_date_dict[\"DAY\"] = \"N\"\n\n        return init_end_date_dict\n\n    def get_reconciliation_cadences(\n        self,\n        cadence: str,\n        selected_reconciliation_window: dict,\n        cadence_configuration_at_end_date: dict,\n        rerun_flag: str,\n    ) -> dict:\n        \"\"\"Get reconciliation cadences based on the use case configuration.\n\n        Args:\n            cadence: cadence to process.\n            selected_reconciliation_window: configured use case reconciliation window.\n            cadence_configuration_at_end_date: cadences to execute at the end date.\n            rerun_flag: flag indicating if it's a rerun or a normal run.\n        \"\"\"\n        configured_cadences = self._get_configured_cadences_by_snapshot(\n            cadence, selected_reconciliation_window, cadence_configuration_at_end_date\n        )\n\n        return self._get_cadences_to_execute(\n            configured_cadences, cadence, cadence_configuration_at_end_date, rerun_flag\n        )\n\n    @classmethod\n    def _get_cadences_to_execute(\n        cls,\n        configured_cadences: dict,\n        cadence: str,\n        cadence_configuration_at_end_date: dict,\n        rerun_flag: str,\n    ) -> dict:\n        \"\"\"Get cadences to execute.\n\n        Args:\n            cadence: cadence to process.\n            configured_cadences: configured use case reconciliation window.\n            cadence_configuration_at_end_date: cadences to execute at the end date.\n            rerun_flag: flag indicating if it's a rerun or a normal run.\n        \"\"\"\n        cadences_to_execute = {}\n        cad_order = GABCadence.get_ordered_cadences()\n\n        for snapshot_cadence, snapshot_flag in configured_cadences.items():\n            if (\n                (cad_order[cadence] > cad_order[snapshot_cadence])\n                and (rerun_flag == \"Y\")\n            ) or snapshot_cadence in cadence_configuration_at_end_date:\n                cadences_to_execute[snapshot_cadence] = snapshot_flag\n            elif snapshot_cadence not in cadence_configuration_at_end_date:\n                continue\n\n        return cls._sort_cadences_to_execute(cadences_to_execute, cad_order)\n\n    @classmethod\n    def _sort_cadences_to_execute(\n        cls, cadences_to_execute: dict, cad_order: dict\n    ) -> dict:\n        \"\"\"Sort the cadences to execute.\n\n        Args:\n            cadences_to_execute: cadences to execute.\n            cad_order: all cadences with order.\n        \"\"\"\n        # ordering it because when grouping cadences with snapshot and without snapshot\n        # can impact the cadence ordering.\n        sorted_cadences_to_execute: dict = dict(\n            sorted(\n                cadences_to_execute.items(),\n                key=lambda item: cad_order.get(item[0]),  # type: ignore\n            )\n        )\n        # ordering cadences to execute it from bigger (YEAR) to smaller (DAY)\n        cadences_to_execute_items = []\n\n        for cadence_name, cadence_value in sorted_cadences_to_execute.items():\n            cadences_to_execute_items.append((cadence_name, cadence_value))\n\n        cadences_sorted_by_bigger_cadence_to_execute: dict = dict(\n            reversed(cadences_to_execute_items)\n        )\n\n        return cadences_sorted_by_bigger_cadence_to_execute\n\n    @classmethod\n    def _get_configured_cadences_by_snapshot(\n        cls,\n        cadence: str,\n        selected_reconciliation_window: dict,\n        cadence_configuration_at_end_date: dict,\n    ) -> dict:\n        \"\"\"Get configured cadences to execute.\n\n        Args:\n            cadence: selected cadence.\n            selected_reconciliation_window: configured use case reconciliation window.\n            cadence_configuration_at_end_date: cadences to execute at the end date.\n\n        Returns:\n            Each cadence with the corresponding information if it's to execute with\n                snapshot or not.\n        \"\"\"\n        cadences_by_snapshot = {}\n\n        (\n            no_snapshot_cadences,\n            snapshot_cadences,\n        ) = cls._generate_reconciliation_by_snapshot(\n            cadence, selected_reconciliation_window\n        )\n\n        for snapshot_cadence, snapshot_flag in no_snapshot_cadences.items():\n            if snapshot_cadence in cadence_configuration_at_end_date:\n                cadences_by_snapshot[snapshot_cadence] = snapshot_flag\n\n                cls._LOGGER.info(f\"{snapshot_cadence} is present in {cadence} cadence\")\n                break\n\n        cadences_by_snapshot.update(snapshot_cadences)\n\n        if (not cadences_by_snapshot) and (\n            cadence in cadence_configuration_at_end_date\n        ):\n            cadences_by_snapshot[cadence] = \"N\"\n\n        return cadences_by_snapshot\n\n    @classmethod\n    def _generate_reconciliation_by_snapshot(\n        cls, cadence: str, selected_reconciliation_window: dict\n    ) -> tuple[dict, dict]:\n        \"\"\"Generate reconciliation by snapshot.\n\n        Args:\n            cadence: cadence to process.\n            selected_reconciliation_window: configured use case reconciliation window.\n        \"\"\"\n        cadence_snapshot_configuration = {cadence: \"N\"}\n        for cadence in GABCadence.get_cadences():\n            cls._add_cadence_snapshot_to_cadence_snapshot_config(\n                cadence, selected_reconciliation_window, cadence_snapshot_configuration\n            )\n        cadence_snapshot_configuration = dict(\n            sorted(\n                cadence_snapshot_configuration.items(),\n                key=(\n                    lambda item: GABCadence.get_ordered_cadences().get(  # type: ignore\n                        item[0]\n                    )\n                ),\n            )\n        )\n\n        cadence_snapshot_configuration = dict(\n            reversed(list(cadence_snapshot_configuration.items()))\n        )\n\n        cadences_without_snapshot = {\n            key: value\n            for key, value in cadence_snapshot_configuration.items()\n            if value == \"N\"\n        }\n\n        cadences_with_snapshot = {\n            key: value\n            for key, value in cadence_snapshot_configuration.items()\n            if value == \"Y\"\n        }\n\n        return cadences_with_snapshot, cadences_without_snapshot\n\n    @classmethod\n    def _add_cadence_snapshot_to_cadence_snapshot_config(\n        cls,\n        cadence: str,\n        selected_reconciliation_window: dict,\n        cadence_snapshot_configuration: dict,\n    ) -> None:\n        \"\"\"Add the selected reconciliation to cadence snapshot configuration.\n\n        Args:\n            cadence: selected cadence.\n            selected_reconciliation_window:  configured use case reconciliation window.\n            cadence_snapshot_configuration: cadence snapshot configuration dictionary\n                who will be updated with the new value.\n        \"\"\"\n        if cadence in selected_reconciliation_window:\n            cadence_snapshot_configuration[cadence] = selected_reconciliation_window[\n                cadence\n            ][\"snapshot\"]\n\n    @classmethod\n    def format_datetime_to_default(cls, date_to_format: datetime) -> str:\n        \"\"\"Format datetime to GAB default format.\n\n        Args:\n            date_to_format: date to format.\n        \"\"\"\n        return datetime.date(date_to_format).strftime(GABDefaults.DATE_FORMAT.value)\n\n\nclass GABPartitionUtils(object):\n    \"\"\"Class to extract a partition based in a date period.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def get_years(cls, start_date: str, end_date: str) -> list[str]:\n        \"\"\"Return a list of distinct years from the input parameters.\n\n        Args:\n            start_date: start of the period.\n            end_date: end of the period.\n        \"\"\"\n        year = []\n        if start_date > end_date:\n            raise ValueError(\n                \"Input Error: Invalid start_date and end_date. \"\n                \"Start_date is greater than end_date\"\n            )\n\n        for i in range(int(start_date[0:4]), int(end_date[0:4]) + 1):\n            year.append(str(i))\n\n        return year\n\n    @classmethod\n    def get_partition_condition(cls, start_date: str, end_date: str) -> str:\n        \"\"\"Return year,month and day partition statement from the input parameters.\n\n        Args:\n            start_date: start of the period.\n            end_date: end of the period.\n        \"\"\"\n        years = cls.get_years(start_date, end_date)\n        if len(years) > 1:\n            partition_condition = cls._get_multiple_years_partition(\n                start_date, end_date, years\n            )\n        else:\n            partition_condition = cls._get_single_year_partition(start_date, end_date)\n        return partition_condition\n\n    @classmethod\n    def _get_multiple_years_partition(\n        cls, start_date: str, end_date: str, years: list[str]\n    ) -> str:\n        \"\"\"Return partition when executing multiple years (>1).\n\n        Args:\n            start_date: start of the period.\n            end_date: end of the period.\n            years: list of years.\n        \"\"\"\n        start_date_month = cls._extract_date_part_from_date(\"MONTH\", start_date)\n        start_date_day = cls._extract_date_part_from_date(\"DAY\", start_date)\n\n        end_date_month = cls._extract_date_part_from_date(\"MONTH\", end_date)\n        end_date_day = cls._extract_date_part_from_date(\"DAY\", end_date)\n\n        year_statement = \"(year = {0} and (\".format(years[0]) + \"{})\"\n        if start_date_month != \"12\":\n            start_date_partition = year_statement.format(\n                \"(month = {0} and day between {1} and 31)\".format(\n                    start_date_month, start_date_day\n                )\n                + \" or (month between {0} and 12)\".format(int(start_date_month) + 1)\n            )\n        else:\n            start_date_partition = year_statement.format(\n                \"month = {0} and day between {1} and 31\".format(\n                    start_date_month, start_date_day\n                )\n            )\n\n        period_years_partition = \"\"\n\n        if len(years) == 3:\n            period_years_partition = \") or (year = {0}\".format(years[1])\n        elif len(years) > 3:\n            period_years_partition = \") or (year between {0} and {1})\".format(\n                years[1], years[-2]\n            )\n\n        if end_date_month != \"01\":\n            end_date_partition = (\n                \") or (year = {0} and ((month between 01 and {1})\".format(\n                    years[-1], int(end_date_month) - 1\n                )\n                + \" or (month = {0} and day between 1 and {1})))\".format(\n                    end_date_month, end_date_day\n                )\n            )\n        else:\n            end_date_partition = (\n                \") or (year = {0} and month = 1 and day between 01 and {1})\".format(\n                    years[-1], end_date_day\n                )\n            )\n        partition_condition = (\n            start_date_partition + period_years_partition + end_date_partition\n        )\n\n        return partition_condition\n\n    @classmethod\n    def _get_single_year_partition(cls, start_date: str, end_date: str) -> str:\n        \"\"\"Return partition when executing a single year.\n\n        Args:\n            start_date: start of the period.\n            end_date: end of the period.\n        \"\"\"\n        start_date_year = cls._extract_date_part_from_date(\"YEAR\", start_date)\n        start_date_month = cls._extract_date_part_from_date(\"MONTH\", start_date)\n        start_date_day = cls._extract_date_part_from_date(\"DAY\", start_date)\n\n        end_date_year = cls._extract_date_part_from_date(\"YEAR\", end_date)\n        end_date_month = cls._extract_date_part_from_date(\"MONTH\", end_date)\n        end_date_day = cls._extract_date_part_from_date(\"DAY\", end_date)\n\n        if start_date_month != end_date_month:\n            months = []\n            for i in range(int(start_date_month), int(end_date_month) + 1):\n                months.append(i)\n\n            start_date_partition = (\n                \"year = {0} and ((month={1} and day between {2} and 31)\".format(\n                    start_date_year, months[0], start_date_day\n                )\n            )\n            period_years_partition = \"\"\n            if len(months) == 2:\n                period_years_partition = start_date_partition\n            elif len(months) == 3:\n                period_years_partition = (\n                    start_date_partition + \" or (month = {0})\".format(months[1])\n                )\n            elif len(months) > 3:\n                period_years_partition = (\n                    start_date_partition\n                    + \" or (month between {0} and {1})\".format(months[1], months[-2])\n                )\n            partition_condition = (\n                period_years_partition\n                + \" or (month = {0} and day between 1 and {1}))\".format(\n                    end_date_month, end_date_day\n                )\n            )\n        else:\n            partition_condition = (\n                \"year = {0} and month = {1} and day between {2} and {3}\".format(\n                    end_date_year, end_date_month, start_date_day, end_date_day\n                )\n            )\n\n        return partition_condition\n\n    @classmethod\n    def _extract_date_part_from_date(cls, part: str, date: str) -> str:\n        \"\"\"Extract date part from string date.\n\n        Args:\n            part: date part (possible values: DAY, MONTH, YEAR)\n            date: string date.\n        \"\"\"\n        if \"DAY\" == part.upper():\n            return date[8:10]\n        elif \"MONTH\" == part.upper():\n            return date[5:7]\n        else:\n            return date[0:4]\n"
  },
  {
    "path": "lakehouse_engine/utils/logging_handler.py",
    "content": "\"\"\"Module to configure project logging.\"\"\"\n\nimport logging\nimport re\n\nFORMATTER = logging.Formatter(\"%(asctime)s — %(name)s — %(levelname)s — %(message)s\")\nSENSITIVE_KEYS_REG = [\n    {  # Enclosed in ''.\n        # Stops replacing when it finds comma and space, space or end of line.\n        \"regex\": r\"'(kafka\\.ssl\\.keystore\\.password|kafka\\.ssl\\.truststore\\.password\"\n        r\"|password|secret|credential|credentials|pass|key)'[ ]*:\"\n        r\"[ ]*'.*?(, | |}|$)\",\n        \"replace\": \"'masked_cred': '******', \",\n    },\n    {  # Enclosed in \"\".\n        # Stops replacing when it finds comma and space, space or end of line.\n        \"regex\": r'\"(kafka\\.ssl\\.keystore\\.password|kafka\\.ssl\\.truststore\\.password'\n        r'|password|secret|credential|credentials|pass|key)\"[ ]*:'\n        r'[ ]*\".*?(, | |}|$)',\n        \"replace\": '\"masked_cred\": \"******\", ',\n    },\n    {  # Not enclosed in '' or \"\".\n        # Stops replacing when it finds comma and space, space or end of line.\n        \"regex\": r\"(kafka\\.ssl\\.keystore\\.password|kafka\\.ssl\\.truststore\\.password\"\n        r\"|password|secret|credential|credentials|pass|key)[ ]*:\"\n        r\"[ ]*.*?(, | |}|$)\",\n        \"replace\": \"masked_cred: ******, \",\n    },\n]\n\n\nclass FilterSensitiveData(logging.Filter):\n    \"\"\"Logging filter to hide sensitive data from being shown in the logs.\"\"\"\n\n    def filter(self, record: logging.LogRecord) -> bool:  # noqa: A003\n        \"\"\"Hide sensitive information from being shown in the logs.\n\n        Based on the configured regex and replace strings, the content of the log\n        records is replaced and then all the records are allowed to be logged\n        (return True).\n\n        Args:\n            record: the LogRecord event being logged.\n\n        Returns:\n            The transformed record to be logged.\n        \"\"\"\n        for key_reg in SENSITIVE_KEYS_REG:\n            record.msg = re.sub(key_reg[\"regex\"], key_reg[\"replace\"], str(record.msg))\n        return True\n\n\nclass LoggingHandler(object):\n    \"\"\"Handle the logging of the lakehouse engine project.\"\"\"\n\n    def __init__(self, class_name: str):\n        \"\"\"Construct a LoggingHandler instance.\n\n        Args:\n            class_name: name of the class to be indicated in the logs.\n        \"\"\"\n        self._logger: logging.Logger = logging.getLogger(class_name)\n        self._logger.setLevel(logging.DEBUG)\n        self._logger.addFilter(FilterSensitiveData())\n        lsh = logging.StreamHandler()\n        lsh.setLevel(logging.DEBUG)\n        lsh.setFormatter(FORMATTER)\n        if not self._logger.hasHandlers():\n            # avoid keep adding handlers and therefore duplicate messages\n            self._logger.addHandler(lsh)\n\n    def get_logger(self) -> logging.Logger:\n        \"\"\"Get the _logger instance variable.\n\n        Returns:\n            logging.Logger: the logger object.\n        \"\"\"\n        return self._logger\n"
  },
  {
    "path": "lakehouse_engine/utils/rest_api.py",
    "content": "\"\"\"Module to handle REST API operations.\"\"\"\n\nimport time\nfrom enum import Enum\n\nimport requests\nfrom requests.adapters import HTTPAdapter\nfrom urllib3.util.retry import Retry\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\nLOG = LoggingHandler(__name__).get_logger()\nDEFAULT_CONTENT_TYPE = \"application/json\"\n\n\nclass RestMethods(Enum):\n    \"\"\"Methods for REST API calls.\"\"\"\n\n    POST = \"POST\"\n    PUT = \"PUT\"\n    ALLOWED_METHODS = [\"POST\", \"PUT\"]\n\n\nclass RestStatusCodes(Enum):\n    \"\"\"REST Status Code.\"\"\"\n\n    RETRY_STATUS_CODES = [429, 500, 502, 503, 504]\n    OK_STATUS_CODES = [200]\n\n\nclass RESTApiException(requests.RequestException):\n    \"\"\"Class representing any possible REST API Exception.\"\"\"\n\n    def __init__(self, message: str) -> None:\n        \"\"\"Construct RESTApiException instances.\n\n        Args:\n            message: message to display on exception event.\n        \"\"\"\n        super().__init__(message)\n\n\ndef get_basic_auth(username: str, password: str) -> requests.auth.HTTPBasicAuth:\n    \"\"\"Get the basic authentication object to authenticate REST requests.\n\n    Args:\n        username: username.\n        password: password.\n\n    Returns:\n        requests.auth.HTTPBasicAuth: the HTTPBasicAuth object.\n    \"\"\"\n    return requests.auth.HTTPBasicAuth(username, password)\n\n\ndef get_configured_session(\n    sleep_seconds: float = 0.2,\n    total_retries: int = 5,\n    backoff_factor: int = 2,\n    retry_status_codes: list = None,\n    allowed_methods: list = None,\n    protocol: str = \"https://\",\n) -> requests.Session:\n    \"\"\"Get a configured requests Session with exponential backoff.\n\n    Args:\n        sleep_seconds: seconds to sleep before each request to avoid rate limits.\n        total_retries: number of times to retry.\n        backoff_factor: factor for the exponential backoff.\n        retry_status_codes: list of status code that triggers a retry.\n        allowed_methods: http methods that are allowed for retry.\n        protocol: http:// or https://.\n\n    Returns\n        requests.Session: the configured session.\n    \"\"\"\n    retry_status_codes = (\n        retry_status_codes\n        if retry_status_codes\n        else RestStatusCodes.RETRY_STATUS_CODES.value\n    )\n    allowed_methods = (\n        allowed_methods if allowed_methods else RestMethods.ALLOWED_METHODS.value\n    )\n    time.sleep(sleep_seconds)\n    session = requests.Session()\n    retries = Retry(\n        total=total_retries,\n        backoff_factor=backoff_factor,\n        status_forcelist=retry_status_codes,\n        allowed_methods=allowed_methods,\n    )\n    session.mount(protocol, HTTPAdapter(max_retries=retries))\n    return session\n\n\ndef execute_api_request(\n    method: str,\n    url: str,\n    headers: dict = None,\n    basic_auth_dict: dict = None,\n    json: dict = None,\n    files: dict = None,\n    sleep_seconds: float = 0.2,\n) -> requests.Response:\n    \"\"\"Execute a REST API request.\n\n    Args:\n        method: REST method (e.g., POST or PUT).\n        url: url of the api.\n        headers: request headers.\n        basic_auth_dict: basic http authentication details\n            (e.g., {\"username\": \"x\", \"password\": \"y\"}).\n        json: json payload to send in the request.\n        files: files payload to send in the request.\n        sleep_seconds: for how many seconds to sleep to avoid error 429.\n\n    Returns:\n        response from the HTTP request.\n    \"\"\"\n    basic_auth: requests.auth.HTTPBasicAuth = None\n    if basic_auth_dict:\n        basic_auth = get_basic_auth(\n            basic_auth_dict[\"username\"], basic_auth_dict[\"password\"]\n        )\n\n    return get_configured_session(sleep_seconds=sleep_seconds).request(\n        method=method,\n        url=url,\n        headers=headers,\n        auth=basic_auth,\n        json=json,\n        files=files,\n    )\n"
  },
  {
    "path": "lakehouse_engine/utils/schema_utils.py",
    "content": "\"\"\"Utilities to facilitate dataframe schema management.\"\"\"\n\nfrom logging import Logger\nfrom typing import Any, List, Optional\n\nfrom pyspark.sql.functions import col\nfrom pyspark.sql.types import StructType\n\nfrom lakehouse_engine.core.definitions import InputSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.storage.file_storage_functions import FileStorageFunctions\n\n\nclass SchemaUtils(object):\n    \"\"\"Schema utils that help retrieve and manage schemas of dataframes.\"\"\"\n\n    _logger: Logger = LoggingHandler(__name__).get_logger()\n\n    @staticmethod\n    def from_file(file_path: str, disable_dbfs_retry: bool = False) -> StructType:\n        \"\"\"Get a spark schema from a file (spark StructType json file) in a file system.\n\n        Args:\n            file_path: path of the file in a file system. [Check here](\n                https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html).\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            Spark schema struct type.\n        \"\"\"\n        return StructType.fromJson(\n            FileStorageFunctions.read_json(file_path, disable_dbfs_retry)\n        )\n\n    @staticmethod\n    def from_file_to_dict(file_path: str, disable_dbfs_retry: bool = False) -> Any:\n        \"\"\"Get a dict with the spark schema from a file in a file system.\n\n        Args:\n            file_path: path of the file in a file system. [Check here](\n                https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html).\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n             Spark schema in a dict.\n        \"\"\"\n        return FileStorageFunctions.read_json(file_path, disable_dbfs_retry)\n\n    @staticmethod\n    def from_dict(struct_type: dict) -> StructType:\n        \"\"\"Get a spark schema from a dict.\n\n        Args:\n            struct_type: dict containing a spark schema structure. [Check here](\n                https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html).\n\n        Returns:\n             Spark schema struct type.\n        \"\"\"\n        return StructType.fromJson(struct_type)\n\n    @staticmethod\n    def from_table_schema(table: str) -> StructType:\n        \"\"\"Get a spark schema from a table.\n\n        Args:\n            table: table name from which to inherit the schema.\n\n        Returns:\n            Spark schema struct type.\n        \"\"\"\n        return ExecEnv.SESSION.read.table(table).schema\n\n    @classmethod\n    def from_input_spec(cls, input_spec: InputSpec) -> Optional[StructType]:\n        \"\"\"Get a spark schema from an input specification.\n\n        This covers scenarios where the schema is provided as part of the input\n        specification of the algorithm. Schema can come from the table specified in the\n        input specification (enforce_schema_from_table) or by the dict with the spark\n        schema provided there also.\n\n        Args:\n            input_spec: input specification.\n\n        Returns:\n            spark schema struct type.\n        \"\"\"\n        if input_spec.enforce_schema_from_table:\n            cls._logger.info(\n                f\"Reading schema from table: {input_spec.enforce_schema_from_table}\"\n            )\n            return SchemaUtils.from_table_schema(input_spec.enforce_schema_from_table)\n        elif input_spec.schema_path:\n            cls._logger.info(f\"Reading schema from file: {input_spec.schema_path}\")\n            return SchemaUtils.from_file(\n                input_spec.schema_path, input_spec.disable_dbfs_retry\n            )\n        elif input_spec.schema:\n            cls._logger.info(\n                f\"Reading schema from configuration file: {input_spec.schema}\"\n            )\n            return SchemaUtils.from_dict(input_spec.schema)\n        else:\n            cls._logger.info(\"No schema was provided... skipping enforce schema\")\n            return None\n\n    @staticmethod\n    def _get_prefix_alias(num_chars: int, prefix: str, shorten_names: bool) -> str:\n        \"\"\"Get prefix alias for a field.\"\"\"\n        return (\n            f\"\"\"{'_'.join(\n                [item[:num_chars] for item in prefix.split('.')]\n            )}_\"\"\"\n            if shorten_names\n            else f\"{prefix}_\".replace(\".\", \"_\")\n        )\n\n    @staticmethod\n    def schema_flattener(\n        schema: StructType,\n        prefix: str = None,\n        level: int = 1,\n        max_level: int = None,\n        shorten_names: bool = False,\n        alias: bool = True,\n        num_chars: int = 7,\n        ignore_cols: List = None,\n    ) -> List:\n        \"\"\"Recursive method to flatten the schema of the dataframe.\n\n        Args:\n            schema: schema to be flattened.\n            prefix: prefix of the struct to get the value for. Only relevant\n                for being used in the internal recursive logic.\n            level: level of the depth in the schema being flattened. Only relevant\n                for being used in the internal recursive logic.\n            max_level: level until which you want to flatten the schema. Default: None.\n            shorten_names: whether to shorten the names of the prefixes of the fields\n                being flattened or not. Default: False.\n            alias: whether to define alias for the columns being flattened or\n                not. Default: True.\n            num_chars: number of characters to consider when shortening the names of\n                the fields. Default: 7.\n            ignore_cols: columns which you don't want to flatten. Default: None.\n\n        Returns:\n            A function to be called in .transform() spark function.\n        \"\"\"\n        cols = []\n        ignore_cols = ignore_cols if ignore_cols else []\n        for field in schema.fields:\n            name = prefix + \".\" + field.name if prefix else field.name\n            field_type = field.dataType\n\n            if (\n                isinstance(field_type, StructType)\n                and name not in ignore_cols\n                and (max_level is None or level <= max_level)\n            ):\n                cols += SchemaUtils.schema_flattener(\n                    schema=field_type,\n                    prefix=name,\n                    level=level + 1,\n                    max_level=max_level,\n                    shorten_names=shorten_names,\n                    alias=alias,\n                    num_chars=num_chars,\n                    ignore_cols=ignore_cols,\n                )\n            else:\n                if alias and prefix:\n                    prefix_alias = SchemaUtils._get_prefix_alias(\n                        num_chars, prefix, shorten_names\n                    )\n                    cols.append(col(name).alias(f\"{prefix_alias}{field.name}\"))\n                else:\n                    cols.append(col(name))\n        return cols\n"
  },
  {
    "path": "lakehouse_engine/utils/sharepoint_utils.py",
    "content": "\"\"\"Utilities for sharepoint API operations.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nimport shutil\nfrom contextlib import contextmanager\nfrom datetime import datetime\nfrom pathlib import Path\nfrom typing import Any, Dict, Generator, List, cast\n\nimport requests\nfrom pyspark.sql import DataFrame\nfrom requests import RequestException\nfrom tenacity import (\n    retry,\n    retry_if_exception_type,\n    stop_after_attempt,\n    wait_exponential,\n)\n\nfrom lakehouse_engine.core.definitions import SharepointFile\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.exceptions import SharePointAPIError\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n_logger = LoggingHandler(__name__).get_logger()\n\n\nclass SharepointUtils(object):\n    \"\"\"Class with methods to connect and extract data from Sharepoint.\"\"\"\n\n    def __init__(\n        self,\n        client_id: str,\n        tenant_id: str,\n        local_path: str,\n        api_version: str,\n        site_name: str,\n        drive_name: str,\n        file_name: str,\n        secret: str,\n        folder_relative_path: str = None,\n        chunk_size: int = 5 * 1024 * 1024,  # 5 MB\n        local_options: dict = None,\n        conflict_behaviour: str = \"replace\",\n        file_pattern: str = None,\n        file_type: str = None,\n    ):\n        \"\"\"Instantiate objects of the SharepointUtils class.\n\n        Args:\n            client_id: application (client) ID of your Azure AD app.\n            tenant_id: tenant ID (directory ID) from Azure AD for authentication.\n            local_path: local directory path (Volume) where the files are temporarily\n            stored.\n            api_version: Graph API version to use.\n            site_name: name of the Sharepoint site where the files are stored.\n            drive_name: name of the document library or drive in Sharepoint.\n            file_name: name of the file to be stored in sharepoint.\n            secret: client secret for authentication.\n            folder_relative_path: optional; relative path within the\n            drive(drive_name) where the file will be stored.\n            chunk_size: Optional; size of file chunks to be uploaded/downloaded\n            in bytes (default is 5 MB).\n            local_options: Optional; additional options for customizing write\n            action to local path.\n            conflict_behaviour: Optional; defines how conflicts in file uploads are\n            handled('replace', 'fail', etc.).\n            file_pattern: Optional; pattern to match files in Sharepoint (e.g.,\n            'data_*').\n            file_type: Optional; type of the file to be stored in Sharepoint (e.g.,\n            'csv').\n\n        Returns:\n            A SharepointUtils object.\n        \"\"\"\n        self.client_id = client_id\n        self.tenant_id = tenant_id\n        self.local_path = local_path\n        self.api_version = api_version\n        self.site_name = site_name\n        self.drive_name = drive_name\n        self.file_name = file_name\n        self.secret = secret\n        self.folder_relative_path = folder_relative_path\n        self.chunk_size = chunk_size\n        self.local_options = local_options\n        self.conflict_behaviour = conflict_behaviour\n        self.site_id = None\n        self.drive_id = None\n        self.token = None\n        self.file_pattern = file_pattern\n        self.file_type = file_type\n\n        self._create_app()\n\n    def _get_token(self) -> None:\n        \"\"\"Fetch and store a valid access token for Sharepoint API.\"\"\"\n        try:\n            self.token = self.app.acquire_token_for_client(\n                scopes=[f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/.default\"]\n            )\n        except Exception as err:\n            _logger.error(f\"Token acquisition error: {err}\")\n\n    def _create_app(self) -> None:\n        \"\"\"Create an MSAL (Microsoft Authentication Library) instance.\n\n        This is used to handle authentication and authorization with Azure AD.\n        \"\"\"\n        import msal\n\n        self.app = msal.ConfidentialClientApplication(\n            client_id=self.client_id,\n            authority=f\"{ExecEnv.ENGINE_CONFIG.sharepoint_authority}/{self.tenant_id}\",\n            client_credential=self.secret,\n        )\n\n        self._get_token()\n\n    @retry(\n        stop=stop_after_attempt(5),\n        wait=wait_exponential(multiplier=30, min=30, max=150),\n        retry=retry_if_exception_type(\n            (RequestException, SharePointAPIError)\n        ),  # Retry on these exceptions\n    )\n    def _make_request(\n        self,\n        endpoint: str,\n        method: str = \"GET\",\n        headers: dict = None,\n        json_options: dict = None,\n        data: object = None,\n        stream: bool = False,\n    ) -> requests.Response:\n        \"\"\"Execute API requests to Microsoft Graph API.\n\n        !!! note\n            If you try to upload large files sequentially,you may encounter\n            a 503 \"serviceNotAvailable\" error. To mitigate this, consider using\n            coalesce in the Acon transform specification. However, be aware that\n            increasing the number of partitions also increases the likelihood of\n            server throttling\n\n        Args:\n            endpoint: The API endpoint to call.\n            headers: A dictionary containing the necessary headers.\n            json_options: Optional; JSON data to include in the request body.\n            method: The HTTP method to use ('GET', 'POST', 'PUT', etc.).\n            data: Optional; additional data (e.g., file content) on request body.\n\n        Returns:\n            A Response object from the request library.\n\n        Raises:\n            SharePointAPIError: If there is an issue with the Sharepoint\n            API request.\n        \"\"\"\n        self._get_token()\n\n        # Required to avoid cicd issue\n        if not self.token or \"access_token\" not in self.token:\n            raise SharePointAPIError(\"Authentication token is missing or invalid.\")\n\n        try:\n            if \"access_token\" in self.token:\n                response = requests.request(\n                    method=method,\n                    url=endpoint,\n                    headers=(\n                        headers\n                        if headers\n                        else {\"Authorization\": \"Bearer \" + self.token[\"access_token\"]}\n                    ),\n                    json=json_options,\n                    data=data,\n                    stream=stream,\n                )\n                return response\n        except RequestException as error:\n            raise SharePointAPIError(f\"{error}\")\n\n    def _parse_json(self, response: requests.Response, context: str) -> Dict[str, Any]:\n        \"\"\"Parse JSON response and raise on errors.\n\n        Args:\n            response: HTTP response object.\n            context: Operation context for error logging.\n\n        Returns:\n            Parsed JSON as a dictionary.\n\n        Raises:\n            HTTPError: If the request fails.\n            ValueError: If the response is not valid JSON.\n        \"\"\"\n        try:\n            response.raise_for_status()\n        except requests.HTTPError as e:\n            _logger.error(\n                \"HTTP error while %s: %s | body: %s\", context, e, response.text[:200]\n            )\n            raise\n        try:\n            data = response.json()\n            if not isinstance(data, dict):\n                raise ValueError(f\"Expected dict JSON while {context}\")\n            return data\n        except (requests.JSONDecodeError, ValueError):\n            _logger.error(\n                \"Non-JSON or wrong type while %s. Body preview: %s\",\n                context,\n                response.text[:200],\n            )\n            raise\n\n    def _get_site_id(self) -> str:\n        \"\"\"Get site ID from site name, with caching.\n\n        Returns:\n            Site ID as a string.\n\n        Raises:\n            SharepointAPIError: If the request fails.\n            RuntimeError: For unexpected errors or missing site ID.\n        \"\"\"\n        if self.site_id is not None:\n            return self.site_id\n\n        endpoint = (\n            f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}\"\n            f\"/sites/{ExecEnv.ENGINE_CONFIG.sharepoint_company_domain}:/\"\n            f\"sites/{self.site_name}\"\n        )\n        try:\n            response = self._make_request(endpoint=endpoint)\n            response_data = self._parse_json(\n                response, f\"getting site id for site '{self.site_name}'\"\n            )\n\n            self.site_id = response_data.get(\"id\")\n\n            if not self.site_id:\n                raise ValueError(\n                    f\"Site ID not found for site '{self.site_name}' in the API \"\n                    f\"response: {response_data}\"\n                )\n\n            return self.site_id\n\n        except RequestException as error:\n            raise SharePointAPIError(f\"{error}\")\n        except Exception as e:\n            raise RuntimeError(\n                f\"Unexpected error while reading site ID for site '{self.site_name}':\"\n                f\"{e}\"\n            )\n\n    def _get_drive_id(self) -> str:\n        \"\"\"Get drive ID from site ID and drive name, with caching.\n\n        Returns:\n            Drive ID as a string.\n\n        Raises:\n            SharepointAPIError: If the request fails.\n            ValueError: If no drive is found.\n        \"\"\"\n        if self.drive_id is not None:\n            return str(self.drive_id)\n\n        site_id = self._get_site_id()\n\n        endpoint = (\n            f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/\"\n            f\"{self.api_version}/sites/{site_id}/drives\"\n        )\n\n        try:\n            response = self._make_request(endpoint=endpoint)\n            response_data = self._parse_json(response, \"listing drives for site\")\n\n            drives = response_data.get(\"value\", [])\n            if not drives:\n                raise ValueError(f\"No drives found for site '{self.site_id}'.\")\n\n            for drive in drives:\n                if self.drive_name.strip().lower() == drive[\"name\"].strip().lower():\n                    drive_id = drive[\"id\"]\n                    self.drive_id = drive_id\n                    return str(drive_id)\n\n            raise ValueError(\n                f\"Drive '{self.drive_name}' could not be found in site '{site_id}'.\"\n            )\n\n        except RequestException as error:\n            raise SharePointAPIError(f\"Request error: {error}\")\n\n    def check_if_endpoint_exists(\n        self, folder_root_path: str = None, raise_error: bool = True\n    ) -> bool:\n        \"\"\"Check if a Sharepoint drive or folder exists.\n\n        Args:\n            folder_root_path: Optional folder path to check.\n            raise_error: Raise error if the folder doesn't exist.\n\n        Returns:\n            True if the endpoint exists, False otherwise.\n\n        Raises:\n            SharepointAPIError: If the endpoint doesn't exist and raise_error is True.\n        \"\"\"\n        try:\n            site_id = self._get_site_id()\n            drive_id = self._get_drive_id()\n\n            if not folder_root_path:\n                return True\n\n            endpoint = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/\"\n                f\"{self.api_version}/sites/{site_id}/drives/{drive_id}\"\n                f\"/root:/{folder_root_path}\"\n            )\n\n            response = self._make_request(endpoint=endpoint)\n            response.raise_for_status()\n            return True\n\n        except requests.HTTPError as error:\n            if error.response.status_code == 404:\n                _logger.warning(f\"Sharepoint path doesn't exist: {folder_root_path}\")\n                if raise_error:\n                    raise SharePointAPIError(\n                        f\"Path '{folder_root_path}' doesn't exist!\"\n                    )\n                return False\n            raise\n\n    def check_if_local_path_exists(self, local_path: str) -> None:\n        \"\"\"Verify that a local path exists.\n\n        Args:\n            local_path: Local folder where files are temporarily stored.\n\n        Raises:\n            SharePointAPIError: If the path cannot be read.\n        \"\"\"\n        try:\n            os.listdir(local_path)\n        except IOError as error:\n            raise SharePointAPIError(f\"{error}\")\n\n    def save_to_staging_area(self, sp_file: SharepointFile) -> str:\n        \"\"\"Save a Sharepoint file locally (direct write or streaming).\n\n        If the file is under the threshold and already loaded in memory, write its\n        content directly.\n        Otherwise, download the file via streaming to avoid memory overload.\n\n        Args:\n            sp_file: File metadata and content.\n\n        Returns:\n            Local file path.\n\n        Raises:\n            SharePointAPIError: On download or write failure.\n        \"\"\"\n        try:\n            if sp_file.content and sp_file.content_size < (500 * 1024 * 1024):\n                _logger.info(\n                    f\"Writing '{sp_file.file_name}' via direct write (under 500MB).\"\n                )\n                return self.write_bytes_to_local_file(sp_file)\n\n            _logger.info(\n                f\"Writing '{sp_file.file_name}' via streaming (500MB+ or content not\"\n                f\" loaded).\"\n            )\n            return self.download_file_streaming(sp_file)\n\n        except Exception as e:\n            raise SharePointAPIError(f\"Failed to write '{sp_file.file_name}': {e}\")\n\n    def download_file_streaming(self, sp_file: SharepointFile) -> str:\n        \"\"\"Download a large file from Sharepoint in chunks to a local path.\n\n        Uses the configured chunk size to avoid memory overload with large files.\n\n        Args:\n            sp_file: File with remote path and name.\n\n        Returns:\n            Local file path.\n\n        Raises:\n            SharePointAPIError: If the download fails.\n        \"\"\"\n        try:\n            site_id = self._get_site_id()\n            drive_id = self._get_drive_id()\n            url = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                f\"sites/{site_id}/drives/{drive_id}/root:/{sp_file.file_path}:/content\"\n            )\n\n            local_file_path = Path(self.local_path) / sp_file.file_name\n            local_file_path.parent.mkdir(parents=True, exist_ok=True)\n\n            with self._make_request(endpoint=url, stream=True) as response:\n                response.raise_for_status()\n                with open(local_file_path, \"wb\") as file:\n                    for chunk in response.iter_content(chunk_size=self.chunk_size):\n                        if chunk:\n                            file.write(chunk)\n\n            return str(local_file_path)\n\n        except requests.RequestException as error:\n            raise SharePointAPIError(f\"Failed to stream download: {error}\")\n\n    def write_bytes_to_local_file(self, sp_file: SharepointFile) -> str:\n        \"\"\"Write Sharepoint file content (bytes) to a local path.\n\n        Args:\n            sp_file: File with content and metadata.\n\n        Returns:\n            Local file path.\n\n        Raises:\n            ValueError: If content is missing.\n            RuntimeError: If writing to disk fails.\n        \"\"\"\n        if not sp_file.content:\n            raise ValueError(\n                f\"Cannot write file '{sp_file.file_name}': Content is empty.\"\n            )\n\n        try:\n            # Local base path (e.g., Unity Volumes, DBFS, or other mounted storage)\n            local_base_path = Path(self.local_path)\n            local_base_path.mkdir(parents=True, exist_ok=True)\n            file_path = local_base_path / sp_file.file_name\n            file_path.write_bytes(sp_file.content)\n            return str(file_path)\n        except Exception as e:\n            raise RuntimeError(\n                f\"Failed to write file '{sp_file.file_name}' to Unity Volume: {e}\"\n            )\n\n    def write_to_local_path(self, df: DataFrame) -> None:\n        \"\"\"Write a Spark DataFrame to a local path (Volume) in CSV format.\n\n        This method writes the provided Spark DataFrame to a specified local directory,\n        saving it in CSV format. The method renames the output file from its default\n        \"part-*\" naming convention to a specified file name.\n        The dictionary local_options enables the customisation of the write action.\n        The customizable options can be found here:\n        https://spark.apache.org/docs/3.5.1/sql-data-sources-csv.html.\n\n        Args:\n            df: The Spark DataFrame to write to the local file system.\n\n        Returns:\n            None.\n\n        Raises:\n            IOError: If there is an issue during the file writing process.\n        \"\"\"\n        try:\n            df.coalesce(1).write.mode(\"overwrite\").save(\n                path=self.local_path,\n                format=\"csv\",\n                **self.local_options if self.local_options else {},\n            )\n            self._rename_local_file(self.local_path, self.file_name)\n        except IOError as error:\n            raise SharePointAPIError(f\"{error}\")\n\n    def _rename_local_file(self, local_path: str, file_name: str) -> None:\n        \"\"\"Rename a local file that starts with 'part-' to the desired file name.\n\n        Args:\n            local_path: The directory where the file is located.\n            file_name: The new file name for the local file.\n        \"\"\"\n        files_in_dir = os.listdir(local_path)\n\n        part_file = [f for f in files_in_dir if f.startswith(\"part-\")][0]\n\n        try:\n            os.rename(\n                os.path.join(local_path, part_file), os.path.join(local_path, file_name)\n            )\n        except IOError as error:\n            raise SharePointAPIError(f\"{error}\")\n\n    def write_to_sharepoint(self) -> None:\n        \"\"\"Upload a local file to Sharepoint in chunks using the Microsoft Graph API.\n\n        This method creates an upload session and uploads a local CSV file to a\n        Sharepoint document library.\n        The file is divided into chunks (based on the `chunk_size` specified)\n        to handle large file uploads and send sequentially using the upload URL\n        returned from the Graph API.\n\n        The method uses instance attributes such as `api_domain`, `api_version`,\n        `site_name`, `drive_name`, `folder_relative_path`, and `file_name` to\n        construct the necessary API calls and upload the file to the specified\n        location in Sharepoint.\n\n        Returns:\n            None.\n\n        Raises:\n            APIError: If an error occurs during any stage of the upload\n            (e.g., failure to create upload session,issues during chunk upload).\n        \"\"\"\n        drive_id = self._get_drive_id()\n\n        if self.folder_relative_path:\n            endpoint = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}\"\n                f\"/{self.api_version}/drives/{drive_id}/items/root:\"\n                f\"/{self.folder_relative_path}/{self.file_name}.csv:\"\n                f\"/createUploadSession\"\n            )\n        else:\n            endpoint = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}\"\n                f\"/{self.api_version}/drives/{drive_id}/items/root:\"\n                f\"/{self.file_name}.csv:/createUploadSession\"\n            )\n\n        response = self._make_request(method=\"POST\", endpoint=endpoint)\n        response.raise_for_status()\n        upload_session = response.json()\n        upload_url = upload_session[\"uploadUrl\"]\n\n        upload_file = str(Path(self.local_path) / self.file_name)\n        stat = os.stat(upload_file)\n        size = stat.st_size\n\n        with open(upload_file, \"rb\") as data:\n            start = 0\n            while start < size:\n                chunk = data.read(self.chunk_size)\n                bytes_read = len(chunk)\n                upload_range = f\"bytes {start}-{start + bytes_read - 1}/{size}\"\n                headers = {\n                    \"Content-Length\": str(bytes_read),\n                    \"Content-Range\": upload_range,\n                }\n                response = self._make_request(\n                    method=\"PUT\", endpoint=upload_url, headers=headers, data=chunk\n                )\n                response.raise_for_status()\n                start += bytes_read\n\n    def delete_local_path(self) -> None:\n        \"\"\"Delete and recreate the local path used for temporary storage.\n\n        Raises:\n            SharePointAPIError: If deletion or recreation fails.\n        \"\"\"\n        try:\n            local_path = Path(self.local_path)\n            if local_path.exists():\n                shutil.rmtree(local_path)\n            local_path.mkdir(parents=True, exist_ok=True)\n        except Exception as e:\n            raise SharePointAPIError(f\"Failed to clear or recreate local path: {e}\")\n\n    @contextmanager\n    def staging_area(self) -> Generator[str, None, None]:\n        \"\"\"Provide a clean local staging folder for Sharepoint files.\n\n        Yield the local path after ensuring it's empty. Cleans up after use.\n\n        Yield:\n            Path to the staging folder as a string.\n        \"\"\"\n        self.delete_local_path()\n        try:\n            yield self.local_path\n        finally:\n            try:\n                self.delete_local_path()\n            except Exception as e:\n                _logger.warning(f\"Failed to clean up local path: {e}\")\n\n    def list_items_in_path(self, path: str) -> list[Any]:\n        \"\"\"List items (files/folders) at a Sharepoint path.\n\n        Args:\n            path: Relative folder or file path.\n\n        Returns:\n            List of items; files include @microsoft.graph.downloadUrl.\n\n        Raises:\n            ValueError: If the path is invalid or not found.\n        \"\"\"\n        site_id = self._get_site_id()\n        drive_id = self._get_drive_id()\n\n        path = path.strip(\"/\")\n        if not path:\n            resp = self._make_request(\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                f\"sites/{site_id}/drives/{drive_id}/root/children\"\n            )\n            data = self._parse_json(resp, \"listing root children\")\n            return cast(List[dict[str, Any]], data.get(\"value\", []))\n\n        path_parts = path.split(\"/\")\n\n        # start from root children\n        resp = self._make_request(\n            f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/sites/\"\n            f\"{site_id}/drives/{drive_id}/root/children\"\n        )\n        data = self._parse_json(resp, \"listing root children\")\n        items = cast(List[dict[str, Any]], data.get(\"value\", []))\n\n        for component in path_parts:\n            current_item = next(\n                (item for item in items if item.get(\"name\") == component), None\n            )\n\n            if not current_item:\n                raise ValueError(f\"Path component '{component}' not found in '{path}'.\")\n\n            if \"folder\" in current_item:\n                # descend into folder\n                resp = self._make_request(\n                    f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                    f\"sites/{site_id}/drives/{drive_id}/items/\"\n                    f\"{current_item['id']}/children\"\n                )\n                data = self._parse_json(resp, f\"listing children for '{component}'\")\n                items = cast(List[dict[str, Any]], data.get(\"value\", []))\n            else:\n                # it's a file; ensure we have downloadUrl\n                if \"@microsoft.graph.downloadUrl\" not in current_item:\n                    resp = self._make_request(\n                        f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/\"\n                        f\"{self.api_version}/sites/{site_id}/drives/{drive_id}/\"\n                        f\"items/{current_item['id']}\"\n                    )\n                    current_item = self._parse_json(\n                        resp, f\"fetching file metadata for item id {current_item['id']}\"\n                    )\n                return [current_item]\n\n        return items\n\n    def get_file_metadata(self, file_path: str) -> SharepointFile:\n        \"\"\"Fetch file metadata and content from Sharepoint.\n\n        Args:\n            file_path: Full Sharepoint path (e.g., 'folder/file.csv').\n\n        Returns:\n            SharepointFile with metadata and bytes content.\n\n        Raises:\n            ValueError: If required metadata is missing or path is invalid.\n            requests.HTTPError: On HTTP errors during retrieval.\n        \"\"\"\n        site_id = self._get_site_id()\n        drive_id = self._get_drive_id()\n\n        file_metadata_url = (\n            f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/\"\n            f\"{self.api_version}/sites/{site_id}/drives/{drive_id}/root:/{file_path}\"\n        )\n\n        # Get metadata\n        metadata_response = self._make_request(endpoint=file_metadata_url, method=\"GET\")\n        metadata = self._parse_json(\n            metadata_response,\n            f\"fetching metadata for '{file_path}'\",\n        )\n\n        file_name = metadata.get(\"name\")\n        time_created = metadata.get(\"createdDateTime\", \"\")\n        time_modified = metadata.get(\"lastModifiedDateTime\", \"\")\n        download_url = metadata.get(\"@microsoft.graph.downloadUrl\")\n\n        if not file_name or not download_url:\n            raise ValueError(\n                f\"Missing required metadata for '{file_path}': \"\n                f\"name={file_name!r}, \"\n                f\"downloadUrl={'present' if download_url else 'absent'}\"\n            )\n\n        # Download file content (bytes)\n        content_response = self._make_request(endpoint=download_url, method=\"GET\")\n        content_response.raise_for_status()\n        file_content = content_response.content\n\n        if \"/\" not in file_path:\n            raise ValueError(\n                f\"Invalid file path: '{file_path}'. Expected a folder/file structure.\"\n            )\n        folder = file_path.rsplit(\"/\", 1)[0]\n\n        return SharepointFile(\n            file_name=file_name,\n            time_created=time_created,\n            time_modified=time_modified,\n            content=file_content,\n            _folder=folder,\n        )\n\n    def archive_sharepoint_file(\n        self, sp_file: SharepointFile, to_path: str | None, *, move_enabled: bool = True\n    ) -> None:\n        \"\"\"Rename (timestamp) and optionally move a Sharepoint file.\n\n        Args:\n            sp_file: File to archive.\n            to_path: Destination folder (if moving).\n            move_enabled: Whether to move after rename.\n\n        Raises:\n            SharePointAPIError: If the request fails.\n        \"\"\"\n        # If already archived (renamed+moved before), don't repeat\n        if getattr(sp_file, \"_already_archived\", False) and move_enabled and to_path:\n            _logger.info(\n                \"Skipping archive: file already archived -> %s\", sp_file.file_name\n            )\n            return\n\n        try:\n            if not getattr(sp_file, \"skip_rename\", False):\n                new_file_name = self._rename_sharepoint_file(sp_file)\n                sp_file.file_name = new_file_name\n                sp_file.skip_rename = True\n\n            if not move_enabled or not to_path:\n                _logger.info(\n                    \"\"\"Archiving disabled or no target folder;\n                     Renamed only and left in place: '%s'.\"\"\",\n                    sp_file.file_path,\n                )\n                return\n\n            self._move_file_in_sharepoint(sp_file, to_path)\n            sp_file._already_archived = True\n            _logger.info(\"Archived '%s' to '%s'.\", sp_file.file_name, to_path)\n\n        except requests.RequestException as e:\n            _logger.error(\n                \"Request failed while archiving '%s': %s\", sp_file.file_name, e\n            )\n            raise SharePointAPIError(f\"Request failed: {e}\")\n\n    def _rename_sharepoint_file(self, sp_file: SharepointFile) -> str:\n        \"\"\"Prefix file name with a timestamp (skip if already renamed).\n\n        Args:\n            sp_file: File to rename.\n\n        Returns:\n            New file name.\n\n        Raises:\n            SharePointAPIError: If the rename request fails.\n        \"\"\"\n        try:\n            if getattr(sp_file, \"skip_rename\", False):\n                _logger.info(\n                    f\"Skipping rename for already-prefixed file: {sp_file.file_name}\"\n                )\n                return sp_file.file_name\n\n            _logger.info(f\"Renaming file at '{sp_file.file_path}'.\")\n\n            site_id = self._get_site_id()\n            drive_id = self._get_drive_id()\n            current_date_formatted = datetime.now().strftime(\"%Y%m%d%H%M%S\")\n            new_file_name = f\"{current_date_formatted}_{sp_file.file_name}\"\n\n            url_get_file = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                f\"sites/{site_id}/drives/{drive_id}/root:/{sp_file.file_path}\"\n            )\n            resp = self._make_request(endpoint=url_get_file, method=\"GET\")\n            file_info = self._parse_json(\n                resp, f\"fetching file info at '{sp_file.file_path}'\"\n            )\n            file_id = file_info.get(\"id\")\n            if not file_id:\n                raise ValueError(\n                    f\"File '{sp_file.file_name}' not found in '{sp_file.file_path}'.\"\n                )\n\n            url_rename_file = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                f\"sites/{site_id}/drives/{drive_id}/items/{file_id}\"\n            )\n            rename_payload = {\"name\": new_file_name}\n            rename_resp = self._make_request(\n                endpoint=url_rename_file, method=\"PATCH\", json_options=rename_payload\n            )\n            rename_resp.raise_for_status()\n\n            _logger.info(f\"File '{sp_file.file_name}' renamed to '{new_file_name}'.\")\n            sp_file.file_name = new_file_name\n            return new_file_name\n\n        except requests.RequestException as e:\n            _logger.error(\n                f\"Request failed while renaming file '{sp_file.file_name}': {e}\"\n            )\n            raise SharePointAPIError(f\"Request failed: {e}\")\n\n    def _move_file_in_sharepoint(self, sp_file: SharepointFile, to_path: str) -> None:\n        \"\"\"Move a file to another folder in Sharepoint.\n\n        Args:\n            sp_file: File to move.\n            to_path: Destination path.\n\n        Raises:\n            ValueError: If the file ID cannot be resolved.\n            SharePointAPIError: If the move request fails.\n        \"\"\"\n        try:\n            _logger.info(\n                f\"Moving file '{sp_file.file_name}' from '{sp_file.file_path}' to \"\n                f\"'{to_path}'.\"\n            )\n\n            site_id = self._get_site_id()\n            drive_id = self._get_drive_id()\n\n            if not self.check_if_endpoint_exists(\n                folder_root_path=to_path, raise_error=False\n            ):\n                self._create_folder_in_sharepoint(to_path)\n                # Create the folder if it doesn't exist; raise_error = false so it\n                # doesn't throw error\n                _logger.info(f\"Created archive folder: {to_path}\")\n\n            url_get_file = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                f\"sites/{site_id}/drives/{drive_id}/root:/{sp_file.file_path}\"\n            )\n\n            response = self._make_request(endpoint=url_get_file, method=\"GET\")\n            file_info = self._parse_json(\n                response,\n                f\"getting file id for move '{sp_file.file_path}'\",\n            )\n\n            file_id = file_info.get(\"id\")\n\n            if not file_id:\n                raise ValueError(\n                    f\"File '{sp_file.file_name}' not found in '{sp_file.file_path}'.\"\n                )\n\n            url_move_file = (\n                f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                f\"sites/{site_id}/drives/{drive_id}/items/{file_id}\"\n            )\n\n            new_parent_reference = {\n                \"parentReference\": {\"path\": f\"/drive/root:/{to_path}\"},\n                \"name\": sp_file.file_name,\n            }\n\n            response = self._make_request(\n                endpoint=url_move_file,\n                method=\"PATCH\",\n                json_options=new_parent_reference,\n            )\n            response.raise_for_status()\n\n            _logger.info(\n                f\"File '{sp_file.file_name}' successfully moved to '{to_path}'.\"\n            )\n\n        except requests.RequestException as e:\n            _logger.error(\n                f\"Request failed while moving file '{sp_file.file_name}': {e}\"\n            )\n            raise SharePointAPIError(f\"Request failed: {e}\")\n\n    def _create_folder_in_sharepoint(self, folder_path: str) -> None:\n        \"\"\"Create the final folder in a Sharepoint path.\n\n        Args:\n            folder_path: Full folder path to create.\n\n        Raises:\n            SharePointAPIError: If folder creation fails.\n        \"\"\"\n        try:\n            site_id = self._get_site_id()\n            drive_id = self._get_drive_id()\n\n            parent_path, folder_name = (\n                folder_path.rsplit(\"/\", 1) if \"/\" in folder_path else (\"\", folder_path)\n            )\n            parent_path = parent_path.strip(\"/\")  # Clean path just in case\n\n            _logger.info(\n                f\"Creating folder '{folder_name}' inside '{parent_path or 'root'}'\"\n            )\n\n            if parent_path:\n                endpoint = (\n                    f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                    f\"sites/{site_id}/drives/{drive_id}/root:/{parent_path}:/children\"\n                )\n            else:\n                endpoint = (\n                    f\"{ExecEnv.ENGINE_CONFIG.sharepoint_api_domain}/{self.api_version}/\"\n                    f\"sites/{site_id}/drives/{drive_id}/root/children\"\n                )\n\n            folder_metadata = {\"name\": folder_name, \"folder\": {}}\n\n            response = self._make_request(\n                endpoint=endpoint, method=\"POST\", json_options=folder_metadata\n            )\n            response.raise_for_status()\n\n            _logger.info(f\"Folder '{folder_path}' created successfully.\")\n\n        except requests.RequestException as e:\n            _logger.error(f\"Failed to create folder '{folder_path}': {e}\")\n            raise SharePointAPIError(f\"Error creating folder '{folder_path}': {e}\")\n"
  },
  {
    "path": "lakehouse_engine/utils/spark_utils.py",
    "content": "\"\"\"Utilities to facilitate spark dataframe management.\"\"\"\n\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\n\n\nclass SparkUtils(object):\n    \"\"\"Spark utils that help retrieve and manage dataframes.\"\"\"\n\n    @staticmethod\n    def create_temp_view(\n        df: DataFrame, view_name: str, return_prefix: bool = False\n    ) -> None | str:\n        \"\"\"Create a temporary view from a dataframe.\n\n        If the execution environment is serverless, it creates a temporary view,\n        otherwise it creates a global temporary view.\n        Serverless environments don't support global temporary views, so we need to\n        create a temporary view in that case, but it still gets accessible from other\n        queries in the same session.\n        In non-serverless environments, we create a global temporary view to make\n        sure it is accessible from other sessions as well.\n\n        Args:\n            df: dataframe to create the view from.\n            view_name: name of the view to create.\n            return_prefix: whether to return the prefix to use in queries\n            for this view or not.\n\n        Returns:\n            None or the prefix to use in queries for this view, depending on the\n            value of return_prefix.\n        \"\"\"\n        if ExecEnv.IS_SERVERLESS:\n            df.createOrReplaceTempView(view_name)\n            prefix = \"\"\n        else:\n            df.createOrReplaceGlobalTempView(view_name)\n            prefix = \"global_temp.\"\n        if return_prefix:\n            return prefix\n        return None\n"
  },
  {
    "path": "lakehouse_engine/utils/sql_parser_utils.py",
    "content": "\"\"\"Module to parse sql files.\"\"\"\n\nfrom lakehouse_engine.core.definitions import SQLParser\n\n\nclass SQLParserUtils(object):\n    \"\"\"Parser utilities class.\"\"\"\n\n    def split_sql_commands(\n        self,\n        sql_commands: str,\n        delimiter: str,\n        advanced_parser: bool,\n    ) -> list[str]:\n        \"\"\"Read the sql commands of a file to choose how to split them.\n\n        Args:\n            sql_commands: commands to be split.\n            delimiter: delimiter to split the sql commands.\n            advanced_parser: boolean to define if we need to use a complex split.\n\n        Returns:\n            List with the sql commands.\n        \"\"\"\n        if advanced_parser:\n            self.sql_commands: str = sql_commands\n            self.delimiter: str = delimiter\n            self.separated_sql_commands: list[str] = []\n            self.split_index: int = 0\n            return self._split_sql_commands()\n        else:\n            return sql_commands.split(delimiter)\n\n    def _split_sql_commands(self) -> list[str]:\n        \"\"\"Read the sql commands of a file to split them based on a delimiter.\n\n        Returns:\n            List with the sql commands.\n        \"\"\"\n        single_quotes: int = 0\n        double_quotes: int = 0\n        one_line_comment: int = 0\n        multiple_line_comment: int = 0\n\n        for index, char in enumerate(self.sql_commands):\n            if char == SQLParser.SINGLE_QUOTES.value and self._character_validation(\n                value=[double_quotes, one_line_comment, multiple_line_comment]\n            ):\n                single_quotes = self._update_value(\n                    value=single_quotes,\n                    condition=self._character_validation(\n                        value=self._get_substring(first_char=index - 1, last_char=index)\n                    ),\n                    operation=\"+-\",\n                )\n            elif char == SQLParser.DOUBLE_QUOTES.value and self._character_validation(\n                value=[single_quotes, one_line_comment, multiple_line_comment]\n            ):\n                double_quotes = self._update_value(\n                    value=double_quotes,\n                    condition=self._character_validation(\n                        value=self._get_substring(first_char=index - 1, last_char=index)\n                    ),\n                    operation=\"+-\",\n                )\n            elif char == SQLParser.SINGLE_TRACE.value and self._character_validation(\n                value=[double_quotes, single_quotes, multiple_line_comment]\n            ):\n                one_line_comment = self._update_value(\n                    value=one_line_comment,\n                    condition=(\n                        self._get_substring(first_char=index, last_char=index + 2)\n                        == SQLParser.DOUBLE_TRACES.value\n                    ),\n                    operation=\"+\",\n                )\n            elif (\n                char == SQLParser.SLASH.value or char == SQLParser.STAR.value\n            ) and self._character_validation(\n                value=[double_quotes, single_quotes, one_line_comment]\n            ):\n                multiple_line_comment = self._update_value(\n                    value=multiple_line_comment,\n                    condition=self._get_substring(first_char=index, last_char=index + 2)\n                    in SQLParser.MULTIPLE_LINE_COMMENT.value,\n                    operation=\"+-\",\n                )\n\n            one_line_comment = self._update_value(\n                value=one_line_comment,\n                condition=char == SQLParser.PARAGRAPH.value,\n                operation=\"-\",\n            )\n\n            self._validate_command_is_closed(\n                index=index,\n                dependencies=self._character_validation(\n                    value=[\n                        single_quotes,\n                        double_quotes,\n                        one_line_comment,\n                        multiple_line_comment,\n                    ]\n                ),\n            )\n\n        return self.separated_sql_commands\n\n    def _get_substring(self, first_char: int = None, last_char: int = None) -> str:\n        \"\"\"Get the substring based on the indexes passed as arguments.\n\n        Args:\n            first_char: represents the first index of the string.\n            last_char: represents the last index of the string.\n\n        Returns:\n            The substring based on the indexes passed as arguments.\n        \"\"\"\n        return self.sql_commands[first_char:last_char]\n\n    def _validate_command_is_closed(self, index: int, dependencies: int) -> None:\n        \"\"\"Validate based on the delimiter if we have the closing of a sql command.\n\n        Args:\n            index: index of the character in a string.\n            dependencies: represents an int to validate if we are outside of quotes,...\n        \"\"\"\n        if (\n            self._get_substring(first_char=index, last_char=index + len(self.delimiter))\n            == self.delimiter\n            and dependencies\n        ):\n            self._add_new_command(\n                sql_command=self._get_substring(\n                    first_char=self.split_index, last_char=index\n                )\n            )\n            self.split_index = index + len(self.delimiter)\n\n        if self._get_substring(\n            first_char=index, last_char=index + len(self.delimiter)\n        ) != self.delimiter and index + len(self.delimiter) == len(self.sql_commands):\n            self._add_new_command(\n                sql_command=self._get_substring(\n                    first_char=self.split_index, last_char=len(self.sql_commands)\n                )\n            )\n\n    def _character_validation(self, value: str | list) -> bool:\n        \"\"\"Validate if character is the opening/closing/inside of a comment.\n\n        Args:\n            value: represent the value associated to different validated\n            types or a character to be analyzed.\n\n        Returns:\n            Boolean that indicates if character found is the opening\n            or closing of a comment, is inside of quotes, comments,...\n        \"\"\"\n        if value.__class__.__name__ == \"list\":\n            return sum(value) == 0\n        else:\n            return value != SQLParser.BACKSLASH.value\n\n    def _add_new_command(self, sql_command: str) -> None:\n        \"\"\"Add a newly found command to list of sql commands to execute.\n\n        Args:\n            sql_command: command to be added to list.\n        \"\"\"\n        self.separated_sql_commands.append(str(sql_command))\n\n    def _update_value(self, value: int, operation: str, condition: bool = False) -> int:\n        \"\"\"Update value associated to different types of comments or quotes.\n\n        Args:\n            value: value to be updated\n            operation: operation that we want to perform on the value.\n            condition: validate if we have a condition associated to the value.\n\n        Returns:\n            A integer that represents the updated value.\n        \"\"\"\n        if condition and operation == \"+-\":\n            value = value + 1 if value == 0 else value - 1\n        elif condition and operation == \"+\":\n            value = value + 1 if value == 0 else value\n        elif condition and operation == \"-\":\n            value = value - 1 if value == 1 else value\n\n        return value\n"
  },
  {
    "path": "lakehouse_engine/utils/storage/__init__.py",
    "content": "\"\"\"Utilities to interact with storage systems.\"\"\"\n"
  },
  {
    "path": "lakehouse_engine/utils/storage/dbfs_storage.py",
    "content": "\"\"\"Module to represent a DBFS file storage system.\"\"\"\n\nfrom typing import Any\nfrom urllib.parse import ParseResult, urlunparse\n\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.storage.file_storage import FileStorage\n\n\nclass DBFSStorage(FileStorage):\n    \"\"\"Class to represent a DBFS file storage system.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n    _MAX_INT = 2147483647\n\n    @classmethod\n    def get_file_payload(cls, url: ParseResult) -> Any:\n        \"\"\"Get the content of a file.\n\n        Args:\n            url: url of the file.\n\n        Returns:\n            File payload/content.\n        \"\"\"\n        from lakehouse_engine.core.exec_env import ExecEnv\n\n        str_url = urlunparse(url)\n        cls._LOGGER.info(f\"Trying with dbfs_storage: Reading from file: {str_url}\")\n        return DatabricksUtils.get_db_utils(ExecEnv.SESSION).fs.head(\n            str_url, cls._MAX_INT\n        )\n\n    @classmethod\n    def write_payload_to_file(cls, url: ParseResult, content: str) -> None:\n        \"\"\"Write payload into a file.\n\n        Args:\n            url: url of the file.\n            content: content to write into the file.\n        \"\"\"\n        from lakehouse_engine.core.exec_env import ExecEnv\n\n        str_url = urlunparse(url)\n        cls._LOGGER.info(f\"Trying with dbfs_storage: Writing into file: {str_url}\")\n        DatabricksUtils.get_db_utils(ExecEnv.SESSION).fs.put(str_url, content, True)\n"
  },
  {
    "path": "lakehouse_engine/utils/storage/file_storage.py",
    "content": "\"\"\"Module for abstract representation of a storage system holding files.\"\"\"\n\nfrom abc import ABC, abstractmethod\nfrom typing import Any\nfrom urllib.parse import ParseResult\n\n\nclass FileStorage(ABC):\n    \"\"\"Abstract file storage class.\"\"\"\n\n    @classmethod\n    @abstractmethod\n    def get_file_payload(cls, url: ParseResult) -> Any:\n        \"\"\"Get the payload of a file.\n\n        Args:\n            url: url of the file.\n\n        Returns:\n            File payload/content.\n        \"\"\"\n        pass\n\n    @classmethod\n    @abstractmethod\n    def write_payload_to_file(cls, url: ParseResult, content: str) -> None:\n        \"\"\"Write payload into a file.\n\n        Args:\n            url: url of the file.\n            content: content to write into the file.\n        \"\"\"\n        pass\n"
  },
  {
    "path": "lakehouse_engine/utils/storage/file_storage_functions.py",
    "content": "\"\"\"Module for common file storage functions.\"\"\"\n\nimport json\nfrom abc import ABC\nfrom typing import Any\nfrom urllib.parse import ParseResult, urlparse\n\nimport boto3\n\nfrom lakehouse_engine.utils.storage.dbfs_storage import DBFSStorage\nfrom lakehouse_engine.utils.storage.local_fs_storage import LocalFSStorage\nfrom lakehouse_engine.utils.storage.s3_storage import S3Storage\n\n\nclass FileStorageFunctions(ABC):  # noqa: B024\n    \"\"\"Class for common file storage functions.\"\"\"\n\n    @classmethod\n    def read_json(cls, path: str, disable_dbfs_retry: bool = False) -> Any:\n        \"\"\"Read a json file.\n\n        The file should be in a supported file system (e.g., s3, dbfs or\n        local filesystem).\n\n        Args:\n            path: path to the json file.\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            Dict with json file content.\n        \"\"\"\n        url = urlparse(path, allow_fragments=False)\n        if disable_dbfs_retry:\n            return json.load(S3Storage.get_file_payload(url))\n        elif url.scheme == \"s3\" and cls.is_boto3_configured():\n            try:\n                return json.load(S3Storage.get_file_payload(url))\n            except Exception:\n                return json.loads(DBFSStorage.get_file_payload(url))\n        elif url.scheme == \"file\":\n            return json.load(LocalFSStorage.get_file_payload(url))\n        elif url.scheme in [\"dbfs\", \"s3\"]:\n            return json.loads(DBFSStorage.get_file_payload(url))\n        else:\n            raise NotImplementedError(\n                f\"File storage protocol not implemented for {path}.\"\n            )\n\n    @classmethod\n    def read_sql(cls, path: str, disable_dbfs_retry: bool = False) -> Any:\n        \"\"\"Read a sql file.\n\n        The file should be in a supported file system (e.g., s3, dbfs or local\n        filesystem).\n\n        Args:\n            path: path to the sql file.\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n\n        Returns:\n            Content of the SQL file.\n        \"\"\"\n        url = urlparse(path, allow_fragments=False)\n        if disable_dbfs_retry:\n            return S3Storage.get_file_payload(url).read().decode(\"utf-8\")\n        elif url.scheme == \"s3\" and cls.is_boto3_configured():\n            try:\n                return S3Storage.get_file_payload(url).read().decode(\"utf-8\")\n            except Exception:\n                return DBFSStorage.get_file_payload(url)\n        elif url.scheme == \"file\":\n            return LocalFSStorage.get_file_payload(url).read()\n        elif url.scheme in [\"dbfs\", \"s3\"]:\n            return DBFSStorage.get_file_payload(url)\n        else:\n            raise NotImplementedError(\n                f\"Object storage protocol not implemented for {path}.\"\n            )\n\n    @classmethod\n    def write_payload(\n        cls, path: str, url: ParseResult, content: str, disable_dbfs_retry: bool = False\n    ) -> None:\n        \"\"\"Write payload into a file.\n\n        The file should be in a supported file system (e.g., s3, dbfs or local\n        filesystem).\n\n        Args:\n            path: path to validate the file type.\n            url: url of the file.\n            content: content to write into the file.\n            disable_dbfs_retry: optional flag to disable file storage dbfs.\n        \"\"\"\n        if disable_dbfs_retry:\n            S3Storage.write_payload_to_file(url, content)\n        elif path.startswith(\"s3://\") and cls.is_boto3_configured():\n            try:\n                S3Storage.write_payload_to_file(url, content)\n            except Exception:\n                DBFSStorage.write_payload_to_file(url, content)\n        elif path.startswith((\"s3://\", \"dbfs:/\")):\n            DBFSStorage.write_payload_to_file(url, content)\n        else:\n            LocalFSStorage.write_payload_to_file(url, content)\n\n    @staticmethod\n    def is_boto3_configured() -> bool:\n        \"\"\"Check if boto3 is able to locate credentials and properly configured.\n\n        If boto3 is not properly configured, we might want to try a different reader.\n        \"\"\"\n        try:\n            boto3.client(\"sts\").get_caller_identity()\n            return True\n        except Exception:\n            return False\n"
  },
  {
    "path": "lakehouse_engine/utils/storage/local_fs_storage.py",
    "content": "\"\"\"Module to represent a local file storage system.\"\"\"\n\nimport os\nfrom typing import TextIO\nfrom urllib.parse import ParseResult\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.storage.file_storage import FileStorage\n\n\nclass LocalFSStorage(FileStorage):\n    \"\"\"Class to represent a local file storage system.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def get_file_payload(cls, url: ParseResult) -> TextIO:\n        \"\"\"Get the payload of a file.\n\n        Args:\n            url: url of the file.\n\n        Returns:\n            file payload/content.\n        \"\"\"\n        cls._LOGGER.info(f\"Reading from file: {url.scheme}:{url.netloc}/{url.path}\")\n        return open(f\"{url.netloc}/{url.path}\", \"r\")\n\n    @classmethod\n    def write_payload_to_file(cls, url: ParseResult, content: str) -> None:\n        \"\"\"Write payload into a file.\n\n        Args:\n            url: url of the file.\n            content: content to write into the file.\n        \"\"\"\n        cls._LOGGER.info(f\"Writing into file: {url.scheme}:{url.netloc}/{url.path}\")\n        os.makedirs(os.path.dirname(f\"{url.netloc}/{url.path}\"), exist_ok=True)\n        with open(f\"{url.netloc}/{url.path}\", \"w\") as file:\n            file.write(content)\n"
  },
  {
    "path": "lakehouse_engine/utils/storage/s3_storage.py",
    "content": "\"\"\"Module to represent a s3 file storage system.\"\"\"\n\nfrom typing import Any\nfrom urllib.parse import ParseResult\n\nimport boto3\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.storage.file_storage import FileStorage\n\n\nclass S3Storage(FileStorage):\n    \"\"\"Class to represent a s3 file storage system.\"\"\"\n\n    _LOGGER = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def get_file_payload(cls, url: ParseResult) -> Any:\n        \"\"\"Get the payload of a config file.\n\n        Args:\n            url: url of the file.\n\n        Returns:\n            File payload/content.\n        \"\"\"\n        s3 = boto3.resource(\"s3\")\n        obj = s3.Object(url.netloc, url.path.lstrip(\"/\"))\n        cls._LOGGER.info(\n            f\"Trying with s3_storage: \"\n            f\"Reading from file: {url.scheme}://{url.netloc}{url.path}\"\n        )\n        return obj.get()[\"Body\"]\n\n    @classmethod\n    def write_payload_to_file(cls, url: ParseResult, content: str) -> None:\n        \"\"\"Write payload into a file.\n\n        Args:\n            url: url of the file.\n            content: content to write into the file.\n        \"\"\"\n        s3 = boto3.resource(\"s3\")\n        obj = s3.Object(url.netloc, url.path.lstrip(\"/\"))\n        cls._LOGGER.info(\n            f\"Trying with s3_storage: \"\n            f\"Writing into file: {url.scheme}://{url.netloc}{url.path}\"\n        )\n        obj.put(Body=content)\n"
  },
  {
    "path": "lakehouse_engine_usage/__init__.py",
    "content": "\"\"\"\n# How to use the Lakehouse Engine?\nLakehouse engine usage examples for all the algorithms and other core functionalities.\n\n- [Data Loader](lakehouse_engine_usage/data_loader.html)\n- [Data Quality](lakehouse_engine_usage/data_quality.html)\n- [Reconciliator](lakehouse_engine_usage/reconciliator.html)\n- [Sensors - Sensor & Heartbeat Sensor](lakehouse_engine_usage/sensors.html)\n- [GAB](lakehouse_engine_usage/gab.html)\n\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/__init__.py",
    "content": "\"\"\"\n.. include::data_loader.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/append_load_from_jdbc_with_permissive_mode/__init__.py",
    "content": "\"\"\"\n.. include::append_load_from_jdbc_with_permissive_mode.md\n\"\"\""
  },
  {
    "path": "lakehouse_engine_usage/data_loader/append_load_from_jdbc_with_permissive_mode/append_load_from_jdbc_with_permissive_mode.md",
    "content": "# Append Load from JDBC with PERMISSIVE mode (default)\n\nThis scenario is an append load from a JDBC source (e.g., SAP BW, Oracle Database, SQL Server Database...).\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"jdbc_args\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/append_load/jdbc_permissive/tests.db\",\n        \"table\": \"jdbc_permissive\",\n        \"properties\": {\n          \"driver\": \"org.sqlite.JDBC\"\n        }\n      },\n      \"options\": {\n        \"numPartitions\": 1\n      }\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.jdbc_permissive_table\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_date\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_date\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.jdbc_permissive_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/jdbc_permissive/data\"\n    }\n  ]\n}\n\nload_data(acon=acon)\n```\n\n##### Relevant notes\n\n- The **ReadMode** is **PERMISSIVE** in this scenario, which **is the default in Spark**, hence we **don't need to specify it**. Permissive means don't enforce any schema on the input data. \n- From a JDBC source the ReadType needs to be \"batch\" always as \"streaming\" is not available for a JDBC source.\n- In this scenario we do an append load by getting the max date (transformer_spec [\"get_max_value\"](../../../reference/packages/transformers/aggregators.md#packages.transformers.aggregators.Aggregators.get_max_value)) on bronze and use that date to filter the source to only get data with a date greater than that max date on bronze (transformer_spec [\"incremental_filter\"](../../../reference/packages/transformers/filters.md#packages.transformers.filters.Filters.incremental_filter)). **That is the standard way we do incremental batch loads in the lakehouse engine.** For streaming incremental loads we rely on Spark Streaming checkpoint feature [(check a streaming append load ACON example)](../streaming_append_load_with_terminator/streaming_append_load_with_terminator.md)."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/append_load_with_failfast/__init__.py",
    "content": "\"\"\"\n.. include::append_load_with_failfast.md\n\"\"\""
  },
  {
    "path": "lakehouse_engine_usage/data_loader/append_load_with_failfast/append_load_with_failfast.md",
    "content": "# Append Load with FAILFAST\n\nThis scenario is an append load enforcing the schema (using the schema of the target table to enforce the schema of the source, i.e., the schema of the source needs to exactly match the schema of the target table) and FAILFASTING if the schema of the input data does not match the one we specified.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"enforce_schema_from_table\": \"test_db.failfast_table\",\n      \"options\": {\n        \"header\": True,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/append_load/failfast/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.failfast_table\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_date\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_date\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.failfast_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/failfast/data\"\n    }\n  ]\n}\n\nload_data(acon=acon)\n```\n##### Relevant notes\n\n- The **ReadMode** is **FAILFAST** in this scenario, i.e., fail the algorithm if the schema of the input data does not match the one we specified via schema_path, read_schema_from_table or schema Input_specs variables.\n- In this scenario we do an append load by getting the max date (transformer_spec [\"get_max_value\"](../../../reference/packages/transformers/aggregators.md#packages.transformers.aggregators.Aggregators.get_max_value)) on bronze and use that date to filter the source to only get data with a date greater than that max date on bronze (transformer_spec [\"incremental_filter\"](../../../reference/packages/transformers/filters.md#packages.transformers.filters.Filters.incremental_filter)). **That is the standard way we do incremental batch loads in the lakehouse engine.** For streaming incremental loads we rely on Spark Streaming checkpoint feature [(check a streaming append load ACON example)](../streaming_append_load_with_terminator/streaming_append_load_with_terminator.md)."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/batch_delta_load_init_delta_backfill_with_merge/__init__.py",
    "content": "\"\"\"\n.. include::batch_delta_load_init_delta_backfill_with_merge.md\n\"\"\""
  },
  {
    "path": "lakehouse_engine_usage/data_loader/batch_delta_load_init_delta_backfill_with_merge/batch_delta_load_init_delta_backfill_with_merge.md",
    "content": "# Batch Delta Load Init, Delta and Backfill with Merge\n\nThis scenario illustrates the process of implementing a delta load algorithm by first using an ACON to perform an initial load, then another one to perform the regular deltas that will be triggered on a recurrent basis, and finally an ACON for backfilling specific parcels if ever needed.\n\n## Init Load\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": True,\n        \"delimiter\": \"|\",\n        \"inferSchema\": True\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/backfill/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}\n\nload_data(acon=acon)\n```\n\n##### Relevant Notes\n\n- We can see that even though this is an init load we still have chosen to condense the records through our [\"condense_record_mode_cdc\"](../../../reference/packages/transformers/condensers.md#packages.transformers.condensers.Condensers.condense_record_mode_cdc) transformer. This is a condensation step capable of handling SAP BW style changelogs based on actrequest_timestamps, datapakid, record_mode, etc...\n- In the init load we actually did a merge in this case because we wanted to test locally if a merge with an empty target table works, but you don't have to do it, as an init load usually can be just a full load. If a merge of init data with an empty table has any performance implications when compared to a regular insert remains to be tested, but we don't have any reason to recommend a merge over an insert for an init load, and as said, this was done solely for local testing purposes, you can just use `write_type: \"overwrite\"`\n\n## Delta Load\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": True,\n        \"delimiter\": \"|\",\n        \"inferSchema\": True\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/backfill/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}\n\nload_data(acon=acon)\n```\n\n##### Relevant Notes\n\n- The merge predicate and the insert, delete or update predicates should reflect the reality of your data, and it's up to each data product to figure out which predicates better match their reality:\n    - The merge predicate usually involves making sure that the \"primary key\" for your data matches.\n    !!! note \"**Performance Tip!!!**\"\n        Ideally, in order to get a performance boost in your merges, you should also place a filter in your merge predicate (e.g., certain technical or business date in the target table >= x days ago), based on the assumption that the rows in that specified interval will never change in the future. This can drastically decrease the merge times of big tables.\n\n    - The insert, delete and update predicates will always depend on the structure of your changelog, and also how you expect your updates to arrive (e.g., in certain data products you know that you will never get out of order data or late arriving data, while in other you can never ensure that). These predicates should reflect that in order to prevent you from doing unwanted changes to the target delta lake table.\n        - For example, in this scenario, we delete rows that have the R, D or X record_mode values, because we know that if after condensing the rows that is the latest status of that row from the changelog, they should be deleted, and we never insert rows with those status (**note**: we use this guardrail in the insert to prevent out of order changes, which is likely not the case in SAP BW).\n        - Because the `insert_predicate` is fully optional, in your scenario you may not require that.\n    - In this scenario, we don't pass an `update_predicate` in the ACON, because both `insert_predicate` and update_predicate are fully optional, i.e., if you don't pass them the algorithm will update any data that matches the `merge_predicate` and insert any data that does not match it. The predicates in these cases just make sure the algorithm does not insert or update any data that you don't want, as in the late arriving changes scenario where a deleted row may arrive first from the changelog then the update row, and to prevent your target table to have inconsistent data for a certain period of time (it will eventually get consistent when you receive the latest correct status from the changelog though) you can have this guardrail in the insert or update predicates. Again, for most sources this will not happen but sources like Kafka for example cannot 100% ensure order, for example.\n    - In order to understand how we can cover different scenarios (e.g., late arriving changes, out of order changes, etc.), please go [here](../streaming_delta_with_late_arriving_and_out_of_order_events/streaming_delta_with_late_arriving_and_out_of_order_events.md).\n- The order of the predicates in the ACON does not matter, is the logic in the lakehouse engine [DeltaMergeWriter's \"_merge\" function](../../../reference/packages/io/writers/delta_merge_writer.md#packages.io.writers.delta_merge_writer.DeltaMergeWriter.__init__) that matters.\n- Notice the \"<=>\" operator? In Spark SQL that's the null safe equal.\n\n## Backfilling\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": True,\n        \"delimiter\": \"|\",\n        \"inferSchema\": True\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/backfill/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_value\": \"20180110120052t\",\n            \"greater_or_equal\": True\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}\n\nload_data(acon=acon)\n```\n\n##### Relevant Notes\n\n- The backfilling process depicted here is fairly similar to the init load, but it is relevant to highlight  by using a static value (that can be modified accordingly to the backfilling needs) in the [incremental_filter](../../../reference/packages/transformers/filters.md#packages.transformers.filters.Filters.incremental_filter) function.\n- Other relevant functions for backfilling may include the [expression_filter](../../../reference/packages/transformers/filters.md#packages.transformers.filters.Filters.expression_filter) function, where you can use a custom SQL filter to filter the input data."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/custom_transformer/__init__.py",
    "content": "\"\"\"\n.. include::custom_transformer.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/custom_transformer/custom_transformer.md",
    "content": "# Custom Transformer\n\nThere may appear a scenario where the data product dev team faces the need to perform complex data transformations that are either not yet available in the lakehouse engine or the logic is just too complex to chain in an ACON file. In the context of the lakehouse, the only layers that usually can impose that complexity is silver+ and gold. This page targets exactly those cases.\n\nBelow you'll find a notebook where you can pass your own PySpark or Spark SQL logic into the ACON, by dynamically injecting a python function into the ACON dictionary. The lakehouse engine will take care of executing those transformations in the transformation step of the data loader algorithm. Please read the notebook's comments carefully to understand how it works, or simply open it in your notebook environment, which will make the notebook's code and comments more readable.\n\n!!! warning \"Force Streaming Micro Batch Processing.\"\n    When you use streaming mode, with a custom transformer, it’s highly advisable that you set the `force_streaming_microbatch_processing` flag to `True` in the transform specification, as explained above!\n\n## What is a custom transformer in the Lakehouse Engine and how you can use it to write your own pyspark logic?\n\nWe highly promote the Lakehouse Engine for creating Data Products aligned with the data source (bronze/silver layer), pumping data into silver so our Data Scientists and Analysts can leverage the value of the data in silver, as close as it comes from the source.\nThe low-code and configuration-driven nature of the lakehouse engine makes it a compelling framework to use in such cases, where the transformations that are done from bronze to silver are not that many, as we want to keep the data close to the source.\n\nHowever, when it comes to Data Products enriched in some way or for insights (silver+, gold), they are typically heavy\non transformations (they are the T of the overall ELT process), so the nature of the lakehouse engine may would have\nget into the way of adequately building it. Considering this, and considering our user base that prefers an ACON-based\napproach and all the nice off-the-shelf features of the lakehouse engine, we have developed a feature that\nallows us to **pass custom transformers where you put your entire pyspark logic and can pass it as an argument\nin the ACON** (the configuration file that configures every lakehouse engine algorithm).\n\n!!! note \"Motivation\"\n    Doing that, you let the ACON guide your read, data quality, write and terminate processes, and you just focus on transforming data :)\n\n## Custom transformation Function\n\nThe function below is the one that encapsulates all your defined pyspark logic and sends it as a python function to the lakehouse engine. This function will then be invoked internally in the lakehouse engine via a df.transform() function. If you are interested in checking the internals of the lakehouse engine, our codebase is openly available here: https://github.com/adidas/lakehouse-engine\n\n!!! warning \"Attention!!!\"\n    For this process to work, your function defined below needs to receive a DataFrame and return a DataFrame. Attempting any other method signature (e.g., defining more parameters) will not work, unless you use something like [python partials](https://docs.python.org/3/library/functools.html#functools.partial), for example.\n\n```python\ndef get_new_data(df: DataFrame) -> DataFrame:\n    \"\"\"Get the new data from the lakehouse engine reader and prepare it.\"\"\"\n    return (\n        df.withColumn(\"amount\", when(col(\"_change_type\") == \"delete\", lit(0)).otherwise(col(\"amount\")))\n        .select(\"article_id\", \"order_date\", \"amount\")\n        .groupBy(\"article_id\", \"order_date\")\n        .agg(sum(\"amount\").alias(\"amount\"))\n    )\n\n\ndef get_joined_data(new_data_df: DataFrame, current_data_df: DataFrame) -> DataFrame:\n    \"\"\"Join the new data with the current data already existing in the target dataset.\"\"\"\n    return (\n        new_data_df.alias(\"new_data\")\n        .join(\n            current_data_df.alias(\"current_data\"),\n            [\n                new_data_df.article_id == current_data_df.article_id,\n                new_data_df.order_date == current_data_df.order_date,\n            ],\n            \"left_outer\",\n        )\n        .withColumn(\n            \"current_amount\", when(col(\"current_data.amount\").isNull(), lit(0)).otherwise(\"current_data.amount\")\n        )\n        .withColumn(\"final_amount\", col(\"current_amount\") + col(\"new_data.amount\"))\n        .select(col(\"new_data.article_id\"), col(\"new_data.order_date\"), col(\"final_amount\").alias(\"amount\"))\n    )\n\n\ndef calculate_kpi(df: DataFrame) -> DataFrame:\n    \"\"\"Calculate KPI through a custom transformer that will be provided in the ACON.\n \n    Args:\n        df: DataFrame passed as input.\n \n    Returns:\n        DataFrame: the transformed DataFrame.\n    \"\"\"\n    new_data_df = get_new_data(df)\n\n    # we prefer if you use 'ExecEnv.SESSION' instead of 'spark', because is the internal object the\n    # lakehouse engine uses to refer to the spark session. But if you use 'spark' should also be fine.\n    current_data_df = ExecEnv.SESSION.table(\n        \"my_database.my_table\"\n    )\n\n    transformed_df = get_joined_data(new_data_df, current_data_df)\n\n    return transformed_df\n```\n\n### Don't like pyspark API? Write SQL\n\nYou don't have to comply to the pyspark API if you prefer SQL. Inside the function above (or any of\nthe auxiliary functions you decide to develop) you can write something like:\n\n````python\ndef calculate_kpi(df: DataFrame) -> DataFrame:\n    df.createOrReplaceTempView(\"new_data\")\n\n    # we prefer if you use 'ExecEnv.SESSION' instead of 'spark', because is the internal object the\n    # lakehouse engine uses to refer to the spark session. But if you use 'spark' should also be fine.\n    ExecEnv.SESSION.sql(\n        \"\"\"\n          CREATE OR REPLACE TEMP VIEW my_kpi AS\n          SELECT ... FROM new_data ...\n        \"\"\"\n    )\n\n    return ExecEnv.SESSION.table(\"my_kpi\")\n````\n\n## Just your regular ACON\n\nIf you notice the ACON below, everything is the same as you would do in a Data Product, but the `transform_specs` section of the ACON has a difference, which is a function called `\"custom_transformation\"` where we supply as argument the function defined above with the pyspark code.\n\n!!! warning \"Attention!!!\"\n    Do not pass the function as calculate_kpi(), but as calculate_kpi, otherwise you are telling python to invoke the function right away, as opposed to pass it as argument to be invoked later by the lakehouse engine.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"my_database.dummy_sales\",\n            \"options\": {\"readChangeFeed\": \"true\"},\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"transformed_sales_kpi\",\n            \"input_id\": \"sales\",\n            # because we are using streaming, this allows us to make sure that\n            # all the computation in our custom transformer gets pushed to\n            # Spark's foreachBatch method in a stream, which allows us to\n            # run all Spark functions in a micro batch DataFrame, as there\n            # are some Spark functions that are not supported in streaming.\n            \"force_streaming_foreach_batch_processing\": True,\n            \"transformers\": [\n                {\n                    \"function\": \"custom_transformation\",\n                    \"args\": {\"custom_transformer\": calculate_kpi},\n                },\n            ],\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"my_table_quality\",\n            \"input_id\": \"transformed_sales_kpi\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_dq_bucket\",\n            \"expectations_store_prefix\": \"dq/expectations/\",\n            \"validations_store_prefix\": \"dq/validations/\",\n            \"checkpoint_store_prefix\": \"dq/checkpoints/\",\n            \"tbl_to_derive_pk\": \"my_table\",\n            \"dq_functions\": [\n                {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"article_id\"}},\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales_kpi\",\n            \"input_id\": \"transformed_sales_kpi\",\n            \"write_type\": \"merge\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"my_database.my_table\",\n            \"options\": {\n                \"checkpointLocation\": \"s3://my_data_product_bucket/gold/my_table\",\n            },\n            \"merge_opts\": {\n                \"merge_predicate\": \"new.article_id = current.article_id AND new.order_date = current.order_date\"\n            },\n        }\n    ],\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/custom_transformer/sql_custom_transformer.md",
    "content": "# SQL Custom Transformer\nThe SQL Custom Transformer executes a SQL transformation provided by the user.This transformer can be very useful whenever the user wants to perform SQL-based transformations that are not natively supported by the lakehouse engine transformers.\n\nThe transformer receives the SQL query to be executed. This can read from any table or view from the catalog, or any dataframe registered as a temp view.\n\n> To register a dataframe as a temp view you can use the \"temp_view\" config in the input_specs, as shown below.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\"mode\": \"FAILFAST\", \"header\": True, \"delimiter\": \"|\"},\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/\"\n            \"data_loader_custom_transformer/sql_transformation/\"\n            \"source_schema.json\",\n            \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n            \"data_loader_custom_transformer/sql_transformation/data\",\n            \"temp_view\": \"sales_sql\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"calculated_kpi\",\n            \"input_id\": \"sales_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"sql_transformation\",\n                    \"args\": {\"sql\": SQL},\n                }\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales_bronze\",\n            \"input_id\": \"calculated_kpi\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"file:///app/tests/lakehouse/out/feature/\"\n            \"data_loader_custom_transformer/sql_transformation/data\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/custom_transformer_sql/__init__.py",
    "content": "\"\"\"\n.. include::custom_transformer_sql.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/custom_transformer_sql/custom_transformer_sql.md",
    "content": "# SQL Custom Transformer\nThe SQL Custom Transformer executes a SQL transformation provided by the user.This transformer can be very useful whenever the user wants to perform SQL-based transformations that are not natively supported by the lakehouse engine transformers.\n\nThe transformer receives the SQL query to be executed. This can read from any table or view from the catalog, or any dataframe registered as a temp view.\n\n> To register a dataframe as a temp view you can use the \"temp_view\" config in the input_specs, as shown below.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\"mode\": \"FAILFAST\", \"header\": True, \"delimiter\": \"|\"},\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/\"\n            \"data_loader_custom_transformer/sql_transformation/\"\n            \"source_schema.json\",\n            \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n            \"data_loader_custom_transformer/sql_transformation/data\",\n            \"temp_view\": \"sales_sql\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"calculated_kpi\",\n            \"input_id\": \"sales_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"sql_transformation\",\n                    \"args\": {\"sql\": SQL},\n                }\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales_bronze\",\n            \"input_id\": \"calculated_kpi\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"file:///app/tests/lakehouse/out/feature/\"\n            \"data_loader_custom_transformer/sql_transformation/data\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/data_loader.md",
    "content": "# Data Loader\n\n## How to configure a DataLoader algorithm in the lakehouse-engine by using an ACON file?\n\nAn algorithm (e.g., data load) in the lakehouse-engine is configured using an ACON. The lakehouse-engine is a\nconfiguration-driven framework, so people don't have to write code to execute a Spark algorithm. In contrast, the\nalgorithm is written in pyspark and accepts configurations through a JSON file (an ACON - algorithm configuration). The\nACON is the configuration providing the behaviour of a lakehouse engine algorithm. [You can check the algorithm code, and\nhow it interprets the ACON here](../../reference/packages/algorithms/algorithm.md).\nIn this page we will go through the structure of an ACON file and what are the most suitable ACON files for common data\nengineering scenarios.\nCheck the underneath pages to find several **ACON examples** that cover many data extraction, transformation and loading scenarios.\n\n## Overview of the Structure of the ACON file for DataLoads\n\nAn ACON-based algorithm needs several specifications to work properly, but some of them might be optional. The available\nspecifications are:\n\n- **Input specifications (input_specs)**: specify how to read data. This is a **mandatory** keyword.\n- **Transform specifications (transform_specs)**: specify how to transform data.\n- **Data quality specifications (dq_specs)**: specify how to execute the data quality process.\n- **Output specifications (output_specs)**: specify how to write data to the target. This is a **mandatory** keyword.\n- **Terminate specifications (terminate_specs)**: specify what to do after writing into the target (e.g., optimising target table, vacuum, compute stats, expose change data feed to external location, etc.).\n- **Execution environment (exec_env)**: custom Spark session configurations to be provided for your algorithm (configurations can also be provided from your job/cluster configuration, which we highly advise you to do instead of passing performance related configs here for example).\n\nBelow is an example of a complete ACON file that reads from a s3 folder with CSVs and incrementally loads that data (using a merge) into a delta lake table.\n\n!!! note \"What is the **spec_id**?\"\n    **spec_id** is one of the main concepts to ensure you can chain the steps of the algorithm, so, for example, you can specify the transformations (in transform_specs) of a DataFrame that was read in the input_specs. Check ACON below to see how the spec_id of the input_specs is used as input_id in one transform specification.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n    {\n      \"spec_id\": \"orders_bronze\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"s3://my-data-product-bucket/artefacts/metadata/bronze/schemas/orders.json\",\n      \"with_filepath\": True,\n      \"options\": {\n        \"badRecordsPath\": \"s3://my-data-product-bucket/badrecords/order_events_with_dq/\",\n        \"header\": False,\n        \"delimiter\": \"\\u005E\",\n        \"dateFormat\": \"yyyyMMdd\"\n      },\n      \"location\": \"s3://my-data-product-bucket/bronze/orders/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"orders_bronze_with_extraction_date\",\n      \"input_id\": \"orders_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_row_id\"\n        },\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": True,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"check_orders_bronze_with_extraction_date\",\n      \"input_id\": \"orders_bronze_with_extraction_date\",\n      \"dq_type\": \"validator\",\n      \"result_sink_db_table\": \"my_database.my_table_dq_checks\",\n      \"fail_on_error\": False,\n      \"dq_functions\": [\n        {\n          \"dq_function\": \"expect_column_values_to_not_be_null\",\n          \"args\": {\n            \"column\": \"omnihub_locale_code\"\n          }\n        },\n        {\n          \"dq_function\": \"expect_column_unique_value_count_to_be_between\",\n          \"args\": {\n            \"column\": \"product_division\",\n            \"min_value\": 10,\n            \"max_value\": 100\n          }\n        },\n        {\n          \"dq_function\": \"expect_column_max_to_be_between\",\n          \"args\": {\n            \"column\": \"so_net_value\",\n            \"min_value\": 10,\n            \"max_value\": 1000\n          }\n        },\n        {\n          \"dq_function\": \"expect_column_value_lengths_to_be_between\",\n          \"args\": {\n            \"column\": \"omnihub_locale_code\",\n            \"min_value\": 1,\n            \"max_value\": 10\n          }\n        },\n        {\n          \"dq_function\": \"expect_column_mean_to_be_between\",\n          \"args\": {\n            \"column\": \"coupon_code\",\n            \"min_value\": 15,\n            \"max_value\": 20\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"orders_silver\",\n      \"input_id\": \"check_orders_bronze_with_extraction_date\",\n      \"data_format\": \"delta\",\n      \"write_type\": \"merge\",\n      \"partitions\": [\n        \"order_date_header\"\n      ],\n      \"merge_opts\": {\n        \"merge_predicate\": \"\"\"\n            new.sales_order_header = current.sales_order_header\n            and new.sales_order_schedule = current.sales_order_schedule\n            and new.sales_order_item=current.sales_order_item\n            and new.epoch_status=current.epoch_status\n            and new.changed_on=current.changed_on\n            and new.extraction_date=current.extraction_date\n            and new.lhe_batch_id=current.lhe_batch_id\n            and new.lhe_row_id=current.lhe_row_id\n        \"\"\",\n        \"insert_only\": True\n      },\n      \"db_table\": \"my_database.my_table_with_dq\",\n      \"location\": \"s3://my-data-product-bucket/silver/order_events_with_dq/\",\n      \"with_batch_id\": True,\n      \"options\": {\n        \"checkpointLocation\": \"s3://my-data-product-bucket/checkpoints/order_events_with_dq/\"\n      }\n    }\n  ],\n  \"terminate_specs\": [\n    {\n      \"function\": \"optimize_dataset\",\n      \"args\": {\n        \"db_table\": \"my_database.my_table_with_dq\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": True\n  }\n}\n\nload_data(acon=acon)\n```\n\n## Input Specifications\n\nYou specify how to read the data by providing a list of Input Specifications. Usually there's just one element in that\nlist, as, in the lakehouse, you are generally focused on reading data from one layer (e.g., source, bronze, silver,\ngold) and put it on the next layer. However, there may be scenarios where you would like to combine two datasets (e.g.,\njoins or incremental filtering on one dataset based on the values of another\none), therefore you can use one or more elements.\n[More information about InputSpecs](../../reference/packages/core/definitions.md#packages.core.definitions.InputSpec).\n\n##### Relevant notes\n\n- A spec id is fundamental, so you can use the input data later on in any step of the algorithm (transform, write, dq process, terminate).\n- You don't have to specify `db_table` and `location` at the same time. Depending on the data_format sometimes you read from a table (e.g., jdbc or deltalake table) sometimes you read from a location (e.g., files like deltalake, parquet, json, avro... or kafka topic).\n\n## Transform Specifications\n\nIn the lakehouse engine, you transform data by providing a transform specification, which contains a list of transform functions (transformers). So the transform specification acts upon on input, and it can execute multiple lakehouse engine transformation functions (transformers) upon that input.\n\nIf you look into the example above we ask the lakehouse engine to execute two functions on the `orders_bronze` input\ndata: `with_row_id` and `with_regex_value`. Those functions can of course receive arguments. You can see a list of all\navailable transformation functions (transformers) here `lakehouse_engine.transformers`. Then, you just invoke them in\nyour ACON as demonstrated above, following exactly the same function name and parameters name as described in the code\ndocumentation. \n[More information about TransformSpec](../../reference/packages/core/definitions.md#packages.core.definitions.TransformSpec).\n\n##### Relevant notes\n\n- This stage is fully optional, you can omit it from the ACON.\n- There is one relevant option `force_streaming_foreach_batch_processing` that can be used to force the transform to be\n  executed in the foreachBatch function to ensure non-supported streaming operations can be properly executed. You don't\n  have to worry about this if you are using regular lakehouse engine transformers. But if you are providing your custom\n  logic in pyspark code via our lakehouse engine\n  custom_transformation (`lakehouse_engine.transformers.custom_transformers`) then sometimes your logic may contain\n  Spark functions that are not compatible with Spark Streaming, and therefore this flag can enable all of your\n  computation to be streaming-compatible by pushing down all the logic into the foreachBatch() function.\n\n## Data Quality Specifications\n\nOne of the most relevant features of the lakehouse engine is that you can have data quality guardrails that prevent you\nfrom loading bad data into your target layer (e.g., bronze, silver or gold). The lakehouse engine data quality process\nincludes one main feature at the moment:\n\n- **Validator**: The capability to perform data quality checks on that data (e.g., is the max value of a column bigger\n  than x?) and even tag your data with the results of the DQ checks.\n\nThe output of the data quality process can be written into a [**Result Sink**](../data_quality/result_sink/result_sink.md) target (e.g. table or files) and is integrated with a [Data Docs website](../data_quality/data_quality.md#3-data-docs-website), which can be a company-wide available website for people to check the quality of their data and share with others.\n\nTo achieve all of this functionality the lakehouse engine uses [Great Expectations](https://greatexpectations.io/) internally. To hide the Great Expectations internals from our user base and provide friendlier abstractions using the ACON, we have developed the concept of DQSpec that can contain many DQFunctionSpec objects, which is very similar to the relationship between the TransformSpec and TransformerSpec, which means you can have multiple Great Expectations functions executed inside a single data quality specification (as in the ACON above).\n\n!!! note\n    The names of the functions and args are a 1 to 1 match of [Great Expectations API](https://greatexpectations.io/expectations/).\n\n[More information about DQSpec](../../reference/packages/core/definitions.md#packages.core.definitions.DQSpec).\n\n##### Relevant notes\n\n- You can write the outputs of the DQ process to a sink through the result_sink* parameters of the\n  DQSpec. `result_sink_options` takes any Spark options for a DataFrame writer, which means you can specify the options\n  according to your sink format (e.g., delta, parquet, json, etc.). We usually recommend using `\"delta\"` as format.\n- You can use the results of the DQ checks to tag the data that you are validating. When configured, these details will\n  appear as a new column (like any other), as part of the tables of your Data Product.\n- To be able to make an analysis with the data of `result_sink*`, we have available an approach in which you\n  set `result_sink_explode` as true (which is the default) and then you have some columns expanded. Those are:\n    - General columns: Those are columns that have the basic information regarding `dq_specs` and will have always values\n      and does not depend on the expectation types chosen.\n        -\n      Columns: `checkpoint_config`, `run_name`, `run_time`, `run_results`, `success`, `validation_result_identifier`, `spec_id`, `input_id`, `validation_results`, `run_time_year`, `run_time_month`, `run_time_day`.\n    - Statistics columns: Those are columns that have information about the runs of expectations, being those values for\n      the run and not for each expectation. Those columns come from `run_results.validation_result.statistics.*`.\n        - Columns: `evaluated_expectations`, `success_percent`, `successful_expectations`, `unsuccessful_expectations`.\n    - Expectations columns: Those are columns that have information about the expectation executed.\n        - Columns: `expectation_type`, `batch_id`, `expectation_success`, `exception_info`. Those columns are exploded\n          from `run_results.validation_result.results`\n          inside `expectation_config.expectation_type`, `expectation_config.kwargs.batch_id`, `success as expectation_success`,\n          and `exception_info`. Moreover, we also include `unexpected_index_list`, `observed_value` and `kwargs`.\n    - Arguments of Expectations columns: Those are columns that will depend on the expectation_type selected. Those\n      columns are exploded from `run_results.validation_result.results` inside `expectation_config.kwargs.*`.\n        - We can have for\n          example: `column`, `column_A`, `column_B`, `max_value`, `min_value`, `value`, `value_pairs_set`, `value_set`,\n          and others.\n    - More columns desired? Those can be added, using `result_sink_extra_columns` in which you can select columns\n      like `<name>` and/or explode columns like `<name>.*`.\n- Use the parameter `\"source\"` to identify the data used for an easier analysis.\n- By default, Great Expectation will also provide a site presenting the history of the DQ validations that you have performed on your data.\n- You can make an analysis of all your expectations and create a dashboard aggregating all that information.\n- This stage is fully optional, you can omit it from the ACON.\n\n## Output Specifications\n\nThe output_specs section of an ACON is relatively similar to the input_specs section, but of course focusing on how to write the results of the algorithm, instead of specifying the input for the algorithm, hence the name output_specs (output specifications). [More information about OutputSpec](../../reference/packages/core/definitions.md#packages.core.definitions.OutputSpec).\n\n##### Relevant notes\n\n- Respect the supported write types and output formats.\n- One of the most relevant options to specify in the options parameter is the `checkpoint_location` when in streaming\n  read mode, because that location will be responsible for storing which data you already read and transformed from the\n  source, **when the source is a Spark Streaming compatible source (e.g., Kafka or S3 files)**.\n\n## Terminate Specifications\n\nThe terminate_specs section of the ACON is responsible for some \"wrapping up\" activities like optimising a table,\nvacuuming old files in a delta table, etc. With time the list of available terminators will likely increase (e.g.,\nreconciliation processes), but for now we have the [following terminators](../../reference/packages/terminators/index.md).\nThis stage is fully optional, you can omit it from the ACON.\nThe most relevant now in the context of the lakehouse initiative are the following:\n\n- [dataset_optimizer](../../reference/packages/terminators/dataset_optimizer.md)\n- [cdf_processor](../../reference/packages/terminators/cdf_processor.md)\n- [sensor_terminator](../../reference/packages/terminators/sensor_terminator.md)\n- [notifier_terminator](../../reference/packages/terminators/notifiers/email_notifier.md)\n\n[More information about TerminatorSpec](../../reference/packages/core/definitions.md#packages.core.definitions.TerminatorSpec).\n\n## Execution Environment\n\nIn the exec_env section of the ACON you can pass any Spark Session configuration that you want to define for the\nexecution of your algorithm. This is basically just a JSON structure that takes in any Spark Session property, so no\ncustom lakehouse engine logic. This stage is fully optional, you can omit it from the ACON.\n\n!!! note\n    Please be aware that Spark Session configurations that are not allowed to be changed when the Spark cluster is already\n    running need to be passed in the configuration of the job/cluster that runs this algorithm, not here in this section.\n    This section only accepts Spark Session configs that can be changed in runtime. Whenever you introduce an option make\n    sure that it takes effect during runtime, as to the best of our knowledge there's no list of allowed Spark properties\n    to be changed after the cluster is already running. Moreover, typically Spark algorithms fail if you try to modify a\n    config that can only be set up before the cluster is running.\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_from_sap_b4_adso/__init__.py",
    "content": "\"\"\"\n.. include::extract_from_sap_b4_adso.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_from_sap_b4_adso/extract_from_sap_b4_adso.md",
    "content": "# Extract from SAP B4 ADSOs\n\nA custom sap_b4 reader and a few utils are offered in the lakehouse-engine framework so that consumption of data from\nSAP B4 DSOs can be easily created. The framework abstracts all the logic behind the init/delta extractions\n(AQ vs CL, active table, changelog table, requests status table, how to identify the next delta timestamp...),\nonly requiring a few parameters that are explained and exemplified in the\n[template](#extraction-from-sap-b4-adsos-template) scenarios that we have created.\n\n\n!!! note\n    This custom reader is very similar and uses most features from the sap_bw reader, so if you were using specific filters/parameters with the sap_bw reader, there is a high chance you can keep using it in a very similar way with the sap_b4 reader. The main concepts are applied to both readers, as the strategies on how to parallelize the extractions, for example.\n\nHow can I find a good candidate column for [partitioning the extraction from S4Hana?](../extract_from_sap_bw_dso/extract_from_sap_bw_dso.md#how-can-we-decide-the-partitionColumn)\n\n!!! danger \"**Parallelization Limitations**\"\n    There are no limits imposed by the Lakehouse-Engine framework, but you need to consider that there might be differences imposed by the source.\n\n    E.g. Each User might be restricted on utilisation of about 100GB memory at a time from the source.\n\n    Parallel extractions ***can bring a jdbc source down*** if a lot of stress is put on the system. Be careful choosing the number of partitions. Spark is a distributed system and can lead to many connections.\n\n!!! danger \n    **In case you want to perform further filtering in the REQTSN field, please be aware that it is not being pushed down to SAP B4 by default (meaning it will have bad performance).** \n    In that case, you will need to use customSchema option while reading, so that you are able to enable filter push down for those.\n\n\nYou can check the code documentation of the reader below:\n\n[**SAP B4 Reader**](../../../reference/packages/io/readers/sap_b4_reader.md)\n\n[**JDBC Extractions arguments**](../../../reference/packages/utils/extraction/jdbc_extraction_utils.md#packages.utils.extraction.jdbc_extraction_utils.JDBCExtraction.__init__)\n\n[**SAP B4 Extractions arguments**](../../../reference/packages/utils/extraction/sap_b4_extraction_utils.md#packages.utils.extraction.sap_b4_extraction_utils.SAPB4Extraction.__init__)\n\n!!! note \n    For extractions using the SAP B4 reader, you can use the arguments listed in the SAP B4 arguments, but also the ones listed in the JDBC extractions, as those are inherited as well.\n\n\n## Extraction from SAP B4 ADSOs Template\nThis template covers the following scenarios of extractions from the SAP B4Hana ADSOs:\n\n- 1 - The Simplest Scenario (Not parallel - Not Recommended)\n- 2 - Parallel extraction\n  - 2.1 - Simplest Scenario\n  - 2.2 - Provide upperBound (Recommended)\n  - 2.3 - Automatic upperBound (Recommended)\n  - 2.4 - Provide predicates (Recommended)\n  - 2.5 - Generate predicates (Recommended)\n\n!!! note\n    The template will cover two ADSO Types:\n\n    - **AQ**: ADSO which is of append type and for which a single ADSO/tables holds all the information, like an\n    event table. For this type, the same ADSO is used for reading data both for the inits and deltas. Usually, these\n    ADSOs end with the digit \"6\".\n    - **CL**: ADSO which is split into two ADSOs, one holding the change log events, the other having the active\n    data (current version of the truth for a particular source). For this type, the ADSO having the active data\n    is used for the first extraction (init) and the change log ADSO is used for the subsequent extractions (deltas).\n    Usually, these ADSOs are split into active table ending with the digit \"2\" and changelog table ending with digit \"3\".\n\nFor each of these ADSO types, the lakehouse-engine abstracts the logic to get the delta extractions. This logic\nbasically consists of joining the `db_table` (for `AQ`) or the `changelog_table` (for `CL`) with the table\nhaving the requests status (`my_database.requests_status_table`).\nOne of the fields used for this joining is the `data_target`, which has a relationship with the ADSO\n(`db_table`/`changelog_table`), being basically the same identifier without considering parts of it.\n\nBased on the previous insights, the queries that the lakehouse-engine generates under the hood translate to\n(this is a simplified version, for more details please refer to the lakehouse-engine code documentation):\n\n**AQ Init Extraction:**\n`SELECT t.*, CAST({self._SAP_B4_EXTRACTION.extraction_timestamp} AS DECIMAL(15,0)) AS extraction_start_timestamp\nFROM my_database.my_table t`\n\n**AQ Delta Extraction:**\n`SELECT tbl.*, CAST({self._B4_EXTRACTION.extraction_timestamp} AS DECIMAL(15,0)) AS extraction_start_timestamp\nFROM my_database.my_table AS tbl\nJOIN my_database.requests_status_table AS req\nWHERE STORAGE = 'AQ' AND REQUEST_IS_IN_PROCESS = 'N' AND LAST_OPERATION_TYPE IN ('C', 'U')\nAND REQUEST_STATUS IN ('GG', 'GR') AND UPPER(DATATARGET) = UPPER('my_identifier')\nAND req.REQUEST_TSN > max_timestamp_in_bronze AND req.REQUEST_TSN <= max_timestamp_in_requests_status_table`\n\n**CL Init Extraction:**\n`SELECT t.*,\n    {self._SAP_B4_EXTRACTION.extraction_timestamp}000000000 AS reqtsn,\n    '0' AS datapakid,\n    0 AS record,\n    CAST({self._SAP_B4_EXTRACTION.extraction_timestamp} AS DECIMAL(15,0)) AS extraction_start_timestamp\nFROM my_database.my_table_2 t`\n\n**CL Delta Extraction:**\n`SELECT tbl.*,\nCAST({self._SAP_B4_EXTRACTION.extraction_timestamp} AS DECIMAL(15,0)) AS extraction_start_timestamp`\nFROM my_database.my_table_3 AS tbl\nJOIN my_database.requests_status_table AS req\nWHERE STORAGE = 'AT' AND REQUEST_IS_IN_PROCESS = 'N' AND LAST_OPERATION_TYPE IN ('C', 'U')\nAND REQUEST_STATUS IN ('GG') AND UPPER(DATATARGET) = UPPER('my_data_target')\nAND req.REQUEST_TSN > max_timestamp_in_bronze AND req.REQUEST_TSN <= max_timestamp_in_requests_status_table`\n\n!!! note \"Introductory Notes\"\n    If you want to have a better understanding about JDBC Spark optimizations, here you have a few useful links:\n    \n    - https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html\n    - https://docs.databricks.com/en/connect/external-systems/jdbc.html\n    - https://bit.ly/3x2eCEm\n    - https://newbedev.com/how-to-optimize-partitioning-when-migrating-data-from-jdbc-source\n\n### 1 - The Simplest Scenario (Not parallel - Not Recommended)\nThis scenario is the simplest one, not taking any advantage of Spark JDBC optimisation techniques\nand using a single connection to retrieve all the data from the source. It should only be used in case the ADSO\nyou want to extract from SAP B4Hana is a small one, with no big requirements in terms of performance to fulfill.\nWhen extracting from the source ADSO, there are two options:\n\n- **Delta Init** - full extraction of the source ADSO. You should use it in the first time you extract from the\nADSO or any time you want to re-extract completely. Similar to a so-called full load.\n- **Delta** - extracts the portion of the data that is new or has changed in the source, since the last\nextraction (using the `max_timestamp` value in the location of the data already extracted\n`latest_timestamp_data_location`).\n\nBelow example is composed of two cells.\n\n- The first cell is only responsible to define the variables `extraction_type` and `write_type`,\ndepending on the extraction type: **Delta Init** (`load_type = \"init\"`) or a **Delta** (`load_type = \"delta\"`).\nThe variables in this cell will also be referenced by other acons/examples in this notebook, similar to what\nyou would do in your pipelines/jobs, defining this centrally and then re-using it.\n- The second cell is where the acon to be used is defined (which uses the two variables `extraction_type` and\n`write_type` defined) and the `load_data` algorithm is executed to perform the extraction.\n\n!!! note\n    There may be cases where you might want to always extract fully from the source ADSO. In these cases,\n    you only need to use a Delta Init every time, meaning you would use `\"extraction_type\": \"init\"` and\n    `\"write_type\": \"overwrite\"` as it is shown below. The explanation about what it is a Delta Init/Delta is\n    applicable for all the scenarios presented in this notebook.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_b4\",\n            \"options\": {\n                \"url\": \"my_sap_b4_url\",\n                \"user\": \"my_user\",\n                \"password\": \"my_b4_hana_pwd\",\n                \"dbtable\": \"my_database.my_table\",\n                \"extraction_type\": extraction_type,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"adso_type\": \"AQ\",\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n### 2 - Parallel extraction\nIn this section, 5 possible scenarios for parallel extractions from SAP B4Hana ADSOs are presented.\n\n#### 2.1 - Parallel Extraction, Simplest Scenario\nThis scenario provides the simplest example you can have for a parallel extraction from SAP B4Hana, only using\nthe property `numPartitions`. The goal of the scenario is to cover the case in which people do not have\nmuch knowledge around how to optimize the extraction from JDBC sources or cannot identify a column that can\nbe used to split the extraction in several tasks. This scenario can also be used if the use case does not\nhave big performance requirements/concerns, meaning you do not feel the need to optimize the performance of\nthe extraction to its maximum potential.\n\nOn the example below, `\"numPartitions\": 10` is specified, meaning that Spark will open 10 parallel connections\nto the source ADSO and automatically decide how to parallelize the extraction upon that requirement. This is the\nonly change compared to the example provided in the scenario 1.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_b4\",\n            \"options\": {\n                \"url\": \"my_sap_b4_url\",\n                \"user\": \"my_user\",\n                \"password\": \"my_sap_b4_pwd\",\n                \"dbtable\": \"my_database.my_table\",\n                \"extraction_type\": extraction_type,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier_par_simple/\",\n                \"adso_type\": \"AQ\",\n                \"numPartitions\": 10,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/my_identifier_par_simple/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.2 - Parallel Extraction, Provide upper_bound (Recommended)\nThis scenario performs the extraction from the SAP B4 ADSO in parallel, but is more concerned with trying to\noptimize and have more control (compared to 2.1 example) on how the extraction is split and performed,\nusing the following options:\n\n- `numPartitions` - number of Spark partitions to split the extraction.\n- `partitionColumn` - column used to split the extraction. It must be a numeric, date, or timestamp.\nIt should be a column that is able to split the extraction evenly in several tasks. An auto-increment\ncolumn is usually a very good candidate.\n- `lowerBound` - lower bound to decide the partition stride.\n- `upperBound` - upper bound to decide the partition stride.\n\nThis is an adequate example for you to follow if you have/know a column in the ADSO that is good to be used as\nthe `partitionColumn`. If you compare with the previous example, you'll notice that now `numPartitions` and\nthree additional options are provided to fine tune the extraction (`partitionColumn`, `lowerBound`,\n`upperBound`).\n\nWhen these 4 properties are used, Spark will use them to build several queries to split the extraction.\n\n**Example:** for `\"numPartitions\": 10`, `\"partitionColumn\": \"record\"`, `\"lowerBound: 1\"`, `\"upperBound: 100\"`,\nSpark will generate 10 queries like this:\n\n- `SELECT * FROM dummy_table WHERE RECORD < 10 OR RECORD IS NULL`\n- `SELECT * FROM dummy_table WHERE RECORD >= 10 AND RECORD < 20`\n- `SELECT * FROM dummy_table WHERE RECORD >= 20 AND RECORD < 30`\n- ...\n- `SELECT * FROM dummy_table WHERE RECORD >= 100`\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_b4\",\n            \"options\": {\n                \"url\": \"my_sap_b4_url\",\n                \"user\": \"my_user\",\n                \"password\": \"my_b4_hana_pwd\",\n                \"dbtable\": \"my_database.my_table\",\n                \"extraction_type\": extraction_type,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier_par_prov_upper/\",\n                \"adso_type\": \"AQ\",\n                \"partitionColumn\": \"RECORD\",\n                \"numPartitions\": 10,\n                \"lowerBound\": 1,\n                \"upperBound\": 1000000,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/my_identifier_par_prov_upper/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.3 - Parallel Extraction, Automatic upper_bound (Recommended)\nThis scenario is very similar to 2.2, the only difference being that **`upperBound`\nis not provided**. Instead, the property `calculate_upper_bound` equals to true is used to benefit\nfrom the automatic calculation of the `upperBound` (derived from the `partitionColumn`) offered by the\nlakehouse-engine framework, which is useful, as in most of the cases you will probably not be aware of\nthe max value for the column. The only thing you need to consider is that if you use this automatic\ncalculation of the upperBound you will be doing an initial query to the SAP B4 ADSO to retrieve the max\nvalue for the `partitionColumn`, before doing the actual query to perform the extraction.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_b4\",\n            \"calculate_upper_bound\": True,\n            \"options\": {\n                \"url\": \"my_sap_b4_url\",\n                \"user\": \"my_user\",\n                \"password\": \"my_b4_hana_pwd\",\n                \"dbtable\": \"my_database.my_table\",\n                \"extraction_type\": extraction_type,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier_par_calc_upper/\",\n                \"adso_type\": \"AQ\",\n                \"partitionColumn\": \"RECORD\",\n                \"numPartitions\": 10,\n                \"lowerBound\": 1,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/my_identifier_par_calc_upper/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.4 - Parallel Extraction, Provide Predicates (Recommended)\nThis scenario performs the extraction from SAP B4 ADSO in parallel, useful in contexts in which there is no\nnumeric, date or timestamp column to parallelize the extraction (e.g. when extracting from ADSO of Type `CL`,\nthe active table does not have the `RECORD` column, which is usually a good option for scenarios 2.2 and 2.3):\n\n- `partitionColumn` - column used to split the extraction. It can be of any type.\n\nThis is an adequate example for you to follow if you have/know a column in the ADSO that is good to be used as\nthe `partitionColumn`, specially if these columns are not complying with the scenario 2.2 or 2.3.\n\n**When this property is used all predicates need to be provided to Spark, otherwise it will leave data behind.**\n\nBelow the lakehouse function to generate predicate list automatically is presented.\n\nThis function needs to be used carefully, specially on predicates_query and predicates_add_null variables.\n\n**predicates_query:** At the sample below the whole table is being considered (`select distinct(x) from table`),\nbut it is possible to filter predicates list here, specially if you are applying filter on transformations spec,\nand you know entire table won't be necessary, so you can change it to something like this: `select distinct(x)\nfrom table where x > y`.\n\n**predicates_add_null:** You can decide if you want to consider null on predicates list or not, by default\nthis property is `True`.\n\n**Example:** for `\"partition_column\": \"CALMONTH\"`\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\n# import the lakehouse_engine ExecEnv class, so that you can use the functions it offers\n# import the lakehouse_engine extraction utils, so that you can use the JDBCExtractionUtils offered functions\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionUtils,\n)\n\nExecEnv.get_or_create()\n\npartition_column = \"CALMONTH\"\ndbtable = \"my_database.my_table_3\"\n\npredicates_query = f\"\"\"(SELECT DISTINCT({partition_column}) FROM {dbtable})\"\"\"\nuser = \"my_user\"\npassword = \"my_b4_hana_pwd\"\nurl = \"my_sap_b4_url\"\npredicates_add_null = True\n\njdbc_util = JDBCExtractionUtils(\n    JDBCExtraction(\n        user=user,\n        password=password,\n        url=url,\n        predicates_add_null=predicates_add_null,\n        partition_column=partition_column,\n        dbtable=dbtable,\n    )\n)\n\npredicates = jdbc_util.get_predicates(predicates_query)\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_2_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_b4\",\n            \"options\": {\n                \"url\": \"my_sap_b4_url\",\n                \"user\": \"my_user\",\n                \"password\": \"my_b4_hana_pwd\",\n                \"driver\": \"com.sap.db.jdbc.Driver\",\n                \"dbtable\": \"my_database.my_table_2\",\n                \"changelog_table\": \"my_database.my_table_3\",\n                \"extraction_type\": extraction_type,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier_2_prov_predicates/\",\n                \"adso_type\": \"CL\",\n                \"predicates\": predicates,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_2_bronze\",\n            \"input_id\": \"my_identifier_2_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/my_identifier_2_prov_predicates/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.5 - Parallel Extraction, Generate Predicates\nThis scenario is very similar to the scenario 2.4, with the only difference that it automatically\ngenerates the predicates (`\"generate_predicates\": True`).\n\nThis is an adequate example for you to follow if you have/know a column in the ADSO that is good to be used as\nthe `partitionColumn`, specially if these columns are not complying with the scenarios 2.2 and 2.3 (otherwise \nthose would probably be recommended).\n\nWhen this property is used, the lakehouse engine will generate the predicates to be used to extract data from\nthe source. What the lakehouse engine does is to check for the init/delta portion of the data,\nwhat are the distinct values of the `partitionColumn` serving that data. Then, these values will be used by\nSpark to generate several queries to extract from the source in a parallel fashion.\nEach distinct value of the `partitionColumn` will be a query, meaning that you will not have control over the\nnumber of partitions used for the extraction. For example, if you face a scenario in which you\nare using a `partitionColumn` `LOAD_DATE` and for today's delta, all the data (let's suppose 2 million rows) is\nserved by a single `LOAD_DATE = 20200101`, that would mean Spark would use a single partition\nto extract everything. In this extreme case you would probably need to change your `partitionColumn`. **Note:**\nthese extreme cases are harder to happen when you use the strategy of the scenarios 2.2/2.3.\n\n**Example:** for `\"partitionColumn\": \"record\"`\n\nGenerate predicates:\n- `SELECT DISTINCT(RECORD) as RECORD FROM dummy_table`\n- `1`\n- `2`\n- `3`\n- ...\n- `100`\n- Predicates List: ['RECORD=1','RECORD=2','RECORD=3',...,'RECORD=100']\n\nSpark will generate 100 queries like this:\n\n- `SELECT * FROM dummy_table WHERE RECORD = 1`\n- `SELECT * FROM dummy_table WHERE RECORD = 2`\n- `SELECT * FROM dummy_table WHERE RECORD = 3`\n- ...\n- `SELECT * FROM dummy_table WHERE RECORD = 100`\n\nGenerate predicates will also consider null by default:\n\n- `SELECT * FROM dummy_table WHERE RECORD IS NULL`\n\nTo disable this behaviour the following variable value should be changed to false: `\"predicates_add_null\": False`\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_2_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_b4\",\n            \"generate_predicates\": True,\n            \"options\": {\n                \"url\": \"my_sap_b4_url\",\n                \"user\": \"my_user\",\n                \"password\": \"my_b4_hana_pwd\",\n                \"driver\": \"com.sap.db.jdbc.Driver\",\n                \"dbtable\": \"my_database.my_table_2\",\n                \"changelog_table\": \"my_database.my_table_3\",\n                \"extraction_type\": extraction_type,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier_2_gen_predicates/\",\n                \"adso_type\": \"CL\",\n                \"partitionColumn\": \"CALMONTH\",\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_2_bronze\",\n            \"input_id\": \"my_identifier_2_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/my_identifier_2_gen_predicates/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_from_sap_bw_dso/__init__.py",
    "content": "\"\"\"\n.. include::extract_from_sap_bw_dso.md\n\"\"\""
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_from_sap_bw_dso/extract_from_sap_bw_dso.md",
    "content": "# Extract from SAP BW DSOs\n\n!!! danger \"**Parallelization Limitations**\"\n    Parallel extractions **can bring a jdbc source down** if a lot of stress is put on the system. Be careful choosing the number of partitions. Spark is a distributed system and can lead to many connections.\n\nA custom sap_bw reader and a few utils are offered in the lakehouse-engine framework so that consumption of data from \nSAP BW DSOs can be easily created. The framework abstracts all the logic behind the init/delta extractions \n(active table, changelog table, activation requests table, how to identify the next delta timestamp...), \nonly requiring a few parameters that are explained and exemplified in the \n[template](#extraction-from-sap-bw-template) scenarios that we have created.\n\nThis page also provides you a section to help you figure out a good candidate for [partitioning the extraction from SAP BW](#how-can-we-decide-the-partitionColumn).\n\nYou can check the code documentation of the reader below:\n\n[**SAP BW Reader**](../../../reference/packages/io/readers/sap_bw_reader.md)\n\n[**JDBC Extractions arguments**](../../../reference/packages/utils/extraction/jdbc_extraction_utils.md#packages.utils.extraction.jdbc_extraction_utils.JDBCExtraction.__init__)\n\n[**SAP BW Extractions arguments**](../../../reference/packages/utils/extraction/sap_bw_extraction_utils.md#packages.utils.extraction.sap_bw_extraction_utils.SAPBWExtraction.__init__)\n\n!!! note\n    For extractions using the SAP BW reader, you can use the arguments listed in the SAP BW arguments, but also \n    the ones listed in the JDBC extractions, as those are inherited as well. \n\n\n## Extraction from SAP-BW template\n\nThis template covers the following scenarios of extractions from the SAP BW DSOs:\n\n- 1 - The Simplest Scenario (Not parallel - Not Recommended)\n- 2 - Parallel extraction\n  - 2.1 - Simplest Scenario\n  - 2.2 - Provide upperBound (Recommended)\n  - 2.3 - Automatic upperBound (Recommended)\n  - 2.4 - Backfilling\n  - 2.5 - Provide predicates (Recommended)\n  - 2.6 - Generate predicates (Recommended)\n- 3 - Extraction from Write Optimized DSO\n  - 3.1 - Get initial actrequest_timestamp from Activation Requests Table\n\n!!! note \"Introductory Notes\"\n    If you want to have a better understanding about JDBC Spark optimizations, \n    here you have a few useful links:\n    - https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html\n    - https://docs.databricks.com/en/connect/external-systems/jdbc.html\n    - https://bit.ly/3x2eCEm\n    - https://newbedev.com/how-to-optimize-partitioning-when-migrating-data-from-jdbc-source\n\n### 1 - The Simplest Scenario (Not parallel - Not Recommended)\nThis scenario is the simplest one, not taking any advantage of Spark JDBC optimisation techniques \nand using a single connection to retrieve all the data from the source. It should only be used in case the DSO \nyou want to extract from SAP BW is a small one, with no big requirements in terms of performance to fulfill.\n\nWhen extracting from the source DSO, there are two options:\n\n- **Delta Init** - full extraction of the source DSO. You should use it in the first time you extract from the \nDSO or any time you want to re-extract completely. Similar to a so-called full load.\n- **Delta** - extracts the portion of the data that is new or has changed in the source, since the last\nextraction (using the max `actrequest_timestamp` value in the location of the data already extracted,\nby default).\n\nBelow example is composed of two cells.\n\n- The first cell is only responsible to define the variables `extraction_type` and `write_type`,\ndepending on the extraction type **Delta Init** (`LOAD_TYPE = INIT`) or a **Delta** (`LOAD_TYPE = DELTA`).\nThe variables in this cell will also be referenced by other acons/examples in this notebook, similar to what\nyou would do in your pipelines/jobs, defining this centrally and then re-using it.\n- The second cell is where the acon to be used is defined (which uses the two variables `extraction_type` and\n`write_type` defined) and the `load_data` algorithm is executed to perform the extraction.\n\n!!! note\n    There may be cases where you might want to always extract fully from the source DSO. In these cases,\n    you only need to use a Delta Init every time, meaning you would use `\"extraction_type\": \"init\"` and\n    `\"write_type\": \"overwrite\"` as it is shown below. The explanation about what it is a Delta Init/Delta is\n    applicable for all the scenarios presented in this notebook.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            # You should use this custom reader to benefit from the lakehouse-engine utils for extractions from SAP BW\n            \"data_format\": \"sap_bw\",\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"changelog_table\": \"my_database.my_changelog_table\",\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n### 2 - Parallel extraction\nIn this section, 6 possible scenarios for parallel extractions from SAP BW DSOs.\n\n#### 2.1 - Parallel Extraction, Simplest Scenario\nThis scenario provides the simplest example you can have for a parallel extraction from SAP BW, only using\nthe property `numPartitions`. The goal of the scenario is to cover the case in which people does not have\nmuch knowledge around how to optimize the extraction from JDBC sources or cannot identify a column that can\nbe used to split the extraction in several tasks. This scenario can also be used if the use case does not\nhave big performance requirements/concerns, meaning you do not feel the need to optimize the performance of\nthe extraction to its maximum potential. \nOn the example below, `\"numPartitions\": 10` is specified, meaning that Spark will open 10 parallel connections\nto the source DSO and automatically decide how to parallelize the extraction upon that requirement. This is the\nonly change compared to the example provided in the example 1.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"changelog_table\": \"my_database.my_changelog_table\",\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n                \"numPartitions\": 10,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.2 - Parallel Extraction, Provide upper_bound (Recommended)\nThis scenario performs the extraction from the SAP BW DSO in parallel, but is more concerned with trying to\noptimize and have more control (compared to 2.1 example) on how the extraction is split and performed, using\nthe following options:\n\n- `numPartitions` - number of Spark partitions to split the extraction.\n- `partitionColumn` - column used to split the extraction. It must be a numeric, date, or timestamp.\nIt should be a column that is able to split the extraction evenly in several tasks. An auto-increment\ncolumn is usually a very good candidate.\n- `lowerBound` - lower bound to decide the partition stride.\n- `upperBound` - upper bound to decide the partition stride. It can either be **provided (as it is done in\nthis example)** or derived automatically by our upperBound optimizer (example 2.3).\n\nThis is an adequate example for you to follow if you have/know a column in the DSO that is good to be used as\nthe `partitionColumn`. If you compare with the previous example, you'll notice that now `numPartitions` and\nthree additional options are provided to fine tune the extraction (`partitionColumn`, `lowerBound`,\n`upperBound`).\n\nWhen these 4 properties are used, Spark will use them to build several queries to split the extraction.\n\n**Example:** for `\"numPartitions\": 10`, `\"partitionColumn\": \"record\"`, `\"lowerBound: 1\"`, `\"upperBound: 100\"`,\nSpark will generate 10 queries like this:\n\n- `SELECT * FROM dummy_table WHERE RECORD < 10 OR RECORD IS NULL`\n- `SELECT * FROM dummy_table WHERE RECORD >= 10 AND RECORD < 20`\n- `SELECT * FROM dummy_table WHERE RECORD >= 20 AND RECORD < 30`\n- ...\n- `SELECT * FROM dummy_table WHERE RECORD >= 100`\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"changelog_table\": \"my_database.my_changelog_table\",\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n                \"numPartitions\": 3,\n                \"partitionColumn\": \"my_partition_col\",\n                \"lowerBound\": 1,\n                \"upperBound\": 42,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.3 - Parallel Extraction, Automatic upper_bound (Recommended)\nThis scenario is very similar to 2.2, the only difference being that **upper_bound\nis not provided**. Instead, the property `calculate_upper_bound` equals to true is used to benefit\nfrom the automatic calculation of the upperBound (derived from the `partitionColumn`) offered by the\nlakehouse-engine framework, which is useful, as in most of the cases you will probably not be aware of\nthe max value for the column. The only thing you need to consider is that if you use this automatic\ncalculation of the upperBound you will be doing an initial query to the SAP BW DSO to retrieve the max\nvalue for the `partitionColumn`, before doing the actual query to perform the extraction.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"calculate_upper_bound\": True,\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"changelog_table\": \"my_database.my_changelog_table\",\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n                \"numPartitions\": 10,\n                \"partitionColumn\": \"my_partition_col\",\n                \"lowerBound\": 1,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.4 - Parallel Extraction, Backfilling\nThis scenario covers the case, in which you might want to backfill the data extracted from a SAP BW DSO and\nmade available in the bronze layer. By default, the delta extraction considers the max value of the column\n`actrequest_timestamp` on the data already extracted. However, there might be cases, in which you might want\nto extract a delta from a particular timestamp onwards or for a particular interval of time. For this, you\ncan use the properties `min_timestamp` and `max_timestamp`.\n\nBelow, a very similar example to the previous one is provided, the only differences being that\nthe properties `\"min_timestamp\": \"20210910000000\"` and `\"max_timestamp\": \"20210913235959\"` are not provided,\nmeaning it will extract the data from the changelog table, using a filter\n`\"20210910000000\" > actrequest_timestamp <= \"20210913235959\"`, ignoring if some of the data is already\navailable in the destination or not. Moreover, note that the property `latest_timestamp_data_location`\ndoes not need to be provided, as the timestamps to be considered are being directly provided (if both\nthe timestamps and the `latest_timestamp_data_location` are provided, the last parameter will have no effect).\nAdditionally, `\"extraction_type\": \"delta\"` and `\"write_type\": \"append\"` is forced, instead of using the\nvariables as in the other  examples, because the backfilling scenario only makes sense for delta extractions.\n\n!!! note\n    Note: be aware that the backfilling example being shown has no mechanism to enforce that\n    you don't generate duplicated data in bronze. For your scenarios, you can either use this example and solve\n    any duplication in the silver layer or extract the delta with a merge strategy while writing to bronze,\n    instead of appending.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"calculate_upper_bound\": True,\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"changelog_table\": \"my_database.my_changelog_table\",\n                \"extraction_type\": \"delta\",\n                \"numPartitions\": 10,\n                \"partitionColumn\": \"my_partition_col\",\n                \"lowerBound\": 1,\n                \"min_timestamp\": \"20210910000000\",\n                \"max_timestamp\": \"20210913235959\",\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.5 - Parallel Extraction, Provide Predicates (Recommended)\nThis scenario performs the extraction from SAP BW DSO in parallel, useful in contexts in which there is no\nnumeric, date or timestamp column to parallelize the extraction:\n\n- `partitionColumn` - column used to split the extraction. It can be of any type. \n\nThis is an adequate example for you to follow if you have/know a column in the DSO that is good to be used as\nthe `partitionColumn`, specially if these columns are not complying with the scenarios 2.2 and 2.3 (otherwise\nthose would probably be recommended).\n\n**When this property is used all predicates need to be provided to Spark, otherwise it will leave data behind.**\n\nBelow the lakehouse function to generate predicate list automatically is presented.\n\nThis function needs to be used carefully, specially on predicates_query and predicates_add_null variables.\n\n**predicates_query:** At the sample below the whole table is being considered (`select distinct(x) from table`),\nbut it is possible to filter predicates list here,\nspecially if you are applying filter on transformations spec, and you know entire table won't be necessary, so\nyou can change it to something like this: `select distinct(x) from table where x > y`.\n\n**predicates_add_null:** You can decide if you want to consider null on predicates list or not, by default this\nproperty is True.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\n# import the lakehouse_engine ExecEnv class, so that you can use the functions it offers\n# import the lakehouse_engine extraction utils, so that you can use the JDBCExtractionUtils offered functions\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionUtils,\n)\n\nExecEnv.get_or_create()\n\npartition_column = \"my_partition_column\"\ndbtable = \"my_database.my_table\"\n\npredicates_query = f\"\"\"(SELECT DISTINCT({partition_column}) FROM {dbtable})\"\"\"\ncolumn_for_predicates = partition_column\nuser = \"my_user\"\npassword = \"my_hana_pwd\"\nurl = \"my_bw_url\"\npredicates_add_null = True\n\njdbc_util = JDBCExtractionUtils(\n    JDBCExtraction(\n        user=user,\n        password=password,\n        url=url,\n        dbtable=dbtable,\n        partition_column=partition_column,\n    )\n)\n\npredicates = jdbc_util.get_predicates(predicates_query)\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n                \"predicates\": predicates,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.6 - Parallel Extraction, Generate Predicates (Recommended)\nThis scenario performs the extraction from SAP BW DSO in parallel, useful in contexts in which there is no\nnumeric, date or timestamp column to parallelize the extraction:\n\n- `partitionColumn` - column used to split the extraction. It can be of any type.\n\nThis is an adequate example for you to follow if you have/know a column in the DSO that is good to be used as\nthe `partitionColumn`, specially if these columns are not complying with the scenarios 2.2 and 2.3 (otherwise\nthose would probably be recommended).\n\nWhen this property is used, the lakehouse engine will generate the predicates to be used to extract data from\nthe source. What the lakehouse engine does is to check for the init/delta portion of the data,\nwhat are the distinct values of the `partitionColumn` serving that data. Then, these values will be used by\nSpark to generate several queries to extract from the source in a parallel fashion.\nEach distinct value of the `partitionColumn` will be a query, meaning that you will not have control over the\nnumber of partitions used for the extraction. For example, if you face a scenario in which you\nare using a `partitionColumn` `LOAD_DATE` and for today's delta, all the data (let's suppose 2 million rows) is\nserved by a single `LOAD_DATE = 20200101`, that would mean Spark would use a single partition\nto extract everything. In this extreme case you would probably need to change your `partitionColumn`. **Note:**\nthese extreme cases are harder to happen when you use the strategy of the scenarios 2.2/2.3.\n\n**Example:** for `\"partitionColumn\": \"record\"`\nGenerate predicates:\n\n- `SELECT DISTINCT(RECORD) as RECORD FROM dummy_table`\n- `1`\n- `2`\n- `3`\n- ...\n- `100`\n- Predicates List: ['RECORD=1','RECORD=2','RECORD=3',...,'RECORD=100']\n\nSpark will generate 100 queries like this:\n\n- `SELECT * FROM dummy_table WHERE RECORD = 1`\n- `SELECT * FROM dummy_table WHERE RECORD = 2`\n- `SELECT * FROM dummy_table WHERE RECORD = 3`\n- ...\n- `SELECT * FROM dummy_table WHERE RECORD = 100`\n\nGenerate predicates will also consider null by default:\n- `SELECT * FROM dummy_table WHERE RECORD IS NULL`\n\nTo disable this behaviour the following variable value should be changed to false: `\"predicates_add_null\": False`\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"generate_predicates\": True,\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n                \"partitionColumn\": \"my_partition_col\",\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n### 3 - Extraction from Write Optimized DSOs\nThis scenario is based on the best practices of the scenario 2.2, but it is ready to extract data from\nWrite Optimized DSOs, which have the changelog embedded in the active table, instead of having a separate\nchangelog table. Due to this reason, you need to specify that the `changelog_table` parameter value is equal\nto the `dbtable` parameter value.\nMoreover, these tables usually already include the changelog technical columns\nlike `RECORD` and `DATAPAKID`, for example, that the framework adds by default. Thus, you need to specify\n`\"include_changelog_tech_cols\": False` to change this behaviour.\nFinally, you also need to specify the name of the column in the table that can be used to join with the\nactivation requests table to get the timestamp of the several requests/deltas,\nwhich is `\"actrequest\"` by default (`\"request_col_name\": 'request'`).\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOAD_TYPE = \"INIT\" or \"DELTA\"\n\nif LOAD_TYPE == \"INIT\":\n    extraction_type = \"init\"\n    write_type = \"overwrite\"\nelse:\n    extraction_type = \"delta\"\n    write_type = \"append\"\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"changelog_table\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"request_col_name\": \"request\",\n                \"include_changelog_tech_cols\": False,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier/\",\n                \"extraction_type\": extraction_type,\n                \"numPartitions\": 2,\n                \"partitionColumn\": \"RECORD\",\n                \"lowerBound\": 1,\n                \"upperBound\": 50000,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": write_type,\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 3.1 - Extraction from Write Optimized DSOs, Get ACTREQUEST_TIMESTAMP from Activation Requests Table\nBy default, the act_request_timestamp has being hardcoded (either assumes a given extraction_timestamp or the\ncurrent timestamp) in the init extraction, however this may be causing problems when merging changes in silver,\nfor write optimised DSOs. So, a new possibility to choose when to retrieve this timestamp from the\nact_req_table was added.\n\nThis scenario performs the data extraction from Write Optimized DSOs, forcing the actrequest_timestamp to\nassume the value from the activation requests table (timestamp column).\n\nThis feature is only available for WODSOs and to use it you need to specify `\"get_timestamp_from_actrequest\": True`.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sap_bw\",\n            \"options\": {\n                \"user\": \"my_user\",\n                \"password\": \"my_hana_pwd\",\n                \"url\": \"my_sap_bw_url\",\n                \"dbtable\": \"my_database.my_table\",\n                \"changelog_table\": \"my_database.my_table\",\n                \"odsobject\": \"my_ods_object\",\n                \"request_col_name\": \"request\",\n                \"include_changelog_tech_cols\": False,\n                \"latest_timestamp_data_location\": \"s3://my_path/my_identifier_ACTREQUEST_TIMESTAMP/\",\n                \"extraction_type\": \"init\",\n                \"numPartitions\": 2,\n                \"partitionColumn\": \"RECORD\",\n                \"lowerBound\": 1,\n                \"upperBound\": 50000,\n                \"get_timestamp_from_act_request\": True,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"actrequest_timestamp\"],\n            \"location\": \"s3://my_path/my_identifier_ACTREQUEST_TIMESTAMP\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n## How can we decide the partitionColumn?\n\n**Compatible partitionColumn for upperBound/lowerBound Spark options:**\n\nIt needs to be **int, date, timestamp** → https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html\n\n**If you don't have any column to partition on those formats, you can use predicates to partition the table** → https://docs.databricks.com/en/connect/external-systems/jdbc.html#manage-parallelism\n\nOne of the most important parameters to optimise the extraction is the **partitionColumn**, as you can see in the template. Thus, this section helps you figure out if a column is a good candidate or not. \n\nBasically the partition column needs to be a column which is able to adequately split the processing, which means we can use it to \"create\" different queries with intervals/filters, so that the Spark tasks process similar amounts of rows/volume. Usually a good candidate is an integer auto-increment technical column.\n\n!!! note\n    Although RECORD is usually a good candidate, it is usually available on the changelog table only. Meaning that you would need to use a different strategy for the init. In case you don't have good candidates for partitionColumn, you can use the sample acon provided in the **scenario 2.1** in the template above. It might make sense to use **scenario 2.1** for the init and then **scenario 2.2 or 2.3** for the subsequent deltas.\n\n**When there is no int, date or timestamp good candidate for partitionColumn:**\n\nIn this case you can opt by the **scenario 2.5 - Generate Predicates**, which supports any kind of column to be defined as **partitionColumn**.\n\nHowever, you should still analyse if the column you are thinking about is a good candidate or not. In this scenario, Spark will create one query per distinct value of the **partitionColumn**, so you can perform some analysis."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_from_sftp/__init__.py",
    "content": "\"\"\"\n.. include::extract_from_sftp.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_from_sftp/extract_from_sftp.md",
    "content": "# Extract from SFTP\n\nSecure File Transfer Protocol (SFTP) is a file protocol for transferring files over the web.\n\nThis feature is available in the Lakehouse Engine with the purpose of having a mechanism to read data directly from SFTP directories without moving those files manually/physically to a S3 bucket.\n\nThe engine uses Pandas to read the files and converts them into a Spark dataframe, which makes the available resources of an Acon usable, such as `dq_specs`, `output_specs`, `terminator_specs` and `transform_specs`.\n\nFurthermore, this feature provides several filters on the directories that makes easier to control the extractions.\n\n\n#### **Introductory Notes**:\n\nThere are important parameters that must be added to **input specs** in order to make the SFTP extraction work properly:\n\n\n!!! note \"**Read type**\"\n    The engine supports only **BATCH** mode for this feature.\n\n\n**sftp_files_format** - File format that will be used to read data from SFTP. **The engine supports: CSV, FWF, JSON and XML**.\n\n**location** - The SFTP directory to be extracted. If it is necessary to filter a specific file, it can be made using the `file_name_contains` option.\n\n**options** - Arguments used to set the Paramiko SSH client connection (hostname, username, password, port...), set the filter to retrieve files and set the file parameters (separators, headers, cols...). For more information about the file parameters, please go to the Pandas link in the useful links section.\n\nThe options allowed are:\n\n| Property type                 | Detail                   | Example                                                                | Comment                                                                                                                                                                                                                                                                                               |\n|-------------------------------|--------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Connection                    | add_auto_policy(str)     | true of false                                                          | Indicates to allow an SFTP connection using no host key. When a connection attempt is being made using no host key, then the engine will throw an exception if the auto_add_policy property is false. The purpose of this flag is to make the user conscientiously choose a lesser secure connection. |\n| Connection                    | key_type (str)           | \"Ed25519\" or \"RSA\"                                                     | Indicates the key type to be used for the connection (SSH, Ed25519).                                                                                                                                                                                                                                  |\n| Connection                    | key_filename (str)       | \"/path/to/private_key/private_key.ppk\"                                 | The filename, or list of filenames, of optional private(keys), and/or certs to try for authentication. It must be used with a pkey in order to add a policy. If a pkey is not provided, then use `add_auto_policy`.                                                                                   |\n| Connection                    | pkey (str)               | \"AAAAC3MidD1lVBI1NTE5AAAAIKssLqd6hjahPi9FBH4GPDqMqwxOMsfxTgowqDCQAeX+\" | Value to use for the host key when connecting to the remote SFTP server.                                                                                                                                                                                                                              |\n| Filter                        | date_time_gt (str)       | \"1900-01-01\" or \"1900-01-01 08:59:59\"                                  | Filter the files greater than the string datetime formatted as \"YYYY-MM-DD\" or \"YYYY-MM-DD HH:MM:SS\"                                                                                                                                                                                                  |\n| Filter                        | date_time_lt (str)       | \"3999-12-31\" or \"3999-12-31 20:59:59\"                                  | Filter the files lower than the string datetime formatted as \"YYYY-MM-DD\" or \"YYYY-MM-DD HH:MM:SS\"                                                                                                                                                                                                    |\n| Filter                        | earliest_file (bool)     | true or false                                                          | Filter the earliest dated file in the directory.                                                                                                                                                                                                                                                      |\n| Filter                        | file_name_contains (str) | \"part_of_filename\"                                                     | Filter files when match the pattern.                                                                                                                                                                                                                                                                  |\n| Filter                        | latest_file (bool)       | true or false                                                          | Filter the most recent dated file in the directory.                                                                                                                                                                                                                                                   |\n| Read data from subdirectories | sub_dir (bool)           | true or false                                                          | The engine will search files into subdirectories of the **location**. It will consider one level below the root location given.<br>When `sub_dir` is used with **latest_file/earliest_file** argument, the engine will retrieve the latest/earliest file for each subdirectory.                       |\n| Add metadata info             | file_metadata (bool)     | true or false                                                          | When this option is set as True, the dataframe retrieves the **filename with location** and the **modification_time** from the original files in sftp. It attaches these two columns adding the information to respective records.                                                                    |\n\n**Useful Info & Links**:\n1. [Paramiko SSH Client](https://docs.paramiko.org/en/latest/api/client.html)\n2. [Pandas documentation](https://pandas.pydata.org/docs/reference/io.html)\n\n\n## Scenario 1\nThe scenario below shows the extraction of a CSV file using most part of the available filter options. Also, as an example, the column \"created_on\" is created in the transform_specs in order to store the processing date for every record. As the result, it will have in the output table the original file date (provided by the option `file_metadata`) and the processing date from the engine.\n\nFor an incremental load approach, it is advised to use the \"modification_time\" column created by the option `file_metadata`. Since it has the original file date of modification, this date can be used in the logic to control what is new and has been changed recently.\n\n!!! note\n    Below scenario uses **\"add_auto_policy\": true**, which is **not recommended**.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n      {\n          \"spec_id\": \"sftp_source\",\n          \"read_type\": \"batch\",\n          \"data_format\": \"sftp\",\n          \"sftp_files_format\": \"csv\",\n          \"location\": \"my_sftp_data_path\",\n          \"options\": {\n              \"hostname\": \"my_sftp_hostname\",\n              \"username\": \"my_sftp_username\",\n              \"password\": \"my_sftp_password\",\n              \"port\": \"my_port\",\n              \"add_auto_policy\": True,\n              \"file_name_contains\": \"test_pattern\",\n              \"args\": {\"sep\": \"|\"},\n              \"latest_file\": True,\n              \"file_metadata\": True\n          }\n      },\n  ],\n  \"transform_specs\": [\n      {\n          \"spec_id\": \"sftp_transformations\",\n          \"input_id\": \"sftp_source\",\n          \"transformers\": [\n              {\n                  \"function\": \"with_literals\",\n                  \"args\": {\"literals\": {\"created_on\": datetime.now()}},\n              },\n          ],\n      },\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sftp_bronze\",\n      \"input_id\": \"sftp_transformations\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"s3://my_path/dummy_table\"\n    }\n  ]\n}\n\nload_data(acon=acon)\n```\n\n## Scenario 2\nThe following scenario shows the extraction of a JSON file using an RSA pkey authentication instead of auto_add_policy. The engine supports Ed25519Key and RSA for pkeys.\n\nFor the pkey file location, it is important to have the file in a location accessible by the cluster. This can be achieved either by mounting the location or with volumes.\n\n!!! note\n    This scenario uses a more secure authentication, thus it is the recommended option, instead of the previous scenario.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n  \"input_specs\": [\n      {\n          \"spec_id\": \"sftp_source\",\n          \"read_type\": \"batch\",\n          \"data_format\": \"sftp\",\n          \"sftp_files_format\": \"json\",\n          \"location\": \"my_sftp_data_path\",\n          \"options\": {\n              \"hostname\": \"my_sftp_hostname\",\n              \"username\": \"my_sftp_username\",\n              \"password\": \"my_sftp_password\",\n              \"port\": \"my_port\",\n              \"key_type\": \"RSA\",\n              \"key_filename\": \"dbfs_mount_location/my_file_key.ppk\",\n              \"pkey\": \"my_key\",\n              \"latest_file\": True,\n              \"file_metadata\": True,\n              \"args\": {\"lines\": True, \"orient\": \"columns\"},\n          },\n      },\n  ],\n  \"transform_specs\": [\n      {\n          \"spec_id\": \"sftp_transformations\",\n          \"input_id\": \"sftp_source\",\n          \"transformers\": [\n              {\n                  \"function\": \"with_literals\",\n                  \"args\": {\"literals\": {\"lh_created_on\": datetime.now()}},\n              },\n          ],\n      },\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sftp_bronze\",\n      \"input_id\": \"sftp_transformations\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"s3://my_path/dummy_table\"\n    }\n  ]\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_using_jdbc_connection/__init__.py",
    "content": "\"\"\"\n.. include::extract_using_jdbc_connection.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/extract_using_jdbc_connection/extract_using_jdbc_connection.md",
    "content": "# Extract using JDBC connection\n\n!!! danger \"**SAP Extraction**\"\n\n    SAP is only used as an example to demonstrate how we can use a JDBC connection to extract data.\n\n    **If you are looking to extract data from SAP, please use our sap_b4 or sap_bw reader.**\n\n    You can find the **sap_b4 reader** documentation: [Extract from SAP B4 ADSOs](../../data_loader/extract_from_sap_b4_adso/extract_from_sap_b4_adso.md) and the **sap_bw reader** documentarion: [Extract from SAP BW DSOs](../../data_loader/extract_from_sap_bw_dso/extract_from_sap_bw_dso.md)\n\n!!! danger \"**Parallel Extraction**\"\n    Parallel extractions **can bring a jdbc source down** if a lot of stress is put on the system. Be careful choosing the number of partitions. Spark is a distributed system and can lead to many connections.\n\n## Introduction\n\nMany databases allow a JDBC connection to extract data. Our engine has one reader where you can configure all the necessary definitions to connect to a database using JDBC.\n\nIn the next section you will find several examples about how to do it.\n\n## The Simplest Scenario using sqlite \n!!! warning \"Not parallel\"\n    Recommended for smaller datasets only, or when stressing the source system is a high concern\n\nThis scenario is the simplest one we can have, not taking any advantage of Spark JDBC optimisation techniques and using a single connection to retrieve all the data from the source.\n\nHere we use a sqlite database where any connection is allowed. Due to that, we do not specify any username or password.\n\nSame as spark, we provide two different ways to run jdbc reader.\n\n1 - We can use the **jdbc() function**, passing inside all the arguments needed for Spark to work, and we can even combine this with additional options passed through .options().\n\n2 - Other way is using **.format(\"jdbc\")** and pass all necessary arguments through .options(). It's important to say by choosing jdbc() we can also add options() to the execution.\n\n\n**You can find and run the following code in our local test for the engine.**\n\n### jdbc() function\n\nAs we can see in the next cell, all the arguments necessary to establish the jdbc connection are passed inside the `jdbc_args` object. Here we find the url, the table, and the driver. Besides that, we can add options, such as the partition number. The partition number will impact in the queries' parallelism.\n\nThe below code is an example in how to use jdbc() function in our ACON.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/jdbc_reader/jdbc_function/correct_arguments/batch_init.json!}\n```\n\nThis is same as using the following code in pyspark:\n\n```python\nspark.read.jdbc(\n  url=\"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_function/correct_arguments/tests.db\",\n  table=\"jdbc_function\",\n  properties={\"driver\":\"org.sqlite.JDBC\"})\n  .option(\"numPartitions\", 1)\n```\n\n### .format(\"jdbc\")\n\nIn this example we do not use the `jdbc_args` object. All the jdbc connection parameters are inside the dictionary with the object options.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/jdbc_reader/jdbc_format/correct_arguments/batch_init.json!}\n```\n\nThis is same as using the following code in pyspark:\n\n```python\nspark.read.format(\"jdbc\")\n    .option(\"url\", \"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_format/correct_arguments/tests.db\")\n    .option(\"driver\", \"org.sqlite.JDBC\")\n    .option(\"dbtable\", \"jdbc_format\")\n    .option(\"numPartitions\", 1)\n```\n\n## Template with more complete and runnable examples\nIn this template we will use a **SAP as example** for a more complete and runnable example.\nThese definitions can be used in several databases that allow JDBC connection.\n\nThe following scenarios of extractions are covered:\n\n- 1 - The Simplest Scenario (Not parallel -  Recommended for smaller datasets only,\nor when stressing the source system is a high concern)\n- 2 - Parallel extraction\n  - 2.1 - Simplest Scenario \n  - 2.2 - Provide upperBound (Recommended)\n  - 2.3 - Provide predicates (Recommended)\n\n!!! note \"Disclaimer\"\n    This template only uses **SAP as demonstration example for JDBC connection.**\n    **This isn't a SAP template!!!**\n    **If you are looking to extract data from SAP, please use our sap_b4 reader or the sap_bw reader.**\n\nThe JDBC connection has 2 main sections to be filled, the **jdbc_args** and **options**:\n\n- jdbc_args - Here you need to fill everything related to jdbc connection itself, like table/query, url, user,\n..., password.\n- options - This section is more flexible, and you can provide additional options like \"fetchSize\", \"batchSize\",\n\"numPartitions\", ..., upper and \"lowerBound\".\n\nIf you want to know more regarding jdbc spark options you can follow the link below:\n\n- https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html\n\nIf you want to have a better understanding about JDBC Spark optimizations, you can find them in the following:\n\n- https://docs.databricks.com/en/connect/external-systems/jdbc.html\n- https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa\n- https://newbedev.com/how-to-optimize-partitioning-when-migrating-data-from-jdbc-source\n\n### 1 - The Simplest Scenario (Not parallel - Recommended for smaller datasets, or for not stressing the source)\nThis scenario is the simplest one we can have, not taking any advantage of Spark JDBC optimisation techniques\nand using a single connection to retrieve all the data from the source. It should only be used in case the data\nyou want to extract from is a small one, with no big requirements in terms of performance to fulfill.\n\nWhen extracting from the source, we can have two options:\n\n- **Delta Init** - full extraction of the source. You should use it in the first time you extract from the\nsource or any time you want to re-extract completely. Similar to a so-called full load.\n- **Delta** - extracts the portion of the data that is new or has changed in the source, since the last\nextraction (for that, the logic at the transformation step needs to be applied). On the examples below,\nthe logic using REQTSN column is applied, which means that the maximum value on bronze is filtered\nand its value is used to filter incoming data from the data source.\n\n##### Init - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"fetchSize\": 100000,\n                \"compress\": True,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/jdbc_template/no_parallel/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n \nload_data(acon=acon)\n```\n\n##### Delta - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_jdbc_url\",\n                \"table\": \"my_database.my_table\",\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"fetchSize\": 100000,\n                \"compress\": True,\n            },\n        },\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_path/jdbc_template/no_parallel/my_identifier/\",\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"max_my_identifier_bronze_date\",\n            \"input_id\": \"my_identifier_bronze\",\n            \"transformers\": [{\"function\": \"get_max_value\", \"args\": {\"input_col\": \"REQTSN\"}}],\n        },\n        {\n            \"spec_id\": \"appended_my_identifier\",\n            \"input_id\": \"my_identifier_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"incremental_filter\",\n                    \"args\": {\"input_col\": \"REQTSN\", \"increment_df\": \"max_my_identifier_bronze_date\"},\n                }\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"appended_my_identifier\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/jdbc_template/no_parallel/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n \nload_data(acon=acon)\n```\n\n### 2 - Parallel extraction\nOn this section we present 3 possible scenarios for parallel extractions from JDBC sources.\n\n!!! note \"Disclaimer for parallel extraction\"\n    Parallel extractions can bring a jdbc source down if a lot of stress\n    is put on the system. **Be careful when choosing the number of partitions. \n    Spark is a distributed system and can lead to many connections.**\n\n#### 2.1 - Parallel Extraction, Simplest Scenario\nThis scenario provides the simplest example you can have for a parallel extraction from JDBC sources, only using\nthe property `numPartitions`. The goal of the scenario is to cover the case in which people do not have\nmuch experience around how to optimize the extraction from JDBC sources or cannot identify a column that can\nbe used to split the extraction in several tasks. This scenario can also be used if the use case does not\nhave big performance requirements/concerns, meaning you do not feel the need to optimize the performance of\nthe extraction to its maximum potential.\n\nOn the example bellow, `\"numPartitions\": 10` is specified, meaning that Spark will open 10 parallel connections\nto the source and automatically decide how to parallelize the extraction upon that requirement. This is the\nonly change compared to the example provided in the scenario 1.\n\n##### Delta Init - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"fetchSize\": 100000,\n                \"compress\": True,\n                \"numPartitions\": 10,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/jdbc_template/parallel_1/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n##### Delta - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"fetchSize\": 100000,\n                \"compress\": True,\n                \"numPartitions\": 10,\n            },\n        },\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_path/jdbc_template/parallel_1/my_identifier/\",\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"max_my_identifier_bronze_date\",\n            \"input_id\": \"my_identifier_bronze\",\n            \"transformers\": [{\"function\": \"get_max_value\", \"args\": {\"input_col\": \"REQTSN\"}}],\n        },\n        {\n            \"spec_id\": \"appended_my_identifier\",\n            \"input_id\": \"my_identifier_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"incremental_filter\",\n                    \"args\": {\"input_col\": \"REQTSN\", \"increment_df\": \"max_my_identifier_bronze_date\"},\n                }\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"appended_my_identifier\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"REQTSN\"],\n            \"location\": \"s3://my_path/jdbc_template/parallel_1/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n \nload_data(acon=acon)\n```\n\n#### 2.2 - Parallel Extraction, Provide upper_bound (Recommended)\nThis scenario performs the extraction from the JDBC source in parallel, but has more concerns trying to\noptimize and have more control (compared to 2.1 example) on how the extraction is split and performed,\nusing the following options:\n\n- `numPartitions` - number of Spark partitions to split the extraction.\n- `partitionColumn` - column used to split the extraction. It must be a numeric, date, or timestamp.\nIt should be a column that is able to split the extraction evenly in several tasks. An auto-increment\ncolumn is usually a very good candidate.\n- `lowerBound` - lower bound to decide the partition stride.\n- `upperBound` - upper bound to decide the partition stride.\n\nThis is an adequate example to be followed if there is a column in the data source that is good to\nbe used as the `partitionColumn`. Comparing with the previous example,\nthe `numPartitions` and three additional options to fine tune the extraction (`partitionColumn`, `lowerBound`,\n`upperBound`) are provided.\n\nWhen these 4 properties are used, Spark will use them to build several queries to split the extraction.\n**Example:** for `\"numPartitions\": 10`, `\"partitionColumn\": \"record\"`, `\"lowerBound: 1\"`, `\"upperBound: 100\"`,\nSpark will generate 10 queries like:\n\n- `SELECT * FROM dummy_table WHERE RECORD < 10 OR RECORD IS NULL`\n- `SELECT * FROM dummy_table WHERE RECORD >= 10 AND RECORD < 20`\n- `SELECT * FROM dummy_table WHERE RECORD >= 20 AND RECORD < 30`\n- ...\n- `SELECT * FROM dummy_table WHERE RECORD >= 100`\n\n \n##### Init - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"partitionColumn\": \"RECORD\",\n                \"numPartitions\": 10,\n                \"lowerBound\": 1,\n                \"upperBound\": 2000,\n                \"fetchSize\": 100000,\n                \"compress\": True,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"RECORD\"],\n            \"location\": \"s3://my_path/jdbc_template/parallel_2/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n \nload_data(acon=acon)\n```\n\n##### Delta - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"partitionColumn\": \"RECORD\",\n                \"numPartitions\": 10,\n                \"lowerBound\": 1,\n                \"upperBound\": 2000,\n                \"fetchSize\": 100000,\n                \"compress\": True,\n            },\n        },\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_path/jdbc_template/parallel_2/my_identifier/\",\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"max_my_identifier_bronze_date\",\n            \"input_id\": \"my_identifier_bronze\",\n            \"transformers\": [{\"function\": \"get_max_value\", \"args\": {\"input_col\": \"RECORD\"}}],\n        },\n        {\n            \"spec_id\": \"appended_my_identifier\",\n            \"input_id\": \"my_identifier_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"incremental_filter\",\n                    \"args\": {\"input_col\": \"RECORD\", \"increment_df\": \"max_my_identifier_bronze_date\"},\n                }\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"appended_my_identifier\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"RECORD\"],\n            \"location\": \"s3://my_path/jdbc_template/parallel_2/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=acon)\n```\n\n#### 2.3 - Parallel Extraction with Predicates (Recommended)\nThis scenario performs the extraction from JDBC source in parallel, useful in contexts where there aren't\nnumeric, date or timestamp columns to parallelize the extraction:\n\n- `partitionColumn` - column used to split the extraction (can be of any type).\n\n- This is an adequate example to be followed if there is a column in the data source that is good to be\nused as the `partitionColumn`, specially if these columns are not complying with the scenario 2.2.\n\n**When this property is used, all predicates to Spark need to be provided, otherwise it will leave data behind.**\n\nBellow, a lakehouse function to generate predicate list automatically, is presented.\n\n**By using this function one needs to be careful specially on predicates_query and predicates_add_null variables.**\n\n**predicates_query:** At the sample below the whole table (`select distinct(x) from table`) is being considered,\nbut it is possible to filter using predicates list here, specially if you are applying filter on\ntransformations spec, and you know entire table won't be necessary, so you can change it to something like this:\n`select distinct(x) from table where x > y`.\n\n**predicates_add_null:** One can consider if null on predicates list or not. By default, this property is True.\n**Example:** for `\"partitionColumn\": \"record\"`\n\n##### Init - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionUtils,\n)\nExecEnv.get_or_create()\n\npartitionColumn = \"my_partition_col\"\ndbtable = \"my_database.my_table\"\n \npredicates_query = f\"\"\"(SELECT DISTINCT({partitionColumn}) FROM {dbtable})\"\"\"\ncolumn_for_predicates = partitionColumn\nuser = \"my_user\"\npassword = \"my_b4_hana_pwd\"\nurl = \"my_sap_b4_url\"\ndriver = \"com.sap.db.jdbc.Driver\"\npredicates_add_null = True\n \njdbc_util = JDBCExtractionUtils(\n    JDBCExtraction(\n        user=user,\n        password=password,\n        url=url,\n        predicates_add_null=predicates_add_null,\n        partition_column=partitionColumn,\n        dbtable=dbtable,\n    )\n)\n \npredicates = jdbc_util.get_predicates(predicates_query)\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"predicates\": predicates,\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"fetchSize\": 100000,\n                \"compress\": True,\n            },\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"my_identifier_source\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"RECORD\"],\n            \"location\": \"s3://my_path/jdbc_template/parallel_3/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n \nload_data(acon=acon)\n```\n\n##### Delta - Load data into the Bronze Bucket\n```python\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.extraction.jdbc_extraction_utils import (\n    JDBCExtraction,\n    JDBCExtractionUtils,\n)\nExecEnv.get_or_create()\n\npartitionColumn = \"my_partition_col\"\ndbtable = \"my_database.my_table\"\n\npredicates_query = f\"\"\"(SELECT DISTINCT({partitionColumn}) FROM {dbtable})\"\"\"\ncolumn_for_predicates = partitionColumn\nuser = \"my_user\"\npassword = \"my_b4_hana_pwd\"\nurl = \"my_sap_b4_url\"\ndriver = \"com.sap.db.jdbc.Driver\"\npredicates_add_null = True\n\njdbc_util = JDBCExtractionUtils(\n    JDBCExtraction(\n        user=user,\n        password=password,\n        url=url,\n        predicates_add_null=predicates_add_null,\n        partition_column=partitionColumn,\n        dbtable=dbtable,\n    )\n)\n\npredicates = jdbc_util.get_predicates(predicates_query)\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"my_identifier_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": \"my_sap_b4_url\",\n                \"table\": \"my_database.my_table\",\n                \"predicates\": predicates,\n                \"properties\": {\n                    \"user\": \"my_user\",\n                    \"password\": \"my_b4_hana_pwd\",\n                    \"driver\": \"com.sap.db.jdbc.Driver\",\n                },\n            },\n            \"options\": {\n                \"fetchSize\": 100000,\n                \"compress\": True,\n            },\n        },\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_path/jdbc_template/parallel_3/my_identifier/\",\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"max_my_identifier_bronze_date\",\n            \"input_id\": \"my_identifier_bronze\",\n            \"transformers\": [{\"function\": \"get_max_value\", \"args\": {\"input_col\": \"RECORD\"}}],\n        },\n        {\n            \"spec_id\": \"appended_my_identifier\",\n            \"input_id\": \"my_identifier_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"incremental_filter\",\n                    \"args\": {\"input_col\": \"RECORD\", \"increment_df\": \"max_my_identifier_bronze_date\"},\n                }\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"my_identifier_bronze\",\n            \"input_id\": \"appended_my_identifier\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"RECORD\"],\n            \"location\": \"s3://my_path/jdbc_template/parallel_3/my_identifier/\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n \nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/filtered_full_load/__init__.py",
    "content": "\"\"\"\n.. include::filtered_full_load.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/filtered_full_load/filtered_full_load.md",
    "content": "# Filtered Full Load\n\nThis scenario is very similar to the [full load](../full_load/full_load.md), but it filters the data coming from the source, instead of doing a complete full load.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/full_load/with_filter/batch.json!}\n```\n\n##### Relevant notes:\n\n* As seen in the ACON, the filtering capabilities are provided by a transformer called `expression_filter`, where you can provide a custom Spark SQL filter."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/filtered_full_load_with_selective_replace/__init__.py",
    "content": "\"\"\"\n.. include::filtered_full_load_with_selective_replace.md\n\"\"\""
  },
  {
    "path": "lakehouse_engine_usage/data_loader/filtered_full_load_with_selective_replace/filtered_full_load_with_selective_replace.md",
    "content": "# Filtered Full Load with Selective Replace\n\nThis scenario is very similar to the [Filtered Full Load](../filtered_full_load/filtered_full_load.md), but we only replace a subset of the partitions, leaving the other ones untouched, so we don't replace the entire table. This capability is very useful for backfilling scenarios.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/full_load/with_filter_partition_overwrite/batch.json!}\n```\n\n##### Relevant notes:\n\n* The key option for this scenario in the ACON is the `replaceWhere`, which we use to only overwrite a specific period of time, that realistically can match a subset of all the partitions of the table. Therefore, this capability is very useful for backfilling scenarios."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/flatten_schema_and_explode_columns/__init__.py",
    "content": "\"\"\"\n.. include::flatten_schema_and_explode_columns.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/flatten_schema_and_explode_columns/flatten_schema_and_explode_columns.md",
    "content": "# Flatten Schema and Explode Columns\n\nRelated with schema, we can make two kind of operations:\n\n* **Flatten Schema**: transformer named \"flatten_schema\" used to flatten the schema of dataframe.\n    * Parameters to be defined:\n        * max_level: 2 => this sets the level until you want to flatten the schema.\n        * shorten_names: True => this flag is when you want to shorten the name of the prefixes of the fields.\n        * alias: True => this flag is used when you want to define a prefix for the column to be flattened.\n        * num_chars: 7 => this sets the number of characters to consider when shortening the names of the fields.\n        * ignore_cols: True => this list value should be set to specify the columns you don't want to flatten.\n\n\n* **Explode Columns**: transformer named \"explode_columns\" used to explode columns with types ArrayType and MapType. \n    * Parameters to be defined:\n        * explode_arrays: True => this flag should be set to true to explode all array columns present in the dataframe.\n        * array_cols_to_explode: [\"sample_col\"] => this list value should be set when to specify the array columns desired to explode.\n        * explode_maps: True => this flag should be set to true to explode all map columns present in the dataframe.\n        * map_cols_to_explode: [\"map_col\"] => this list value should be set when to specify the map columns desired to explode.\n    * Recommendation: use array_cols_to_explode and map_cols_to_explode to specify the columns desired to explode and do not do it for all of them.\n\n\nThe below scenario of **flatten_schema** is transforming one or more columns and dividing the content nested in more columns, as desired. We defined the number of levels we want to flatten in the schema, regarding the nested values. In this case, we are just setting `max_level` of `2`.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/transformations/column_reshapers/flatten_schema/batch.json!}\n```\n\nThe scenario of **explode_arrays** is transforming the arrays columns in one or more rows, depending on the number of elements, so, it replicates the row for each array value. In this case we are using explode to all array columns, using `explode_arrays` as `true`.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/transformations/column_reshapers/explode_arrays/batch.json!}\n```\n\nThe scenario of **flatten_and_explode_arrays_and_maps** is using `flatten_schema` and `explode_columns` to have the desired output. In this case, the desired output is to flatten all schema and explode maps and arrays, even having an array inside a struct. Steps:\n\n    1. In this case, we have an array column inside a struct column, so first we need to use the `flatten_schema` transformer to extract the columns inside that struct;\n    2. Then, we are able to explode all the array columns desired and map columns, using `explode_columns` transformer.\n    3. To be able to have the map column in 2 columns, we use again the `flatten_schema` transformer.\n\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/batch.json!}\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/full_load/__init__.py",
    "content": "\"\"\"\n.. include::full_load.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/full_load/full_load.md",
    "content": "# Full Load\n\nThis scenario reads CSV data from a path and writes in full to another path with delta lake files.\n\n##### Relevant notes\n\n- This ACON infers the schema automatically through the option `inferSchema` (we use it for local tests only). This is usually not a best practice using CSV files, and you should provide a schema through the InputSpec variables `schema_path`, `read_schema_from_table` or `schema`.\n- The `transform_specs` in this case are purely optional, and we basically use the repartition transformer to create one partition per combination of date and customer. This does not mean you have to use this in your algorithm.\n- A full load is also adequate for an init load (initial load).\n\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/full_load/full_overwrite/batch.json!}\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/read_from_dataframe/__init__.py",
    "content": "\"\"\"\n.. include::read_from_dataframe.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/read_from_dataframe/read_from_dataframe.md",
    "content": "# Read from Dataframe\n\n!!! danger\n    Don't use this feature if the Lakehouse Engine already has a supported data format for your use case, as in that case it is preferred to use the dedicated data formats which are more extensively tested and predictable. Check the supported data formats [here](../../../reference/packages/core/definitions.md#packages.core.definitions.InputFormat).\n\nReading from a Spark DataFrame is very simple using our framework. You just need to define the input_specs as follows: \n\n```python\n{\n    \"input_spec\": {\n        \"spec_id\": \"my_df\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"dataframe\",\n        \"df_name\": df,\n    }\n}\n```\n\n!!! note \"**Why is it relevant?**\"\n    With this capability of reading a dataframe you can deal with sources that do not yet officially have a reader (e.g., REST api, XML files, etc.)."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/read_from_sharepoint/__init__.py",
    "content": "\"\"\"\n.. include::read_from_sharepoint.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/read_from_sharepoint/read_from_sharepoint.md",
    "content": "# Read from Sharepoint\n\nThere may be scenarios where data products must ingest curated datasets that business teams maintain directly in Sharepoint, for example exports from external systems or manually maintained reference files.\n\nThe `SharepointReader` is a specialized reader module designed to load one or more files from a Sharepoint document library into the lakehouse. It abstracts away the complexity of accessing Sharepoint by:\n\n* Resolving the configured Sharepoint site, drive, and document path.\n* Downloading the target file or all files matching a configured pattern into a temporary local location.\n* Reading the downloaded file(s) into a Spark DataFrame using the configured format and options.\n* Optionally combining multiple files into a single DataFrame (for example, unioning all matching CSV files in a folder) and optionally archiving processed files back to Sharepoint (success and error folders).\n\n!!! note\n    📘 Tip: This reader integrates seamlessly into the lakehouse engine’s input step and can be triggered as part of the ACON-based pipeline, just like any other reader module.\n\n!!! warning\n    When reading from text-based formats such as CSV, complex data types (arrays, maps, structs) are not preserved in the source file. If your downstream tables expect these types, you must reconstruct them from string columns after ingestion (for example using `from_json` or explicit casts).\n\n\n### Usage Scenarios\n\nThe examples below show how to read data from Sharepoint, ranging from simple single-file reads to more advanced multi-file and large-file scenarios.\n\n1. [Configuration parameters](#1-configuration-parameters)\n2. [**Simple:** Read one file from Sharepoint](#2-simple-read-one-file-from-sharepoint)\n    1. [Minimal configuration](#i-minimal-configuration)\n    2. [With optional configurations](#ii-with-optional-configurations)\n3. [**Complex:** Read multiple files from Sharepoint](#3-complex-read-multiple-files-from-sharepoint)\n    1. [Read multiple files (standard size)](#i-read-multiple-files-standard-size)\n    2. [Read multiple large files with `chunk_size` and CSV options](#ii-read-multiple-large-files-with-chunk_size-and-csv-options)\n4. [Delimiter handling](#4-delimiter-handling)\n5. [Orchestrating multiple Sharepoint reads (loop pattern)](#5-orchestrating-multiple-sharepoint-reads-loop-pattern)\n\n\n## 1. Configuration parameters\n\n### The mandatory configuration parameters are:\n\n   - **client_id** (str): azure client ID application, available at the\n     Azure Portal -> Azure Active Directory.\n   - **tenant_id** (str): tenant ID associated with the Sharepoint site, available at the\n     Azure Portal -> Azure Active Directory.\n   - **site_name** (str): name of the Sharepoint site where the document library resides.\n     Sharepoint URL naming convention is: **https://your_company_name.Sharepoint.com/sites/site_name**\n   - **drive_name** (str): name of the document library where the file will be uploaded.\n     Sharepoint URL naming convention is: **https://your_company_name.Sharepoint.com/sites/site_name/drive_name**\n   - **file_name** (str): name of the file to be read from Sharepoint when\n     performing a **single-file** read.\n     - In multi-file scenarios, `file_pattern` is typically used instead\n       (see examples below).\n   - **secret** (str): client secret for authentication, available at the\n     Azure Portal -> Azure Active Directory.\n   - **local_path** (str): temporary local storage path (Volume) where files are\n     downloaded before being read.\n     - Ensure the **path ends with \"/\"**.\n     - The **specified sub-folder may be deleted during processing** (for example when\n       cleaning up temporary files); it does not perform a recursive delete on parent\n       directories.\n     - **Avoid using a critical sub-folder.**\n   - **api_version** (str): version of the Graph Sharepoint API to be used for operations.\n     This defaults to \"v1.0\".\n\n> 🔐 Authentication details (`client_id`, `secret`, etc.) should be handled\n> securely via lakehouse configuration or secret management tools, rather than\n> hard-coded in notebooks.\n\n### The optional parameters are:\n\n   - **folder_relative_path** (Optional[str]): relative folder path within the\n     document library where the file(s) are located (for example,\n     `\"incoming/daily_exports\"`).\n   - **chunk_size** (Optional[int]): size (in bytes) of the file chunks used when\n     downloading and archiving files.\n     **Default is `5 * 1024 * 1024` (5 MB).**\n     Useful when working with large files to avoid memory pressure.\n   - **local_options** (Optional[dict]): additional options for customizing the\n     **Spark read** from the temporary local file(s) (for example CSV options such as\n     `header`, `delimiter`, `encoding`, etc.). See the Spark CSV options link below.\n   - **conflict_behaviour** (Optional[str]): behavior to adopt when archiving files\n     and a file with the same name already exists in the target location\n     (for example, `\"replace\"`, `\"fail\"`).\n   - **file_pattern** (Optional[str]): pattern to match **multiple files** in\n     Sharepoint (for example, `\"export_*.csv\"`).\n     Used by the multi-file reader flow to download and union all matching files.\n   - **file_type** (Optional[str]): type of the files to be read from Sharepoint\n     (for example, `\"csv\"`). The reader uses this to decide which Spark data source\n     to use when reading from `local_path`.\n\n\n!!! note\n    For more details about the Sharepoint framework, refer to Microsoft's official documentation:\n\n    > 📖[ Microsoft Graph API - Sharepoint](https://learn.microsoft.com/en-us/graph/api/resources/sharepoint?view=graph-rest-1.0)\n\n    > 🛠️ [Graph Explorer Tool](https://developer.microsoft.com/en-us/graph/graph-explorer) -  this tool helps you explore available Sharepoint Graph API functionalities.\n\n    > 📑 [Spark CSV options](https://spark.apache.org/docs/3.5.3/sql-data-sources-csv.html)\n\n## 2. Simple: Read one file from Sharepoint\n\nThis section demonstrates both minimal configuration and extended configurations\nwhen using the Sharepoint Reader.\n\n### i. Minimal Configuration\n\nThis approach uses only the mandatory parameters needed to connect to Sharepoint\nand read a single CSV file into the lakehouse.\n\n**Note:** In this minimal configuration:\n\n- The file is read from the configured `drive_name` (optionally under `folder_relative_path`).\n- No explicit archiving or custom CSV options are configured; those are covered in later sections.\n\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"csv_read\",\n            \"data_format\": \"sharepoint\",\n            \"read_type\": \"batch\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"dummy_client_id\",\n                \"tenant_id\": \"dummy_tenant_id\",\n                \"secret\": \"dummy_secret\",\n                \"site_name\": \"dummy_site_name\",\n                \"drive_name\": \"dummy_drive_name\",\n                \"local_path\": \"/Volumes/my_volume/sharepoint_tmp/\",  # must end with \"/\"\n                \"folder_relative_path\": \"dummy_folder\",              # optional\n                \"file_name\": \"dummy_sales.csv\",\n                \"file_type\": \"csv\",\n            },\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_output\",\n            \"input_id\": \"csv_read\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"dummy_sales\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"s3://my_data_product_bucket/silver/dummy_sales/\"\n        },\n    ],\n}\n\nload_data(acon=acon)\n```\n\n### ii. With optional configurations\n\nFor more control over the read process, additional parameters can be specified on\ntop of the minimal configuration:\n\n> **archive_enabled (Optional):** Enables archiving of the processed file in\n> Sharepoint.\n>\n> * If `True`, the reader moves the file out of the input folder after the read.\n> * Successful reads go to the *success* subfolder; failures go to the *error*\n>   subfolder.\n\n> **archive_success_subfolder (Optional):** Name of the subfolder used to store\n> successfully processed files (default is `\"done\"`).\n> The folder is created under the same `folder_relative_path` and `drive_name`.\n\n> **archive_error_subfolder (Optional):** Name of the subfolder used to store\n> files that failed to be processed (default is `\"error\"`).\n\n> **local_options (Optional):** Additional options passed to Spark when reading\n> the downloaded CSV file(s) from `local_path` (for example `header`, `delimiter`,\n> `encoding`, etc.).\n> These options can be used in both **single-file** and **multi-file** read modes.\n\n>\n> * For available options, refer to:\n>   [Apache Spark CSV Options](https://spark.apache.org/docs/3.5.4/sql-data-sources-csv.html).\n\n> **chunk_size (Optional):** Size (in bytes) of the chunks used when\n> downloading files.\n>\n> * Default: `5 * 1024 * 1024` (5 MB).\n> * Smaller chunks are safer for very large files or memory-constrained clusters.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\n# Optional CSV options for the local read\nLOCAL_OPTIONS = {\n    \"header\": \"true\",\n    \"delimiter\": \";\",\n}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"csv_read\",\n            \"data_format\": \"sharepoint\",\n            \"read_type\": \"batch\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"dummy_client_id\",\n                \"tenant_id\": \"dummy_tenant_id\",\n                \"secret\": \"dummy_secret\",\n                \"site_name\": \"dummy_site_name\",\n                \"drive_name\": \"dummy_drive_name\",\n                \"local_path\": \"/Volumes/my_volume/sharepoint_tmp/\",\n                \"folder_relative_path\": \"dummy_simple\",\n                \"file_name\": \"dummy_sales.csv\",\n                \"file_type\": \"csv\",\n                \"archive_enabled\": True,\n                \"archive_success_subfolder\": \"successful\",\n                \"archive_error_subfolder\": \"with_error\",\n                \"local_options\": LOCAL_OPTIONS,\n                \"chunk_size\": 5 * 1024 * 1024,\n            },\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_output\",\n            \"input_id\": \"csv_read\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"dummy_sales\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"s3://my_data_product_bucket/silver/dummy_sales/\"\n        },\n    ],\n}\n\nload_data(acon=acon)\n```\n\n## 3. Complex: Read multiple files from Sharepoint\n\nIn many cases, data in Sharepoint is split across multiple files within a folder or\nexported periodically.\nThe `SharepointReader` can automatically locate and read all matching files based\non a configured pattern, merging them into a single DataFrame.\n\n### i. Read multiple files (standard size)\n\nUse `file_pattern` to match and load multiple files within the same folder.\nThe reader downloads all matching files into the temporary local folder and\nperforms a union of their contents before returning the DataFrame.\n\n⚠️ **Schema consistency check:**\nAll matched files must share the same schema.\nIf a file with a different schema is encountered, the reader stops the ingestion,\nmoves that file to the configured *error archive* folder, and logs the event.\n\n> **file_pattern (Optional):** Glob-style pattern for matching files, such as `\"export_*.csv\"`.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"csv_read_multi\",\n            \"data_format\": \"sharepoint\",\n            \"read_type\": \"batch\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"dummy_client_id\",\n                \"tenant_id\": \"dummy_tenant_id\",\n                \"secret\": \"dummy_secret\",\n                \"site_name\": \"dummy_site_name\",\n                \"drive_name\": \"dummy_drive_name\",\n                \"local_path\": \"/Volumes/my_volume/sharepoint_tmp/\",\n                \"folder_relative_path\": \"dummy_sales/daily_exports\",\n                \"file_pattern\": \"export_*.csv\",\n                \"file_type\": \"csv\",\n            },\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_output\",\n            \"input_id\": \"csv_read_multi\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"dummy_sales_daily_exports\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"s3://my_data_product_bucket/silver/dummy_sales/\"\n        },\n    ],\n}\n\nload_data(acon=acon)\n```\n\n## ii. Read multiple large files with `chunk_size` and CSV options\n\nWhen reading multiple large CSV files, the reader can:\n\n- Download each file in chunks (to avoid memory pressure).\n- Apply custom CSV read options (delimiter, header, encoding, etc.) before unioning the data.\n\n> **chunk_size (Optional):**\n> Size (in bytes) of the chunks used when downloading and archiving files.\n> Default is `5 * 1024 * 1024` (5 MB). Increase this for very large files to reduce the number of download operations.\n\n> **local_options (Optional):**\n> Spark CSV options used when reading the downloaded files from `local_path`\n> (for example `header`, `delimiter`, `encoding`, `quote`, etc.).\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nLOCAL_OPTIONS = {\n    \"header\": \"true\",\n    \"delimiter\": \";\",\n    \"encoding\": \"utf-8\",\n}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"csv_read_multi_large\",\n            \"data_format\": \"sharepoint\",\n            \"read_type\": \"batch\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"dummy_client_id\",\n                \"tenant_id\": \"dummy_tenant_id\",\n                \"secret\": \"dummy_secret\",\n                \"site_name\": \"dummy_site_name\",\n                \"drive_name\": \"dummy_drive_name\",\n                \"local_path\": \"/Volumes/my_volume/sharepoint_tmp/\",\n                \"folder_relative_path\": \"dummy_sales/big_daily_exports/\",\n                \"file_pattern\": \"big_export_*.csv\",\n                \"file_type\": \"csv\",\n                \"chunk_size\": 50 * 1024 * 1024,  # 50 MB per chunk\n                \"local_options\": LOCAL_OPTIONS,\n            },\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_output\",\n            \"input_id\": \"csv_read_multi_large\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"dummy_sales_daily_exports\",\n            \"write_type\": \"overwrite\",\n        },\n    ],\n}\n\nload_data(acon=acon)\n```\n\n## 4. Delimiter handling\n\nWhen reading CSV files (single-file or multi-file), the Sharepoint Reader:\n\n- Uses `sep` or `delimiter` from `local_options` as-is if provided\n  (no auto-detection in this case).\n- If no delimiter is provided, it:\n    - Tries to auto-detect one from `; , | \\t` using `csv.Sniffer`.\n    - Optionally compares the resulting column count with `expected_columns`\n      (if set) and logs a warning if they do not match.\n    - Falls back to comma (`,`) if detection fails.\n\nInternally, the final delimiter is always passed to Spark as `sep`\n(`delimiter` is mapped to `sep` and then removed).\n\n> 💡 Tip: You can use `local_options` (including `sep` / `delimiter`) in both\n> single-file and multi-file read modes. When in doubt, set `sep` explicitly.\n\n\n## 5. Orchestrating multiple Sharepoint reads (loop pattern)\n\nIf you need to read from multiple independent Sharepoint locations\n(different folders, drives, or file patterns), you can orchestrate a loop in your\nnotebook and call `load_data` once per configuration.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nsharepoint_sources = [\n    {\"folder_relative_path\": \"dummy_sales/big_daily_exports\", \"file_pattern\": \"big_export_*.csv\"},\n    {\"folder_relative_path\": \"dummy_sales/daily_exports\", \"file_pattern\": \"export_*.csv.csv\"},\n]\n\nfor src in sharepoint_sources:\n    acon = {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"csv_read\",\n                \"data_format\": \"sharepoint\",\n                \"read_type\": \"batch\",\n                \"sharepoint_opts\": {\n                    \"client_id\": \"...\",\n                    \"tenant_id\": \"...\",\n                    \"secret\": \"...\",\n                    \"site_name\": \"...\",\n                    \"drive_name\": \"...\",\n                    \"local_path\": \"/Volumes/my_volume/sharepoint_tmp/\",\n                    \"folder_relative_path\": src[\"folder_relative_path\"],\n                    \"file_pattern\": src[\"file_pattern\"],\n                    \"file_type\": \"csv\",\n                },\n            },\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"output\",\n                \"input_id\": \"csv_read\",\n                \"data_format\": \"delta\",\n                \"db_table\": \"dummy_sales_daily_exports\",\n                \"write_type\": \"append\",\n            },\n        ],\n    }\n\n    load_data(acon=acon)\n```\n\n‼️ Caution: excessive parallelism\n\n    - Running too many Sharepoint reads in parallel can trigger MS Graph API\n    throttling (for example 429 or 503 responses).\n    - Prefer a controlled level of parallelism when orchestrating multiple\n    pipelines or loops that read from Sharepoint.\n    - Monitor logs and retries to ensure stable performance, especially when\n    working with large files or many files at once.\n\nThe Lakehouse Engine framework uses retry logic with backoff to mitigate\nthrottling, but it cannot fully replace sensible limits on concurrency.\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_append_load_with_malformed/__init__.py",
    "content": "\"\"\"\n.. include::streaming_append_load_with_malformed.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_append_load_with_malformed/streaming_append_load_with_malformed.md",
    "content": "# Streaming Append Load with DROPMALFORMED\n\nThis scenario illustrates an append load done via streaming instead of batch, providing an efficient way of picking up new files from an S3 folder, instead of relying on the incremental filtering from the source needed from a batch based process (see append loads in batch from a JDBC source to understand the differences between streaming and batch append loads). However, not all sources (e.g., JDBC) allow streaming.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/append_load/streaming_dropmalformed/streaming.json!}\n```\n\n##### Relevant notes:\n\n* In this scenario, we use DROPMALFORMED read mode, which drops rows that do not comply with the provided schema;\n* In this scenario, the schema is provided through the `input_spec` \"schema\" variable. This removes the need of a separate JSON Spark schema file, which may be more convenient in certain cases.\n* As can be seen, we use the `output_spec` Spark option `checkpointLocation` to specify where to save the checkpoints indicating what we have already consumed from the input data. This allows fault-tolerance if the streaming job fails, but more importantly, it allows us to run a streaming job using [AvailableNow](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) and the next job automatically picks up the stream state since the last checkpoint, allowing us to do efficient append loads without having to manually specify incremental filters as we do for batch append loads."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_append_load_with_terminator/__init__.py",
    "content": "\"\"\"\n.. include::streaming_append_load_with_terminator.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_append_load_with_terminator/streaming_append_load_with_terminator.md",
    "content": "# Streaming Append Load with Optimize Dataset Terminator\n\nThis scenario includes a terminator which optimizes a dataset (table), being able of vacuuming the table, optimising it with z-order or not, computing table statistics and more. You can find more details on the Terminator [here](../../../reference/packages/terminators/dataset_optimizer.md).\n\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/append_load/streaming_with_terminators/streaming.json!}\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_delta_load_with_group_and_rank_condensation/__init__.py",
    "content": "\"\"\"\n.. include::streaming_delta_load_with_group_and_rank_condensation.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_delta_load_with_group_and_rank_condensation/streaming_delta_load_with_group_and_rank_condensation.md",
    "content": "# Streaming Delta Load with Group and Rank Condensation\n\nThis scenario is useful for when we want to do delta loads based on changelogs that need to be first condensed based on a group by and then a rank only, instead of the record mode logic in the record mode based change data capture.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming_delta.json!}\n```\n\n##### Relevant notes:\n* This type of delta load with this type of condensation is useful when the source changelog can be condensed based on dates, instead of technical fields like `datapakid`, `record`, `record_mode`, etc., as we see in SAP BW DSOs.An example of such system is Omnihub Tibco orders and deliveries files."
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_delta_with_late_arriving_and_out_of_order_events/__init__.py",
    "content": "\"\"\"\n.. include::streaming_delta_with_late_arriving_and_out_of_order_events.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/streaming_delta_with_late_arriving_and_out_of_order_events/streaming_delta_with_late_arriving_and_out_of_order_events.md",
    "content": "# Streaming Delta Load with Late Arriving and Out of Order Events (with and without watermarking)\n\n## How to Deal with Late Arriving Data without using Watermark\n\nThis scenario covers a delta load in streaming mode that is able to deal with late arriving and out of order events.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\nExample of ACON configuration:\n```json\n{!../../../../tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/streaming_delta.json!}\n```\n\n##### Relevant notes:\n\n* First question we can impose is: Do we need such complicated update predicate to handle late arriving and out of order events? Simple answer is no. Because we expect that the latest event (e.g., latest status of a record in the source) will eventually arrive, and therefore the target delta lake table will eventually be consistent. However, when will that happen? Do we want to have our target table inconsistent until the next update comes along? This of course is only true when your source cannot ensure the order of the changes and cannot avoid late arriving changes (e.g., some changes that should have come in this changelog extraction, will only arrive in the next changelog extraction). From previous experiences, this is not the case with SAP BW, for example (as SAP BW is ACID compliant, and it will extract data from an SAP source and only have the updated changelog available when the extraction goes through, so theoretically we should not be able to extract data from the SAP BW changelog while SAP BW is still extracting data). \n* However, when the source cannot fully ensure ordering (e.g., Kafka) and we want to make sure we don't load temporarily inconsistent data into the target table, we can pay extra special attention, as we do here, to our update and insert predicates, that will enable us to only insert or update data if the new event meets the respective predicates:\n    * In this scenario, we will only update if the `update_predicate` is true, and that long predicate we have here ensures that the change that we are receiving is likely the latest one;\n    * In this scenario, we will only insert the record if the record is not marked for deletion (this can happen if the new event is a record that is marked for deletion, but the record was not in the target table (late arriving changes where the delete came before the insert), and therefore, without the `insert_predicate`, the algorithm would still try to insert the row, even if the `record_mode` indicates that that row is for deletion. By using the `insert_predicate` above we avoid that to happen. However, even in such scenario, to prevent the algorithm to insert the data that comes later (which is old, as we said, the delete came before the insert and was actually the latest status), we would even need a more complex predicate based on your data's nature. Therefore, please read the disclaimer below.\n!!! note \"**Disclaimer**!\" The scenario illustrated in this page is purely fictional, designed for the Lakehouse Engine local tests specifically. Your data source changelogs may be different and the scenario and predicates discussed here may not make sense to you. Consequently, the data product team should reason about the adequate merge predicate and insert, update and delete predicates, that better reflect how they want to handle the delta loads for their data.\n* We use spark.sql.streaming.schemaInference in our local tests only. We don't encourage you to use it in your data product.\n\n\n!!! note \"**Documentation**\"\n    [Feature Deep Dive: Watermarking in Apache Spark Structured Streaming - The Databricks Blog](https://www.databricks.com/blog/feature-deep-dive-watermarking-apache-spark-structured-streaming)\n    \n    [Structured Streaming Programming Guide - Spark 3.4.0 Documentation](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)\n\n## How to Deal with Late Arriving Data using Watermark\n\nWhen building real-time pipelines, one of the realities that teams have to work with is that distributed data ingestion is inherently unordered. Additionally, in the context of stateful streaming operations, teams need to be able to properly track event time progress in the stream of data they are ingesting for the proper calculation of time-window aggregations and other stateful operations. While working with real-time streaming data there will be delays between event time and processing time due to how data is ingested and whether the overall application experiences issues like downtime. Due to these potential variable delays, the engine that you use to process this data needs to have some mechanism to decide when to close the aggregate windows and produce the aggregate result.\n\nImagine a scenario where we will need to perform stateful aggregations on the streaming data to understand and identify problems in the machines. **This is where we need to leverage Structured Streaming and Watermarking to produce the necessary stateful aggregations.**\n\n##### Approach 1 - Use a pre-defined fixed window (Bad)\n\n<img src=\"../../../assets/img/fixed_window.png?raw=true\" style=\"max-width: 800px; height: auto; \"/>\n\nCredits: [Image source](https://www.databricks.com/blog/feature-deep-dive-watermarking-apache-spark-structured-streaming)\n\nTo explain this visually let’s take a scenario where we are receiving data at various times from around 10:50 AM → 11:20 AM. We are creating 10-minute tumbling windows that calculate the average of the temperature and pressure readings that came in during the windowed period.\n\nIn this first picture, we have the tumbling windows trigger at 11:00 AM, 11:10 AM and 11:20 AM leading to the result tables shown at the respective times. When the second batch of data comes around 11:10 AM with data that has an event time of 10:53 AM this gets incorporated into the temperature and pressure averages calculated for the 11:00 AM → 11:10 AM window that closes at 11:10 AM, which does not give the correct result.\n\n##### Approach 2 - Watermark\n\nWe can define a **watermark** that will allow Spark to understand when to close the aggregate window and produce the correct aggregate result. In Structured Streaming applications, we can ensure that all relevant data for the aggregations we want to calculate is collected by using a feature called **watermarking**. In the most basic sense, by defining a **watermark** Spark Structured Streaming then knows when it has ingested all data up to some time, **T**, (based on a set lateness expectation) so that it can close and produce windowed aggregates up to timestamp **T**.\n\n<img src=\"../../../assets/img/watermarking.png?raw=true\" style=\"max-width: 800px; height: auto; \"/>\n\nCredits: [Image source](https://www.databricks.com/blog/feature-deep-dive-watermarking-apache-spark-structured-streaming)\n\nUnlike the first scenario where Spark will emit the windowed aggregation for the previous ten minutes every ten minutes (i.e. emit the 11:00 AM →11:10 AM window at 11:10 AM), Spark now waits to close and output the windowed aggregation once **the max event time seen minus the specified watermark is greater than the upper bound of the window**.\n\nIn other words, Spark needed to wait until it saw data points where the latest event time seen minus 10 minutes was greater than 11:00 AM to emit the 10:50 AM → 11:00 AM aggregate window. At 11:00 AM, it does not see this, so it only initialises the aggregate calculation in Spark’s internal state store. At 11:10 AM, this condition is still not met, but we have a new data point for 10:53 AM so the internal state gets updated, just **not emitted**. Then finally by 11:20 AM Spark has seen a data point with an event time of 11:15 AM and since 11:15 AM minus 10 minutes is 11:05 AM which is later than 11:00 AM the 10:50 AM → 11:00 AM window can be emitted to the result table.\n\nThis produces the correct result by properly incorporating the data based on the expected lateness defined by the watermark. Once the results are emitted the corresponding state is removed from the state store.\n\n###### Watermarking and Different Output Modes\n\nIt is important to understand how state, late-arriving records, and the different output modes could lead to different behaviours of your application running on Spark. The main takeaway here is that in both append and update modes, once the watermark indicates that all data is received for an aggregate time window, the engine can trim the window state. In append mode the aggregate is produced only at the closing of the time window plus the watermark delay while in update mode it is produced on every update to the window.\n\nLastly, by increasing your watermark delay window you will cause the pipeline to wait longer for data and potentially drop less data – higher precision, but also higher latency to produce the aggregates. On the flip side, smaller watermark delay leads to lower precision but also lower latency to produce the aggregates.\n\nWatermarks can only be used when you are running your streaming application in **append** or **update** output modes. There is a third output mode, complete mode, in which the entire result table is written to storage. This mode cannot be used because it requires all aggregate data to be preserved, and hence cannot use watermarking to drop intermediate state.\n\n###### Joins With Watermark\n\nThere are three types of stream-stream joins that can be implemented in Structured Streaming: **inner, outer, and semi joins**. The main problem with doing joins in streaming applications is that you may have an incomplete picture of one side of the join. Giving Spark an understanding of when there are no future matches to expect is similar to the earlier problem with aggregations where Spark needed to understand when there were no new rows to incorporate into the calculation for the aggregation before emitting it.\n\nTo allow Spark to handle this, we can leverage a combination of watermarks and event-time constraints within the join condition of the stream-stream join. This combination allows Spark to filter out late records and trim the state for the join operation through a time range condition on the join.\n\nSpark has a policy for handling multiple watermark definitions. Spark maintains **one global watermark** that is based on the slowest stream to ensure the highest amount of safety when it comes to not missing data.\n\nWe can change this behaviour by changing *spark.sql.streaming.multipleWatermarkPolicy* to max; however, this means that data from the slower stream will be dropped.\n\n###### State Store Performance Considerations\n\nAs of Spark 3.2, Spark offers RocksDB state store provider.\n\nIf you have stateful operations in your streaming query (for example, streaming aggregation, streaming dropDuplicates, stream-stream joins, mapGroupsWithState, or flatMapGroupsWithState) and you want to maintain millions of keys in the state, then you may face issues related to large JVM garbage collection (GC) pauses causing high variations in the micro-batch processing times. This occurs because, by the implementation of HDFSBackedStateStore, the state data is maintained in the JVM memory of the executors and large number of state objects puts memory pressure on the JVM causing high GC pauses.\n\nIn such cases, you can choose to use a more optimized state management solution based on RocksDB. Rather than keeping the state in the JVM memory, this solution uses RocksDB to efficiently manage the state in the native memory and the local disk. Furthermore, any changes to this state are automatically saved by Structured Streaming to the checkpoint location you have provided, thus providing full fault-tolerance guarantees (the same as default state management).\n\nTo enable the new build-in state store implementation, *set `spark.sql.streaming.stateStore.providerClass` to `org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider`*.\n\nFor more details please visit Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#rocksdb-state-store-implementation\n\nYou can enable this in your acons, by specifying it as part of the exec_env properties like below:\n\n```json\n\"exec_env\": {\n    \"spark.sql.streaming.stateStore.providerClass\":\"org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider\"\n}\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_and_read_dataframe/__init__.py",
    "content": "\"\"\"\n.. include::write_and_read_dataframe.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_and_read_dataframe/write_and_read_dataframe.md",
    "content": "# Write and Read Dataframe\n\nDataFrame writer can give us some advantages by returning a dictionary containing the `spec_id` and the computed dataframe.\nIn these examples we will cover the following scenarios of using the output `dataframe` format:\n\n1. [**Write to dataframe**: Consuming the output spec as DataFrame;](#1-write-to-dataframe-consuming-the-output-spec-as-dataframe)\n2. [**Write all dataframes**: Consuming all DataFrames generated per specs;](#2-write-all-dataframes-consuming-all-dataframes-generated-per-specs)\n3. [**Read from and Write to dataframe**: Making use of the DataFrame output spec to compose silver data.](#3-read-from-and-write-to-dataframe-making-use-of-the-dataframe-output-spec-to-compose-silver-data)\n\n#### Main advantages of using this output writer:\n\n- **Debugging purposes**: as we can access any dataframe used in any part of our ACON\n  we can observe what is happening with the computation and identify what might be wrong\n  or can be improved.\n- **Flexibility**: in case we have some very specific need not covered yet by the lakehouse\n  engine capabilities, example: return the Dataframe for further processing like using a machine\n  learning model/prediction.\n- **Simplify ACONs**: instead developing a single complex ACON, using the Dataframe writer,\n  we can compose our ACON from the output of another ACON. This allows us to identify\n  and split the notebook logic across ACONs.\n\nIf you want/need, you can add as many dataframes as you want in the output spec\nreferencing the spec_id you want to add.\n\n!!! warning\n    **This is not intended to replace the other capabilities offered by the lakehouse-engine** and in case **other feature can cover your use case**, you should **use it instead of using the Dataframe writer**, as they are much **more extensively tested on different type of operations**.\n  \n  *Additionally, please always introspect if the problem that you are trying to resolve and for which no lakehouse-engine feature is available, could be a common problem and thus deserve a common solution and feature.*\n  \n  Moreover, **Dataframe writer is not supported for the streaming trigger\n  types `processing time` and `continuous`.**\n\n## 1. Write to dataframe: Consuming the output spec as DataFrame\n\n### Silver Dummy Sales Write to DataFrame\n\nIn this example we will cover the Dummy Sales write to a result containing the output DataFrame.\n\n- An ACON is used to read from bronze, apply silver transformations and write to a dictionary\n  containing the output spec as key and the dataframe as value through the following steps:\n    - 1 - Definition of how to read data (input data location, read type and data format);\n    - 2 - Transformation of data (rename relevant columns);\n    - 3 - Write the data to dict containing the dataframe;\n\n!!! note\n    If you are trying to retrieve more than once the same data using checkpoint it will return an empty dataframe with empty schema as we don't have new data to read.\n\n\n```python\nfrom lakehouse_engine.engine import load_data\n\ncols_to_rename = {\"item\": \"ordered_item\", \"date\": \"order_date\", \"article\": \"article_id\"}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_bronze\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_sales\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_transform\",\n            \"input_id\": \"dummy_sales_bronze\",\n            \"transformers\": [\n                {\n                    \"function\": \"rename\",\n                    \"args\": {\n                        \"cols\": cols_to_rename,\n                    },\n                },\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_silver\",\n            \"input_id\": \"dummy_sales_transform\",\n            \"data_format\": \"dataframe\",\n            \"options\": {\n                \"checkpointLocation\": \"s3://my_data_product_bucket/checkpoints/bronze/dummy_sales\",\n            },\n        }\n    ],\n}\n```\n\n### Run the Load and Return the Dictionary with the DataFrames by OutputSpec\n\nThis exploratory test will return a dictionary with the output spec and the dataframe\nthat will be stored after transformations.\n\n```python\noutput = load_data(acon=acon)\ndisplay(output.keys())\ndisplay(output.get(\"dummy_sales_silver\"))\n```\n\n## 2. Write all dataframes: Consuming all DataFrames generated per specs\n\n### Silver Dummy Sales Write to DataFrame\n\nIn this example we will cover the Dummy Sales write to a result containing the specs and related DataFrame.\n\n- An ACON is used to read from bronze, apply silver transformations and write to a dictionary\n  containing the spec id as key and the DataFrames as value through the following steps:\n    - Definition of how to read data (input data location, read type and data format);\n    - Transformation of data (rename relevant columns);\n    - Write the data to a dictionary containing all the spec ids and DataFrames computed per step;\n\n```python\nfrom lakehouse_engine.engine import load_data\n\ncols_to_rename = {\"item\": \"ordered_item\", \"date\": \"order_date\", \"article\": \"article_id\"}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_sales\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_transform\",\n            \"input_id\": \"dummy_sales_bronze\",\n            \"transformers\": [\n                {\n                    \"function\": \"rename\",\n                    \"args\": {\n                        \"cols\": cols_to_rename,\n                    },\n                },\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales_bronze\",\n            \"input_id\": \"dummy_sales_bronze\",\n            \"data_format\": \"dataframe\",\n        },\n        {\n            \"spec_id\": \"sales_silver\",\n            \"input_id\": \"dummy_sales_transform\",\n            \"data_format\": \"dataframe\",\n        },\n    ],\n}\n```\n\n### Run the Load and Return the Dictionary with the related DataFrames by Spec\n\nThis exploratory test will return a dictionary with all specs and the related dataframe.\nYou can access the DataFrame you need by `output.get(<spec_id>)` for future developments and tests.\n\n```python\noutput = load_data(acon=acon)\ndisplay(output.keys())\ndisplay(output.get(\"sales_bronze\"))\ndisplay(output.get(\"sales_silver\"))\n```\n\n## 3. Read from and Write to dataframe: Making use of the DataFrame output spec to compose silver data\n\n### Silver Load Dummy Deliveries\n\nIn this example we will cover the Dummy Deliveries table read and incremental load to silver composing the silver data to write using the DataFrame output spec:\n\n- First ACON is used to get the latest data from bronze, in this step we are using more than one output because we will need the bronze data with the latest data in the next step.\n- Second ACON is used to consume the bronze data and the latest data to perform silver transformation, in this ACON we are using as **input the two dataframes computed by the first ACON.**\n- Third ACON is used to write the silver computed data from the previous ACON to the target.\n\n!!! note\n    This example is not a recommendation on how to deal with incremental loads, the ACON was split in 3 for demo purposes.\n\nConsume bronze data, generate the latest data and return a dictionary with bronze and transformed dataframes:\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_sales\",\n        },\n        {\n            \"spec_id\": \"dummy_deliveries_silver_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"my_database.dummy_deliveries\",\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_table_max_value\",\n            \"input_id\": \"dummy_deliveries_silver_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"get_max_value\",\n                    \"args\": {\"input_col\": \"delivery_date\", \"output_col\": \"latest\"},\n                },\n                {\n                    \"function\": \"with_expressions\",\n                    \"args\": {\n                        \"cols_and_exprs\": {\"latest\": \"CASE WHEN latest IS NULL THEN 0 ELSE latest END\"},\n                    },\n                },\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"deliveries_bronze\",\n            \"input_id\": \"dummy_deliveries_bronze\",\n            \"data_format\": \"dataframe\",\n        },\n        {\n            \"spec_id\": \"dummy_deliveries_transformed\",\n            \"input_id\": \"dummy_deliveries_table_max_value\",\n            \"data_format\": \"dataframe\",\n        },\n    ],\n}\n\ndummy_deliveries_transformed = load_data(acon=acon)\n\ndummy_deliveries_transformed_df = dummy_deliveries_transformed.get(\"dummy_deliveries_transformed\")\ndummy_deliveries_bronze_df = dummy_deliveries_transformed.get(\"deliveries_bronze\")\n```\n\nConsume previous dataframes generated by the first ACON (bronze and latest bronze data) to generate the silver data. In this acon we are only using **just one output** because we only need the dataframe from the output for the next step.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\ncols_to_rename = {\"delivery_note_header\": \"delivery_note\", \"article\": \"article_id\"}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"dataframe\",\n            \"df_name\": dummy_deliveries_bronze_df,\n        },\n        {\n            \"spec_id\": \"dummy_deliveries_table_max_value\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"dataframe\",\n            \"df_name\": dummy_deliveries_transformed_df,\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_transform\",\n            \"input_id\": \"dummy_deliveries_bronze\",\n            \"transformers\": [\n                {\n                    \"function\": \"rename\",\n                    \"args\": {\n                        \"cols\": cols_to_rename,\n                    },\n                },\n                {\n                    \"function\": \"incremental_filter\",\n                    \"args\": {\n                        \"input_col\": \"delivery_date\",\n                        \"increment_df\": \"dummy_deliveries_table_max_value\",\n                        \"increment_col\": \"latest\",\n                        \"greater_or_equal\": False,\n                    },\n                },\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_silver\",\n            \"input_id\": \"dummy_deliveries_transform\",\n            \"data_format\": \"dataframe\",\n        }\n    ],\n}\n\ndummy_deliveries_silver = load_data(acon=acon)\ndummy_deliveries_silver_df = dummy_deliveries_silver.get(\"dummy_deliveries_silver\")\n```\n\nWrite the silver data generated by previous ACON into the target\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nwrite_silver_acon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_silver\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"dataframe\",\n            \"df_name\": dummy_deliveries_silver_df,\n        },\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_quality\",\n            \"input_id\": \"dummy_deliveries_silver\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"expectations_store_prefix\": \"dq/expectations/\",\n            \"validations_store_prefix\": \"dq/validations/\",\n            \"checkpoint_store_prefix\": \"dq/checkpoints/\",\n            \"result_sink_db_table\": \"my_database.dummy_deliveries_dq\",\n            \"result_sink_location\": \"my_data_product_bucket/dq/dummy_deliveries\",\n            \"fail_on_error\": False,\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"dq_functions\": [\n                {\n                    \"function\": \"expect_column_values_to_not_be_null\",\n                    \"args\": {\"column\": \"delivery_note\"},\n                },\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\"min_value\": 19},\n                },\n                {\n                    \"function\": \"expect_column_max_to_be_between\",\n                    \"args\": {\"column\": \"delivery_item\", \"min_value\": 2},\n                },\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_silver\",\n            \"input_id\": \"dummy_deliveries_quality\",\n            \"write_type\": \"append\",\n            \"location\": \"s3://my_data_product_bucket/silver/dummy_deliveries_df_writer\",\n            \"data_format\": \"delta\",\n        }\n    ],\n    \"exec_env\": {\n        \"spark.databricks.delta.schema.autoMerge.enabled\": True,\n        \"spark.databricks.delta.optimizeWrite.enabled\": True,\n        \"spark.databricks.delta.autoCompact.enabled\": True,\n    },\n}\n\nload_data(acon=write_silver_acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_to_console/__init__.py",
    "content": "\"\"\"\n.. include::write_to_console.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_to_console/write_to_console.md",
    "content": "# Write to Console\n\nConsole writer is an interesting feature to debug / validate what have been done on lakehouse engine. Before moving forward and store data somewhere, it is possible to show / print the final dataframe to the console, which means it is possible to transform the data as many times as you want and display the final result to validate if it is as expected.\n\n## Silver Dummy Sales Write to Console Example\n\nIn this template we will cover the Dummy Sales write to console. An ACON is used to read from bronze, apply silver transformations and write on console through the following steps:\n\n1. Definition of how to read data (input data location, read type and data format);\n2. Transformation of data (rename relevant columns);\n3. Definition of how to print to console (limit, truncate, vertical options);\n\nFor this, the ACON specs are :\n\n- **input_specs** (MANDATORY): specify how to read data;\n- **transform specs** (OPTIONAL): specify how to transform data;\n- **output_specs** (MANDATORY): specify how to write data to the target.\n\n!!! note\n    Writer to console **is a wrapper for spark.show() function**, if you want to know more about the function itself or the available options, [please check the spark documentation here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.show.html).\n\n```python\nfrom lakehouse_engine.engine import load_data\n\ncols_to_rename = {\"item\": \"ordered_item\", \"date\": \"order_date\", \"article\": \"article_id\"}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_bronze\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_sales\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_transform\",\n            \"input_id\": \"dummy_sales_bronze\",\n            \"transformers\": [\n                {\n                    \"function\": \"rename\",\n                    \"args\": {\n                        \"cols\": cols_to_rename,\n                    },\n                },\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_silver\",\n            \"input_id\": \"dummy_sales_transform\",\n            \"data_format\": \"console\",\n            \"options\": {\"limit\": 8, \"truncate\": False, \"vertical\": False},\n        }\n    ],\n}\n```\n\nAnd then, **Run the Load and Exit the Notebook**: This exploratory test will write to the console, which means the final\ndataframe will be displayed.\n\n```python\nload_data(acon=acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_to_rest_api/__init__.py",
    "content": "\"\"\"\n.. include::write_to_rest_api.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_to_rest_api/write_to_rest_api.md",
    "content": "# Write to REST API\n\nREST API writer is an interesting feature to send data from Spark to a REST API within the data pipeline context. It uses the Python requests library to execute the REST calls.\n\nIt is possible to configure a few aspects of the writer, like if the payload should be sent via JSON body or via file, or configure additional JSON body parameters to add to the payload generated via Spark.\n\nIn the current implementation of the writer, each row will generate a request to the API, so it is important that you prepare your dataframe accordingly (check example below).\n\n## Silver Dummy Sales Write to REST API Example\n\nIn this template we will cover the Dummy Sales write to a REST API. An ACON is used to read from bronze, apply silver transformations to prepare the REST api payload and write to the API through the following steps:\n\n1. Definition of how to read data (input data location, read type and data format);\n2. Transformation of the data so that we form a payload column per each row.\n    **Important Note:** In the current implementation of the writer, each row will generate a request to the API, so `create_payload` is a lakehouse engine custom transformer function that creates a JSON string with the **payload** to be sent to the API. The column name should be exactly **\"payload\"**, so that the lakehouse engine further processes that column accordingly, in order to correctly write the data to the REST API.\n3. Definition of how to write to a REST api (url, authentication, payload format configuration, ...);\n\nFor this, the ACON specs are :\n\n- **input_specs** (MANDATORY): specify how to read data;\n- **transform specs** (MANDATORY): specify how to transform data to prepare the payload;\n- **output_specs** (MANDATORY): specify how to write data to the target.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\ndef create_payload(df: DataFrame) -> DataFrame:\n    payload_df = payload_df.withColumn(\n        \"payload\",\n        lit('{\"just a dummy key\": \"just a dummy value\"}')\n    )\n\n    return payload_df\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_bronze\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_sales\",\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_sales_transform\",\n            \"input_id\": \"dummy_sales_bronze\",\n            \"transformers\": [\n                {\n                    \"function\": \"custom_transformation\",\n                    \"args\": {\n                        \"custom_transformer\": create_payload,\n                    },\n                }\n            ],\n        },\n    ],\n    \"output_specs\": [\n        { \n            \"spec_id\": \"data_to_send_to_api\",\n            \"input_id\": \"dummy_sales_transform\",\n            \"data_format\": \"rest_api\",\n            \"options\": {\n                \"rest_api_url\": \"https://foo.bar.com\",\n                \"rest_api_method\": \"post\",\n                \"rest_api_basic_auth\": {\n                    \"username\": \"...\",\n                    \"password\": \"...\",\n                },\n                \"rest_api_is_file_payload\": False, # True if payload is to be sent via JSON file instead of JSON body (application/json)\n                \"rest_api_file_payload_name\": \"custom_file\", # this is the name of the file to be sent in cases where the payload uses file uploads rather than JSON body.\n                \"rest_api_extra_json_payload\": {\"x\": \"y\"}\n            }\n        }\n    ],\n}\n\nload_data(acon=acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_to_sharepoint/__init__.py",
    "content": "\"\"\"\n.. include::write_to_sharepoint.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_loader/write_to_sharepoint/write_to_sharepoint.md",
    "content": "# Write to Sharepoint\n\nThere may be scenarios where data products must deliver curated datasets to external platforms like Sharepoint,\noften to serve business users or reporting tools outside the lakehouse environment.\n\nThe SharePointWriter is a specialized writer module designed to export a single file from the lakehouse to a Sharepoint document library.\nIt handles the complexities of the export by:\n\n* Writing the dataset to a temporary local file.\n* Uploading that file to the configured Sharepoint location using authenticated APIs.\n* Since it is scoped to handle only a single file per execution, any logic for splitting or generating multiple files must be implemented within your notebook prior to invoking the writer.\n\n!!! note\n    📘 Tip: This writer integrates seamlessly into the lakehouse engine's output step and can be triggered as part of the ACON-based pipeline, just like any other writer module.\n\n!!! warning\n    **CSV files do not support complex data types such as array, map, or struct.**\n    If these fields exist in the dataset, they must be converted to string (e.g., via to_json(), cast, or similar) before using the Sharepoint Writer, as **these types will cause the export to fail.**\n\n### Usage Scenarios\n\nThe examples below show how to write data to Sharepoint, ranging from simple single-DataFrame writes to more complex multi-DataFrame workflows.\n\n1. [Configuration parameters](#1-configuration-parameters)\n2. [**Simple:** Write one Dataframe to Sharepoint](#2-simple-write-one-dataframe-to-sharepoint)\n    1. [Minimal configuration](#i-minimal-configuration)\n    2. [With optional configurations](#ii-with-optional-configurations)\n3. [**Complex:** Write multiple Dataframes to Sharepoint](#3-complex-write-multiple-dataframes-to-sharepoint)\n    1. [Example: Partitioning function](#i-example-partitioning-function)\n    2. [Example: Detect Unsupported Column Types](#ii-detect-unsupported-columns-types)\n    2. [Without parallelism (sequential processing)](#iii-without-parallelism-sequential-processing)\n    3. [With parallelism (optimized for efficiency)](#iv-complex---with-parallelism-optimized-for-efficiency)\n\n## 1. Configuration parameters\n\n### The mandatory configuration parameters are:\n\n   - **client_id** (str): azure client ID application, available at the\n     Azure Portal -> Azure Active Directory.\n   - **tenant_id** (str): tenant ID associated with the Sharepoint site, available at the\n     Azure Portal -> Azure Active Directory.\n   - **site_name** (str): name of the Sharepoint site where the document library resides.\n     Sharepoint URL naming convention is: **https://your_company_name.sharepoint.com/sites/site_name**\n   - **drive_name** (str): name of the document library where the file will be uploaded.\n     Sharepoint URL naming convention is: **https://your_company_name.sharepoint.com/sites/site_name/drive_name**\n   - **file_name** (str): name of the file to be uploaded to local path and to Sharepoint.\n   - **secret** (str): client secret for authentication, available at the\n     Azure Portal -> Azure Active Directory.\n   - **local_path** (str): Temporary local storage path for the file before uploading.\n     - Ensure the **path ends with \"/\"**.\n     - Note: The **specified sub-folder is deleted during the process**; it does not perform a recursive\n     delete on parent directories.\n     - **Avoid using a critical sub-folder.**\n   - **api_version** (str): version of the Graph Sharepoint API to be used for operations.\n     This defaults to \"v1.0\".\n\n### The optional parameters are:\n\n   - **folder_relative_path** (Optional[str]): relative folder path within the document\n       library to upload the file.\n   - **chunk_size** (Optional[int]): Optional; size (in Bytes) of the file chunks for\n       uploading to Sharepoint. **Default is 100 Mb.**\n   - **local_options** (Optional[dict]): Optional; additional options for customizing\n       write to csv action to local path. You can check the available options\n       below.\n   - **conflict_behaviour** (Optional[str]): Optional; behavior to adopt in case\n       of a conflict (e.g., 'replace', 'fail').\n\n!!! note\n    For more details about the Sharepoint framework, refer to Microsoft's official documentation:\n\n    > 📖[ Microsoft Graph API - Sharepoint](https://learn.microsoft.com/en-us/graph/api/resources/sharepoint?view=graph-rest-1.0)\n\n    > 🛠️ [Graph Explorer Tool](https://developer.microsoft.com/en-us/graph/graph-explorer) -  this tool helps you explore available Sharepoint Graph API functionalities.\n\n    > 📑 [Spark CSV options](https://spark.apache.org/docs/3.5.3/sql-data-sources-csv.html)\n\n## 2. Simple: Write one Dataframe to Sharepoint\n\nThis section demonstrates both minimal configuration and extended configurations\nwhen using the Sharepoint Writer.\n\n### i. Minimal Configuration\n\nThis approach uses only the mandatory parameters, making it the quickest way to write a DataFrame to Sharepoint.\n\n**Note:** With minimal configurations, not even the header is written on the table. Furthermore, the file is\nwritten on the Sharepoint Drive root folder.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"dummy_sales\",\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_output\",\n            \"input_id\": \"dummy_input\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"dummy_client_id\",\n                \"tenant_id\": \"dummy_tenant_id\",\n                \"secret\": \"dummy_secret\",\n                \"site_name\": \"dummy_site_name\",\n                \"drive_name\": \"dummy_drive_name\",\n                \"local_path\": \"s3://my_data_product_bucket/silver/dummy_sales/\",  # this path must end with an \"/\"\n                \"file_name\": \"dummy_sales\",\n            },\n        },\n    ],\n}\n\nload_data(acon=acon)\n```\n\n### ii. With Optional Configurations\n\nFor more control over the upload process, additional parameters can be specified:\n\n>**folder_relative_path (Optional):** Defines the subfolder inside the Sharepoint drive\nwhere the file should be stored.\n>\n> ‼️ **Important:** The drive within the site acts as the root.\n>\n> **Example:**\n>\n>   * Site Name: \"dummy_sharepoint\"\n>   * Drive Name: \"dummy_drive\"\n>   * Folder Path: \"dummy/test/\"\n>   * File Name: \"test.csv\"\n>   * Final Destination: \"dummy_sharepoint/dummy_drive/dummy/test/test.csv\"\n\n> **chunk_size (Optional):** Defines the file chunk size (in bytes) for uploading.\n>\n> * Default: 100 MB (Recommended unless handling large files).\n> * Larger chunk sizes can improve performance but may increase memory usage.\n\n> **local_options (Optional):** Additional options for writing the DataFrame to a CSV file before upload.\n>\n> * For available options, refer to: [Apache Spark CSV Options](https://spark.apache.org/docs/3.5.4/sql-data-sources-csv.html).\n\n> **conflict_behaviour (Optional):** Determines the action taken if a file with the same name already exists.\n>\n> * Possible values: \"replace\", \"fail\", \"rename\", etc.\n> * Refer to Microsoft’s documentation: [Drive Item Conflict Behavior](https://learn.microsoft.com/en-us/dynamics365/business-central/application/system-application/enum/system.integration.graph.graph-conflictbehavior).\n\n```python\nfrom lakehouse_engine.engine import load_data\n\n# Set the optional parameters\nLOCAL_OPTIONS = {\"mode\": \"overwrite\", \"header\": \"true\"}\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"dummy_sales\",\n        },\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"dummy_transform\",\n            \"input_id\": \"dummy_input\",\n            \"transformers\": [\n                {\n                    \"function\": \"add_current_date\",\n                    \"args\": {\"output_col\": \"extraction_timestamp\"},\n                },  # Add a new column with the current date if needed\n                {\n                    \"function\": \"expression_filter\",\n                    \"args\": {\"exp\": \"customer = 'customer 1'\"},\n                },  # Filter the data if needed\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_output\",\n            \"input_id\": \"dummy_transform\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"dummy_client_id\",\n                \"tenant_id\": \"dummy_tenant_id\",\n                \"secret\": \"dummy_secret\",\n                \"site_name\": \"dummy_site_name\",\n                \"drive_name\": \"dummy_drive_name\",\n                \"local_path\": \"s3://my_data_product_bucket/silver/dummy_sales/\",  # this path must end with an \"/\"\n                \"file_name\": \"dummy_sales\",\n                \"folder_relative_path\": \"dummy_simple\",  # writes file in the folder ./dummy_simple\n                \"local_options\": LOCAL_OPTIONS,\n                \"chunk_size\": 300 * 1024 * 1024,  # 300 MB\n            },\n        },\n    ],\n}\n\nload_data(acon=acon)\n```\n\n## 3. Complex: Write multiple Dataframes to Sharepoint\n\nThis scenario illustrates how to write multiple files to Sharepoint within a loop.\nSome use cases may require uploading files categorized by season, customer type, product category, etc.,\ndepending on the business needs.\n\nPartitioning the data ensures better organization and optimized file management in Sharepoint.\n\n!!!warning\n    ‼️ **Caution: Excessive Parallelism!**\n\n    * Too many simultaneous uploads can trigger Graph API throttling, leading to 503 (Service Unavailable) errors.\n    * Use a controlled level of parallelism (limit concurrent uploads) **if necessary**.\n        * [Coalesce](https://spark.apache.org/docs/3.5.3/sql-performance-tuning.html#coalesce-hints-for-sql-queries) allows you to control Spark's parallelism.\n    * **As the size of the files increases so does this concern,** so it’s important to test and monitor upload\n    processes to avoid service disruptions and ensure smooth performance.\n\n**Neverthless, a stress test with over 50 partition files with > 4GB each** was performed and parallelism\nissues were not detected.\nThe Lakehouse Engine Framework uses a **exponential backoff retry logic to avoid throttling** issues.\n\n### i. Example: Partitioning function\n\nThis function is a mere example on how to fetch the distinct of a column from a given table.\\\nIt is not part of the lakehouse_engine framework.\n\n```python\ndef get_partitions(\npartition: str, bucket: Optional[str] = None, table: Optional[str] = None, filter_expression: Optional[str] = None\n) -> List[Dict[str, str]]:\n\"\"\"Fetch distinct values from a given partition column in a table or bucket.\n\n    Parameters\n    ----------\n    partition : str\n        The name of the partition column.\n    bucket : Optional[str], default=None\n        The path to the S3 bucket (if applicable).\n    table : Optional[str], default=None\n        The name of the table (if applicable).\n    filter_expression : Optional[str], default=None\n        A filter condition to apply.\n\n    Returns\n    -------\n    List[Dict[str, str]]\n        A list of dictionaries with unique partition values.\n    \"\"\"\n    if not bucket and not table:\n        raise ValueError(\"Either 'bucket' or 'table' must be provided\")\n\n    df = spark.read.format(\"delta\").load(bucket) if bucket else spark.table(table)\n\n    partitions = df.select(partition).distinct()\n\n    if filter_expression:\n        partitions = partitions.filter(filter_expression)\n\n    return [{partition: row[partition]} for row in partitions.collect()]\n```\n\n### ii. Detect unsupported columns types\n\nThis function exemplifies how to detect unsupported .csv column types.\nIt is not part of the lakehouse_engine framework.\n\n```python\ndef detect_array_or_struct_fields(df: DataFrame) -> Dict[str, str]:\n\"\"\"\nDetect fields in a DataFrame that are arrays, structs, or maps.\n\n    Args:\n        df (DataFrame): The input DataFrame.\n\n    Returns:\n        Dict[str, str]: A dictionary with field names as keys and their types ('array', 'struct', or 'map') as values.\n    \"\"\"\n    field_types = {}\n    type_mapping = {ArrayType: \"StringType\", StructType: \"StringType\", MapType: \"StringType\"}\n\n    for field in df.schema.fields:\n        for data_type, type_name in type_mapping.items():\n            if isinstance(field.dataType, data_type):\n                field_types[field.name] = type_name\n                break\n    return field_types\n```\n\n### iii. Without parallelism (sequential processing)\n\n```python\nfrom lakehouse_engine.engine import load_data\n\n# Set the optional parameters\nLOCAL_OPTIONS = {\"mode\": \"overwrite\", \"header\": \"true\"}\n\n# Set the partition column\nPARTITION = \"customer\"\n\n# Fetch distinct values from the partition column\npartitions = get_partitions(partition=PARTITION, table=\"dummy_sales\")\n\n# Sort the distinct values to ensure the correct order of the files\n# Note:\n#   - If an error occurs during the process, by sorting beforehand, you guarantee the correct order of the files.\n#   - It may come in handy if you want to restart the process (starting on a given file).\npartitions.sort(key=lambda x: x[\"customer\"])\n\nfor partition in partitions:\n    acon = {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"dummy_input\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"delta\",\n                \"db_table\": \"dummy_sales\",\n            },\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"dummy_transform\",\n                \"input_id\": \"dummy_input\",\n                \"transformers\": [\n                    {\"function\": \"add_current_date\", \"args\": {\"output_col\": \"extraction_timestamp\"}},\n                    {\"function\": \"expression_filter\", \"args\": {\"exp\": f\"customer = '{partition['customer']}'\"}},\n                    {\n                        \"function\": \"coalesce\",\n                        \"args\": {\"num_partitions\": 1},\n                    },  # Enforce that only 1 file is written - eliminating the parallelism\n                ],\n            },\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"dummy_output\",\n                \"input_id\": \"dummy_transform\",\n                \"data_format\": \"sharepoint\",\n                \"sharepoint_opts\": {\n                    \"client_id\": \"dummy_client_id\",\n                    \"tenant_id\": \"dummy_tenant_id\",\n                    \"secret\": \"dummy_secret\",\n                    \"site_name\": \"dummy_site_name\",\n                    \"drive_name\": \"dummy_drive_name\",\n                    \"local_path\": \"s3://my_data_product_bucket/silver/dummy_sales/\",  # this path must end with an \"/\"\n                    \"folder_relative_path\": \"dummy_complex/wo_parallelism\",\n                    \"file_name\": f\"dummy_sales_{partition['customer']}\",\n                    \"local_options\": LOCAL_OPTIONS,\n                    \"chunk_size\": 200 * 1024 * 1024,  # 200 MB\n                },\n            },\n        ],\n    }\n\nload_data(acon=acon)\n```\n\n### iv. Complex - With parallelism (optimized for efficiency)\n\n```python\nfrom lakehouse_engine.engine import load_data\n\n# Set the optional parameters\nLOCAL_OPTIONS = {\"mode\": \"overwrite\", \"header\": \"true\"}\n\n# Set the partition column\nPARTITION = \"customer\"\n\n# Fetch distinct values from the partition column\npartitions = get_partitions(partition=PARTITION, table=\"dummy_sales\")\n\n# Detect array, struct or map fields which cannot be written to .csv files\ncolumns_to_cast = detect_array_or_struct_fields(spark.sql(f\"SELECT * FROM {dummy_sales}\"))\n\n# Sort the distinct values to ensure the correct order of the files\n# Note:\n#   - If an error occurs during the process, by sorting beforehand, you guarantee the correct order of the files.\n#   - It may come in handy if you want to restart the process (starting on a given file).\npartitions.sort(key=lambda x: x[\"customer\"])\n\nfor partition in partitions:\n    acon = {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"dummy_input\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"delta\",\n                \"db_table\": \"dummy_sales\",\n            },\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"dummy_transform\",\n                \"input_id\": \"dummy_input\",\n                \"transformers\": [\n                    {\"function\": \"add_current_date\", \"args\": {\"output_col\": \"extraction_timestamp\"}},\n                    {\"function\": \"expression_filter\", \"args\": {\"exp\": f\"customer = '{partition['customer']}'\"}},\n                    # Coalesce removed guaranteeing maximum parallelism\n                    {\"function\": \"cast\", \"args\": {\"cols\": columns_to_cast}}, # Cast unsupported column types\n                ],\n            },\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"dummy_output\",\n                \"input_id\": \"dummy_transform\",\n                \"data_format\": \"sharepoint\",\n                \"sharepoint_opts\": {\n                    \"client_id\": \"dummy_client_id\",\n                    \"tenant_id\": \"dummy_tenant_id\",\n                    \"secret\": \"dummy_secret\",\n                    \"site_name\": \"dummy_site_name\",\n                    \"drive_name\": \"dummy_drive_name\",\n                    \"local_path\": \"s3://my_data_product_bucket/silver/dummy_sales/\",  # this path must end with an \"/\"\n                    \"folder_relative_path\": \"dummy_complex/with_parallelism\",\n                    \"file_name\": f\"dummy_sales_{partition['customer']}\",\n                    \"local_options\": LOCAL_OPTIONS,\n                    \"chunk_size\": 200 * 1024 * 1024,  # 200 MB\n                },\n            },\n        ],\n    }\n\nload_data(acon=acon)\n```\n\n### Relevant Notes\n\n- Multi-file export is not supported. For such use cases, loop through files manually and invoke SharePointWriter per file.\n- Authentication details should be handled securely via lakehouse configuration or secret management tools."
  },
  {
    "path": "lakehouse_engine_usage/data_quality/__init__.py",
    "content": "\"\"\"\n.. include::data_quality.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/custom_expectations/__init__.py",
    "content": "\"\"\"\n.. include::custom_expectations.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/custom_expectations/custom_expectations.md",
    "content": "# Custom Expectations\n\n## Defining Custom Expectations\n\nCustom expectations are defined in python and need to follow a structure to correctly integrate with Great Expectations.\n\nFollow the [documentation of GX on Creating Custom Expectations](https://docs.greatexpectations.io/docs/oss/guides/expectations/custom_expectations_lp/) \nand find information about [the existing types of expectations](https://docs.greatexpectations.io/docs/conceptual_guides/expectation_classes). \n\nHere is an example of custom expectation.\nAs for other cases, the acon configuration should be executed with `load_data` using:\n```python\nfrom lakehouse_engine.engine import load_data\nacon = {...}\nload_data(acon=acon)\n```\n\nExample of ACON configuration:\n\n```python \n{!../../../../lakehouse_engine/dq_processors/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b.py!}\n```\n\n### Naming Conventions\nYour expectation's name **should** start with expect.\n\nThe name of the file **must** be the name of the expectation written in snake case. Ex: `expect_column_length_match_input_length`\n\nThe name of the class **must** be the name of the expectation written in camel case. Ex: `ExpectColumnLengthMatchInputLength`\n\n### File Structure\nThe file contains two main sections:\n\n- the definition of the metric that we are tracking (where we define the logic of the expectation);\n- the definition of the expectation\n\n### Metric Definition\nIn this section we define the logic of the expectation. This needs to follow a certain structure:\n\n#### Code Structure\n1) The class you define needs to extend one of the Metric Providers defined by Great Expectations that corresponds \nto your expectation's type. More info on the [metric providers](https://docs.greatexpectations.io/docs/conceptual_guides/metricproviders). \n\n2) You need to define the name of your metric. This name **must** be unique and **must** follow the following structure: \ntype of expectation.name of metric. Ex.: `column_pair_values.a_smaller_or_equal_than_b`\n**Types of expectations:**  `column_values`, `multicolumn_values`, `column_pair_values`, `table_rows`, `table_columns`.\n\n3) Any [GX default parameters](#parameters) that are necessary to calculate your metric **must** be defined as \"condition_domain_keys\".\n\n4) Any [additional parameters](#parameters) that are necessary to calculate your metric **must** be defined as \"condition_value_keys\".\n\n5) The logic of your expectation **must** be defined for the SparkDFExecutionEngine in order to be run on the Lakehouse.\n\n```python\n1) class ColumnMapMetric(ColumnMapMetricProvider):\n    \"\"\"Asserts that a column matches a pattern.\"\"\"\n \n    2) condition_metric_name = \"column_pair_values.a_smaller_or_equal_than_b\"\n    3) condition_domain_keys = (\n        \"batch_id\",\n        \"table\",\n        \"column_A\",\n        \"column_B\",\n        \"ignore_row_if\",\n    )\n    4) condition_value_keys = (\"margin\",)\n     \n    5) @column_pair_condition_partial(engine=SparkDFExecutionEngine)\n    def _spark(\n        self: ColumnPairMapMetricProvider,\n        column_A: Any,\n        column_B: Any,\n        margin: Any,\n        **kwargs: dict,\n    ) -> Any:\n        \"\"\"Implementation of the expectation's logic.\n \n        Args:\n            column_A: Value of the row of column_A.\n            column_B: Value of the row of column_B.\n            margin: margin value to be added to column_b.\n            kwargs: dict with additional parameters.\n \n        Returns:\n            If the condition is met.\n        \"\"\"\n        if margin is None:\n            approx = 0\n        elif not isinstance(margin, (int, float, complex)):\n            raise TypeError(\n                f\"margin must be one of int, float, complex.\"\n                f\" Found: {margin} as {type(margin)}\"\n            )\n        else:\n            approx = margin  # type: ignore\n \n        return column_A <= column_B + approx  # type: ignore\n```\n\n### Expectation Definition\nIn this section we define the expectation. This needs to follow a certain structure:\n\n#### Code Structure\n1) The class you define needs to extend one of the Expectations defined by Great Expectations that corresponds to your expectation's type. \n\n2) You must define an \"examples\" object where you define at least one success and one failure of your expectation to \ndemonstrate its logic. The result format must be set to complete, and you must set the [unexpected_index_name](#result-format) variable.\n\n!!! note\n    For any examples where you will have unexpected results you must define  unexpected_index_list in your \"out\" element.\n    This will be validated during the testing phase.\n\n3) The metric **must** be the same you defined in the metric definition.\n\n4) You **must** define all [additional parameters](#parameters) that the user has to/should provide to the expectation. \n\n5) You **should** define any default values for your expectations parameters. \n\n6) You must **define** the `_validate` method like shown in the example. You **must** call the `validate_result` function \ninside your validate method, this process adds a validation to the unexpected index list in the examples.\n\n!!! note\n    If your custom expectation requires any extra validations, or you require additional fields to be returned on \n    the final dataframe, you can add them in this function. \n    The validate_result method has two optional parameters (`partial_success` and `partial_result) that can be used to \n    pass the result of additional validations and add more information to the result key of the returned dict respectively.\n\n```python\n1) class ExpectColumnPairAToBeSmallerOrEqualThanB(ColumnPairMapExpectation):\n    \"\"\"Expect values in column A to be lower or equal than column B.\n \n    Args:\n        column_A: The first column name.\n        column_B: The second column name.\n        margin: additional approximation to column B value.\n \n    Keyword Args:\n        allow_cross_type_comparisons: If True, allow\n            comparisons between types (e.g. integer and string).\n            Otherwise, attempting such comparisons will raise an exception.\n        ignore_row_if: \"both_values_are_missing\",\n            \"either_value_is_missing\", \"neither\" (default).\n        result_format: Which output mode to use:\n            `BOOLEAN_ONLY`, `BASIC` (default), `COMPLETE`, or `SUMMARY`.\n        include_config: If True (default), then include the expectation config\n            as part of the result object.\n        catch_exceptions: If True, then catch exceptions and\n            include them as part of the result object. Default: False.\n        meta: A JSON-serializable dictionary (nesting allowed)\n            that will be included in the output without modification.\n \n    Returns:\n        An ExpectationSuiteValidationResult.\n    \"\"\"\n    2) examples = [\n        {\n            \"dataset_name\": \"Test Dataset\",\n            \"data\": {\n                \"a\": [11, 22, 50],\n                \"b\": [10, 21, 100],\n                \"c\": [9, 21, 30],\n            },\n            \"schemas\": {\n                \"spark\": {\"a\": \"IntegerType\", \"b\": \"IntegerType\", \"c\": \"IntegerType\"}\n            },\n            \"tests\": [\n                {\n                    \"title\": \"negative_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"c\",\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"c\"],\n                            \"include_unexpected_rows\": True,\n                        },\n                    },\n                    \"out\": {\n                        \"success\": False,\n                        \"unexpected_index_list\": [\n                            {\"c\": 9, \"a\": 11},\n                            {\"c\": 21, \"a\": 22},\n                            {\"c\": 30, \"a\": 50},\n                        ],\n                    },\n                },\n                {\n                    \"title\": \"positive_test\",\n                    \"exact_match_out\": False,\n                    \"include_in_gallery\": True,\n                    \"in\": {\n                        \"column_A\": \"a\",\n                        \"column_B\": \"b\",\n                        \"margin\": 1,\n                        \"result_format\": {\n                            \"result_format\": \"COMPLETE\",\n                            \"unexpected_index_column_names\": [\"a\"],\n                        },\n                    },\n                    \"out\": {\"success\": True},\n                },\n            ],\n        },\n    ]\n      \n    3) map_metric = \"column_values.pattern_match\"\n    4) success_keys = (\n        \"validation_regex\",\n        \"mostly\",\n    )\n    5) default_kwarg_values = {\n        \"ignore_row_if\": \"never\",\n        \"result_format\": \"BASIC\",\n        \"include_config\": True,\n        \"catch_exceptions\": False,\n        \"mostly\": 1,\n    }\n \n    6) def _validate(\n        self,\n        configuration: ExpectationConfiguration,\n        metrics: Dict,\n        runtime_configuration: Optional[dict] = None,\n        execution_engine: Optional[ExecutionEngine] = None,\n    ) -> dict:\n        \"\"\"Custom implementation of the GX _validate method.\n \n        This method is used on the tests to validate both the result\n        of the tests themselves and if the unexpected index list\n        is correctly generated.\n        The GX test logic does not do this validation, and thus\n        we need to make it manually.\n \n        Args:\n            configuration: Configuration used in the test.\n            metrics: Test result metrics.\n            runtime_configuration: Configuration used when running the expectation.\n            execution_engine: Execution Engine where the expectation was run.\n \n        Returns:\n            Dictionary with the result of the validation.\n        \"\"\"\n        return validate_result(self, configuration, metrics)\n```\n\n### Printing the Expectation Diagnostics\nYour expectations **must** include the ability to call the Great Expectations diagnostic function in order to be validated.\n\nIn order to do this code **must** be present.\n\n```python\n\"\"\"Mandatory block of code. If it is removed the expectation will not be available.\"\"\"\nif __name__ == \"__main__\":\n    # test the custom expectation with the function `print_diagnostic_checklist()`\n    ExpectColumnPairAToBeSmallerOrEqualThanB().print_diagnostic_checklist()\n```\n\n## Creation Process\n\n1) Create a branch from lakehouse engine.\n\n2) Create a custom expectation with your specific logic:\n\n   1. All new expectations must be placed inside folder `/lakehouse_engine/dq_processors/custom_expectations`.\n   2. The name of the expectation must be added to the file `/lakehouse_engine/core/definitions.py`, to the variable: `CUSTOM_EXPECTATION_LIST`.\n   3. All new expectations must be tested on `/tests/feature/custom_expectations/test_custom_expectations.py`.\n   In order to create a new test for your custom expectation it is necessary to:\n   \n   - Copy one of the expectation folders in `tests/resources/feature/custom_expectations` renaming it to your custom expectation.\n   - Make any necessary changes on the data/schema file present.\n   - On `/tests/feature/custom_expectations/test_custom_expectations.py` add a scenario to test your expectation, all expectations \n   must be tested on batch and streaming. The test is implemented to generate an acon based on each scenario data. \n   - Test your developments to check that everything is working as intended.\n\n3) When the development is completed, create a pull request with your changes.\n\n4) Your expectation will be available with the next release of the lakehouse engine that happens after you pull request is approved. \nThis means that you need to upgrade your version of the lakehouse engine in order to use it.\n\n## Usage\nCustom Expectations are available to use like any other expectations provided by Great Expectations.\n\n## Parameters\nDepending on the type of expectation you are defining some parameters are expected by default. \nEx: A ColumnMapExpectation has a default \"column\" parameter.\n\n### Mostly\n[Mostly](https://docs.greatexpectations.io/docs/reference/learn/expectations/standard_arguments/#mostly) is a standard \nparameter for a subset of expectations that is used to define a threshold for the failure of an expectation. \nEx: A mostly value of 0.7 makes it so that the expectation only fails if more than 70% of records have \na negative result.\n\n## Result Format\nGreat Expectations has several different types of [result formats](https://docs.greatexpectations.io/docs/reference/learn/expectations/result_format/) \nfor the expectations results. The lakehouse engine requires the result format to be set to \"COMPLETE\" in order to tag \nthe lines where the expectations failed.\n\n### `unexpected_index_column_names`\nInside this key you must define what columns are used as an index inside your data. If this is set and the result \nformat is set to \"COMPLETE\" a list with the indexes of the lines that failed the validation will be returned by \nGreat Expectations.\nThis information is used by the Lakehouse Engine to tag the lines in error after the fact. The additional tests \ninside the `_validate` method verify that the custom expectation is tagging these lines correctly.\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/data_quality.md",
    "content": "# Data Quality\n\nThe Data Quality framework is based on [Great Expectations (GX)](https://greatexpectations.io/) and other custom-made \ndevelopments, providing a very light abstraction on top of the GX open source framework and the Spark framework.\n\n## How to use Data Quality?\n\n### Data Loader\nYou can define data quality rules inside the DataLoader algorithm that you use to load data.\n\n!!! note\n    The DataLoader algorithm allows you to store the results of the data quality checks inside your custom location\n    using the **result_sink** options (e.g., a delta table on your data product). Using result sink unlocks the \n    capability to store DQ results having history over all the DQ executions, which can be used for debugging, \n    to create **DQ dashboards** on top of the data, and much more.\n\n**Examples**:\nIn these examples, dummy sales local data is used to cover a few example usages of the DQ Framework\n(based on Great Expectations).\n\nThe main difference between the sample acons is on the usage of `dq_specs`.\n\n- 1 - [Minimal Example applying DQ with the Required Parameters](minimal_example/minimal_example.md)\n- 2 - [Configure Result Sink](result_sink/result_sink.md)\n- 3 - [Validations Failing](validations_failing/validations_failing.md)\n- 4 - [Row Tagging](row_tagging/row_tagging.md)\n\n**Disclaimer:** even though the `\"dq_type\": \"validator\"` is still supported (as presented on this template),\nour recommendation is to use `\"dq_type\": \"prisma\"`, which offers many more features end to end (from DQ Rules\ncreation, execution until results analysis) and a configurable central observability\nwith standard offering of Dashboarding on top. The DQ Type validator and the result_sink is still\nsupported for very specific use cases that might still exist and for which it might make sense to keep using\nthis approach. In case of doubt between the offerings, please feel free to reach us.\n\n### Data Quality Validator\n\nThe DQValidator algorithm focuses on validating data (e.g., spark DataFrames, Files or Tables).\nIn contrast to the `dq_specs` inside the DataLoader algorithm, the DQValidator focuses on **validating data at rest \n(post-mortem)** instead of validating data in-transit (before it is loaded to the destination).\n\n!!! note\n    The DQValidator algorithm allows you to store the results of the data quality checks inside your custom location\n    using the **result_sink** options (e.g., a delta table on your data product). Using result sink unlocks the\n    capability to store DQ results having history over all the DQ executions, which can be used for debugging,\n    to create **DQ dashboards** on top of the data, and much more.\n\n[Here you can find more information regarding DQValidator and examples](data_quality_validator/data_quality_validator.md).\n\n\n### Reconciliator\n\nSimilarly to the [Data Quality Validator](#data-quality-validator) algorithm, the Reconciliator algorithm focuses on \nvalidating data at rest (post-mortem). In contrast to the DQValidator algorithm, the Reconciliator always compares a \ntruth dataset (e.g., spark DataFrames, Files or Tables) with the current dataset (e.g., spark DataFrames, Files or \nTables), instead of executing DQ rules defined by the teams. \n[Here you can find more information regarding reconciliator and examples](../reconciliator/reconciliator.md).\n\n!!! note\n    Reconciliator does not use Great Expectations, therefore Data Docs and Result Sink and others native methods are not available.\n\n### Custom Expectations\n\nIf your data has a data quality check that cannot be done with the expectations provided by Great Expectations you \ncan create a custom expectation to make this verification.\n\n!!! note\n    Before creating a custom expectation check if there is an expectation already created to address your needs, \n    both in Great Expectations and the Lakehouse Engine.\n    Any Custom Expectation that is too specific (using hardcoded table/column names) will be rejected.\n    **Expectations should be generic by definition.**\n\n[Here you can find more information regarding custom expectations and examples](custom_expectations/custom_expectations.md).\n\n### Row Tagging\nThe row tagging strategy allows users to tag the rows that failed to be easier to identify the problems \nin the validations. [Here you can find all the details and examples](row_tagging/row_tagging.md).\n\n### Prisma\nPrisma is part of the Lakehouse Engine DQ Framework, and it allows users to read DQ functions dynamically from a table instead of writing them explicitly in the Acons.\n[Here you can find more information regarding Prisma](prisma/prisma.md).\n\n## How to check the results of the Data Quality Process?\n\n### 1. Table/location analysis\nThe possibility to configure a **Result Sink** allows you to store the history of executions of the DQ process. \nYou can query the table or the location to search through data and analyse history.\n\n### 2. Power BI Dashboard \nWith the information expanded, interactive analysis can be built on top of the history of the DQ process.\nA dashboard can be created with the results that we have in `dq_specs`. To be able to have this information you \nneed to use arguments `result_sink_db_table` and/or `result_sink_location`.\n\nThrough having a dashboard, the runs and expectations can be analysed, filtered by year, month, source and \nrun name, and you will have information about the number of runs, some statistics, status of expectations and more. \nAnalysis such as biggest failures per expectation type, biggest failures by columns, biggest failures per source, \nand others can be made, using the information in the `result_sink_db_table`/`result_sink_location`.\n\n!!! note\n    The recommendation is to use the same result sink table/location for all your dq_specs and \n    in the dashboard you will get a preview of the status of all of them.\n\n<img src=\"../assets/img/dq_dashboard.png?raw=true\" style=\"max-width: 800px; height: auto; \"/>\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/data_quality_validator/__init__.py",
    "content": "\"\"\"\n.. include::data_quality_validator.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/data_quality_validator/data_quality_validator.md",
    "content": "# Data Quality Validator\n\nDQValidator algorithm allows DQ Validations isolated from the data load (only read and apply data quality validations).\nWith this algorithm you have the capacity to apply the Lakehouse-Engine Data Quality Process,\nusing [Great Expectations](https://greatexpectations.io/expectations/) functions directly into a specific dataset also\nmaking use of all the [InputSpecs](../../../reference/packages/core/definitions.md#packages.core.definitions.InputSpec) available in the engine.\n\nValidating the Data Quality, using this algorithm, is a matter of defining the data you want to read and the validations you want to do to your data, detailing the great expectations functions you want to apply on the data to assess its quality.\n\n!!! warning\n    **This algorithm also gives the possibility to restore a previous version of a delta table or delta files in case the DQ\n    process raises any exception. Please use it carefully!!** You may lose important commits and data. Moreover, this will\n    highly depend on the frequency that you run your Data Quality validations. If you run your data loads daily and Data\n    Quality validations weekly, and you define the restore_prev_version to true, this means that the table will be restored\n    to the previous version, but the error could have happened 4 or 5 versions before.\n\n## When to use?\n\n- **Post-Load validation**: check quality of data already loaded to a table/location\n- **Pre-Load validation**: check quality of the data you want to load (check DQ by reading a set of files in a specific\n  location...)\n- **Validation of a DataFrame computed in the notebook itself** (e.g. check data quality after joining or filtering\n  datasets, using the computed DataFrame as input for the validation)\n\nThis algorithm also gives teams some freedom to:\n\n- **Schedule isolated DQ Validations to run periodically**, with the frequency they need;\n- Define a DQ Validation process **as an end-to-end test** of the respective data product.\n\n## How to use?\n\nAll of these configurations are passed via the ACON to instantiate\na [DQValidatorSpec object](../../../reference/packages/core/definitions.md#packages.core.definitions.DQValidatorSpec). The DQValidator algorithm uses an\nACON to configure its execution. In [DQValidatorSpec](../../../reference/packages/core/definitions.md#packages.core.definitions.DQValidatorSpec) you can\nfind the meaning of each ACON property.\n\nHere is an example of ACON configuration:\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_spec\": {\n        \"spec_id\": \"sales_source\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"table\",\n        \"db_table\": \"my_database.my_table\"\n    },\n    \"dq_spec\": {\n        \"spec_id\": \"dq_sales\",\n        \"input_id\": \"sales_source\",\n        \"dq_type\": \"validator\",\n        \"store_backend\": \"file_system\",\n        \"local_fs_root_dir\": \"/app/tests/lakehouse/in/feature/dq_validator/dq\",\n        \"result_sink_db_table\": \"my_database.dq_validator\",\n        \"result_sink_format\": \"json\",\n        \"fail_on_error\": False,\n        \"dq_functions\": [\n            {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"article\"}},\n            {\n                \"function\": \"expect_table_row_count_to_be_between\",\n                \"args\": {\"min_value\": 3, \"max_value\": 11},\n            },\n        ],\n    },\n    \"restore_prev_version\": True,\n}\n\nload_data(acon=acon)\n```\n\nOn this page you will also find the following examples of usage:\n\n1. Dataframe as input & Success on the DQ Validation\n2. Table as input & Failure on DQ Validation & Restore previous version\n3. Files as input & Failure on DQ Validation & Fail_on_error disabled\n4. Files as input & Failure on DQ Validation & Critical functions defined\n5. Files as input & Failure on DQ Validation & Max failure percentage defined\n\n\n### Example 1 : Dataframe as input & Success on the DQ Validation\n\nThis example focuses on using a dataframe, computed in this notebook, directly in the input spec. First, a new\nDataFrame is generated as a result of the join of data from two tables (dummy_deliveries and dummy_pd_article) and\nsome DQ Validations are applied on top of this dataframe.\n\n```python\nfrom lakehouse_engine.engine import execute_dq_validation\n\ninput_df = spark.sql(\"\"\"\n        SELECT a.*, b.article_category, b.article_color\n        FROM my_database.dummy_deliveries a\n        JOIN my_database.dummy_pd_article b\n            ON a.article_id = b.article_id\n        \"\"\"\n)\n\nacon = {\n    \"input_spec\": {\n        \"spec_id\": \"deliveries_article_input\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"dataframe\",\n        \"df_name\": input_df,\n    },\n    \"dq_spec\": {\n        \"spec_id\": \"deliveries_article_dq\",\n        \"input_id\": \"deliveries_article_input\",\n        \"dq_type\": \"validator\",\n        \"bucket\": \"my_data_product_bucket\",\n        \"result_sink_db_table\": \"my_database.dq_validator_deliveries\",\n        \"result_sink_location\": \"my_dq_path/dq_validator/dq_validator_deliveries/\",\n        \"expectations_store_prefix\": \"dq/dq_validator/expectations/\",\n        \"validations_store_prefix\": \"dq/dq_validator/validations/\",\n        \"checkpoint_store_prefix\": \"dq/dq_validator/checkpoints/\",\n        \"unexpected_rows_pk\": [\"salesorder\", \"delivery_item\", \"article_id\"],\n        \"dq_functions\": [{\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"delivery_date\"}}],\n    },\n    \"restore_prev_version\": False,\n}\n\nexecute_dq_validation(acon=acon)\n```\n\n\n### Example 2: Table as input & Failure on DQ Validation & Restore previous version\n\nIn this example we are using a table as input to validate the data that was loaded. Here, we are forcing the DQ Validations to fail in order to show the possibility of restoring the table to the previous version.\n\n!!! warning\n    **Be careful when using the feature of restoring a previous version of a delta table or delta files.** You may\n    lose important commits and data. Moreover, this will highly depend on the frequency that you run your Data Quality\n    validations. If you run your data loads daily and Data Quality validations weekly, and you define the\n    restore_prev_version to true, this means that the table will be restored to the previous version, but the error\n    could have happened 4 or 5 versions before (because loads are daily, validations are weekly).\n\nSteps followed in this example to show how the restore_prev_version feature works.\n\n1. **Insert rows into the dummy_deliveries table** to adjust the total numbers of rows and **make the DQ process fail**.\n2. **Use the \"DESCRIBE HISTORY\" statement to check the number of versions available on the table** and check the version\n   number resulting from the insertion to the table.\n3. **Execute the DQ Validation**, using the configured acon (based on reading the dummy_deliveries table and setting the \n`restore_prev_version` to `true`). Checking the logs of the process, you can see that the data did not pass all the \nexpectations defined and that the table version restore process was triggered.\n4. **Re-run a \"DESCRIBE HISTORY\" statement to check that the previous version of the table was restored** and thus, the row inserted in the beginning of the process is no longer present in the table.\n\n```python\nfrom lakehouse_engine.engine import execute_dq_validation\n\n# Force failure of data quality by adding new row\nspark.sql(\"\"\"INSERT INTO my_database.dummy_deliveries VALUES (7, 1, 20180601, 71, \"article1\", \"delivered\")\"\"\")\n\n\n# Check history of the table\nspark.sql(\"\"\"DESCRIBE HISTORY my_database.dummy_deliveries\"\"\")\n\nacon = {\n    \"input_spec\": {\n        \"spec_id\": \"deliveries_input\",\n        \"read_type\": \"batch\",\n        \"db_table\": \"my_database.dummy_deliveries\",\n    },\n    \"dq_spec\": {\n        \"spec_id\": \"dq_deliveries\",\n        \"input_id\": \"deliveries_input\",\n        \"dq_type\": \"validator\",\n        \"bucket\": \"my_data_product_bucket\",\n        \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n        \"dq_functions\": [\n            {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"delivery_date\"}},\n            {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 19}},\n        ],\n    },\n    \"restore_prev_version\": True,\n}\n\nexecute_dq_validation(acon=acon)\n \n# Check that the previous version of the table was restored\nspark.sql(\"\"\"DESCRIBE HISTORY my_database.dummy_deliveries\"\"\")\n```\n\n\n### Example 3: Files as input & Failure on DQ Validation & Fail_on_error disabled\n\nIn this example we are using a location as input to validate the files in a specific folder.\nHere, we are forcing the DQ Validations to fail, however disabling the \"fail_on_error\" configuration,\nso the algorithm warns about the expectations that failed but the process/the execution of the algorithm doesn't fail.\n\n```python\nfrom lakehouse_engine.engine import execute_dq_validation\n\nacon = {\n    \"input_spec\": {\n        \"spec_id\": \"deliveries_input\",\n        \"data_format\": \"delta\",\n        \"read_type\": \"streaming\",\n        \"location\": \"s3://my_data_product_bucket/silver/dummy_deliveries/\",\n    },\n    \"dq_spec\": {\n        \"spec_id\": \"dq_deliveries\",\n        \"input_id\": \"deliveries_input\",\n        \"dq_type\": \"validator\",\n        \"bucket\": \"my_data_product_bucket\",\n        \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n        \"fail_on_error\": False,\n        \"dq_functions\": [\n            {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"delivery_date\"}},\n            {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 17}},\n        ],\n    },\n    \"restore_prev_version\": False,\n}\n\nexecute_dq_validation(acon=acon)\n```\n\n\n### Example 4: Files as input & Failure on DQ Validation & Critical functions defined\n\nIn this example we are using a location as input to validate the files in a specific folder.\nHere, we are forcing the DQ Validations to fail by using the critical functions feature, which will throw an error\nif any of the functions fails.\n\n```python\nfrom lakehouse_engine.engine import execute_dq_validation\n\nacon = {\n    \"input_spec\": {\n        \"spec_id\": \"deliveries_input\",\n        \"data_format\": \"delta\",\n        \"read_type\": \"streaming\",\n        \"location\": \"s3://my_data_product_bucket/silver/dummy_deliveries/\",\n    },\n    \"dq_spec\": {\n        \"spec_id\": \"dq_deliveries\",\n        \"input_id\": \"deliveries_input\",\n        \"dq_type\": \"validator\",\n        \"bucket\": \"my_data_product_bucket\",\n        \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n        \"fail_on_error\": True,\n        \"dq_functions\": [\n            {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"delivery_date\"}},\n        ],\n        \"critical_functions\": [\n            {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 17}},\n        ],\n    },\n    \"restore_prev_version\": False,\n}\n\nexecute_dq_validation(acon=acon)\n```\n\n\n### Example 5: Files as input & Failure on DQ Validation & Max failure percentage defined\n\nIn this example we are using a location as input to validate the files in a specific folder.\nHere, we are forcing the DQ Validations to fail by using the max_percentage_failure,\nwhich will throw an error if the percentage of failures surpasses the defined maximum threshold.\n\n```python\nfrom lakehouse_engine.engine import execute_dq_validation\n\nacon = {\n    \"input_spec\": {\n        \"spec_id\": \"deliveries_input\",\n        \"data_format\": \"delta\",\n        \"read_type\": \"streaming\",\n        \"location\": \"s3://my_data_product_bucket/silver/dummy_deliveries/\",\n    },\n    \"dq_spec\": {\n        \"spec_id\": \"dq_deliveries\",\n        \"input_id\": \"deliveries_input\",\n        \"dq_type\": \"validator\",\n        \"bucket\": \"my_data_product_bucket\",\n        \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n        \"fail_on_error\": True,\n        \"dq_functions\": [\n            {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"delivery_date\"}},\n            {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 17}},\n        ],\n        \"max_percentage_failure\": 0.2,\n    },\n    \"restore_prev_version\": False,\n}\n\nexecute_dq_validation(acon=acon)\n```\n\n\n## Limitations\n\nUnlike DataLoader, this new DQValidator algorithm only allows, for now, one input_spec (instead of a list of input_specs) and one dq_spec (instead of a list of dq_specs). There are plans and efforts already initiated to make this available in the input_specs and one dq_spec (instead of a list of dq_specs). However, you can prepare a Dataframe which joins more than a source, and use it as input, in case you need to assess the Data Quality from different sources at the same time. Alternatively, you can also show interest on any enhancement on this feature, as well as contributing yourself.\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/minimal_example/__init__.py",
    "content": "\"\"\"\n.. include::minimal_example.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/minimal_example/minimal_example.md",
    "content": "# Minimal Example\n\nThis scenario illustrates the minimal configuration that you can have to use `dq_specs`, in which\nit uses required parameters: `spec_id, input_id, dq_type, bucket, dq_functions`.\n\nRegarding the dq_functions, it uses 3 functions (retrieved from the expectations supported by GX), which check:\n\n- **expect_column_to_exist** - if a column exist in the data;\n- **expect_table_row_count_to_be_between** - if the row count of the data is between the defined interval;\n- **expect_table_column_count_to_be_between** - if the number of columns in the data is bellow the max value defined.\n\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"inferSchema\": True,\n            },\n            \"location\": \"s3://my_data_product_bucket/dummy_deliveries/\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dq_validator\",\n            \"input_id\": \"dummy_deliveries_source\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"salesorder\"}},\n                {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 25}},\n                {\"function\": \"expect_table_column_count_to_be_between\", \"args\": {\"max_value\": 7}},\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"input_id\": \"dq_validator\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/prisma/__init__.py",
    "content": "\"\"\"\n.. include::prisma.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/prisma/prisma.md",
    "content": "# Prisma\n\nPrisma is part of the Lakehouse Engine DQ Framework, and it allows users to read DQ functions dynamically from a table instead of writing them explicitly in the Acons.\n\n\n## How to use Prisma?\n- Use the Lakehouse Engine version: 1.22.0 or later;\n- Use DBR 13.3 or later.  If you are not using Databricks, ensure a similar environment with Spark 3.4.1 and Delta 2.4.0.\n- Create the DQ Checks in a table in your Data Product:\n  - Each data quality check conducted in Prisma will be hosted within the bucket defined in the engine config file (lakehouse_engine/configs/engine.yaml). Consequently, the result sink location will receive the results of their assessments at the granularity of each \"run\", capturing all records generated during every operation. The DQ Checks table is located in the demanding data product and can have any name (i.e: data_quality_checks).\n  - The idea is for it to be a central bucket for all DPs to ensure easier and better observability and unlock offering of easier insights over the Data Quality of the Lakehouse.\n  \nBelow you find a DDL example with the expected schema and description for the fields:\n```sql\nDROP TABLE IF EXISTS my_database.data_quality_checks;\nCREATE EXTERNAL TABLE my_database.data_quality_checks (\n  dq_rule_id STRING COMMENT 'DQ Rule ID.',\n  dq_tech_function STRING COMMENT 'Great Expectations function type to apply according to the DQ rules type. Example: expect_column_to_exist.',\n  execution_point STRING COMMENT 'In motion/At rest.',\n  schema STRING COMMENT 'The database schema on which the check is to be applied.',\n  table STRING COMMENT 'The table on which the check is to be applied.',\n  column STRING COMMENT 'The column (either on Lakehouse or in other accessible source systems, such as FDP or SAP BW) on which the check is to be applied.',\n  filters STRING COMMENT 'General filters to the data set (where part of the statement). Note: this is purely descriptive at this point as there is no automated action/filtering of the Lakehouse Engine or PRISMA upon it.',\n  arguments STRING COMMENT 'Additional arguments to run the Great Expectation Function in the same order as they appear in the function. Example: {\"column\": \"amount\", \"min_value\": 0}.',\n  dimension STRING COMMENT 'Data Quality dimension.'\n)\nUSING DELTA\nLOCATION 's3://my-data-product-bucket/inbound/data_quality_checks'\nCOMMENT 'Table with dummy data mapping DQ Checks.'\nTBLPROPERTIES(\n  'lakehouse.primary_key'='dq_rule_id',\n  'delta.enableChangeDataFeed'='true'\n)\n```\n**Data sample:**\n\n| dq_rule_id | dq_tech_function                          | execution_point | schema             | table       | column       | filters | arguments                                         | dimension    |\n|------------|:------------------------------------------|:----------------|:-------------------|:------------|:-------------|:--------|:--------------------------------------------------|--------------|\n| 1          | expect_column_values_to_not_be_null       | at_rest         | my_database_schema | dummy_sales | ordered_item |         | {\"column\": \"ordered_item\"}                        | Completeness |\n| 2          | expect_column_min_to_be_between           | in_motion       | my_database_schema | dummy_sales | ordered_item |         | {\"column\": \"amount\", \"min_value\": 0}              | Completeness |\n| 3          | expect_column_values_to_not_be_in_set     | in_motion       | my_database_schema | dummy_sales | ordered_item |         | {\"column\": \"amount\", \"value_set\": [1,2,3]}        | Completeness |\n| 4          | expect_column_pair_a_to_be_not_equal_to_b | at_rest         | my_database_schema | dummy_sales | ordered_item |         | {\"column_A\": \"amount\",\"column_B\": \"ordered_item\"} | Completeness |\n| 5          | expect_table_row_count_to_be_between      | at_rest         | my_database_schema | dummy_sales | ordered_item |         | {\"min_value\": 1, \"max_value\": 10}                 | Completeness |\n\n**Table definition:**\n\n| Column Name      | Definition                                                                                                                                                                                                                                                                        |\n|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| dq_rule_id       | The identifier of a data quality rule.                                                                                                                                                                                                                                            |\n| dq_tech_function | Type of Great Expectations function to apply according to the DQ rules type. See the values here: [Gallery of Expectations and Packages](https://greatexpectations.io/legacy/v1/expectations/?filterType=Backend+support&viewType=Summary&showFilters=true&subFilterValues=spark) |\n| execution_point  | The way how validations will be performed on top the the data set. List of values: at_rest, in_motion.                                                                                                                                                                            |\n| schema           | The schema on which the check is to be applied.                                                                                                                                                                                                                                   |\n| table            | The table on which the check is to be applied.                                                                                                                                                                                                                                    |\n| column           | The column on which the check is to be applied.                                                                                                                                                                                                                                   |\n| filters          | General filters to the data set (where part of the statement). **Note**: this is purely descriptive at this point as there is no automated action/filtering of the Lakehouse Engine or PRISMA upon it.                                                                            |\n| arguments        | Additional arguments to run the Great Expectation Function in the same order as they appear in the function.                                                                                                                                                                      |\n| dimension        | Categorisation of a DQ rule related to one of the dimensions. List of values: Completeness, Uniqueness, Timeliness, Validity, Consistency, Accuracy. **Note**: these values are purely descriptive.                                                                               |\n\n**Execution behaviour** - The value of the **execution_point** column determines the type of Acon execution:\n\n  - **For records at_rest**, they will only be processed when the Lakehouse engine is called by the execute_dq_validation() function.\n\n  - **For records in_motion**, they will only be processed when the Lakehouse engine is called by load_data() function.\n\n## What are the main changes on my ACON if I already implemented DQ?\n\nThe following configurations represent the minimum requirements to make Prisma DQ work.\n\n- **dq_type:** \"prisma\" - the value must be set in order for the engine process the DQ with Prisma;\n- **store_backend:** \"file_system\" or \"s3\" - which store backend to use;\n  - **bucket** - the bucket name to consider for the store_backend (store DQ artefacts). **Note**: only applicable and mandatory for store_backend s3.\n  - **local_fs_root_dir:** path of the root directory. **Notes**: only applicable for store_backend file_system;\n- **dq_db_table:** the DQ Check table that is located in the demanding data product;\n- **dq_table_table_filter:** name of the table which rules are to be applied in the validations. The table name must match with the values inserted in the column \"table\" from dq_db_table;\n- **data_product_name:** the name of the data product;\n- **tbl_to_derive_pk or unexpected_rows_pk:**\n  - tbl_to_derive_pk - automatically derive the primary keys from a given database table. **Note**: the primary keys are derived from the **lakehouse.primary_key** property of a table.   \n  - unexpected_rows_pk - the list of columns composing the primary key of the source data to identify the rows failing the DQ validations.\n\n**DQ Prisma Acon example**\n```python\n\"dq_specs\": [\n  {\n    \"spec_id\": \"dq_validator_in_motion\",\n    \"input_id\": \"dummy_sales_transform\",\n    \"dq_type\": \"prisma\",\n    \"store_backend\": \"file_system\",\n    \"local_fs_root_dir\": \"/my-data-product/artefacts/dq\",\n    \"dq_db_table\": DQ_DB_TABLE,\n    \"dq_table_table_filter\": \"dummy_sales\",\n    \"data_product_name\": DATA_PRODUCT_NAME,\n    \"tbl_to_derive_pk\": DB_TABLE,\n  }\n],\n```\n\n!!! note\n    Available extra parameters to use in the DQ Specs for Prisma:\n    \n    - **data_docs_local_fs** - the path for data docs. The parameter is useful in case you want your DQ Results to be reflected on the automatic Data Docs site;\n    - **data_docs_prefix** - prefix where to store data_docs' data. This parameter must be used together with `data_docs_local_fs`;\n    - **dq_table_extra_filters** - extra filters to be used when deriving DQ functions. This is an SQL expression to be applied to `dq_db_table` which means that the statements must use one of the available columns in the table. For example: dq_rule_id in ('rule1','rule2');\n    - **data_docs_bucket** - the bucket name for data docs only. When defined, it will supersede bucket parameter. **Note:** only applicable for store_backend s3;\n    - **expectations_store_prefix** - prefix where to store expectations' data. **Note:** only applicable for store_backend s3;\n    - **validations_store_prefix** - prefix where to store validations' data. **Note:** only applicable for store_backend s3;\n    - **checkpoint_store_prefix** - prefix where to store checkpoints' data. **Note:** only applicable for store_backend s3;\n\n## End2End Example\nBelow you can also find an End2End and detailed example of loading data into the DQ Checks table and then using PRISMA both with load_data() and execute_dq_validation().\n\n??? example \"**1 - Load the DQ Checks Table**\"\n    This example shows how to insert data into the data_quality_checks table using an Acon with a csv file as a source.\n    The location provided is just an example of a place to store the csv. It is also important that the source file contains the **data_quality_checks** schema.\n    ```python\n    acon = {\n      \"input_specs\": [\n        {\n        \"spec_id\": \"read_dq_checks\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"csv\",\n        \"options\": {\"header\": True, \"delimiter\": \";\"},\n        \"location\": \"s3://my-data-product/local_data/data_quality_checks/\",\n        }\n      ],\n      \"output_specs\": [\n        {\n        \"spec_id\": \"write_dq_checks\",\n        \"input_id\": \"read_dq_checks\",\n        \"write_type\": \"overwrite\",\n        \"data_format\": \"delta\",\n        \"location\": \"s3://my-data-product-bucket/inbound/data_quality_checks\",\n        }\n      ],\n    }\n    \n    load_data(acon=acon)\n    ```\n\n??? example \"**2 - PRISMA - IN MOTION (load_data)**\"\n    ```python\n    cols_to_rename = {\"item\": \"ordered_item\", \"date\": \"order_date\", \"article\": \"article_id\"}   \n    acon = {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"dummy_sales_bronze\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"delta\",\n                \"location\": \"s3://my-data-product-bucket/bronze/dummy_sales\",\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"dummy_sales_transform\",\n                \"input_id\": \"dummy_sales_bronze\",\n                \"transformers\": [\n                    {\n                        \"function\": \"rename\",\n                        \"args\": {\n                            \"cols\": cols_to_rename,\n                        },\n                    },\n                ],\n            }\n        ],\n        \"dq_specs\": [\n            {\n                \"spec_id\": \"dq_validator_in_motion\",\n                \"input_id\": \"dummy_sales_transform\",\n                \"dq_type\": \"prisma\",\n                \"store_backend\": \"file_system\",\n                \"local_fs_root_dir\": \"/my-data-product/artefacts/dq\",\n                \"dq_db_table\": DQ_DB_TABLE,\n                \"dq_table_table_filter\": \"dummy_sales\",\n                \"dq_table_extra_filters\": \"1 = 1\",\n                \"data_docs_local_fs\": \"my-data-product/my-data-product-dq-site\",\n                \"data_docs_prefix\": \"{}/my-data-product-bucket/data_docs/site/\".format(DQ_PREFIX),\n                \"data_product_name\": DATA_PRODUCT_NAME,\n                \"tbl_to_derive_pk\": DB_TABLE,\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"dummy_sales_silver\",\n                \"input_id\": \"dq_validator_in_motion\",\n                \"write_type\": \"overwrite\",\n                \"data_format\": \"delta\",\n                \"location\": \"s3://my-data-product-bucket/silver/dummy_sales_dq_template_in_motion\",\n            }\n        ],\n    }\n    \n    load_data(acon=acon)\n    ```\n\n??? example \"**3 - PRISMA - AT REST (exec_dq_validation)**\"\n    ```python\n    acon = {\n        \"input_spec\": {\n            \"spec_id\": \"dummy_sales_source\",\n            \"read_type\": \"batch\",\n            \"db_table\": DB_TABLE,\n        },\n        \"dq_spec\": {\n            \"spec_id\": \"dq_validator_at_rest\",\n            \"input_id\": \"sales_input\",\n            \"dq_type\": \"prisma\",\n            \"store_backend\": \"file_system\",\n            \"local_fs_root_dir\": \"/my-data-product/artefacts/dq\",\n            \"dq_db_table\": DQ_DB_TABLE,\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"data_docs_local_fs\": \"my-data-product/my-data-product-dq-site\",\n            \"data_docs_prefix\": \"{}/my-data-product-bucket/data_docs/site/\".format(DQ_PREFIX),\n            \"data_product_name\": DATA_PRODUCT_NAME,\n            \"tbl_to_derive_pk\": DB_TABLE,\n        },\n    }\n    \n    execute_dq_validation(acon=acon)\n    ```\n\n## Troubleshooting/Common issues\nThis section provides a summary of common issues and resolutions.\n\n??? warning \"**Error type: filter does not get rules from DQ Checks table.**\"\n    <img src=\"../../../assets/prisma/img/dq_checks_table_w_no_rules.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n    **Solution**: make sure the records in your DQ Checks table are well-defined. In the Acon, ensure that you have the dq_table_table_filter with the correct table name.\n\n??? warning \"**Error type: missing expectation.**\"\n    <img src=\"../../../assets/prisma/img/missing_expectation.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n    **Solution**: make sure that you are using a valid expectation. See the valid ones on: [Gallery of Expectations and Packages](https://greatexpectations.io/legacy/v1/expectations/?filterType=Backend+support&viewType=Summary&showFilters=true&subFilterValues=spark)\n\n??? warning \"**Error type: missing expectation parameters.**\"\n    <img src=\"../../../assets/prisma/img/missing_expectation_parameters.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n    **Solution**: make sure that your \"arguments\" column in the DQ CHECKS table has all necessary parameters for the expectation. For example, the expectation [expect_column_values_to_not_be_null](https://greatexpectations.io/legacy/v1/expectations/expect_column_values_to_not_be_null?filterType=Backend%20support&gotoPage=1&showFilters=true&viewType=Summary&subFilterValues=spark) needs one argument (column (str): The column name).\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/result_sink/__init__.py",
    "content": "\"\"\"\n.. include::result_sink.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/result_sink/result_sink.md",
    "content": "# Result Sink\n\nThese scenarios store the results of the dq_specs into a result sink. For that, both scenarios include parameters defining\nthe specific table and location (`result_sink_db_table` and `result_sink_location`) where the results\nare expected to be stored. With this configuration, people can, later on, check the history of the DQ\nexecutions using the configured table/location, as shown bellow. You can configure saving the output of the\nresults in the result sink following two approaches:\n\n- [**Denormalized/exploded Data Model (recommended)**](#1-result-sink-exploded-recommended) - the results are stored in a detailed format in which\npeople are able to analyse them by Data Quality Run, by expectation_type and by keyword arguments.\n\n| ...                         | source     | column     | max_value | min_value | expectation_type                        | expectation_success | observed_value | run_time_year | ... |\n|-----------------------------|------------|------------|-----------|-----------|-----------------------------------------|---------------------|----------------|---------------|-----|\n| all columns from raw + more | deliveries | salesorder | null      | null      | expect_column_to_exist                  | TRUE                | null           | 2023          | ... |\n| all columns from raw + more | deliveries | null       | null      | null      | expect_table_row_count_to_be_between    | TRUE                | 23             | 2023          | ... |\n| all columns from raw + more | deliveries | null       | null      | null      | expect_table_column_count_to_be_between | TRUE                | 6              | 2023          | ... |\n\n- [**Raw Format Data Model (not recommended)**](#2-raw-result-sink) - the results are stored in the raw format that Great\nExpectations outputs. This is not recommended as the data will be highly nested and in a\nstring format (to prevent problems with schema changes), which makes analysis and the creation of a dashboard on top way \nharder.\n\n| checkpoint_config    | run_name                   | run_time                         | run_results                   | success                | validation_result_identifier | spec_id | input_id |\n|----------------------|----------------------------|----------------------------------|-------------------------------|------------------------|------------------------------|---------|----------|\n| entire configuration | 20230323-...-dq_validation | 2023-03-23T15:11:32.225354+00:00 | results of the 3 expectations | true/false for the run | identifier                   | spec_id | input_id |\n\n!!! note\n    - More configurations can be applied in the result sink, as the file format and partitions.\n    - It is recommended to:\n\n        - Use the same result sink table/location for all dq_specs across different data loads, from different \n        sources, in the same Data Product.\n        - Use the parameter `source` (only available with `\"result_sink_explode\": True`), in the dq_specs, as\n        used in both scenarios, with the name of the data source, to be easier to distinguish sources in the\n        analysis. If not specified, the `input_id` of the dq_spec will be considered as the `source`.\n        - These recommendations will enable more rich analysis/dashboard at Data Product level, considering\n        all the different sources and data loads that the Data Product is having.\n\n## 1. Result Sink Exploded (Recommended)\n\nThis scenario stores DQ Results (results produces by the execution of the dq_specs) in the Result Sink,\nin a detailed format, in which people are able to analyse them by Data Quality Run, by expectation_type and\nby keyword arguments. This is the recommended approach since it makes the analysis on top of the result\nsink way easier and faster.\n\nFor achieving the exploded data model, this scenario introduces the parameter `result_sink_explode`, which\nis a flag to determine if the output table/location should have the columns exploded (as `True`) or\nnot (as `False`). **Default:** `True`, but it is still provided explicitly in this scenario for demo purposes.\nThe table/location will include a schema which contains general columns, statistic columns, arguments of\nexpectations, and others, thus part of the schema will be always with values and other part will depend on\nthe expectations chosen.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"inferSchema\": True,\n            },\n            \"location\": \"s3://my_data_product_bucket/dummy_deliveries/\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dq_validator\",\n            \"input_id\": \"dummy_deliveries_source\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"result_sink_db_table\": \"my_database.dq_result_sink\",\n            \"result_sink_location\": \"my_dq_path/dq_result_sink/\",\n            \"result_sink_explode\": True,\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"source\": \"deliveries_success\",\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"salesorder\"}},\n                {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 25}},\n                {\"function\": \"expect_table_column_count_to_be_between\", \"args\": {\"max_value\": 7}},\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"input_id\": \"dq_validator\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```\n\nTo check the history of the DQ results, you can run commands like:\n\n- the table: `display(spark.table(\"my_database.dq_result_sink\"))`\n- the location: `display(spark.read.format(\"delta\").load(\"my_dq_path/dq_result_sink/\"))`\n\n## 2. Raw Result Sink\nThis scenario is very similar to the previous one, but it changes the parameter `result_sink_explode` to `False` so that\nit produces a raw result sink output containing only one row representing the full run of `dq_specs` (no\nmatter the amount of expectations/dq_functions defined there). Being a raw output, **it is not a\nrecommended approach**, as it will be more complicated to analyse and make queries on top of it.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"inferSchema\": True,\n            },\n            \"location\": \"s3://my_data_product_bucket/dummy_deliveries/\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dq_validator\",\n            \"input_id\": \"dummy_deliveries_source\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"result_sink_db_table\": \"my_database.dq_result_sink_raw\",\n            \"result_sink_location\": \"my_dq_path/dq_result_sink_raw/\",\n            \"result_sink_explode\": False,\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"source\": \"deliveries_success_raw\",\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"salesorder\"}},\n                {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 25}},\n                {\"function\": \"expect_table_column_count_to_be_between\", \"args\": {\"max_value\": 7}},\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"input_id\": \"dq_validator\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```\n\nTo check the history of the DQ results, you can run commands like:\n\n- the table: `display(spark.table(\"my_database.dq_result_sink_raw\"))`\n- the location: `display(spark.read.format(\"delta\").load(\"my_dq_path/dq_result_sink_raw/\"))`\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/row_tagging/__init__.py",
    "content": "\"\"\"\n.. include::row_tagging.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/row_tagging/row_tagging.md",
    "content": "# Row Tagging\nData quality is essential for any organisation that relies on data to make informed decisions. \nHigh-quality data provides accurate, reliable, and timely information that enables organisations to identify\nopportunities, mitigate risks, and optimize their operations. In contrast, low-quality data can lead to incorrect\nconclusions, faulty decisions, and wasted resources.\n\nThere are several common issues that can compromise data quality, such as:\n\n- data entry errors; \n- data duplication; \n- incomplete / inconsistent data; \n- changes where data is collected (e.g. sources); \n- faulty data processing, such as inaccurate data cleansing or transformations.\n\nTherefore, implementing data quality controls, such as data validation rules, and regularly monitoring data for \naccuracy and completeness is key for any organisation.\n\nOne of these controls that can be applied is the **DQ Row Tagging Strategy** so that you not only apply validations on \nyour data to ensure Data Quality, but you also tag your data with the results of the Data Quality validations \nproviding advantages like:\n\n- Transparency for downstream and upstream consumers; \n- Data Observability and Reliability; \n- More trust over the data; \n- Anomaly Detection; \n- Easier and faster discovery of Data Quality problems, and, consequently faster resolution; \n- Makes it easier to deal with integrations with other systems and migrations (you can have validations capturing that a column was changed or simply disappeared);\n\n!!! note\n    When using the DQ Row Tagging approach data availability will take precedence over Data Quality, meaning \n    that all the data will be introduced into the final target (e.g. table or location) no matter what Data Quality\n    issues it is having.\n\nDifferent Types of Expectations:\n\n- Table Level \n- Column Aggregated Level \n- Query Level \n- Column Values (**row level**)\n- Column Pair Value (**row level**)\n- Multicolumn Values (**row level**)\n\nThe expectations highlighted as **row level** will be the ones enabling to Tag failures on specific rows and adding \nthe details about each failure (they affect the field **run_row_result** inside **dq_validations**). The expectations \nwith other levels (not row level) influence the overall result of the Data Quality execution, but won't be used to tag\nspecific rows (they affect the field **run_success** only, so you can even have situations for which you get \n**run_success False** and **run_row_success True** for all rows).\n\n## How does the Strategy work?\n\nThe strategy relies mostly on the 6 below arguments.\n\n!!! note\n    When you specify `\"tag_source_data\": True` the arguments **fail_on_error**, **gx_result_format** and \n    **result_sink_explode** are set to the expected values. \n\n- **unexpected_rows_pk** - the list columns composing the primary key of the source data to use to identify the rows \nfailing the DQ validations. \n- **tbl_to_derive_pk** - `db.table` to automatically derive the unexpected_rows_pk from. \n- **gx_result_format** - great expectations result format. Default: `COMPLETE`. \n- **tag_source_data** - flag to enable the tagging strategy in the source data, adding the information of \nthe DQ results in a column `dq_validations`. This column makes it possible to identify if the DQ run was\nsucceeded in general and, if not, it unlocks the insights to know what specific rows have made the DQ validations\nfail and why. Default: `False`.\n\n!!! note\n    It only works if result_sink_explode is `True`, result_format is `COMPLETE` and \n    fail_on_error is `False. \n\n- **fail_on_error** - whether to fail the algorithm if the validations of your data in the DQ process failed. \n- **result_sink_explode** - flag to determine if the output table/location should have the columns exploded (as `True`)\nor not (as `False`). Default: `True`.\n\n!!! note\n    It is mandatory to provide one of the arguments (**unexpected_rows_pk** or **tbl_to_derive_pk**) when using \n    **tag_source_data** as **True**. \n    When **tag_source_data** is **False**, this is not mandatory, but **still recommended**. \n\n<img src=\"../../../assets/img/row_tagging.png?raw=true\" style=\"max-width: 800px; height: auto; \"/>\n\n!!! note\n    The tagging strategy only works when `tag_source_data` is `True`, which automatically\n    assigns the expected values for the parameters `result_sink_explode` (True), `fail_on_error` (False)\n    and `gx_result_format` (\"COMPLETE\").\n\n!!! note\n    For the DQ Row Tagging to work, in addition to configuring the aforementioned arguments in the dq_specs, \n    you will also need to add the **dq_validations** field into your table (your DDL statements, **recommended**) or \n    enable schema evolution.\n\n!!! note\n    Kwargs field is a string, because it can assume different schemas for different expectations and runs. \n    It is useful to provide the complete picture of the **row level failure** and to allow filtering/joining with \n    the result sink table, when there is one. Some examples of kwargs bellow:\n\n    - `{\"column\": \"country\", \"min_value\": 1, \"max_value\": 2, \"batch_id\": \"o723491yyr507ho4nf3\"}` → example for \n    expectations starting with `expect_column_values` (they always make use of \"column\", the other arguments vary). \n    - `{\"column_A: \"country\", \"column_B\": \"city\", \"batch_id\": \"o723491yyr507ho4nf3\"}` → example for expectations \n    starting with `expect_column_pair` (they make use of \"column_A\" and \"column_B\", the other arguments vary). \n    - `{\"column_list\": [\"col1\", \"col2\", \"col3\"], \"batch_id\": \"o723491yyr507ho4nf3\"}` → example for expectations \n    starting with `expect_multicolumn` (they make use of \"column_list\", the other arguments vary).\n    `batch_id` is common to all expectations, and it is an identifier for the batch of data being validated by\n    Great Expectations.\n\n### Example\n\nThis scenario uses the row tagging strategy which allow users to tag the rows that failed to be easier to\nidentify the problems in the validations.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"inferSchema\": True,\n            },\n            \"location\": \"s3://my_data_product_bucket/dummy_deliveries/\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dq_validator\",\n            \"input_id\": \"dummy_deliveries_source\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"result_sink_db_table\": \"my_database.dq_result_sink\",\n            \"result_sink_location\": \"my_dq_path/dq_result_sink/\",\n            \"tag_source_data\": True,\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"source\": \"deliveries_tag\",\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"salesorder\"}},\n                {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 25}},\n                {\n                    \"function\": \"expect_column_values_to_be_in_set\",\n                    \"args\": {\"column\": \"salesorder\", \"value_set\": [\"37\"]},\n                },\n                {\n                    \"function\": \"expect_column_pair_a_to_be_smaller_or_equal_than_b\",\n                    \"args\": {\"column_A\": \"salesorder\", \"column_B\": \"delivery_item\"},\n                },\n                {\n                    \"function\": \"expect_multicolumn_sum_to_equal\",\n                    \"args\": {\"column_list\": [\"salesorder\", \"delivery_item\"], \"sum_total\": 100},\n                },\n            ],\n            \"critical_functions\": [\n                {\"function\": \"expect_table_column_count_to_be_between\", \"args\": {\"max_value\": 6}},\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"input_id\": \"dq_validator\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```\n\nRunning bellow cell shows the new column created, named `dq_validations` with information about DQ validations.\n`display(spark.read.format(\"delta\").load(\"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\"))`\n\n## Performance and Limitations Trade-offs\n\nWhen using the DQ Row Tagging Strategy, by default we are using Great Expectations Result Format \"Complete\" with \nUnexpected Index Column Names (a primary key for the failures), meaning that for each failure, we are getting all \nthe distinct values for the primary key. After getting all the failures, we are applying some needed transformations \nand joining them with the source data, so that it can be tagged by filling the \"dq_validations\" column.\n\nHence, this can definitely be a heavy and time-consuming operation on your data loads. To reduce this disadvantage \nyou can cache the dataframe by passing the `\"cache_df\": True` in your DQ Specs. In addition to this, always have in \nmind that each expectation (dq_function) that you add into your DQ Specs, is more time that you are adding into your \ndata loads, so always balance performance vs amount of validations that you need.\n\nMoreover, Great Expectations is currently relying on the driver node to capture the results of the execution and \nreturn/store them. Thus, in case you have huge amounts of rows failing (let's say 500k or more) Great Expectations \nmight raise exceptions.\n\nOn these situations, the data load will still happen and the data will still be tagged with the Data Quality \nvalidations information, however you won't have the complete picture of the failures, so the raised_exceptions \nfield is filled as True, so that you can easily notice it and debug it.\n\nMost of the time, if you have such an amount of rows failing, it will probably mean that you did something wrong \nand want to fix it as soon as possible (you are not really caring about tagging specific rows, because you will \nnot want your consumers to be consuming a million of defective rows). However, if you still want to try to make it \npass, you can try to increase your driver and play with some spark configurations like:\n\n- `spark.driver.maxResultSize`\n- `spark.task.maxFailures`\n\nFor debugging purposes, you can also use a different [Great Expectations Result Format](\nhttps://docs.greatexpectations.io/docs/reference/expectations/result_format/) like \"SUMMARY\" (adding in your DQ Spec\n`\"gx_result_format\": \"SUMMARY\"`), so that you get only a partial list of the failures, avoiding surpassing the driver\ncapacity. \n\n!!! note\n    When using a Result Format different from the default (\"COMPLETE\"), the flag \"tag_source_data\" will be \n    overwritten to `False`, as the results of the tagging wouldn't be complete which could lead to erroneous \n    conclusions from stakeholders (but you can always get the details about the result of the DQ execution in\n    the `result_sink_location` or `result_sink_db_table` that you have configured).\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/validations_failing/__init__.py",
    "content": "\"\"\"\n.. include::validations_failing.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/data_quality/validations_failing/validations_failing.md",
    "content": "# Validations Failing\n\nThe scenarios presented on this page are similar, but their goal is to show what happens when a DQ expectation fails the validations.\nThe logs generated by the execution of the code will contain information regarding which expectation(s) have failed and why.\n\n## 1. Fail on Error\nIn this scenario is specified below two parameters:\n\n- `\"fail_on_error\": False` - this parameter is what controls what happens if a DQ expectation fails. In case\nthis is set to `true` (default), your job will fail/be aborted and an exception will be raised.\nIn case this is set to `false, a log message will be printed about the error (as shown in this\nscenario) and the result status will also be available in result sink (if configured) and in the\n[data docs great expectation site](../data_quality.html#3-data-docs-website). On this scenario it is set to `false` \nto avoid failing the execution of the notebook.\n- the `max_value` of the function `expect_table_column_count_to_be_between` is defined with specific value so that\nthis expectation fails the validations.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"inferSchema\": True,\n            },\n            \"location\": \"s3://my_data_product_bucket/dummy_deliveries/\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dq_validator\",\n            \"input_id\": \"dummy_deliveries_source\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"result_sink_db_table\": \"my_database.dq_result_sink\",\n            \"result_sink_location\": \"my_dq_path/dq_result_sink/\",\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"source\": \"deliveries_fail\",\n            \"fail_on_error\": False,\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"salesorder\"}},\n                {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 20}},\n                {\"function\": \"expect_table_column_count_to_be_between\", \"args\": {\"max_value\": 5}},\n                {\"function\": \"expect_column_values_to_be_null\", \"args\": {\"column\": \"article\"}},\n                {\"function\": \"expect_column_values_to_be_unique\", \"args\": {\"column\": \"status\"}},\n                {\n                    \"function\": \"expect_column_min_to_be_between\",\n                    \"args\": {\"column\": \"delivery_item\", \"min_value\": 1, \"max_value\": 15},\n                },\n                {\n                    \"function\": \"expect_column_max_to_be_between\",\n                    \"args\": {\"column\": \"delivery_item\", \"min_value\": 15, \"max_value\": 30},\n                },\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"input_id\": \"dq_validator\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```\n\nIf you run bellow command, you would be able to see the `success` column has the value `false`\nfor the last execution.\n`display(spark.table(RENDER_UTILS.render_content(\"my_database.dq_result_sink\")))`\n\n## 2. Critical Functions\nIn this scenario, alternative parameters to `fail_on_error` are used:\n\n- `critical_functions` - this parameter defaults to `None` if not defined.\nIt controls what DQ functions are considered a priority and as such, it stops the validation\nand throws an execution error whenever a function defined as critical doesn't pass the test.\nIf any other function that is not defined in this parameter fails, an error message is printed in the logs.\nThis parameter has priority over `fail_on_error`.\nIn this specific example, after defining the `expect_table_column_count_to_be_between` as critical,\nit is made sure that the execution is stopped whenever the conditions for the function are not met.\n\nAdditionally, it can also be defined additional parameters like:\n\n- `max_percentage_failure` - this parameter defaults to `None` if not defined.\nIt controls what percentage of the total functions can fail without stopping the execution of the validation.\nIf the threshold is surpassed the execution stops and a failure error is thrown.\nThis parameter has priority over `fail_on_error` and `critical_functions`.\n\nYou can also pair `critical_functions` with `max_percentage_failure` by defining something like\na 0.6 max percentage of failure and also defining some critical function.\nIn this case even if the threshold is respected, the list defined on `critical_functions` still is checked.\n\n```python\nfrom lakehouse_engine.engine import load_data\n\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"inferSchema\": True,\n            },\n            \"location\": \"s3://my_data_product_bucket/dummy_deliveries/\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"dq_validator\",\n            \"input_id\": \"dummy_deliveries_source\",\n            \"dq_type\": \"validator\",\n            \"bucket\": \"my_data_product_bucket\",\n            \"result_sink_db_table\": \"my_database.dq_result_sink\",\n            \"result_sink_location\": \"my_dq_path/dq_result_sink/\",\n            \"source\": \"deliveries_critical\",\n            \"tbl_to_derive_pk\": \"my_database.dummy_deliveries\",\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"salesorder\"}},\n                {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 15, \"max_value\": 25}},\n            ],\n            \"critical_functions\": [\n                {\"function\": \"expect_table_column_count_to_be_between\", \"args\": {\"max_value\": 5}},\n            ],\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"dummy_deliveries_bronze\",\n            \"input_id\": \"dq_validator\",\n            \"write_type\": \"overwrite\",\n            \"data_format\": \"delta\",\n            \"location\": \"s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/\",\n        }\n    ],\n}\n\nload_data(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/gab/__init__.py",
    "content": "\"\"\"\n.. include::gab.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/gab/gab.md",
    "content": "# GAB - Gold Asset Builder\n\nGAB stands for Gold Asset Builder and, technically, it is a SQL-first transformation workflow that allows teams to quickly and collaboratively deploy aggregate tables on top of base fact tables, which can then be used for empowering analytics over different perspectives on dashboards or exploratory queries.\n\nGAB provides the following benefits:\n\n- **Efficiency and speed**: It reduces the efforts and time to production for new aggregate tables (gold layer assets).\n- **Simple operation**: It simplifies the cluster decision by having just 3 cluster types (small, medium, large), there's no need to create a separated pipeline for each case. These cluster types are tied to the concept of workload priority in GAB (more on that later).\n- **Low-code first:** Focus on low-code aggregation configuration with capabilities to also orchestrate complex SQL.\n\n!!! warning\n    Before deciding whether your use case can be supported by GAB or not, read the instructions in the sections below carefully. If there is any doubt about certain metrics which might deviate from the realm of GAB, reach out to us before starting your development and we will support you. GAB may not be a one size fit for all your requirements, so use GAB only if it satisfies your requirements.\n\n<img src=\"../../assets/gab/img/gab_overview.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n## Main Advantages over Self-Orchestrated SQL\n\n- More flexibility to define any type of complex sql queries.\n- Only need to touch SQL, GAB takes care of all its orchestration.\n- Quick production rollout, adaptability and maintainability, without the need to define any complex aggregation orchestration, rerun logic, monitoring, etc.\n- Inner-sourcing model really works, as a data analyst can work on a SQL template and hand it over to the data engineering team, which can then adapt that SQL template and take it to production quickly after the data validation.\n- As shown in the image below, it's possible to generate different perspectives (dimensions - D1, D2, D3...) of different metrics (M1, M2, M3) for a specific use case:\n    1. **Grouping Set (dimensions D1, D2)** - Compute the same metrics at a higher grain from the finest grain.\n    2. **Grouping Set (dimensions D1, D2, D3)** - Compute the same metrics at the finest grain.\n    3. **Grouping Set (dimensions D1)** - Compute the same metrics at a higher grain.\n\n    | D1                | D2     | D3     | M1     | M2     | M3     |\n    | :------           | :-----:| :-----:| :-----:| :-----:| :-----:|\n    | value 1           | value 2| NULL   | 22     | 45     | 54     |\n    | value 1           | value 2| value 3| 89     | 12     | 47     |\n    | value 1           | NULL   | NULL   | 45     | 57     | 12     |\n\n## When to use GAB?\n\n- When an aggregate result, constructed using SQL, is to be created for different levels of detail (AKA different grains) supporting analytics on dashboards or exploratory queries with some specific dimensions and metrics.\n- When metrics and dimensions are bound to configured *DAY, WEEK, MONTH, QUARTER, YEAR* cadences and you are not calculating the whole universe of data in your SQL query (e.g., you're looking back or forward on a specific time interval).\n\n## When not to use GAB?\n\n- When metrics and dimensions are not bound to *DAY, WEEK, MONTH, QUARTER, YEAR* cadences.\n- When your result is not an aggregated result, i.e., the resulting table is at the transaction grain.\n- If your start and end dates for the time interval include dates into the future.\n  - !!! warning\n        This is for now a current limitation in the GAB engine codebase (`if new_end_date >= current_date: new_end_date = current_date`) that would require further testing to ensure it can be relaxed.\n- If your metrics are not calculated incrementally, you should consider the tradeoff of using GAB vs just writing a very simple \"full load\" SQL code that computes the all universe of data all the time. \n  - !!! note\n        However, if the computation is not very intensive, the orchestration/automation that comes with GAB out of the box can actually provide you value. Moreover, even if the metrics are not computed incrementally, you can collect all the automation benefits from GAB and use a time filter in your SQL statements in GAB. You can take that into consideration for your use case.\n\n## GAB Concepts and Features\n\n### Cadence\n\nIn which time grain you want the data to be aggregated: DAILY, WEEKLY, MONTHLY, QUARTERLY, YEARLY. The internal dynamics with the CADENCE concept in GAB heavily rely on an automatically generated dimension calendar for GAB's internal usage.\n\n```python\n{'DAY':{},'WEEK':{},'MONTH':{},'YEAR':{}}\n```\n\n### Dimensions & Metrics\n\n#### Dimensions\n\nIt's just a regular dimension according to the OLAP concept. It will be used to aggregate the metrics, example: `product_category`. Usually it is directly mapped from the source tables without any transformation.\n\n#### Metrics\n\nAggregated value at the dimension level. As part of the dimensions, GAB has an automatically generated calendar dimension at different grains (more on that below).\n\nThere are some options to compute a metric:\n\n- **Using SQL to directly** query and aggregate a source table column. Example: `sum(product_amount)`\n- Compute it in the same cadence, but in **CADENCE - 1 time window**. Example: In a `MONTHLY` cadence it will compute for the previous month.\n- Compute it in the same cadence, but using **last year's reference value**.  Example: In a `QUARTERLY` cadence it will compute it in the same quarter but from the previous year.\n- Compute it in the same cadence, but with a **custom window function**. Example: In a `QUARTER` cadence computing the last 2 quarters.\n- Compute it in **using any SQL function**, using any of the available columns, deriving a metric from another, etc. Example: compute a metric by multiplying it by 0.56 for the last 6 months of data.\n\n!!! note\n    Each computation derives a [new column on the output view](step_by_step/step_by_step.md#use-case-configuration-using-the-query_builder_helper).\n\n### Extended Window Calculator, Reconciliation & Snapshotting\n\n#### Extended Window Calculator\n\nThis feature aims to calculate the extended window of any cadence despite the user providing custom dates which are not the exact start and end dates of a cadence.\n\nFor example, if the user wants to calculate the `MONTH` cadence but gives a date range of `2023-01-10` to `2023-01-29`, which is not exactly the start and/or end of the month, the computation window will be extended/adjusted to `2023-01-01`-`2023-01-31`, i.e., including the complete month. This ensures that GAB automatically handles any user error to efficiently integrate the complete data of the selected cadence.\n\n#### Reconciliation\n\nThe concept of Extended Window Calculator is intertwined with the concept of Reconciliation. These enable the user to compute the data aggregated by the specified cadence, but leveraging 1) *\"cadence to date\"* calculations; or 2) Reconcile the data taking into account late events.\n\n##### \"*Cadence to Date*\" Calculations\n\nFor example, there can be a use case where the cadence is `WEEKLY`, but we want the aggregated data with a `DAILY` frequency, so configuring the reconciliation window to be `DAILY` it will compute the data in `WEEK TO DATE` basis. In a case where the first day of week is Monday, on Monday it will have the data just for Monday; on Tuesday will be the computation of Monday + Tuesday; on Wednesday will have the results for Monday + Tuesday + Wednesday; and so on, until the end of week. That example would be configured as follows:\n\n```python\n{'WEEK': {'recon_window': {'DAY'}}}\n```\n\n##### Reconcile the Data to Account for Late Events\n\nAnother example can be if we consider WEEK cadence with reconciliation MONTH and QUARTER enabled (`{WEEK':{'recon_window':['MONTH','QUARTER']}`). What this means is, at the start of a new month or a quarter, all the weeks that still belong to that month or that quarter are recalculated to consider the late events. For example, `2023-01-01` is the start of a month, quarter and a year. In this example, since month and quarter are given, and quarter is the higher grain among the two, all the weeks in Q4/22 (using the extended window explained above) are recalculated, i.e. instead of `2022-10-01` to `2022-12-31`, extended window to consider in the current GAB execution is `2022-09-26` to `2023-01-01`. This is true because the first day of Q1/23 was on a Sunday of the last week of Q4/22, and once we execute GAB on 01/01/2023, we are reconciling all the weeks of Q4/22, hence weekly cadence with quarterly reconciliation.\n\nYou can find in the image below other illustrative examples of how the extended window and the reconciliation concept work together. In the first example, GAB will always extend the processing window and reconcile the results for all the weeks (yellow color) involved in that month (green color color). In the second example, GAB will always extend the processing window and reconcile the results for all the months (yellow color) involved in the year (note that green color is quarter, not year, but since year is an higher grain than quarter GAB extends the window and reconciles the results for all the months involved in the year, not only the quarter).\n\n<img src=\"../../assets/gab/img/gab_extended_window_calculator.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n### Snapshot\n\nIt creates a snapshot of the data on a specified cadence. For example: in a case where we have `MONTHLY` cadence and snapshot enabled at `DAILY` basis, we are going to compute the aggregates for each day in the month:\n\n```python\n{'MONTH': {'recon_window': {'DAY': {'snapshot': 'Y'}}}}\n```\n\nThis is possible with the template column `{{ to_date }}`, which will tell us the end date of the snapshot.\n\nIn the version without snapshot, there will be one record for the *MONTH* cadence, but when we enable the above configuration the number of entries for the *MONTH* cadence will be the same as the number of days in the month.\n\nThis means there will be a separate entry for each day of the month, which enables to compare the data to the previous year on the same day from the start of the month.\n\n!!! note\n    The snapshot feature will always write the snapshot entry for the given period (start date and end date), meaning if you have runs that overlap each other but for a different period (e.g., same start date but different end date) it will not rewrite past snapshot entries.\n\nThe above configuration is just an example, and the snapshot can be enabled on any combination of cadences:\n\n```python\n{'QUARTER': {'recon_window': {'WEEK': {'snapshot': 'Y'}}}}\n{'YEAR': {'recon_window': {'MONTH': {'snapshot': 'Y'}}}}\n{'MONTH': {'recon_window': {'WEEK': {'snapshot': 'Y'}}}}\n```\n\n## Next Steps\n\nIf you are interested in using GAB you can check our [step-by-step documentation](step_by_step/step_by_step.md) that aims to help in the use case configuration and make easier to use GAB.\n\n## FAQ\n\n### Can we ensure past snapshots are not changed?\n\nWhen we use the snapshots feature, taking monthly cadence with daily reconciliation as example, the number of entries for the *MONTH* cadence will be the same as the number of days in the month, because every day, GAB will generate a snapshot of that month, providing a cumulative picture of the month throughout the several days. In this way, snapshots are immutable.\n\nThere may be cases, where the date that you want to control the snapshots is different than the cadence date in GAB, and in this case you will have to inject custom snapshot gathering logic in your GAB SQL templates and potentially play around with GAB's filter date to achieve what you want, because as of now, GAB relies on the cadence date to control the snapshot logic.\n\n### How exactly `lookback_window` works?\n\nSometimes, `lookback_days` in [GAB execution notebook](../../assets/gab/notebooks/gab.py) and `lookback_window` get confused. `lookback_window` is only used for when you define derived metrics that use window functions (check [step-by-step documentation](step_by_step/step_by_step.md)), and it is used to configure the window. On the other hand, `lookback_days` are only part of [GAB execution notebook](../../assets/gab/notebooks/gab.py) to modify the provided `start_date` so that it considers `lookback_days` before that.\n\n### Can I use GAB with cadence dates in the future?\n\nAs mentioned in the [\"When not to use GAB?\"](#when-not-to-use-gab) section, this is currently not supported.\n\n### What is the purpose of the `rerun` flag?\n\nIf you run GAB for same start date and end date as it was run before, without the *rerun* flag, GAB will ignore the execution based on the `gab_events_log` table. The *rerun* flag ensures we can force such re-execution.\n\n### Does my data product needs to be using a star schema (fact table and dimension tables) to use GAB?\n\nNo, GAB can be used regardless of the underlying data model, as you should prepare your data with templated SQL (that can be as simple or as complex as your use case) before feeding it to the GAB execution engine.\n"
  },
  {
    "path": "lakehouse_engine_usage/gab/step_by_step/__init__.py",
    "content": "\"\"\"\n.. include::step_by_step.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/gab/step_by_step/step_by_step.md",
    "content": "# GAB Step-by-Step\n\n!!! note\n    Requirements: Lakehouse engine: 1.20.0+\n\n## 1. Setup Data Product based on Templated Files\n\n- Copy GAB assets from the templated files to your data product:\n  - GAB Tables:\n    - [Calendar table - dim_calendar](../../../assets/gab/metadata/tables/dim_calendar.sql)\n    - [Use case configuration table - lkp_query_builder](../../../assets/gab/metadata/tables/lkp_query_builder.sql)\n    - [Unified data table - gab_use_case_results](../../../assets/gab/metadata/tables/gab_use_case_results.sql)\n    - [GAB log events table - gab_log_events](../../../assets/gab/metadata/tables/gab_log_events.sql)\n  - GAB Notebooks:\n    - [Feed Calendar table - gab_dim_calendar](../../../assets/gab/notebooks/gab_dim_calendar.py)\n    - [Use case creation - query_builder_helper](../../../assets/gab/notebooks/query_builder_helper.py)\n    - [GAB execution - gab](../../../assets/gab/notebooks/gab.py)\n    - [GAB job manager - gab_job_manager](../../../assets/gab/notebooks/gab_job_manager.py)\n\n## 2. Set up the Use Case\n\n### 2.1. Create the SQL Template Files\n\nStart by writing the SQL code for your use case. Here's an example where you will find several available placeholders (more on that below):\n\n```sql\nSELECT\n    {% if replace_offset_value == 0 %} {{ project_date_column }} {% else %} ({{ project_date_column }} + interval '{{offset_value}}' hour) {% endif %} AS order_date,  # date aggregation: computed cadence start date\n    {{ to_date }} AS to_date,  # date aggregation: last day of the cadence or of the snapshot if enabled\n    b.category_name,\n    COUNT(a.article_id) qty_articles,\n    SUM(amount) total_amount\nFROM\n    {{ database }}.dummy_sales_kpi a  # source database\n    {{ joins }}  # calendar table join: used to compute the cadence start and end date\nLEFT JOIN\n    article_categories b ON a.article_id = b.article_id\nWHERE\n    {{ partition_filter }}  # filter: partition filter\nAND\n    TO_DATE({{ filter_date_column }}, 'yyyyMMdd') >= (\n        '{{ start_date }}' + interval '{{ offset_value }}' hour\n    )  # filter by date column configured in the use case for this file and timezone shift\nAND\n    TO_DATE({{ filter_date_column }}, 'yyyyMMdd') < (\n        '{{ end_date }}' + interval '{{ offset_value }}' hour\n    )  # filter by date column configured in the use case for this file and timezone shift\nGROUP BY 1,2,3\n```\n\n#### Available SQL Template Placeholders\n\nYou can use placeholders in your SQL queries to have them replaced at runtime by the GAB engine. There are several available placeholders that will be listed in this section.\n\n!!! warning\n    The placeholder value will always be [injected as per the configurations of the use cases](#use-case-configuration-using-the-query_builder_helper) in the [lkp_query_builder table](../../../assets/gab/metadata/tables/lkp_query_builder.sql).\n\n##### Reference Dates\n\n- *Start and End Dates*:\n    - `{{ start_date }}` and `{{ end_date }}` are the dates that control the time window of the current GAB execution. These can be used to execute GAB on a certain schedule and have it incrementally compute the aggregated metrics. These dates are fundamental to control GAB executions and will be provided as arguments in the GAB notebook.\n\n      - !!! warning\n            Currently only past and present dates are supported. Future dates are not supported.\n\n- *Project Date*:\n    - `{{ project_date_column }}` is the reference date used to compute the cadences and the extended window (together with `{{ start_date }}` and `{{ end_date }}`).\n\n        ```python\n        {% if replace_offset_value == 0 %} {{ project_date_column }}\n        {% else %} ({{ project_date_column }} + interval '{{offset_value}}' hour)\n        {% endif %}\n        ```\n\n        - !!! note\n              The `replace_offset_value` is a flag that has the responsibility to instruct GAB to either directly use the `{{ project_date_column }}` or shift it to the specified timezone according to the provided `offset_value` from the configured use case.\n\n- *To Date*:\n    - `{{ to_date }}` is the last date of the cadence, if snapshots are disabled, or, if snapshots are enabled, then this date is the snapshot end date.\n\n##### Filter Placeholders\n\n- `{{ partition_filter }}` the expression to filter the data according to a date partitioning scheme (year/month/day) and it replaces the placeholder with a filter like `year = **** and month = ** and day = **`:\n    - !!! warning\n          if your table does not have the Year, Month, Day columns you should not add this template\n- `{{ filter_date_column }}` and `{{offset_value}}` can be used to filter the data to be processed on your use case to be between the specified time range:\n  \n    ```python\n    {{ filter_date_column }} >= ('{{ start_date }}' + interval '{{offset_value}}' hour) AND {{ filter_date_column }} < ('{{ end_date}}' + interval '{{offset_value}}' hour)\n    ```\n\n##### Source Database\n\nFrom where the data comes from: `{{ database }}`.\n\n##### Dim Calendar join\n\nRepresented by the `{{ joins }}` placeholder.\n\n!!! warning\n    It is mandatory! Can be added after any of the table names in the `from` statement. The framework renders these `joins` with an internal calendar join and populates the `to_date` and the `project_date_column` as per the configured cadences.\n\n#### Combining Multiple SQL Template Files for a Use Case\n\nFor each use case, you can have just one SQL file or have multiple SQL files that depend on each other and need to be executed in a specific order.\n\n##### If there's just one SQL file for the use case\n\nThe file should start with 1_. Example: 1_xxxx.sql.\n\n##### When the use case has several SQL files\n\nThe different files will represent different intermediate stages/temp tables in GAB execution of the use case. Create the SQL files according to the sequence order (as shown in the image below) and a final combined script, example:\n\n<img src=\"../../../assets/gab/img/gab_sample_templated_query.png\" alt=\"image\" width=\"auto\" height=\"auto\">\n\n!!! note\n    We suggest using the folder **metadata/gab** to use as the SQL use case folder but this is a parametrized property that you can override with the property [gab_base_path in the GAB notebook](../../../assets/gab/notebooks/gab.py). This property is used in the [GAB Job Manager](../../../assets/gab/notebooks/gab_job_manager.py) as well.\n\n### 2.2. Configure the Use Case using the Query Builder Helper Notebook\n\nGAB will pull information from **`lkp_query_builder`** in order to retrieve information/configuration to execute the process. To help you on this task you can use the [query_builder_help notebook](../../../assets/gab/notebooks/query_builder_helper.py). In this section, we will go step-by-step in the notebook instructions to configure a use case.\n\n#### 2.2.1. General Configuration\n\n<img src=\"../../../assets/gab/img/gab_use_case.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n      \n| Variable                    | Default Value                                            | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                    |\n|-----------------------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Complexity**              | Low                                                      | Defines the complexity of your use case.<br />You should mainly consider the volume of the data or the complexity of the SQL potentially generating a high load.<br />Possible values: **Low**, **Medium** and **High**. These values are used GAB's orchestration, i.e., [GAB job manager - gab_job_manager](../../../assets/gab/notebooks/gab_job_manager.py), which uses it to define the job cluster size/type based on the complexity of the query.                                                                                                                                                                                                                                                                                                     |\n| **Database Name**           | example_database                                         | Refers to the name of the development environment database where the **lkp_query_builder** table resides.<br />This parameter is used at the end of the notebook to insert data into the **lkp_query_builder** table.                                                                                                                                                                                                                                          |\n| **How many dimensions**     | 1                                                        | Number of dimension columns expected in the use case.<br />**Note: Do not consider the `project_date_column` or metrics**, as they have their own parameters.                                                                                                                                                                                                                                                                                                |\n| **How many views**          | 1                                                        | Defines how many output views to generate for the use case. It's possible to have as many as the use case needs.<br />All views will have the same structure (dimensions and metrics), the only difference possible to specify between the views is the `view filter`.<br />**Default value is 1.**<br />**Note**: This configuration has a direct impact in the `3. Configure View Name and Filters` configuration.                                                 |\n| **Is Active**               | Y                                                        | Flag to make the use case active or not.<br />**Default value is Y**.                                                                                                                                                                                                                                                                                                                                                                                          |\n| **Market**                  | GLOBAL                                                   | Used in the **gab_job_manager** to execute the use cases for each **market**. If your business does not have the concept of Market, you can leave the `GLOBAL` default.                                                                                                                                                                                                                                                                                                                                                                                 |\n| **SQL File Names**          | 1_article_category.sql,<br />2_f_agg_dummy_sales_kpi.sql | Name of the SQL files used in the use case, according to what you have configured in ***step 2.1***. <br />You can combine different layers of dependencies between them as shown in the example above, where the **2_combined.sql** file depends on **1_product_category.sql** file. <br />The file name should follow the pattern x_file_name (where x is an integer digit) and should be separated by a comma (e.g.: 1_first_query.sql, 2_second_query.sql).                                                                      |\n| **Snapshot End Date**       | to_date                                                  | This parameter is used in the template, by default its value must be ***to_date***.<br />You can change it if you have managed this in your SQL files.<br />The values stored in this column depend on the use case behavior:<br /><ul><li>if snapshots are enabled, it will contain the snapshot end date.</li><li>If no snapshot is enabled, it will contain the last date of the cadence.</li></ul>The snapshot behavior is set in the reconciliation steps (more on that later). |\n| **Timezone Offset**         | 0                                                        | The timezone offset that you want to apply to the the date columns (`project_date_column` or `filter_date_column`).<br />It should be a number to decrement or add to the date (e.g., -8 or 8).<br />**The default value is 0**, which means that, by default, no timezone transformation will be applied to the date.                                                                                                                                                                                           |\n| **Use Case Name**           | f_agg_dummy_sales_kpi                                    | Name of the use case.<br />The suggestion is to use lowercase and underlined alphanumeric characters.                                                                                                                                                                                                                                                                                                                                                          |\n| **Use Case Reference Date** | order_date                                               | Reference date of the use case, i.e., `project_date_column`.<br />The parameter should be the column name and the selected column should have the date/datetime format.                                                                                                                                                                                                                                                                                                                      |\n| **Week Start**              | MONDAY                                                   | The start of the business week of the use case.<br />Possible values: **SUNDAY** or **MONDAY**.\n\n#### 2.2.2. Configure Dimension Names\n\n<img src=\"../../../assets/gab/img/gab_dimensions.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n#### 2.2.3. Configure View Name and Filters\n\nThis will be the name of the output view at the end of the process. Filters can be applied at this step, if needed.\n\n<img src=\"../../../assets/gab/img/gab_view_name_and_filters.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n    \n| Variable        | Default Value            | Description                                                                                                                                                                                                                                                             |\n|-----------------|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **View Filter** |                          | A SQL *WHERE* clause expression based on the dimensions defined in the previous step.<br />**Example**: if you have set the country as `D1`, the filter here could be **D1 = \"Germany\"**. The syntax allowed here is the same as the syntax of the *WHERE* clause in SQL. |\n| **View Name**   | vw_f_agg_dummy_sales_kpi | Name of the view to query the resulting aggregated data. This will contain the results produced by GAB for the configured use case.\n\n#### 2.2.4. Configure the Cadence, Reconciliation and Snapshot\n\nThis step is where we define which will be the cadence displayed at the view.\n\n<img src=\"../../../assets/gab/img/gab_recon.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n    \n| Variable                   | Default Value  | Description                                                                                                                                             |\n|----------------------------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Reconciliation Cadence** | YEAR           | Compute the data aggregated by the specified cadence, optionally defined with reconciliation and snapshotting.<br />[Check more about it here](../gab.md#reconciliation). |\n\n#### 2.2.5. Configure METRICS\n\nFirst question to ask regarding metrics is how many metrics do you have on our SQL use case query. On our template we have two metrics (`qty_articles` and `total_amount`).\n\n<img src=\"../../../assets/gab/img/gab_query_metrics.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n<img src=\"../../../assets/gab/img/gab_metrics.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\nNext, we will define if we want GAB to create secondary calculations for us based on the metric name.\n\n!!! warning\n    Metrics should follow the same order as defined on the SQL use case query.\n\n<img src=\"../../../assets/gab/img/gab_metrics_configuration.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n| Variable                                    | Description                                                                                                                                                                                                                                                                                                                                                                      |\n|---------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [**Calculated Metric**](../gab.md#Metrics) | It's possible to derive (add secondary calculations) 4 new columns based on each metric.<br />Those new columns will be based on cadences like ***last_cadence***, ***last_year_cadence*** and ***window function***.<br />Moreover, you can create a derived column, which is a custom SQL statement that you can write by selecting the ***derived_metric*** option. |\n| **Metric Name**                             | Name of the base metric. Should have the same name as on the SQL use case query in the SQL template files defined previously. |\n\nAfter that, it's where you configure secondary calculations.\n\n<img src=\"../../../assets/gab/img/gab_metrics_calculations.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n| Variable                            | Description                                                                                                                                |\n|-------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|\n| **derived_metric.Formula**          | Formula to calculate the metric referring any of previous configured metrics by the **Metric Name**.<br />**Example**: `total_amount*0.56` |\n| **derived_metric.Label**            | Name of the generated metric by ***derived_metric***.                                                                                      |\n| **last_cadence.Label**              | Name of the generated metric by ***last_cadence***.                                                                                        |\n| **last_cadence.Window**             | Cadence lookback window, which means in this example, a lookback from the previous year (as the use case is on **YEARLY** cadence)    |\n| **window_function.Agg Func**        | SQL Function to calculate the metric.<br />Possible values: ***sum***, ***avg***, ***max***, ***min***, ***count***                        |\n| **window_function.Label**           | Name of the generated metric by ***window_function***.                                                                                     |\n| **window_function.Window Interval** | Window interval to use on the metric generation. \n\n#### 2.2.6. Configure Stages\n\nStages are related to each SQL file in the use case.\n\n<img src=\"../../../assets/gab/img/gab_stages_configuration.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n| Variable                       | Description                                                                                                                                                                                                                                       |\n|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Filter Date Column**  | It will be used to filter the data of your use case.<br />This information will be replaced in the placeholder of the GAB template `{{ filter_date_column }}`.                                                                                                               |\n| **Project Date Column** | It will be used as reference date for the given query.<br />This information will be replaced in the placeholder of the GAB template `{{ project_date_column }}`.                                                                                                             |\n| **Repartition Type**    | Type of repartitioning of the data of the query.<br />Possible values: ***Key*** and ***Number***.<br />When you use Key, it expects column names separated by a comma.<br />When you use Number, it expects an integer of how many partitions the user wants. |\n| **Repartition Value**   | This parameter only has effect when used with **Repartition Type parameter**.<br />It sets the value for the repartitioning type set by the parameter above selected.                                                                                                   |\n| **Storage Level**       | Defines the Spark persistence storage level you want (e.g. ***Memory Only***, ***Memory and Disk*** etc).                                                                                                                 |\n| **Table Alias**         | The alias of the SQL file that will be executed. This name can be used to consume the output of a SQL stage (corresponding to a SQL file) in the next stage (the next SQL file).\n\n#### 2.2.7. Build and Execute the SQL Commands to populate the lkp_query_builder Table\n\n<img src=\"../../../assets/gab/img/gab_build_insert_sql_instruction.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n<img src=\"../../../assets/gab/img/gab_insert_use_case.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\nAfter configuring the use case, it would generate a SQL command to create it on the `lkp_query_builder`:\n\n```sql\nDELETE FROM example_database.lkp_query_builder WHERE QUERY_LABEL = 'f_agg_dummy_sales_kpi';\nINSERT INTO example_database.lkp_query_builder VALUES (\n  1,\n  'f_agg_dummy_sales_kpi',\n  'GLOBAL',\n  \"\"\"{\n    'vw_f_agg_dummy_sales_kpi': {\n      'dimensions': {\n        'from_date': 'order_date',\n        'to_date': 'to_date',\n        'd1': 'category_name'\n      },\n      'metric': {\n        'm1': {\n          'metric_name': 'qty_articles',\n          'calculated_metric': {},\n          'derived_metric': {}\n        },\n        'm2': {\n          'metric_name': 'total_amount',\n            'calculated_metric': {\n              'last_cadence': [\n                {\n                  'label': 'total_amount_last_year',\n                  'window': '1'\n                }\n              ],\n              'window_function': [\n                {\n                  'label': 'avg_total_amount_last_2_years',\n                  'window': [2, 1],\n                  'agg_func': 'avg'\n                }\n              ]\n            },\n            'derived_metric': [\n              {\n                'label': 'discounted_total_amount',\n                'formula': 'total_amount*0.56'\n              }\n            ]\n          }\n        },\n      'filter': {}\n    }\n  }\"\"\",\n  \"\"\"{\n    '1': {\n        'file_path': 'f_agg_dummy_sales_kpi/1_article_category.sql',\n        'table_alias': 'article_categories',\n        'storage_level': 'MEMORY_ONLY',\n        'project_date_column': '',\n        'filter_date_column': '',\n        'repartition': {}\n    },\n    '2': {\n        'file_path': 'f_agg_dummy_sales_kpi/2_f_agg_dummy_sales_kpi.sql',\n        'table_alias': 'dummy_sales_kpi',\n        'storage_level': 'MEMORY_ONLY',\n        'project_date_column': 'order_date',\n        'filter_date_column': 'order_date',\n        'repartition': {}\n    }\n  }\"\"\",\n  \"\"\"{'YEAR': {}}\"\"\",\n  '0',\n  'MONDAY',\n  'Y',\n  'Low',\n  current_timestamp()\n)\n```\n\n## 3. Use case execution\n\nAfter the initial setup and adding your use case to the ***lkp_query_builder*** you can schedule the [gab_job_manager](../../../assets/gab/notebooks/gab_job_manager.py) to manage the use case execution in any schedule you want.\n\nYou can repeat these steps for each use case you have.\n\n## 4. Consuming the data\n\nThe data is available in the view you specified as output from the use case in ***step 2***, so you can normally consume the view as you would consume any other data asset (e.g., Report, Dashboard, ML model, Data Pipeline).\n"
  },
  {
    "path": "lakehouse_engine_usage/lakehouse_engine_usage.md",
    "content": "# How to use the Lakehouse Engine?\n\nLakehouse engine usage examples for all the algorithms and other core functionalities.\n\n- [Data Loader](data_loader/data_loader.md)\n- [Data Quality](data_quality/data_quality.md)\n- [Reconciliator](reconciliator/reconciliator.md)\n- [Sensors](sensors/sensors.md)\n- [GAB](gab/gab.md)"
  },
  {
    "path": "lakehouse_engine_usage/managerhelper/managerhelper.md",
    "content": "# Table and File Manager Operations Generator\n\nGenerate JSON configurations for TableManager and FileManager operations with an interactive form.\n\n<link rel=\"stylesheet\" href=\"https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css\">\n<link rel=\"stylesheet\" href=\"../managerhelper/styles-mkdocs.css\">\n<link rel=\"stylesheet\" href=\"../managerhelper/operations-styles-mkdocs.css\">\n\n<div markdown=\"0\" class=\"managerhelper-wrapper\">\n\n    <!-- Navigation Tabs -->\n    <nav class=\"tabs\">\n        <button class=\"tab-button active\" data-tab=\"table-manager\">\n            <i class=\"fas fa-table\"></i>\n            Table Manager\n        </button>\n        <button class=\"tab-button\" data-tab=\"file-manager\">\n            <i class=\"fas fa-folder\"></i>\n            File Manager\n        </button>\n    </nav>\n\n    <!-- Operations Container -->\n    <main class=\"operations-container\">\n        <!-- Table Manager Tab -->\n        <div id=\"table-manager\" class=\"tab-content active\">\n            <div class=\"section\">\n                <h3><i class=\"fas fa-table\"></i> Table Manager Operations</h3>\n                \n                <div class=\"operation-selector\">\n                    <label for=\"table-operation-select\">Select Operation:</label>\n                    <select id=\"table-operation-select\" class=\"form-control\">\n                        <option value=\"\">Choose an operation...</option>\n                        <option value=\"compute_table_statistics\">Compute Table Statistics</option>\n                        <option value=\"create_table\">Create Table</option>\n                        <option value=\"create_tables\">Create Multiple Tables</option>\n                        <option value=\"create_view\">Create View</option>\n                        <option value=\"drop_table\">Drop Table</option>\n                        <option value=\"drop_view\">Drop View</option>\n                        <option value=\"execute_sql\">Execute SQL</option>\n                        <option value=\"truncate\">Truncate Table</option>\n                        <option value=\"vacuum\">Vacuum Table</option>\n                        <option value=\"describe\">Describe Table</option>\n                        <option value=\"optimize\">Optimize Table</option>\n                        <option value=\"show_tbl_properties\">Show Table Properties</option>\n                        <option value=\"get_tbl_pk\">Get Table Primary Key</option>\n                        <option value=\"repair_table\">Repair Table</option>\n                        <option value=\"delete_where\">Delete Where</option>\n                    </select>\n                </div>\n\n                <!-- Dynamic form fields will be inserted here -->\n                <div id=\"table-dynamic-fields\" class=\"dynamic-fields\">\n                    <div class=\"no-operation-selected\">\n                        <i class=\"fas fa-arrow-up\"></i>\n                        <p>Select an operation above to see its configuration options</p>\n                    </div>\n                </div>\n            </div>\n        </div>\n\n        <!-- File Manager Tab -->\n        <div id=\"file-manager\" class=\"tab-content\">\n            <div class=\"section\">\n                <h2><i class=\"fas fa-folder\"></i> File Manager Operations</h2>\n                \n                <div class=\"operation-selector\">\n                    <label for=\"file-operation-select\">Select Operation:</label>\n                    <select id=\"file-operation-select\" class=\"form-control\">\n                        <option value=\"\">Choose an operation...</option>\n                        <option value=\"delete_objects\">Delete Objects</option>\n                        <option value=\"copy_objects\">Copy Objects</option>\n                        <option value=\"move_objects\">Move Objects</option>\n                        <option value=\"request_restore\">Request Restore (S3)</option>\n                        <option value=\"check_restore_status\">Check Restore Status (S3)</option>\n                        <option value=\"request_restore_to_destination_and_wait\">Request Restore and Copy (S3)</option>\n                    </select>\n                </div>\n\n                <!-- Dynamic form fields will be inserted here -->\n                <div id=\"file-dynamic-fields\" class=\"dynamic-fields\">\n                    <div class=\"no-operation-selected\">\n                        <i class=\"fas fa-arrow-up\"></i>\n                        <p>Select an operation above to see its configuration options</p>\n                    </div>\n                </div>\n            </div>\n        </div>\n    </main>\n\n    <!-- Operations List -->\n    <div class=\"operations-list-container\">\n        <div class=\"operations-header\">\n            <h4><i class=\"fas fa-list\"></i> Operations Queue</h4>\n            <div class=\"operations-actions\">\n                <button id=\"add-operation\" class=\"btn btn-primary\" disabled>\n                    <i class=\"fas fa-plus\"></i>\n                    Add Operation\n                </button>\n                <button id=\"clear-operations\" class=\"btn btn-outline\">\n                    <i class=\"fas fa-trash\"></i>\n                    Clear All\n                </button>\n            </div>\n        </div>\n        \n        <div id=\"operations-list\" class=\"operations-list\">\n            <div class=\"empty-operations\">\n                <i class=\"fas fa-clipboard-list\"></i>\n                <p>No operations added yet. Configure and add operations to build your JSON.</p>\n            </div>\n        </div>\n    </div>\n\n    <!-- Actions -->\n    <div class=\"actions\">\n        <button id=\"generate-json\" class=\"btn btn-primary\" disabled>\n            <i class=\"fas fa-code\"></i>\n            Generate JSON\n        </button>\n        <button id=\"copy-json\" class=\"btn btn-secondary\" disabled>\n            <i class=\"fas fa-copy\"></i>\n            Copy to Clipboard\n        </button>\n        <button id=\"download-json\" class=\"btn btn-secondary\" disabled>\n            <i class=\"fas fa-download\"></i>\n            Download JSON\n        </button>\n    </div>\n\n    <!-- JSON Output -->\n    <div class=\"output-container\">\n        <div class=\"output-header\">\n            <h4><i class=\"fas fa-file-code\"></i> Generated JSON Configuration</h4>\n            <div class=\"output-actions\">\n                <button id=\"format-json\" class=\"btn btn-sm\">\n                    <i class=\"fas fa-indent\"></i>\n                    Format\n                </button>\n                <button id=\"validate-json\" class=\"btn btn-sm\">\n                    <i class=\"fas fa-check\"></i>\n                    Validate\n                </button>\n            </div>\n        </div>\n        <pre id=\"json-output\" class=\"json-output\"></pre>\n        <div id=\"validation-result\" class=\"validation-result\"></div>\n    </div>\n\n    <!-- Loading Spinner -->\n    <div id=\"loading\" class=\"loading\" style=\"display: none;\">\n        <div class=\"spinner\"></div>\n        <p>Generating configuration...</p>\n    </div>\n\n    <!-- Success Toast -->\n    <div id=\"toast\" class=\"toast\"></div>\n</div>\n\n<script src=\"../managerhelper/operations-script.js\"></script>\n"
  },
  {
    "path": "lakehouse_engine_usage/managerhelper/operations-script.js",
    "content": "// ============================================================================\n// LAKEHOUSE ENGINE OPERATIONS GENERATOR - MAIN JAVASCRIPT\n// ============================================================================\n// This script manages the interactive UI for generating JSON configurations\n// for Lakehouse Engine table and file manager operations.\n// ============================================================================\n\n// ============================================================================\n// DOM ELEMENT REFERENCES\n// ============================================================================\n// Cache frequently accessed DOM elements for better performance\n\n/** Tab navigation buttons for switching between table and file managers */\nconst tabButtons = document.querySelectorAll('.tab-button');\n\n/** Tab content containers for table and file manager sections */\nconst tabContents = document.querySelectorAll('.tab-content');\n\n/** Dropdown select for choosing table manager operations */\nconst tableOperationSelect = document.getElementById('table-operation-select');\n\n/** Dropdown select for choosing file manager operations */\nconst fileOperationSelect = document.getElementById('file-operation-select');\n\n/** Container for dynamically generated table operation parameter fields */\nconst tableDynamicFields = document.getElementById('table-dynamic-fields');\n\n/** Container for dynamically generated file operation parameter fields */\nconst fileDynamicFields = document.getElementById('file-dynamic-fields');\n\n/** Button to add the currently configured operation to the list */\nconst addOperationBtn = document.getElementById('add-operation');\n\n/** Button to clear all operations from the list */\nconst clearOperationsBtn = document.getElementById('clear-operations');\n\n/** Container displaying the list of added operations */\nconst operationsList = document.getElementById('operations-list');\n\n/** Button to generate JSON configuration from operations list */\nconst generateBtn = document.getElementById('generate-json');\n\n/** Button to copy generated JSON to clipboard */\nconst copyBtn = document.getElementById('copy-json');\n\n/** Button to download generated JSON as a file */\nconst downloadBtn = document.getElementById('download-json');\n\n/** Button to format the displayed JSON */\nconst formatBtn = document.getElementById('format-json');\n\n/** Button to validate the generated JSON configuration */\nconst validateBtn = document.getElementById('validate-json');\n\n/** Pre-formatted text area displaying the generated JSON output */\nconst jsonOutput = document.getElementById('json-output');\n\n/** Element displaying validation results and messages */\nconst validationResult = document.getElementById('validation-result');\n\n/** Loading spinner overlay element */\nconst loading = document.getElementById('loading');\n\n/** Toast notification element for user feedback */\nconst toast = document.getElementById('toast');\n\n// ============================================================================\n// APPLICATION STATE\n// ============================================================================\n// Global state variables that track the application's current status\n\n/** Current active tab ('table-manager' or 'file-manager') */\nlet currentTab = 'table-manager';\n\n/** Array of operation objects added by the user */\nlet operations = [];\n\n/** Generated JSON configuration object */\nlet generatedConfig = null;\n\n// ============================================================================\n// OPERATION DEFINITIONS - TABLE MANAGER\n// ============================================================================\n// Defines all available table manager operations with their parameters,\n// validation rules, and UI presentation details\n\n/**\n * Table Manager Operations Configuration\n * Each operation includes:\n * - name: Display name for the UI\n * - icon: FontAwesome icon class\n * - fields: Array of field definitions with type, validation, and help text\n */\nconst TABLE_OPERATIONS = {\n    'compute_table_statistics': {\n        name: 'Compute Table Statistics',\n        icon: 'fas fa-chart-bar',\n        fields: [\n            { name: 'table_or_view', label: 'Table or View Name', type: 'text', required: true, help: 'Name of the table or view to compute statistics for' }\n        ]\n    },\n    'create_table': {\n        name: 'Create Table',\n        icon: 'fas fa-plus-square',\n        fields: [\n            { name: 'path', label: 'SQL File Path', type: 'text', required: true, help: 'Path to the SQL file containing the CREATE TABLE statement' },\n            { name: 'disable_dbfs_retry', label: 'Disable DBFS Retry', type: 'select', options: ['True', 'False'], default: 'False', help: 'Whether to disable DBFS retry mechanism' },\n            { name: 'delimiter', label: 'SQL Delimiter', type: 'text', default: ';', help: 'Delimiter to separate SQL commands' },\n            { name: 'advanced_parser', label: 'Advanced Parser', type: 'select', options: ['True', 'False'], default: 'False', help: 'Use advanced SQL parser' }\n        ]\n    },\n    'create_tables': {\n        name: 'Create Multiple Tables',\n        icon: 'fas fa-layer-group',\n        fields: [\n            { name: 'path', label: 'SQL File Paths', type: 'textarea', required: true, help: 'Comma-separated paths to SQL files containing CREATE TABLE statements' },\n            { name: 'disable_dbfs_retry', label: 'Disable DBFS Retry', type: 'select', options: ['True', 'False'], default: 'False', help: 'Whether to disable DBFS retry mechanism' },\n            { name: 'delimiter', label: 'SQL Delimiter', type: 'text', default: ';', help: 'Delimiter to separate SQL commands' },\n            { name: 'advanced_parser', label: 'Advanced Parser', type: 'select', options: ['True', 'False'], default: 'False', help: 'Use advanced SQL parser' }\n        ]\n    },\n    'create_view': {\n        name: 'Create View',\n        icon: 'fas fa-eye',\n        fields: [\n            { name: 'path', label: 'SQL File Path', type: 'text', required: true, help: 'Path to the SQL file containing the CREATE VIEW statement' },\n            { name: 'disable_dbfs_retry', label: 'Disable DBFS Retry', type: 'select', options: ['True', 'False'], default: 'False', help: 'Whether to disable DBFS retry mechanism' },\n            { name: 'delimiter', label: 'SQL Delimiter', type: 'text', default: ';', help: 'Delimiter to separate SQL commands' },\n            { name: 'advanced_parser', label: 'Advanced Parser', type: 'select', options: ['True', 'False'], default: 'False', help: 'Use advanced SQL parser' }\n        ]\n    },\n    'drop_table': {\n        name: 'Drop Table',\n        icon: 'fas fa-trash-alt',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', required: true, help: 'Name of the table to drop' }\n        ]\n    },\n    'drop_view': {\n        name: 'Drop View',\n        icon: 'fas fa-eye-slash',\n        fields: [\n            { name: 'table_or_view', label: 'View Name', type: 'text', required: true, help: 'Name of the view to drop' }\n        ]\n    },\n    'execute_sql': {\n        name: 'Execute SQL',\n        icon: 'fas fa-code',\n        fields: [\n            { name: 'sql', label: 'SQL Commands', type: 'textarea', required: true, help: 'SQL commands to execute (separated by delimiter)' },\n            { name: 'delimiter', label: 'SQL Delimiter', type: 'text', default: ';', help: 'Delimiter to separate SQL commands' },\n            { name: 'advanced_parser', label: 'Advanced Parser', type: 'select', options: ['True', 'False'], default: 'False', help: 'Use advanced SQL parser' }\n        ]\n    },\n    'truncate': {\n        name: 'Truncate Table',\n        icon: 'fas fa-cut',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', required: true, help: 'Name of the table to truncate' }\n        ]\n    },\n    'vacuum': {\n        name: 'Vacuum Table',\n        icon: 'fas fa-broom',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', help: 'Name of the table to vacuum (leave empty to use path)' },\n            { name: 'path', label: 'Table Path', type: 'text', help: 'Path to the Delta table location (use if table_or_view is empty)' },\n            { name: 'vacuum_hours', label: 'Retention Hours', type: 'number', default: '168', help: 'Number of hours to retain old versions (default: 168 hours = 7 days)' }\n        ]\n    },\n    'describe': {\n        name: 'Describe Table',\n        icon: 'fas fa-info-circle',\n        fields: [\n            { name: 'table_or_view', label: 'Table or View Name', type: 'text', required: true, help: 'Name of the table or view to describe' }\n        ]\n    },\n    'optimize': {\n        name: 'Optimize Table',\n        icon: 'fas fa-tachometer-alt',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', help: 'Name of the table to optimize (leave empty to use path)' },\n            { name: 'path', label: 'Table Path', type: 'text', help: 'Path to the Delta table location (use if table_or_view is empty)' },\n            { name: 'where_clause', label: 'Where Clause', type: 'text', help: 'Optional WHERE clause to limit optimization scope' },\n            { name: 'optimize_zorder_col_list', label: 'Z-Order Columns', type: 'text', help: 'Comma-separated list of columns for Z-ORDER optimization' }\n        ]\n    },\n    'show_tbl_properties': {\n        name: 'Show Table Properties',\n        icon: 'fas fa-cogs',\n        fields: [\n            { name: 'table_or_view', label: 'Table or View Name', type: 'text', required: true, help: 'Name of the table or view to show properties for' }\n        ]\n    },\n    'get_tbl_pk': {\n        name: 'Get Table Primary Key',\n        icon: 'fas fa-key',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', required: true, help: 'Name of the table to get primary key from' }\n        ]\n    },\n    'repair_table': {\n        name: 'Repair Table',\n        icon: 'fas fa-wrench',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', required: true, help: 'Name of the table to repair' },\n            { name: 'sync_metadata', label: 'Sync Metadata', type: 'select', options: ['True', 'False'], default: 'False', help: 'Whether to sync metadata during repair' }\n        ]\n    },\n    'delete_where': {\n        name: 'Delete Where',\n        icon: 'fas fa-eraser',\n        fields: [\n            { name: 'table_or_view', label: 'Table Name', type: 'text', required: true, help: 'Name of the table to delete from' },\n            { name: 'where_clause', label: 'Where Clause', type: 'text', required: true, help: 'WHERE condition for deletion (without WHERE keyword)' }\n        ]\n    }\n};\n\n// ============================================================================\n// OPERATION DEFINITIONS - FILE MANAGER\n// ============================================================================\n// Defines all available file manager operations for S3 and DBFS file systems\n\n/**\n * File Manager Operations Configuration\n * Supports operations for:\n * - S3: delete, copy, move, restore from Glacier\n * - DBFS: delete, copy, move\n */\nconst FILE_OPERATIONS = {\n    'delete_objects': {\n        name: 'Delete Objects',\n        icon: 'fas fa-trash',\n        fields: [\n            { name: 'bucket', label: 'Bucket Name', type: 'text', help: 'S3 bucket name (leave empty for DBFS paths)' },\n            { name: 'object_paths', label: 'Object Paths', type: 'textarea', required: true, help: 'Comma-separated list of object paths to delete' },\n            { name: 'dry_run', label: 'Dry Run', type: 'select', options: ['True', 'False'], default: 'False', help: 'Preview what would be deleted without actually deleting' }\n        ]\n    },\n    'copy_objects': {\n        name: 'Copy Objects',\n        icon: 'fas fa-copy',\n        fields: [\n            { name: 'bucket', label: 'Source Bucket', type: 'text', help: 'Source S3 bucket name (leave empty for DBFS paths)' },\n            { name: 'source_object', label: 'Source Object Path', type: 'text', required: true, help: 'Path of the source object or directory' },\n            { name: 'destination_bucket', label: 'Destination Bucket', type: 'text', help: 'Destination S3 bucket name (leave empty for DBFS paths)' },\n            { name: 'destination_object', label: 'Destination Object Path', type: 'text', required: true, help: 'Path of the destination object or directory' },\n            { name: 'dry_run', label: 'Dry Run', type: 'select', options: ['True', 'False'], default: 'False', help: 'Preview what would be copied without actually copying' }\n        ]\n    },\n    'move_objects': {\n        name: 'Move Objects',\n        icon: 'fas fa-arrows-alt',\n        fields: [\n            { name: 'bucket', label: 'Source Bucket', type: 'text', help: 'Source S3 bucket name (leave empty for DBFS paths)' },\n            { name: 'source_object', label: 'Source Object Path', type: 'text', required: true, help: 'Path of the source object or directory' },\n            { name: 'destination_bucket', label: 'Destination Bucket', type: 'text', help: 'Destination S3 bucket name (leave empty for DBFS paths)' },\n            { name: 'destination_object', label: 'Destination Object Path', type: 'text', required: true, help: 'Path of the destination object or directory' },\n            { name: 'dry_run', label: 'Dry Run', type: 'select', options: ['True', 'False'], default: 'False', help: 'Preview what would be moved without actually moving' }\n        ]\n    },\n    'request_restore': {\n        name: 'Request Restore (S3)',\n        icon: 'fas fa-undo',\n        fields: [\n            { name: 'bucket', label: 'S3 Bucket', type: 'text', required: true, help: 'S3 bucket containing archived objects' },\n            { name: 'source_object', label: 'Source Object Path', type: 'text', required: true, help: 'Path of the archived object to restore' },\n            { name: 'restore_expiration', label: 'Restore Expiration (days)', type: 'number', required: true, default: '7', help: 'Number of days to keep restored objects available' },\n            { name: 'retrieval_tier', label: 'Retrieval Tier', type: 'select', options: ['Expedited', 'Standard', 'Bulk'], default: 'Standard', help: 'Speed and cost tier for restoration' },\n            { name: 'dry_run', label: 'Dry Run', type: 'select', options: ['True', 'False'], default: 'False', help: 'Preview what would be restored without actually restoring' }\n        ]\n    },\n    'check_restore_status': {\n        name: 'Check Restore Status (S3)',\n        icon: 'fas fa-search',\n        fields: [\n            { name: 'bucket', label: 'S3 Bucket', type: 'text', required: true, help: 'S3 bucket containing archived objects' },\n            { name: 'source_object', label: 'Source Object Path', type: 'text', required: true, help: 'Path of the object to check restore status' }\n        ]\n    },\n    'request_restore_to_destination_and_wait': {\n        name: 'Request Restore and Copy (S3)',\n        icon: 'fas fa-sync-alt',\n        fields: [\n            { name: 'bucket', label: 'Source S3 Bucket', type: 'text', required: true, help: 'S3 bucket containing archived objects' },\n            { name: 'source_object', label: 'Source Object Path', type: 'text', required: true, help: 'Path of the archived object to restore' },\n            { name: 'destination_bucket', label: 'Destination S3 Bucket', type: 'text', required: true, help: 'Destination S3 bucket for restored objects' },\n            { name: 'destination_object', label: 'Destination Object Path', type: 'text', required: true, help: 'Path of the destination for restored objects' },\n            { name: 'restore_expiration', label: 'Restore Expiration (days)', type: 'number', required: true, default: '7', help: 'Number of days to keep restored objects available' },\n            { name: 'retrieval_tier', label: 'Retrieval Tier', type: 'select', options: ['Expedited'], default: 'Expedited', help: 'Only Expedited tier supported for this operation' },\n            { name: 'dry_run', label: 'Dry Run', type: 'select', options: ['True', 'False'], default: 'False', help: 'Preview what would be restored without actually restoring' }\n        ]\n    }\n};\n\n// ============================================================================\n// INITIALIZATION\n// ============================================================================\n// Set up the application when the DOM is fully loaded\n\n/**\n * Initialize the application on page load\n * Sets up tabs, event listeners, and loads any saved state\n */\ndocument.addEventListener('DOMContentLoaded', function() {\n    initializeTabs();\n    initializeEventListeners();\n    loadFromLocalStorage();\n});\n\n// ============================================================================\n// TAB MANAGEMENT\n// ============================================================================\n\n/**\n * Initialize tab navigation functionality\n * Sets up click handlers for switching between table and file manager tabs\n */\nfunction initializeTabs() {\n    tabButtons.forEach(button => {\n        button.addEventListener('click', () => {\n            const tabId = button.getAttribute('data-tab');\n            switchTab(tabId);\n        });\n    });\n}\n\n/**\n * Switch to a different tab\n * @param {string} tabId - The ID of the tab to activate ('table-manager' or 'file-manager')\n */\nfunction switchTab(tabId) {\n    // Update button active states\n    tabButtons.forEach(btn => btn.classList.remove('active'));\n    document.querySelector(`[data-tab=\"${tabId}\"]`).classList.add('active');\n    \n    // Update content visibility\n    tabContents.forEach(content => content.classList.remove('active'));\n    document.getElementById(tabId).classList.add('active');\n    \n    // Update application state\n    currentTab = tabId;\n    updateAddButtonState();\n}\n\n// ============================================================================\n// EVENT LISTENERS SETUP\n// ============================================================================\n\n/**\n * Initialize all event listeners for interactive elements\n * Connects UI actions to their handler functions\n */\nfunction initializeEventListeners() {\n    // Operation selection change handlers\n    tableOperationSelect.addEventListener('change', handleTableOperationChange);\n    fileOperationSelect.addEventListener('change', handleFileOperationChange);\n    \n    // Button click handlers\n    addOperationBtn.addEventListener('click', addCurrentOperation);\n    clearOperationsBtn.addEventListener('click', clearAllOperations);\n    generateBtn.addEventListener('click', generateJSON);\n    copyBtn.addEventListener('click', copyToClipboard);\n    downloadBtn.addEventListener('click', downloadJSON);\n    formatBtn.addEventListener('click', formatJSON);\n    validateBtn.addEventListener('click', validateJSON);\n}\n\n// ============================================================================\n// DYNAMIC FIELD GENERATION\n// ============================================================================\n\n/**\n * Handle table operation selection change\n * Renders the appropriate parameter fields for the selected table operation\n */\nfunction handleTableOperationChange() {\n    const operation = tableOperationSelect.value;\n    if (operation && TABLE_OPERATIONS[operation]) {\n        renderDynamicFields(tableDynamicFields, TABLE_OPERATIONS[operation], 'table');\n        updateAddButtonState();\n    } else {\n        showNoOperationSelected(tableDynamicFields);\n        updateAddButtonState();\n    }\n}\n\n/**\n * Handle file operation selection change\n * Renders the appropriate parameter fields for the selected file operation\n */\nfunction handleFileOperationChange() {\n    const operation = fileOperationSelect.value;\n    if (operation && FILE_OPERATIONS[operation]) {\n        renderDynamicFields(fileDynamicFields, FILE_OPERATIONS[operation], 'file');\n        updateAddButtonState();\n    } else {\n        showNoOperationSelected(fileDynamicFields);\n        updateAddButtonState();\n    }\n}\n\n/**\n * Display a message when no operation is selected\n * @param {HTMLElement} container - The container to display the message in\n */\nfunction showNoOperationSelected(container) {\n    container.innerHTML = `\n        <div class=\"no-operation-selected\">\n            <i class=\"fas fa-arrow-up\"></i>\n            <p>Select an operation above to see its configuration options</p>\n        </div>\n    `;\n}\n\n/**\n * Render dynamic parameter fields for the selected operation\n * @param {HTMLElement} container - The container to render fields into\n * @param {Object} operationDef - The operation definition with field specifications\n * @param {string} type - The operation type ('table' or 'file')\n */\nfunction renderDynamicFields(container, operationDef, type) {\n    const html = `\n        <div class=\"field-group\">\n            <h4>\n                <i class=\"${operationDef.icon}\"></i>\n                ${operationDef.name} Configuration\n            </h4>\n            ${operationDef.fields.map(field => renderField(field, type)).join('')}\n        </div>\n    `;\n    container.innerHTML = html;\n    \n    // Attach validation event listeners to all input fields\n    container.querySelectorAll('input, select, textarea').forEach(input => {\n        input.addEventListener('blur', () => validateField(input));\n        input.addEventListener('input', () => clearFieldValidation(input));\n    });\n}\n\n/**\n * Render a single input field based on its definition\n * @param {Object} field - Field definition with name, type, label, etc.\n * @param {string} type - The operation type for generating unique field IDs\n * @returns {string} HTML string for the field\n */\nfunction renderField(field, type) {\n    const fieldId = `${type}-${field.name}`;\n    const required = field.required ? 'required' : '';\n    const requiredMarker = field.required ? '<span class=\"field-required\">*</span>' : '';\n    \n    let inputHtml = '';\n    \n    // Generate appropriate input HTML based on field type\n    switch (field.type) {\n        case 'text':\n        case 'number':\n            inputHtml = `<input type=\"${field.type}\" id=\"${fieldId}\" name=\"${field.name}\" ${required} ${field.default ? `value=\"${field.default}\"` : ''}>`;\n            break;\n        case 'textarea':\n            inputHtml = `<textarea id=\"${fieldId}\" name=\"${field.name}\" rows=\"3\" ${required}>${field.default || ''}</textarea>`;\n            break;\n        case 'select':\n            const options = field.options.map(option => \n                `<option value=\"${option}\" ${field.default === option ? 'selected' : ''}>${option}</option>`\n            ).join('');\n            inputHtml = `<select id=\"${fieldId}\" name=\"${field.name}\" ${required}>${options}</select>`;\n            break;\n    }\n    \n    return `\n        <div class=\"field-row\">\n            <div class=\"field-item\">\n                <label for=\"${fieldId}\">\n                    ${field.label} ${requiredMarker}\n                </label>\n                ${inputHtml}\n                <div class=\"field-help\">${field.help}</div>\n                <div class=\"validation-message\" id=\"${fieldId}-validation\"></div>\n            </div>\n        </div>\n    `;\n}\n\n// ============================================================================\n// FIELD VALIDATION\n// ============================================================================\n\n/**\n * Validate a single input field\n * @param {HTMLInputElement} input - The input element to validate\n * @returns {boolean} True if field is valid, false otherwise\n */\nfunction validateField(input) {\n    const validationDiv = document.getElementById(`${input.id}-validation`);\n    const isRequired = input.hasAttribute('required');\n    const value = input.value.trim();\n    \n    // Clear previous validation state\n    input.classList.remove('valid', 'invalid');\n    validationDiv.textContent = '';\n    validationDiv.className = 'validation-message';\n    \n    // Check if required field is empty\n    if (isRequired && !value) {\n        input.classList.add('invalid');\n        validationDiv.textContent = 'This field is required';\n        validationDiv.classList.add('error');\n        return false;\n    }\n    \n    // Type-specific validation for number fields\n    if (value && input.type === 'number') {\n        const numValue = parseFloat(value);\n        if (isNaN(numValue) || numValue < 0) {\n            input.classList.add('invalid');\n            validationDiv.textContent = 'Please enter a valid positive number';\n            validationDiv.classList.add('error');\n            return false;\n        }\n    }\n    \n    // Mark field as valid if it has a value\n    if (value) {\n        input.classList.add('valid');\n        validationDiv.textContent = '✓ Valid';\n        validationDiv.classList.add('success');\n    }\n    \n    return true;\n}\n\n/**\n * Clear validation state from an input field\n * @param {HTMLInputElement} input - The input element to clear validation from\n */\nfunction clearFieldValidation(input) {\n    input.classList.remove('valid', 'invalid');\n    const validationDiv = document.getElementById(`${input.id}-validation`);\n    if (validationDiv) {\n        validationDiv.textContent = '';\n        validationDiv.className = 'validation-message';\n    }\n}\n\n// ============================================================================\n// OPERATION MANAGEMENT\n// ============================================================================\n\n/**\n * Update the enabled/disabled state of the Add Operation button\n * Button is only enabled when an operation is selected\n */\nfunction updateAddButtonState() {\n    const currentSelect = currentTab === 'table-manager' ? tableOperationSelect : fileOperationSelect;\n    const hasSelection = currentSelect.value !== '';\n    addOperationBtn.disabled = !hasSelection;\n}\n\n/**\n * Add the currently configured operation to the operations list\n * Validates all fields before adding\n */\nfunction addCurrentOperation() {\n    const currentSelect = currentTab === 'table-manager' ? tableOperationSelect : fileOperationSelect;\n    const operationKey = currentSelect.value;\n    \n    if (!operationKey) return;\n    \n    const operationDef = currentTab === 'table-manager' ? \n        TABLE_OPERATIONS[operationKey] : FILE_OPERATIONS[operationKey];\n    \n    // Collect and validate field values\n    const config = { function: operationKey };\n    const container = currentTab === 'table-manager' ? tableDynamicFields : fileDynamicFields;\n    let isValid = true;\n    \n    container.querySelectorAll('input, select, textarea').forEach(input => {\n        if (!validateField(input)) {\n            isValid = false;\n        }\n        \n        const value = input.value.trim();\n        if (value) {\n            // Handle different field types and convert values appropriately\n            if (input.name === 'object_paths' && value.includes(',')) {\n                config[input.name] = value.split(',').map(s => s.trim());\n            } else if (input.type === 'number') {\n                config[input.name] = parseInt(value, 10);\n            } else if (value === 'True') {\n                config[input.name] = true;\n            } else if (value === 'False') {\n                config[input.name] = false;\n            } else {\n                config[input.name] = value;\n            }\n        }\n    });\n    \n    // Abort if validation failed\n    if (!isValid) {\n        showToast('Please fix validation errors before adding the operation', 'error');\n        return;\n    }\n    \n    // Create and add operation object\n    const operation = {\n        id: Date.now(),\n        type: currentTab === 'table-manager' ? 'table' : 'file',\n        manager: currentTab === 'table-manager' ? 'table' : 'file',\n        functionName: operationKey,\n        displayName: operationDef.name,\n        icon: operationDef.icon,\n        config: config\n    };\n    \n    operations.push(operation);\n    renderOperationsList();\n    updateGenerateButtonState();\n    saveToLocalStorage();\n    \n    showToast(`${operationDef.name} operation added successfully!`, 'success');\n}\n\n/**\n * Remove an operation from the operations list\n * @param {number} id - The unique ID of the operation to remove\n */\nfunction removeOperation(id) {\n    operations = operations.filter(op => op.id !== id);\n    renderOperationsList();\n    updateGenerateButtonState();\n    saveToLocalStorage();\n    showToast('Operation removed', 'success');\n}\n\n/**\n * Clear all operations from the list after confirmation\n */\nfunction clearAllOperations() {\n    if (operations.length === 0) return;\n    \n    if (confirm('Are you sure you want to remove all operations?')) {\n        operations = [];\n        renderOperationsList();\n        updateGenerateButtonState();\n        saveToLocalStorage();\n        showToast('All operations cleared', 'success');\n    }\n}\n\n/**\n * Render the list of added operations in the UI\n * Shows empty state if no operations exist\n */\nfunction renderOperationsList() {\n    if (operations.length === 0) {\n        operationsList.innerHTML = `\n            <div class=\"empty-operations\">\n                <i class=\"fas fa-clipboard-list\"></i>\n                <p>No operations added yet. Configure and add operations to build your JSON.</p>\n            </div>\n        `;\n        return;\n    }\n    \n    const html = operations.map(operation => `\n        <div class=\"operation-item ${operation.id === operations[operations.length - 1]?.id ? 'new' : ''}\">\n            <div class=\"operation-info\">\n                <div class=\"operation-title\">\n                    <span class=\"operation-badge badge-${operation.type}\">\n                        ${operation.type}\n                    </span>\n                    <i class=\"${operation.icon}\"></i>\n                    ${operation.displayName}\n                </div>\n                <div class=\"operation-details\">\n                    Function: <strong>${operation.functionName}</strong> |\n                    Parameters: ${Object.keys(operation.config).filter(k => k !== 'function').length}\n                </div>\n            </div>\n            <div class=\"operation-actions\">\n                <button class=\"btn btn-sm btn-remove\" onclick=\"removeOperation(${operation.id})\">\n                    <i class=\"fas fa-times\"></i>\n                    Remove\n                </button>\n            </div>\n        </div>\n    `).join('');\n    \n    operationsList.innerHTML = html;\n}\n\n/**\n * Update the enabled/disabled state of the Generate JSON button\n * Button is only enabled when at least one operation exists\n */\nfunction updateGenerateButtonState() {\n    generateBtn.disabled = operations.length === 0;\n}\n\n// ============================================================================\n// JSON GENERATION AND OUTPUT\n// ============================================================================\n\n/**\n * Generate JSON configuration from the operations list\n * Creates the final configuration object in Lakehouse Engine format\n */\nfunction generateJSON() {\n    if (operations.length === 0) {\n        showToast('No operations to generate. Please add at least one operation.', 'error');\n        return;\n    }\n    \n    showLoading();\n    \n    // Use setTimeout to show loading animation\n    setTimeout(() => {\n        try {\n            const config = {\n                operations: operations.map(op => ({\n                    manager: op.manager,\n                    ...op.config\n                }))\n            };\n            \n            generatedConfig = config;\n            displayJSON(config);\n            enableActionButtons();\n            showToast('JSON configuration generated successfully!', 'success');\n            \n        } catch (error) {\n            console.error('Generation error:', error);\n            showToast('Error generating JSON: ' + error.message, 'error');\n        } finally {\n            hideLoading();\n        }\n    }, 500);\n}\n\n/**\n * Display formatted JSON in the output area\n * @param {Object} config - The configuration object to display\n */\nfunction displayJSON(config) {\n    const formattedJSON = JSON.stringify(config, null, 2);\n    jsonOutput.textContent = formattedJSON;\n    highlightJSON();\n}\n\n/**\n * Apply syntax highlighting to the displayed JSON\n * Colors different JSON elements (keys, strings, numbers, booleans)\n */\nfunction highlightJSON() {\n    const content = jsonOutput.textContent;\n    const highlighted = content\n        .replace(/\"([^\"]+)\":/g, '<span class=\"json-key\">\"$1\":</span>')\n        .replace(/: \"([^\"]+)\"/g, ': <span class=\"json-string\">\"$1\"</span>')\n        .replace(/: (\\d+)/g, ': <span class=\"json-number\">$1</span>')\n        .replace(/: (true|false)/g, ': <span class=\"json-boolean\">$1</span>')\n        .replace(/: null/g, ': <span class=\"json-null\">null</span>');\n    \n    jsonOutput.innerHTML = highlighted;\n}\n\n/**\n * Format the generated JSON with proper indentation\n * Re-formats and re-highlights the JSON output\n */\nfunction formatJSON() {\n    if (!generatedConfig) {\n        showToast('No JSON to format. Generate configuration first.', 'error');\n        return;\n    }\n    \n    try {\n        const formatted = JSON.stringify(generatedConfig, null, 2);\n        jsonOutput.textContent = formatted;\n        highlightJSON();\n        showToast('JSON formatted successfully!', 'success');\n    } catch (error) {\n        showToast('Error formatting JSON: ' + error.message, 'error');\n    }\n}\n\n/**\n * Validate the generated JSON configuration\n * Checks for required fields and proper structure\n */\nfunction validateJSON() {\n    if (!generatedConfig) {\n        showValidationResult(false, 'No JSON to validate. Generate configuration first.');\n        return;\n    }\n    \n    try {\n        const config = generatedConfig;\n        const errors = [];\n        \n        // Check for operations array\n        if (!config.operations || !Array.isArray(config.operations)) {\n            errors.push('Missing or invalid operations array');\n        } else {\n            // Validate each operation\n            config.operations.forEach((op, index) => {\n                if (!op.manager) {\n                    errors.push(`Operation ${index + 1}: Missing manager field`);\n                }\n                if (!op.function) {\n                    errors.push(`Operation ${index + 1}: Missing function field`);\n                }\n            });\n        }\n        \n        // Display validation results\n        if (errors.length === 0) {\n            showValidationResult(true, `JSON configuration is valid! Contains ${config.operations.length} operation(s).`);\n        } else {\n            showValidationResult(false, 'Validation errors: ' + errors.join(', '));\n        }\n    } catch (error) {\n        showValidationResult(false, 'Validation error: ' + error.message);\n    }\n}\n\n/**\n * Display validation results to the user\n * @param {boolean} isValid - Whether the validation passed\n * @param {string} message - The validation message to display\n */\nfunction showValidationResult(isValid, message) {\n    validationResult.className = `validation-result ${isValid ? 'valid' : 'invalid'}`;\n    validationResult.textContent = isValid ? '✅ ' + message : '❌ ' + message;\n}\n\n/**\n * Copy the generated JSON to the clipboard\n * Uses modern Clipboard API with fallback for older browsers\n */\nasync function copyToClipboard() {\n    if (!generatedConfig) {\n        showToast('No JSON to copy. Generate configuration first.', 'error');\n        return;\n    }\n    \n    try {\n        const jsonString = JSON.stringify(generatedConfig, null, 2);\n        await navigator.clipboard.writeText(jsonString);\n        showToast('JSON copied to clipboard!', 'success');\n    } catch (error) {\n        // Fallback for older browsers\n        const textArea = document.createElement('textarea');\n        textArea.value = JSON.stringify(generatedConfig, null, 2);\n        document.body.appendChild(textArea);\n        textArea.select();\n        document.execCommand('copy');\n        document.body.removeChild(textArea);\n        showToast('JSON copied to clipboard!', 'success');\n    }\n}\n\n/**\n * Download the generated JSON as a file\n * Creates a timestamped filename and triggers browser download\n */\nfunction downloadJSON() {\n    if (!generatedConfig) {\n        showToast('No JSON to download. Generate configuration first.', 'error');\n        return;\n    }\n    \n    const jsonString = JSON.stringify(generatedConfig, null, 2);\n    const blob = new Blob([jsonString], { type: 'application/json' });\n    const url = URL.createObjectURL(blob);\n    \n    // Generate filename with timestamp\n    const timestamp = new Date().toISOString().replace(/[:.]/g, '-');\n    const filename = `lakehouse-operations-${timestamp}.json`;\n    \n    // Trigger download\n    const a = document.createElement('a');\n    a.href = url;\n    a.download = filename;\n    document.body.appendChild(a);\n    a.click();\n    document.body.removeChild(a);\n    URL.revokeObjectURL(url);\n    \n    showToast(`Configuration downloaded as ${filename}`, 'success');\n}\n\n// ============================================================================\n// UI HELPER FUNCTIONS\n// ============================================================================\n\n/**\n * Enable the JSON action buttons (copy, download)\n * Called after JSON is successfully generated\n */\nfunction enableActionButtons() {\n    copyBtn.disabled = false;\n    downloadBtn.disabled = false;\n}\n\n/**\n * Show the loading spinner overlay\n */\nfunction showLoading() {\n    loading.style.display = 'flex';\n}\n\n/**\n * Hide the loading spinner overlay\n */\nfunction hideLoading() {\n    loading.style.display = 'none';\n}\n\n/**\n * Display a toast notification message\n * @param {string} message - The message to display\n * @param {string} type - The toast type ('success' or 'error')\n */\nfunction showToast(message, type = 'success') {\n    toast.textContent = message;\n    toast.className = `toast ${type}`;\n    toast.classList.add('show');\n    \n    // Auto-hide after 3 seconds\n    setTimeout(() => {\n        toast.classList.remove('show');\n    }, 3000);\n}\n\n// ============================================================================\n// LOCAL STORAGE PERSISTENCE\n// ============================================================================\n\n/**\n * Save the current operations and state to localStorage\n * Allows users to resume work after page reload\n */\nfunction saveToLocalStorage() {\n    const data = {\n        operations: operations,\n        currentTab: currentTab,\n        timestamp: Date.now()\n    };\n    localStorage.setItem('lakehouse-operations-generator', JSON.stringify(data));\n}\n\n/**\n * Load previously saved operations and state from localStorage\n * Only loads data saved within the last 24 hours\n */\nfunction loadFromLocalStorage() {\n    try {\n        const saved = localStorage.getItem('lakehouse-operations-generator');\n        if (saved) {\n            const data = JSON.parse(saved);\n            \n            // Only load if saved within last 24 hours\n            if (Date.now() - data.timestamp < 24 * 60 * 60 * 1000) {\n                operations = data.operations || [];\n                renderOperationsList();\n                updateGenerateButtonState();\n                \n                if (data.currentTab) {\n                    switchTab(data.currentTab);\n                }\n            }\n        }\n    } catch (error) {\n        console.warn('Could not load saved data:', error);\n    }\n}\n\n// ============================================================================\n// KEYBOARD SHORTCUTS\n// ============================================================================\n\n/**\n * Handle keyboard shortcuts for common actions\n * - Ctrl/Cmd + G: Generate JSON\n * - Ctrl/Cmd + A: Add operation (when operation selector focused)\n * - Ctrl + Delete: Clear all operations\n */\ndocument.addEventListener('keydown', function(event) {\n    // Ctrl+G or Cmd+G - Generate JSON\n    if ((event.ctrlKey || event.metaKey) && event.key === 'g') {\n        event.preventDefault();\n        generateJSON();\n    }\n    \n    // Ctrl+A or Cmd+A when focused on operation selector - Add operation\n    if ((event.ctrlKey || event.metaKey) && event.key === 'a' && \n        (event.target === tableOperationSelect || event.target === fileOperationSelect)) {\n        event.preventDefault();\n        addCurrentOperation();\n    }\n    \n    // Ctrl + Delete - Clear operations\n    if (event.key === 'Delete' && event.ctrlKey && operations.length > 0) {\n        event.preventDefault();\n        clearAllOperations();\n    }\n});\n\n// ============================================================================\n// FINAL INITIALIZATION\n// ============================================================================\n\n/**\n * Initialize button states when page loads\n * Ensures all buttons are in the correct enabled/disabled state\n */\ndocument.addEventListener('DOMContentLoaded', function() {\n    updateAddButtonState();\n    updateGenerateButtonState();\n});\n"
  },
  {
    "path": "lakehouse_engine_usage/managerhelper/operations-styles-mkdocs.css",
    "content": "/* Import base styles */\n/* Operations-specific styles for MkDocs */\n\n.managerhelper-wrapper .operation-selector {\n    background: #e3f2fd;\n    padding: 1.5rem;\n    border-radius: 4px;\n    margin-bottom: 2rem;\n    border-left: 4px solid #2196f3;\n}\n\n.managerhelper-wrapper .operation-selector label {\n    display: block;\n    margin-bottom: 0.5rem;\n    font-weight: 500;\n    color: #1565c0;\n}\n\n.managerhelper-wrapper .operation-selector select {\n    width: 100%;\n    padding: 12px 15px;\n    border: 1px solid #90caf9;\n    border-radius: 4px;\n    font-size: 0.875rem;\n    background: white;\n    color: #1565c0;\n    transition: all 0.3s ease;\n}\n\n.managerhelper-wrapper .operation-selector select:focus {\n    border-color: #2196f3;\n    outline: none;\n    box-shadow: 0 0 0 2px rgba(33, 150, 243, 0.2);\n}\n\n/* Operations List */\n.managerhelper-wrapper .operations-list-container {\n    background: #fafafa;\n    border-radius: 4px;\n    margin: 0 2rem 2rem;\n    overflow: hidden;\n    box-shadow: 0 1px 3px rgba(0, 0, 0, 0.12);\n    border: 1px solid #e0e0e0;\n}\n\n.managerhelper-wrapper .operations-header {\n    display: flex;\n    justify-content: space-between;\n    align-items: center;\n    padding: 1.5rem;\n    background: #2196f3;\n    color: white;\n}\n\n.managerhelper-wrapper .operations-header h3 {\n    margin: 0;\n    display: flex;\n    align-items: center;\n    gap: 10px;\n}\n\n.managerhelper-wrapper .operations-actions {\n    display: flex;\n    gap: 10px;\n}\n\n.managerhelper-wrapper .operations-list {\n    padding: 1rem;\n    max-height: 400px;\n    overflow-y: auto;\n}\n\n.managerhelper-wrapper .empty-operations {\n    text-align: center;\n    padding: 2rem;\n    color: rgba(0, 0, 0, 0.54);\n}\n\n.managerhelper-wrapper .empty-operations i {\n    font-size: 2rem;\n    margin-bottom: 1rem;\n    opacity: 0.5;\n}\n\n.managerhelper-wrapper .operation-item {\n    background: white;\n    border: 1px solid #e0e0e0;\n    border-radius: 4px;\n    padding: 1rem;\n    margin-bottom: 0.5rem;\n    display: flex;\n    justify-content: space-between;\n    align-items: flex-start;\n    transition: all 0.3s ease;\n}\n\n.managerhelper-wrapper .operation-item:hover {\n    box-shadow: 0 2px 4px rgba(0, 0, 0, 0.15);\n    transform: translateY(-1px);\n}\n\n.managerhelper-wrapper .operation-info {\n    flex: 1;\n}\n\n.managerhelper-wrapper .operation-title {\n    font-weight: 500;\n    color: rgba(0, 0, 0, 0.87);\n    margin-bottom: 0.5rem;\n    display: flex;\n    align-items: center;\n    gap: 8px;\n}\n\n.managerhelper-wrapper .operation-title i {\n    color: #2196f3;\n}\n\n.managerhelper-wrapper .operation-details {\n    font-size: 0.8rem;\n    color: rgba(0, 0, 0, 0.54);\n}\n\n.managerhelper-wrapper .operation-actions {\n    display: flex;\n    gap: 8px;\n    margin-left: 1rem;\n}\n\n.managerhelper-wrapper .btn-edit {\n    background: #ffd54f;\n    color: rgba(0, 0, 0, 0.87);\n    border: none;\n}\n\n.managerhelper-wrapper .btn-edit:hover {\n    background: #ffca28;\n}\n\n.managerhelper-wrapper .btn-remove {\n    background: #f44336;\n    color: white;\n    border: none;\n}\n\n.managerhelper-wrapper .btn-remove:hover {\n    background: #d32f2f;\n}\n\n/* Field Groups */\n.managerhelper-wrapper .field-group {\n    background: #fafafa;\n    padding: 1.5rem;\n    border-radius: 4px;\n    margin-bottom: 1.5rem;\n    border: 1px solid #e0e0e0;\n    border-left: 3px solid #4caf50;\n}\n\n.managerhelper-wrapper .field-group h4 {\n    color: rgba(0, 0, 0, 0.87);\n    margin-bottom: 1rem;\n    display: flex;\n    align-items: center;\n    gap: 8px;\n    font-weight: 500;\n}\n\n.managerhelper-wrapper .field-group h4 i {\n    color: #4caf50;\n}\n\n.managerhelper-wrapper .field-row {\n    display: grid;\n    grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));\n    gap: 1rem;\n    margin-bottom: 1rem;\n}\n\n.managerhelper-wrapper .field-item {\n    display: flex;\n    flex-direction: column;\n}\n\n.managerhelper-wrapper .field-item label {\n    margin-bottom: 0.5rem;\n    font-weight: 500;\n    color: rgba(0, 0, 0, 0.87);\n}\n\n.managerhelper-wrapper .field-item input,\n.managerhelper-wrapper .field-item select,\n.managerhelper-wrapper .field-item textarea {\n    padding: 10px 12px;\n    border: 1px solid #bdbdbd;\n    border-radius: 4px;\n    font-size: 0.875rem;\n    transition: all 0.3s ease;\n}\n\n.managerhelper-wrapper .field-item input:focus,\n.managerhelper-wrapper .field-item select:focus,\n.managerhelper-wrapper .field-item textarea:focus {\n    outline: none;\n    border-color: #2196f3;\n    box-shadow: 0 0 0 2px rgba(33, 150, 243, 0.2);\n}\n\n.managerhelper-wrapper .field-help {\n    font-size: 0.8rem;\n    color: rgba(0, 0, 0, 0.54);\n    margin-top: 0.25rem;\n}\n\n.managerhelper-wrapper .field-required {\n    color: #f44336;\n}\n\n/* Form validation */\n.managerhelper-wrapper .form-control.invalid {\n    border-color: #f44336;\n    background-color: #ffebee;\n}\n\n.managerhelper-wrapper .form-control.valid {\n    border-color: #4caf50;\n    background-color: #f1f8e9;\n}\n\n.managerhelper-wrapper .validation-message {\n    font-size: 0.8rem;\n    margin-top: 0.25rem;\n}\n\n.managerhelper-wrapper .validation-message.error {\n    color: #f44336;\n}\n\n.managerhelper-wrapper .validation-message.success {\n    color: #4caf50;\n}\n\n/* Operation Type Badges */\n.managerhelper-wrapper .operation-badge {\n    display: inline-block;\n    padding: 0.25rem 0.5rem;\n    font-size: 0.75rem;\n    font-weight: 500;\n    border-radius: 0.25rem;\n    text-transform: uppercase;\n    margin-right: 0.5rem;\n}\n\n.managerhelper-wrapper .badge-table {\n    background-color: #e3f2fd;\n    color: #1565c0;\n}\n\n.managerhelper-wrapper .badge-file {\n    background-color: #fff3e0;\n    color: #ef6c00;\n}\n\n/* Responsive Design */\n@media (max-width: 768px) {\n    .managerhelper-wrapper .field-row {\n        grid-template-columns: 1fr;\n    }\n    \n    .managerhelper-wrapper .operations-header {\n        flex-direction: column;\n        gap: 1rem;\n        align-items: stretch;\n    }\n    \n    .managerhelper-wrapper .operations-actions {\n        justify-content: center;\n    }\n    \n    .managerhelper-wrapper .operation-item {\n        flex-direction: column;\n        gap: 1rem;\n    }\n    \n    .managerhelper-wrapper .operation-actions {\n        margin-left: 0;\n        justify-content: flex-end;\n    }\n    \n    .managerhelper-wrapper .operations-list-container {\n        margin: 0 1rem 1rem;\n    }\n}\n"
  },
  {
    "path": "lakehouse_engine_usage/managerhelper/styles-mkdocs.css",
    "content": "/* MkDocs-scoped styles for Manager Helper with Material Design theme */\n.managerhelper-wrapper {\n    font-family: 'Roboto', 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;\n    line-height: 1.6;\n    color: #333;\n    margin: 0 -24px;\n    padding: 0;\n}\n\n.managerhelper-wrapper * {\n    box-sizing: border-box;\n}\n\n/* Header */\n.managerhelper-wrapper .header {\n    text-align: center;\n    margin-bottom: 0;\n    padding: 2rem 1rem;\n    background: #2196f3;\n    color: white;\n    box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);\n}\n\n.managerhelper-wrapper .logo {\n    display: flex;\n    align-items: center;\n    justify-content: center;\n    gap: 15px;\n    margin-bottom: 10px;\n}\n\n.managerhelper-wrapper .logo i {\n    font-size: 2.5rem;\n}\n\n.managerhelper-wrapper .header h1 {\n    font-size: 2rem;\n    font-weight: 500;\n    margin: 0;\n}\n\n.managerhelper-wrapper .subtitle {\n    font-size: 1rem;\n    opacity: 0.9;\n    margin-top: 10px;\n}\n\n/* Navigation Tabs */\n.managerhelper-wrapper .tabs {\n    display: flex;\n    gap: 0;\n    margin-bottom: 0;\n    border-bottom: 2px solid #e0e0e0;\n    overflow-x: auto;\n    background: #fafafa;\n    padding: 0 2rem;\n}\n\n.managerhelper-wrapper .tab-button {\n    display: flex;\n    align-items: center;\n    gap: 8px;\n    padding: 14px 24px;\n    border: none;\n    background: transparent;\n    cursor: pointer;\n    font-size: 0.875rem;\n    color: rgba(0, 0, 0, 0.6);\n    border-bottom: 2px solid transparent;\n    transition: all 0.3s ease;\n    white-space: nowrap;\n    font-weight: 500;\n}\n\n.managerhelper-wrapper .tab-button:hover {\n    color: rgba(0, 0, 0, 0.87);\n    background: rgba(33, 150, 243, 0.08);\n}\n\n.managerhelper-wrapper .tab-button.active {\n    color: #2196f3;\n    border-bottom-color: #ffd54f;\n    background: white;\n}\n\n.managerhelper-wrapper .tab-button i {\n    font-size: 1rem;\n}\n\n/* Form Container */\n.managerhelper-wrapper .operations-container {\n    padding: 2rem;\n}\n\n.managerhelper-wrapper .tab-content {\n    display: none;\n    animation: fadeIn 0.3s ease-in;\n}\n\n.managerhelper-wrapper .tab-content.active {\n    display: block;\n}\n\n@keyframes fadeIn {\n    from { opacity: 0; transform: translateY(10px); }\n    to { opacity: 1; transform: translateY(0); }\n}\n\n.managerhelper-wrapper .section {\n    background: #fafafa;\n    padding: 2rem;\n    border-radius: 4px;\n    box-shadow: 0 1px 3px rgba(0, 0, 0, 0.12);\n    border: 1px solid #e0e0e0;\n}\n\n.managerhelper-wrapper .section h2 {\n    display: flex;\n    align-items: center;\n    gap: 10px;\n    color: rgba(0, 0, 0, 0.87);\n    margin-bottom: 1.5rem;\n    font-size: 1.5rem;\n    font-weight: 500;\n}\n\n.managerhelper-wrapper .section h2 i {\n    color: #2196f3;\n}\n\n/* Form Groups */\n.managerhelper-wrapper .form-group {\n    margin-bottom: 1.5rem;\n}\n\n.managerhelper-wrapper .form-group label {\n    display: block;\n    margin-bottom: 0.5rem;\n    font-weight: 500;\n    color: rgba(0, 0, 0, 0.87);\n}\n\n.managerhelper-wrapper .form-control {\n    width: 100%;\n    padding: 12px 15px;\n    border: 1px solid #bdbdbd;\n    border-radius: 4px;\n    font-size: 0.875rem;\n    transition: all 0.3s ease;\n    background: white;\n}\n\n.managerhelper-wrapper .form-control:focus {\n    outline: none;\n    border-color: #2196f3;\n    box-shadow: 0 0 0 2px rgba(33, 150, 243, 0.2);\n}\n\n.managerhelper-wrapper textarea.form-control {\n    resize: vertical;\n    font-family: 'Roboto Mono', 'Fira Code', monospace;\n    line-height: 1.5;\n}\n\n.managerhelper-wrapper .help-text {\n    display: block;\n    margin-top: 0.25rem;\n    color: #6c757d;\n    font-size: 0.8rem;\n}\n\n/* Dynamic Fields */\n.managerhelper-wrapper .dynamic-fields {\n    min-height: 200px;\n}\n\n.managerhelper-wrapper .no-operation-selected {\n    text-align: center;\n    padding: 3rem 2rem;\n    color: #757575;\n}\n\n.managerhelper-wrapper .no-operation-selected i {\n    font-size: 3rem;\n    margin-bottom: 1rem;\n    opacity: 0.5;\n}\n\n/* Actions */\n.managerhelper-wrapper .actions {\n    display: flex;\n    gap: 15px;\n    margin-bottom: 2rem;\n    flex-wrap: wrap;\n    justify-content: center;\n    padding: 1rem 2rem;\n}\n\n.managerhelper-wrapper .btn {\n    display: inline-flex;\n    align-items: center;\n    gap: 8px;\n    padding: 12px 20px;\n    border: none;\n    border-radius: 4px;\n    font-size: 0.875rem;\n    cursor: pointer;\n    transition: all 0.3s ease;\n    text-decoration: none;\n    font-weight: 500;\n}\n\n.managerhelper-wrapper .btn:disabled {\n    opacity: 0.6;\n    cursor: not-allowed;\n}\n\n.managerhelper-wrapper .btn-primary {\n    background: #2196f3;\n    color: white;\n}\n\n.managerhelper-wrapper .btn-primary:hover:not(:disabled) {\n    background: #1976d2;\n    transform: translateY(-1px);\n    box-shadow: 0 2px 8px rgba(33, 150, 243, 0.4);\n}\n\n.managerhelper-wrapper .btn-secondary {\n    background: #757575;\n    color: white;\n}\n\n.managerhelper-wrapper .btn-secondary:hover:not(:disabled) {\n    background: #616161;\n    transform: translateY(-1px);\n}\n\n.managerhelper-wrapper .btn-outline {\n    background: transparent;\n    border: 2px solid #f44336;\n    color: #f44336;\n}\n\n.managerhelper-wrapper .btn-outline:hover {\n    background: #f44336;\n    color: white;\n}\n\n.managerhelper-wrapper .btn-sm {\n    padding: 6px 12px;\n    font-size: 0.8rem;\n}\n\n/* Output Container */\n.managerhelper-wrapper .output-container {\n    background: #263238;\n    border-radius: 4px;\n    overflow: hidden;\n    box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2);\n    margin: 0 2rem 2rem;\n}\n\n.managerhelper-wrapper .output-header {\n    display: flex;\n    justify-content: space-between;\n    align-items: center;\n    padding: 1rem 1.5rem;\n    background: #37474f;\n    color: white;\n    border-bottom: 1px solid #455a64;\n}\n\n.managerhelper-wrapper .output-header h3 {\n    display: flex;\n    align-items: center;\n    gap: 10px;\n    margin: 0;\n}\n\n.managerhelper-wrapper .output-actions {\n    display: flex;\n    gap: 10px;\n}\n\n.managerhelper-wrapper .json-output {\n    background: #263238;\n    color: #eceff1;\n    padding: 1.5rem;\n    margin: 0;\n    font-family: 'Roboto Mono', 'Fira Code', 'Courier New', monospace;\n    font-size: 0.8rem;\n    line-height: 1.5;\n    overflow-x: auto;\n    min-height: 200px;\n    white-space: pre-wrap;\n}\n\n.managerhelper-wrapper .json-output:empty::before {\n    content: 'Generated JSON configuration will appear here...';\n    color: #90a4ae;\n    font-style: italic;\n}\n\n/* Validation Result */\n.managerhelper-wrapper .validation-result {\n    padding: 1rem 1.5rem;\n    font-weight: 500;\n    display: none;\n}\n\n.managerhelper-wrapper .validation-result.valid {\n    background: #1b5e20;\n    color: #81c784;\n    display: block;\n}\n\n.managerhelper-wrapper .validation-result.invalid {\n    background: #b71c1c;\n    color: #ef5350;\n    display: block;\n}\n\n/* Loading Spinner */\n.managerhelper-wrapper .loading {\n    position: fixed;\n    top: 0;\n    left: 0;\n    width: 100%;\n    height: 100%;\n    background: rgba(0, 0, 0, 0.8);\n    display: flex;\n    flex-direction: column;\n    justify-content: center;\n    align-items: center;\n    z-index: 2000;\n    color: white;\n}\n\n.managerhelper-wrapper .spinner {\n    width: 50px;\n    height: 50px;\n    border: 5px solid rgba(255, 255, 255, 0.3);\n    border-top: 5px solid #ffd54f;\n    border-radius: 50%;\n    animation: spin 1s linear infinite;\n    margin-bottom: 1rem;\n}\n\n@keyframes spin {\n    0% { transform: rotate(0deg); }\n    100% { transform: rotate(360deg); }\n}\n\n/* Toast Notification */\n.managerhelper-wrapper .toast {\n    position: fixed;\n    top: 80px;\n    right: 20px;\n    background: #4caf50;\n    color: white;\n    padding: 1rem 1.5rem;\n    border-radius: 4px;\n    box-shadow: 0 2px 8px rgba(0, 0, 0, 0.3);\n    transform: translateX(400px);\n    transition: transform 0.3s ease;\n    z-index: 2001;\n}\n\n.managerhelper-wrapper .toast.show {\n    transform: translateX(0);\n}\n\n.managerhelper-wrapper .toast.error {\n    background: #f44336;\n}\n\n/* Responsive Design */\n@media (max-width: 768px) {\n    .managerhelper-wrapper {\n        margin: 0 -16px;\n    }\n    \n    .managerhelper-wrapper .header h1 {\n        font-size: 1.5rem;\n    }\n    \n    .managerhelper-wrapper .tabs {\n        padding: 0 1rem;\n    }\n    \n    .managerhelper-wrapper .operations-container {\n        padding: 1rem;\n    }\n    \n    .managerhelper-wrapper .section {\n        padding: 1rem;\n    }\n    \n    .managerhelper-wrapper .actions {\n        flex-direction: column;\n        padding: 1rem;\n    }\n    \n    .managerhelper-wrapper .output-container {\n        margin: 0 1rem 1rem;\n    }\n}\n"
  },
  {
    "path": "lakehouse_engine_usage/reconciliator/__init__.py",
    "content": "\"\"\"\n.. include::reconciliator.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/reconciliator/reconciliator.md",
    "content": "# Reconciliator\n\nChecking if data reconciles, using this algorithm, is a matter of reading the **truth** data and the **current** data.\nYou can use any input specification compatible with the lakehouse engine to read **truth** or **current** data. On top\nof that, you can pass a `truth_preprocess_query` and a `current_preprocess_query` so you can preprocess the data before\nit goes into the actual reconciliation process. The reconciliation process is focused on joining **truth**\nwith `current` by all provided columns except the ones passed as `metrics`.\n\nIn the table below, we present how a simple reconciliation would look like:\n\n| current_country | current_count | truth_country | truth_count | absolute_diff | perc_diff | yellow | red | recon_type |\n|-----------------|---------------|---------------|-------------|---------------|-----------|--------|-----|------------|\n| Sweden          | 123           | Sweden        | 120         | 3             | 0.025     | 0.1    | 0.2 | percentage |\n| Germany         | 2946          | Sweden        | 2946        | 0             | 0         | 0.1    | 0.2 | percentage |\n| France          | 2901          | France        | 2901        | 0             | 0         | 0.1    | 0.2 | percentage |\n| Belgium         | 426           | Belgium       | 425         | 1             | 0.002     | 0.1    | 0.2 | percentage |\n\nThe Reconciliator algorithm uses an ACON to configure its execution. You can find the meaning of each ACON property\nin [ReconciliatorSpec object](../../reference/packages/core/definitions.md#packages.core.definitions.ReconciliatorSpec).\n\nBelow there is an example of usage of reconciliator.\n```python\nfrom lakehouse_engine.engine import execute_reconciliation\n\ntruth_query = \"\"\"\n  SELECT\n    shipping_city,\n    sum(sales_order_qty) as qty,\n    order_date_header\n  FROM (\n    SELECT\n      ROW_NUMBER() OVER (\n        PARTITION BY sales_order_header, sales_order_schedule, sales_order_item, shipping_city\n        ORDER BY changed_on desc\n      ) as rank1,\n      sales_order_header,\n      sales_order_item,\n      sales_order_qty,\n      order_date_header,\n      shipping_city\n    FROM truth -- truth is a locally accessible temp view created by the lakehouse engine\n    WHERE order_date_header = '2021-10-01'\n  ) a\nWHERE a.rank1 = 1\nGROUP BY a.shipping_city, a.order_date_header\n\"\"\"\n\ncurrent_query = \"\"\"\n  SELECT\n    shipping_city,\n    sum(sales_order_qty) as qty,\n    order_date_header\n  FROM (\n    SELECT\n      ROW_NUMBER() OVER (\n        PARTITION BY sales_order_header, sales_order_schedule, sales_order_item, shipping_city\n        ORDER BY changed_on desc\n      ) as rank1,\n      sales_order_header,\n      sales_order_item,\n      sales_order_qty,\n      order_date_header,\n      shipping_city\n    FROM current -- current is a locally accessible temp view created by the lakehouse engine\n    WHERE order_date_header = '2021-10-01'\n  ) a\nWHERE a.rank1 = 1\nGROUP BY a.shipping_city, a.order_date_header\n\"\"\"\n\nacon = {\n    \"metrics\": [{\"metric\": \"qty\", \"type\": \"percentage\", \"aggregation\": \"avg\", \"yellow\": 0.05, \"red\": 0.1}],\n    \"truth_input_spec\": {\n        \"spec_id\": \"truth\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"csv\",\n        \"schema_path\": \"s3://my_data_product_bucket/artefacts/metadata/schemas/bronze/orders.json\",\n        \"options\": {\n            \"delimiter\": \"^\",\n            \"dateFormat\": \"yyyyMMdd\",\n        },\n        \"location\": \"s3://my_data_product_bucket/bronze/orders\",\n    },\n    \"truth_preprocess_query\": truth_query,\n    \"current_input_spec\": {\n        \"spec_id\": \"current\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"delta\",\n        \"db_table\": \"my_database.orders\",\n    },\n    \"current_preprocess_query\": current_query,\n}\n\nexecute_reconciliation(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensor/__init__.py",
    "content": "\"\"\"\n.. include::sensor.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/delta_table/__init__.py",
    "content": "\"\"\"\n.. include::delta_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/delta_table/delta_table.md",
    "content": "# Sensor from Delta Table\n\nThis shows how to create a **Sensor to detect new data from a Delta Table**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n    If you want to view some examples of usage you can visit the [delta upstream sensor table](../delta_upstream_sensor_table/delta_upstream_sensor_table.md) or the [jdbc sensor](../jdbc_table/jdbc_table.md).\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and **SUGGESTED** behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData will be consumed from a delta table in streaming mode,\nso if there is any new data it will give condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"delta\",\n        \"db_table\": \"upstream_database.source_delta_table\",\n        \"options\": {\n            \"readChangeFeed\": \"true\", # to read changes in upstream table\n        },\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it \nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensor/delta_upstream_sensor_table/__init__.py",
    "content": "\"\"\"\n.. include::delta_upstream_sensor_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/delta_upstream_sensor_table/delta_upstream_sensor_table.md",
    "content": "# Sensor from other Sensor Delta Table\n\nThis shows how to create a **Sensor to detect new data from another Sensor Delta Table**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nIt makes use of `generate_sensor_query` to generate the `preprocess_query`,\ndifferent from [delta_table](../delta_table/delta_table.md).\n\nData from other sensor delta table, in streaming mode, will be consumed. If there is any new data it will trigger \nthe condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor, generate_sensor_query\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"delta\",\n        \"db_table\": \"upstream_database.lakehouse_engine_sensors\",\n        \"options\": {\n            \"readChangeFeed\": \"true\",\n        },\n    },\n    \"preprocess_query\": generate_sensor_query(\"UPSTREAM_SENSOR_ID\"),\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensor/file/__init__.py",
    "content": "\"\"\"\n.. include::file.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/file/file.md",
    "content": "# Sensor from Files\n\nThis shows how to create a **Sensor to detect new data from a File Location**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios \n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nUsing these sensors and consuming the data in streaming mode, if any new file is added to the file location, \nit will automatically trigger the proceeding task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"csv\",  # You can use any of the data formats supported by the lakehouse engine, e.g: \"avro|json|parquet|csv|delta|cloudfiles\"\n        \"location\": \"s3://my_data_product_bucket/path\",\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensor/jdbc_table/__init__.py",
    "content": "\"\"\"\n.. include::jdbc_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/jdbc_table/jdbc_table.md",
    "content": "# Sensor from JDBC\n\nThis shows how to create a **Sensor to detect new data from a JDBC table**.\n\n## Configuration required to have a Sensor\n\n- **jdbc_args**: Arguments of the JDBC upstream.\n- **generate_sensor_query**: Generates a Sensor query to consume data from the upstream, this function can be used on `preprocess_query` ACON option.\n    - **sensor_id**: The unique identifier for the Sensor.\n    - **filter_exp**: Expression to filter incoming new data.\n      A placeholder `?upstream_key` and `?upstream_value` can be used, example: `?upstream_key > ?upstream_value` so that it can be replaced by the respective values from the sensor `control_db_table_name` for this specific sensor_id.\n    - **control_db_table_name**: Sensor control table name.\n    - **upstream_key**: the key of custom sensor information to control how to identify new data from the upstream (e.g., a time column in the upstream).\n    - **upstream_value**: the **first** upstream value to identify new data from the upstream (e.g., the value of a time present in the upstream). ***Note:*** This parameter will have effect just in the first run to detect if the upstream have new data. If it's empty the default value applied is `-2147483647`.\n    - **upstream_table_name**: Table name to consume the upstream value. If it's empty the default value applied is `sensor_new_data`.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios \n\nThis covers the following scenarios of using the Sensor:\n\n1. [Generic JDBC template with `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [Generic JDBC template with `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData from JDBC, in batch mode, will be consumed. If there is new data based in the preprocess query from the source table, it will trigger the condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor, generate_sensor_query\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"jdbc\",\n        \"jdbc_args\": {\n            \"url\": \"JDBC_URL\",\n            \"table\": \"JDBC_DB_TABLE\",\n            \"properties\": {\n                \"user\": \"JDBC_USERNAME\",\n                \"password\": \"JDBC_PWD\",\n                \"driver\": \"JDBC_DRIVER\",\n            },\n        },\n        \"options\": {\n            \"compress\": True,\n        },\n    },\n    \"preprocess_query\": generate_sensor_query(\n        sensor_id=\"MY_SENSOR_ID\",\n        filter_exp=\"?upstream_key > '?upstream_value'\",\n        control_db_table_name=\"my_database.lakehouse_engine_sensors\",\n        upstream_key=\"UPSTREAM_COLUMN_TO_IDENTIFY_NEW_DATA\",\n    ),\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensor/kafka/__init__.py",
    "content": "\"\"\"\n.. include::kafka.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/kafka/kafka.md",
    "content": "# Sensor from Kafka\n\nThis shows how to create a **Sensor to detect new data from Kafka**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData from Kafka, in streaming mode, will be consumed, so if there is any new data in the kafka topic it will give condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"kafka\",\n        \"options\": {\n            \"kafka.bootstrap.servers\": \"KAFKA_SERVER\",\n            \"subscribe\": \"KAFKA_TOPIC\",\n            \"startingOffsets\": \"earliest\",\n            \"kafka.security.protocol\": \"SSL\",\n            \"kafka.ssl.truststore.location\": \"TRUSTSTORE_LOCATION\",\n            \"kafka.ssl.truststore.password\": \"TRUSTSTORE_PWD\",\n            \"kafka.ssl.keystore.location\": \"KEYSTORE_LOCATION\",\n            \"kafka.ssl.keystore.password\": \"KEYSTORE_PWD\",\n        },\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensor/sap_bw_b4/__init__.py",
    "content": "\"\"\"\n.. include::sap_bw_b4.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/sap_bw_b4/sap_bw_b4.md",
    "content": "# Sensor from SAP\n\nThis shows how to create a **Sensor to detect new data from a SAP LOGCHAIN table**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom\n    query should be created with the source table as `sensor_new_data`.\n\n    - **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n    - **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\n    there is no new data detected from upstream.\n\nSpecific configuration required to have a Sensor consuming a SAP BW/B4 upstream.\nThe Lakehouse Engine provides two utility functions to make easier to consume SAP as upstream:\n`generate_sensor_sap_logchain_query` and `generate_sensor_query`.\n\n- **generate_sensor_sap_logchain_query**: This function aims\n  to create a temporary table with timestamp from the SAP LOGCHAIN table, which is a process control table.\n  \n    !!! note\n        this temporary table only lives during runtime, and it is related with the\n        sap process control table but has no relationship or effect on the sensor control table.\n    \n        - **chain_id**: SAP Chain ID process.\n        - **dbtable**: SAP LOGCHAIN db table name, default: `my_database.RSPCLOGCHAIN`.\n        - **status**: SAP Chain Status of your process, default: `G`.\n        - **engine_table_name**: Name of the temporary table created from the upstream data, \n        default: `sensor_new_data`.\n        This temporary table will be used as source in the `query` option.\n\n  - **generate_sensor_query**: Generates a Sensor query to consume data from the temporary table created in the `prepareQuery`.\n      - **sensor_id**: The unique identifier for the Sensor.\n      - **filter_exp**: Expression to filter incoming new data.\n        A placeholder `?upstream_key` and `?upstream_value` can be used, example: `?upstream_key > ?upstream_value`\n        so that it can be replaced by the respective values from the sensor `control_db_table_name`\n        for this specific sensor_id.\n      - **control_db_table_name**: Sensor control table name.\n      - **upstream_key**: the key of custom sensor information to control how to identify\n        new data from the upstream (e.g., a time column in the upstream).\n      - **upstream_value**: the **first** upstream value to identify new data from the\n        upstream (e.g., the value of a time present in the upstream).\n        .. note:: This parameter will have effect just in the first run to detect if the upstream have new data. If it's empty the default value applied is `-2147483647`.\n      - **upstream_table_name**: Table name to consume the upstream value.\n        If it's empty the default value applied is `sensor_new_data`.\n        .. note:: In case of using the `generate_sensor_sap_logchain_query` the default value for the temp table is `sensor_new_data`, so if passing a different value in the `engine_table_name` this parameter should have the same value.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData from SAP, in streaming mode, will be consumed, so if there is any new data in the kafka topic it will give condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor, generate_sensor_query, generate_sensor_sap_logchain_query\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"jdbc\",\n        \"options\": {\n            \"compress\": True,\n            \"driver\": \"JDBC_DRIVER\",\n            \"url\": \"JDBC_URL\",\n            \"user\": \"JDBC_USERNAME\",\n            \"password\": \"JDBC_PWD\",\n            \"prepareQuery\": generate_sensor_sap_logchain_query(chain_id=\"CHAIN_ID\", dbtable=\"JDBC_DB_TABLE\"),\n            \"query\": generate_sensor_query(\n                sensor_id=\"MY_SENSOR_ID\",\n                filter_exp=\"?upstream_key > '?upstream_value'\",\n                control_db_table_name=\"my_database.lakehouse_engine_sensors\",\n                upstream_key=\"UPSTREAM_COLUMN_TO_IDENTIFY_NEW_DATA\",\n            ),\n        },\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```\n\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/sensor.md",
    "content": "# Sensor\n\n## What is it?\n\nThe lakehouse engine sensors are an abstraction to otherwise complex spark code that can be executed in very small\nsingle-node clusters to check if an upstream system or data product contains new data since the last execution of our\njob. With this feature, we can trigger a job to run in more frequent intervals and if the upstream does not contain new\ndata, then the rest of the job exits without creating bigger clusters to execute more intensive data ETL (Extraction,\nTransformation, and Loading).\n\n## How do Sensor-based jobs work?\n\n<img src=\"../../assets/img/sensor_os.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\nWith the sensors capability, data products in the lakehouse can sense if another data product or an upstream system (source\nsystem) have new data since the last successful job. We accomplish this through the approach illustrated above, which\ncan be interpreted as follows:\n\n1. A Data Product can check if Kafka, JDBC or any other Lakehouse Engine Sensors supported sources, contains new data using the respective sensors;\n2. The Sensor task may run in a very tiny single-node cluster to ensure cost\n   efficiency ([check sensor cost efficiency](#are-sensor-based-jobs-cost-efficient));\n3. If the sensor has recognised that there is new data in the upstream, then you can start a different ETL Job Cluster\n   to process all the ETL tasks (data processing tasks).\n4. In the same way, a different Data Product can sense if an upstream Data Product has new data by using 1 of 2 options:\n    1. **(Preferred)** Sense the upstream Data Product sensor control delta table;\n    2. Sense the upstream Data Product data files in s3 (files sensor) or any of their delta tables (delta table\n       sensor);\n\n## The Structure and Relevance of the Data Product’s Sensors Control Table\n\nThe concept of a lakehouse engine sensor is based on a special delta table stored inside the data product that chooses\nto opt in for a sensor-based job. That table is used to control the status of the various sensors implemented by that\ndata product. You can refer to the below table to understand the sensor delta table structure:\n\n| Column Name                 | Type           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |\n|-----------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **sensor_id**               | STRING         | A unique identifier of the sensor in a specific job. This unique identifier is really important because it is used by the engine to identify if there is new data in the upstream.<br />Each sensor in each job should have a different sensor_id.<br />If you attempt to create 2 sensors with the same sensor_id, the engine will fail.                                                                                                                                                                                                                                                              |\n| **assets**                  | ARRAY\\<STRING> | A list of assets (e.g., tables or dataset folder) that are considered as available to consume downstream after the sensor has status *PROCESSED_NEW_DATA*.                                                                                                                                                                                                                                                                                                                                                                                                                                             |\n| **status**                  | STRING         | Status of the sensor. Can either be:<br /><ul><li>*ACQUIRED_NEW_DATA* – when the sensor in a job has recognised that there is new data from the upstream but, the job where the sensor is, was still not successfully executed.</li><li>*PROCESSED_NEW_DATA* - when the job where the sensor is located has processed all the tasks in that job.</li></ul>                                                                                                                                                                                                                                             |\n| **status_change_timestamp** | STRING         | Timestamp when the status has changed for the last time.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |\n| **checkpoint_location**     | STRING         | Base location of the Spark streaming checkpoint location, when applicable (i.e., when the type of sensor uses Spark streaming checkpoints to identify if the upstream has new data). E.g. Spark streaming checkpoints are used for Kafka, Delta and File sensors.                                                                                                                                                                                                                                                                                                                                      |\n| **upstream_key**            | STRING         | Upstream key (e.g., used to store an attribute name from the upstream so that new data can be detected automatically).<br />This is useful for sensors that do not rely on Spark streaming checkpoints, like the JDBC sensor, as it stores the name of a field in the JDBC upstream that contains the values that will allow us to identify new data (e.g., a timestamp in the upstream that tells us when the record was loaded into the database).                                                                                                                                                   |\n| **upstream_value**          | STRING         | Upstream value (e.g., used to store the max attribute value from the upstream so that new data can be detected automatically). This is the value for upstream_key. <br />This is useful for sensors that do not rely on Spark streaming checkpoints, like the JDBC sensor, as it stores the value of a field in the JDBC upstream that contains the maximum value that was processed by the sensor, and therefore useful for recognizing that there is new data in the upstream (e.g., the value of a timestamp attribute in the upstream that tells us when the record was loaded into the database). |\n\n!!! note\n    To make use of the sensors you will need to add this table to your data product.\n\n## How is it different from scheduled jobs?\n\nSensor-based jobs are still scheduled, but they can be scheduled with higher frequency, as they are more cost-efficient\nthan ramping up a multi-node cluster supposed to do heavy ETL, only to figure out that the upstream does not have new\ndata.\n\n## Are sensor-based jobs cost-efficient?\n\nFor the same schedule (e.g., 4 times a day), sensor-based jobs are more cost-efficient than scheduling a regular job, because with sensor-based jobs you can start a **very tiny single-node cluster**, and only if there is new data in the upstream the bigger ETL cluster is spin up. For this reason, they are considered more cost-efficient.\nMoreover, if you have very hard SLAs to comply with, you can also play with alternative architectures where you can have several sensors in a continuous (always running) cluster, which then keeps triggering the respective data processing jobs, whenever there is new data.\n\n\n## Sensor Steps\n\n1. Create your sensor task for the upstream source. Examples of available sources:\n    - [Delta Table](delta_table/delta_table.md)\n    - [Delta Upstream Sensor Table](delta_upstream_sensor_table/delta_upstream_sensor_table.md)\n    - [File](file/file.md)\n    - [JDBC](jdbc_table/jdbc_table.md)\n    - [Kafka](kafka/kafka.md)\n    - [SAP BW/B4](sap_bw_b4/sap_bw_b4.md)\n2. Setup/Execute your ETL task based in the Sensor Condition\n3. Update the Sensor Control table status with the [Update Sensor Status](update_sensor_status/update_sensor_status.md)"
  },
  {
    "path": "lakehouse_engine_usage/sensor/update_sensor_status/__init__.py",
    "content": "\"\"\"\n.. include::update_sensor_status.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensor/update_sensor_status/update_sensor_status.md",
    "content": "# Update Sensor control delta table after processing the data\n\nThis shows how to **update the status of your Sensor after processing the new data**.\n\nHere is an example on how to update the status of your sensor in the Sensors Control Table:\n```python\nfrom lakehouse_engine.engine import update_sensor_status\n\nupdate_sensor_status(\n    sensor_id=\"MY_SENSOR_ID\",\n    control_db_table_name=\"my_database.lakehouse_engine_sensors\",\n    status=\"PROCESSED_NEW_DATA\",\n    assets=[\"MY_SENSOR_ASSETS\"]\n)\n```\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec)."
  },
  {
    "path": "lakehouse_engine_usage/sensors/__init__.py",
    "content": "\"\"\"\n.. include::sensors.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/__init__.py",
    "content": "\"\"\"\n.. include::heartbeat.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/delta_table/__init__.py",
    "content": "\"\"\"\n.. include::delta_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/delta_table/delta_table.md",
    "content": "# Heartbeat Sensor for Delta Table\n\nThis shows how to create a Heartbeat Sensor Orchestrator to detect new data from a \nDelta Table and trigger Databricks Workflows related to them.\n\n## Configuration required to create an orchestration task for the delta table source\n\n- **sensor_source**: Set to `delta_table` in the Heartbeat Control Table to identify this as a Delta table source.\n- **data_format**: Set to `delta` to specify the data format for reading Delta tables.\n- **heartbeat_sensor_db_table**: Database table name for the Heartbeat sensor control table (e.g., `my_database.heartbeat_sensor`).\n- **lakehouse_engine_sensor_db_table**: Database table name for the lakehouse engine sensors (e.g., `my_database.lakehouse_engine_sensors`).\n- **options**: Configuration options for Delta table reading:\n    - `readChangeFeed`: Set to `\"true\"` to enable change data feed reading.\n- **base_checkpoint_location**: `S3` path for storing checkpoint data (required if `sensor_read_type` is `streaming`).\n- **domain**: Databricks workflows domain for job triggering.\n- **token**: Databricks workflows token for authentication.\n\n### Delta Table Data Feed CSV Configuration Entry\n\nTo check how the entry for a Delta table source should look in the Heartbeat Control Table, [check it here](../heartbeat.md#heartbeat-sensor-control-table-reference-records).\n\n## Code sample of listener and trigger\n\n```python\nfrom lakehouse_engine.engine import (\n    execute_sensor_heartbeat,\n    trigger_heartbeat_sensor_jobs,\n)\n\n# Create an ACON dictionary for all delta table source entries.\n# This ACON dictionary is useful for passing parameters to heartbeat sensors.\n\nheartbeat_sensor_config_acon = {\n    \"sensor_source\": \"delta_table\",\n    \"data_format\": \"delta\",\n    \"heartbeat_sensor_db_table\": \"my_database.heartbeat_sensor\",\n    \"lakehouse_engine_sensor_db_table\": \"my_database.lakehouse_engine_sensors\",\n    \"options\": {\n        \"readChangeFeed\": \"true\",\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"domain\": \"DATABRICKS_WORKFLOWS_DOMAIN\",\n    \"token\": \"DATABRICKS_WORKFLOWS_TOKEN\",\n}\n\n# Execute Heartbeat sensor and trigger jobs which have acquired new data. \nexecute_sensor_heartbeat(acon=heartbeat_sensor_config_acon)\ntrigger_heartbeat_sensor_jobs(heartbeat_sensor_config_acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/heartbeat.md",
    "content": "# Heartbeat Sensor\n\n## What is it?\n\nThe Heartbeat Sensor is a robust, configurable system designed to continuously monitor \nupstream systems for new data. It enhances the existing sensor infrastructure by addressing \nkey limitations and providing significant improvements:\n\n**Previous Sensor Architecture Limitations:**\n\n- Required individual sensor configurations for each data source.\n- Limited scalability when monitoring multiple upstream systems.\n- Manual job triggering and dependency management.\n- No centralized control or monitoring of sensor status.\n- Difficult to manage complex multi-source dependencies.\n\n**Heartbeat Sensor Enhancements:**\n\n- **Centralized Management**: Single control table to manage all sensor sources and their dependencies.\n- **Automated Job Orchestration**: Automatically triggers downstream Databricks jobs when new data is detected.\n- **Multi-Source Support**: Handles diverse source types (SAP, Kafka, Delta Tables, Manual Uploads, Trigger Files) in one unified system.\n- **Dependency Management**: Built-in hard/soft dependency validation before triggering jobs.\n- **Scalable Architecture**: Efficiently processes multiple sensors in parallel.\n- **Status Tracking**: Comprehensive lifecycle tracking from detection to job completion.\n\nThis provides a centralized, efficient, and automated mechanism to detect and trigger \ndownstream workflows with minimal user intervention.\n\n## How Does the Heartbeat Sensor Work?\n\n<img src=\"../../../assets/img/heartbeat_sensor_os.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\nThe Heartbeat Sensor operates on a pull-based approach using a single-node cluster that continuously monitors upstream systems. Here's how the system works:\n\n### Core Architecture Components\n\n**1. [Centralized Control Table](#control-table-schema)**\n\n- Tracks and manages all data sources and their configurations.\n- Dynamically populated by the [Heartbeat Data Feeder Job](heartbeat_sensor_data_feed/heartbeat_sensor_data_feed.md).\n- Provides structured monitoring across various upstream systems.\n\n**2. Persistent Heartbeat Job**\n\n- Runs continuously or on a user-defined schedule.\n- Supports both real-time and batch-style data monitoring.\n- Efficiently processes multiple sensors in parallel.\n\n**3. Sensor Integration Framework**\n\n- Leverages existing sensor mechanisms for event detection.\n- Creates appropriate Sensor ACONs based on source types.\n- Returns `NEW_EVENT_AVAILABLE` status when new data is detected.\n\n**4. Automated Job Orchestration**\n\n- Triggers Databricks jobs via Job Run API when conditions are met.\n- Validates dependencies before job execution.\n- Maintains comprehensive audit trail of all operations.\n\n### Operational Flow\n\n1. **Continuous Monitoring**: The heartbeat cluster continuously polls configured sensor sources.\n2. **Event Detection**: Checks each source for `NEW_EVENT_AVAILABLE` status.\n3. **Dependency Validation**: Evaluates hard/soft dependencies before triggering jobs.\n4. **Automatic Triggering**: Launches Databricks jobs when all conditions are satisfied.\n5. **Status Management**: Updates control table throughout the entire lifecycle.\n\n!!! warning \"Pull-Based Architecture\"\n    The system is designed for a \"pull\" approach, same as the Sensor solution.\n    Downstream data product sensor clusters actively check for new events from the\n    upstream. Upstream sensor clusters do not require write permissions to the downstream\n    data product system. Just read access is required for upstream from downstream system.\n\n### Control Table Schema\n\nThe Heartbeat Sensor Control Table is the central component that manages all sensor sources and their configurations. Below is the complete schema with detailed descriptions:\n\n| Column name                        | Data Type | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Produced/Maintained by |\n|------------------------------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|\n| **sensor_source**                  | STRING    | Upstream source system<ul><li>`sap_b4` - SAP 4HANA</li><li>`sap_bw` SAP BW</li><li>`delta_table`</li><li>`lmu_delta_table` - Lakehouse Manual Upload</li><li>`kafka`</li><li>`trigger_file`</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | User/Developer         |\n| **sensor_id**                      | STRING    | Unique Upstream id or upstream reference.<ul><li>**sap_bw** or **sap_b4** source:<br />SAP Chain Id, example: `SAP_CHAIN_ID_SAP_TABLE`</li><li>**delta_table** source: Delta table name along with database name, examples: `my_database_1.my_table`; `my_database_2.my_table_2`</li><li>**lmu_delta_table** source: Lakehouse Manual Upload Delta table name along with database name, examples: `my_database.my_lmu_table`</li><li>**kafka** source: Kafka Topic name starting with <data_product_name:> prefix and then the topic name, example: `data_product_name: my_product.my.topic`.</li><li>**trigger_file** source: Asset name/folder name under which trigger file will be kept, example: `my_trigger`</li></ul> | User/Developer         |\n| **sensor_read_type**               | STRING    | Sensor read type to fetch new event - can be batch or streaming.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | User/Developer         |\n| **asset_description**              | STRING    | Description of Upstream source (It can be upstream name).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | User/Developer         |\n| **upstream_key**                   | STRING    | upstream key (an attribute name from the upstream so that new data can be detected automatically), example: `load_date`.<br />This is useful for sensors that do not rely on Spark streaming checkpoints, like the JDBC sensor, as it stores the name of a field in the JDBC upstream that contains the values that will allow us to identify new data (e.g., a timestamp in the upstream that tells us when the record was loaded into the database).<br />**Note**: This attribute will be used in the `preprocess_query`, example: `SELECT * FROM sensor_new_data WHERE ?upstream_key >= current_date() - 7` will be rendered to `SELECT * FROM sensor_new_data WHERE load_date >= current_date() - 7`                    | User/Developer         |\n| **preprocess_query**               | STRING    | Query to filter data returned by the upstream. **Note**: This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.<br />Example: `SELECT * FROM sensor_new_data WHERE load_date >= current_date() - 7`                                                                                                                                                                                                                                                                                                                                                                                                             | User/Developer         |\n| **latest_event_fetched_timestamp** | TIMESTAMP | Latest event fetched timestamp for upstream source. It will be updated each time as soon as NEW EVENT is available.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | lakehouse-engine       |\n| **trigger_job_id**                 | STRING    | Databricks Job Id of downstream application. Based on this, Job will get triggered by Heartbeat once new event is available.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | User/Developer         |\n| **trigger_job_name**               | STRING    | Databricks Job Name.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | User/Developer         |\n| **status**                         | STRING    | Status of the orchestration.<br/><ul><li>`NEW_EVENT_AVAILABLE` once new event is found.</li><li>`IN PROGRESS` - When job gets triggered</li><li>`COMPLETED` - once Job completed successfully</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | lakehouse-engine       |\n| **status_change_timestamp**        | STRING    | string containing the datetime when the status has changed.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | lakehouse-engine       |\n| **job_start_timestamp**            | TIMESTAMP | Start timestamp of downstream Job. It will get updated as soon as Job went into `IN_PROGRESS` job_status.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | lakehouse-engine       |\n| **job_end_timestamp**              | TIMESTAMP | End timestamp of downstream Job. It will get updated as soon as Job went into `COMPLETED` job_status.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | lakehouse-engine       |\n| **job_state**                      | STRING    | Current status of Job in Control table. `PAUSED` or `UNPAUSED`. If `PAUSED`, Sensor will **not look** for NEW EVENTS or Trigger the dependent job.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | User/Developer         |\n| **dependency_flag**                | STRING    | <ul><li>TRUE - For Hard dependency</li><li>FALSE - For SOFT dependency</li></ul>All dependent Job needs to complete successfully for HARD dependency. For SOFT → FALSE marked job will be ignored. Default - must be TRUE in case of no dependency.                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | User/Developer         |\n\n### Control Table Reference Records\n\nThe following table shows **example records** that demonstrate how different types of sensor sources are configured in the Heartbeat Sensor Control Table. These are **sample entries** that illustrate the structure and typical values for each column across various sensor source types (Kafka, Lakehouse Manual Upload Delta Table, SAP B4, Delta Table, and Trigger File).\n\n**Purpose of these examples:**\n\n- Show real-world configuration patterns for different sensor sources.\n- Demonstrate how different statuses (`NEW_EVENT_AVAILABLE`, `IN_PROGRESS`, `null`) appear in the table.\n- Illustrate the relationship between sensor sources and their corresponding Databricks jobs.\n- Provide reference values for fields like `sensor_id`, `trigger_job_id`, and status timestamps.\n\n!!! note\n    These are illustrative examples - your actual table will contain records specific to your data sources and job configurations.\n\n| sensor_source   | sensor_id                    | sensor_read_type | asset_description                      | upstream_key | preprocess_query | latest_event_fetched_timestamp | trigger_job_id | trigger_job_name                         | status              | status_change_timestamp  | job_start_timestamp      | job_end_timestamp | job_state | dependancy_flag |\n|-----------------|------------------------------|------------------|----------------------------------------|--------------|------------------|--------------------------------|----------------|------------------------------------------|---------------------|--------------------------|--------------------------|-------------------|-----------|-----------------|\n| kafka           | my_product: my.topic         | streaming        | My product Kafka Topic                 | null         | null             | 2025-04-23T21:40:23.768Z       | 111111111      | my-product-kafka_consumer_job            | IN_PROGRESS         | 2025-04-23T21:40:36.88Z  | 2025-04-23T21:40:36.88Z  | null              | UNPAUSED  | TRUE            |\n| lmu_delta_table | my_database.my_lmu_table     | batch            | My Lakehouse Manual Upload Delta Table | date         | null             | 2025-04-23T21:46:07.495Z       | 222222222      | my-product-lmu_table_consumer_job        | IN_PROGRESS         | 2025-04-23T21:46:19.4Z   | 2025-04-23T21:46:19.4Z   | null              | UNPAUSED  | TRUE            |\n| sap_b4          | SAP_BW_CHAIN_ID_SAP_TABLE    | batch            | My SAP BW Chain Process                | LOAD_DATE    | null             | 2025-04-23T21:35:10.643Z       | 333333333      | my-product-sap_bw_consumer_job           | IN_PROGRESS         | 2025-04-23T21:35:29.248Z | 2025-04-23T21:35:29.248Z | null              | UNPAUSED  | TRUE            |\n| delta_table     | my_database_1.my_table       | streaming        | My Delta Table from My Database 1      | null         | null             | 2025-04-23T22:11:56.384Z       | 444444444      | my-product-delta_and_sap_b4_consumer_job | NEW_EVENT_AVAILABLE | 2025-04-23T22:11:56.384Z | null                     | null              | UNPAUSED  | TRUE            |\n| sap_b4          | SAP_4HANA_CHAIN_ID_SAP_TABLE | batch            | My SAP 4HANA Chain Process             | LOAD_DATE    | null             | null                           | 444444444      | my-product-delta_and_sap_b4_consumer_job | null                | null                     | null                     | null              | UNPAUSED  | TRUE            |\n| trigger_file    | my_trigger                   | streaming        | My Trigger File                        | null         | null             | 2025-04-23T22:07:28.668Z       | 555555555      | my-product-trigger_file_consumer_job     | IN_PROGRESS         | 2025-04-23T22:07:39.865Z | 2025-04-23T22:07:39.865Z | null              | UNPAUSED  | TRUE            |\n\n\n## How to Implement the Heartbeat Sensor\n\nThis step-by-step guide aims to help you through setting up, configuring, and operating the Heartbeat Sensor system from initial setup to ongoing monitoring and troubleshooting.\n\n### Phase 1: Initial Setup and Configuration\n\n#### Step 1: Define Your Data Source Configurations\n\nCreate a CSV file containing your data source configurations with the following required columns:\n\n- `sensor_source`: Type of [sensor source](#control-table-schema).\n- `sensor_id`: Unique upstream identifier or reference.\n- `sensor_read_type`: How to read the sensor (batch or streaming).\n- `asset_description`: Description of the upstream source.\n- `upstream_key`: Attribute name for detecting new data automatically.\n- `preprocess_query`: Optional query to filter upstream data.\n- `trigger_job_id`: Databricks Job ID to trigger when new data is available.\n- `trigger_job_name`: Databricks Job Name.\n- `job_state`: Job control state (`UNPAUSED` or `PAUSED`).\n- `dependency_flag`: Dependency type (`TRUE` for hard, `FALSE` for soft).\n\n**Example CSV Configuration:**\n```csv\nsensor_source,sensor_id,sensor_read_type,asset_description,upstream_key,preprocess_query,trigger_job_id,trigger_job_name,job_state,dependency_flag\nkafka,\"my_product: my.topic\",streaming,\"My product Kafka Topic\",,,\"111111111\",\"my-product-kafka_consumer_job\",UNPAUSED,TRUE\ndelta_table,\"my_database_1.my_table\",streaming,\"My Delta Table from My Database 1\",,,\"444444444\",\"my-product-delta_and_sap_b4_consumer_job\",UNPAUSED,TRUE\nsap_b4,\"SAP_4HANA_CHAIN_ID_SAP_TABLE\",batch,\"My SAP 4HANA Chain Process\",LOAD_DATE,,\"444444444\",\"my-product-delta_and_sap_b4_consumer_job\",UNPAUSED,TRUE\n```\n\n#### Step 2: Populate the Heartbeat Control Table\n\nUse the [Heartbeat Sensor Control Table Data Feeder](heartbeat_sensor_data_feed/heartbeat_sensor_data_feed.md) to:\n\n- Read your CSV configuration file.\n- Validate the configuration entries.\n- Ingest the data into the Heartbeat Control Table.\n- Establish the foundation for monitoring and orchestration.\n\n### Phase 2: Heartbeat Sensor Operation Workflow\n\n#### Step 3: Continuous Monitoring and Event Detection\n\nThe Heartbeat sensor cluster (running on a single node) performs the following operations:\n\n**3.1 Control Table Scanning**\n\n- Scans the Heartbeat Control Table for eligible records.\n- Filters records based on:\n    - Supported sensor sources: `Delta Table`, `Kafka`, `SAP BW/4HANA`, `Lakehouse Manual Upload`, `Trigger file`.\n    - Job state: `job_state = 'UNPAUSED'`.\n    - Status conditions: `status IS NULL` or `status = 'COMPLETED'`.\n\n!!! important \"Orchestration job recommendation\"\n    We recommend running multiple tasks for each sensor source type in the same Heartbeat\n    Sensor Orchestrator and just create specific source related jobs when it's really needed,\n    example: real time processing jobs or some complex jobs that need to be triggered as\n    soon as the trigger condition is satisfied (all hard dependencies has `NEW_EVENT_AVAILABLE`).\n\n!!! note \"First-Time Execution\"\n    For new sensor sources and IDs, the initial `status` will be `NULL`. This ensures that failed or paused jobs are not automatically triggered.\n\n**3.2 Source-Specific Event Detection**\n\nFor each eligible record, the Heartbeat system:\n\n- Creates the appropriate Sensor ACON (configuration) based on the `sensor_source` type.\n- Passes the configuration to the respective Sensor Algorithm.\n- The sensor algorithm checks for `NEW_EVENT_AVAILABLE` status for the specific `sensor_id`.\n\n**Supported Source Types and Their Configuration:**\n\n- **[Delta Table Sources](delta_table/delta_table.md)**: Monitor delta tables for new data.\n- **[Kafka Sources](kafka/kafka.md)**: Monitor Kafka topics for new messages.\n- **[Manual Table Sources](manual_table/manual_table.md)**: Monitor manually uploaded delta tables.\n- **[SAP BW/B4 Sources](sap_bw_b4/sap_bw_b4.md)**: Monitor SAP systems for new process chains.\n- **[Trigger File Sources](trigger_file/trigger_file.md)**: Monitor file systems for trigger files.\n\n#### Step 4: Event Processing and Status Updates\n\n**4.1 New Event Detection**\n\nWhen a sensor detects new data:\n\n- Updates the traditional sensor table (`lakehouse_engine_sensor`) with detection details.\n- Returns `NEW_EVENT_AVAILABLE` status to the Heartbeat module.\n\n**4.2 Heartbeat Control Table Updates**\n\nThe Heartbeat system updates the control table with:\n\n- `status` → `NEW_EVENT_AVAILABLE`.\n- `status_change_timestamp` → current timestamp.\n- `latest_event_fetched_timestamp` → timestamp when event detection started.\n\n#### Step 5: Dependency Validation and Job Triggering\n\n**5.1 Dependency Evaluation Process**\n\nBefore triggering any jobs, the system evaluates dependencies:\n\n1. **Filter Eligible Records**: Select records with `status = 'NEW_EVENT_AVAILABLE'`.\n2. **Group by Job ID**: Group records by `trigger_job_id` to identify job dependencies.\n3. **Evaluate Dependency Flags**:\n    - **TRUE (Hard Dependency)**: Job must have `NEW_EVENT_AVAILABLE` status.\n    - **FALSE (Soft Dependency)**: Job status is optional and doesn't block triggering.\n4. **Aggregate and Validate**: Ensure all hard dependencies are satisfied before triggering.\n\n**5.2 Triggering Logic Examples**\n\nConsider Job 3 that depends on Job 1 and Job 2:\n\n- **Scenario A**: Job 1 (HARD) + Job 2 (HARD) → Both must have `NEW_EVENT_AVAILABLE`.\n- **Scenario B**: Job 1 (HARD) + Job 2 (SOFT) → Only Job 1 needs `NEW_EVENT_AVAILABLE`.\n\n<img src=\"../../../assets/img/heartbeat_dependency_flag.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\n**5.3 Job Triggering via Databricks API**\n\nFor jobs that pass dependency validation:\n\n- Trigger the corresponding `trigger_job_id` via Databricks Job Run API.\n- Immediately update the control table:\n    - `status` → `IN_PROGRESS`.\n    - `job_start_timestamp` → current timestamp.\n    - `status_change_timestamp` → current timestamp.\n\n### Phase 3: Job Execution and Completion\n\n#### Step 6: Databricks Job Execution\n\nEach triggered Databricks job must include:\n\n- Your primary ETL/processing tasks.\n- **Final Task**: [Update Heartbeat Sensor Status](update_heartbeat_sensor_status/update_heartbeat_sensor_status.md) task.\n\n#### Step 7: Job Completion Handling\n\nUpon successful job completion, the update status task:\n\n- Sets `status` → `COMPLETED`.\n- Updates `status_change_timestamp` → current timestamp.\n- Sets `job_end_timestamp` → job completion timestamp.\n\n### Phase 4: Error Handling and Recovery\n\n#### Step 8: Job Failure Recovery Process\n\nIf a Databricks job fails, follow this recovery process:\n\n1. **Identify the Issue**: Analyze job logs and error messages.\n2. **Fix the Problem**: Address the underlying cause of the failure.\n3. **Manual Recovery**: Execute at least one successful manual run of the job.\n4. **Automatic Resumption**: Heartbeat will resume monitoring and triggering after successful completion.\n\n!!! warning \"Important Recovery Note\"\n    The Heartbeat sensor will **not** resume checking failed jobs for new events until at least one successful completion occurs. This prevents repeated triggering of failing jobs.\n\n#### Step 9: Monitoring and Maintenance\n\n**9.1 Regular Monitoring Tasks**\n\n- Monitor the Heartbeat Control Table for job statuses.\n- Check for jobs stuck in `IN_PROGRESS` status.\n- Verify dependency relationships are working correctly.\n- Review `latest_event_fetched_timestamp` for regular updates.\n\n**9.2 Control and Management**\n\n- **Pause Jobs**: Set `job_state` to `PAUSED` to temporarily stop monitoring.\n- **Resume Jobs**: Set `job_state` to `UNPAUSED` to resume monitoring.\n- **Modify Dependencies**: Update `dependency_flag` to change dependency relationships.\n\n### Phase 5: Advanced Configuration and Optimization\n\n#### Step 10: Advanced Configuration Options\n\n**10.1 Preprocess Queries**\n\nUse `preprocess_query` to filter upstream data:\n```sql\n-- Example: Filter only recent records\nSELECT * FROM sensor_new_data WHERE load_date >= current_date() - 7\n```\n\n**10.2 Parallel Processing**\n\nThe Heartbeat sensor automatically handles parallel processing of multiple sources, improving efficiency and scalability.\n\n**10.3 Pull-Based Architecture Benefits**\n\n- Upstream systems only need read access to downstream systems.\n- No write permissions required from upstream to downstream.\n- Improved security and access control.\n\n### Troubleshooting Common Issues\n\n| Issue                       | Symptoms                                      | Solution                                                           |\n|-----------------------------|-----------------------------------------------|--------------------------------------------------------------------|\n| Jobs not triggering         | Status remains `NEW_EVENT_AVAILABLE`          | Check dependency flags and ensure all hard dependencies are met.   |\n| Jobs stuck in `IN_PROGRESS` | No completion status updates                  | Verify that jobs include the update status task as the final step. |\n| Failed job recovery         | Jobs not resuming after fixes                 | Manually run the job successfully at least once.                   |\n| Missing events              | `latest_event_fetched_timestamp` not updating | Check sensor source connectivity and configuration.                |\n\nThis workflow ensures reliable, automated data pipeline orchestration with robust error handling and dependency management.\n\n!!! note\n    Also have a look at the [Sensor documentation](../sensors.md) to have a better understanding of the underlying sensor mechanisms that power the Heartbeat Sensor system.\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/heartbeat_sensor_data_feed/__init__.py",
    "content": "\"\"\"\n.. include::heartbeat_sensor_data_feed.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/heartbeat_sensor_data_feed/heartbeat_sensor_data_feed.md",
    "content": "# Heartbeat Sensor Control Table Data Feeder\n\n## What is it?\n\nIt's a foundational component of the Heartbeat Sensor architecture. The primary purpose\nis to populate and maintain the Control Table, which drives the entire heartbeat \nmonitoring process. The Data Feeder Job is responsible for creating and updating entries\nin the Control Table. Each entry in the control table represents a sensor_source (e.g., \nSAP, Kafka, Delta) for a unique combination of `sensor_id` and `trigger_job_id`.\n\n## Configuration required to execute heartbeat sensor data feed\n\n- **heartbeat_sensor_data_feed_path**: S3 path to the CSV file containing the heartbeat sensor control table data (e.g., `\"s3://my_data_product_bucket/local_data/heartbeat_sensor/heartbeat_sensor_control_table_data.csv\"`).\n- **heartbeat_sensor_control_table**: Database table name for the [Heartbeat sensor control table](../heartbeat.md#control-table-schema) (e.g., `\"my_database.heartbeat_sensor\"`).\n\n## How it works\n\n1. A Heartbeat Sensor data feed job in each data product needs to be created to facilitate any\naddition, update and deletion of entries.\n2. Entries need to be added in CSV file format [as shown in Heartbeat Sensor Control table\nMetadata description section for more](../heartbeat.md#the-structure-and-relevance-of-the-data-products-heartbeat-sensor-control-table).\nOther fields in the control table will be filled automatically at different stages of \nthe sensor process.\n3. After adding/updating/deleting any entries in CSV, the Data feeder job needs to run again\nto reflect the changes in the table.\n\n## Code sample\n\n```python\nfrom lakehouse_engine.engine import execute_heartbeat_sensor_data_feed\n\nexecute_heartbeat_sensor_data_feed(\n    heartbeat_sensor_data_feed_path=\"s3://my_data_product_bucket/local_data/heartbeat_sensor/heartbeat_sensor_control_table_data.csv\" ,\n    heartbeat_sensor_control_table=\"my_database.heartbeat_sensor\"\n)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/kafka/__init__.py",
    "content": "\"\"\"\n.. include::kafka.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/kafka/kafka.md",
    "content": "# Heartbeat Sensor for Kafka\n\nThis shows how to create a Heartbeat Sensor Orchestrator to detect new data from \nKafka and trigger Databricks Workflows related to them.\n\n## Configuration required to create an orchestration task for the kafka source\n\n- **sensor_source**: Set to `kafka` in the Heartbeat Control Table to identify this as a Kafka source.\n- **data_format**: Set to `kafka` to specify the data format for reading Kafka streams.\n- **heartbeat_sensor_db_table**: Database table name for the Heartbeat sensor control table (e.g., `my_database.heartbeat_sensor`).\n- **lakehouse_engine_sensor_db_table**: Database table name for the lakehouse engine sensors (e.g., `my_database.lakehouse_engine_sensors`).\n- **options**: Configuration options for Kafka reading:\n    - `readChangeFeed`: Set to `\"true\"` to enable change data feed reading.\n- **kafka_configs**: Kafka connection and security configurations:\n    - `kafka_bootstrap_servers_list`: Kafka server endpoints.\n    - `kafka_ssl_truststore_location`: Path to SSL truststore.\n    - `truststore_pwd_secret_key`: Secret key for truststore password.\n    - `kafka_ssl_keystore_location`: Path to SSL keystore.\n    - `keystore_pwd_secret_key`: Secret key for keystore password.\n- **kafka_secret_scope**: Databricks secret scope for Kafka credentials.\n- **base_checkpoint_location**: S3 path for storing checkpoint data (required if `sensor_read_type` is `streaming`).\n- **domain**: Databricks workflows domain for job triggering.\n- **token**: Databricks workflows token for authentication.\n\n### Kafka Data Feed CSV Configuration Entry\n\nTo check how the entry for a Kafka source should look in the Heartbeat Control Table, [check it here](../heartbeat.md#heartbeat-sensor-control-table-reference-records).\n\n**Additional Requirements for Kafka**:\n\nThe `sensor_id` follows a specific naming convention because you can have multiple data \nproducts using the same configuration file with different Kafka configuration values:\n\n- The value for the `sensor_id` will be the Kafka Topic name starting with \n`<product_name:>` or any other prefix, example: `my_product: my.topic`.\n- How it works? → Heartbeat receives a dictionary containing all kafka configurations by\nproduct, which is passed as `kafka_configs` in the ACON.\nThen it segregates the config based on `sensor_id` value present in the heartbeat\ncontrol table.\nHeartbeat will split the `sensor_id` based on colon (:) and the first part of it will be\nconsidered as product name (in our case, `my_product`) and the second part of\nthe split string will be the Kafka topic name (in our case, `my.topic`).\nFinally, **it will make use of the product related kafka config from the `kafka_configs`**.\n\n## Code sample of listener and trigger\n\n```python\nfrom lakehouse_engine.engine import (\n    execute_sensor_heartbeat,\n    trigger_heartbeat_sensor_jobs,\n)\n\n# Kafka configurations for the product, we strongly recommend to read these values from a external configuration file.\nkafka_configs = {\n  \"my_product\": {\n    \"kafka_bootstrap_servers_list\": \"KAFKA_SERVER\",\n    \"kafka_ssl_truststore_location\": \"TRUSTSTORE_LOCATION\",\n    \"truststore_pwd_secret_key\": \"TRUSTSTORE_PWD\",\n    \"kafka_ssl_keystore_location\": \"KEYSTORE_LOCATION\",\n    \"keystore_pwd_secret_key\": \"KEYSTORE_PWD\"\n  }\n}\n\n# Create an ACON dictionary for all kafka source entries.\n# This ACON dictionary is useful for passing parameters to heartbeat sensors.\n\nheartbeat_sensor_config_acon = {\n    \"sensor_source\": \"kafka\",\n    \"data_format\": \"kafka\",\n    \"heartbeat_sensor_db_table\": \"my_database.heartbeat_sensor\",\n    \"lakehouse_engine_sensor_db_table\": \"my_database.lakehouse_engine_sensors\",\n    \"options\": {\n        \"readChangeFeed\": \"true\",\n    },\n    \"kafka_configs\": kafka_configs,\n    \"kafka_secret_scope\": \"DB_SECRET_SCOPE\",\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"domain\": \"DATABRICKS_WORKFLOWS_DOMAIN\",\n    \"token\": \"DATABRICKS_WORKFLOWS_TOKEN\",\n}\n\n# Execute Heartbeat sensor and trigger jobs which have acquired new data. \nexecute_sensor_heartbeat(acon=heartbeat_sensor_config_acon)\ntrigger_heartbeat_sensor_jobs(heartbeat_sensor_config_acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/manual_table/__init__.py",
    "content": "\"\"\"\n.. include::manual_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/manual_table/manual_table.md",
    "content": "# Heartbeat Sensor for Manual Table\n\nThis shows how to create a Heartbeat Sensor Orchestrator to detect new data from a\nManual Table and trigger Databricks Workflows related to them.\n\n**Manual Tables (Lakehouse Manual Upload)** are different from regular Delta tables because:\n\n- **Data Upload Pattern**: Instead of continuous streaming or scheduled batch loads, data is manually uploaded by users at irregular intervals.\n- **Detection Challenge**: Unlike regular Delta tables with change data feeds or append operations, manual tables are typically overwritten completely, making it harder to detect new data using standard mechanisms.\n- **Custom Detection Logic**: Requires a special `upstream_key` (usually a timestamp column) to track when the table was last updated, since the table structure and most content may remain the same between uploads.\n- **Sensor Source Type**: Uses `lmu_delta_table` instead of `delta_table` to indicate this special handling requirement.\n\n## Configuration required to create an orchestration task for the manual table source\n\n- **sensor_source**: Set to `lmu_delta_table` in the Heartbeat Control Table to identify this as a Lakehouse Manual Upload Delta table source.\n- **data_format**: Set to `delta` to specify the data format for reading Delta tables.\n- **heartbeat_sensor_db_table**: Database table name for the Heartbeat sensor control table (e.g., `my_database.heartbeat_sensor`).\n- **lakehouse_engine_sensor_db_table**: Database table name for the lakehouse engine sensors (e.g., `my_database.lakehouse_engine_sensors`).\n- **domain**: Databricks workflows domain for job triggering.\n- **token**: Databricks workflows token for authentication.\n\n### Manual Tables Data Feed CSV Configuration Entry\n\nTo check how the entry for a manual table source should look in the Heartbeat Control Table, [check it here](../heartbeat.md#heartbeat-sensor-control-table-reference-records).\n\n**Additional Requirements for Manual Tables**:\n\n- **sensor_id**: Needs to be filled with the Lakehouse Manual Upload Delta table name along with database, e.g., `my_database.my_manual_table`.\n- **upstream_key**: Must specify the table date/timestamp column (typically named `date`) which indicates when the Lakehouse Manual Upload table was last overwritten. This is crucial for detecting new manual uploads.\n\n**Setup Requirements**:\n\n- A column named **`date`** must be added to your Lakehouse Manual Upload source Delta table.\n- This column should contain a timestamp value in **YYYYMMDDHHMMSS** format.\n- The value should be updated to `current_timestamp()` whenever new data is uploaded.\n- This timestamp serves as the \"fingerprint\" that the sensor uses to detect new uploads.\n\n!!! note\n    **`date` (or any other name, but with the same purpose, need to be defined on \n    `upstream_key` CSV configuration entry) column requirement**: Since manual tables \n    are typically overwritten entirely during each upload, standard Delta table change\n    detection mechanisms won't work. The Heartbeat sensor needs a reliable way to\n    determine if new data has been uploaded since the last check.\n\n## Code sample of listener and trigger\n\n```python\nfrom lakehouse_engine.engine import (\n    execute_sensor_heartbeat,\n    trigger_heartbeat_sensor_jobs,\n)\n\n# Create an ACON dictionary for all manual table source entries.\n# This ACON dictionary is useful for passing parameters to heartbeat sensors.\n\nheartbeat_sensor_config_acon = {\n    \"sensor_source\": \"lmu_delta_table\",\n    \"data_format\": \"delta\",\n    \"heartbeat_sensor_db_table\": \"my_database.heartbeat_sensor\",\n    \"lakehouse_engine_sensor_db_table\": \"my_database.lakehouse_engine_sensors\",\n    \"domain\": \"DATABRICKS_WORKFLOWS_DOMAIN\",\n    \"token\": \"DATABRICKS_WORKFLOWS_TOKEN\",\n}\n\n# Execute Heartbeat sensor and trigger jobs which have acquired new data. \nexecute_sensor_heartbeat(acon=heartbeat_sensor_config_acon)\ntrigger_heartbeat_sensor_jobs(heartbeat_sensor_config_acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/sap_bw_b4/__init__.py",
    "content": "\"\"\"\n.. include::sap_bw_b4.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/sap_bw_b4/sap_bw_b4.md",
    "content": "# Heartbeat Sensor for SAP BW/B4\n\nThis shows how to create a Heartbeat Sensor Orchestrator to detect new data from \nSAP BW/B4 and trigger Databricks Workflows related to them.\n\n## Configuration required to create an orchestration task for the SAP BW/B4 source\n\n- **sensor_source**: Set to `sap_b4` or `sap_bw` in the Heartbeat Control Table to identify this as a SAP source.\n- **data_format**: Set to `jdbc` to specify the data format for reading from SAP via JDBC connection.\n- **heartbeat_sensor_db_table**: Database table name for the Heartbeat sensor control table (e.g., `my_database.heartbeat_sensor`).\n- **lakehouse_engine_sensor_db_table**: Database table name for the lakehouse engine sensors (e.g., `my_database.lakehouse_engine_sensors`).\n- **options**: JDBC connection configuration:\n    - `compress`: Set to `true` to enable compression.\n    - `driver`: JDBC driver class name.\n    - `url`: JDBC connection URL.\n    - `user`: JDBC username for authentication.\n    - `password`: JDBC password for authentication.\n- **jdbc_db_table**: SAP logchain table name to query for process chain status.\n- **domain**: Databricks workflows domain for job triggering.\n- **token**: Databricks workflows token for authentication.\n\n### SAP BW/B4 Data Feed CSV Configuration Entry\n\nTo check how the entry for a SAP BW/B4 source should look in the Heartbeat Control Table, [check it here](../heartbeat.md#heartbeat-sensor-control-table-reference-records).\n\n**Additional Requirements for SAP BW/4HANA**:\n\n- The `sensor_id` needs to be filled with the Process Chain Name of the SAP object.\n- `sensor_read_type` needs to be `batch` for SAP.\n\n## Code sample of listener and trigger\n\n```python\nfrom lakehouse_engine.engine import (\n    execute_sensor_heartbeat,\n    trigger_heartbeat_sensor_jobs,\n)\n\n# Create an ACON dictionary for all SAP BW/B4 source entries.\n# This ACON dictionary is useful for passing parameters to heartbeat sensors.\n\nheartbeat_sensor_config_acon = {\n    \"sensor_source\": \"sap_b4|sap_bw\",  # use sadp_b4 or sap_bw, depending on the source you are reading from\n    \"data_format\": \"jdbc\",\n    \"heartbeat_sensor_db_table\": \"my_database.heartbeat_sensor\",\n    \"lakehouse_engine_sensor_db_table\": \"my_database.lakehouse_engine_sensors\",\n    \"options\": {\n        \"compress\": True,\n        \"driver\": \"JDBC_DRIVER\",\n        \"url\": \"JDBC_URL\",\n        \"user\": \"JDBC_USERNAME\",\n        \"password\": \"JDBC_PSWD\",\n    },\n    \"jdbc_db_table\": \"SAP_LOGCHAIN_TABLE\",\n    \"domain\": \"DATABRICKS_WORKFLOWS_DOMAIN\",\n    \"token\": \"DATABRICKS_WORKFLOWS_TOKEN\",\n}\n\n# Execute Heartbeat sensor and trigger jobs which have acquired new data. \nexecute_sensor_heartbeat(acon=heartbeat_sensor_config_acon)\ntrigger_heartbeat_sensor_jobs(heartbeat_sensor_config_acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/trigger_file/__init__.py",
    "content": "\"\"\"\n.. include::trigger_file.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/trigger_file/trigger_file.md",
    "content": "# Heartbeat Sensor for Trigger Files\n\nThis shows how to create a Heartbeat Sensor Orchestrator to detect new data from \nTrigger Files and trigger Databricks Workflows related to them.\n\n## Generating the trigger file\n\nIt's needed to create a task in the upstream pipeline to generate a trigger file,\nindicating that the upstream source has completed and the dependent job can be triggered.\nThe `sensor_id` used to generate the file must match the `sensor_id` specified in the\nheartbeat control table. Check here the [code example](#creation-of-the-trigger-file-following-the-sensorid-standard-code-example) of how to generate the\ntrigger file.\n\n#### Creation of the trigger file following the `sensor_id` standard code example:\n```pyhon\nimport datetime\n\nsensor_id = \"my_trigger\"\nfile_root_path = \"s3://my_data_product_bucket/triggers\"\n\nfile_name = f\"{sensor_id}_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.txt\"\nfile_path = \"/\".join([file_root_path, sensor_id, file_name])\n\n### Write Trigger File to S3 location using dbutils\noutput = dbutils.fs.put(file_path, \"Success\")\n```\n\n## Configuration required to create an orchestration task for the trigger file source\n\n- **sensor_source**: Set to `trigger_file` in the Heartbeat Control Table to identify this as a trigger file source.\n- **data_format**: Set to `cloudfiles` to enable Spark Auto Loader functionality for monitoring trigger files. This format allows the system to automatically detect when new trigger files are available at the specified location and trigger the [corresponding `trigger_job_id`](../heartbeat.md#control-table-schema).\n- **heartbeat_sensor_db_table**: Database table name for the Heartbeat sensor control table (e.g., `my_database.heartbeat_sensor`).\n- **lakehouse_engine_sensor_db_table**: Database table name for the lakehouse engine sensors (e.g., `my_database.lakehouse_engine_sensors`).\n- **options**: Cloud files configuration:\n    - `cloudFiles.format`: Set to `\"csv\"` to specify the file format.\n- **schema_dict**: Schema definition for the trigger files:\n    - Defines the structure with fields like `file_name` (string) and `file_modification_time` (timestamp).\n- **base_checkpoint_location**: S3 path for storing checkpoint data (required if `sensor_read_type` is `streaming`).\n- **base_trigger_file_location**: S3 path where trigger files are located.\n- **domain**: Databricks workflows domain for job triggering.\n- **token**: Databricks workflows token for authentication.\n\n### Trigger File Data Feed CSV Configuration Entry\n\nTo check how the entry for a trigger file source should look in the Heartbeat Control Table, [check it here](../heartbeat.md#heartbeat-sensor-control-table-reference-records).\n\n**Additional Requirements for Trigger File**:\n\n- The `sensor_id` will match the name used to create the trigger file. For example, if\nthe trigger file is named `my_trigger_YYYYMMDDHHMMSS.txt`, then the sensor_id will be\n`my_trigger`.\n\n## Code sample of listener and trigger\n\n```python\nfrom lakehouse_engine.engine import (\n    execute_sensor_heartbeat,\n    trigger_heartbeat_sensor_jobs,\n)\n\n# Create an ACON dictionary for all trigger file source entries.\n# This ACON dictionary is useful for passing parameters to heartbeat sensors.\n\nheartbeat_sensor_config_acon = {\n    \"sensor_source\": \"trigger_file\",\n    \"data_format\": \"cloudfiles\",\n    \"heartbeat_sensor_db_table\": \"my_database.heartbeat_sensor\",\n    \"lakehouse_engine_sensor_db_table\": \"my_database.lakehouse_engine_sensors\",\n    \"options\": {\n        \"cloudFiles.format\": \"csv\",\n    },\n    \"schema_dict\": {\n        \"type\": \"struct\",\n        \"fields\": [\n            {\n                \"name\": \"file_name\",\n                \"type\": \"string\",\n            },\n            {\n                \"name\": \"file_modification_time\",\n                \"type\": \"timestamp\",\n            },\n        ],\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"base_trigger_file_location\": \"s3://my_data_product_bucket/triggers\",\n    \"domain\": \"DATABRICKS_WORKFLOWS_DOMAIN\",\n    \"token\": \"DATABRICKS_WORKFLOWS_TOKEN\",\n}\n\n# Execute Heartbeat sensor and trigger jobs which have acquired new data. \nexecute_sensor_heartbeat(acon=heartbeat_sensor_config_acon)\ntrigger_heartbeat_sensor_jobs(heartbeat_sensor_config_acon)\n```\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/update_heartbeat_sensor_status/__init__.py",
    "content": "\"\"\"\n.. include::update_heartbeat_sensor_status.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/heartbeat/update_heartbeat_sensor_status/update_heartbeat_sensor_status.md",
    "content": "# Update Heartbeat Sensor control delta table after processing the data\n\nThis shows how to update the status of your Heartbeat Sensor after executing the pipeline.\n\nThe `update_heartbeat_sensor_status` function is **critical for the Heartbeat Sensor lifecycle** because:\n\n- **Completes the monitoring cycle**: When a Heartbeat sensor triggers a job, it sets the status to `IN_PROGRESS`. Without this update, the sensor would never know the job completed successfully.\n- **Enables continuous monitoring**: Only after a job is marked as `COMPLETED` will the Heartbeat sensor resume monitoring that source for new events.\n- **Prevents stuck jobs**: Without proper status updates, failed jobs remain in `IN_PROGRESS` status indefinitely, blocking future job triggers.\n- **Supports recovery process**: This is essential for the [Job Failure Recovery Process](../heartbeat.md#heartbeat-sensor-workflow-explanation) described in the main Heartbeat documentation, where at least one successful run must be completed before the sensor resumes monitoring.\n\n!!! note\n    **When to use**: This function must be called as the **final task** in every Databricks job that is orchestrated by the Heartbeat Sensor to properly update the `status` to `COMPLETED` and record the job completion timestamp.\n\n## Configuration required to update heartbeat sensor status\n\n- **job_id**: The unique identifier of the Databricks job that was triggered by the Heartbeat sensor (e.g., `\"MY_JOB_ID\"`).\n- **heartbeat_sensor_control_table**: Database table name for the Heartbeat sensor control table (e.g., `\"my_database.heartbeat_sensor\"`).\n- **sensor_table**: Database table name for the lakehouse engine sensors table (e.g., `\"my_database.lakehouse_engine_sensors\"`).\n\n## Code sample\n\nCode sample on how to update the status of your sensor in the Heartbeat Sensors Control Table:\n```python\nfrom lakehouse_engine.engine import update_heartbeat_sensor_status\n\nupdate_heartbeat_sensor_status(\n    job_id=\"MY_JOB_ID\",\n    heartbeat_sensor_control_table=\"my_database.heartbeat_sensor\",\n    sensor_table=\"my_database.lakehouse_engine_sensors\",\n)\n```\n\nIf you want to know more please visit the definition of the class [here](../../../../reference/packages/core/definitions.md#packages.core.definitions.HeartbeatConfigSpec)."
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/__init__.py",
    "content": "\"\"\"\n.. include::sensor.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/delta_table/__init__.py",
    "content": "\"\"\"\n.. include::delta_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/delta_table/delta_table.md",
    "content": "# Sensor from Delta Table\n\nThis shows how to create a **Sensor to detect new data from a Delta Table**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n    If you want to view some examples of usage you can visit the [delta upstream sensor table](../delta_upstream_sensor_table/delta_upstream_sensor_table.md) or the [jdbc sensor](../jdbc_table/jdbc_table.md).\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and **SUGGESTED** behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData will be consumed from a delta table in streaming mode,\nso if there is any new data it will give condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"delta\",\n        \"db_table\": \"upstream_database.source_delta_table\",\n        \"options\": {\n            \"readChangeFeed\": \"true\", # to read changes in upstream table\n        },\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it \nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/delta_upstream_sensor_table/__init__.py",
    "content": "\"\"\"\n.. include::delta_upstream_sensor_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/delta_upstream_sensor_table/delta_upstream_sensor_table.md",
    "content": "# Sensor from other Sensor Delta Table\n\nThis shows how to create a **Sensor to detect new data from another Sensor Delta Table**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nIt makes use of `generate_sensor_query` to generate the `preprocess_query`,\ndifferent from [delta_table](../delta_table/delta_table.md).\n\nData from other sensor delta table, in streaming mode, will be consumed. If there is any new data it will trigger \nthe condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor, generate_sensor_query\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"delta\",\n        \"db_table\": \"upstream_database.lakehouse_engine_sensors\",\n        \"options\": {\n            \"readChangeFeed\": \"true\",\n        },\n    },\n    \"preprocess_query\": generate_sensor_query(\"UPSTREAM_SENSOR_ID\"),\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/file/__init__.py",
    "content": "\"\"\"\n.. include::file.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/file/file.md",
    "content": "# Sensor from Files\n\nThis shows how to create a **Sensor to detect new data from a File Location**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios \n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nUsing these sensors and consuming the data in streaming mode, if any new file is added to the file location, \nit will automatically trigger the proceeding task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"csv\",  # You can use any of the data formats supported by the lakehouse engine, e.g: \"avro|json|parquet|csv|delta|cloudfiles\"\n        \"location\": \"s3://my_data_product_bucket/path\",\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/jdbc_table/__init__.py",
    "content": "\"\"\"\n.. include::jdbc_table.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/jdbc_table/jdbc_table.md",
    "content": "# Sensor from JDBC\n\nThis shows how to create a **Sensor to detect new data from a JDBC table**.\n\n## Configuration required to have a Sensor\n\n- **jdbc_args**: Arguments of the JDBC upstream.\n- **generate_sensor_query**: Generates a Sensor query to consume data from the upstream, this function can be used on `preprocess_query` ACON option.\n    - **sensor_id**: The unique identifier for the Sensor.\n    - **filter_exp**: Expression to filter incoming new data.\n      A placeholder `?upstream_key` and `?upstream_value` can be used, example: `?upstream_key > ?upstream_value` so that it can be replaced by the respective values from the sensor `control_db_table_name` for this specific sensor_id.\n    - **control_db_table_name**: Sensor control table name.\n    - **upstream_key**: the key of custom sensor information to control how to identify new data from the upstream (e.g., a time column in the upstream).\n    - **upstream_value**: the **first** upstream value to identify new data from the upstream (e.g., the value of a time present in the upstream). ***Note:*** This parameter will have effect just in the first run to detect if the upstream have new data. If it's empty the default value applied is `-2147483647`.\n    - **upstream_table_name**: Table name to consume the upstream value. If it's empty the default value applied is `sensor_new_data`.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios \n\nThis covers the following scenarios of using the Sensor:\n\n1. [Generic JDBC template with `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [Generic JDBC template with `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData from JDBC, in batch mode, will be consumed. If there is new data based in the preprocess query from the source table, it will trigger the condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor, generate_sensor_query\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"jdbc\",\n        \"jdbc_args\": {\n            \"url\": \"JDBC_URL\",\n            \"table\": \"JDBC_DB_TABLE\",\n            \"properties\": {\n                \"user\": \"JDBC_USERNAME\",\n                \"password\": \"JDBC_PWD\",\n                \"driver\": \"JDBC_DRIVER\",\n            },\n        },\n        \"options\": {\n            \"compress\": True,\n        },\n    },\n    \"preprocess_query\": generate_sensor_query(\n        sensor_id=\"MY_SENSOR_ID\",\n        filter_exp=\"?upstream_key > '?upstream_value'\",\n        control_db_table_name=\"my_database.lakehouse_engine_sensors\",\n        upstream_key=\"UPSTREAM_COLUMN_TO_IDENTIFY_NEW_DATA\",\n    ),\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/kafka/__init__.py",
    "content": "\"\"\"\n.. include::kafka.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/kafka/kafka.md",
    "content": "# Sensor from Kafka\n\nThis shows how to create a **Sensor to detect new data from Kafka**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom query should be created with the source table as `sensor_new_data`.\n\n- **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n- **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\nthere is no new data detected from upstream.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData from Kafka, in streaming mode, will be consumed, so if there is any new data in the kafka topic it will give condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"kafka\",\n        \"options\": {\n            \"kafka.bootstrap.servers\": \"KAFKA_SERVER\",\n            \"subscribe\": \"KAFKA_TOPIC\",\n            \"startingOffsets\": \"earliest\",\n            \"kafka.security.protocol\": \"SSL\",\n            \"kafka.ssl.truststore.location\": \"TRUSTSTORE_LOCATION\",\n            \"kafka.ssl.truststore.password\": \"TRUSTSTORE_PWD\",\n            \"kafka.ssl.keystore.location\": \"KEYSTORE_LOCATION\",\n            \"kafka.ssl.keystore.password\": \"KEYSTORE_PWD\",\n        },\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/sap_bw_b4/__init__.py",
    "content": "\"\"\"\n.. include::sap_bw_b4.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/sap_bw_b4/sap_bw_b4.md",
    "content": "# Sensor from SAP\n\nThis shows how to create a **Sensor to detect new data from a SAP LOGCHAIN table**.\n\n## Configuration required to have a Sensor\n\n- **sensor_id**: A unique identifier of the sensor in a specific job.\n- **assets**: List of assets considered for the sensor, which are considered as available once the\n  sensor detects new data and status is `ACQUIRED_NEW_DATA`.\n- **control_db_table_name**: Name of the sensor control table.\n- **input_spec**: Input spec with the upstream source.\n- **preprocess_query**: Query to filter data returned by the upstream.\n\n\n!!! note\n    This parameter is only needed when the upstream data have to be filtered, in this case a custom\n    query should be created with the source table as `sensor_new_data`.\n\n    - **base_checkpoint_location**: Spark streaming checkpoints to identify if the upstream has new data.\n    - **fail_on_empty_result**: Flag representing if it should raise `NoNewDataException` when\n    there is no new data detected from upstream.\n\nSpecific configuration required to have a Sensor consuming a SAP BW/B4 upstream.\nThe Lakehouse Engine provides two utility functions to make easier to consume SAP as upstream:\n`generate_sensor_sap_logchain_query` and `generate_sensor_query`.\n\n- **generate_sensor_sap_logchain_query**: This function aims\n  to create a temporary table with timestamp from the SAP LOGCHAIN table, which is a process control table.\n  \n    !!! note\n        this temporary table only lives during runtime, and it is related with the\n        sap process control table but has no relationship or effect on the sensor control table.\n    \n        - **chain_id**: SAP Chain ID process.\n        - **dbtable**: SAP LOGCHAIN db table name, default: `my_database.RSPCLOGCHAIN`.\n        - **status**: SAP Chain Status of your process, default: `G`.\n        - **engine_table_name**: Name of the temporary table created from the upstream data, \n        default: `sensor_new_data`.\n        This temporary table will be used as source in the `query` option.\n\n  - **generate_sensor_query**: Generates a Sensor query to consume data from the temporary table created in the `prepareQuery`.\n      - **sensor_id**: The unique identifier for the Sensor.\n      - **filter_exp**: Expression to filter incoming new data.\n        A placeholder `?upstream_key` and `?upstream_value` can be used, example: `?upstream_key > ?upstream_value`\n        so that it can be replaced by the respective values from the sensor `control_db_table_name`\n        for this specific sensor_id.\n      - **control_db_table_name**: Sensor control table name.\n      - **upstream_key**: the key of custom sensor information to control how to identify\n        new data from the upstream (e.g., a time column in the upstream).\n      - **upstream_value**: the **first** upstream value to identify new data from the\n        upstream (e.g., the value of a time present in the upstream).\n        .. note:: This parameter will have effect just in the first run to detect if the upstream have new data. If it's empty the default value applied is `-2147483647`.\n      - **upstream_table_name**: Table name to consume the upstream value.\n        If it's empty the default value applied is `sensor_new_data`.\n        .. note:: In case of using the `generate_sensor_sap_logchain_query` the default value for the temp table is `sensor_new_data`, so if passing a different value in the `engine_table_name` this parameter should have the same value.\n\nIf you want to know more please visit the definition of the class [here](../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec).\n\n## Scenarios\n\nThis covers the following scenarios of using the Sensor:\n\n1. [The `fail_on_empty_result=True` (the default and SUGGESTED behaviour).](#fail_on_empty_result-as-true-default-and-suggested)\n2. [The `fail_on_empty_result=False`.](#fail_on_empty_result-as-false)\n\nData from SAP, in streaming mode, will be consumed, so if there is any new data in the kafka topic it will give condition to proceed to the next task.\n\n### `fail_on_empty_result` as True (default and SUGGESTED)\n\n```python\nfrom lakehouse_engine.engine import execute_sensor, generate_sensor_query, generate_sensor_sap_logchain_query\n\nacon = {\n    \"sensor_id\": \"MY_SENSOR_ID\",\n    \"assets\": [\"MY_SENSOR_ASSETS\"],\n    \"control_db_table_name\": \"my_database.lakehouse_engine_sensors\",\n    \"input_spec\": {\n        \"spec_id\": \"sensor_upstream\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"jdbc\",\n        \"options\": {\n            \"compress\": True,\n            \"driver\": \"JDBC_DRIVER\",\n            \"url\": \"JDBC_URL\",\n            \"user\": \"JDBC_USERNAME\",\n            \"password\": \"JDBC_PWD\",\n            \"prepareQuery\": generate_sensor_sap_logchain_query(chain_id=\"CHAIN_ID\", dbtable=\"JDBC_DB_TABLE\"),\n            \"query\": generate_sensor_query(\n                sensor_id=\"MY_SENSOR_ID\",\n                filter_exp=\"?upstream_key > '?upstream_value'\",\n                control_db_table_name=\"my_database.lakehouse_engine_sensors\",\n                upstream_key=\"UPSTREAM_COLUMN_TO_IDENTIFY_NEW_DATA\",\n            ),\n        },\n    },\n    \"base_checkpoint_location\": \"s3://my_data_product_bucket/checkpoints\",\n    \"fail_on_empty_result\": True,\n}\n\nexecute_sensor(acon=acon)\n```\n\n### `fail_on_empty_result` as False\n\nUsing `fail_on_empty_result=False`, in which the `execute_sensor` function returns a `boolean` representing if it\nhas acquired new data. This value can be used to execute or not the next steps.\n\n```python\nfrom lakehouse_engine.engine import execute_sensor\n\nacon = {\n    [...],\n    \"fail_on_empty_result\": False\n}\n\nacquired_data = execute_sensor(acon=acon)\n```\n\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/sensor.md",
    "content": "# Sensor\n\n## What is it?\n\nThe lakehouse engine sensors are an abstraction to otherwise complex spark code that can be executed in very small\nsingle-node clusters to check if an upstream system or data product contains new data since the last execution of our\njob. With this feature, we can trigger a job to run in more frequent intervals and if the upstream does not contain new\ndata, then the rest of the job exits without creating bigger clusters to execute more intensive data ETL (Extraction,\nTransformation, and Loading).\n\n## How do Sensor-based jobs work?\n\n<img src=\"../../../assets/img/sensor_os.png\" alt=\"image\" width=\"1000px\" height=\"auto\">\n\nWith the sensors capability, data products in the lakehouse can sense if another data product or an upstream system (source\nsystem) have new data since the last successful job. We accomplish this through the approach illustrated above, which\ncan be interpreted as follows:\n\n1. A Data Product can check if Kafka, JDBC or any other Lakehouse Engine Sensors supported sources, contains new data using the respective sensors;\n2. The Sensor task may run in a very tiny single-node cluster to ensure cost\n   efficiency ([check sensor cost efficiency](#are-sensor-based-jobs-cost-efficient));\n3. If the sensor has recognised that there is new data in the upstream, then you can start a different ETL Job Cluster\n   to process all the ETL tasks (data processing tasks).\n4. In the same way, a different Data Product can sense if an upstream Data Product has new data by using 1 of 2 options:\n    1. **(Preferred)** Sense the upstream Data Product sensor control delta table;\n    2. Sense the upstream Data Product data files in s3 (files sensor) or any of their delta tables (delta table\n       sensor);\n\n## The Structure and Relevance of the Sensors Control Table\n\nThe concept of the lakehouse-engine sensor is based on a special delta table stored inside the data product that chooses to opt in for a sensor-based job. That table is used to control the status of the various sensors implemented by that data product. You can refer to the below table to understand the sensor delta table structure:\n\n| Column Name                 | Type          | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |\n|-----------------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **sensor_id**               | STRING        | A unique identifier of the sensor in a specific job. This unique identifier is really important because it is used by the engine to identify if there is new data in the upstream.<br />Each sensor in each job should have a different sensor_id.<br />If you attempt to create 2 sensors with the same sensor_id, the engine will fail.                                                                                                                                                                                                                                                              |\n| **assets**                  | ARRAY<STRING> | A list of assets (e.g., tables or dataset folder) that are considered as available to consume downstream after the sensor has status *PROCESSED_NEW_DATA*.                                                                                                                                                                                                                                                                                                                                                                                                                                             |\n| **status**                  | STRING        | Status of the sensor. Can either be:<br /><ul><li>*ACQUIRED_NEW_DATA* – when the sensor in a job has recognised that there is new data from the upstream but, the job where the sensor is, was still not successfully executed.</li><li>*PROCESSED_NEW_DATA* - when the job where the sensor is located has processed all the tasks in that job.</li></ul>                                                                                                                                                                                                                                             |\n| **status_change_timestamp** | STRING        | Timestamp when the status has changed for the last time.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |\n| **checkpoint_location**     | STRING        | Base location of the Spark streaming checkpoint location, when applicable (i.e., when the type of sensor uses Spark streaming checkpoints to identify if the upstream has new data). E.g. Spark streaming checkpoints are used for Kafka, Delta and File sensors.                                                                                                                                                                                                                                                                                                                                      |\n| **upstream_key**            | STRING        | Upstream key (e.g., used to store an attribute name from the upstream so that new data can be detected automatically).<br />This is useful for sensors that do not rely on Spark streaming checkpoints, like the JDBC sensor, as it stores the name of a field in the JDBC upstream that contains the values that will allow us to identify new data (e.g., a timestamp in the upstream that tells us when the record was loaded into the database).                                                                                                                                                   |\n| **upstream_value**          | STRING        | Upstream value (e.g., used to store the max attribute value from the upstream so that new data can be detected automatically). This is the value for upstream_key. <br />This is useful for sensors that do not rely on Spark streaming checkpoints, like the JDBC sensor, as it stores the value of a field in the JDBC upstream that contains the maximum value that was processed by the sensor, and therefore useful for recognizing that there is new data in the upstream (e.g., the value of a timestamp attribute in the upstream that tells us when the record was loaded into the database). |\n\n!!! note \"Control Table Requirements\"\n** Sensors**: You need to add this control table to your data product to use sensors.\n\n    **Heartbeat Sensor**: Uses Sensor control table and a different heartbeat control table structure. For Heartbeat Sensor implementation, refer to the [Heartbeat Sensor Control Table structure](heartbeat/heartbeat.md#heartbeat-sensor-control-table-reference-records).\n\n## How is it Different from Scheduled Jobs?\n\nBoth sensor-based jobs and Heartbeat Sensor jobs are still scheduled, but they can be scheduled with higher frequency because they are more cost-efficient than spinning up multi-node clusters for heavy ETL operations, only to discover that the upstream doesn't have new data.\nEach job includes a sensor task that checks for new data before proceeding with ETL tasks. If no new data is found, the job exits early without consuming additional resources.\n\n## Are Sensor-based Jobs Cost-Efficient?\n\nYes, for the same schedule (e.g., 4 times a day), sensor-based jobs are significantly more cost-efficient than scheduling regular jobs because:\n\n1. **Minimal Resource Usage**: Sensor tasks run on very small single-node clusters\n2. **Conditional Processing**: Larger ETL clusters are only spun up when new data is available\n3. **Early Exit Strategy**: Jobs exit early if no new data is detected, saving compute costs\n4. **Optimized Scheduling**: You can schedule sensor checks more frequently without proportional cost increases\n\nFor demanding SLAs, you can implement alternative architectures with continuous (always-running) sensor clusters that trigger respective data processing jobs whenever new data becomes available.\n\n## Sensor Steps\n\n1. Create your sensor task for the upstream source. Examples of available sources:\n    - [Delta Table](delta_table/delta_table.md)\n    - [Delta Upstream Sensor Table](delta_upstream_sensor_table/delta_upstream_sensor_table.md)\n    - [File](file/file.md)\n    - [JDBC](jdbc_table/jdbc_table.md)\n    - [Kafka](kafka/kafka.md)\n    - [SAP BW/B4](sap_bw_b4/sap_bw_b4.md)\n2. Setup/Execute your ETL task based in the Sensor Condition\n3. Update the Sensor Control table status with the [Update Sensor Status](update_sensor_status/update_sensor_status.md)\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/update_sensor_status/__init__.py",
    "content": "\"\"\"\n.. include::update_sensor_status.md\n\"\"\"\n"
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensor/update_sensor_status/update_sensor_status.md",
    "content": "# Update Sensor control delta table after processing the data\n\nThis shows how to **update the status of your Sensor after processing the new data**.\n\nHere is an example on how to update the status of your sensor in the Sensors Control Table:\n```python\nfrom lakehouse_engine.engine import update_sensor_status\n\nupdate_sensor_status(\n    sensor_id=\"MY_SENSOR_ID\",\n    control_db_table_name=\"my_database.lakehouse_engine_sensors\",\n    status=\"PROCESSED_NEW_DATA\",\n    assets=[\"MY_SENSOR_ASSETS\"]\n)\n```\n\nIf you want to know more please visit the definition of the class [here](../../../../reference/packages/core/definitions.md#packages.core.definitions.SensorSpec)."
  },
  {
    "path": "lakehouse_engine_usage/sensors/sensors.md",
    "content": "# Sensors\n\n## What is it?\n\nThe lakehouse engine provides two complementary sensor solutions for monitoring upstream systems and detecting new data:\n\n### 1. Sensor\n\nTraditional lakehouse engine sensors are abstractions that simplify complex Spark code, allowing you to check if an upstream system or data product contains new data since the last job execution. These sensors run in very small single-node clusters to ensure cost efficiency. If the upstream contains new data, the sensor triggers the rest of the job; otherwise, the job exits without spinning up larger clusters for intensive ETL operations.\n\n**Key Characteristics:**\n\n- Individual sensor configuration for each data source within jobs.\n- Manual job execution after sensor detection or the need of adding a Sensor task in the beginning of the pipeline.\n- Single-source monitoring capability per task, coupling it directly with the source and not with a source type.\n- Built-in cost optimization through minimal cluster usage.\n\n### 2. Heartbeat Sensor\n\nThe Heartbeat Sensor is a robust, centralized orchestration system that enhances the Sensor infrastructure. It provides automated event detection, efficient multiple sources parallelism detection, and seamless integration with downstream workflows. Unlike Sensors that require individual configuration for each data source, the Heartbeat Sensor manages multiple sources through a single control table and automatically triggers Databricks jobs when new data is detected.\n\n**Key Characteristics:**\n\n- Centralized control table for managing all Sensor sources\n- Automatic Databricks job triggering via Job Run API\n- Multi-source support with dependency management\n- Built-in hard/soft dependency validation\n- Comprehensive status tracking and lifecycle management\n\n## When to Use Each Solution\n\n| Aspect                    | Sensor                                                   | Heartbeat Sensor                                           |\n|---------------------------|----------------------------------------------------------|------------------------------------------------------------|\n| **Use Case**              | Simple, single-source monitoring within individual jobs. | Complex, multi-source orchestration with job dependencies. |\n| **Configuration**         | Individual sensor setup per job.                         | Centralized control table configuration.                   |\n| **Job Triggering**        | Manual job execution after sensor detection.             | Automatic Databricks job triggering via Job API.           |\n| **Dependency Management** | Not supported.                                           | Built-in hard/soft dependency validation.                  |\n| **Scalability**           | Limited to individual sensors.                           | Highly scalable with parallel source type processing.      |\n| **Management Overhead**   | Higher (individual configurations).                      | Lower (centralized management).                            |\n| **Best For**              | Single data product monitoring.                          | Enterprise-level orchestration.                            |\n\n### Decision Guide\n\n**Choose Sensors when:**\n\n- You need simple monitoring for a single data source.\n- Your workflow involves manual job execution or you are up to update your pipeline to have a Sensor task at the beginning.\n- You have straightforward ETL pipelines without complex dependencies.\n- You prefer embedded sensor logic within individual jobs.\n- Your data pipeline is relatively straightforward.\n\n**Choose Heartbeat Sensor when:**\n\n- You need to orchestrate multiple data sources and dependencies.\n- You want automated job triggering without manual intervention.\n- You require centralized monitoring and management.\n- You need to handle complex multi-source workflows at enterprise scale.\n- You require enterprise-level orchestration capabilities.\n- You need centralized monitoring and status management.\n\nBoth solutions can coexist in the same environment, allowing you to choose the appropriate sensor type based on specific use case requirements.\n\n## How do Sensor-based Jobs Work?\n\nWith sensors, data products in the lakehouse can detect if another data product or upstream system contains new data since the last successful job execution. The workflow is as follows:\n\n1. **Data Detection**: A data product checks if Kafka, JDBC, or any other supported Sensor source contains new data using the respective sensors.\n2. **Cost-Efficient Execution**: The Sensor task runs in a very small single-node cluster to ensure cost efficiency.\n3. **Conditional Processing**: If the Sensor detects new data in the upstream, you can start a different ETL job cluster to process all ETL tasks (data processing tasks).\n4. **Cross-Product Sensing**: Different data products can Sense if upstream data products have new data using:\n    - **(Preferred)** Sensing the upstream data product's Sensor control delta table.\n    - Sensing the upstream data product's data files in S3 (files sensor) or delta tables (delta table sensor).\n\nFor detailed information about Heartbeat Sensor implementation, configuration, and usage, see the [Sensor documentation](sensor/sensor.md).\n\n## How do Heartbeat Sensor Jobs Work?\n\nThe Heartbeat Sensor approach uses a centralized sensor cluster running on a single node that continuously checks for new events from different sensor sources mentioned in the Heartbeat sensor control table. When a new event is available from a sensor source, it automatically triggers the corresponding job via the Databricks Job Run API using a pull-based approach.\n\n**Workflow Process:**\n\n1. **Continuous Monitoring**: The heartbeat cluster continuously polls various sensor sources.\n2. **Event Detection**: Checks for `NEW_EVENT_AVAILABLE` status from configured sources.\n3. **Dependency Validation**: Validates hard/soft dependencies before job triggering.\n4. **Automatic Triggering**: Automatically triggers dependent Databricks jobs.\n5. **Status Management**: Updates job status throughout the lifecycle.\n\n**Key Advantages:**\n\n- **Centralized Control**: Single control table manages all sensor sources and dependencies.\n- **Automated Orchestration**: No manual intervention required for job triggering.\n- **Multi-Source Support**: Handles diverse source types (SAP, Kafka, Delta Tables, Manual Uploads, Trigger Files) in one unified system.\n- **Dependency Management**: Built-in validation prevents premature job execution.\n- **Status Tracking**: Comprehensive lifecycle tracking from detection to job completion.\n\nFor detailed information about Heartbeat Sensor implementation, configuration, and usage, see the [Heartbeat Sensor documentation](heartbeat/heartbeat.md).\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[build-system]\nrequires = [\n    \"setuptools==74.*\"\n]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"lakehouse-engine\"\nrequires-python = \">=3.12\"\nreadme = \"README.md\"\nlicense = {file = \"LICENSE.txt\"}\nversion = \"2.0.0\"\nauthors = [{name = \"Adidas Lakehouse Foundations Team\", email = \"software.engineering@adidas.com\"}]\ndescription = \"A configuration-driven Spark framework serving as the engine for several lakehouse algorithms and data flows.\"\nkeywords = [\"framework\", \"big-data\", \"spark\", \"databricks\", \"data-quality\", \"data-engineering\", \"great-expectations\",\n    \"lakehouse\", \"delta-lake\", \"configuration-driver\"]\nclassifiers = [\n    \"Development Status :: 5 - Production/Stable\",\n    \"Programming Language :: Python :: 3\",\n    \"Intended Audience :: Developers\",\n    \"Intended Audience :: Science/Research\",\n    \"Intended Audience :: Other Audience\",\n    \"Operating System :: OS Independent\",\n    \"Topic :: Scientific/Engineering\",\n    \"Topic :: Software Development\",\n    \"License :: OSI Approved :: Apache Software License\"\n]\ndynamic = [\"dependencies\", \"optional-dependencies\"]\n\n[project.urls]\nRepository = \"https://github.com/adidas/lakehouse-engine\"\nDocumentation = \"https://adidas.github.io/lakehouse-engine-docs/index.html\"\nIssues = \"https://github.com/adidas/lakehouse-engine/issues\"\nReleases = \"https://github.com/adidas/lakehouse-engine/releases\"\n\n[tool.setuptools.dynamic]\ndependencies = { file = [\"cicd/requirements.lock\"] }\noptional-dependencies.os = { file = [\"cicd/requirements_os.lock\"] }\noptional-dependencies.azure = { file = [\"cicd/requirements_azure.lock\"] }\noptional-dependencies.dq = { file = [\"cicd/requirements_dq.lock\"] }\noptional-dependencies.sftp = { file = [\"cicd/requirements_sftp.lock\"] }\noptional-dependencies.sharepoint = { file = [\"cicd/requirements_sharepoint.lock\"] }\n\n[tool.setuptools.packages.find]\nexclude = [\"tests*\", \"lakehouse_engine_usage*\"]\nnamespaces = false\n\n[tool.setuptools.package-data]\nlakehouse_engine = [\"configs/engine.yaml\"]\n\n[tool.isort]\nprofile = \"black\"\n\n[tool.mypy]\nwarn_return_any = true\nwarn_unused_configs = true\nignore_missing_imports = false\nstrict_optional = false\ndisallow_untyped_defs = true\n\n[[tool.mypy.overrides]]\nmodule = [\n    \"delta.*\",\n    \"pyspark.*\",\n    \"py4j.*\",\n    \"great_expectations.*\",\n    \"pandas.*\",\n    \"IPython.*\",\n    \"nest_asyncio.*\",\n    \"msgraph.*\",\n    \"importlib.*\",\n    \"yaml.*\",\n    \"ruamel.*\",\n    \"msal.*\",\n    \"dbruntime.databricks_repl_context.*\"\n]\nignore_missing_imports = true\n\n[tool.pytest.ini_options]\ntestpaths = [\n    \"tests\"\n]\nfilterwarnings = [\n    # coming from GX and also on their pyproject ignores\n    \"ignore: Jupyter is migrating its paths to use standard platformdirs:DeprecationWarning\", #1 warning\n    # We are defining result_format at the Checkpoint level (which is the right one), but GX is wrongly\n    # triggering the warning, because it is also considering the defaults of the expectations for triggering the warning.\n    # Only place where we are not defining at Checkpoint level is for custom expectation local test, as we don't\n    # need checkpoint for the test.\n    \"ignore:`result_format` configured at the Validator-level will not be persisted:UserWarning\", # 12 warnings\n    \"ignore:`result_format` configured at the Expectation-level will not be persisted:UserWarning\", # 12 warnings\n    \"ignore: jsonschema.RefResolver is deprecated as of v4.18.0:DeprecationWarning\", #1985 warnings come from this one\n    \"ignore: The default dtype for empty Series will be 'object' instead of 'float64' in a future version.:DeprecationWarning\",\n    \"ignore: The default dtype for empty Series will be:FutureWarning\",\n    # Warning about host keys on local ftp tests with paramiko\n    \"ignore: Unknown ssh-rsa host key for : UserWarning\",\n    # GX library is using fields.Number from marshmallow, which is deprecated and will be removed in Marshmallow 4.0\n    \"ignore: `Number` field should not be instantiated. Use `Integer`, `Float`, or `Decimal` instead.:DeprecationWarning\"\n]"
  },
  {
    "path": "samples/cricket_dq_tutorial.py",
    "content": "# This sample tutorial is based on the dataset available here: https://www.kaggle.com/datasets/vikramrn/icc-mens-cricket-odi-world-cup-wc-2023-bowling.\n# The goal of the tutorial is to demonstrate how you can use the Lakehouse Engine to load data into a target location while assessing its data quality.\n\n# You can install the Lakehouse Engine framework with below command just like any other python library,\n# or you can also install it as a cluster-scoped library\npip install lakehouse-engine\n\n# The ACON (algorithm configuration) is the way how you can interact with the Lakehouse Engine.\n# Note: don't forget to change locations, buckets and databases to match your environment.\nacon = {\n    \"input_specs\": [\n        {\n            \"spec_id\": \"cricket_world_cup_bronze\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \",\",\n            },\n            \"location\": \"s3://your_bucket_file_location/icc_wc_23_bowl.csv\",\n        }\n    ],\n    \"dq_specs\": [\n        {\n            \"spec_id\": \"cricket_world_cup_data_quality\",\n            \"input_id\": \"cricket_world_cup_bronze\",\n            \"dq_type\": \"validator\",\n            \"store_backend\": \"s3\",\n            \"bucket\": \"your_bucket\",\n            \"result_sink_location\": \"s3://your_bucket/dq_result_sink/gx_blog/\",\n            \"result_sink_db_table\": \"your_database.gx_blog_result_sink\",\n            \"tag_source_data\": True,\n            \"unexpected_rows_pk\": [\"player\", \"match_id\"],\n            \"fail_on_error\": False,\n            \"critical_functions\": [\n                {\n                    \"function\": \"expect_column_values_to_be_in_set\",\n                    \"args\": {\n                        \"column\": \"team\",\n                        \"value_set\": [\n                            \"Sri Lanka\", \"Netherlands\", \"Australia\", \"England\", \"Bangladesh\",\n                            \"New Zealand\", \"India\", \"Afghanistan\", \"South Africa\", \"Pakistan\",\n                        ],\n                    },\n                },\n                {\n                    \"function\": \"expect_column_values_to_be_in_set\",\n                    \"args\": {\n                        \"column\": \"opponent\",\n                        \"value_set\": [\n                            \"Sri Lanka\", \"Netherlands\", \"Australia\", \"England\", \"Bangladesh\",\n                            \"New Zealand\", \"India\", \"Afghanistan\", \"South Africa\", \"Pakistan\",\n                        ],\n                    },\n                },\n            ],\n            \"dq_functions\": [\n                {\n                    \"function\": \"expect_column_values_to_not_be_null\",\n                    \"args\": {\"column\": \"player\"},\n                },\n                {\n                    \"function\": \"expect_column_values_to_be_between\",\n                    \"args\": {\"column\": \"match_id\", \"min_value\": 0, \"max_value\": 47},\n                },\n                {\n                    \"function\": \"expect_column_values_to_be_in_set\",\n                    \"args\": {\"column\": \"maidens\", \"value_set\": [0, 1]},\n                },\n            ],\n        },\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"cricket_world_cup_silver\",\n            \"input_id\": \"cricket_world_cup_data_quality\",\n            \"write_type\": \"overwrite\",\n            \"db_table\": \"your_database.gx_blog_cricket\",\n            \"location\": \"s3://your_bucket/rest_of_path/gx_blog_cricket/\",\n            \"data_format\": \"delta\",\n        }\n    ],\n}\n\n# You need to import the Load Data algorithm from the Lakehouse Engine, so that you can perform Data Loads.\nfrom lakehouse_engine.engine import load_data\n\n# Finally, you just need to run the Load Data algorithm with the ACON that you have just defined.\nload_data(acon=acon)\n"
  },
  {
    "path": "samples/tpch_load_and_analysis_tutorial.py",
    "content": "# Databricks notebook source\n# MAGIC %md\n# MAGIC ### How to use the Lakehouse Engine to load and analyse Data\n# MAGIC This sample is composed of two main sections and goals:\n# MAGIC 1. **Data Load (integrate data into the Lakehouse)**\n# MAGIC     - load 2 data sources\n# MAGIC     - join both sources and enhance the dataset with more information\n# MAGIC     - write the output into a target table\n# MAGIC 2. **Data Analysis (analyse the data ingested in the previous step)**\n# MAGIC     - read the ingested data\n# MAGIC     - assess the quality of that data\n# MAGIC     - output this data as a DataFrame to enable further processing\n# MAGIC     - analyse the data with sample Databricks Notebook Dashboards\n# MAGIC\n# MAGIC The base dataset used, on this sample, is the TPCH Dataset from Databricks Datasets (https://docs.databricks.com/en/discover/databricks-datasets.html).\n# MAGIC Moreover, Databricks Notebook Dashboards are also used. This is why this example consists of a Databricks python Notebook, instead of simple raw python.\n\n# COMMAND ----------\n\n# You can install the Lakehouse Engine framework with below command just like any other python library,\n# or you can also install it as a cluster-scoped library\n%pip install lakehouse-engine\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC #### 1. Data Load\n# MAGIC On this section an example is provided in order to accomplish the following:\n# MAGIC - loading `orders` and `customers` TPCH data\n# MAGIC - add current date, join both data sources and identify Super VIPs\n# MAGIC - write data into the final table\n# MAGIC\n# MAGIC **Note:** as it can be seen in the following code, the Lakehouse Engine cannot offer transformers for everything one might want to do on the data, as there may be very specific use cases. This is why the Lakehouse Engine provides full flexibility with Custom Transformations (`custom_transformation`), which can be used to pass any custom function, as the `is_a_super_vip` function used on this example. \n\n# COMMAND ----------\n\nfrom pyspark.sql.functions import col\nfrom pyspark.sql import DataFrame\n\ndef is_a_super_vip(df: DataFrame) -> DataFrame:\n    \"\"\"Example of custom transformation.\n    \n    It checks if the totalprice for a particular order is within the \n    10% higher and if the order priority is URGENT.\n    If both criterias are met, the customer is considered a super vip.\n\n    Args:\n        df: DataFrame passed as input.\n\n    Returns:\n        DataFrame: the transformed DataFrame.\n    \"\"\"\n    percentile_90 = df.approxQuantile(\"o_totalprice\", [0.9], 0)[0]\n    df = df.withColumn(\n            \"is_a_super_vip\", \n            (col(\"o_totalprice\") >= percentile_90) & \n            (col(\"o_orderpriority\") == \"1-URGENT\")\n        )\n    return df\n\n# COMMAND ----------\n\nacon = {\n        \"input_specs\": [\n            # Batch (streaming is also supported) read tpch orders delta files from Databricks datasets location\n            {\n                \"spec_id\": \"tpch_orders\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"delta\",\n                \"location\": \"/databricks-datasets/tpch/delta-001/orders\",\n            },\n            # Batch read tpch customers from a samples delta table in Databricks\n            {\n                \"spec_id\": \"tpch_customer\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"delta\",\n                \"db_table\": \"samples.tpch.customer\",\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"tpch_orders_transformed\",\n                \"input_id\": \"tpch_orders\",\n                \"transformers\": [\n                    # Add current date to easily track when a particular row was added\n                    {\n                        \"function\": \"add_current_date\",\n                        \"args\": {\n                            \"output_col\": \"lak_load_date\"\n                        }\n                    },\n                    # Join orders with customers to get the customer name.\n                    # Having customer name in the table will make analysis easier\n                    {\n                        \"function\": \"join\",\n                        \"args\": {\n                            \"join_with\": \"tpch_customer\",\n                            \"join_type\": \"left outer\",\n                            \"join_condition\": \"a.o_custkey = b.c_custkey\",\n                            \"select_cols\": [\"a.*\", \"b.c_name as customer_name\"]\n                        }\n                    },\n                    # Custom transformation to assess if a customer should be considered Super VIP.\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": is_a_super_vip},\n                    }\n                ],\n            },\n        ],\n        \"output_specs\": [\n            # Overwrite data into an external table on top of the specified location, using delta data format.\n            # Note: other write types are supported, such as append and merge, but overwrite is used for simplicity on this demo.\n            {\n                \"spec_id\": \"tpch_orders_output\",\n                \"input_id\": \"tpch_orders_transformed\",\n                \"write_type\": \"overwrite\",\n                \"db_table\": \"your_database.tpch_orders\",\n                \"location\": \"s3://your_s3_bucket/silver/tpch_orders/\",\n                \"data_format\": \"delta\",\n            }\n        ],\n    }\n\nfrom lakehouse_engine.engine import load_data\n\ntpch_df = load_data(acon=acon)\n\n# COMMAND ----------\n\n# As soon as the algorithm is finished, the dataframe output of the framework can be directly checked in order to analyse the data that have been just produced\ndisplay(tpch_df[\"tpch_orders_output\"])\n\n# COMMAND ----------\n\n# MAGIC %md\n# MAGIC #### 2. Data Analysis\n# MAGIC On this section an example is provided in order to accomplish the following:\n# MAGIC - reading the data loaded on the previous step, using a SQL query\n# MAGIC - assess the quality of the data, by applying Data Quality functions/expectations\n# MAGIC - output the data as a DataFrame for further processing\n# MAGIC - analyse the data with sample Databricks Notebook Dashboards\n\n# COMMAND ----------\n\nacon = {\n        \"input_specs\": [\n            # Batch read a custom SQL query from the table we have just inserted data into\n            {\n                \"spec_id\": \"tpch_orders\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"sql\",\n                \"query\": \"\"\"\n                    SELECT o_orderkey, customer_name, o_totalprice, is_a_super_vip\n                    FROM your_database.tpch_orders\n                \"\"\",\n            },\n        ],\n        \"dq_specs\": [\n            # Assess the quality of data, by ensuring that the specified 3 columns have no nulls.\n            {\n                \"spec_id\": \"tpch_orders_dq\",\n                \"input_id\": \"tpch_orders\",\n                \"dq_type\": \"validator\",\n                \"bucket\": \"your_s3_bucket\",\n                \"dq_functions\": [\n                    {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"o_orderkey\"}},\n                    {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"customer_name\"}},\n                    {\"function\": \"expect_column_values_to_not_be_null\", \"args\": {\"column\": \"o_totalprice\"}}\n                    ]\n            },\n        ],\n        \"output_specs\": [\n            # As the data is being analysed, there is no need to write it into any table or location.\n            # Thus, the data output is just a Dataframe that can be used for further debug or processing.\n            {\n                \"spec_id\": \"validated_tpch_orders\",\n                \"input_id\": \"tpch_orders_dq\",\n                \"data_format\": \"dataframe\",\n            }\n        ],\n    }\n\nfrom lakehouse_engine.engine import load_data\n\nvalidated_tpch_df = load_data(acon=acon)\n\n# COMMAND ----------\n\n# Create a Temporary View to make it easier to interact with the Data using SQL\nvalidated_tpch_df[\"validated_tpch_orders\"].createOrReplaceTempView(\"tpch_order_analysis\")\n\n# COMMAND ----------\n\n# MAGIC %sql\n# MAGIC -- the data that came from the previous load_data algorithm execution can now be queried\n# MAGIC -- to analyse the customers and orders classified as SUPER VIP\n# MAGIC SELECT customer_name, o_totalprice, is_a_super_vip\n# MAGIC FROM tpch_order_analysis\n# MAGIC GROUP BY customer_name, o_totalprice, is_a_super_vip\n# MAGIC ORDER BY o_totalprice desc\n\n# COMMAND ----------\n\n# MAGIC %sql\n# MAGIC SELECT customer_name, o_totalprice\n# MAGIC FROM tpch_order_analysis\n# MAGIC WHERE is_a_super_vip is True\n# MAGIC GROUP BY customer_name, o_totalprice\n# MAGIC ORDER BY o_totalprice desc\n# MAGIC LIMIT 10\n\n# COMMAND ----------\n\n\n"
  },
  {
    "path": "tests/__init__.py",
    "content": "\"\"\"Tests package.\"\"\"\n"
  },
  {
    "path": "tests/configs/__init__.py",
    "content": "\"\"\"This module has the engine test configurations.\"\"\"\n"
  },
  {
    "path": "tests/configs/engine.yaml",
    "content": "dq_bucket: /app/tests/lakehouse/out/feature\ndq_dev_bucket: /app/tests/lakehouse/out/feature\nnotif_disallowed_email_servers:\n  - smtp.test.com\nengine_usage_path: file:///app/tests/lakehouse/logs/lakehouse-engine-logs\nengine_dev_usage_path: file:///app/tests/lakehouse/logs/lakehouse-engine-logs\ncollect_engine_usage: disabled\ndq_functions_column_list:\n  - dq_rule_id\n  - execution_point\n  - filters\n  - schema\n  - table\n  - column\n  - dimension\ndq_result_sink_columns_to_delete:\n  - partial_unexpected_list\n  - partial_unexpected_counts\n  - partial_unexpected_index_list\n  - unexpected_list\nsharepoint_authority: https://login.microsoftonline.com\nsharepoint_api_domain: https://graph.microsoft.com\nsharepoint_company_domain: company_name.sharepoint.com\nprod_catalog: sample_catalog"
  },
  {
    "path": "tests/conftest.py",
    "content": "\"\"\"Module to configure the test environment.\"\"\"\n\nfrom typing import Any, Generator\nfrom unittest.mock import patch\n\nimport pytest\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom tests.utils.exec_env_helpers import ExecEnvHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nRESOURCES = \"/app/tests/resources/\"\nFEATURE_RESOURCES = RESOURCES + \"feature\"\nUNIT_RESOURCES = RESOURCES + \"unit\"\nLAKEHOUSE = \"/app/tests/lakehouse/\"\nLAKEHOUSE_FEATURE_IN = LAKEHOUSE + \"in/feature\"\nLAKEHOUSE_FEATURE_CONTROL = LAKEHOUSE + \"control/feature\"\nLAKEHOUSE_FEATURE_OUT = LAKEHOUSE + \"out/feature\"\nLAKEHOUSE_FEATURE_LOGS = LAKEHOUSE + \"logs/lakehouse-engine-logs\"\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef patch_databricks_utils_job_info() -> Generator:\n    \"\"\"Patch DatabricksUtils.get_databricks_job_information to return local values.\"\"\"\n    with patch(\n        \"lakehouse_engine.utils.databricks_utils.\"\n        \"DatabricksUtils.get_databricks_job_information\",\n        return_value=(\"local\", \"local\"),\n    ):\n        yield\n\n\ndef pytest_addoption(parser: Any) -> Any:\n    \"\"\"Setting extra options for pytest command.\"\"\"\n    parser.addoption(\n        \"--spark_driver_memory\",\n        action=\"store\",\n        help=\"memory limit for the spark driver (default 2g)\",\n    )\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef spark_driver_memory(request: Any) -> Any:\n    \"\"\"Fetching the value of spark_driver_memory parameter.\"\"\"\n    return request.config.getoption(name=\"--spark_driver_memory\")\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef prepare_exec_env(spark_driver_memory: str) -> None:\n    \"\"\"Prepare the execution environment before any test is executed.\"\"\"\n    # remove previous test lakehouse data\n    LocalStorage.clean_folder(LAKEHOUSE)\n    ExecEnv.set_default_engine_config(\"tests.configs\")\n    ExecEnvHelpers.prepare_exec_env(spark_driver_memory)\n    ExecEnv.SESSION.sql(f\"CREATE DATABASE IF NOT EXISTS test_db LOCATION '{LAKEHOUSE}'\")\n\n\n@pytest.fixture(autouse=True)\ndef before_each_test() -> Generator:\n    \"\"\"Reset default spark session configs.\"\"\"\n    yield\n    ExecEnvHelpers.reset_default_spark_session_configs()\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef test_session_closure(request: Any) -> None:\n    \"\"\"Finalizing resources.\"\"\"\n\n    def finalizer() -> None:\n        \"\"\"Close spark session.\"\"\"\n        ExecEnv.SESSION.stop()\n\n    request.addfinalizer(finalizer)\n"
  },
  {
    "path": "tests/feature/__init__.py",
    "content": "\"\"\"Feature tests focusing on algorithm execution with different acon functionalities.\"\"\"\n"
  },
  {
    "path": "tests/feature/custom_expectations/__init__.py",
    "content": "\"\"\"Tests related to the custom expectation's implementation.\"\"\"\n"
  },
  {
    "path": "tests/feature/custom_expectations/test_custom_expectations.py",
    "content": "\"\"\"Test custom expectation validations.\"\"\"\n\nfrom json import loads\nfrom typing import Any, Tuple\n\nimport pytest\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import execute_dq_validation\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"custom_expectations\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"expectation_name\": \"expect_column_pair_a_to_be_smaller_or_equal_than_b\",\n            \"arguments\": {\n                \"column_A\": \"salesorder\",\n                \"column_B\": \"amount\",\n                \"margin\": 9.78,\n            },\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_pair_a_to_be_smaller_or_equal_than_b\",\n            \"arguments\": {\"column_A\": \"salesorder\", \"column_B\": \"amount\"},\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_multicolumn_column_a_must_equal_b_or_c\",\n            \"arguments\": {\n                \"column_list\": [\"item\", \"itemcode\", \"amount\"],\n            },\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_multicolumn_column_a_must_equal_b_or_c\",\n            \"arguments\": {\n                \"column_list\": [\"item\", \"itemcode\", \"amount\"],\n            },\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_queried_column_agg_value_to_be\",\n            \"arguments\": {\n                \"template_dict\": {\n                    \"column\": \"amount\",\n                    \"group_column_list\": \"year, month, day\",\n                    \"agg_type\": \"max\",\n                    \"condition\": \"lesser\",\n                    \"max_value\": 10000,\n                },\n            },\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_queried_column_agg_value_to_be\",\n            \"arguments\": {\n                \"template_dict\": {\n                    \"column\": \"amount\",\n                    \"group_column_list\": \"year,month,day\",\n                    \"agg_type\": \"count\",\n                    \"condition\": \"greater\",\n                    \"min_value\": 0,\n                },\n            },\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_values_to_be_date_not_older_than\",\n            \"arguments\": {\n                \"column\": \"date\",\n                \"timeframe\": {\"years\": 100},\n            },\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_values_to_be_date_not_older_than\",\n            \"arguments\": {\n                \"column\": \"date\",\n                \"timeframe\": {\"years\": 100},\n            },\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b\",  # noqa: E501\n            \"arguments\": {\"column_A\": \"EDATU\", \"column_B\": \"ERDAT\"},\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b\",  # noqa: E501\n            \"arguments\": {\"column_A\": \"MBDAT\", \"column_B\": \"ERDATA\"},\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_pair_a_to_be_not_equal_to_b\",\n            \"arguments\": {\n                \"column_A\": \"group_article\",\n                \"column_B\": \"article_number\",\n            },\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_pair_a_to_be_not_equal_to_b\",\n            \"arguments\": {\n                \"column_A\": \"group_article\",\n                \"column_B\": \"article_number\",\n            },\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_values_to_not_be_null_or_empty_string\",\n            \"arguments\": {\n                \"column\": \"number\",\n            },\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n        {\n            \"expectation_name\": \"expect_column_values_to_not_be_null_or_empty_string\",\n            \"arguments\": {\n                \"column\": \"number\",\n            },\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"custom_expectation_result\": \"success\",\n        },\n    ],\n)\ndef test_custom_expectation(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the implementation of the custom expectations.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    _clean_folders(scenario[\"expectation_name\"])\n\n    input_spec = {\n        \"spec_id\": \"sales_source\",\n        \"read_type\": scenario[\"read_type\"],\n        \"data_format\": \"dataframe\",\n        \"df_name\": _generate_dataframe(\n            scenario[\"read_type\"], scenario[\"expectation_name\"]\n        ),\n    }\n\n    acon = _generate_acon(input_spec, scenario, \"validator\")\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario['expectation_name']}/data/control/*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario['expectation_name']}/data/\",\n    )\n\n    execute_dq_validation(acon=acon)\n\n    dq_result_df, dq_control_df = _get_result_and_control_dfs(\n        \"test_db.sales_order\",\n        f'dq_control_{scenario[\"custom_expectation_result\"]}',\n        True,\n        scenario[\"expectation_name\"],\n    )\n\n    assert not DataframeHelpers.has_diff(\n        dq_result_df.select(\"spec_id\", \"input_id\", \"success\"),\n        dq_control_df.fillna(\"\").select(\"spec_id\", \"input_id\", \"success\"),\n    )\n\n    for key in dq_result_df.collect():\n        for result in loads(key.validation_results):\n            assert {\n                \"success\",\n                \"expectation_config\",\n            }.issubset(result.keys())\n\n\ndef _clean_folders(expectation_name: str) -> None:\n    \"\"\"Clean test folders and tables.\"\"\"\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_IN}/{expectation_name}/data\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/{expectation_name}/data\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/{expectation_name}/checkpoint\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/{expectation_name}/dq\")\n    ExecEnv.SESSION.sql(\"DROP TABLE IF EXISTS test_db.dq_sales\")\n    ExecEnv.SESSION.sql(\"DROP TABLE IF EXISTS test_db.sales_order\")\n\n\ndef _generate_acon(\n    input_spec: dict,\n    scenario: dict,\n    dq_type: str,\n) -> dict:\n    \"\"\"Generate acon according to test scenario.\n\n    Args:\n        input_spec: input specification.\n        scenario: the scenario being tested.\n        dq_type: the type of data quality process.\n\n    Returns: a dict corresponding to the generated acon.\n    \"\"\"\n    dq_spec_add_options = {\n        \"result_sink_db_table\": \"test_db.sales_order\",\n        \"result_sink_format\": \"json\",\n        \"result_sink_explode\": False,\n        \"dq_functions\": [\n            {\n                \"function\": scenario[\"expectation_name\"],\n                \"args\": scenario[\"arguments\"],\n            }\n        ],\n    }\n\n    return {\n        \"input_spec\": input_spec,\n        \"dq_spec\": {\n            \"spec_id\": \"dq_sales\",\n            \"input_id\": \"sales_source\",\n            \"dq_type\": dq_type,\n            \"store_backend\": \"file_system\",\n            \"local_fs_root_dir\": f\"{TEST_LAKEHOUSE_OUT}/{scenario['expectation_name']}/dq\",  # noqa: E501\n            **dq_spec_add_options,\n        },\n        \"restore_prev_version\": scenario.get(\"restore_prev_version\", False),\n    }\n\n\ndef _generate_dataframe(load_type: str, expectation_name: str) -> DataFrame:\n    \"\"\"Generate test dataframe.\n\n    Args:\n        load_type: batch or streaming.\n        expectation_name: name of the expectation to test\n\n    Returns: the generated dataframe.\n    \"\"\"\n    if load_type == \"batch\":\n        input_df = (\n            ExecEnv.SESSION.read.format(\"csv\")\n            .option(\"header\", True)\n            .option(\"delimiter\", \"|\")\n            .schema(\n                SchemaUtils.from_file(\n                    f\"file://{TEST_RESOURCES}/{expectation_name}/dq_sales_schema.json\"\n                )\n            )\n            .load(f\"{TEST_RESOURCES}/{expectation_name}/data/source/part-01.csv\")\n        )\n    else:\n        input_df = (\n            ExecEnv.SESSION.readStream.format(\"csv\")\n            .option(\"header\", True)\n            .option(\"delimiter\", \"|\")\n            .schema(\n                SchemaUtils.from_file(\n                    f\"file://{TEST_RESOURCES}/{expectation_name}/dq_sales_schema.json\"\n                )\n            )\n            .load(f\"{TEST_RESOURCES}/{expectation_name}/data/source/*\")\n        )\n\n    return input_df\n\n\ndef _get_result_and_control_dfs(\n    table: str, file_name: str, infer_schema: bool, expectation_name: str\n) -> Tuple[DataFrame, DataFrame]:\n    \"\"\"Helper to get the result and control dataframes.\n\n    Args:\n        table: the table to read from.\n        file_name: the file name to read from.\n        infer_schema: whether to infer the schema or not.\n        expectation_name: expectation name.\n\n    Returns: the result and control dataframes.\n    \"\"\"\n    dq_result_df = DataframeHelpers.read_from_table(table)\n\n    dq_control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{expectation_name}/data/{file_name}.csv\",\n        file_format=\"csv\",\n        options={\"header\": True, \"delimiter\": \"|\", \"inferSchema\": infer_schema},\n    )\n\n    return dq_result_df, dq_control_df\n"
  },
  {
    "path": "tests/feature/custom_expectations/test_expectation_validity.py",
    "content": "\"\"\"Module with the validation code for the custom expectations.\"\"\"\n\nimport copy\nimport importlib\nimport re\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import DQDefaults\n\n\"\"\"This value '✔' is used to filter the output from the GX diagnostics\"\"\"\nCHECKMARK = \"\\u2714\"\nDIAGNOSTICS_VALIDATIONS = [\n    \" ✔ Has a docstring, including a one-line short description\",\n    \" ✔ Has at least one positive and negative example case, and all test cases pass\",\n    \" ✔ Has core logic and passes tests on at least one Execution Engine\",\n    \"    ✔ All [0-9]+ tests for spark are passing\",\n    \" ✔ Has core logic that passes tests for all applicable Execution Engines and SQL\"\n    \" dialects\",\n    \"    ✔ All [0-9]+ tests for spark are passing\",\n]\n\nMETRIC_NAME_TYPES = [\n    \"column_values\",\n    \"multicolumn_values\",\n    \"column_pair_values\",\n    \"table_rows\",\n    \"table_columns\",\n]\n\nMAP_METRICS = []\n\n\n@pytest.mark.parametrize(\"expectation\", DQDefaults.CUSTOM_EXPECTATION_LIST.value)\ndef test_expectation_validity(expectation: str) -> None:\n    \"\"\"Validates the custom expectations defined in the project.\n\n    Based on the diagnostics of the custom expectations this test validates if all the\n    best practices are being followed.\n    \"\"\"\n    result, metric_name = _run_diagnostics(expectation)\n\n    _process_diagnostics_output(result)\n\n    if metric_name:\n        assert _validate_metric_name_structure(metric_name), (\n            f\"Metric name {metric_name} has the incorrect format. \"\n            f\"Should be 'metric type'.'metric_name'\"\n        )\n\n        MAP_METRICS.append(metric_name)\n\n        assert len(MAP_METRICS) == len(\n            set(MAP_METRICS)\n        ), f\"Metric names repeated: {MAP_METRICS}\"\n\n\ndef _run_diagnostics(expectation_name: str) -> tuple:\n    \"\"\"Runs the diagnostics of the custom expectation.\n\n    This function both runs the Great Expectations Diagnostics and\n    retrieves the diagnostics checklist and the metric name defined.\n\n    Args:\n        expectation_name: name of the expectation file.\n\n    Returns:\n        The output of the diagnostics command and the expectation's metric name.\n    \"\"\"\n    segments = expectation_name.split(\".\")[0].split(\"_\")\n    expectation_class_name = \"\".join(ele.title() for ele in segments[0:])\n\n    module = importlib.import_module(\n        f\"lakehouse_engine.dq_processors.custom_expectations.{expectation_name}\"\n    )\n    expectation_class = getattr(module, expectation_class_name)\n    expectation = expectation_class()\n\n    metric_name = \"\"\n\n    if \"map_metric\" in dir(expectation):\n        metric_name = expectation.map_metric\n\n    return expectation.run_diagnostics().generate_checklist(), metric_name\n\n\ndef _process_diagnostics_output(diagnostics_output: str) -> None:\n    \"\"\"Processes the output from the expectation diagnostics.\n\n    Args:\n        diagnostics_output: the output from the diagnostics command.\n    \"\"\"\n    validations = copy.deepcopy(DIAGNOSTICS_VALIDATIONS)\n    for line in str(diagnostics_output).split(\"\\n\"):\n        if CHECKMARK in line:\n            for validation in validations:\n                if re.match(validation, line):\n                    validations.remove(validation)\n                    break\n\n    assert not validations, f\"Validations not met: {validations}\"\n\n\ndef _validate_metric_name_structure(metric_name: str) -> int:\n    \"\"\"Validates the structure of the custom expectation's metric name.\n\n    The metric name must have two parts separated by a '.',\n    and the first part must be the type of the expectation.\n\n    Args:\n        metric_name: custom expectation's metric name.\n\n    Returns:\n        The validation of custom expectation's the metric name.\n    \"\"\"\n    parts = metric_name.split(\".\")\n\n    if len(parts) != 2:\n        return False\n\n    if parts[0] not in METRIC_NAME_TYPES:\n        return False\n\n    return True\n"
  },
  {
    "path": "tests/feature/data_loader_custom_transformer/__init__.py",
    "content": "\"\"\"Feature tests focusing on data loader algorithm execution with custom transformer.\"\"\"\n"
  },
  {
    "path": "tests/feature/data_loader_custom_transformer/test_data_loader_custom_transformer_calculate_kpi.py",
    "content": "\"\"\"Tests for the DataLoader algorithm with custom transformations.\"\"\"\n\nimport pytest\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"data_loader_custom_transformer\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\ndef yet_another_kpi_calculator(df: DataFrame) -> DataFrame:\n    \"\"\"An example custom transformer that will be provided in the ACON.\n\n    Args:\n        df: DataFrame passed as input.\n\n    Returns:\n        DataFrame: the transformed DataFrame.\n    \"\"\"\n    session = ExecEnv.SESSION\n    df.createOrReplaceTempView(\"sales\")\n    kpi_df = session.sql(\n        \"\"\"\n            SELECT date, SUM(amount) AS amount\n            FROM sales\n            GROUP BY date\n        \"\"\"\n    )\n    return kpi_df\n\n\ndef get_test_acon() -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"csv\",\n                \"options\": {\"mode\": \"FAILFAST\", \"header\": True, \"delimiter\": \"|\"},\n                \"schema_path\": \"file:///app/tests/lakehouse/in/feature/\"\n                \"data_loader_custom_transformer/calculate_kpi/\"\n                \"source_schema.json\",\n                \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n                \"data_loader_custom_transformer/calculate_kpi/data\",\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"calculated_kpi\",\n                \"input_id\": \"sales_source\",\n                \"transformers\": [\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": yet_another_kpi_calculator},\n                    }\n                ],\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"calculated_kpi\",\n                \"write_type\": \"overwrite\",\n                \"data_format\": \"delta\",\n                \"location\": \"file:///app/tests/lakehouse/out/feature/\"\n                \"data_loader_custom_transformer/calculate_kpi/data\",\n            }\n        ],\n    }\n\n\n@pytest.mark.parametrize(\"scenario\", [\"calculate_kpi\"])\ndef test_calculate_kpi_and_merge(scenario: str) -> None:\n    \"\"\"Test full load with a custom transformation function.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/*_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n\n    load_data(acon=get_test_acon())\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}/control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/data_loader_custom_transformer/test_data_loader_custom_transformer_delta_load.py",
    "content": "\"\"\"Tests for the DataLoader algorithm with custom transformations.\"\"\"\n\nimport pytest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"data_loader_custom_transformer\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\ndef multiply_by_100(df: DataFrame) -> DataFrame:\n    \"\"\"An example custom transformer that will be provided in the ACON.\n\n    Args:\n        df: DataFrame passed as input.\n\n    Returns:\n        DataFrame: the transformed DataFrame.\n    \"\"\"\n    multiplied_df = df.withColumn(\"amount\", col(\"amount\") * 100)\n    return multiplied_df\n\n\ndef get_test_acon() -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": \"streaming\",\n                \"data_format\": \"csv\",\n                \"options\": {\"header\": True, \"delimiter\": \"|\"},\n                \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n                \"data_loader_custom_transformer/delta_load/data\",\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"transformed_sales_source\",\n                \"input_id\": \"sales_source\",\n                \"transformers\": [\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": multiply_by_100},\n                    },\n                    {\n                        \"function\": \"condense_record_mode_cdc\",\n                        \"args\": {\n                            \"business_key\": [\"salesorder\", \"item\"],\n                            \"ranking_key_desc\": [\n                                \"actrequest_timestamp\",\n                                \"datapakid\",\n                                \"partno\",\n                                \"record\",\n                            ],\n                            \"record_mode_col\": \"recordmode\",\n                            \"valid_record_modes\": [\"\", \"N\", \"R\", \"D\", \"X\"],\n                        },\n                    },\n                ],\n            }\n        ],\n        \"dq_specs\": [\n            {\n                \"spec_id\": \"checked_transformed_sales_source\",\n                \"input_id\": \"transformed_sales_source\",\n                \"dq_type\": \"validator\",\n                \"store_backend\": \"file_system\",\n                \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/\"\n                \"data_loader_custom_transformer/dq\",\n                \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_values_to_not_be_null\",\n                        \"args\": {\"column\": \"article\"},\n                    }\n                ],\n            },\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"checked_transformed_sales_source\",\n                \"write_type\": \"merge\",\n                \"data_format\": \"delta\",\n                \"location\": \"file:///app/tests/lakehouse/out/feature/\"\n                \"data_loader_custom_transformer/delta_load/data\",\n                \"options\": {\n                    \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/\"\n                    \"data_loader_custom_transformer/delta_load/checkpoint\"\n                },\n                \"merge_opts\": {\n                    \"merge_predicate\": \"current.salesorder = new.salesorder \"\n                    \"and current.item = new.item \"\n                    \"and current.date <=> new.date\",\n                    \"update_predicate\": \"new.actrequest_timestamp > \"\n                    \"current.actrequest_timestamp or ( \"\n                    \"new.actrequest_timestamp = \"\n                    \"current.actrequest_timestamp and \"\n                    \"new.datapakid > current.datapakid) or ( \"\n                    \"new.actrequest_timestamp = \"\n                    \"current.actrequest_timestamp and \"\n                    \"new.datapakid = current.datapakid and \"\n                    \"new.partno > current.partno) or ( \"\n                    \"new.actrequest_timestamp = \"\n                    \"current.actrequest_timestamp and \"\n                    \"new.datapakid = current.datapakid and \"\n                    \"new.partno = current.partno and new.record \"\n                    \">= current.record)\",\n                    \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n                    \"insert_predicate\": \"new.recordmode is null or new.recordmode \"\n                    \"not in ('R','D','X')\",\n                },\n            }\n        ],\n        \"exec_env\": {\"spark.sql.streaming.schemaInference\": True},\n    }\n\n\n@pytest.mark.parametrize(\"scenario\", [\"delta_load\"])\ndef test_delta_load(scenario: str) -> None:\n    \"\"\"Test full load with a custom transformation function.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    _create_table(\n        f\"{scenario}\",\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    load_data(acon=get_test_acon())\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-03.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    load_data(acon=get_test_acon())\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    load_data(acon=get_test_acon())\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-04.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    load_data(acon=get_test_acon())\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\ndef _create_table(table_name: str, location: str) -> None:\n    \"\"\"Create test table.\n\n    Args:\n        table_name: name of the table.\n        location: location of the table.\n    \"\"\"\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS test_db.{table_name} (\n            actrequest_timestamp string,\n            request string,\n            datapakid int,\n            partno int,\n            record int,\n            salesorder int,\n            item int,\n            recordmode string,\n            date int,\n            customer string,\n            article string,\n            amount int\n        )\n        USING delta\n        LOCATION '{location}'\n        \"\"\"\n    )\n"
  },
  {
    "path": "tests/feature/data_loader_custom_transformer/test_data_loader_custom_transformer_sql_transformation.py",
    "content": "\"\"\"Tests for the DataLoader algorithm with custom transformations.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"data_loader_custom_transformer\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\nSQL = \"\"\"\n    SELECT date, SUM(amount) AS amount\n    FROM sales_sql\n    GROUP BY date\n\"\"\"\n\n\ndef get_test_acon() -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"csv\",\n                \"options\": {\"mode\": \"FAILFAST\", \"header\": True, \"delimiter\": \"|\"},\n                \"schema_path\": \"file:///app/tests/lakehouse/in/feature/\"\n                \"data_loader_custom_transformer/sql_transformation/\"\n                \"source_schema.json\",\n                \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n                \"data_loader_custom_transformer/sql_transformation/data\",\n                \"temp_view\": \"sales_sql\",\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"calculated_kpi\",\n                \"input_id\": \"sales_source\",\n                \"transformers\": [\n                    {\n                        \"function\": \"sql_transformation\",\n                        \"args\": {\"sql\": SQL},\n                    }\n                ],\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"calculated_kpi\",\n                \"write_type\": \"overwrite\",\n                \"data_format\": \"delta\",\n                \"location\": \"file:///app/tests/lakehouse/out/feature/\"\n                \"data_loader_custom_transformer/sql_transformation/data\",\n            }\n        ],\n    }\n\n\n@pytest.mark.parametrize(\"scenario\", [\"sql_transformation\"])\ndef test_sql_transformation_and_merge(scenario: str) -> None:\n    \"\"\"Test full load with a custom sql transformation function.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/*_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n\n    load_data(acon=get_test_acon())\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}/control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/delta_load/__init__.py",
    "content": "\"\"\"Delta load feature tests.\"\"\"\n"
  },
  {
    "path": "tests/feature/delta_load/test_delta_load_group_and_rank.py",
    "content": "\"\"\"Test delta loads with group and rank.\"\"\"\n\nfrom typing import List\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"delta_load/group_and_rank\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"with_duplicates_in_same_file\", \"batch\"],\n        [\"with_duplicates_in_same_file\", \"streaming\"],\n        [\"fail_with_duplicates_in_same_file\", \"batch\"],\n        [\"fail_with_duplicates_in_same_file\", \"streaming\"],\n    ],\n)\ndef test_delta_load_group_and_rank(scenario: List[str]) -> None:\n    \"\"\"Test delta loads in batch mode.\n\n    Args:\n        scenario: scenario to test.\n            with_duplicates_in_same_file - This test includes duplicated rows in the\n            same file produced by the source (e.g., an order is cancelled and created\n            within the same file).\n            fail_with_duplicates_in_same_file - purposely checks if the delta load fails\n            (result has a diff compared to the control data), because sales order 7 item\n            1 as cancelled status before created in the second source data file.\n    \"\"\"\n    _create_table(scenario)\n\n    execute_loads(scenario, 1)\n\n    if scenario[1] == \"streaming\":\n        # simulate a scenario where the same data is loaded twice in streaming mode\n        execute_loads(scenario, 2)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/control/{scenario[1]}.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/{scenario[1]}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/{scenario[1]}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/{scenario[1]}/data/{scenario[1]}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/\"\n            f\"control_{scenario[1]}_schema.json\"\n        ),\n    )\n\n    if scenario[0] == \"fail_with_duplicates_in_same_file\":\n        # sales order 7 item 1 in second file has event cancelled before created\n        assert DataframeHelpers.has_diff(result_df, control_df)\n    else:\n        assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\ndef execute_loads(scenario: List[str], iteration: int) -> None:\n    \"\"\"Execute the data loads.\n\n    Args:\n        scenario: scenario to test.\n        iteration: number indicating the iteration in the testing process.\n            This is useful because in this test we want to repeat the same loading\n            process twice, to simulate a scenario where the same data is loaded twice.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/WE_SO_SCL_202108111400000000.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}\"\n        f\"/data/WE_SO_SCL_202108111400000000.csv{iteration}\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/\"\n        f\"{scenario[1] + ('_init' if scenario[1] == 'batch' else '_delta')}.json\"\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/WE_SO_SCL_202108111500000000.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}\"\n        f\"/data/WE_SO_SCL_202108111500000000.csv{iteration}\",\n    )\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}_delta.json\"\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/WE_SO_SCL_202108111600000000.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}\"\n        f\"/data/WE_SO_SCL_202108111600000000.csv{iteration}\",\n    )\n    load_data(acon=acon)\n\n\ndef _create_table(scenario: List[str]) -> None:\n    \"\"\"Create test table.\n\n    Args:\n        scenario: scenario being tested.\n    \"\"\"\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS test_db.{scenario[0]}_{scenario[1]} (\n            salesorder int,\n            item int,\n            event string,\n            changed_on int,\n            date int,\n            customer string,\n            article string,\n            amount int,\n            {\"extraction_date string,\"\n             if scenario[1] == \"streaming\" else \"lhe_row_id int,\"}\n            {\"lhe_batch_id int,\" if scenario[1] == \"streaming\" else \"\"}\n            {\"lhe_row_id int\"\n             if scenario[1] == \"streaming\" else \"extraction_date string\"}\n        )\n        USING delta\n        LOCATION '{TEST_LAKEHOUSE_OUT}/{scenario[0]}/{scenario[1]}/data'\n        \"\"\"\n    )\n"
  },
  {
    "path": "tests/feature/delta_load/test_delta_load_merge_options.py",
    "content": "\"\"\"Test delta loads with different merge options.\"\"\"\n\nfrom typing import List\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"delta_load/merge_options\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"update_column_set\", \"insert_column_set\", \"update_all\"],\n)\ndef test_delta_load_merge_options(scenario: List[str]) -> None:\n    \"\"\"Test upsert for specific columns in batch mode.\n\n    Args:\n        scenario: scenario to test.\n            update_column_set - This test uses whenMatchedUpdate option. It allows to\n                update a matched table row based on the rules defined in\n                update_column_set, instead of updating all the columns of the matched\n                table row with the values of the corresponding columns in the source\n                 row.\n            insert_column_set - This test uses whenNotMatchedInsert option. It allows to\n                insert a new row to the target table based on the rules defined in\n                insert_column_set, instead of inserting a new target Delta table row\n                by assigning the target columns to the values of the corresponding\n                columns in the source row.\n            update_all - This test uses whenMatchedUpdateAll option. It allows to\n                update a matched table updating all the columns with the values\n                of the corresponding columns in the source row.\n    \"\"\"\n    execute_loads(scenario)\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/batch.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/batch.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}/control_batch_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\ndef execute_loads(scenario: List[str]) -> None:\n    \"\"\"Execute the data loads.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/WE_SO_SCL_202108111400000000.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/WE_SO_SCL_202108111400000000.csv\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_init.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/WE_SO_SCL_202108111500000000.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/WE_SO_SCL_202108111500000000.csv\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_delta.json\")\n    load_data(acon=acon)\n"
  },
  {
    "path": "tests/feature/delta_load/test_delta_load_record_mode_cdc.py",
    "content": "\"\"\"Test delta loads with record mode based cdc.\"\"\"\n\nfrom typing import List\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"delta_load/record_mode_cdc\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"with_deletes_additional_columns\", \"csv\"],\n        [\"with_duplicates\", \"csv\"],\n        [\"with_upserts_only_removed_columns\", \"json\"],\n    ],\n)\ndef test_batch_delta_load(scenario: List[str]) -> None:\n    \"\"\"Test delta loads in batch mode.\n\n    Args:\n        scenario: scenario to test (name and file format).\n    \"\"\"\n    _create_table(f\"{scenario[0]}\", f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/data\")\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/part-01.{scenario[1]}\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/data/\",\n    )\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/batch_init.json\"\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/part-0[2,3,4].{scenario[1]}\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/data/\",\n    )\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/batch_delta.json\"\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/data\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"late_arriving_changes\", \"batch\"],\n        [\"out_of_order_changes\", \"batch\"],\n        [\"late_arriving_changes\", \"streaming\"],\n        [\"out_of_order_changes\", \"streaming\"],\n    ],\n)\ndef test_file_by_file(scenario: str) -> None:\n    \"\"\"Test delta loads in batch mode.\n\n    Args:\n        scenario: scenario to test.\n            late_arriving_changes - This test checks if if changes arrive late (certain\n            changes on part-02 are incomplete and only arrive in part-03), the data\n            stays consistent.\n            out_of_order_changes - This test checks if by loading the data out of order\n            (part-03 is loaded before part-02) the delta table stays consistent.\n    \"\"\"\n    _create_table(\n        f\"{scenario[0]}_{scenario[1]}\",\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/{scenario[1]}/data\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/data/\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/\"\n        f\"{scenario[1] + ('_init' if scenario[1] == 'batch' else '_delta')}.json\"\n    )\n\n    if scenario[0] == \"out_of_order_changes\":\n        second_file = \"part-03\"\n        third_file = \"part-02\"\n    else:\n        second_file = \"part-02\"\n        third_file = \"part-03\"\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/{second_file}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/data/\",\n    )\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}_delta.json\"\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/{third_file}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/data/\",\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/part-04.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/data/\",\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/{scenario[1]}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/{scenario[1]}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/{scenario[1]}/data\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"backfill\"])\ndef test_backfill(scenario: str) -> None:\n    \"\"\"Test backfill process of a delta load based table.\n\n    Args:\n        scenario: scenario to test.\n            This test performs a regular delta load and, after that, backfills from the\n            source where we simulate that all data contained in part-2, part-3 and\n            part-04 has changed to be amount * 10.\n    \"\"\"\n    _create_table(f\"{scenario}\", f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\")\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_init.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-0[2,3,4].csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_delta.json\")\n    load_data(acon=acon)\n\n    LocalStorage.delete_file(f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/part-0[2,3,4].csv\")\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-05.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario}/batch_backfill.json\"\n    )\n    load_data(acon=acon)\n\n    LocalStorage.delete_file(f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/part-01.csv\")\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"direct_silver_load\"])\ndef test_direct_silver_load(scenario: str) -> None:\n    \"\"\"Test a delta load based process that loads to bronze and silver in the same run.\n\n    We get data from the source, load it to bronze and then into silver, without needing\n    to run two separate algorithms.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    _create_table(f\"{scenario}_bronze\", f\"{TEST_LAKEHOUSE_OUT}/{scenario}/bronze/data\")\n    _create_table(f\"{scenario}_silver\", f\"{TEST_LAKEHOUSE_OUT}/{scenario}/silver/data\")\n\n    scenario = \"direct_silver_load\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_init.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-0[2,3,4].csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_delta.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/bronze/data/\",\n    )\n\n    bronze_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/bronze/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_bronze_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/bronze/data\", file_format=\"csv\"\n    )\n\n    assert not DataframeHelpers.has_diff(bronze_df, control_bronze_df)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/silver/data/\",\n    )\n\n    silver_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/silver/data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n    control_silver_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/silver/data\", file_format=\"csv\"\n    )\n\n    assert not DataframeHelpers.has_diff(silver_df, control_silver_df)\n\n\ndef _create_table(table_name: str, location: str) -> None:\n    \"\"\"Create test table.\n\n    Args:\n        table_name: name of the table.\n        location: location of the table.\n    \"\"\"\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS test_db.{table_name} (\n            extraction_timestamp string,\n            actrequest_timestamp string,\n            request string,\n            datapakid int,\n            partno int,\n            record int,\n            salesorder int,\n            item int,\n            recordmode string,\n            date int,\n            customer string,\n            article string,\n            amount int\n        )\n        USING delta\n        LOCATION '{location}'\n        \"\"\"\n    )\n"
  },
  {
    "path": "tests/feature/test_append_load.py",
    "content": "\"\"\"Test append loads.\"\"\"\n\nfrom typing import Any\n\nimport pytest\nfrom py4j.protocol import Py4JJavaError\n\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"append_load\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n\n\n@pytest.mark.parametrize(\"scenario\", [\"jdbc_permissive\"])\ndef test_permissive_jdbc_append_load(scenario: str) -> None:\n    \"\"\"Test append loads from jdbc source with permissive read mode.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    _append_data_into_source(scenario)\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_init.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    _append_data_into_source(scenario)\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-03.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    _append_data_into_source(scenario)\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_table(f\"test_db.{scenario}_table\")\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"failfast\"])\ndef test_failfast_append_load(scenario: str) -> None:\n    \"\"\"Test append loads with failfast read mode.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch_init.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-0[2,3].csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n\n    with pytest.raises(Py4JJavaError) as e:\n        # should raise malformed records due to failfast, as amount column was\n        # renamed to amount2 and there is one more column in the pat-03.csv file.\n        acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch.json\")\n        load_data(acon=acon)\n\n    assert \"Malformed CSV record\" in str(e.value)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"streaming_dropmalformed\"])\ndef test_streaming_dropmalformed(scenario: str) -> None:\n    \"\"\"Test append loads, in streaming mode, with dropmalformed read mode.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/streaming.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-03.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_table(f\"test_db.{scenario}_table\")\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        schema=ConfigUtils.read_json_acon(\n            f\"file://{TEST_RESOURCES}/{scenario}/streaming.json\"\n        )[\"input_specs\"][0][\"schema\"],\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"streaming_with_terminators\"])\ndef test_streaming_with_terminators(scenario: str, caplog: Any) -> None:\n    \"\"\"Test append loads, in streaming mode, with terminator functions.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/streaming.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_table(f\"test_db.{scenario}_table\")\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        schema=ConfigUtils.read_json_acon(\n            f\"file://{TEST_RESOURCES}/{scenario}/streaming.json\"\n        )[\"input_specs\"][0][\"schema\"],\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n    assert (\n        \"sql command: OPTIMIZE test_db.streaming_with_terminators_table\" in caplog.text\n    )\n    assert \"Vacuuming table: test_db.streaming_with_terminators_table\" in caplog.text\n    assert (\n        \"sql command: ANALYZE TABLE test_db.streaming_with_terminators_table \"\n        \"COMPUTE STATISTICS\" in caplog.text\n    )\n\n\ndef _append_data_into_source(scenario: str) -> None:\n    \"\"\"Append data into jdbc sql lite table used as source for append load tests.\n\n    Args:\n        scenario: scenario being tested.\n    \"\"\"\n    source_df = DataframeHelpers.read_from_file(f\"{TEST_LAKEHOUSE_IN}/{scenario}/data\")\n    DataframeHelpers.write_into_jdbc_table(\n        source_df, f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario}/tests.db\", f\"{scenario}\"\n    )\n"
  },
  {
    "path": "tests/feature/test_data_quality.py",
    "content": "\"\"\"Test data quality process in different types of data loads.\"\"\"\n\nfrom json import loads\nfrom typing import Any\n\nimport pytest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import array_sort, col, regexp_replace, transform\nfrom pyspark.sql.types import IntegerType, StringType, StructField, StructType\n\nfrom lakehouse_engine.core.definitions import (\n    DQExecutionPoint,\n    DQFunctionSpec,\n    DQSpec,\n    DQType,\n)\nfrom lakehouse_engine.dq_processors.dq_factory import DQFactory\nfrom lakehouse_engine.dq_processors.exceptions import DQValidationsFailedException\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.dq_utils import PrismaUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.dq_rules_table_utils import _create_dq_functions_source_table\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"data_quality\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"delta_with_duplicates\",\n            \"read_type\": \"streaming\",\n            \"results_exploded\": True,\n            \"tag_source_data\": False,\n        },\n        {\n            \"name\": \"delta_with_duplicates_tag\",\n            \"read_type\": \"streaming\",\n            \"results_exploded\": True,\n            \"tag_source_data\": True,\n        },\n        {\n            \"name\": \"delta_with_dupl_tag_gen_fail\",\n            \"read_type\": \"streaming\",\n            \"results_exploded\": True,\n            \"tag_source_data\": True,\n        },\n        {\n            \"name\": \"no_transformers\",\n            \"read_type\": \"streaming\",\n            \"results_exploded\": False,\n            \"tag_source_data\": False,\n        },\n        {\n            \"name\": \"full_overwrite\",\n            \"read_type\": \"batch\",\n            \"results_exploded\": True,\n            \"tag_source_data\": False,\n        },\n        {\n            \"name\": \"full_overwrite_tag\",\n            \"read_type\": \"batch\",\n            \"results_exploded\": True,\n            \"tag_source_data\": True,\n        },\n    ],\n)\ndef test_load_with_dq_validator(scenario: dict) -> None:\n    \"\"\"Test the data quality validator process as part of the load_data algorithm.\n\n    Description of the test scenarios:\n        - delta_with_duplicates - test the DQ process for a streaming\n        init and delta load with duplicates and merge strategy scenario.\n        It's generated a DQ result_sink where some columns are exploded to make easier\n        the analysis.\n        - delta_with_duplicates_tag - similar to delta_with_duplicates but using DQ Row\n        Tagging. The scenarios with tagging, test not only the loads and the result\n        DQ sink, but also the resulting data to assert the \"dq_validations\" column\n        that gets added into the source data used. This scenario covers different\n        kinds of expectations (table, column aggregated, column, multi-column,\n        column pair) with successes and failures.\n        - delta_with_dupl_tag_gen_fail - similar to delta_with_duplicates_tag, but\n        tests DQ success on init and then only general failures (not row level).\n        - no_transformers - test the DQ process for a streaming init and delta\n        without transformers or micro batch transformers. It's generated a DQ\n        result_sink in a raw format.\n        - full_overwrite - test the DQ process for a batch full overwrite scenario.\n        It's generated a DQ result_sink where some columns are exploded to make easier\n        the analysis, in which includes some extra columns set by\n        the user to be included (using parameter result_sink_extra_columns).\n        - full_overwrite_tag - similar to full_overwrite but using DQ Row\n        Tagging. This scenario covers different kinds of expectations, all succeeded.\n\n    Args:\n        scenario: scenario to test.\n            name - name of the scenario.\n            read_type - type of read, namely batch or streaming.\n            results_exploded - flag to generate a DQ result_sink in a raw format\n                (False) or an exploded format easier for analysis (True).\n            tag_source_data - whether the test scenario tests tagging the source\n                data with the DQ results or not.\n    \"\"\"\n    test_name = \"load_with_dq_validator\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{test_name}/{scenario['name']}/data/\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"{scenario['read_type']}_init.json\"\n    )\n\n    if \"full_overwrite\" in scenario[\"name\"]:\n        LocalStorage.clean_folder(\n            f\"{TEST_LAKEHOUSE_IN}/{test_name}/{scenario['name']}/data\",\n        )\n\n    result_sink_df = DataframeHelpers.read_from_table(\n        f\"test_db.validator_{scenario['name']}\"\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"data/source/part-0[2,3,4].csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{test_name}/{scenario['name']}/data/\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"{scenario['read_type']}_new.json\"\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"data/control/data_validator.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/validator/data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/data/control/sales.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"data/control/*_schema.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/validator/\",\n    )\n\n    result_sink_df = DataframeHelpers.read_from_table(\n        f\"test_db.validator_{scenario['name']}\"\n    )\n\n    control_sink_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/validator/data/\",\n        file_format=\"json\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/{test_name}/\"\n            f\"{scenario['name']}/validator/data_validator_schema.json\"\n        ),\n    )\n\n    # drop columns for which the values vary from run to run (ex: depending on date)\n    cols_to_drop = [\n        \"checkpoint_config\",\n        \"run_name\",\n        \"run_time\",\n        \"run_results\",\n        \"validation_results\",\n        \"validation_result_identifier\",\n        \"exception_info\",\n        \"batch_id\",\n        \"run_time_year\",\n        \"run_time_month\",\n        \"run_time_day\",\n        \"kwargs\",\n        \"processed_keys\",\n    ]\n\n    assert (\n        result_sink_df.columns\n        == control_sink_df.select(*result_sink_df.columns).columns\n    )\n\n    assert not DataframeHelpers.has_diff(\n        result_sink_df.drop(*cols_to_drop),\n        control_sink_df.drop(*cols_to_drop),\n    )\n\n    if scenario[\"tag_source_data\"]:\n        result_data_df = _prepare_validation_df(\n            DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_OUT}/{test_name}/{scenario['name']}/data\",\n                file_format=\"delta\",\n            )\n        )\n\n        control_data_df = _prepare_validation_df(\n            DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/data/\",\n                file_format=\"json\",\n                schema=SchemaUtils.from_file_to_dict(\n                    f\"file://{TEST_LAKEHOUSE_CONTROL}/{test_name}/\"\n                    f\"{scenario['name']}/validator/sales_schema.json\"\n                ),\n            )\n        )\n\n        assert not DataframeHelpers.has_diff(result_data_df, control_data_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"delta_with_duplicates_tag\",\n            \"read_type\": \"streaming\",\n            \"results_exploded\": True,\n        },\n        {\n            \"name\": \"delta_with_dupl_tag_gen_fail\",\n            \"read_type\": \"streaming\",\n            \"results_exploded\": True,\n        },\n        {\n            \"name\": \"full_overwrite_tag\",\n            \"read_type\": \"batch\",\n            \"results_exploded\": True,\n        },\n    ],\n)\ndef test_load_with_dq_validator_table(scenario: dict) -> None:\n    \"\"\"Test the data quality validator process as part of the load_data algorithm.\n\n    Description of the test scenarios:\n        - delta_with_duplicates_tag - test the DQ process for a streaming\n        init and delta load with duplicates and merge strategy scenario.\n        It's generated a DQ result_sink where some columns are exploded to make easier\n        the analysis using DQ Row Tagging. The scenarios with tagging, test\n        not only the loads and the result DQ sink, but also the resulting data to\n        assert the \"dq_validations\" column that gets added into the source data used.\n        This scenario covers different kinds of expectations (table, column aggregated,\n        column, multi-column, column pair) with successes and failures.\n        - delta_with_dupl_tag_gen_fail - similar to delta_with_duplicates_tag, but\n        tests DQ success on init and then only general failures (not row level).\n        - full_overwrite_tag - test the DQ process for a batch full overwrite scenario.\n        It's generated a DQ result_sink where some columns are exploded to make easier\n        the analysis, in which includes some extra columns set by\n        the user to be included (using parameter result_sink_extra_columns).\n        This scenario covers different kinds of expectations, all succeeded.\n\n    Args:\n        scenario: scenario to test.\n            name - name of the scenario.\n            read_type - type of read, namely batch or streaming.\n            results_exploded - flag to generate a DQ result_sink in a raw format\n                (False) or an exploded format easier for analysis (True).\n            tag_source_data - whether the test scenario tests tagging the source\n                data with the DQ results or not.\n    \"\"\"\n    test_name = \"load_with_dq_table\"\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{test_name}/{scenario['name']}/data/\",\n    )\n    _create_dq_functions_source_table(\n        test_resources_path=TEST_RESOURCES,\n        lakehouse_in_path=TEST_LAKEHOUSE_IN,\n        lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n        test_name=f\"{test_name}/{scenario['name']}\",\n        scenario=scenario[\"name\"],\n        table_name=f\"test_db.dq_functions_source_{test_name}_{scenario['name']}_init\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"{scenario['read_type']}_init.json\"\n    )\n\n    if \"full_overwrite\" in scenario[\"name\"]:\n        LocalStorage.clean_folder(\n            f\"{TEST_LAKEHOUSE_IN}/{test_name}/{scenario['name']}/data\",\n        )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"data/source/part-0[2,3,4].csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{test_name}/{scenario['name']}/data/\",\n    )\n    _create_dq_functions_source_table(\n        test_resources_path=TEST_RESOURCES,\n        lakehouse_in_path=TEST_LAKEHOUSE_IN,\n        lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n        test_name=f\"{test_name}/{scenario['name']}\",\n        scenario=scenario[\"name\"],\n        table_name=f\"test_db.dq_functions_source_{test_name}_{scenario['name']}_new\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"{scenario['read_type']}_new.json\"\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"data/control/data_validator.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/validator/data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/data/control/sales.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{test_name}/{scenario['name']}/\"\n        f\"data/control/*_schema.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/validator/\",\n    )\n\n    result_sink_df = DataframeHelpers.read_from_file(\n        location=f\"{LAKEHOUSE_FEATURE_OUT}/{scenario['name']}/result_sink/\",\n        file_format=\"delta\",\n    )\n\n    control_sink_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/validator/data/\",\n        file_format=\"json\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/{test_name}/\"\n            f\"{scenario['name']}/validator/data_validator_schema.json\"\n        ),\n    )\n\n    # drop columns for which the values vary from run to run (ex: depending on date)\n    cols_to_drop = [\n        \"checkpoint_config\",\n        \"run_name\",\n        \"run_time\",\n        \"run_results\",\n        \"validation_results\",\n        \"validation_result_identifier\",\n        \"exception_info\",\n        \"batch_id\",\n        \"run_time_year\",\n        \"run_time_month\",\n        \"run_time_day\",\n        \"kwargs\",\n        \"meta\",\n    ]\n\n    assert (\n        result_sink_df.columns\n        == control_sink_df.select(*result_sink_df.columns).columns\n    )\n\n    assert not DataframeHelpers.has_diff(\n        result_sink_df.drop(*cols_to_drop),\n        control_sink_df.drop(*cols_to_drop),\n    )\n\n    result_data_df = _prepare_validation_df(\n        DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_OUT}/{test_name}/{scenario['name']}/data\",\n            file_format=\"delta\",\n        )\n    )\n\n    control_data_df = _prepare_validation_df(\n        DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_CONTROL}/{test_name}/{scenario['name']}/data/\",\n            file_format=\"json\",\n            schema=SchemaUtils.from_file_to_dict(\n                f\"file://{TEST_LAKEHOUSE_CONTROL}/{test_name}/\"\n                f\"{scenario['name']}/validator/sales_schema.json\"\n            ),\n        )\n    )\n\n    assert not DataframeHelpers.has_diff(result_data_df, control_data_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"spec_id\": \"dq_success\",\n            \"dq_type\": \"validator\",\n            \"dq_functions\": [\n                DQFunctionSpec(\"expect_column_to_exist\", {\"column\": \"article\"}),\n                DQFunctionSpec(\n                    \"expect_table_row_count_to_be_between\",\n                    {\"min_value\": 0, \"max_value\": 50},\n                ),\n            ],\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"dq_failure\",\n            \"dq_type\": \"validator\",\n            \"dq_functions\": [\n                DQFunctionSpec(\"expect_column_to_exist\", {\"column\": \"article\"}),\n                DQFunctionSpec(\n                    \"expect_table_row_count_to_be_between\",\n                    {\"min_value\": 0, \"max_value\": 1},\n                ),\n            ],\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"dq_failure_error_disabled\",\n            \"dq_type\": \"validator\",\n            \"dq_functions\": [\n                DQFunctionSpec(\"expect_column_to_exist\", {\"column\": \"article\"}),\n                DQFunctionSpec(\n                    \"expect_table_row_count_to_be_between\",\n                    {\"min_value\": 0, \"max_value\": 1},\n                ),\n            ],\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"dq_failure_critical_functions\",\n            \"dq_type\": \"validator\",\n            \"dq_functions\": [\n                DQFunctionSpec(\"expect_column_to_exist\", {\"column\": \"article\"}),\n            ],\n            \"fail_on_error\": False,\n            \"critical_functions\": [\n                DQFunctionSpec(\n                    \"expect_table_row_count_to_be_between\",\n                    {\n                        \"min_value\": 0,\n                        \"max_value\": 1,\n                    },\n                ),\n            ],\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"dq_failure_max_percentage\",\n            \"dq_type\": \"validator\",\n            \"dq_functions\": [\n                DQFunctionSpec(\"expect_column_to_exist\", {\"column\": \"article\"}),\n            ],\n            \"fail_on_error\": False,\n            \"critical_functions\": [\n                DQFunctionSpec(\n                    \"expect_table_row_count_to_be_between\",\n                    {\n                        \"min_value\": 0,\n                        \"max_value\": 1,\n                    },\n                ),\n            ],\n            \"max_percentage_failure\": 0.2,\n        },\n        {\n            \"spec_id\": \"dq_success\",\n            \"dq_type\": \"prisma\",\n            \"dq_db_table\": \"test_db.dq_functions_source_dq_success\",\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"data_product_name\": \"dq_success\",\n            \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n        },\n        {\n            \"spec_id\": \"dq_failure_error_disabled\",\n            \"dq_type\": \"prisma\",\n            \"fail_on_error\": False,\n            \"dq_db_table\": None,\n            \"dq_functions\": [\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\n                        \"min_value\": 0,\n                        \"max_value\": 1,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_1\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_sales\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n                {\n                    \"function\": \"expect_table_column_count_to_be_between\",\n                    \"args\": {\n                        \"min_value\": 0,\n                        \"max_value\": 50,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_2\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_sales\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n            ],\n            \"critical_functions\": [],\n            \"data_product_name\": \"dq_failure_error_disabled\",\n            \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n            \"max_percentage_failure\": None,\n        },\n    ],\n)\ndef test_validator_dq_spec(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the data quality process using DQSpec.\n\n    Data Quality Functions tested using validator:\n    - dq_success: it tests two expectations and both are succeeded.\n    - dq_failure: it tests two expectations and one of them fails, raising an exception\n    in the DQ process.\n    - dq_failure_error_disabled: it tests one expectation and it fails, but no exception\n    is raised, because the fail_on_error is set to false.\n    - dq_failure_critical_functions: it tests two expectations where one fails, since\n    the one that fails is part of the \"critical_functions\" an exception is raised.\n    - dq_failure_max_percentage: it tests two expectations where one fails, since the\n    \"max_percentage_failure\" variable is not respected, an exception is thrown.\n    - dq_success: it tests two expectations defined using prisma and both succeed.\n    - dq_failure_error_disabled: it tests one expectation defined in prisma, by\n    manually defining the functions in the acon, and it fails, but no exception\n    is raised, because the fail_on_error is set to false.\n\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/validator/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario['dq_type']}/{scenario['spec_id']}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/validator/data/control/data_validator.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario['dq_type']}/{scenario['spec_id']}/data/\",\n    )\n    input_data = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_IN}/{scenario['dq_type']}/{scenario['spec_id']}/data\",\n        file_format=\"csv\",\n        options={\"header\": True, \"delimiter\": \"|\", \"inferSchema\": True},\n    )\n    location = TEST_LAKEHOUSE_OUT.replace(\"file://\", \"\")\n\n    if scenario[\"dq_type\"] == DQType.PRISMA.value:\n        if scenario[\"dq_db_table\"]:\n            _create_dq_functions_source_table(\n                test_resources_path=TEST_RESOURCES,\n                lakehouse_in_path=TEST_LAKEHOUSE_IN,\n                lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n                test_name=\"validator\",\n                scenario=scenario[\"spec_id\"],\n                table_name=scenario[\"dq_db_table\"],\n            )\n            dq_functions = PrismaUtils.build_prisma_dq_spec(\n                scenario,\n                DQExecutionPoint.AT_REST.value,\n            )[\"dq_functions\"]\n        else:\n            dq_functions = scenario[\"dq_functions\"]\n\n        dq_spec = DQSpec(\n            spec_id=scenario[\"spec_id\"],\n            input_id=\"sales_orders\",\n            dq_type=scenario[\"dq_type\"],\n            dq_db_table=scenario[\"dq_db_table\"],\n            store_backend=\"file_system\",\n            local_fs_root_dir=f\"{location}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/\",\n            result_sink_format=\"json\",\n            result_sink_explode=False,\n            processed_keys_location=f\"{TEST_LAKEHOUSE_OUT}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/processed_keys\",\n            dq_functions=[\n                DQFunctionSpec(\n                    function=dq_function[\"function\"], args=dq_function[\"args\"]\n                )\n                for dq_function in dq_functions\n            ],\n            unexpected_rows_pk=scenario[\"unexpected_rows_pk\"],\n            result_sink_location=f\"{TEST_LAKEHOUSE_OUT}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/data\",\n            fail_on_error=scenario[\"fail_on_error\"],\n            max_percentage_failure=scenario[\"max_percentage_failure\"],\n        )\n    else:\n        dq_spec = DQSpec(\n            spec_id=scenario[\"spec_id\"],\n            input_id=\"sales_orders\",\n            dq_type=scenario[\"dq_type\"],\n            store_backend=\"file_system\",\n            local_fs_root_dir=f\"{location}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/\",\n            result_sink_format=\"json\",\n            result_sink_explode=False,\n            unexpected_rows_pk=[\n                \"salesorder\",\n                \"item\",\n                \"date\",\n                \"customer\",\n            ],\n            dq_functions=scenario[\"dq_functions\"],\n            result_sink_location=f\"{TEST_LAKEHOUSE_OUT}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/data\",\n            fail_on_error=scenario[\"fail_on_error\"],\n            critical_functions=scenario[\"critical_functions\"],\n            max_percentage_failure=scenario[\"max_percentage_failure\"],\n        )\n\n    if scenario[\"spec_id\"] == \"dq_failure\":\n        with pytest.raises(DQValidationsFailedException) as ex:\n            DQFactory.run_dq_process(dq_spec, input_data)\n        assert \"Data Quality Validations Failed!\" in str(ex.value)\n    elif scenario[\"spec_id\"] == \"dq_failure_critical_functions\":\n        if scenario[\"dq_type\"] != DQType.PRISMA.value:\n            with pytest.raises(DQValidationsFailedException) as ex:\n                DQFactory.run_dq_process(dq_spec, input_data)\n            assert (\n                \"Data Quality Validations Failed, the following critical expectations \"\n                \"failed: ['expect_table_row_count_to_be_between'].\" in str(ex.value)\n            )\n        else:\n            DQFactory.run_dq_process(dq_spec, input_data)\n    elif scenario[\"spec_id\"] == \"dq_failure_max_percentage\":\n        with pytest.raises(DQValidationsFailedException) as ex:\n            DQFactory.run_dq_process(dq_spec, input_data)\n        assert \"Max error threshold is being surpassed!\" in str(ex.value)\n    else:\n        DQFactory.run_dq_process(dq_spec, input_data)\n\n        result_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_OUT}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/data\",\n            file_format=\"json\",\n        )\n\n        if scenario[\"spec_id\"] == \"dq_failure_error_disabled\":\n            assert (\n                \"1 out of 2 Data Quality Expectation(s) have failed! \"\n                \"Failed Expectations\" in caplog.text\n            )\n\n        control_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario['dq_type']}/\"\n            f\"{scenario['spec_id']}/data\",\n            file_format=\"csv\",\n            options={\"header\": True, \"delimiter\": \"|\", \"inferSchema\": True},\n        ).fillna(\"\")\n\n        assert not DataframeHelpers.has_diff(\n            result_df.filter(result_df[\"spec_id\"] == scenario[\"spec_id\"]).select(\n                \"spec_id\", \"input_id\", \"success\"\n            ),\n            control_df.filter(control_df[\"spec_id\"] == scenario[\"spec_id\"]).select(\n                \"spec_id\", \"input_id\", \"success\"\n            ),\n        )\n\n        assert result_df.columns == control_df.select(*result_df.columns).columns\n\n        _test_result_structure(result_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"result\": \"success\",\n            \"tag_source_data\": False,\n            \"num_chunks\": 2,\n            \"num_rows\": 10,\n            \"dq_functions\": [\n                {\n                    \"function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                        \"column\": \"id\",\n                        \"min_value\": 0,\n                        \"max_value\": 5,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_2\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_data\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n                {\n                    \"function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                        \"column\": \"static_column\",\n                        \"min_value\": 0,\n                        \"max_value\": 5,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_3\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_data\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n            ],\n        },\n        {\n            \"result\": \"failure\",\n            \"tag_source_data\": False,\n            \"num_chunks\": 20,\n            \"num_rows\": 15,\n            \"dq_functions\": [\n                {\n                    \"function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                        \"column\": \"id\",\n                        \"min_value\": 0,\n                        \"max_value\": 1,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_2\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_data\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n                {\n                    \"function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                        \"column\": \"static_column\",\n                        \"min_value\": 0,\n                        \"max_value\": 1,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_3\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_data\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n            ],\n        },\n        {\n            \"result\": \"success\",\n            \"tag_source_data\": True,\n            \"num_chunks\": 6,\n            \"num_rows\": 15,\n            \"dq_functions\": [\n                {\n                    \"function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                        \"column\": \"id\",\n                        \"min_value\": 0,\n                        \"max_value\": 1,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_2\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_data\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n                {\n                    \"function\": \"expect_column_value_lengths_to_be_between\",\n                    \"args\": {\n                        \"column\": \"static_column\",\n                        \"min_value\": 0,\n                        \"max_value\": 20,\n                        \"meta\": {\n                            \"dq_rule_id\": \"rule_3\",\n                            \"execution_point\": \"in_motion\",\n                            \"schema\": \"test_db\",\n                            \"table\": \"dummy_data\",\n                            \"column\": \"\",\n                            \"dimension\": \"\",\n                            \"filters\": \"\",\n                        },\n                    },\n                },\n            ],\n        },\n    ],\n)\ndef test_chunked_result_sink(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the chunked result sink for data quality validation.\n\n    Scenario 0: test two expectations and both are successful.\n    Scenario 1: test two expectations, both with errors\n    Scenario 2: test two expectations, one with error and one without and\n        the tagging functionality when multiple chunks exist.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    LocalStorage.clean_folder(f\"{LAKEHOUSE_FEATURE_OUT}/test_dp/\")\n    schema = StructType(\n        [\n            StructField(\"id\", IntegerType(), False),\n            StructField(\"static_column\", StringType(), False),\n        ]\n    )\n\n    data = []\n    for x in range(0, scenario[\"num_rows\"]):\n        data.append((x, True))\n\n    df = DataframeHelpers.create_dataframe(data=data, schema=schema)\n\n    acon = {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"test_in\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"dataframe\",\n                \"df_name\": df,\n            },\n        ],\n        \"dq_specs\": [\n            {\n                \"spec_id\": \"test_dq\",\n                \"input_id\": \"test_in\",\n                \"dq_type\": DQType.PRISMA.value,\n                \"store_backend\": \"file_system\",\n                \"local_fs_root_dir\": f\"{TEST_LAKEHOUSE_OUT}/chunked_result_sink/\",\n                \"result_sink_format\": \"json\",\n                \"data_product_name\": \"test_dp\",\n                \"unexpected_rows_pk\": [\"id\", \"static_column\"],\n                \"result_sink_chunk_size\": 1,\n                \"dq_functions\": scenario[\"dq_functions\"],\n                \"tag_source_data\": scenario[\"tag_source_data\"],\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"test_out\",\n                \"input_id\": \"test_dq\",\n                \"data_format\": \"dataframe\",\n                \"write_type\": \"overwrite\",\n            }\n        ],\n    }\n\n    result_df = load_data(acon=acon)[\"test_out\"]\n\n    result_sink = DataframeHelpers.read_from_file(\n        location=f\"{LAKEHOUSE_FEATURE_OUT}/test_dp/result_sink/\", file_format=\"json\"\n    )\n    assert result_sink.count() == scenario[\"num_chunks\"]\n    processed_keys = DataframeHelpers.read_from_file(\n        location=f\"{LAKEHOUSE_FEATURE_OUT}/test_dp/dq_processed_keys/\",\n        file_format=\"json\",\n    )\n    assert processed_keys.count() == scenario[\"num_rows\"]\n\n    if scenario[\"result\"] == \"failure\":\n        assert (\n            \"2 out of 2 Data Quality Expectation(s) have failed! Failed Expectations\"\n            in caplog.text\n        )\n\n    if scenario[\"tag_source_data\"]:\n        final_df = result_df.groupBy(\"dq_validations\").count()\n\n        assert final_df.count() == 2\n        for ele in final_df.collect():\n            if ele.dq_validations.dq_failure_details:\n                assert ele[\"count\"] == 5\n            else:\n                assert ele[\"count\"] == 10\n\n\ndef _test_result_structure(df: DataFrame) -> None:\n    \"\"\"Test if a dataframe has the expected keys in its structure.\n\n    Tests the validity of a dataframe, by checking if some keys are part of the\n    base structure of that dataframe.\n\n    Args:\n        df: dataframe to test.\n    \"\"\"\n    for key in df.collect():\n        for result in loads(key.validation_results):\n            assert {\n                \"success\",\n                \"expectation_config\",\n            }.issubset(result.keys())\n\n\ndef _prepare_validation_df(df: DataFrame) -> DataFrame:\n    \"\"\"Given a DataFrame apply necessary transformations to prepare it for validations.\n\n    It performs necessary transformations like removing the date from the run_name and\n    removing the batch_id from the dq_failure_details.\n\n    Args:\n        df: dataframe to transform.\n\n    Returns: the transformed dataframe\n    \"\"\"\n    return df.withColumn(\n        \"dq_validations\",\n        col(\"dq_validations\")\n        .withField(\n            \"run_name\", regexp_replace(col(\"dq_validations.run_name\"), \"[0-9]\", \"\")\n        )\n        .withField(\n            \"dq_failure_details\",\n            array_sort(\n                transform(\n                    \"dq_validations.dq_failure_details\",\n                    lambda x: x.withField(\n                        \"kwargs\",\n                        regexp_replace(\n                            x.kwargs,\n                            '\"batch_id\":.*?,',\n                            \"\",\n                        ),\n                    ),\n                ),\n            ),\n        ),\n    )\n"
  },
  {
    "path": "tests/feature/test_dq_validator.py",
    "content": "\"\"\"Test data quality validator.\"\"\"\n\nfrom json import loads\nfrom typing import Any, Dict, List, Tuple, Union\n\nimport py4j\nimport pytest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.utils import StreamingQueryException\n\nfrom lakehouse_engine.core.definitions import DQType\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.dq_processors.exceptions import (\n    DQDuplicateRuleIdException,\n    DQValidationsFailedException,\n)\nfrom lakehouse_engine.engine import execute_dq_validation, load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.dq_rules_table_utils import _create_dq_functions_source_table\nfrom tests.utils.local_storage import LocalStorage\n\n_LOGGER = LoggingHandler(__name__).get_logger()\n\nTEST_NAME = \"dq_validator\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"spec_id\": \"spec_without_duplicate\",\n            \"name\": \"table_batch_dq_rule\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"file_reader\",\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"dq_validator_result\": \"success\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_table_rule_id_success\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"spec_with_duplicate\",\n            \"name\": \"table_batch_dq_rule\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"file_reader\",\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"dq_validator_result\": \"failed\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_table_rule_id_failure\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"streaming_spec_without_duplicate\",\n            \"name\": \"table_streaming_dq_rule\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"dq_validator_result\": \"success\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_table_rule_id_success\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"spec_id\": \"streaming_spec_with_duplicate\",\n            \"name\": \"table_streaming_dq_rule\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"dq_validator_result\": \"failed\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_table_rule_id_failure\",\n            \"max_percentage_failure\": None,\n        },\n    ],\n)\ndef test_dq_rule_id_uniqueness(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the function to detect duplicate dq_rule_id.\n\n    Dq_rule_id scenarios:\n    - scenario 1: using the file reader in batch to test if the dq_db_table\n    has duplicated dq_rule_id. This scenario do not have duplicates.\n    - scenario 2: Using the file reader in batch mode to check for duplicate\n    dq_rule_id values in the dq_db_table. In this scenario, duplicates are found\n    in rule_3 and rule_4.\n    - scenario 3: using the file reader in streaming to test if the dq_db_table\n    has duplicated dq_rule_id. This scenario do not have duplicates.\n    - scenario 4: using the file reader in streaming mode to check for duplicate\n    dq_rule_id values in the dq_db_table. In this scenario, duplicates are found\n    in rule_3 and rule_5.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    _clean_folders()\n\n    _create_table(\"dq_sales\")\n\n    _execute_load(scenario[\"read_type\"])\n\n    input_spec = {\n        \"spec_id\": \"sales_source\",\n        \"data_format\": \"delta\",\n        \"read_type\": scenario[\"read_type\"],\n        \"location\": f\"{TEST_LAKEHOUSE_OUT}/data/\",\n    }\n\n    _create_dq_functions_source_table(\n        test_resources_path=TEST_RESOURCES,\n        lakehouse_in_path=TEST_LAKEHOUSE_IN,\n        lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n        test_name=scenario[\"name\"],\n        scenario=scenario[\"read_type\"],\n        table_name=scenario[\"dq_db_table\"],\n    )\n\n    acon = _generate_acon(\n        input_spec, scenario, scenario.get(\"dq_type\", DQType.VALIDATOR.value)\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    if (scenario[\"dq_validator_result\"] == \"failed\") and (\"batch\" in scenario[\"name\"]):\n        with pytest.raises(DQDuplicateRuleIdException) as error:\n            execute_dq_validation(acon=acon)\n        assert \"rule_3\" and \"rule_4\" in error.value.args[0]\n        _LOGGER.critical(error.value.args[0])\n    elif (scenario[\"dq_validator_result\"] == \"failed\") and (\n        \"streaming\" in scenario[\"name\"]\n    ):\n        with pytest.raises(DQDuplicateRuleIdException) as error:\n            execute_dq_validation(acon=acon)\n        assert \"rule_3\" and \"rule_5\" in error.value.args[0]\n        _LOGGER.critical(error.value.args[0])\n    else:\n        execute_dq_validation(acon=acon)\n        assert \"A duplicate dq_rule_id was found!!!\" not in caplog.text\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"batch_dataframe_success\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"dq_validator_result\": \"success\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_dataframe_failure\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_failure_disabled\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"table_reader\",\n            \"dq_validator_result\": \"failure_disabled\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"batch_failure\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"table_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": True,\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_failure\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": True,\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_failure_critical\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": True,\n            \"fail_on_error\": True,\n            \"critical_functions\": [\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\"min_value\": 3, \"max_value\": 11},\n                }\n            ],\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_failure_critical_notes\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": True,\n            \"fail_on_error\": True,\n            \"critical_functions\": [\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\n                        \"min_value\": 3,\n                        \"max_value\": 11,\n                        \"meta\": {\"notes\": \"Test notes\"},\n                    },\n                }\n            ],\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_failure_critical_markdown\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": True,\n            \"fail_on_error\": True,\n            \"critical_functions\": [\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\n                        \"min_value\": 3,\n                        \"max_value\": 11,\n                        \"meta\": {\n                            \"notes\": {\"format\": \"markdown\", \"content\": \"**Test Notes**\"}\n                        },\n                    },\n                }\n            ],\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"streaming_failure_percentage\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"failure\",\n            \"restore_prev_version\": True,\n            \"fail_on_error\": True,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": 0.2,\n        },\n        {\n            \"name\": \"table_batch_success\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"success_explode\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_functions_source_table_success\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"table_batch_failure_disabled\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"success_explode_disabled\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_functions_source_table_failure\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"table_streaming_success\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"success_explode\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_functions_source_table_success\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"table_streaming_failure_disabled\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"file_reader\",\n            \"dq_validator_result\": \"success_explode_disabled\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_functions_source_table_failure\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"table_batch_dataframe_success\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"batch\",\n            \"input_type\": \"dataframe_reader\",\n            \"dq_validator_result\": \"success_explode\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_functions_source_table_success\",\n            \"max_percentage_failure\": None,\n        },\n        {\n            \"name\": \"table_batch_dataframe_failure_disabled\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"dq_validator_result\": \"success_explode_disabled\",\n            \"restore_prev_version\": False,\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"dq_db_table\": \"test_db.dq_functions_source_table_failure\",\n            \"max_percentage_failure\": None,\n        },\n    ],\n)\ndef test_dq_validator(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the Data Quality Validator algorithm with DQ Type Validator.\n\n    Data Quality Validator scenarios:\n    - scenario 1: test DQ Validator having a generated dataframe as input\n    that passes all the expectations defined.\n    - scenario 2: test DQ Validator, reading a generated dataframe as\n    stream that fails one of the expectations defined.\n    - scenario 3: test DQ Validator, reading as streaming a delta table,\n    failing one of the expectations but not failing the complete DQ process\n    as fail_on_error is disabled.\n    - scenario 4: test DQ Validator, reading a delta table (batch),\n    that fails one of the expectations defined and a previous version of the\n    delta table is restored.\n    - scenario 5: test DQ Validator, reading as streaming a set of files in a\n    specific location, that fail one of the expectations defined and a\n    previous version of the delta table is restored.\n    - scenario 6: test DQ Validator, reading as streaming a set of files in a\n    specific location, that fails one of the expectations that is defined as\n    critical.\n    - scenario 7: test DQ Validator, reading as streaming a set of files in a\n    specific location, that fails one of the expectations that is defined as\n    critical and notes in default format.\n    - scenario 8: test DQ Validator, reading as streaming a set of files in a\n    specific location, that fails one of the expectations that is defined as\n    critical and notes with markdown.\n    - scenario 9: test DQ Validator, reading as streaming a set of files in a\n    specific location, that fails the whole expectation suite because the\n    maximum percentage threshold is surpassed.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    _clean_folders()\n\n    if \"dataframe\" in scenario[\"input_type\"]:\n        input_spec = {\n            \"spec_id\": \"sales_source\",\n            \"read_type\": scenario[\"read_type\"],\n            \"data_format\": \"dataframe\",\n            \"df_name\": _generate_dataframe(scenario[\"read_type\"]),\n        }\n    else:\n        _create_table(\"dq_sales\")\n\n        _execute_load(scenario[\"read_type\"])\n\n        if \"table\" in scenario[\"input_type\"]:\n            input_spec = {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": scenario[\"read_type\"],\n                \"db_table\": \"test_db.dq_sales\",\n            }\n        else:\n            input_spec = {\n                \"spec_id\": \"sales_source\",\n                \"data_format\": \"delta\",\n                \"read_type\": scenario[\"read_type\"],\n                \"location\": f\"{TEST_LAKEHOUSE_OUT}/data/\",\n            }\n\n    if \"dq_db_table\" in scenario.keys():\n        _create_dq_functions_source_table(\n            test_resources_path=TEST_RESOURCES,\n            lakehouse_in_path=TEST_LAKEHOUSE_IN,\n            lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n            test_name=scenario[\"name\"],\n            scenario=scenario[\"read_type\"],\n            table_name=scenario[\"dq_db_table\"],\n        )\n\n    acon = _generate_acon(\n        input_spec, scenario, scenario.get(\"dq_type\", DQType.VALIDATOR.value)\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    if scenario[\"dq_validator_result\"] == \"failure\":\n        with pytest.raises(\n            (DQValidationsFailedException, StreamingQueryException),\n            match=\".*Data Quality Validations Failed!.*\",\n        ):\n            execute_dq_validation(acon=acon)\n    else:\n        execute_dq_validation(acon=acon)\n\n    if scenario[\"restore_prev_version\"] is True:\n        data_result_df, data_control_df = _get_result_and_control_dfs(\n            \"test_db.dq_sales\", \"data_restore_control\", False\n        )\n\n        assert not DataframeHelpers.has_diff(data_result_df, data_control_df)\n        assert \"Data Quality Expectation(s) have failed!\" in caplog.text\n\n    if scenario[\"dq_validator_result\"] == \"failure_disabled\":\n        assert (\n            \"1 out of 3 Data Quality Expectation(s) have failed! \"\n            \"Failed Expectations\" in caplog.text\n        )\n\n    dq_result_df, dq_control_df = _get_result_and_control_dfs(\n        result=f\"{LAKEHOUSE_FEATURE_OUT}/{scenario['name']}/result_sink/\",\n        control=f'dq_control_{scenario[\"dq_validator_result\"]}',\n        infer_schema=True,\n        result_is_table=False,\n    )\n\n    assert not DataframeHelpers.has_diff(\n        dq_result_df.select(\"spec_id\", \"input_id\", \"success\"),\n        dq_control_df.fillna(\"\").select(\"spec_id\", \"input_id\", \"success\"),\n    )\n\n    for key in dq_result_df.collect():\n        validation_results = loads(key.validation_results)\n        result = (\n            validation_results[0]\n            if isinstance(validation_results, list)\n            else validation_results\n        )\n        assert {\n            \"success\",\n            \"expectation_config\",\n        }.issubset(result.keys())\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"streaming_dataframe_two_runs\",\n            \"dq_type\": \"prisma\",\n            \"read_type\": \"streaming\",\n            \"input_type\": \"dataframe_reader\",\n            \"dq_validator_result\": \"success_explode\",\n            \"dq_db_table_first_run\": \"test_db.dq_functions_streaming_dataframe_two_runs_first_run\",  # noqa: E501\n            \"dq_db_table_second_run\": \"test_db.dq_functions_streaming_dataframe_two_runs_second_run\",  # noqa: E501\n            \"fail_on_error\": False,\n            \"critical_functions\": None,\n            \"max_percentage_failure\": None,\n            \"restore_prev_version\": False,\n        },\n    ],\n)\ndef test_dq_validator_two_runs(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the integrity of the result sink after two runs.\n\n    This tests performs two runs of the Data Quality Validator with the same\n    scenario but different dq functions source tables. The goal is to ensure\n    that the result sink does not have void types and that it is able to\n    be read without issues.\n    This is a regression test for the case when the Data Quality Validator\n    was writing a column with void types to the result sink, which caused\n    issues when reading the result sink.\n\n    Data Quality Validator scenarios:\n    - scenario 1: test result sink structure by having two runs writing to the same\n    result sink without creating an issue with void types.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    _clean_folders()\n\n    input_spec = {\n        \"spec_id\": \"sales_source\",\n        \"read_type\": scenario[\"read_type\"],\n        \"data_format\": \"dataframe\",\n        \"df_name\": _generate_dataframe(scenario[\"read_type\"]),\n    }\n\n    _create_dq_functions_source_table(\n        test_resources_path=TEST_RESOURCES,\n        lakehouse_in_path=TEST_LAKEHOUSE_IN,\n        lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n        test_name=scenario[\"name\"],\n        scenario=scenario[\"read_type\"],\n        table_name=scenario[\"dq_db_table_first_run\"],\n    )\n\n    _create_dq_functions_source_table(\n        test_resources_path=TEST_RESOURCES,\n        lakehouse_in_path=TEST_LAKEHOUSE_IN,\n        lakehouse_out_path=TEST_LAKEHOUSE_OUT,\n        test_name=scenario[\"name\"],\n        scenario=scenario[\"read_type\"],\n        table_name=scenario[\"dq_db_table_second_run\"],\n    )\n\n    scenario[\"dq_db_table\"] = scenario[\"dq_db_table_first_run\"]\n\n    first_acon = _generate_acon(\n        input_spec, scenario, scenario.get(\"dq_type\", DQType.PRISMA.value)\n    )\n\n    scenario[\"dq_db_table\"] = scenario[\"dq_db_table_second_run\"]\n\n    second_acon = _generate_acon(\n        input_spec, scenario, scenario.get(\"dq_type\", DQType.PRISMA.value)\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    execute_dq_validation(acon=first_acon)\n\n    execute_dq_validation(acon=second_acon)\n\n    result_sink_path = f\"{LAKEHOUSE_FEATURE_OUT}/{scenario['name']}/result_sink/\"\n    df = ExecEnv.SESSION.sql(\n        f\"\"\"select * from delta.`{result_sink_path}`\"\"\"  # nosec B608\n    )\n\n    try:\n        df.show()\n    except py4j.protocol.Py4JJavaError:\n        pytest.fail(\"Failed to write to result sink due to void type in the dataframe.\")\n\n\ndef _clean_folders() -> None:\n    \"\"\"Clean test folders and tables.\"\"\"\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_IN}/data\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/data\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/checkpoint\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/dq\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/profiling\")\n    ExecEnv.SESSION.sql(\"DROP TABLE IF EXISTS test_db.dq_sales\")\n    ExecEnv.SESSION.sql(\"DROP TABLE IF EXISTS test_db.dq_validator\")\n\n\ndef _create_table(table_name: str) -> None:\n    \"\"\"Create test table.\n\n    Args:\n        table_name: name of the test table.\n    \"\"\"\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS test_db.{table_name} (\n            salesorder string,\n            item string,\n            date string,\n            customer string,\n            article string,\n            amount string\n        )\n        USING delta\n        LOCATION '{TEST_LAKEHOUSE_OUT}/data'\n        TBLPROPERTIES(\n          'lakehouse.primary_key'='salesorder, `item`, date ,`customer`',\n          'delta.enableChangeDataFeed'='false'\n        )\n        \"\"\"\n    )\n\n\ndef _execute_load(load_type: str) -> None:\n    \"\"\"Helper function to reuse for loading the data for the scenario tests.\n\n    Args:\n        load_type: batch or streaming.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{load_type}.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    load_data(acon=acon)\n\n\ndef _generate_acon(\n    input_spec: dict,\n    scenario: dict,\n    dq_type: str,\n) -> dict:\n    \"\"\"Generate acon according to test scenario.\n\n    Args:\n        input_spec: input specification.\n        scenario: the scenario being tested.\n        dq_type: the type of data quality process.\n\n    Returns:\n        A dict corresponding to the generated acon.\n    \"\"\"\n    if \"dataframe\" in scenario[\"input_type\"]:\n        unexpected_rows_pk: Dict[str, Union[str, List[str]]] = {\n            \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"]\n        }\n    else:\n        unexpected_rows_pk = {\"tbl_to_derive_pk\": \"test_db.dq_sales\"}\n\n    if dq_type == DQType.VALIDATOR.value or dq_type == DQType.PRISMA.value:\n        dq_spec_add_options = {\n            \"result_sink_location\": f\"{LAKEHOUSE_FEATURE_OUT}/\"\n            f\"{scenario['name']}/result_sink/\",\n            \"dq_db_table\": scenario.get(\"dq_db_table\"),\n            \"dq_table_table_filter\": \"dummy_sales\",\n            \"result_sink_format\": \"delta\",\n            \"fail_on_error\": scenario[\"fail_on_error\"],\n            \"critical_functions\": scenario[\"critical_functions\"],\n            \"max_percentage_failure\": scenario[\"max_percentage_failure\"],\n            \"result_sink_explode\": False,\n            \"data_product_name\": scenario[\"name\"],\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"article\"}},\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\"min_value\": 3, \"max_value\": 11},\n                },\n                {\n                    \"function\": \"expect_column_pair_a_to_be_smaller_or_equal_than_b\",\n                    \"args\": {\"column_A\": \"salesorder\", \"column_B\": \"amount\"},\n                },\n            ],\n        }\n        dq_spec_add_options.update(unexpected_rows_pk)\n\n    return {\n        \"input_spec\": input_spec,\n        \"dq_spec\": {\n            \"spec_id\": \"dq_sales\",\n            \"input_id\": \"sales_source\",\n            \"dq_type\": dq_type,\n            \"store_backend\": \"file_system\",\n            \"local_fs_root_dir\": f\"{TEST_LAKEHOUSE_OUT}/dq\",\n            **dq_spec_add_options,\n        },\n        \"restore_prev_version\": scenario.get(\"restore_prev_version\", False),\n    }\n\n\ndef _generate_dataframe(load_type: str) -> DataFrame:\n    \"\"\"Generate test dataframe.\n\n    Args:\n        load_type: batch or streaming.\n\n    Returns: the generated dataframe.\n    \"\"\"\n    if load_type == \"batch\":\n        input_df = (\n            ExecEnv.SESSION.read.format(\"csv\")\n            .schema(\n                SchemaUtils.from_file(f\"file://{TEST_RESOURCES}/dq_sales_schema.json\")\n            )\n            .load(f\"{TEST_RESOURCES}/data/source/part-01.csv\")\n        )\n    else:\n        input_df = (\n            ExecEnv.SESSION.readStream.format(\"csv\")\n            .schema(\n                SchemaUtils.from_file(f\"file://{TEST_RESOURCES}/dq_sales_schema.json\")\n            )\n            .load(f\"{TEST_RESOURCES}/data/source/*\")\n        )\n\n    return input_df\n\n\ndef _get_result_and_control_dfs(\n    result: str, control: str, infer_schema: bool, result_is_table: bool = True\n) -> Tuple[DataFrame, DataFrame]:\n    \"\"\"Helper to get the result and control dataframes.\n\n    Args:\n        result: the table to read from.\n        control: the file name to read from.\n        infer_schema: whether to infer the schema or not.\n        result_is_table: whether the result is a table or a file.\n\n    Returns: the result and control dataframes.\n    \"\"\"\n    if result_is_table:\n        dq_result_df = DataframeHelpers.read_from_table(result)\n    else:\n        dq_result_df = DataframeHelpers.read_from_file(\n            location=result,\n            file_format=\"delta\",\n        )\n\n    dq_control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{control}.csv\",\n        file_format=\"csv\",\n        options={\"header\": True, \"delimiter\": \"|\", \"inferSchema\": infer_schema},\n    )\n\n    return dq_result_df, dq_control_df\n"
  },
  {
    "path": "tests/feature/test_engine_usage_stats.py",
    "content": "\"\"\"Tests for the log lakehouse engine function.\"\"\"\n\nimport os\nimport re\nfrom datetime import datetime\n\nimport pytest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import lit\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import execute_dq_validation, load_data, manage_table\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_LOGS,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"engine_usage_stats\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\nTIMESTAMP = datetime.now()\nYEAR = TIMESTAMP.year\nMONTH = TIMESTAMP.month\n\n\ndef custom_transformation(df: DataFrame) -> DataFrame:\n    \"\"\"A sample custom transformation to use in the ACON.\n\n    Args:\n        df: DataFrame passed as input.\n\n    Returns:\n        DataFrame: the transformed DataFrame.\n    \"\"\"\n    return df.withColumn(\"new_column\", lit(\"literal\"))\n\n\ndef _get_test_acon(scenario_name: str) -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the test.\n\n    Args:\n        scenario_name: name of the test scenario running.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    df = ExecEnv.SESSION.read.options(\n        header=\"True\", inferSchema=\"True\", delimiter=\"|\"\n    ).csv(f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/\")\n    input_spec: dict = {\n        \"spec_id\": \"sales_source\",\n        \"read_type\": \"batch\",\n    }\n    transformers = [\n        {\n            \"function\": \"rename\",\n            \"args\": {\"cols\": {\"salesorder\": \"salesorder1\"}},\n        }\n    ]\n    if \"simple_acon\" not in scenario_name:\n        transformers.append(\n            {\n                \"function\": \"custom_transformation\",\n                \"args\": {\"custom_transformer\": custom_transformation},\n            }\n        )\n        input_spec = {**input_spec, \"data_format\": \"dataframe\", \"df_name\": df}\n    else:\n        input_spec = {\n            **input_spec,\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"mode\": \"FAILFAST\",\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"password\": \"dummy_password\",\n            },\n            \"location\": f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/\",\n        }\n\n    return {\n        \"input_specs\": [input_spec],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"renamed_kpi\",\n                \"input_id\": \"sales_source\",\n                \"transformers\": transformers,\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"renamed_kpi\",\n                \"write_type\": \"overwrite\",\n                \"data_format\": \"delta\",\n                \"location\": f\"{TEST_LAKEHOUSE_OUT}/{scenario_name}/data/\",\n            }\n        ],\n        \"exec_env\": {\"dp_name\": scenario_name},\n    }\n\n\n@pytest.mark.parametrize(\"scenario\", [\"load_simple_acon\", \"load_custom_transf_and_df\"])\ndef test_load_data(scenario: str) -> None:\n    \"\"\"Test Data Loader with different scenarios.\n\n    Scenarios:\n        engine_usage_stats:\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    load_data(\n        acon=_get_test_acon(scenario),\n        spark_confs={\"dp_name\": \"dp_name\"},\n        collect_engine_usage=\"enabled\",\n    )\n\n    _prepare_and_compare_dfs(scenario)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"table_manager\"])\ndef test_table_manager(scenario: str) -> None:\n    \"\"\"Test Table Manager with different scenarios.\n\n    Scenarios:\n        table_manager: table_manager logging behaviour\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    acon = {\n        \"function\": \"execute_sql\",\n        \"sql\": \"select 1\",\n        \"exec_env\": {\"dp_name\": scenario},\n    }\n\n    manage_table(\n        acon=acon, spark_confs={\"dp_name\": \"dp_name\"}, collect_engine_usage=\"enabled\"\n    )\n\n    _prepare_and_compare_dfs(scenario)\n\n\n@pytest.mark.parametrize(\"scenario\", [\"dq_validator\"])\ndef test_dq_validator(scenario: str) -> None:\n    \"\"\"Test DQ Validator with different scenarios.\n\n    Scenarios:\n        dq_validator: dq_validator logging behaviour\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    acon = {\n        \"input_spec\": {\n            \"spec_id\": \"sales_source\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"options\": {\"mode\": \"FAILFAST\", \"header\": True, \"delimiter\": \"|\"},\n            \"location\": f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n        },\n        \"dq_spec\": {\n            \"spec_id\": \"dq_sales\",\n            \"input_id\": \"sales_source\",\n            \"dq_type\": \"validator\",\n            \"store_backend\": \"file_system\",\n            \"local_fs_root_dir\": f\"{TEST_LAKEHOUSE_OUT}/dq\",\n            \"result_sink_db_table\": \"test_db.dq_validator\",\n            \"result_sink_format\": \"json\",\n            \"result_sink_explode\": False,\n            \"dq_functions\": [\n                {\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"article\"}},\n                {\n                    \"function\": \"expect_table_row_count_to_be_between\",\n                    \"args\": {\"min_value\": 3, \"max_value\": 11},\n                },\n                {\n                    \"function\": \"expect_column_pair_a_to_be_smaller_or_equal_than_b\",\n                    \"args\": {\"column_A\": \"salesorder\", \"column_B\": \"amount\"},\n                },\n            ],\n        },\n        \"exec_env\": {\"dp_name\": scenario},\n    }\n\n    execute_dq_validation(\n        acon=acon, spark_confs={\"dp_name\": \"dp_name\"}, collect_engine_usage=\"enabled\"\n    )\n\n    _prepare_and_compare_dfs(scenario)\n\n\ndef _prepare_and_compare_dfs(scenario: str) -> None:\n    \"\"\"Prepare DF and compare test and control dataframes.\n\n    Args:\n        scenario: Scenario to load dataframes to compare.\n    \"\"\"\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        \"json\",\n        options={\"inferSchema\": True},\n    )\n\n    log_folder_path = f\"{LAKEHOUSE_FEATURE_LOGS}/{scenario}/{YEAR}/{MONTH}/\"\n    log_file_path = os.listdir(log_folder_path)[-1]\n\n    eng_usage_df = DataframeHelpers.read_from_file(\n        f\"{log_folder_path}{log_file_path}\", \"json\"\n    )\n\n    assert eng_usage_df.columns == control_df.columns\n    assert (\n        eng_usage_df.select(\"start_timestamp\").first()[0]\n        >= control_df.select(\"start_timestamp\").first()[0]\n    )\n\n    assert _prepare_df_comparison(eng_usage_df) == _prepare_df_comparison(control_df)\n\n\ndef _prepare_df_comparison(df: DataFrame) -> str:\n    \"\"\"Prepared DF to be comparable by dropping columns and converting it to string.\n\n    Args:\n        df: DataFrame to be prepared.\n\n    Returns: a string representation of the Dataframe, ready to be compared.\n    \"\"\"\n    cols_to_ignore = [\"start_timestamp\", \"engine_version\"]\n    str_df = str(df.drop(*cols_to_ignore).first()[0])\n    str_df = re.sub(\"'<function \", \"\", str_df)\n    return re.sub(\" at.*'\", \"\", str_df)\n"
  },
  {
    "path": "tests/feature/test_extract_from_sap_b4.py",
    "content": "\"\"\"Test extractions from SAP B4.\"\"\"\n\nfrom datetime import datetime, timezone\n\nimport pytest\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"extract_from_sap_b4\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\nLOGGER = LoggingHandler(__name__).get_logger()\nDB_TABLE = \"dummy_table\"\n\"\"\"Scenario - Description:\n    no_part_col_no_lower_and_upper_bound_extra_cols - no strategy to split the\n        extraction. Moreover, test adding single extra column from the activation\n        requests table.\n    int_part_col_provide_upper_bound_&_min_timestamp - partition column of type int,\n        manually provided upper_bound to parallelize the extraction. Moreover, it\n        provides the min_timestamp to use to get the data from the changelog in the\n        delta extraction after the init, which mimics the possible situation, in\n        which people might need to provide a specific timestamp for backfilling,\n        instead of deriving it from an existing location.\n    int_part_col_generate_predicates_multi_extra_cols - partition column of type int to\n        automatically generate predicates and parallelize the extraction. Moreover, test\n        adding multiple extra columns from the activation requests table.\n    str_part_col_generate_predicates - partition column of type str to\n        automatically generate predicates and parallelize the extraction.\n    str_part_col_predicates_list - partition column of type str,\n        manually provided predicates list to parallelize the extraction.\n    date_part_col_calculate_upper_bound - partition column of type date to automatically\n        calculate the upper_bound and parallelize the extraction.\n    timestamp_part_col_calculate_upper_bound - partition column of type timestamp to\n        automatically calculate the upper_bound and parallelize the extraction from.\n    default_calc_upper_bound - empty partition of type int to force the default on\n        the upper bound calculation.\n    no_part_col_join_condition - no strategy to split the extraction. Test to\n        validate custom join condition on activation table.\n\"\"\"\nTEST_SCENARIOS = [\n    {\n        \"scenario_name\": \"no_part_col_no_lower_and_upper_bound_extra_cols\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": None,\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": \"req.records_read\",\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"int_part_col_provide_upper_bound_&_min_timestamp\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": \"upper_bound int\",\n        \"part_col\": \"item\",\n        \"lower_bound\": 1,\n        \"upper_bound\": 3,\n        \"min_timestamp\": \"20210713151010000000000\",\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"int_part_col_generate_predicates_multi_extra_cols\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": \"item\",\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": True,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": \"req.records_read, req.records_updated\",\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"str_part_col_generate_predicates\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": '\"/bic/article\"',\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": True,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"str_part_col_predicates_list\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": None,\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": [\n            \"\\\"/bic/article\\\"='article1'\",\n            \"\\\"/bic/article\\\"='article2'\",\n            \"\\\"/bic/article\\\"='article3'\",\n            \"\\\"/bic/article\\\"='article4'\",\n            \"\\\"/bic/article\\\"='article5'\",\n            \"\\\"/bic/article\\\"='article6'\",\n            \"\\\"/bic/article\\\"='article7'\",\n            \"\\\"/bic/article\\\"='article33'\",\n            \"\\\"/bic/article\\\"='article60'\",\n            '\"/bic/article\" IS NULL',\n        ],\n        \"extra_cols_req_status_tbl\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"date_part_col_calculate_upper_bound\",\n        \"calculate_upper_bound\": True,\n        \"calculate_upper_bound_schema\": \"upper_bound date\",\n        \"part_col\": \"date\",\n        \"lower_bound\": \"2000-01-01\",\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"timestamp_part_col_calculate_upper_bound\",\n        \"calculate_upper_bound\": True,\n        \"calculate_upper_bound_schema\": \"upper_bound timestamp\",\n        \"part_col\": \"time\",\n        \"lower_bound\": \"2000-01-01 01:01:01.000\",\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"no_part_col_join_condition\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": None,\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_req_status_tbl\": None,\n        \"act_req_join_condition\": \"tbl.reqtsn = req.request_tsn \"\n        \"AND tbl.reqtsn = req.last_process_tsn\",\n    },\n]\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS)\ndef test_extract_aq_dso(scenario: dict) -> None:\n    \"\"\"Test the extraction from SAP B4 AQ DSO.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    extra_params = {\n        \"changelog_table\": DB_TABLE,\n        \"test_name\": \"extract_aq_dso\",\n        \"adso_type\": \"AQ\",\n    }\n\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    _prepare_files(scenario[\"scenario_name\"], extra_params)\n    _load_test_table(\"rspmrequest\", scenario[\"scenario_name\"], extra_params)\n\n    _execute_and_validate(scenario, extra_params)\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS)\ndef test_extract_cl_dso(scenario: dict) -> None:\n    \"\"\"Test the extraction from SAP B4 CL DSO.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    extra_params = {\n        \"changelog_table\": f\"{DB_TABLE}_cl\",\n        \"test_name\": \"extract_cl_dso\",\n        \"adso_type\": \"CL\",\n    }\n\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    _prepare_files(scenario[\"scenario_name\"], extra_params)\n    _load_test_table(\"rspmrequest\", scenario[\"scenario_name\"], extra_params)\n\n    _execute_and_validate(scenario, extra_params)\n\n\ndef _execute_and_validate(scenario: dict, extra_params: dict) -> None:\n    \"\"\"Helper function to reuse for triggering the load data and validation of results.\n\n    Args:\n        scenario: scenario being tested.\n        extra_params: extra params for the scenario being tested.\n    \"\"\"\n    _execute_load(scenario=scenario, extraction_type=\"init\", extra_params=extra_params)\n\n    _execute_load(\n        scenario=scenario,\n        extraction_type=\"delta\",\n        iteration=1,\n        extra_params=extra_params,\n    )\n\n    _execute_load(\n        scenario=scenario,\n        extraction_type=\"delta\",\n        iteration=2,\n        extra_params=extra_params,\n    )\n\n    _validate(\n        scenario[\"scenario_name\"],\n        extra_params,\n        scenario[\"min_timestamp\"] is not None,\n    )\n\n\ndef _execute_load(\n    scenario: dict,\n    extra_params: dict,\n    extraction_type: str,\n    iteration: int = None,\n) -> None:\n    \"\"\"Helper function to reuse for loading the data for the scenario tests.\n\n    Args:\n        scenario: scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        extraction_type: type of extraction (delta or init).\n        iteration: number of the iteration, in case it is to test a delta.\n    \"\"\"\n    write_type = \"overwrite\" if extraction_type == \"init\" else \"append\"\n\n    _load_test_table(\n        extra_params[\"changelog_table\"] if extraction_type != \"init\" else DB_TABLE,\n        scenario[\"scenario_name\"],\n        extra_params,\n        iteration,\n    )\n\n    # if it is an init, we need to provide an extraction_timestamp, otherwise the\n    # current time would be used and data would be filtered accordingly.\n    acon = _get_test_acon(\n        extraction_timestamp=(\n            \"20210713151010\"\n            if extraction_type == \"init\"\n            else datetime.now(timezone.utc).strftime(\"%Y%m%d%H%M%S\")\n        ),\n        extraction_type=extraction_type,\n        write_type=write_type,\n        scenario=scenario,\n        extra_params=extra_params,\n    )\n\n    load_data(acon=acon)\n\n\ndef _get_test_acon(\n    extraction_type: str,\n    write_type: str,\n    scenario: dict,\n    extra_params: dict,\n    extraction_timestamp: str = None,\n) -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Args:\n        extraction_type: type of extraction (delta or init).\n        write_type: the spark write type to be used.\n        scenario: the scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        extraction_timestamp: timestamp of the extraction. For local tests\n            we specify it in the init, otherwise would be calculated and\n            tests would fail.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"sap_b4\",\n                \"calculate_upper_bound\": scenario[\"calculate_upper_bound\"],\n                \"calc_upper_bound_schema\": scenario[\"calculate_upper_bound_schema\"],\n                \"generate_predicates\": scenario[\"generate_predicates\"],\n                \"options\": {\n                    \"driver\": \"org.sqlite.JDBC\",\n                    \"user\": \"dummy_user\",\n                    \"password\": \"dummy_pwd\",\n                    \"url\": f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/\"\n                    f\"{scenario['scenario_name']}/{extra_params['test_name']}/tests.db\",\n                    \"dbtable\": DB_TABLE,\n                    \"data_target\": \"dummy_table\",\n                    \"act_req_join_condition\": scenario[\"act_req_join_condition\"],\n                    \"changelog_table\": extra_params[\"changelog_table\"],\n                    \"customSchema\": \"reqtsn DECIMAL(23,0), datapakid STRING, \"\n                    \"record INTEGER, extraction_start_timestamp DECIMAL(15,0)\",\n                    \"request_status_tbl\": \"rspmrequest\",\n                    \"extra_cols_req_status_tbl\": scenario[\"extra_cols_req_status_tbl\"],\n                    \"latest_timestamp_data_location\": f\"file:///{TEST_LAKEHOUSE_OUT}/\"\n                    f\"{scenario['scenario_name']}/{extra_params['test_name']}/data\",\n                    \"extraction_type\": extraction_type,\n                    \"numPartitions\": 2,\n                    \"partitionColumn\": scenario[\"part_col\"],\n                    \"lowerBound\": scenario[\"lower_bound\"],\n                    \"upperBound\": scenario[\"upper_bound\"],\n                    \"default_upper_bound\": scenario.get(\"default_upper_bound\", \"Null\"),\n                    \"extraction_timestamp\": extraction_timestamp,\n                    \"min_timestamp\": scenario[\"min_timestamp\"],\n                    \"predicates\": scenario[\"predicates_list\"],\n                    \"adso_type\": extra_params[\"adso_type\"],\n                },\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"sales_source\",\n                \"write_type\": write_type,\n                \"data_format\": \"delta\",\n                \"partitions\": [\"reqtsn\"],\n                \"location\": f\"file:///{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/\"\n                f\"{extra_params['test_name']}/data\",\n            }\n        ],\n        \"exec_env\": {\n            \"spark.databricks.delta.schema.autoMerge.enabled\": (\n                True if scenario[\"extra_cols_req_status_tbl\"] else False\n            )\n        },\n    }\n\n\ndef _prepare_files(scenario: str, extra_params: dict) -> None:\n    \"\"\"Copy all the files needed for the tests.\n\n    Args:\n         scenario: scenario being tested.\n         extra_params: extra params for the scenario being tested.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{extra_params['test_name']}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/source/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{extra_params['test_name']}/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{extra_params['test_name']}/data/control/*_schema.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/\",\n    )\n\n    if scenario == \"no_part_col_join_condition\":\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/\"\n            f\"{extra_params['test_name']}/data/control/\"\n            f\"dummy_table_join_condition.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/data/\",\n        )\n    else:\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{extra_params['test_name']}/data/control/\"\n            f\"dummy_table.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/data/\",\n        )\n\n\ndef _load_test_table(\n    db_table: str, scenario: str, extra_params: dict, iteration: int = None\n) -> DataFrame:\n    \"\"\"Load the JDBC tables for the tests and return a Dataframe with the content.\n\n    Args:\n        db_table: table being loaded.\n        scenario: scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        iteration: number of the iteration, in case it is to test a delta.\n\n    Returns:\n        A Dataframe with the content of the JDBC table loaded.\n    \"\"\"\n    file_name = f\"{db_table}_{iteration}\" if iteration else db_table\n\n    source_df = DataframeHelpers.read_from_file(\n        location=f\"{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/\"\n        f\"source/{file_name}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/\"\n            f\"{db_table}_schema.json\"\n        ),\n        options={\"header\": True, \"delimiter\": \"|\", \"dateFormat\": \"yyyyMMdd\"},\n    )\n\n    DataframeHelpers.write_into_jdbc_table(\n        source_df,\n        f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario}/\"\n        f\"{extra_params['test_name']}/tests.db\",\n        db_table,\n    )\n\n    return DataframeHelpers.read_from_jdbc(\n        f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario}/\"\n        f\"{extra_params['test_name']}/tests.db\",\n        db_table,\n    )\n\n\ndef _validate(scenario: str, extra_params: dict, min_timestamp: bool) -> None:\n    \"\"\"Perform the validation part of the local tests.\n\n    Args:\n        scenario: the scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        min_timestamp: whether the min_timestamp is provided or not.\n    \"\"\"\n    control_df = DataframeHelpers.read_from_file(\n        location=f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/\"\n        f\"data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/{scenario}/\"\n            f\"{extra_params['test_name']}/dummy_table_schema.json\"\n        ),\n        options={\"header\": True, \"delimiter\": \"|\", \"dateFormat\": \"yyyyMMdd\"},\n    )\n\n    control_df_columns = control_df.columns\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/{extra_params['test_name']}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    ).select(control_df_columns)\n\n    if min_timestamp:\n        # when we fill the min_timestamp, it means it can either skip or\n        # re-extract things, depending on the timestamp provided. In our scenario\n        # is expected to re-extract, causing duplicates, thus if we remove the\n        # duplicates we expect to match the non-duplicated control dataframe\n        result_df = result_df.drop_duplicates()\n\n    assert not DataframeHelpers.has_diff(control_df, result_df)\n"
  },
  {
    "path": "tests/feature/test_extract_from_sap_bw.py",
    "content": "\"\"\"Test extractions from SAP BW.\"\"\"\n\nimport re\nfrom datetime import datetime, timezone\n\nimport pytest\nfrom _pytest.logging import LogCaptureFixture\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.core.definitions import OutputFormat, WriteType\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.extraction.sap_bw_extraction_utils import (\n    SAPBWExtraction,\n    SAPBWExtractionUtils,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"extract_from_sap_bw\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\nLOGGER = LoggingHandler(__name__).get_logger()\nDB_TABLE = \"dummy_table\"\n\"\"\"Scenario - Description:\n    no_part_col_no_lower_and_upper_bound_extra_cols - no strategy to split the\n        extraction. Moreover, test adding single extra column from the activation\n        requests table.\n    int_part_col_provide_upper_bound_&_min_timestamp - partition column of type int,\n        manually provided upper_bound to parallelize the extraction. Moreover, it\n        provides the min_timestamp to use to get the data from the changelog in the\n        delta extraction after the init, which mimics the possible situation, in\n        which people might need to provide a specific timestamp for backfilling,\n        instead of deriving it from an existing location.\n    int_part_col_generate_predicates_multi_extra_cols - partition column of type int to\n        automatically generate predicates and parallelize the extraction. Moreover, test\n        adding multiple extra columns from the activation requests table.\n    str_part_col_generate_predicates - partition column of type str to\n        automatically generate predicates and parallelize the extraction.\n    str_part_col_predicates_list - partition column of type str,\n        manually provided predicates list to parallelize the extraction.\n    date_part_col_calculate_upper_bound - partition column of type date to automatically\n        calculate the upper_bound and parallelize the extraction.\n    timestamp_part_col_calculate_upper_bound - partition column of type timestamp to\n        automatically calculate the upper_bound and parallelize the extraction from.\n    init_timestamp_from_actrequest - get the init timestamp from act_request\n        table instead of assuming a given timestamp.\n    fail_calc_upper_bound - empty partition of type date to force failure on\n        the upper bound calculation.\n    no_part_col_join_condition - no strategy to split the extraction. Test to\n        validate custom join condition on activation table.\n\"\"\"\nTEST_SCENARIOS = [\n    {\n        \"scenario_name\": \"no_part_col_no_lower_and_upper_bound_extra_cols\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": None,\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": \"act_req.request as activation_request\",\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"int_part_col_provide_upper_bound_&_min_timestamp\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": \"upper_bound int\",\n        \"part_col\": \"item\",\n        \"lower_bound\": 1,\n        \"upper_bound\": 3,\n        \"min_timestamp\": \"20211004151010\",\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"int_part_col_generate_predicates_multi_extra_cols\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": \"item\",\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": True,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": \"act_req.request as actrequest_request, status\",\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"str_part_col_generate_predicates\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": '\"/bic/article\"',\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": True,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"str_part_col_predicates_list\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": None,\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": [\n            \"\\\"/bic/article\\\"='article1'\",\n            \"\\\"/bic/article\\\"='article2'\",\n            \"\\\"/bic/article\\\"='article3'\",\n            \"\\\"/bic/article\\\"='article4'\",\n            \"\\\"/bic/article\\\"='article5'\",\n            \"\\\"/bic/article\\\"='article6'\",\n            \"\\\"/bic/article\\\"='article7'\",\n            \"\\\"/bic/article\\\"='article33'\",\n            \"\\\"/bic/article\\\"='article60'\",\n            '\"/bic/article\" IS NULL',\n        ],\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"date_part_col_calculate_upper_bound\",\n        \"calculate_upper_bound\": True,\n        \"calculate_upper_bound_schema\": \"upper_bound date\",\n        \"part_col\": \"date\",\n        \"lower_bound\": \"2000-01-01\",\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"timestamp_part_col_calculate_upper_bound\",\n        \"calculate_upper_bound\": True,\n        \"calculate_upper_bound_schema\": \"upper_bound timestamp\",\n        \"part_col\": \"time\",\n        \"lower_bound\": \"2000-01-01 01:01:01.000\",\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"init_timestamp_from_actrequest\",\n        \"calculate_upper_bound\": True,\n        \"calculate_upper_bound_schema\": \"upper_bound timestamp\",\n        \"part_col\": \"time\",\n        \"lower_bound\": \"2000-01-01 01:01:01.000\",\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"get_timestamp_from_act_request\": True,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"fail_calc_upper_bound\",\n        \"calculate_upper_bound\": True,\n        \"calculate_upper_bound_schema\": \"upper_bound date\",\n        \"part_col\": \"order_date\",\n        \"lower_bound\": \"2000-01-01\",\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": None,\n    },\n    {\n        \"scenario_name\": \"no_part_col_join_condition\",\n        \"calculate_upper_bound\": False,\n        \"calculate_upper_bound_schema\": None,\n        \"part_col\": None,\n        \"lower_bound\": None,\n        \"upper_bound\": None,\n        \"min_timestamp\": None,\n        \"generate_predicates\": False,\n        \"predicates_list\": None,\n        \"extra_cols_act_request\": None,\n        \"act_req_join_condition\": \"changelog_tbl.request = act_req.actrequest \"\n        \"AND changelog_tbl.request = act_req.request\",\n    },\n]\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS)\ndef test_extract_dso(scenario: dict, caplog: LogCaptureFixture) -> None:\n    \"\"\"Test the extraction from SAP BW DSO.\n\n    Args:\n        scenario: scenario to test.\n        caplog: fixture to capture console logs.\n    \"\"\"\n    extra_params = {\n        \"request_col_name\": \"actrequest\",\n        \"changelog_table\": f\"{DB_TABLE}_cl\",\n        \"test_name\": \"extract_dso\",\n        \"include_changelog_tech_cols\": True,\n    }\n\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    _prepare_files(scenario[\"scenario_name\"], extra_params)\n    _load_test_table(\"rsodsactreq\", scenario[\"scenario_name\"], extra_params)\n\n    _execute_and_validate(\"extract_dso\", scenario, extra_params, caplog)\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS)\ndef test_extract_write_optimised_dso(scenario: dict, caplog: LogCaptureFixture) -> None:\n    \"\"\"Test the extraction from SAP BW Write Optimised DSO.\n\n    Args:\n        scenario: scenario to test.\n        caplog: fixture to capture console logs.\n    \"\"\"\n    extra_params = {\n        \"request_col_name\": \"request\",\n        \"changelog_table\": DB_TABLE,\n        \"test_name\": \"extract_write_optimised_dso\",\n        \"include_changelog_tech_cols\": False,\n    }\n\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    _prepare_files(scenario[\"scenario_name\"], extra_params)\n    _load_test_table(\"rsodsactreq\", scenario[\"scenario_name\"], extra_params)\n\n    _execute_and_validate(\"extract_wodso\", scenario, extra_params, caplog)\n\n\ndef _execute_and_validate(\n    test_name: str, scenario: dict, extra_params: dict, caplog: LogCaptureFixture\n) -> None:\n    \"\"\"Helper function to reuse for trigger loading data and validation of results.\n\n    Args:\n        test_name: test being executed (for dso or wodso).\n        scenario: scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        caplog: fixture to capture console logs.\n    \"\"\"\n    if scenario[\"scenario_name\"] == \"fail_calc_upper_bound\":\n        with pytest.raises(AttributeError, match=\"Not able to calculate upper bound\"):\n            _execute_load(\n                scenario=scenario, extraction_type=\"init\", extra_params=extra_params\n            )\n    elif test_name == \"extract_dso\" and \"from_actrequest\" in scenario[\"scenario_name\"]:\n        with pytest.raises(\n            AttributeError, match=\"Not able to get the extraction query\"\n        ):\n            _execute_load(\n                scenario=scenario, extraction_type=\"init\", extra_params=extra_params\n            )\n    else:\n        _execute_load(\n            scenario=scenario, extraction_type=\"init\", extra_params=extra_params\n        )\n\n        changelog_table = extra_params[\"changelog_table\"]\n        assert f\"The changelog table derived is: '{changelog_table}'\" in caplog.text\n\n        _execute_load(\n            scenario=scenario,\n            extraction_type=\"delta\",\n            iteration=1,\n            extra_params=extra_params,\n        )\n\n        _execute_load(\n            scenario=scenario,\n            extraction_type=\"delta\",\n            iteration=2,\n            extra_params=extra_params,\n        )\n\n        _validate(\n            scenario[\"scenario_name\"],\n            extra_params,\n            scenario[\"min_timestamp\"] is not None,\n        )\n\n\ndef _execute_load(\n    scenario: dict,\n    extra_params: dict,\n    extraction_type: str,\n    iteration: int = None,\n) -> None:\n    \"\"\"Helper function to reuse for loading the data for the scenario tests.\n\n    Args:\n        scenario: scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        extraction_type: type of extraction (delta or init).\n        iteration: number of the iteration, in case it is to test a delta.\n    \"\"\"\n    write_type = \"overwrite\" if extraction_type == \"init\" else \"append\"\n\n    _load_test_table(\n        DB_TABLE if extraction_type == \"init\" else extra_params[\"changelog_table\"],\n        scenario[\"scenario_name\"],\n        extra_params,\n        iteration,\n    )\n\n    # if it is an init, we need to provide an extraction_timestamp, otherwise the\n    # current time would be used and data would be filtered accordingly.\n    acon = _get_test_acon(\n        extraction_timestamp=(\n            \"20211004151010\"\n            if extraction_type == \"init\"\n            else datetime.now(timezone.utc).strftime(\"%Y%m%d%H%M%S\")\n        ),\n        extraction_type=extraction_type,\n        write_type=write_type,\n        scenario=scenario,\n        extra_params=extra_params,\n    )\n\n    load_data(acon=acon)\n\n\ndef _get_test_acon(\n    extraction_type: str,\n    write_type: str,\n    scenario: dict,\n    extra_params: dict,\n    extraction_timestamp: str = None,\n) -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Args:\n        extraction_type: type of extraction (delta or init).\n        write_type: the spark write type to be used.\n        scenario: the scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        extraction_timestamp: timestamp of the extraction. For local tests\n            we specify it in the init, otherwise would be calculated and\n            tests would fail.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"sap_bw\",\n                \"calculate_upper_bound\": scenario[\"calculate_upper_bound\"],\n                \"calc_upper_bound_schema\": scenario[\"calculate_upper_bound_schema\"],\n                \"generate_predicates\": scenario[\"generate_predicates\"],\n                \"options\": {\n                    \"driver\": \"org.sqlite.JDBC\",\n                    \"user\": \"dummy_user\",\n                    \"password\": \"dummy_pwd\",\n                    \"url\": f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/\"\n                    f\"{scenario['scenario_name']}/{extra_params['test_name']}/tests.db\",\n                    \"dbtable\": DB_TABLE,\n                    \"changelog_table\": (\n                        extra_params[\"changelog_table\"]\n                        if \"changelog_table\" in extra_params.keys()\n                        else None\n                    ),\n                    \"customSchema\": \"actrequest_timestamp DECIMAL(15,0), \"\n                    \"datapakid STRING, request STRING, \"\n                    \"partno INTEGER, record INTEGER, \"\n                    \"extraction_start_timestamp DECIMAL(15,0)\",\n                    \"act_request_table\": \"rsodsactreq\",\n                    \"extra_cols_act_request\": scenario[\"extra_cols_act_request\"],\n                    \"latest_timestamp_data_location\": f\"file:///{TEST_LAKEHOUSE_OUT}/\"\n                    f\"{scenario['scenario_name']}/{extra_params['test_name']}/data\",\n                    \"extraction_type\": extraction_type,\n                    \"numPartitions\": 2,\n                    \"partitionColumn\": scenario[\"part_col\"],\n                    \"lowerBound\": scenario[\"lower_bound\"],\n                    \"upperBound\": scenario[\"upper_bound\"],\n                    \"default_upper_bound\": \"Null\",\n                    \"extraction_timestamp\": extraction_timestamp,\n                    \"min_timestamp\": scenario[\"min_timestamp\"],\n                    \"request_col_name\": extra_params[\"request_col_name\"],\n                    \"act_req_join_condition\": scenario[\"act_req_join_condition\"],\n                    \"include_changelog_tech_cols\": extra_params[\n                        \"include_changelog_tech_cols\"\n                    ],\n                    \"predicates\": scenario[\"predicates_list\"],\n                    \"get_timestamp_from_act_request\": scenario.get(\n                        \"get_timestamp_from_act_request\", False\n                    ),\n                },\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"filtered_sales\",\n                \"input_id\": \"sales_source\",\n                \"transformers\": [\n                    {\n                        \"function\": \"expression_filter\",\n                        \"args\": {\"exp\": \"`/bic/article` like 'article%'\"},\n                    }\n                ],\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"sales_source\",\n                \"write_type\": write_type,\n                \"data_format\": \"delta\",\n                \"partitions\": [\"actrequest_timestamp\"],\n                \"location\": f\"file:///{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/\"\n                f\"{extra_params['test_name']}/data\",\n            }\n        ],\n        \"exec_env\": {\n            \"spark.databricks.delta.schema.autoMerge.enabled\": (\n                True if scenario[\"extra_cols_act_request\"] else False\n            )\n        },\n    }\n\n\ndef _prepare_files(scenario: str, extra_params: dict) -> None:\n    \"\"\"Copy all the files needed for the tests.\n\n    Args:\n         scenario: scenario being tested.\n         extra_params: extra params for the scenario being tested.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{extra_params['test_name']}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/source/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{extra_params['test_name']}/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{extra_params['test_name']}/data/control/*_schema.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/\",\n    )\n\n    if (\n        \"optimised_dso\" in extra_params[\"test_name\"]\n        and scenario == \"init_timestamp_from_actrequest\"\n    ):\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/\"\n            f\"{extra_params['test_name']}/data/control/\"\n            f\"dummy_table_actreq_timestamp.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/data/\",\n        )\n    elif scenario == \"no_part_col_join_condition\":\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/\"\n            f\"{extra_params['test_name']}/data/control/\"\n            f\"dummy_table_join_condition.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/data/\",\n        )\n    else:\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{extra_params['test_name']}/data/control/\"\n            f\"dummy_table.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/data/\",\n        )\n\n\ndef _load_test_table(\n    db_table: str, scenario: str, extra_params: dict, iteration: int = None\n) -> DataFrame:\n    \"\"\"Load the JDBC tables for the tests and return a Dataframe with the content.\n\n    Args:\n        db_table: table being loaded.\n        scenario: scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        iteration: number of the iteration, in case it is to test a delta.\n\n    Returns:\n        A Dataframe with the content of the JDBC table loaded.\n    \"\"\"\n    file_name = f\"{db_table}_{iteration}\" if iteration else db_table\n\n    source_df = DataframeHelpers.read_from_file(\n        location=f\"{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/\"\n        f\"source/{file_name}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}/{extra_params['test_name']}/\"\n            f\"{db_table}_schema.json\"\n        ),\n        options={\"header\": True, \"delimiter\": \"|\", \"dateFormat\": \"yyyyMMdd\"},\n    )\n\n    DataframeHelpers.write_into_jdbc_table(\n        source_df,\n        f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario}/\"\n        f\"{extra_params['test_name']}/tests.db\",\n        db_table,\n    )\n\n    return DataframeHelpers.read_from_jdbc(\n        f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario}/\"\n        f\"{extra_params['test_name']}/tests.db\",\n        db_table,\n    )\n\n\ndef _validate(scenario: str, extra_params: dict, min_timestamp: bool) -> None:\n    \"\"\"Perform the validation part of the local tests.\n\n    Args:\n        scenario: the scenario being tested.\n        extra_params: extra params for the scenario being tested.\n        min_timestamp: whether the min_timestamp is provided or not.\n    \"\"\"\n    control_df = DataframeHelpers.read_from_file(\n        location=f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/{extra_params['test_name']}/\"\n        f\"data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/{scenario}/\"\n            f\"{extra_params['test_name']}/dummy_table_schema.json\"\n        ),\n        options={\"header\": True, \"delimiter\": \"|\", \"dateFormat\": \"yyyyMMdd\"},\n    )\n    control_df_columns = control_df.columns\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/{extra_params['test_name']}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    ).select(control_df_columns)\n\n    if min_timestamp:\n        # when we fill the min_timestamp, it means it can either skip or\n        # re-extract things, depending on the timestamp provided. In our scenario\n        # is expected to re-extract, causing duplicates, thus if we remove the\n        # duplicates we expect to match the non-duplicated control dataframe\n        result_df = result_df.drop_duplicates()\n\n    assert not DataframeHelpers.has_diff(control_df, result_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"derive_changelog_table_name\",\n            \"odsobject\": \"testtable\",\n            \"logsys\": \"DHACLNT003\",\n        },\n        {\n            \"name\": \"derive_changelog_table_name\",\n            \"odsobject\": \"test_table\",\n        },\n    ],\n)\ndef test_changelog_table_name_derivation(scenario: dict) -> None:\n    \"\"\"Test the changelog table name derivation.\n\n    Args:\n        scenario: scenario to be tested.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"\"\"{TEST_RESOURCES}/{scenario[\"name\"]}/data/source/*.csv\"\"\",\n        f\"\"\"{TEST_LAKEHOUSE_IN}/{scenario[\"name\"]}/source/\"\"\",\n    )\n    LocalStorage.copy_file(\n        f\"\"\"{TEST_RESOURCES}/{scenario[\"name\"]}/*.json\"\"\",\n        f\"\"\"{TEST_LAKEHOUSE_IN}/{scenario[\"name\"]}/\"\"\",\n    )\n\n    for table in [\"RSTSODS\", \"RSBASIDOC\"]:\n        source_df = DataframeHelpers.read_from_file(\n            location=f\"\"\"{TEST_LAKEHOUSE_IN}/{scenario[\"name\"]}/\"\"\"\n            f\"\"\"source/{table}.csv\"\"\",\n            schema=SchemaUtils.from_file_to_dict(\n                f\"\"\"file://{TEST_LAKEHOUSE_IN}/{scenario[\"name\"]}/\"\"\"\n                f\"\"\"{table}_schema.json\"\"\"\n            ),\n            options={\"header\": True, \"delimiter\": \"|\"},\n        )\n\n        DataframeHelpers.write_into_jdbc_table(\n            source_df,\n            f\"\"\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario[\"name\"]}/tests.db\"\"\",\n            table,\n            write_type=WriteType.OVERWRITE.value,\n        )\n\n    extraction_utils = SAPBWExtractionUtils(\n        SAPBWExtraction(  # nosec B106\n            sap_bw_schema=\"\",\n            odsobject=scenario[\"odsobject\"],\n            dbtable=\"dummy_table\",\n            driver=\"org.sqlite.JDBC\",\n            user=\"dummy_user\",\n            password=\"dummy_pwd\",\n            url=f\"\"\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario[\"name\"]}/tests.db\"\"\",\n            **(\n                {\"logsys\": scenario[\"logsys\"]}\n                if \"logsys\" in scenario and scenario[\"logsys\"] is not None\n                else {}\n            ),\n        )\n    )\n\n    assert re.match(\n        f\"\"\"{scenario[\"odsobject\"]}_OA\"\"\",\n        extraction_utils.get_changelog_table(),\n    )\n"
  },
  {
    "path": "tests/feature/test_file_manager.py",
    "content": "\"\"\"Test file manager.\"\"\"\nimport logging\nfrom typing import Any\n\nimport boto3\nimport pytest\nfrom moto import mock_s3  # type: ignore\n\nfrom lakehouse_engine.engine import manage_files\nfrom tests.conftest import FEATURE_RESOURCES\n\nTEST_PATH = \"file_manager\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\n\n\n@mock_s3\ndef test_file_manager(caplog: Any) -> None:\n    \"\"\"Test functions from file manager.\n\n    Args:\n        caplog: captured log.\n    \"\"\"\n    s3_res = boto3.resource(\"s3\", region_name=\"us-east-1\")\n    s3_cli = boto3.client(\"s3\", region_name=\"us-east-1\")\n\n    s3_res.create_bucket(Bucket=\"test_bucket\")\n    s3_res.create_bucket(Bucket=\"destination_bucket\")\n\n    with caplog.at_level(logging.INFO):\n        # Creating test files/folders in S3\n        # 2000 files are created to test the pagination is being correctly performed\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_single_file.json\", Body=\"\")\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory/\", Body=\"\")\n        for x in range(0, 2000):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory/test_recursive_file{x}.json\",\n                Body=\"\",\n            )\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory_test/\", Body=\"\")\n        for x in range(0, 2000):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory_test/test_recursive_file{x}.json\",\n                Body=\"\",\n            )\n\n        _test_file_manager_copy(caplog, s3_cli)\n        _test_file_manager_delete(caplog, s3_cli)\n\n\ndef _test_file_manager_copy(caplog: Any, s3_cli: Any) -> None:\n    \"\"\"Testing file manager copy operations.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n    \"\"\"\n    manage_files(\n        f\"file://{TEST_RESOURCES}/copy_object/acon_copy_single_object_dry_run.json\"\n    )\n    assert \"{'test_single_file.json': ['test_single_file.json']}\" in caplog.text\n\n    manage_files(\n        f\"file://{TEST_RESOURCES}/copy_object/acon_copy_directory_dry_run.json\"\n    )\n    for x in range(0, 2000):\n        assert f\"test_directory/test_recursive_file{x}.json\" in caplog.text\n\n    manage_files(f\"file://{TEST_RESOURCES}/copy_object/acon_copy_single_object.json\")\n    assert \"'KeyCount': 1\" in str(s3_cli.list_objects_v2(Bucket=\"destination_bucket\"))\n\n    manage_files(f\"file://{TEST_RESOURCES}/copy_object/acon_copy_directory.json\")\n\n    assert \"'KeyCount': 2002\" in str(\n        s3_cli.list_objects_v2(Bucket=\"destination_bucket\", MaxKeys=100000)\n    )\n\n\ndef _test_file_manager_delete(caplog: Any, s3_cli: Any) -> None:\n    \"\"\"Testing file manager delete operations.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n    \"\"\"\n    manage_files(\n        f\"file://{TEST_RESOURCES}/delete_objects/acon_delete_objects_dry_run.json\"\n    )\n    assert (\n        \"{'test_single_file.json': ['test_single_file.json'], \"\n        \"'test_directory/': ['test_directory/'\" in caplog.text\n    )\n    for x in range(0, 2000):\n        assert f\"test_directory/test_recursive_file{x}.json\" in caplog.text\n\n    manage_files(f\"file://{TEST_RESOURCES}/delete_objects/acon_delete_objects.json\")\n\n    assert \"'KeyCount': 2001\" in str(\n        s3_cli.list_objects_v2(Bucket=\"test_bucket\", MaxKeys=100000)\n    )\n\n\n@mock_s3\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"glacier\", \"storage_class\": \"GLACIER\"},\n        {\"scenario_name\": \"glacier_ir\", \"storage_class\": \"GLACIER_IR\"},\n        {\"scenario_name\": \"deep_archive\", \"storage_class\": \"DEEP_ARCHIVE\"},\n    ],\n)\ndef test_file_manager_restore_archive(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test restore functions from file manager.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    s3_res = boto3.resource(\"s3\", region_name=\"us-east-1\")\n    s3_cli = boto3.client(\"s3\", region_name=\"us-east-1\")\n\n    s3_res.create_bucket(Bucket=\"test_bucket\")\n    s3_res.create_bucket(Bucket=\"destination_bucket\")\n\n    with caplog.at_level(logging.INFO):\n        s3_cli.put_object(\n            Bucket=\"test_bucket\",\n            Key=\"test_single_file.json\",\n            Body=\"\",\n            StorageClass=scenario.get(\"storage_class\"),\n        )\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory\", Body=\"\")\n        for x in range(0, 3):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory/test_recursive_file{x}.json\",\n                Body=\"\",\n                StorageClass=scenario.get(\"storage_class\"),\n            )\n\n        _test_file_manager_restore_request(caplog, s3_cli, s3_res)\n        _test_file_manager_restore_check(caplog, s3_cli, s3_res)\n\n\ndef _test_file_manager_restore_check(caplog: Any, s3_cli: Any, s3_res: Any) -> None:\n    \"\"\"Testing file manager restore check.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n        s3_res: s3 resource interface.\n    \"\"\"\n    test_bucket = s3_res.Bucket(\"test_bucket\")\n    expected_restored_objects = 4\n    restored_objects = 0\n\n    manage_files(\n        f\"file://{TEST_RESOURCES}/check_restore_status/\"\n        \"acon_check_restore_status_directory.json\"\n    )\n    for x in range(0, 3):\n        assert (\n            f\"Checking restore status for: test_directory/test_recursive_file{x}.json\"\n            in caplog.text\n        )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 5\" in str(\n        s3_cli.list_objects_v2(Bucket=\"test_bucket\", MaxKeys=100000)\n    )\n    assert expected_restored_objects == restored_objects\n\n\ndef _test_file_manager_restore_request(caplog: Any, s3_cli: Any, s3_res: Any) -> None:\n    \"\"\"Testing file manager restore request.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n        s3_res: s3 resource interface.\n    \"\"\"\n    test_bucket = s3_res.Bucket(\"test_bucket\")\n    expected_restored_objects = 4\n    restored_objects = 0\n\n    manage_files(\n        f\"file://{TEST_RESOURCES}/request_restore/\"\n        \"acon_request_restore_single_object.json\"\n    )\n    manage_files(\n        f\"file://{TEST_RESOURCES}/request_restore/\"\n        \"acon_request_restore_directory.json\"\n    )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 5\" in str(\n        s3_cli.list_objects_v2(Bucket=\"test_bucket\", MaxKeys=100000)\n    )\n    assert expected_restored_objects == restored_objects\n\n\n@mock_s3\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"glacier\", \"storage_class\": \"GLACIER\"},\n        {\"scenario_name\": \"glacier_ir\", \"storage_class\": \"GLACIER_IR\"},\n        {\"scenario_name\": \"deep_archive\", \"storage_class\": \"DEEP_ARCHIVE\"},\n    ],\n)\ndef test_file_manager_restore_sync(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test restore functions from file manager.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    s3_res = boto3.resource(\"s3\", region_name=\"us-east-1\")\n    s3_cli = boto3.client(\"s3\", region_name=\"us-east-1\")\n\n    s3_res.create_bucket(Bucket=\"test_bucket\")\n    s3_res.create_bucket(Bucket=\"destination_bucket\")\n\n    with caplog.at_level(logging.INFO):\n        s3_cli.put_object(\n            Bucket=\"test_bucket\",\n            Key=\"test_single_file.json\",\n            Body=\"\",\n            StorageClass=scenario.get(\"storage_class\"),\n        )\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory/\", Body=\"\")\n        for x in range(0, 3):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory/test_recursive_file{x}.json\",\n                Body=\"\",\n                StorageClass=scenario.get(\"storage_class\"),\n            )\n\n        _test_file_manager_restore_sync(caplog, s3_cli, s3_res)\n        _test_file_manager_restore_sync_retrieval_tier_exception(caplog)\n\n\ndef _test_file_manager_restore_sync(caplog: Any, s3_cli: Any, s3_res: Any) -> None:\n    \"\"\"Testing file manager restore file sync.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n        s3_res: s3 resource interface.\n    \"\"\"\n    test_bucket = s3_res.Bucket(\"test_bucket\")\n    expected_single_restored_objects = 1\n    restored_objects = 0\n\n    manage_files(\n        f\"file://{TEST_RESOURCES}/request_restore_to_destination_and_wait/\"\n        \"acon_request_restore_to_destination_and_wait_single_object.json\"\n    )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 1\" in str(\n        s3_cli.list_objects_v2(Bucket=\"destination_bucket\", MaxKeys=100000)\n    )\n    assert expected_single_restored_objects == restored_objects\n\n    restored_objects = 0\n    expected_restored_objects = 4\n\n    manage_files(\n        f\"file://{TEST_RESOURCES}/request_restore_to_destination_and_wait/\"\n        \"acon_request_restore_to_destination_and_wait_directory.json\"\n    )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 5\" in str(\n        s3_cli.list_objects_v2(Bucket=\"destination_bucket\", MaxKeys=100000)\n    )\n    assert expected_restored_objects == restored_objects\n\n\ndef _test_file_manager_restore_sync_retrieval_tier_exception(caplog: Any) -> None:\n    \"\"\"Testing file manager restore sync operation when raising exception.\n\n    Args:\n        caplog: captured log.\n    \"\"\"\n    with pytest.raises(ValueError) as exception:\n        manage_files(\n            f\"file://{TEST_RESOURCES}/request_restore_to_destination_and_wait/\"\n            \"acon_request_restore_to_destination_and_wait_single\"\n            \"_object_raise_error.json\"\n        )\n\n    assert (\n        \"Retrieval Tier Bulk not allowed on this operation! \"\n        \"This kind of restore should be used just with `Expedited` retrieval tier \"\n        \"to save cluster costs.\" in str(exception.value)\n    )\n"
  },
  {
    "path": "tests/feature/test_file_manager_dbfs.py",
    "content": "\"\"\"Test file manager for dbfs.\"\"\"\n\nimport logging\nimport os\nimport shutil\nfrom dataclasses import dataclass\nfrom pathlib import Path\nfrom typing import Any, Iterator\nfrom unittest.mock import patch\n\nimport pytest\n\nfrom lakehouse_engine.engine import manage_files\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\nfrom tests.conftest import FEATURE_RESOURCES\n\nTEST_PATH = \"file_manager_dbfs\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_DBFS = \"tests/lakehouse/dbfs\"\n\n\n@dataclass\nclass FileInfoFixture:\n    \"\"\"This class mocks the DBUtils FileInfo object.\"\"\"\n\n    path: str\n    name: str\n    size: int\n\n    def isDir(self) -> bool:\n        \"\"\"Construct to check if the path is a directory.\n\n        Returns:\n            A bool as true is it is a directory.\n        \"\"\"\n        return os.path.isdir(self.path)\n\n    def isFile(self) -> bool:\n        \"\"\"Construct to check if the path is a file.\n\n        Returns:\n            A bool as true is it is a file.\n        \"\"\"\n        return os.path.isfile(self.path)\n\n\nclass DBUtilsFixture:\n    \"\"\"This class is used for mocking the behaviour of DBUtils inside tests.\"\"\"\n\n    def __init__(self) -> None:\n        \"\"\"Construct to mock DBUtils filesystem operations.\"\"\"\n        self.fs = self\n\n    @staticmethod\n    def cp(src: str, dest: str, recurse: bool = False) -> None:\n        \"\"\"This mocks the behavior of dbutils when copy files or directories.\n\n        Args:\n            src: string with the path to copy from.\n            dest: string with the path to copy to.\n            recurse: bool to recursively move files or directories.\n        \"\"\"\n        if os.path.isfile(src):\n            shutil.copy(src, dest)\n        elif recurse:\n            shutil.copytree(src, dest)\n        else:\n            shutil.copy(src, dest)\n\n    @staticmethod\n    def ls(path: str) -> list:\n        \"\"\"This mocks the behavior of dbutils when reading a directory or files inside.\n\n        Args:\n            path: string with the path to read the directory or files inside.\n        \"\"\"\n        paths = Path(path).glob(\"*\")\n        objects = [\n            FileInfoFixture(str(p.absolute()), p.name, p.stat().st_size) for p in paths\n        ]\n        return objects\n\n    @staticmethod\n    def mkdirs(path: str) -> None:\n        \"\"\"This mocks the behavior of dbutils when creating a directory.\n\n        Args:\n            path: string with the path to create the directory.\n        \"\"\"\n        Path(path).mkdir(parents=True, exist_ok=True)\n\n    @staticmethod\n    def mv(src: str, dest: str, recurse: bool = False) -> None:\n        \"\"\"This mocks the behavior of dbutils when moving files or directories.\n\n        Args:\n            src: string with the path to move from.\n            dest: string with the path to move to.\n            recurse: bool to recursively move files or directories.\n        \"\"\"\n        if os.path.isfile(src):\n            shutil.move(src, dest, copy_function=shutil.copy)\n        elif recurse:\n            shutil.move(src, dest, copy_function=shutil.copytree)\n        else:\n            shutil.move(src, dest, copy_function=shutil.copy)\n\n    @staticmethod\n    def put(path: str, content: str, overwrite: bool = False) -> None:\n        \"\"\"This mocks the behavior of dbutils when inserting in files.\n\n        Args:\n            path: string with the path to insert content.\n            content: string with the content to insert in the file.\n            overwrite: bool to overwrite file with the content.\n        \"\"\"\n        file = Path(path)\n\n        if file.exists() and not overwrite:\n            raise FileExistsError(\"File already exists\")\n\n        file.write_text(content, encoding=\"utf-8\")\n\n    @staticmethod\n    def rm(path: str, recurse: bool = False) -> None:\n        \"\"\"This mocks the behavior of dbutils when removing files or directories.\n\n        Args:\n            path: string with the path to remove.\n            recurse: bool to recursively remove files or directories.\n        \"\"\"\n        if os.path.isfile(path):\n            os.remove(path)\n        elif recurse:\n            shutil.rmtree(path)\n        else:\n            os.remove(path)\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef dbutils_fixture() -> Iterator[None]:\n    \"\"\"This fixture patches the `get_db_utils` function.\"\"\"\n    with patch.object(DatabricksUtils, \"get_db_utils\", lambda _: DBUtilsFixture()):\n        yield\n\n\n@patch(\n    \"lakehouse_engine.utils.storage.file_storage_functions.\"\n    \"FileStorageFunctions.is_boto3_configured\",\n    return_value=False,\n)\ndef test_file_manager_dbfs(_patch: Any, caplog: Any) -> None:\n    \"\"\"Test functions from file manager.\n\n    Args:\n        caplog: captured log.\n    \"\"\"\n    dbutils = DBUtilsFixture()\n\n    with caplog.at_level(logging.INFO):\n        # Creating test files/folders in dbfs\n        dbutils.fs.mkdirs(path=TEST_LAKEHOUSE_DBFS)\n        dbutils.fs.put(path=f\"{TEST_LAKEHOUSE_DBFS}/test_single_file.json\", content=\"\")\n        dbutils.fs.mkdirs(path=f\"{TEST_LAKEHOUSE_DBFS}/test_directory/\")\n        for x in range(0, 2000):\n            dbutils.fs.put(\n                path=f\"{TEST_LAKEHOUSE_DBFS}/test_directory/\"\n                f\"test_recursive_file{x}.json\",\n                content=\"\",\n            )\n        dbutils.fs.mkdirs(path=f\"{TEST_LAKEHOUSE_DBFS}/test_directory_test/\")\n        for x in range(0, 2000):\n            dbutils.fs.put(\n                path=f\"{TEST_LAKEHOUSE_DBFS}/test_directory_test/\"\n                f\"test_recursive_file{x}.json\",\n                content=\"\",\n            )\n\n        _test_file_manager_dbfs_copy(caplog, dbutils)\n        _test_file_manager_dbfs_delete(caplog, dbutils)\n        _test_file_manager_dbfs_move(caplog, dbutils)\n\n\ndef _list_objects(path: str, objects_list: list, dbutils: Any) -> list:\n    list_objects = dbutils.fs.ls(path)\n\n    for file_or_directory in list_objects:\n        if file_or_directory.isDir():\n            _list_objects(file_or_directory.path, objects_list, dbutils)\n        else:\n            objects_list.append(file_or_directory.path)\n    return objects_list\n\n\ndef _test_file_manager_dbfs_copy(caplog: Any, dbutils: Any) -> None:\n    \"\"\"Testing file manager copy operations.\n\n    Args:\n        caplog: captured log.\n        dbutils: Dbutils from databricks.\n    \"\"\"\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/\"\n        f\"acon_copy_directory_dry_run.json\"\n    )\n    for x in range(0, 2000):\n        assert (\n            f\"/app/tests/lakehouse/dbfs/test_directory/test_recursive_file{x}.json\"\n            in caplog.text\n        )\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/acon_copy_directory.json\"\n    )\n\n    assert len(dbutils.fs.ls(\"tests/lakehouse/dbfs/test_directory\")) == len(\n        dbutils.fs.ls(\"tests/lakehouse/dbfs/destination_directory\")\n    )\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/acon_copy_single_object.json\"\n    )\n\n    assert \"tests/lakehouse/dbfs/test_single_file.json\" in str(\n        dbutils.fs.ls(\"tests/lakehouse/dbfs/\")\n    )\n\n\ndef _test_file_manager_dbfs_delete(caplog: Any, dbutils: Any) -> None:\n    \"\"\"Testing file manager delete operations.\n\n    Args:\n        caplog: captured log.\n        dbutils: Dbutils from databricks.\n    \"\"\"\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/delete_objects/\"\n        f\"acon_delete_objects_dry_run.json\"\n    )\n    assert (\n        \"{'tests/lakehouse/dbfs/test_directory': \"\n        \"['/app/tests/lakehouse/dbfs/test_directory/\" in caplog.text\n    )\n    for x in range(0, 2000):\n        assert (\n            f\"/app/tests/lakehouse/dbfs/test_directory/\"\n            f\"test_recursive_file{x}.json\" in caplog.text\n        )\n    for x in range(0, 2000):\n        assert (\n            f\"/app/tests/lakehouse/dbfs/destination_directory/\"\n            f\"test_recursive_file{x}.json\" in caplog.text\n        )\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/delete_objects/acon_delete_objects.json\"\n    )\n    assert len(dbutils.fs.ls(\"tests/lakehouse/dbfs/destination_directory\")) == 0\n\n\ndef _test_file_manager_dbfs_move(caplog: Any, dbutils: Any) -> None:\n    \"\"\"Testing file manager move operations.\n\n    Args:\n        caplog: captured log.\n        dbutils: Dbutils from databricks.\n    \"\"\"\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/move_objects/acon_move_objects_dry_run.json\"\n    )\n    assert (\n        \"{'tests/lakehouse/dbfs/test_directory': \"\n        \"['/app/tests/lakehouse/dbfs/test_directory/\" in caplog.text\n    )\n    for x in range(0, 2000):\n        assert (\n            f\"/app/tests/lakehouse/dbfs/test_directory/\"\n            f\"test_recursive_file{x}.json\" in caplog.text\n        )\n    for x in range(0, 2000):\n        assert (\n            f\"/app/tests/lakehouse/dbfs/destination_directory/\"\n            f\"test_recursive_file{x}.json\" in caplog.text\n        )\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/move_objects/acon_move_objects.json\"\n    )\n    assert len(dbutils.fs.ls(\"tests/lakehouse/dbfs/test_directory\")) == 0\n    assert len(dbutils.fs.ls(\"tests/lakehouse/dbfs/test_mv_directory\")) == 2000\n"
  },
  {
    "path": "tests/feature/test_file_manager_s3.py",
    "content": "\"\"\"Test file manager for s3.\"\"\"\n\nimport logging\nfrom typing import Any\n\nimport boto3\nimport pytest\nfrom moto import mock_s3, mock_sts  # type: ignore\n\nfrom lakehouse_engine.engine import manage_files\nfrom tests.conftest import FEATURE_RESOURCES\n\nTEST_PATH = \"file_manager_s3\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\n\n\n@mock_sts\ndef test_get_caller_identity_with_default_credentials() -> None:\n    \"\"\"Test get_caller_identity of sts client.\"\"\"\n    boto3.client(\"sts\", region_name=\"us-east-1\").get_caller_identity()\n\n\n@mock_s3\ndef test_file_manager_s3(caplog: Any) -> None:\n    \"\"\"Test functions from file manager.\n\n    Args:\n        caplog: captured log.\n    \"\"\"\n    s3_res = boto3.resource(\"s3\", region_name=\"us-east-1\")\n    s3_cli = boto3.client(\"s3\", region_name=\"us-east-1\")\n    test_get_caller_identity_with_default_credentials()\n\n    s3_res.create_bucket(Bucket=\"test_bucket\")\n    s3_res.create_bucket(Bucket=\"destination_bucket\")\n\n    with caplog.at_level(logging.INFO):\n        # Creating test files/folders in S3\n        # 2000 files are created to test the pagination is being correctly performed\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_single_file.json\", Body=\"\")\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory/\", Body=\"\")\n        for x in range(0, 2000):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory/test_recursive_file{x}.json\",\n                Body=\"\",\n            )\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory_test/\", Body=\"\")\n        for x in range(0, 2000):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory_test/test_recursive_file{x}.json\",\n                Body=\"\",\n            )\n\n        _test_file_manager_s3_copy(caplog, s3_cli)\n        _test_file_manager_s3_delete(caplog, s3_cli)\n\n\ndef _test_file_manager_s3_copy(caplog: Any, s3_cli: Any) -> None:\n    \"\"\"Testing file manager copy operations.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n    \"\"\"\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/\"\n        f\"acon_copy_single_object_dry_run.json\"\n    )\n    assert \"{'test_single_file.json': ['test_single_file.json']}\" in caplog.text\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/\"\n        f\"acon_copy_directory_dry_run.json\"\n    )\n    for x in range(0, 2000):\n        assert f\"test_directory/test_recursive_file{x}.json\" in caplog.text\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/acon_copy_single_object.json\"\n    )\n    assert \"'KeyCount': 1\" in str(s3_cli.list_objects_v2(Bucket=\"destination_bucket\"))\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/copy_objects/acon_copy_directory.json\"\n    )\n    assert \"'KeyCount': 2002\" in str(\n        s3_cli.list_objects_v2(Bucket=\"destination_bucket\", MaxKeys=100000)\n    )\n\n\ndef _test_file_manager_s3_delete(caplog: Any, s3_cli: Any) -> None:\n    \"\"\"Testing file manager delete operations.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n    \"\"\"\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/delete_objects/\"\n        f\"acon_delete_objects_dry_run.json\"\n    )\n    assert (\n        \"{'test_single_file.json': ['test_single_file.json'], \"\n        \"'test_directory/': ['test_directory/'\" in caplog.text\n    )\n    for x in range(0, 2000):\n        assert f\"test_directory/test_recursive_file{x}.json\" in caplog.text\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/delete_objects/acon_delete_objects.json\"\n    )\n    assert \"'KeyCount': 2001\" in str(\n        s3_cli.list_objects_v2(Bucket=\"test_bucket\", MaxKeys=100000)\n    )\n\n\n@mock_s3\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"glacier\", \"storage_class\": \"GLACIER\"},\n        {\"scenario_name\": \"glacier_ir\", \"storage_class\": \"GLACIER_IR\"},\n        {\"scenario_name\": \"deep_archive\", \"storage_class\": \"DEEP_ARCHIVE\"},\n    ],\n)\ndef test_file_manager_s3_restore_archive(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test restore functions from file manager.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    s3_res = boto3.resource(\"s3\", region_name=\"us-east-1\")\n    s3_cli = boto3.client(\"s3\", region_name=\"us-east-1\")\n    test_get_caller_identity_with_default_credentials()\n\n    s3_res.create_bucket(Bucket=\"test_bucket\")\n    s3_res.create_bucket(Bucket=\"destination_bucket\")\n\n    with caplog.at_level(logging.INFO):\n        s3_cli.put_object(\n            Bucket=\"test_bucket\",\n            Key=\"test_single_file.json\",\n            Body=\"\",\n            StorageClass=scenario.get(\"storage_class\"),\n        )\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory\", Body=\"\")\n        for x in range(0, 3):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory/test_recursive_file{x}.json\",\n                Body=\"\",\n                StorageClass=scenario.get(\"storage_class\"),\n            )\n\n        _test_file_manager_s3_restore_request(caplog, s3_cli, s3_res)\n        _test_file_manager_s3_restore_check(caplog, s3_cli, s3_res)\n\n\ndef _test_file_manager_s3_restore_check(caplog: Any, s3_cli: Any, s3_res: Any) -> None:\n    \"\"\"Testing file manager restore check.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n        s3_res: s3 resource interface.\n    \"\"\"\n    test_bucket = s3_res.Bucket(\"test_bucket\")\n    expected_restored_objects = 4\n    restored_objects = 0\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/check_restore_status/\"\n        f\"acon_check_restore_status_directory.json\"\n    )\n    for x in range(0, 3):\n        assert (\n            f\"Checking restore status for: test_directory/test_recursive_file{x}.json\"\n            in caplog.text\n        )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 5\" in str(\n        s3_cli.list_objects_v2(Bucket=\"test_bucket\", MaxKeys=100000)\n    )\n    assert expected_restored_objects == restored_objects\n\n\ndef _test_file_manager_s3_restore_request(\n    caplog: Any, s3_cli: Any, s3_res: Any\n) -> None:\n    \"\"\"Testing file manager restore request.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n        s3_res: s3 resource interface.\n    \"\"\"\n    test_bucket = s3_res.Bucket(\"test_bucket\")\n    expected_restored_objects = 4\n    restored_objects = 0\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/request_restore/\"\n        f\"acon_request_restore_single_object.json\"\n    )\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/request_restore/\"\n        f\"acon_request_restore_directory.json\"\n    )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 5\" in str(\n        s3_cli.list_objects_v2(Bucket=\"test_bucket\", MaxKeys=100000)\n    )\n    assert expected_restored_objects == restored_objects\n\n\n@mock_s3\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"glacier\", \"storage_class\": \"GLACIER\"},\n        {\"scenario_name\": \"glacier_ir\", \"storage_class\": \"GLACIER_IR\"},\n        {\"scenario_name\": \"deep_archive\", \"storage_class\": \"DEEP_ARCHIVE\"},\n    ],\n)\ndef test_file_manager_s3_restore_sync(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test restore functions from file manager.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n    \"\"\"\n    s3_res = boto3.resource(\"s3\", region_name=\"us-east-1\")\n    s3_cli = boto3.client(\"s3\", region_name=\"us-east-1\")\n    test_get_caller_identity_with_default_credentials()\n\n    s3_res.create_bucket(Bucket=\"test_bucket\")\n    s3_res.create_bucket(Bucket=\"destination_bucket\")\n\n    with caplog.at_level(logging.INFO):\n        s3_cli.put_object(\n            Bucket=\"test_bucket\",\n            Key=\"test_single_file.json\",\n            Body=\"\",\n            StorageClass=scenario.get(\"storage_class\"),\n        )\n        s3_cli.put_object(Bucket=\"test_bucket\", Key=\"test_directory/\", Body=\"\")\n        for x in range(0, 3):\n            s3_cli.put_object(\n                Bucket=\"test_bucket\",\n                Key=f\"test_directory/test_recursive_file{x}.json\",\n                Body=\"\",\n                StorageClass=scenario.get(\"storage_class\"),\n            )\n\n        _test_file_manager_s3_restore_sync(caplog, s3_cli, s3_res)\n        _test_file_manager_s3_restore_sync_retrieval_tier_exception(caplog)\n\n\ndef _test_file_manager_s3_restore_sync(caplog: Any, s3_cli: Any, s3_res: Any) -> None:\n    \"\"\"Testing file manager restore file sync.\n\n    Args:\n        caplog: captured log.\n        s3_cli: s3 client interface.\n        s3_res: s3 resource interface.\n    \"\"\"\n    test_bucket = s3_res.Bucket(\"test_bucket\")\n    expected_single_restored_objects = 1\n    restored_objects = 0\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/request_restore_to_destination_and_wait/\"\n        f\"acon_request_restore_to_destination_and_wait_single_object.json\"\n    )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 1\" in str(\n        s3_cli.list_objects_v2(Bucket=\"destination_bucket\", MaxKeys=100000)\n    )\n    assert expected_single_restored_objects == restored_objects\n\n    restored_objects = 0\n    expected_restored_objects = 4\n\n    manage_files(\n        acon_path=f\"file://{TEST_RESOURCES}/request_restore_to_destination_and_wait/\"\n        f\"acon_request_restore_to_destination_and_wait_directory.json\"\n    )\n\n    for bucket_object in test_bucket.objects.all():\n        obj = s3_res.Object(bucket_object.bucket_name, bucket_object.key)\n        if obj.restore is not None and 'ongoing-request=\"false\"' in obj.restore:\n            restored_objects += 1\n\n    assert \"'KeyCount': 5\" in str(\n        s3_cli.list_objects_v2(Bucket=\"destination_bucket\", MaxKeys=100000)\n    )\n    assert expected_restored_objects == restored_objects\n\n\ndef _test_file_manager_s3_restore_sync_retrieval_tier_exception(caplog: Any) -> None:\n    \"\"\"Testing file manager restore sync operation when raising exception.\n\n    Args:\n        caplog: captured log.\n    \"\"\"\n    with pytest.raises(ValueError) as exception:\n        manage_files(\n            acon_path=f\"file://{TEST_RESOURCES}/request_restore_to_destination_\"\n            f\"and_wait/acon_request_restore_to_destination_and_wait_\"\n            f\"single_object_raise_error.json\"\n        )\n\n    assert (\n        \"Retrieval Tier Bulk not allowed on this operation! \"\n        \"This kind of restore should be used just with `Expedited` retrieval tier \"\n        \"to save cluster costs.\" in str(exception.value)\n    )\n"
  },
  {
    "path": "tests/feature/test_full_load.py",
    "content": "\"\"\"Test full loads.\"\"\"\n\nfrom typing import List\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"full_load\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"with_filter\", InputFormat.PARQUET.value],\n        [\"with_filter_partition_overwrite\", InputFormat.DELTAFILES.value],\n        [\"full_overwrite\", InputFormat.DELTAFILES.value],\n    ],\n)\ndef test_batch_full_load(scenario: List[str]) -> None:\n    \"\"\"Test full loads in batch mode.\n\n    Args:\n        scenario: scenario to test.\n             with_filter - loads in full but applies a filter to the source.\n             with_filter_partition_overwrite - loads in full but only overwrites\n             partitions that are contained in the data being loaded, keeping\n             untouched partitions in the target table, therefore not doing a\n             complete overwrite.\n             full_overwrite - loads in full and overwrites target table.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/data/\",\n    )\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}/batch_init.json\"\n    )\n    load_data(acon=acon)\n\n    LocalStorage.clean_folder(\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/data\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/source/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/data/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario[0]}/batch.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/data\",\n        file_format=scenario[1],\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/data\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/test_gab.py",
    "content": "\"\"\"Module with integration tests for gab feature.\"\"\"\n\nfrom typing import Any, Optional\n\nimport pendulum\nimport pytest\nfrom _pytest.fixtures import SubRequest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import col, to_date\nfrom pyspark.sql.types import Row\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import execute_gab, load_data\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"gab\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n_LOGGER = LoggingHandler(__name__).get_logger()\n_CALENDAR_MIN_DATE = pendulum.from_format(\"2016-01-01\", \"YYYY-MM-DD\")\n_CALENDAR_MAX_DATE = pendulum.from_format(\"2023-01-01\", \"YYYY-MM-DD\")\n_SETUP_DELTA_TABLES = {\n    \"dim_calendar\": \"calendar\",\n    \"lkp_query_builder\": \"lkp_query_builder\",\n    \"gab_use_case_results\": \"gab_use_case_results\",\n    \"gab_log_events\": \"gab_log_events\",\n}\n_USE_CASE_TABLES = [\"order_events\", \"dummy_sales_kpi\"]\n\n\ndef _create_gab_tables() -> None:\n    \"\"\"Create necessary tables to use GAB.\"\"\"\n    for table_name, table_column_file in _SETUP_DELTA_TABLES.items():\n        DataframeHelpers.create_delta_table(\n            cols=SchemaUtils.from_file_to_dict(\n                f\"file:///{TEST_RESOURCES}/setup/column_list/{table_column_file}.json\"\n            ),\n            table=table_name,\n        )\n\n\ndef _generate_calendar_test_dates() -> list:\n    \"\"\"Generate calendar date between the test period.\"\"\"\n    calendar_dates: list[Row] = []\n    calendar_date = _CALENDAR_MIN_DATE\n\n    for _ in range(1, _CALENDAR_MIN_DATE.diff(_CALENDAR_MAX_DATE).in_days()):\n        calendar_date = calendar_date.add(days=1)\n        calendar_dates.append(Row(value=calendar_date.strftime(\"%Y-%m-%d\")))\n\n    return calendar_dates\n\n\ndef _transform_dates_list_to_dataframe(dates: list) -> DataFrame:\n    \"\"\"Create calendar dates DataFrame from a list of dates.\n\n    Args:\n        dates: list of dates to create the calendar DataFrame.\n    \"\"\"\n    calendar_dates = ExecEnv.SESSION.createDataFrame(dates)\n    calendar_dates = calendar_dates.withColumn(\n        \"calendar_date\", to_date(col(\"value\"), \"yyyy-MM-dd\")\n    ).drop(calendar_dates.value)\n\n    return calendar_dates\n\n\ndef _feed_dim_calendar(df: DataFrame) -> DataFrame:\n    \"\"\"Feed dim calendar table.\"\"\"\n    df.createOrReplaceTempView(\"dates_completed\")\n\n    df_cal = ExecEnv.SESSION.sql(\n        \"\"\"\n        WITH monday_calendar AS (\n            SELECT\n                 calendar_date,\n                WEEKOFYEAR(calendar_date) AS weeknum_mon,\n                DATE_FORMAT(calendar_date, 'E') AS day_en,\n                MIN(calendar_date) OVER (PARTITION BY CONCAT(DATE_PART(\n                    'YEAROFWEEK', calendar_date\n                ),\n                WEEKOFYEAR(calendar_date)) ORDER BY calendar_date) AS weekstart_mon\n            FROM dates_completed\n            ORDER BY\n                calendar_date\n        ),\n        monday_calendar_plus_week_num_sunday AS (\n            SELECT\n                monday_calendar.*,\n                LEAD(weeknum_mon) OVER(ORDER BY calendar_date) AS weeknum_sun\n            FROM monday_calendar\n        ),\n        calendar_complementary_values AS (\n            SELECT\n                calendar_date,\n                weeknum_mon,\n                day_en,\n                weekstart_mon,\n                weekstart_mon+6 AS weekend_mon,\n                LEAD(weekstart_mon-1) OVER(ORDER BY calendar_date) AS weekstart_sun,\n                DATE(DATE_TRUNC('MONTH', calendar_date)) AS month_start,\n                DATE(DATE_TRUNC('QUARTER', calendar_date)) AS quarter_start,\n                DATE(DATE_TRUNC('YEAR', calendar_date)) AS year_start\n            FROM monday_calendar_plus_week_num_sunday\n        )\n        SELECT\n            calendar_date,\n            day_en,\n            weeknum_mon,\n            weekstart_mon,\n            weekend_mon,\n            weekstart_sun,\n            weekstart_sun+6 AS weekend_sun,\n            month_start,\n            add_months(month_start, 1)-1 AS month_end,\n            quarter_start,\n            ADD_MONTHS(quarter_start, 3)-1 AS quarter_end,\n            year_start,\n            ADD_MONTHS(year_start, 12)-1 AS year_end\n        FROM calendar_complementary_values\n        \"\"\"\n    )\n\n    return df_cal\n\n\ndef _feed_table_with_test_data(\n    table_name: str,\n    source_dataframe: Optional[DataFrame] = None,\n    transformer_specs: list = None,\n    input_id_to_write: str = \"data_to_load\",\n) -> None:\n    \"\"\"Feed table with test data.\n\n    Args:\n        table_name: name of the table to feed.\n        source_dataframe: dataframe to feed the table, present when load_type is\n            dataframe.\n        transformer_specs: acon transformations.\n        input_id_to_write: input id used in the write step.\n    \"\"\"\n    input_spec: dict[str, Any]\n    if source_dataframe:\n        input_spec = {\n            \"spec_id\": \"data_to_load\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"dataframe\",\n            \"df_name\": source_dataframe,\n        }\n    else:\n        input_spec = {\n            \"spec_id\": \"data_to_load\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"schema_path\": f\"file:///{TEST_RESOURCES}/setup/schema/{table_name}.json\",\n            \"options\": {\n                \"header\": True,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\",\n                \"nullValue\": \"null\",\n            },\n            \"location\": f\"file:///{TEST_RESOURCES}/setup/data/{table_name}.csv\",\n        }\n\n    acon = {\n        \"input_specs\": [input_spec],\n        \"transform_specs\": transformer_specs if transformer_specs else [],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"loaded_table\",\n                \"input_id\": input_id_to_write,\n                \"write_type\": \"overwrite\",\n                \"data_format\": \"delta\",\n                \"db_table\": f\"test_db.{table_name}\",\n            },\n        ],\n    }\n    load_data(acon=acon)\n\n\ndef _create_and_load_source_data_for_use_case(source_table: str) -> None:\n    \"\"\"Create and load source for use case.\n\n    Args:\n        source_table: source table to create/feed the data.\n    \"\"\"\n    DataframeHelpers.create_delta_table(\n        cols=SchemaUtils.from_file_to_dict(\n            f\"file:///{TEST_RESOURCES}/setup/column_list/{source_table}.json\"\n        ),\n        table=source_table,\n    )\n\n    _feed_table_with_test_data(table_name=source_table)\n\n\ndef _import_use_case_sql(use_case_name: str) -> None:\n    \"\"\"Import use case SQL stage files.\n\n    Args:\n        use_case_name: name of the use case.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/usecases/{use_case_name}/*.sql\",\n        f\"{TEST_LAKEHOUSE_IN}/usecases_sql/{use_case_name}/\",\n    )\n\n\ndef _setup_use_case(use_case_name: str) -> None:\n    \"\"\"Set up the use case.\n\n    Args:\n        use_case_name: name of hte use case.\n    \"\"\"\n    _create_and_load_source_data_for_use_case(use_case_name)\n    _import_use_case_sql(use_case_name)\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef _gab_setup() -> None:\n    \"\"\"Execute the GAB setup.\n\n    Create and load config gab tables.\n    \"\"\"\n    _LOGGER.info(\"Creating gab config tables...\")\n\n    _create_gab_tables()\n    _feed_table_with_test_data(table_name=\"lkp_query_builder\")\n\n    calendar_dates = _generate_calendar_test_dates()\n    calendar_dates_df = _transform_dates_list_to_dataframe(calendar_dates)\n    _feed_table_with_test_data(\n        table_name=\"dim_calendar\",\n        source_dataframe=calendar_dates_df,\n        input_id_to_write=\"transformed_data\",\n        transformer_specs=[\n            {\n                \"spec_id\": \"transformed_data\",\n                \"input_id\": \"data_to_load\",\n                \"transformers\": [\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": _feed_dim_calendar},\n                    }\n                ],\n            }\n        ],\n    )\n\n    _LOGGER.info(\"Created with success...\")\n\n\n@pytest.fixture(scope=\"session\", autouse=True, params=[_USE_CASE_TABLES])\ndef _run_setup_use_case(request: SubRequest) -> None:\n    \"\"\"Create and load use case gab tables.\n\n    Args:\n        request: fixture request, giving access to the `params`.\n    \"\"\"\n    _LOGGER.info(\"Creating use case config tables...\")\n    for use_case in request.param:\n        _setup_use_case(use_case)\n\n    _LOGGER.info(\"Created with success...\")\n\n\n@pytest.mark.usefixtures(\"_gab_setup\", \"_run_setup_use_case\")\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"use_case_name\": \"order_events\",\n            \"gold_assets\": [\"vw_orders_all\", \"vw_orders_filtered\"],\n            \"gold_asset_schema\": \"vw_orders\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"order_events_snapshot\",\n            \"gold_assets\": [\"vw_orders_all_snapshot\", \"vw_orders_filtered_snapshot\"],\n            \"gold_asset_schema\": \"vw_orders\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"order_events_nam\",\n            \"gold_assets\": [\n                \"vw_nam_orders_all_snapshot\",\n                \"vw_nam_orders_filtered_snapshot\",\n            ],\n            \"gold_asset_schema\": \"vw_orders\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"order_events_negative_timezone_offset\",\n            \"gold_assets\": [\n                \"vw_negative_offset_orders_all\",\n                \"vw_negative_offset_orders_filtered\",\n            ],\n            \"gold_asset_schema\": \"vw_orders\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"dummy_sales_kpi\",\n            \"gold_assets\": [\"vw_dummy_sales_kpi\"],\n            \"gold_asset_schema\": \"vw_dummy_sales_kpi\",\n            \"use_case_stages\": \"dummy_sales_kpi\",\n        },\n        {\n            \"use_case_name\": \"skip_use_case_by_empty_reconciliation\",\n            \"query_label\": \"order_events_empty_reconciliation_window\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"skip_use_case_by_empty_requested_cadence\",\n            \"query_label\": \"order_events_negative_timezone_offset\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"skip_use_case_by_not_configured_cadence\",\n            \"query_label\": \"order_events_negative_timezone_offset\",\n            \"use_case_stages\": \"order_events\",\n        },\n        {\n            \"use_case_name\": \"skip_use_case_by_unexisting_cadence\",\n            \"query_label\": \"order_events_unexisting_cadence\",\n            \"use_case_stages\": \"order_events\",\n        },\n    ],\n)\ndef test_gold_asset_builder(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test the feature of using gab to generate gold assets.\n\n    Args:\n        scenario: scenario to test.\n        caplog: captured log.\n\n    Scenarios:\n        order_events: tests gab features:\n            - Cadence\n            - Recon Window\n            - Metrics\n            - Extended Window Calculator\n            Also test the generation of two different views for the same asset.\n        order_events_snapshot: tests gab features:\n            - Cadence\n            - Recon Window\n            - Metrics\n            - Extended Window Calculator\n            - Snapshot\n            Also test the generation of two different views for the same asset.\n        order_events_nam: tests gab features:\n            - Cadence\n            - Recon Window\n            - Metrics\n            - Extended Window Calculator\n            - Snapshot\n            Also test the generation of two different views for the same asset and the\n                use case `query_type` equals to `NAM`.\n        order_events_negative_timezone_offset: tests gab features:\n            - Cadence\n            - Recon Window\n            - Metrics\n            - Extended Window Calculator\n            - Offset\n            - Snapshot\n            Also test the generation of two different views for the same asset.\n       dummy_sales_kpi: tests almost all gab features:\n            - Cadence\n            - Recon Window\n            - Metrics\n            - Extended Window Calculator\n            Also test multiple stages for the asset creation.\n\n    \"\"\"\n    use_case_name = scenario[\"use_case_name\"]\n    execute_gab(\n        f\"file://{TEST_RESOURCES}/usecases/{scenario['use_case_stages']}/scenario/\"\n        f\"{use_case_name}.json\"\n    )\n\n    if not use_case_name.startswith(\"skip\"):\n        for expected_gold_asset in scenario[\"gold_assets\"]:\n            result_df = ExecEnv.SESSION.sql(\n                f\"SELECT * FROM test_db.{expected_gold_asset}\"  # nosec\n            )\n            control_df = DataframeHelpers.read_from_file(\n                f\"{TEST_RESOURCES}/control/data/{expected_gold_asset}.csv\",\n                schema=SchemaUtils.from_file_to_dict(\n                    f\"file:///{TEST_RESOURCES}/control/schema/\"\n                    f\"{scenario['gold_asset_schema']}.json\"\n                ),\n            )\n\n            assert not DataframeHelpers.has_diff(result_df, control_df)\n    else:\n        assert (\n            f\"Skipping use case {scenario['query_label']}. No cadence processed \"\n            \"for the use case.\" in caplog.text\n        )\n"
  },
  {
    "path": "tests/feature/test_heartbeat.py",
    "content": "\"\"\"Module with integration tests for heartbeat feature.\"\"\"\n\nimport datetime\nfrom unittest.mock import MagicMock, patch\n\nimport pytest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.functions import lit\nfrom pyspark.sql.types import TimestampType\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import (\n    execute_heartbeat_sensor_data_feed,\n    execute_sensor_heartbeat,\n    trigger_heartbeat_sensor_jobs,\n    update_heartbeat_sensor_status,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"heartbeat\"\nFEATURE_TEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n\ndef _create_heartbeat_table(scenario_name: str, tables: dict) -> None:\n    \"\"\"Create the necessary tables required for using Heartbeat.\n\n    Args:\n        scenario_name (str): The name of the scenario.\n        tables (dict): Table names.\n    \"\"\"\n    for _, table_name in tables.items():\n        DataframeHelpers.create_delta_table(\n            cols=SchemaUtils.from_file_to_dict(\n                f\"file:///{FEATURE_TEST_RESOURCES}/setup/\"\n                f\"{scenario_name}/column_list/{table_name}.json\"\n            ),\n            table=table_name,\n        )\n\n\ndef _test_heartbeat_sensor_data_feed(\n    heartbeat_data_file_path: str,\n    heartbeat_control_table_name: str,\n    ctrl_heartbeat_df: DataFrame,\n) -> None:\n    \"\"\"Test the function that populates the heartbeat control table.\n\n    Args:\n        heartbeat_data_file_path (str): Path to the CSV file used\n            to populate the control table.\n        heartbeat_control_table_name (str): Name of the target\n            control table.\n        ctrl_heartbeat_df (DataFrame): Reference DataFrame\n            used to validate the table contents.\n    \"\"\"\n    _LOGGER.info(\"Testing execute_heartbeat_sensor_data_feed function\")\n\n    execute_heartbeat_sensor_data_feed(\n        heartbeat_data_file_path, heartbeat_control_table_name\n    )\n    heartbeat_df = ExecEnv.SESSION.table(f\"{heartbeat_control_table_name}\")\n\n    assert not DataframeHelpers.has_diff(heartbeat_df, ctrl_heartbeat_df)\n\n\n@patch(\n    \"lakehouse_engine.algorithms.sensors.heartbeat.Heartbeat._execute_batch_of_sensor\",\n    MagicMock(\n        return_value={\n            \"sensor_id\": \"dummy_delta_table\",\n            \"trigger_job_id\": \"1927384615203749\",\n        }\n    ),\n)\n@patch(\"lakehouse_engine.algorithms.sensors.heartbeat.current_timestamp\")\ndef _test_execute_sensor_heartbeat(\n    mocked_timestamp: MagicMock,\n    acon: dict,\n    heartbeat_control_table_name: str,\n    ctrl_heartbeat_df: DataFrame,\n    results: dict,\n) -> None:\n    \"\"\"Test the execution of the sensor heartbeat process.\n\n    This test mocks the internal `_execute_batch_of_sensor` method\n    to simulate the heartbeat execution, then validates the\n    resulting state in the heartbeat control table after\n    the execution of the execute_sensor_heartbeat function.\n\n    Args:\n        mocked_timestamp (MagicMock): A static timestamp for testing.\n        acon (dict): Acon used to trigger the heartbeat execution.\n        heartbeat_control_table_name (str): Name of the control table to validate.\n        ctrl_heartbeat_df (DataFrame): Reference DataFrame\n            for asserting table contents.\n        results (dict): Reference values to compare.\n    \"\"\"\n    mocked_timestamp.return_value = lit(\n        datetime.datetime.strptime(\"2025/08/14 23:00\", \"%Y/%m/%d %H:%M\")\n    ).cast(TimestampType())\n\n    execute_sensor_heartbeat(acon=acon)\n    heartbeat_result = ExecEnv.SESSION.table(f\"{heartbeat_control_table_name}\")\n\n    assert (\n        heartbeat_result.filter(\"status = 'NEW_EVENT_AVAILABLE'\").count()\n        == results[\"new_events_available_count\"]\n    )\n    assert not DataframeHelpers.has_diff(ctrl_heartbeat_df, heartbeat_result)\n\n\n@patch(\"lakehouse_engine.algorithms.sensors.heartbeat.current_timestamp\")\n@patch(\n    \"lakehouse_engine.core.sensor_manager.datetime\",\n)\ndef _test_update_heartbeat_sensor_status(\n    mocked_timestamp_sensor: MagicMock,\n    mocked_timestamp_heartbeat: MagicMock,\n    heartbeat_control_table_name: str,\n    sensor_table_name: str,\n    job_id: str,\n    ctrl_heartbeat_df: DataFrame,\n    ctrl_sensor_df: DataFrame,\n) -> None:\n    \"\"\"Test the update of sensor and heartbeat control table statuses.\n\n    This test validates that the `update_heartbeat_sensor_status`\n    function correctly updates timestamps and status fields in\n    both the sensor and heartbeat control tables. It also\n    compares the updated tables against expected control DataFrames.\n\n    Args:\n        mocked_timestamp_sensor (MagicMock): A static timestamp for testing\n            sensor table.\n        mocked_timestamp_heartbeat (MagicMock): A static timestamp for testing\n            heartbeat table.\n        heartbeat_control_table_name (str): Name of the heartbeat control\n            table to validate.\n        sensor_table_name (str): Name of the sensor table to validate.\n        job_id (str): Job identifier used in the update process.\n        ctrl_heartbeat_df (DataFrame): Expected state\n            of the updated heartbeat control table.\n        ctrl_sensor_df (DataFrame): Expected state\n            of the updated sensor table.\n    \"\"\"\n    mocked_timestamp_sensor.now.return_value = datetime.datetime(\n        2025, 8, 14, 23, 00, 00, 00000\n    )\n    mocked_timestamp_heartbeat.return_value = lit(\n        datetime.datetime.strptime(\"2025/08/14 23:00\", \"%Y/%m/%d %H:%M\")\n    ).cast(TimestampType())\n\n    update_heartbeat_sensor_status(\n        heartbeat_control_table_name, sensor_table_name, job_id\n    )\n\n    heartbeat_data = ExecEnv.SESSION.table(f\"{heartbeat_control_table_name}\")\n    sensor_data = ExecEnv.SESSION.table(f\"{sensor_table_name}\")\n\n    _LOGGER.info(\"Comparing heartbeat and sensor tables with control tables\")\n    assert not DataframeHelpers.has_diff(ctrl_sensor_df, sensor_data)\n\n    assert not DataframeHelpers.has_diff(ctrl_heartbeat_df, heartbeat_data)\n\n\n@patch(\n    \"lakehouse_engine.core.sensor_manager.SensorJobRunManager.run_job\",\n    MagicMock(return_value=(\"run_id\", None)),\n)\n@patch(\"lakehouse_engine.algorithms.sensors.heartbeat.current_timestamp\")\ndef _trigger_heartbeat_sensor_jobs(\n    mocked_timestamp_heartbeat: MagicMock,\n    acon: dict,\n    heartbeat_control_table_name: str,\n    heartbeat_control_table_updated: DataFrame,\n) -> None:\n    \"\"\"Test the triggering of sensor heartbeat jobs.\n\n    This test mocks the `run_job` method to simulate job execution,\n    triggers the heartbeat sensor jobs, and verifies that the\n    heartbeat control table reflects the expected changes.\n\n    Args:\n        mocked_timestamp_heartbeat (MagicMock): A static timestamp for testing\n            heartbeat table.\n        acon (dict): Acon used to trigger the sensor jobs.\n        heartbeat_control_table_name (str): Name of the heartbeat control\n            table to validate.\n        heartbeat_control_table_updated (DataFrame): Expected state\n            of the control table after job execution.\n    \"\"\"\n    mocked_timestamp_heartbeat.return_value = lit(\n        datetime.datetime.strptime(\"2025/08/14 23:00\", \"%Y/%m/%d %H:%M\")\n    ).cast(TimestampType())\n\n    trigger_heartbeat_sensor_jobs(acon)\n\n    heartbeat_table_job_run = ExecEnv.SESSION.table(f\"{heartbeat_control_table_name}\")\n    assert not DataframeHelpers.has_diff(\n        heartbeat_table_job_run, heartbeat_control_table_updated\n    )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"use_case_name\": \"default\",\n            \"control_files\": {\n                \"ctrl_heart_tbl_heartb_feed_fname\": \"ctr_heart_tbl_heartb_feed.csv\",\n                \"ctrl_heart_tbl_exe_sns_hb_fname\": \"ctrl_heart_tbl_exec_sensor.csv\",\n                \"ctrl_heart_tbl_updated_fname\": \"ctrl_heart_tbl_updated.csv\",\n                \"ctrl_heart_tbl_trigger_job_fname\": \"ctrl_heart_tbl_trigger_job.csv\",\n                \"ctrl_sensor_tbl_upd_status_fname\": \"ctrl_sensor_tbl_upd_status.json\",\n                \"ctrl_heart_tbl_schema_fname\": \"ctrl_heart_tbl_schema.json\",\n            },\n            \"tables\": {\n                \"heartbeat_sensor_control_table\": \"heartbeat_sensor_control_table\",\n                \"sensor_table\": \"sensor_table\",\n            },\n            \"setup\": {\n                \"setup_heartbeat_data\": \"setup_heartbeat_data.csv\",\n                \"setup_sensor_data\": \"setup_sensor_data.json\",\n                \"schema_sensor_df\": \"schema_sensor_df.json\",\n            },\n            \"execute_sensor_heartbeat_results\": {\"new_events_available_count\": 1},\n            \"job_id\": \"1927384615203749\",\n            \"trigger_heartbeat_sensor_jobs_records\": {\n                \"heartbeat\": \"\"\"\n                    (\"delta_table\",\"dummy_order\",\"batch\",\n                    \"dummy_heartbeat_asset\",NULL,NULL,NULL,\n                    \"1015557820139870\",\"data-product_job_name_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"true\")\"\"\",\n                \"sensors\": \"\"\"\n                    (\"dummy_order\",\n                    array(\"dummy_heartbeat_asset\"),\"ACQUIRED_NEW_DATA\",\n                    NULL,NULL,\"LOAD_DATE\",\"10155578201985\")\"\"\",\n            },\n        },\n        {\n            \"use_case_name\": \"heartbeat_paused_sensor_new_record\",\n            \"control_files\": {\n                \"ctrl_heart_tbl_heartb_feed_fname\": \"ctr_heart_tbl_heartb_feed.csv\",\n                \"ctrl_heart_tbl_exe_sns_hb_fname\": \"ctrl_heart_tbl_exec_sensor.csv\",\n                \"ctrl_heart_tbl_updated_fname\": \"ctrl_heart_tbl_updated.csv\",\n                \"ctrl_heart_tbl_trigger_job_fname\": \"ctrl_heart_tbl_trigger_job.csv\",\n                \"ctrl_sensor_tbl_upd_status_fname\": \"ctrl_sensor_tbl_upd_status.json\",\n                \"ctrl_heart_tbl_schema_fname\": \"ctrl_heart_tbl_schema.json\",\n            },\n            \"tables\": {\n                \"heartbeat_sensor_control_table\": \"heartbeat_sensor_control_table\",\n                \"sensor_table\": \"sensor_table\",\n            },\n            \"setup\": {\n                \"setup_heartbeat_data\": \"setup_heartbeat_data.csv\",\n                \"setup_sensor_data\": \"setup_sensor_data.json\",\n                \"schema_sensor_df\": \"schema_sensor_df.json\",\n            },\n            \"execute_sensor_heartbeat_results\": {\"new_events_available_count\": 0},\n            \"job_id\": \"2604918372561094\",\n            \"trigger_heartbeat_sensor_jobs_records\": {\n                \"heartbeat\": \"\"\"\n                    (\"delta_table\",\"dummy_order\",\"batch\",\n                    \"dummy_heartbeat_asset\",NULL,NULL,NULL,\n                    \"1015557820139870\",\"data-product_job_name_orders\",\"IN PROGRESS\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"true\")\"\"\",\n                \"sensors\": \"\"\"\n                    (\"dummy_order\",\n                    array(\"dummy_heartbeat_asset\"),\"ACQUIRED_NEW_DATA\",\n                    NULL,NULL,\"LOAD_DATE\",\"10155578201985\")\"\"\",\n            },\n        },\n    ],\n)\ndef test_heartbeat(scenario: dict) -> None:\n    \"\"\"Test the heartbeat feature.\n\n    Tests the heartbeat feature by validating the four core\n    functions invoked by the heartbeat algorithm.\n\n    Args:\n        scenario: The test scenario to execute.\n\n    Scenarios:\n        Default: A basic scenario that tests the four main steps of\n        the Heartbeat algorithm:\n            1. `execute_heartbeat_sensor_data_feed`: Loads a CSV file\n                into an empty Heartbeat control table.\n            2. `execute_sensor_heartbeat`: Simulates a Databricks job run.\n                The return value is patched to avoid actual API calls.\n            3. `update_heartbeat_sensor_status`: Updates values in the Heartbeat\n                and Sensor tables.\n            4. `trigger_heartbeat_sensor_jobs`: Triggers Databricks jobs.\n            This function is also patched to prevent real job execution.\n        Heartbeat_paused_sensor_new_record: Different state records that will\n        have different behaviour.\n            1. A record wih job_state = 'PAUSED' and  sensor_source = 'delta_table'\n                is inserted into the `heartbeat` table.\n                - Expected Behavior: No updates or changes throughout the test.\n            2. A record wih job_state = 'Null' and  sensor_source = 'sap_bw' is\n                inserted into heartbeat control table and sensor table.\n               - Expected Behavior: Record is updated during the process to\n                    reflect activity.\n            3. A record wih job_state = 'COMPLETED' and  sensor_source = 'kafka'\n                is inserted into heartbeat control table.\n               - Expected Behavior:\n                 - The record is updated during the process.\n                 - A corresponding entry is created in the `sensor` table.\n    \"\"\"\n    scenario_name = scenario[\"use_case_name\"]\n    _LOGGER.info(f\"Setting up Test - {scenario_name}.\")\n\n    tables = scenario[\"tables\"]\n    control_files = scenario[\"control_files\"]\n\n    heartbeat_control_table_name = f\"test_db.{tables['heartbeat_sensor_control_table']}\"\n    sensor_table_name = f\"test_db.{tables['sensor_table']}\"\n\n    acon = {\n        \"heartbeat_sensor_db_table\": heartbeat_control_table_name,\n        \"lakehouse_engine_sensor_db_table\": sensor_table_name,\n        \"data_format\": \"delta\",\n        \"sensor_source\": \"delta_table\",\n        \"token\": \"my-token\",\n        \"domain\": \"my-adidas-domain.cloud.databricks.com\",\n    }\n\n    _create_heartbeat_table(scenario_name, tables)\n\n    LocalStorage.copy_dir(\n        f\"{FEATURE_TEST_RESOURCES}/setup/{scenario_name}/data/\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/\",\n    )\n\n    LocalStorage.copy_dir(\n        f\"{FEATURE_TEST_RESOURCES}/control/{scenario_name}/data/\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario_name}/data/\",\n    )\n\n    setup_heartbeat_data_file_path = (\n        f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/\"\n        f\"{scenario['setup']['setup_heartbeat_data']}\"\n    )\n\n    ctrl_heart_tbl_heartb_feed_fname = control_files[\"ctrl_heart_tbl_heartb_feed_fname\"]\n    ctrl_heart_tbl_heartb_feed_file_path = (\n        f\"{TEST_LAKEHOUSE_CONTROL}/\"\n        f\"{scenario_name}/data/{ctrl_heart_tbl_heartb_feed_fname}\"\n    )\n\n    ctrl_heart_tbl_schema_file_name = control_files[\"ctrl_heart_tbl_schema_fname\"]\n    ctrl_heart_tbl_schema_file_path = (\n        f\"file:///{FEATURE_TEST_RESOURCES}/control/\"\n        f\"{scenario_name}/schema/{ctrl_heart_tbl_schema_file_name}\"\n    )\n\n    ctrl_heartbeat_df = DataframeHelpers.read_from_file(\n        ctrl_heart_tbl_heartb_feed_file_path,\n        schema=SchemaUtils.from_file_to_dict(ctrl_heart_tbl_schema_file_path),\n    )\n\n    _test_heartbeat_sensor_data_feed(\n        setup_heartbeat_data_file_path, heartbeat_control_table_name, ctrl_heartbeat_df\n    )\n\n    _LOGGER.info(\"Testing execute_sensor_heartbeat function\")\n\n    ctrl_heart_tbl_exe_sns_file_name = control_files[\"ctrl_heart_tbl_exe_sns_hb_fname\"]\n    ctrl_heart_tbl_exe_sns_file_path = (\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario_name}/\"\n        f\"data/{ctrl_heart_tbl_exe_sns_file_name}\"\n    )\n    ctrl_heart_tbl_exe_sns_df = DataframeHelpers.read_from_file(\n        ctrl_heart_tbl_exe_sns_file_path,\n        schema=SchemaUtils.from_file_to_dict(ctrl_heart_tbl_schema_file_path),\n    )\n\n    execute_sensor_results = scenario[\"execute_sensor_heartbeat_results\"]\n\n    _test_execute_sensor_heartbeat(\n        acon=acon,\n        heartbeat_control_table_name=heartbeat_control_table_name,\n        ctrl_heartbeat_df=ctrl_heart_tbl_exe_sns_df,\n        results=execute_sensor_results,\n    )\n\n    _LOGGER.info(\"Testing update_heartbeat_sensor_status function\")\n\n    sensor_df_schema = (\n        f\"file:///{FEATURE_TEST_RESOURCES}/setup/\"\n        f\"{scenario_name}/schema/{scenario['setup']['schema_sensor_df']}\"\n    )\n\n    ctrl_heart_table_upd = (\n        f\"{FEATURE_TEST_RESOURCES}/control/{scenario_name}/\"\n        f\"data/{scenario['control_files']['ctrl_heart_tbl_updated_fname']}\"\n    )\n\n    setup_sensor_file_name = scenario[\"setup\"][\"setup_sensor_data\"]\n    sensor_table_data_path = (\n        f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/{setup_sensor_file_name}\"\n    )\n\n    ctrl_sensor_tbl_upd_status_fname = control_files[\"ctrl_sensor_tbl_upd_status_fname\"]\n    ctrl_sensor_upd_path = (\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario_name}/\"\n        f\"data/{ctrl_sensor_tbl_upd_status_fname}\"\n    )\n\n    sensors_data = DataframeHelpers.read_from_file(\n        sensor_table_data_path,\n        file_format=\"json\",\n        schema=SchemaUtils.from_file_to_dict(sensor_df_schema),\n    )\n\n    ctrl_sensor_upd_sensor_status_df = DataframeHelpers.read_from_file(\n        ctrl_sensor_upd_path,\n        file_format=\"json\",\n        schema=SchemaUtils.from_file_to_dict(sensor_df_schema),\n    )\n\n    ctrl_heart_tbl_df_upd_sns_status = DataframeHelpers.read_from_file(\n        ctrl_heart_table_upd,\n        schema=SchemaUtils.from_file_to_dict(ctrl_heart_tbl_schema_file_path),\n    )\n\n    sensors_data.write.format(\"delta\").mode(\"overwrite\").saveAsTable(sensor_table_name)\n\n    job_id = scenario[\"job_id\"]\n\n    _test_update_heartbeat_sensor_status(\n        heartbeat_control_table_name=heartbeat_control_table_name,\n        sensor_table_name=sensor_table_name,\n        job_id=job_id,\n        ctrl_heartbeat_df=ctrl_heart_tbl_df_upd_sns_status,\n        ctrl_sensor_df=ctrl_sensor_upd_sensor_status_df,\n    )\n\n    _LOGGER.info(\"Testing trigger_heartbeat_sensor_jobs function\")\n\n    _LOGGER.info(f\"acon: {acon}\")\n\n    _LOGGER.info(\"Preparing heartbeat and sensor table\")\n    records_to_insert = scenario[\"trigger_heartbeat_sensor_jobs_records\"]\n\n    ExecEnv.SESSION.sql(\n        f\"\"\"INSERT INTO {heartbeat_control_table_name}\n            VALUES {records_to_insert[\"heartbeat\"]}\"\"\"  # nosec\n    )\n    ExecEnv.SESSION.sql(\n        f\"\"\"INSERT INTO {sensor_table_name}\n        VALUES {records_to_insert[\"sensors\"]}\"\"\"  # nosec\n    )\n\n    ctrl_heart_tbl_trig_job_fname = control_files[\"ctrl_heart_tbl_trigger_job_fname\"]\n    ctrl_heart_tbl_trig_job_path = (\n        f\"file:///{FEATURE_TEST_RESOURCES}/control/\"\n        f\"{scenario_name}/data/{ctrl_heart_tbl_trig_job_fname}\"\n    )\n\n    ctrl_heartbeat_update_df = DataframeHelpers.read_from_file(\n        ctrl_heart_tbl_trig_job_path,\n        schema=SchemaUtils.from_file_to_dict(ctrl_heart_tbl_schema_file_path),\n    )\n\n    _trigger_heartbeat_sensor_jobs(\n        acon=acon,\n        heartbeat_control_table_name=heartbeat_control_table_name,\n        heartbeat_control_table_updated=ctrl_heartbeat_update_df,\n    )\n\n    for _, table_name in tables.items():\n        LocalStorage.clean_folder(f\"{LAKEHOUSE}{table_name}\")\n        ExecEnv.SESSION.sql(f\"\"\"DROP TABLE IF EXISTS test_db.{table_name}\"\"\")  # nosec\n"
  },
  {
    "path": "tests/feature/test_jdbc_reader.py",
    "content": "\"\"\"Test jdbc reader.\"\"\"\n\nfrom typing import List\n\nimport pytest\nfrom pyspark.sql.utils import IllegalArgumentException\n\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.transformers.exceptions import WrongArgumentsException\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"jdbc_reader\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n\"\"\"Same as spark, we provide two different ways to run jdbc reader.\n    We can use the jdbc() function, passing inside all the arguments\n        needed for Spark to work and we can even combine this with additional\n        options passed trough .options().\n    Other way is using .format(\"jdbc\") and pass all necessary arguments\n        through .options().\n    It's important to say by choosing jdbc() we can also add options() to the execution.\n\nJDBC Function Scenario - Description:\n    correct_arguments - we are providing jdbc_args and options by passing arguments\n        in a correct way.\n    wrong_arguments - we are providing jdbc_args and options, but wrong arguments are\n        filled to validate if spark reports the error messages properly.\nJDBC Format Scenario - Description:\n    correct_arguments - we are providing options to .format(jdbc) by passing arguments\n        in a correct way.\n    wrong_arguments - we are providing options to .format(jdbc), but wrong arguments are\n        filled to validate if spark reports the error messages properly.\n    predicates - predicates on spark read works on jdbc() function only, but if you\n        mistake and pass to .format(jdbc) as a option, spark won't show any error, so\n        we decided to add a validation and raise the error, this scenario validates it.\n\"\"\"\nTEST_SCENARIOS = [\n    [\"jdbc_function\", \"correct_arguments\"],\n    [\"jdbc_function\", \"wrong_arguments\"],\n    [\"jdbc_format\", \"correct_arguments\"],\n    [\"jdbc_format\", \"wrong_arguments\"],\n    [\"jdbc_format\", \"predicates\"],\n]\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS)\ndef test_jdbc_reader(scenario: List[str]) -> None:\n    \"\"\"Test loads from jdbc source.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    if scenario[0] == \"jdbc_format\" and scenario[1] == \"wrong_arguments\":\n        with pytest.raises(IllegalArgumentException, match=\"Option.*is required.\"):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}/batch_init.json\"\n            )\n\n    elif scenario[0] == \"jdbc_format\" and scenario[1] == \"predicates\":\n        with pytest.raises(\n            WrongArgumentsException, match=\"Predicates can only be used with jdbc_args.\"\n        ):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}/batch_init.json\"\n            )\n\n    elif scenario[0] == \"jdbc_function\" and scenario[1] == \"wrong_arguments\":\n        with pytest.raises(\n            TypeError, match=r\"jdbc\\(\\) got an unexpected keyword argument.*\"\n        ):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}/batch_init.json\"\n            )\n    else:\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}/data/source/part-01.csv\",\n            f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/data/\",\n        )\n\n        source_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/data\"\n        )\n        DataframeHelpers.write_into_jdbc_table(\n            source_df,\n            f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/{scenario[0]}/{scenario[1]}/tests.db\",\n            f\"{scenario[0]}\",\n        )\n\n        load_data(\n            f\"file://{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}/batch_init.json\"\n        )\n\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario[0]}/{scenario[1]}/data/control/part-01.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/{scenario[1]}/data/\",\n        )\n\n        result_df = DataframeHelpers.read_from_table(f\"test_db.{scenario[0]}_table\")\n        control_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/{scenario[1]}/data\"\n        )\n\n        assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/test_materialize_cdf.py",
    "content": "\"\"\"Test materialize cdf to external location.\"\"\"\n\nfrom typing import Any\n\nimport pytest\nfrom delta.tables import DeltaTable\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data, manage_table\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"materialize_cdf\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\"scenario\", [\"streaming_with_cdf\"])\ndef test_streaming_with_cdf(scenario: str, caplog: Any) -> None:\n    \"\"\"Test materialize cdf function.\n\n    Args:\n        scenario: scenario name.\n        caplog: captured log.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/table/streaming_with_cdf.sql\",\n        f\"{TEST_LAKEHOUSE_IN}/data/table/\",\n    )\n    manage_table(f\"file://{TEST_RESOURCES}/acon_create_table.json\")\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/streaming_without_clean_cdf.json\"\n    )\n    load_data(acon=acon)\n\n    assert \"Writing CDF to external table...\" in caplog.text\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/part-01_cdf.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/control_schema.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/control_schema.json\",\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/{scenario}/control_schema.json\"\n        ),\n    )\n\n    result_df_delta = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/cdf_data\",\n        file_format=InputFormat.DELTAFILES.value,\n    ).drop(\"_commit_timestamp\")\n\n    # once we are writing the cdf as delta, it can also be read as parquet.\n    # because the _commit_timestamp field is a partition field (comes from the folder),\n    # not from the parquet file, we need to enforce a schema where _commit_timestamp is\n    # a string, not an int (as automatically inferred from the folder by spark).\n    result_df_parquet = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/cdf_data\",\n        file_format=InputFormat.PARQUET.value,\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/{scenario}/control_schema.json\"\n        ),\n    ).drop(\"_commit_timestamp\")\n\n    assert not DataframeHelpers.has_diff(result_df_delta, control_df)\n    assert not DataframeHelpers.has_diff(result_df_parquet, control_df)\n\n    # to be able to execute vacuum on expose cdf terminator spec it is\n    # necessary to update _commit_timestamp to an old value, for that we\n    # are enforcing the timestamp with the following delta commands.\n    delta_table = DeltaTable.forPath(\n        ExecEnv.SESSION,\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/cdf_data\",\n    )\n    delta_table.update(set={\"_commit_timestamp\": \"'20211105132711'\"})\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/streaming_with_clean_and_vacuum.json\"\n    )\n    load_data(acon=acon)\n\n    assert \"Writing CDF to external table...\" in caplog.text\n    assert \"Cleaning CDF table...\" in caplog.text\n    assert \"Vacuuming CDF table...\" in caplog.text\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/cdf_data\",\n        file_format=InputFormat.DELTAFILES.value,\n    )\n\n    assert result_df.count() == 6\n"
  },
  {
    "path": "tests/feature/test_notification.py",
    "content": "\"\"\"Mail notifications tests.\"\"\"\n\nimport re\nimport typing\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import TerminatorSpec\nfrom lakehouse_engine.engine import send_notification\nfrom lakehouse_engine.terminators.notifiers.email_notifier import EmailNotifier\nfrom lakehouse_engine.terminators.notifiers.exceptions import (\n    NotifierConfigException,\n    NotifierTemplateConfigException,\n    NotifierTemplateNotFoundException,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom tests.conftest import FEATURE_RESOURCES\nfrom tests.utils.smtp_server import SMTPServer\n\nLOGGER = LoggingHandler(__name__).get_logger()\nTEST_ATTACHEMENTS_PATH = FEATURE_RESOURCES + \"/notification/\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Email Notification Template\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"template\": \"failure_notification_email\",\n                    \"from\": \"test-email@email.com\",\n                    \"cc\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"mimetype\": \"text/text\",\n                    \"exception\": \"test-exception\",\n                },\n            ),\n            \"expected\": \"\"\"\n            Job local in workspace local has\n            failed with the exception: test-exception\"\"\",\n        },\n        {\n            \"name\": \"Email Notification Free Form\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"mimetype\": \"text/text\",\n                    \"subject\": \"Test Email\",\n                    \"message\": \"Test message for the email.\",\n                    \"attachments\": [\n                        f\"{TEST_ATTACHEMENTS_PATH}test_attachement.txt\",\n                        f\"{TEST_ATTACHEMENTS_PATH}test_image.png\",\n                    ],\n                },\n            ),\n            \"expected\": \"Test message for the email.\",\n            \"expected_attachments\": [\"test_attachement.txt\", \"test_image.png\"],\n        },\n        {\n            \"name\": \"Email Notification Free Form\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"mimetype\": \"text/html\",\n                    \"subject\": \"Test Email\",\n                    \"message\": \"\"\"<html><body>Test message.</body></html>\"\"\",\n                },\n            ),\n            \"expected\": \"<html><body>Test message.</body></html>\",\n        },\n        {\n            \"name\": \"Error: non-existent template\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"template\": \"missing_template\",\n                },\n            ),\n            \"expected\": \"Template missing_template does not exist\",\n        },\n        {\n            \"name\": \"Error: malformed definition\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                },\n            ),\n            \"expected\": \"Malformed Notification Definition\",\n        },\n        {\n            \"name\": \"Error: Using disallowed smtp server\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"smtp.test.com\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"mimetype\": \"text/text\",\n                    \"subject\": \"Test Email\",\n                    \"message\": \"Test message for the email.\",\n                },\n            ),\n            \"expected\": \"Trying to use disallowed smtp server: \"\n            \"'smtp.test.com'.\\n\"\n            \"Disallowed smtp servers: ['smtp.test.com']\",\n        },\n    ],\n)\ndef test_email_notification(scenario: dict) -> None:\n    \"\"\"Testing send email notification with template.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    spec: TerminatorSpec = scenario[\"spec\"]\n    name = scenario[\"name\"]\n    expected_output = scenario[\"expected\"]\n\n    notification_type = spec.args[\"type\"]\n\n    LOGGER.info(f\"Executing notification test: {name}\")\n\n    if notification_type == \"email\":\n        port = spec.args[\"port\"]\n        server = spec.args[\"server\"]\n\n        email_notifier = EmailNotifier(spec)\n\n        if \"Error: \" in name:\n            with pytest.raises(\n                (\n                    NotifierTemplateNotFoundException,\n                    NotifierConfigException,\n                    NotifierTemplateConfigException,\n                )\n            ) as e:\n                email_notifier.create_notification()\n                email_notifier.send_notification()\n            assert expected_output in str(e.value)\n        else:\n            smtp_server = SMTPServer(server, port)\n            smtp_server.start()\n\n            email_notifier.create_notification()\n            email_notifier.send_notification()\n            (\n                email_from,\n                email_to,\n                email_cc,\n                email_bcc,\n                mimetype,\n                subject,\n                message,\n                attachments,\n            ) = _parse_email_output(smtp_server.get_last_message().as_string())\n\n            assert email_from == spec.args[\"from\"]\n            if \"to\" in spec.args:\n                assert email_to == spec.args[\"to\"]\n            if \"cc\" in spec.args:\n                assert email_cc == spec.args[\"cc\"]\n            if \"bcc\" in spec.args:\n                assert email_bcc == spec.args[\"bcc\"]\n            assert mimetype == spec.args[\"mimetype\"]\n            assert subject == spec.args[\"subject\"]\n            assert message == expected_output\n            assert attachments == scenario.get(\"expected_attachments\", [])\n\n            smtp_server.stop()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Email Notification Template\",\n            \"args\": {\n                \"server\": \"localhost\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"template\": \"failure_notification_email\",\n                \"from\": \"test-email@email.com\",\n                \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                \"cc\": [\"test-email3@email.com\", \"test-email4@email.com\"],\n                \"exception\": \"test-exception\",\n            },\n            \"expected\": \"\"\"\n            Job local in workspace local has\n            failed with the exception: test-exception\"\"\",\n        },\n        {\n            \"name\": \"Email Notification Free Form\",\n            \"args\": {\n                \"server\": \"localhost\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"from\": \"test-email@email.com\",\n                \"bcc\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                \"mimetype\": \"text/text\",\n                \"subject\": \"Test Email\",\n                \"message\": \"Test message for the email.\",\n                \"attachments\": [\n                    f\"{TEST_ATTACHEMENTS_PATH}test_attachement.txt\",\n                    f\"{TEST_ATTACHEMENTS_PATH}test_image.png\",\n                ],\n            },\n            \"expected\": \"Test message for the email.\",\n            \"expected_attachments\": [\"test_attachement.txt\", \"test_image.png\"],\n        },\n        {\n            \"name\": \"Error: non-existent template\",\n            \"args\": {\n                \"server\": \"localhost\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"template\": \"missing_template\",\n            },\n            \"expected\": \"Template missing_template does not exist\",\n        },\n        {\n            \"name\": \"Error: Malformed Notification Definition\",\n            \"args\": {\n                \"server\": \"localhost\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"from\": \"test-email@email.com\",\n                \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n            },\n            \"expected\": \"Malformed Notification Definition\",\n        },\n        {\n            \"name\": \"Error: Using disallowed smtp server\",\n            \"args\": {\n                \"server\": \"smtp.test.com\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"from\": \"test-email@email.com\",\n                \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                \"mimetype\": \"plain\",\n                \"subject\": \"Test Email\",\n                \"message\": \"Test message for the email.\",\n            },\n            \"expected\": \"Trying to use disallowed smtp server: \"\n            \"'smtp.test.com'.\\n\"\n            \"Disallowed smtp servers: ['smtp.test.com']\",\n        },\n    ],\n)\ndef test_email_notification_facade(scenario: dict) -> None:\n    \"\"\"Testing send email notification with template.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    args = scenario[\"args\"]\n    name = scenario[\"name\"]\n    expected_output = scenario[\"expected\"]\n\n    notification_type = args[\"type\"]\n\n    LOGGER.info(f\"Executing notification test: {name}\")\n\n    if notification_type == \"email\":\n        port = args[\"port\"]\n        server = args[\"server\"]\n\n        if \"Error: \" in name:\n            with pytest.raises(\n                (\n                    NotifierTemplateNotFoundException,\n                    NotifierConfigException,\n                    NotifierTemplateConfigException,\n                )\n            ) as e:\n                send_notification(args=args)\n            assert expected_output in str(e.value)\n        else:\n            smtp_server = SMTPServer(server, port)\n            smtp_server.start()\n\n            send_notification(args=args)\n            (\n                email_from,\n                email_to,\n                email_cc,\n                email_bcc,\n                mimetype,\n                subject,\n                message,\n                attachments,\n            ) = _parse_email_output(smtp_server.get_last_message().as_string())\n\n            assert email_from == args[\"from\"]\n            if \"to\" in args:\n                assert email_to == args[\"to\"]\n            if \"cc\" in args:\n                assert email_cc == args[\"cc\"]\n            if \"bcc\" in args:\n                assert email_bcc == args[\"bcc\"]\n            assert mimetype == args[\"mimetype\"]\n            assert subject == args[\"subject\"]\n            assert message == expected_output\n            assert attachments == scenario.get(\"expected_attachments\", [])\n\n            smtp_server.stop()\n\n\ndef _parse_email_output(\n    mail_content: str,\n) -> typing.Tuple[str, list, list, list, str, str, str, list]:\n    \"\"\"Parse the mail that was received in the debug smtp server.\n\n    Args:\n        mail_content: The raw mail content.\n\n    Returns:\n        A tuple with the email from, email to, cc, bcc, subject and message.\n    \"\"\"\n    email_from = re.search(\"(?<=From: ).*\", mail_content).group()\n    email_to = re.search(\"(?<=To: ).*\", mail_content).group().split(\", \")\n    email_cc = re.search(\"(?<=CC: ).*\", mail_content).group().split(\", \")\n    email_bcc = re.search(\"(?<=BCC: ).*\", mail_content).group().split(\", \")\n    mimetype = re.search(\"(?<=Content-Type: ).*(?=; charset)\", mail_content).group()\n    subject = re.search(\"(?<=Subject: ).*\", mail_content).group()\n    message = re.search(\"(?<=bit\\n).*?(?=--=)\", mail_content, re.S).group()[1:-1]\n    attachments = re.findall(\"\"\"(?<=filename=\").*(?=\")\"\"\", mail_content)\n\n    return (\n        email_from,\n        email_to,\n        email_cc,\n        email_bcc,\n        mimetype,\n        subject,\n        message,\n        attachments,\n    )\n"
  },
  {
    "path": "tests/feature/test_reconciliation.py",
    "content": "\"\"\"Test reconciliation.\"\"\"\n\nfrom typing import Any, List, Union\n\nimport pytest\n\nfrom lakehouse_engine.algorithms.exceptions import ReconciliationFailedException\nfrom lakehouse_engine.algorithms.reconciliator import ReconciliationType\nfrom lakehouse_engine.engine import execute_reconciliation\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"reconciliation\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\nACON_WITH_QUERIES = {\n    \"metrics\": [\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"absolute\",\n            \"aggregation\": \"sum\",\n            \"yellow\": 0.05,\n            \"red\": 0.1,\n        },\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"percentage\",\n            \"aggregation\": \"avg\",\n            \"yellow\": 0.04,\n            \"red\": 0.08,\n        },\n    ],\n    \"truth_input_spec\": {\n        \"spec_id\": \"truth\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/truth.json\",\n    },\n    \"truth_preprocess_query\": \"\"\"\n        SELECT country, sum(net_sales) as net_sales\n        FROM truth\n        GROUP BY country\n    \"\"\",\n    \"truth_preprocess_query_args\": [\n        {\n            \"function\": \"persist\",\n            \"args\": {\"storage_level\": \"MEMORY_AND_DISK_DESER\"},\n        }\n    ],\n    \"current_input_spec\": {\n        \"spec_id\": \"current_results\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/current.json\",\n    },\n    \"current_preprocess_query\": \"\"\"\n        SELECT country, sum(net_sales) as net_sales\n        FROM current\n        GROUP BY country\n    \"\"\",\n    \"current_preprocess_query_args\": [\n        {\n            \"function\": \"persist\",\n            \"args\": {\"storage_level\": \"MEMORY_AND_DISK\"},\n        }\n    ],\n}\n\nACON_WITHOUT_QUERIES = {\n    \"metrics\": [\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"absolute\",\n            \"aggregation\": \"sum\",\n            \"yellow\": 0.01,\n            \"red\": 0.05,\n        },\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"absolute\",\n            \"aggregation\": \"avg\",\n            \"yellow\": 0.04,\n            \"red\": 0.08,\n        },\n    ],\n    \"truth_input_spec\": {\n        \"spec_id\": \"truth\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/truth.json\",\n    },\n    \"truth_preprocess_query_args\": [{\"function\": \"cache\"}],\n    \"current_input_spec\": {\n        \"spec_id\": \"current_results\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/current.json\",\n    },\n    \"current_preprocess_query_args\": [],  # turn cache off as it is a default\n}\n\nACON_WITH_QUERIES_EMPTY_DF_TRUE_CHECK = {\n    \"metrics\": [\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"absolute\",\n            \"aggregation\": \"sum\",\n            \"yellow\": 0.05,\n            \"red\": 0.1,\n        },\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"percentage\",\n            \"aggregation\": \"avg\",\n            \"yellow\": 0.04,\n            \"red\": 0.08,\n        },\n    ],\n    \"truth_input_spec\": {\n        \"spec_id\": \"truth\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/truth.json\",\n    },\n    \"truth_preprocess_query\": \"\"\"\n        SELECT country, sum(net_sales) as net_sales\n        FROM truth where 1 = 0\n        group by country\n    \"\"\",\n    \"truth_preprocess_query_args\": [\n        {\n            \"function\": \"persist\",\n            \"args\": {\"storage_level\": \"MEMORY_AND_DISK_DESER\"},\n        }\n    ],\n    \"current_input_spec\": {\n        \"spec_id\": \"current_results\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse\"\n        \"/in/feature/reconciliation/data/current.json\",\n    },\n    \"current_preprocess_query\": \"\"\"\n        SELECT country, sum(net_sales) as net_sales\n        FROM current\n        WHERE 1 = 0\n        group by country\n    \"\"\",\n    \"current_preprocess_query_args\": [\n        {\n            \"function\": \"persist\",\n            \"args\": {\"storage_level\": \"MEMORY_AND_DISK\"},\n        }\n    ],\n    \"ignore_empty_df\": True,\n}\n\nACON_WITH_QUERIES_EMPTY_DF_FALSE_CHECK = {\n    \"metrics\": [\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"absolute\",\n            \"aggregation\": \"sum\",\n            \"yellow\": 0.05,\n            \"red\": 0.1,\n        },\n        {\n            \"metric\": \"net_sales\",\n            \"type\": \"percentage\",\n            \"aggregation\": \"avg\",\n            \"yellow\": 0.04,\n            \"red\": 0.08,\n        },\n    ],\n    \"truth_input_spec\": {\n        \"spec_id\": \"truth\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/truth.json\",\n    },\n    \"truth_preprocess_query\": \"\"\"\n        SELECT country, sum(net_sales) as net_sales\n        FROM truth where 1 = 0\n        group by country\n    \"\"\",\n    \"truth_preprocess_query_args\": [\n        {\n            \"function\": \"persist\",\n            \"args\": {\"storage_level\": \"MEMORY_AND_DISK_DESER\"},\n        }\n    ],\n    \"current_input_spec\": {\n        \"spec_id\": \"current_results\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"json\",\n        \"options\": {\"multiline\": \"true\"},\n        \"location\": \"file:///app/tests/lakehouse/in/feature/\"\n        \"reconciliation/data/current.json\",\n    },\n    \"current_preprocess_query\": \"\"\"\n        SELECT country, sum(net_sales) as net_sales\n        FROM current\n        WHERE 1 = 0\n        group by country\n    \"\"\",\n    \"current_preprocess_query_args\": [\n        {\n            \"function\": \"persist\",\n            \"args\": {\"storage_level\": \"MEMORY_AND_DISK\"},\n        }\n    ],\n    \"ignore_empty_df\": False,\n}\n\n\nACONS = {\n    \"with_queries_pct\": ACON_WITH_QUERIES,\n    \"with_files_abs\": ACON_WITHOUT_QUERIES,\n    \"failed_reconciliation_pct\": ACON_WITH_QUERIES,\n    \"empty_truth\": ACON_WITHOUT_QUERIES,\n    \"different_rows\": ACON_WITHOUT_QUERIES,\n    \"empty_df_true_check\": ACON_WITH_QUERIES_EMPTY_DF_TRUE_CHECK,\n    \"empty_df_false_check\": ACON_WITH_QUERIES_EMPTY_DF_FALSE_CHECK,\n}\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\n            \"with_queries_pct\",\n            \"current.json\",\n            \"truth.json\",\n            None,\n            \"The Reconciliation process has succeeded.\",\n        ],\n        [\n            \"with_files_abs\",\n            \"current.json\",\n            \"truth.json\",\n            None,\n            \"The Reconciliation process has succeeded.\",\n        ],\n        [\n            \"failed_reconciliation_pct\",\n            \"current_fail.json\",\n            \"truth.json\",\n            \"Reconciliation result: {'net_sales_absolute_diff_sum': 100.0, \"\n            \"'net_sales_percentage_diff_avg': 0.0625}\",\n            \"The Reconciliation process has failed with status: red.\",\n        ],\n        [\n            \"empty_truth\",\n            \"current.json\",\n            \"truth_empty.json\",\n            None,\n            \"The reconciliation has failed because either the truth dataset or the \"\n            \"current results dataset was empty.\",\n        ],\n        [\n            \"different_rows\",\n            \"current_different_rows.json\",\n            \"truth_different_rows.json\",\n            \"Reconciliation result: {'net_sales_absolute_diff_sum': 500.0, \"\n            \"'net_sales_absolute_diff_avg': 100.0}\",\n            \"The Reconciliation process has failed with status: red.\",\n        ],\n        [\n            \"empty_df_true_check\",\n            \"current.json\",\n            \"truth.json\",\n            None,\n            \"The Reconciliation process has succeeded.\",\n        ],\n        [\n            \"empty_df_false_check\",\n            \"current.json\",\n            \"truth.json\",\n            None,\n            \"The reconciliation has failed because either the truth dataset or the \"\n            \"current results dataset was empty.\",\n        ],\n    ],\n)\ndef test_reconciliation(scenario: str, caplog: Any) -> None:\n    \"\"\"Test reconciliation.\n\n    Args:\n        scenario: scenario to test.\n             with_queries - uses queries to get the truth data and the current data.\n                Reconciliation type is percentage.\n             with_files - uses files for the truth data and query for the current data.\n                Reconciliation type is absolute.\n             failed_reconciliation - same as 'with_queries' but with a failed\n                reconciliation. Reconciliation type is percentage.\n             empty_truth - scenario in which the truth data is empty.\n             different_rows - the truth dataset and current results dataset have\n                different rows, therefore reconciliation should fail.\n        caplog: captured log.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    acon = ACONS[scenario[0]]\n    acon[\"current_input_spec\"][  # type: ignore\n        \"location\"\n    ] = f\"file:///app/tests/lakehouse/in/feature/reconciliation/data/{scenario[1]}\"\n    acon[\"truth_input_spec\"][  # type: ignore\n        \"location\"\n    ] = f\"file:///app/tests/lakehouse/in/feature/reconciliation/data/{scenario[2]}\"\n\n    if scenario[0] in [\n        \"failed_reconciliation_pct\",\n        \"empty_truth\",\n        \"different_rows\",\n        \"empty_df_false_check\",\n    ]:\n        with pytest.raises(ReconciliationFailedException) as e:\n            execute_reconciliation(acon=acon)  # type: ignore\n        if scenario[3]:\n            assert scenario[3] in caplog.text\n        assert str(e.value) == scenario[4]\n    else:\n        execute_reconciliation(acon=acon)  # type: ignore\n        assert scenario[4] in caplog.text\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\n            \"pass\",\n            ReconciliationType.PCT.value,\n            0.05,\n            0.1,\n            \"current_nulls_and_zeros\",\n            \"truth_nulls_and_zeros\",\n            \"Reconciliation result: {'net_sales_percentage_diff_sum': 0.0, \"\n            \"'net_sales_percentage_diff_avg': 0.0}\",\n            \"The Reconciliation process has succeeded.\",\n        ],\n        [\n            \"fail_if_threshold_zero\",\n            ReconciliationType.PCT.value,\n            0,\n            0,\n            \"current_nulls_and_zeros_fail\",\n            \"truth_nulls_and_zeros_fail\",\n            \"Reconciliation result: {'net_sales_percentage_diff_sum': 1.0, \"\n            \"'net_sales_percentage_diff_avg': 0.3333333333333333}\",\n            \"The Reconciliation process has failed with status: red.\",\n        ],\n        [\n            \"fail_null_is_not_zero\",\n            ReconciliationType.PCT.value,\n            0.05,\n            0.1,\n            \"current_nulls_and_zeros_fail\",\n            \"truth_nulls_and_zeros_fail\",\n            \"Reconciliation result: {'net_sales_percentage_diff_sum': 1.0, \"\n            \"'net_sales_percentage_diff_avg': 0.3333333333333333}\",\n            \"The Reconciliation process has failed with status: red.\",\n        ],\n    ],\n)\ndef test_nulls_and_zero_values_and_threshold(\n    scenario: List[Union[str, float]], caplog: Any\n) -> None:\n    \"\"\"Test truth and current datasets with nulls and zeros.\n\n    Args:\n        scenario: scenario to test.\n            pass - reconciliation should pass even if there are 0s and nulls in the\n                truth and current datasets.\n            fail_if_threshold_zero - reconciliation should fail if users pass 0 as\n                threshold as of course 0 indicates there's no difference. If that's the\n                threshold then it will indicate that the reconciliation has failed.\n            fail_null_is_not_zero - reconciliation should fail if in the first record\n                of the current data we have a 0, and in the corresponding row of the\n                truth data we have a null, because that indicates a percentage\n                difference of 1 according to the recon algorithm, and therefore,\n                the reconciliation should present those differences properly, instead\n                of assuming that 0 is equal to null.\n        caplog: captured log.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    acon = ACON_WITHOUT_QUERIES\n    acon[\"current_input_spec\"][\"location\"] = (  # type: ignore\n        f\"file:///app/tests/\"\n        f\"lakehouse/in/feature/reconciliation/data/{scenario[4]}.json\"\n    )\n    acon[\"truth_input_spec\"][\"location\"] = (  # type: ignore\n        f\"file:///app/tests/\"\n        f\"lakehouse/in/feature/reconciliation/data/{scenario[5]}.json\"\n    )\n    acon[\"metrics\"][0][\"type\"] = scenario[1]  # type: ignore\n    acon[\"metrics\"][0][\"yellow\"] = scenario[2]  # type: ignore\n    acon[\"metrics\"][0][\"red\"] = scenario[3]  # type: ignore\n    acon[\"metrics\"][1][\"type\"] = scenario[1]  # type: ignore\n    acon[\"metrics\"][1][\"yellow\"] = scenario[2]  # type: ignore\n    acon[\"metrics\"][1][\"red\"] = scenario[3]  # type: ignore\n\n    if scenario[0] in [\"fail_null_is_not_zero\", \"fail_if_threshold_zero\"]:\n        with pytest.raises(ReconciliationFailedException) as e:\n            execute_reconciliation(acon=acon)\n        assert scenario[6] in caplog.text\n        assert str(e.value) == scenario[7]\n    else:\n        execute_reconciliation(acon=acon)\n        assert scenario[6] in caplog.text\n"
  },
  {
    "path": "tests/feature/test_schema_evolution.py",
    "content": "\"\"\"Test schema evolution on delta loads.\"\"\"\n\nfrom typing import Generator\n\nimport pytest\nfrom pyspark.sql.utils import AnalysisException\n\nfrom lakehouse_engine.core.definitions import InputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"schema_evolution\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.fixture(autouse=True)\ndef prepare_tests() -> Generator:\n    \"\"\"Run setup and cleanup steps before/after each test scenario.\"\"\"\n    # Test setup\n    yield\n    # Test cleanup\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_IN}\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}\")\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\n            \"auto_merge_enabled_add_column\",\n            \"part-02\",\n            \"batch_delta_enabled\",\n            \"control_schema_add_column\",\n        ],\n        [\n            \"auto_merge_disabled_add_column\",\n            \"part-02\",\n            \"batch_delta_disabled\",\n            \"control_schema_add_column\",\n        ],\n        [\n            \"auto_merge_enabled_remove_column\",\n            \"part-03\",\n            \"batch_delta_enabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_disabled_remove_column\",\n            \"part-03\",\n            \"batch_delta_disabled\",\n            \"control_schema\",\n            \"customer\",\n        ],\n        [\n            \"auto_merge_enabled_cast_column\",\n            \"part-04\",\n            \"batch_delta_enabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_disabled_cast_column\",\n            \"part-04\",\n            \"batch_delta_disabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_enabled_rename_column_file\",\n            \"part-05\",\n            \"batch_delta_enabled\",\n            \"control_schema_rename\",\n        ],\n        [\n            \"auto_merge_disabled_rename_column_file\",\n            \"part-05\",\n            \"batch_delta_disabled\",\n            \"control_schema_rename\",\n            \"request\",\n        ],\n        [\n            \"auto_merge_enabled_rename_column_transform\",\n            \"part-06\",\n            \"batch_delta_enabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_disabled_rename_column_transform\",\n            \"part-06\",\n            \"batch_delta_disabled_rename\",\n            \"control_schema\",\n            \"ARTICLE\",\n        ],\n    ],\n)\ndef test_schema_evolution_delta_load(scenario: str) -> None:\n    \"\"\"Test schema evolution on delta loads.\n\n    Args:\n        scenario: scenario to test.\n        auto_merge_enabled_add_column - it performs the merge successfully and\n        the new column is added to the schema (older rows assume null value\n        for this column)\n        auto_merge_disabled_add_column - it performs the merge successfully\n        but the new column is ignored (is not added to the final schema).\n        auto_merge_enabled_remove_column - it performs the merge successfully,\n        the column is not removed from the final schema and the new rows assume\n        the value null for this column.\n        auto_merge_disabled_remove_column - purposely checks that the delta\n        load fails when a column is removed.\n        auto_merge_enabled_cast_column - it performs the merge successfully\n        but the column type does not change automatically in the final schema.\n        auto_merge_disabled_cast_column - it performs the merge successfully\n        but the column type does not change automatically in the final schema.\n        auto_merge_enabled_rename_column_file - it performs the merge\n        successfully but assumes the renamed column as a new column (the\n        column is renamed in the source schema only).\n        auto_merge_disabled_rename_column_file - purposely checks that the\n        delta load fails when a column is renamed (the column is renamed in\n        the source schema only).\n        auto_merge_enabled_rename_column_transform - it performs the merge\n        successfully but ignores the renaming transformation specified in\n        the acon.\n        auto_merge_disabled_rename_column_transform - checks the behavior\n        of the delta load when a column is renamed to lowercase,\n        based on a transformation specified in the acon, without spark\n        case-sensitive property.\n    Scenario Properties:\n        [scenario name, input file, acon file, control schema file,\n        error message excerpt (optional)]\n    \"\"\"\n    _create_table(\"schema_evolution_delta_load\", \"delta_load\")\n\n    # initial load\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/delta_load/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/delta_load/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/delta_load/schema/source/source_part-01_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/delta_load/\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/delta_load/batch_init_\"\n        f\"{'enabled' if 'enabled' in scenario[0] else 'disabled'}.json\"\n    )\n\n    initial_schema = DataframeHelpers.read_from_table(\n        \"test_db.schema_evolution_delta_load\"\n    ).schema\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/delta_load/data/source/{scenario[1]}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/delta_load/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/delta_load/schema/source/source_{scenario[1]}_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/delta_load/source_delta_schema.json\",\n    )\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/delta_load/{scenario[2]}.json\"\n    )\n\n    # tests with schema auto merge enabled\n    if (\n        \"enabled\" in scenario[0]\n        or scenario[0] == \"auto_merge_disabled_rename_column_transform\"\n    ):\n        load_data(acon=acon)\n\n        result_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_OUT}/delta_load/data\",\n            file_format=InputFormat.DELTAFILES.value,\n        )\n        schema_after_merge = DataframeHelpers.read_from_table(\n            \"test_db.schema_evolution_delta_load\"\n        ).schema\n\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/delta_load/data/control/{scenario[1]}.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/delta_load/data/\",\n        )\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/delta_load/schema/control/{scenario[3]}.json\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/delta_load/\",\n        )\n        control_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_CONTROL}/delta_load/data/{scenario[1]}.csv\",\n            schema=SchemaUtils.from_file_to_dict(\n                f\"file://{TEST_LAKEHOUSE_CONTROL}/delta_load/{scenario[3]}.json\"\n            ),\n        )\n\n        # for the cast and rename tests, based on the transformations\n        # specified in the acon file, the schema changes are ignored\n        if scenario[0] == \"auto_merge_enabled_cast_column\" or scenario[0] == (\n            \"auto_merge_enabled_rename_column_transform\"\n        ):\n            assert initial_schema == schema_after_merge\n        else:\n            assert not DataframeHelpers.has_diff(result_df, control_df)\n\n    # tests with schema auto merge disabled\n    elif \"disabled\" in scenario[0]:\n        # for \"add column\" and \"cast column\" tests the merge runs successfully\n        # but the schema changes are ignored\n        if \"add\" in scenario[0] or \"cast\" in scenario[0]:\n            load_data(acon=acon)\n\n            result_df = DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_OUT}/delta_load/data\",\n                file_format=InputFormat.DELTAFILES.value,\n            )\n\n            if scenario[0] == \"auto_merge_disabled_add_column\":\n                assert \"new_column\" not in result_df.columns\n            else:\n                assert not isinstance(result_df[\"code\"], str)\n        # for the removing column tests, the merge throws an error\n        else:\n            with pytest.raises(\n                AnalysisException,\n                match=f\".*Cannot resolve {scenario[4]} in UPDATE clause given.*\",\n            ):\n                load_data(acon=acon)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\n            \"auto_merge_enabled_add_column\",\n            \"part-02\",\n            \"batch_append_enabled\",\n            \"control_schema_add_column\",\n        ],\n        [\n            \"auto_merge_disabled_add_column\",\n            \"part-02\",\n            \"batch_append_disabled\",\n            \"control_schema_add_column\",\n            \"A schema mismatch detected when writing to the Delta table\",\n        ],\n        [\n            \"auto_merge_enabled_remove_column\",\n            \"part-03\",\n            \"batch_append_enabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_disabled_remove_column\",\n            \"part-03\",\n            \"batch_append_disabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_enabled_cast_column\",\n            \"part-04\",\n            \"batch_append_enabled_cast\",\n            \"control_schema\",\n            \"Failed to merge fields\",\n        ],\n        [\n            \"auto_merge_disabled_cast_column\",\n            \"part-04\",\n            \"batch_append_disabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_enabled_rename_column_file\",\n            \"part-05\",\n            \"batch_append_enabled\",\n            \"control_schema_rename\",\n        ],\n        [\n            \"auto_merge_disabled_rename_column_file\",\n            \"part-05\",\n            \"batch_append_disabled\",\n            \"control_schema_rename\",\n            \"A schema mismatch detected\",\n        ],\n        [\n            \"auto_merge_enabled_rename_column_transform\",\n            \"part-06\",\n            \"batch_append_enabled\",\n            \"control_schema\",\n        ],\n        [\n            \"auto_merge_disabled_rename_column_transform\",\n            \"part-06\",\n            \"batch_append_disabled\",\n            \"control_schema\",\n        ],\n    ],\n)\ndef test_schema_evolution_append_load(scenario: str) -> None:\n    \"\"\"Test schema evolution on append loads.\n\n    Args:\n        scenario: scenario to test.\n        auto_merge_enabled_add_column - it performs the append load successfully\n        and the new column is added to the schema (older rows assume null value\n        for this column)\n        auto_merge_disabled_add_column - purposely checks that the append load\n        fails when a new column is added.\n        auto_merge_enabled_remove_column - it performs the append load\n        successfully, the column is not removed from the final schema and the\n        new rows assume the value null for this column.\n        auto_merge_disabled_remove_column - it performs the append load\n        successfully, the column is not removed from the final schema and the\n        new rows assume the value null for this column.\n        auto_merge_enabled_cast_column - purposely checks that the append load\n        fails when a cast transformation is added to the acon file.\n        auto_merge_disabled_cast_column - purposely checks that the append load\n        fails when a cast transformation is added to the acon file.\n        auto_merge_enabled_rename_column_file - purposely checks that the\n        append load fails when a column is renamed (the column is renamed in\n        the source schema only).\n        auto_merge_disabled_rename_column_file - purposely checks that the\n        append load fails when a column is renamed (the column is renamed in\n        the source schema only).\n        auto_merge_enabled_rename_column_transform - it performs the append load\n        successfully but ignores the renaming transformation specified in\n        the acon.\n        auto_merge_disabled_rename_column_transform - it performs the append load\n        successfully but ignores the renaming transformation specified in\n        the acon.\n    Scenario Properties:\n        [scenario name, input file, acon file, control schema file,\n        error message excerpt (optional)]\n    \"\"\"\n    _create_table(\"schema_evolution_append_load\", \"append_load\")\n\n    # initial load\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/append_load/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/append_load/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/append_load/schema/source/source_part-01_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/append_load/\",\n    )\n    load_data(\n        f\"file://{TEST_RESOURCES}/append_load/batch_init_\"\n        f\"{'enabled' if 'enabled' in scenario[0] else 'disabled'}.json\"\n    )\n\n    initial_schema = DataframeHelpers.read_from_table(\n        \"test_db.schema_evolution_append_load\"\n    ).schema\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/append_load/data/source/{scenario[1]}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/append_load/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/append_load/schema/source/source_{scenario[1]}_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/append_load/source_append_schema.json\",\n    )\n\n    # tests with schema auto merge enabled\n    if \"enabled\" in scenario[0]:\n        # for the cast column test, the append throws an error\n        acon = ConfigUtils.get_acon(\n            f\"file://{TEST_RESOURCES}/append_load/{scenario[2]}.json\"\n        )\n        if \"cast\" in scenario[0]:\n            with pytest.raises(AnalysisException, match=f\".*{scenario[4]}*\"):\n                load_data(acon=acon)\n        else:\n            load_data(acon=acon)\n\n            result_df = DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_OUT}/append_load/data\",\n                file_format=InputFormat.DELTAFILES.value,\n            )\n            schema_after_append = DataframeHelpers.read_from_table(\n                \"test_db.schema_evolution_append_load\"\n            ).schema\n\n            LocalStorage.copy_file(\n                f\"{TEST_RESOURCES}/append_load/data/control/{scenario[1]}.csv\",\n                f\"{TEST_LAKEHOUSE_CONTROL}/append_load/data/\",\n            )\n            LocalStorage.copy_file(\n                f\"{TEST_RESOURCES}/append_load/schema/control/{scenario[3]}.json\",\n                f\"{TEST_LAKEHOUSE_CONTROL}/append_load/\",\n            )\n            control_df = DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_CONTROL}/append_load/data/{scenario[1]}.csv\",\n                schema=SchemaUtils.from_file_to_dict(\n                    f\"file://{TEST_LAKEHOUSE_CONTROL}/append_load/{scenario[3]}.json\"\n                ),\n            )\n\n            # for rename test, based on the transformation specified in the\n            # acon file, the schema change is ignored\n            if scenario[0] == \"auto_merge_enabled_rename_column_transform\":\n                assert initial_schema == schema_after_append\n            else:\n                assert not DataframeHelpers.has_diff(result_df, control_df)\n\n    # tests with schema auto merge disabled\n    elif \"disabled\" in scenario[0]:\n        # for the renaming or adding column tests, the append throws an error\n        acon = ConfigUtils.get_acon(\n            f\"file://{TEST_RESOURCES}/append_load/{scenario[2]}.json\"\n        )\n        if \"rename_column_file\" in scenario[0] or \"add\" in scenario[0]:\n            with pytest.raises(AnalysisException, match=f\".*{scenario[4]}*\"):\n                load_data(acon=acon)\n        else:\n            load_data(acon=acon)\n\n            result_df = DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_OUT}/append_load/data\",\n                file_format=InputFormat.DELTAFILES.value,\n            )\n\n            schema_after_append = DataframeHelpers.read_from_table(\n                \"test_db.schema_evolution_append_load\"\n            ).schema\n\n            assert initial_schema == schema_after_append\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\n            \"auto_merge_enabled\",\n            \"part-02\",\n            \"batch_merge_enabled\",\n            \"control_schema_merge_enabled\",\n        ],\n        [\n            \"auto_merge_disabled\",\n            \"part-02\",\n            \"batch_merge_disabled\",\n            \"\",\n            \"Failed to merge\",\n        ],\n        [\n            \"overwrite_schema\",\n            \"part-02\",\n            \"batch_overwrite\",\n            \"control_schema_overwrite\",\n        ],\n    ],\n)\ndef test_schema_evolution_full_load(scenario: str) -> None:\n    \"\"\"Test schema evolution on full loads.\n\n    Args:\n        scenario: scenario to test.\n        auto_merge_enabled - overwrites the data in the table but does not\n        overwrite the schema (assumes the new column, keeps the removed\n        column, ignores renaming and cast transformations)\n        auto_merge_disabled - throws a mismatch schema error.\n        overwrite_schema - overwrites the data and the schema of the table.\n    Scenario Properties:\n        [scenario name, input file, acon file, control schema file,\n        error message excerpt (optional)]\n    \"\"\"\n    _create_table(\"schema_evolution_full_load\", \"full_load\")\n\n    # initial load\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/full_load/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/full_load/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/full_load/schema/source/source_part-01_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/full_load/source_schema.json\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/full_load/batch_init.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/full_load/data/source/{scenario[1]}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/full_load/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/full_load/schema/source/source_{scenario[1]}_schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/full_load/source_schema.json\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/full_load/{scenario[2]}.json\")\n    if scenario[0] == \"auto_merge_disabled\":\n        with pytest.raises(AnalysisException, match=f\".*{scenario[4]}*\"):\n            load_data(acon=acon)\n    else:\n        load_data(acon=acon)\n\n        final_schema = SchemaUtils.from_table_schema(\n            \"test_db.schema_evolution_full_load\"\n        )\n\n        result_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_OUT}/full_load/data\",\n            file_format=InputFormat.DELTAFILES.value,\n        )\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/full_load/schema/control/{scenario[3]}.json\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/full_load/\",\n        )\n\n        control_schema = SchemaUtils.from_file(\n            f\"file://{TEST_LAKEHOUSE_CONTROL}/full_load/{scenario[3]}.json\"\n        )\n\n        assert final_schema == control_schema\n        # with the rename transformation specified in acon, both the original\n        # and the renamed field (ARTICLE and article) are not considered in\n        # the final schema\n        assert (\"article\", \"ARTICLE\") not in result_df.columns\n\n\ndef _create_table(table_name: str, location: str) -> None:\n    \"\"\"Create test table.\"\"\"\n    ExecEnv.SESSION.sql(f\"DROP TABLE IF EXISTS test_db.{table_name}\")\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS test_db.{table_name} (\n            actrequest_timestamp string,\n            request string,\n            datapakid int,\n            partno int,\n            record int,\n            salesorder int,\n            item int,\n            recordmode string,\n            date int,\n            customer string,\n            ARTICLE string,\n            amount int,\n            code int\n        )\n        USING delta\n        LOCATION '{TEST_LAKEHOUSE_OUT}/{location}/data'\n        \"\"\"\n    )\n"
  },
  {
    "path": "tests/feature/test_sensors.py",
    "content": "\"\"\"Module with integration tests for sensors feature.\"\"\"\n\nimport json\nimport os\nfrom datetime import datetime\n\nimport pytest\nfrom pyspark.sql.types import StringType, StructField, StructType\n\nfrom lakehouse_engine.algorithms.exceptions import (\n    NoNewDataException,\n    SensorAlreadyExistsException,\n)\nfrom lakehouse_engine.core.definitions import SensorSpec, SensorStatus\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.sensor_manager import SensorControlTableManager\nfrom lakehouse_engine.engine import (\n    execute_sensor,\n    generate_sensor_query,\n    update_sensor_status,\n)\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"sensors\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n\n_TEST_SENSOR_DELTA_TABLE_BASE_SCHEMA = {\n    \"sensor_id\": \"string\",\n    \"assets\": \"array<string>\",\n    \"status\": \"string\",\n    \"status_change_timestamp\": \"timestamp\",\n    \"checkpoint_location\": \"string\",\n}\n\n_TEST_SENSOR_DELTA_TABLE_SCHEMA = {\n    **_TEST_SENSOR_DELTA_TABLE_BASE_SCHEMA,\n    **{\n        \"upstream_key\": \"string\",\n        \"upstream_value\": \"string\",\n    },\n}\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        \"1st_run\",\n        \"has_new_data\",\n        \"has_data_from_previous_execution\",\n        \"upstream_acquired_new_data_but_not_processed\",\n        \"no_new_data\",\n    ],\n)\ndef test_table_sensor(scenario: list) -> None:\n    \"\"\"Test the feature of using a sensor to read from a delta table.\n\n    This specific test focuses on a delta table that is in itself the delta\n    table where sensor information is stored. This is useful for data products\n    consuming other data products sensor information to trigger their pipelines.\n\n    Scenarios:\n        1st_run: initial setup.\n        has_new_data: the first time the sensor detects new data from the\n            upstream.\n        has_data_from_previous_execution: the sensor does not detect new data\n            from the upstream, but it had data detected from a previous\n            execution of the pipeline for which the completion of the processing\n            of all the data was not acknowledged (e.g., the pipeline failed\n            before completing all the tasks).\n        upstream_acquired_new_data_but_not_processed: tests the scenario where\n            the upstream sensor has acquired new data, but because it's still\n            not in processed state, the downstream sensoring this table cannot\n            consider there's new data available from the upstream (e.g.,\n            a data product pipeline has identified new data from the source,\n            but the pipeline failed, so the downstream data product pipeline's\n            sensor cannot consider there's new data from the upstream).\n        no_new_data: there's no new data from the upstream.\n    \"\"\"\n    upstream_table = \"test_table_sensor_upstream\"\n    sensor_id = \"sensor_id_1\"\n    control_db_table_name = \"test_db.test_table_sensor\"\n    checkpoint_location = f\"{TEST_LAKEHOUSE_IN}/test_table_sensor/\"\n\n    if scenario == \"1st_run\":\n        DataframeHelpers.create_delta_table(\n            _TEST_SENSOR_DELTA_TABLE_SCHEMA,\n            table=\"test_table_sensor\",\n        )\n        DataframeHelpers.create_delta_table(\n            _TEST_SENSOR_DELTA_TABLE_SCHEMA,\n            table=upstream_table,\n            enable_cdf=True,\n        )\n\n    if scenario == \"has_new_data\":\n        _insert_data_into_upstream_table(upstream_table)\n    elif scenario == \"upstream_acquired_new_data_but_not_processed\":\n        _insert_data_into_upstream_table(\n            upstream_table,\n            values=(\n                f\"('sensor_id_upstream_1', array('dummy_upstream_asset_1'), \"\n                f\"'{SensorStatus.ACQUIRED_NEW_DATA.value}', \"\n                f\"'2023-05-30 23:29:49.079522', null, null, null)\"\n            ),\n        )\n\n    acon = {\n        \"sensor_id\": sensor_id,\n        \"assets\": [\"dummy_asset_1\"],\n        \"control_db_table_name\": control_db_table_name,\n        \"input_spec\": {\n            \"spec_id\": \"sensor_upstream\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"delta\",\n            \"db_table\": f\"test_db.{upstream_table}\",\n            \"options\": {\n                \"readChangeFeed\": \"true\",\n            },\n        },\n        \"preprocess_query\": generate_sensor_query(\"sensor_id_upstream_1\"),\n        \"base_checkpoint_location\": checkpoint_location,\n        \"fail_on_empty_result\": True,\n    }\n\n    if scenario in [\"has_new_data\", \"has_data_from_previous_execution\"]:\n        has_new_data = execute_sensor(acon=acon)\n        sensor_table_data = SensorControlTableManager.read_sensor_table_data(\n            sensor_id=sensor_id, control_db_table_name=control_db_table_name\n        )\n        assert sensor_table_data.status == SensorStatus.ACQUIRED_NEW_DATA.value\n        assert has_new_data\n\n        if scenario == \"has_data_from_previous_execution\":\n            # this is the final scenario where we should have data from upstream.\n            # therefore, we checkpoint to indicate that sensor has processed\n            # all the new data.\n            update_sensor_status(\n                sensor_id,\n                control_db_table_name,\n            )\n\n            sensor_table_data = SensorControlTableManager.read_sensor_table_data(\n                sensor_id=sensor_id, control_db_table_name=control_db_table_name\n            )\n\n            assert sensor_table_data.status == SensorStatus.PROCESSED_NEW_DATA.value\n    else:\n        with pytest.raises(NoNewDataException) as exception:\n            execute_sensor(acon=acon)\n\n        assert f\"No data was acquired by {sensor_id} sensor.\" == str(exception.value)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"raise_exception_as_sensor_already_exists_by_sensor_id\",\n            \"sensor_id\": \"sensor_id_2\",\n            \"assets\": [\"dummy_asset_1\"],\n        },\n        {\n            \"scenario_name\": \"raise_exception_as_sensor_already_exists_by_assets\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"assets\": [\"dummy_asset_2\"],\n        },\n    ],\n)\ndef test_if_sensor_already_exists(scenario: dict) -> None:\n    \"\"\"Test if the sensor already exists.\n\n    This specific test focuses on the ways to identify if a sensor\n    already exists.\n\n    Scenarios:\n        raise_exception_as_sensor_already_exists_by_sensor_id: raises\n            exception if you try to create a sensor with a\n            different sensor id but same asset.\n        raise_exception_as_sensor_already_exists_by_assets: raises\n            exception if you try to create a sensor with\n            different assets but same sensor_id.\n    \"\"\"\n    sensor_id = \"sensor_id_1\"\n    assets = [\"dummy_asset_1\"]\n\n    control_db_table_name = \"test_db.test_table_sensor\"\n    upstream_table = \"test_table_sensor_upstream\"\n    checkpoint_location = f\"{TEST_LAKEHOUSE_IN}/test_table_sensor/\"\n\n    LocalStorage.clean_folder(checkpoint_location)\n    ExecEnv.SESSION.sql(f\"DROP TABLE IF EXISTS {control_db_table_name}\")\n    ExecEnv.SESSION.sql(f\"DROP TABLE IF EXISTS test_db.{upstream_table}\")\n\n    DataframeHelpers.create_delta_table(\n        _TEST_SENSOR_DELTA_TABLE_SCHEMA,\n        table=\"test_table_sensor\",\n    )\n    DataframeHelpers.create_delta_table(\n        _TEST_SENSOR_DELTA_TABLE_SCHEMA,\n        table=upstream_table,\n        enable_cdf=True,\n    )\n\n    _insert_data_into_upstream_table(upstream_table)\n\n    acon = {\n        \"sensor_id\": sensor_id,\n        \"assets\": assets,\n        \"control_db_table_name\": control_db_table_name,\n        \"input_spec\": {\n            \"spec_id\": \"sensor_upstream\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"delta\",\n            \"db_table\": f\"test_db.{upstream_table}\",\n            \"options\": {\n                \"readChangeFeed\": \"true\",\n            },\n        },\n        \"preprocess_query\": generate_sensor_query(\"sensor_id_upstream_1\"),\n        \"base_checkpoint_location\": checkpoint_location,\n        \"fail_on_empty_result\": True,\n    }\n\n    execute_sensor(acon=acon)\n\n    with pytest.raises(SensorAlreadyExistsException) as exception:\n        acon[\"sensor_id\"] = scenario[\"sensor_id\"]\n        acon[\"assets\"] = scenario[\"assets\"]\n        execute_sensor(acon=acon)\n\n    assert \"There's already a sensor registered with same id or assets!\" == str(\n        exception.value\n    )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        \"1st_run\",\n        \"2nd_run_with_new_data\",\n        \"3rd_run_without_new_data\",\n        \"4th_run_with_new_data\",\n    ],\n)\ndef test_jdbc_sensor(scenario: str) -> None:\n    \"\"\"Test the feature of sensoring new data from a jdbc upstream.\n\n    Scenario:\n        1st_run - initial setup.\n        2nd_run_with_new_data - jdbc upstream has new data.\n        3rd_run_without_new_data - jdbc upstream does not have new data.\n        4th_run_with_new_data - jdbc upstream has new data again.\n    \"\"\"\n    upstream_jdbc_table = \"test_jdbc_sensor_upstream\"\n    sensor_id = \"sensor_id_1\"\n    sensor_table = \"test_jdbc_sensor\"\n    control_db_table_name = f\"test_db.{sensor_table}\"\n    os.makedirs(f\"{TEST_LAKEHOUSE_IN}/{upstream_jdbc_table}\", exist_ok=True)\n\n    if scenario == \"1st_run\":\n        DataframeHelpers.create_delta_table(\n            _TEST_SENSOR_DELTA_TABLE_SCHEMA,\n            table=sensor_table,\n        )\n        _insert_into_jdbc_table(init=True)\n    elif scenario == \"2nd_run_with_new_data\":\n        _insert_into_jdbc_table(time=datetime.now())\n    elif scenario == \"4th_run_with_new_data\":\n        _insert_into_jdbc_table(time=datetime.now())\n\n    acon = {\n        \"sensor_id\": sensor_id,\n        \"assets\": [\"dummy_asset_1\"],\n        \"control_db_table_name\": control_db_table_name,\n        \"input_spec\": {\n            \"spec_id\": \"sensor_upstream\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"jdbc\",\n            \"jdbc_args\": {\n                \"url\": f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/\"\n                f\"{upstream_jdbc_table}/tests.db\",\n                \"table\": upstream_jdbc_table,\n                \"properties\": {\"driver\": \"org.sqlite.JDBC\"},\n            },\n        },\n        \"preprocess_query\": generate_sensor_query(\n            sensor_id=sensor_id,\n            filter_exp=\"?upstream_key > '?upstream_value'\",\n            control_db_table_name=control_db_table_name,\n            upstream_key=\"dummy_time\",\n        ),\n        \"fail_on_empty_result\": True,\n    }\n\n    if scenario in [\"2nd_run_with_new_data\", \"4th_run_with_new_data\"]:\n        has_new_data = execute_sensor(acon=acon)\n        sensor_table_data = SensorControlTableManager.read_sensor_table_data(\n            sensor_id=sensor_id, control_db_table_name=control_db_table_name\n        )\n\n        assert sensor_table_data.status == SensorStatus.ACQUIRED_NEW_DATA.value\n\n        update_sensor_status(\n            sensor_id,\n            control_db_table_name,\n        )\n\n        sensor_table_data = SensorControlTableManager.read_sensor_table_data(\n            sensor_id=sensor_id, control_db_table_name=control_db_table_name\n        )\n\n        assert sensor_table_data.status == SensorStatus.PROCESSED_NEW_DATA.value\n        assert has_new_data\n    else:\n        with pytest.raises(NoNewDataException) as exception:\n            execute_sensor(acon=acon)\n\n        assert f\"No data was acquired by {sensor_id} sensor.\" == str(exception.value)\n\n\ndef test_files_sensor() -> None:\n    \"\"\"Test the feature of sensoring a filesystem location (e.g., s3).\"\"\"\n    sensor_id = \"sensor_id_1\"\n    sensor_table = \"test_files_sensor\"\n    control_db_table_name = f\"test_db.{sensor_table}\"\n    checkpoint_location = f\"{TEST_LAKEHOUSE_IN}/test_files_sensor/\"\n    files_location = f\"{TEST_LAKEHOUSE_IN}/test_files_sensor/files/\"\n\n    DataframeHelpers.create_delta_table(\n        _TEST_SENSOR_DELTA_TABLE_SCHEMA,\n        table=sensor_table,\n    )\n\n    schema = _insert_files_sensor_test_data(files_location)\n\n    acon = {\n        \"sensor_id\": sensor_id,\n        \"assets\": [\"dummy_asset_1\"],\n        \"control_db_table_name\": control_db_table_name,\n        \"input_spec\": {\n            \"spec_id\": \"sensor_upstream\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"location\": files_location,\n            \"schema\": json.loads(schema.json()),\n        },\n        \"base_checkpoint_location\": checkpoint_location,\n        \"fail_on_empty_result\": False,\n    }\n\n    has_new_data = execute_sensor(acon=acon)\n\n    assert has_new_data\n\n\ndef test_update_sensor_status() -> None:\n    \"\"\"Test sensor update status logic.\"\"\"\n    sensor_id = \"sensor_id_1\"\n    sensor_table = \"test_checkpoint_sensor\"\n    control_db_table_name = f\"test_db.{sensor_table}\"\n    status = SensorStatus.ACQUIRED_NEW_DATA.value\n    checkpoint_location = \"s3://dummy-bucket/sensors/sensor_id_1\"\n\n    DataframeHelpers.create_delta_table(\n        _TEST_SENSOR_DELTA_TABLE_BASE_SCHEMA,\n        table=\"test_checkpoint_sensor\",\n    )\n\n    SensorControlTableManager.update_sensor_status(\n        sensor_spec=SensorSpec(\n            sensor_id=sensor_id,\n            assets=[\"asset_1\"],\n            control_db_table_name=control_db_table_name,\n            checkpoint_location=checkpoint_location,\n            preprocess_query=None,\n            input_spec=None,\n        ),\n        status=status,\n    )\n\n    row = SensorControlTableManager.read_sensor_table_data(\n        sensor_id=sensor_id, control_db_table_name=control_db_table_name\n    )\n\n    assert (\n        row.sensor_id == sensor_id\n        and row.status == SensorStatus.ACQUIRED_NEW_DATA.value\n        and row.checkpoint_location == \"s3://dummy-bucket/sensors/sensor_id_1\"\n    )\n\n\ndef _insert_data_into_upstream_table(\n    table: str,\n    db: str = \"test_db\",\n    values: str = None,\n) -> None:\n    \"\"\"Insert data into upstream table for testing sensoring based on tables.\n\n    Args:\n        table: table name.\n        db: database name.\n        values: string with the values operator for inserting data through SQL\n            DML statement.\n    \"\"\"\n    if not values:\n        values = (\n            f\"('sensor_id_upstream_1', array('dummy_upstream_asset_1'), \"\n            f\"'{SensorStatus.PROCESSED_NEW_DATA.value}', \"\n            f\"'2023-05-30 23:28:49.079522', null, null, null),\"\n            f\"('sensor_id_upstream_2', array('dummy_upstream_asset_2'), \"\n            f\"'{SensorStatus.PROCESSED_NEW_DATA.value}', \"\n            f\"'2023-05-30 23:28:49.089522', null, null, null)\"\n        )\n\n    ExecEnv.SESSION.sql(f\"INSERT INTO {db}.{table} VALUES {values}\")  # nosec: B608\n\n\ndef _insert_files_sensor_test_data(files_location: str) -> StructType:\n    \"\"\"Insert test data for files sensor test.\n\n    Args:\n        files_location: location to insert the data.\n\n    Returns:\n        A dummy struct type.\n    \"\"\"\n    schema = StructType([StructField(\"dummy_field\", StringType(), True)])\n\n    df = ExecEnv.SESSION.createDataFrame(\n        [\n            [\"a\"],\n            [\"b\"],\n        ],\n        schema,\n    )\n\n    df.write.format(\"csv\").save(files_location)\n\n    return schema\n\n\ndef _insert_into_jdbc_table(\n    init: bool = False,\n    time: datetime = None,\n) -> None:\n    \"\"\"Insert data into the jdbc table for tests.\n\n    Args:\n        init: if to init the table or not with empty data.\n        time: value to use for the dummy_time field, so that time-based filters\n            can be applied to the table so that we know that new data is\n            available from upstream.\n    \"\"\"\n    schema = StructType(\n        [\n            StructField(\"dummy_field\", StringType(), True),\n            StructField(\"dummy_time\", StringType(), True),\n        ]\n    )\n\n    if init:\n        df = ExecEnv.SESSION.createDataFrame(\n            [],\n            schema,\n        )\n    else:\n        df = ExecEnv.SESSION.createDataFrame(\n            [\n                [\"a\", str(time)],\n                [\"b\", str(time)],\n            ],\n            schema,\n        )\n\n    DataframeHelpers.write_into_jdbc_table(\n        df,\n        f\"jdbc:sqlite:{TEST_LAKEHOUSE_IN}/test_jdbc_sensor_upstream/tests.db\",\n        \"test_jdbc_sensor_upstream\",\n    )\n"
  },
  {
    "path": "tests/feature/test_sftp_reader.py",
    "content": "\"\"\"Test SFTP reader.\n\nNote: there is a limitation with the SFTP server/client which serves all files with\nthe same access and modified time, so we use the biggest dates to cover those\nscenarios. Moreover, we also cover scenarios were no files are expected to be found,\ndue to the date filters.\n\"\"\"\n\nimport gzip\nimport io\nimport os\nfrom copy import deepcopy\nfrom io import TextIOWrapper\nfrom typing import Generator\nfrom zipfile import ZipFile\n\nimport pandas as pd\nimport pytest\nfrom paramiko import Transport\nfrom paramiko.sftp_client import SFTPClient\nfrom pytest_sftpserver.consts import (  # type: ignore\n    SERVER_KEY_PRIVATE,\n    SERVER_KEY_PUBLIC,\n)\nfrom pytest_sftpserver.sftp.server import SFTPServer  # type: ignore\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom tests.conftest import FEATURE_RESOURCES, LAKEHOUSE_FEATURE_OUT\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"sftp_reader\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\nLOCAL_PATH = f\"{TEST_RESOURCES}/data/\"\nLOGGER = LoggingHandler(__name__).get_logger()\nFILES = os.listdir(LOCAL_PATH)\n\n\n@pytest.fixture(scope=\"module\")\ndef sftp_client(sftpserver: SFTPServer) -> Generator:\n    \"\"\"Create the sftp client to perform the tests.\n\n    Args:\n        sftpserver: a local SFTP-Server provided by the plugin pytest-sftpserver.\n    \"\"\"\n    conn_cred = {\"username\": \"a\", \"password\": \"b\"}\n    transport = Transport((sftpserver.host, sftpserver.port))\n    transport.connect(\n        hostkey=None,\n        **conn_cred,\n        pkey=None,\n        gss_host=None,\n        gss_auth=False,\n        gss_kex=False,\n        gss_deleg_creds=True,\n        gss_trust_dns=True,\n    )\n    client = SFTPClient.from_transport(transport)\n    yield client\n    client.close()\n    transport.close()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_csv\",\n            \"test_name\": \"between_dates\",\n            \"sftp_files_format\": \"csv\",\n            \"file_name\": \"file\",\n            \"file_extension\": \".csv\",\n        },\n        {\n            \"scenario_name\": \"sftp_csv\",\n            \"test_name\": \"between_dates_fail\",\n            \"sftp_files_format\": \"csv\",\n            \"file_name\": \"file\",\n            \"file_extension\": \".csv\",\n        },\n    ],\n)\ndef test_sftp_reader_csv(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - csv type.\n\n    This tests covers a connection using keys and tests a scenario between dates.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"key_type\": \"RSA\",\n            \"pkey\": LocalStorage.read_file(SERVER_KEY_PUBLIC).split()[1],\n            \"key_filename\": SERVER_KEY_PRIVATE,\n            \"date_time_gt\": \"2022-01-01\",\n            \"date_time_lt\": (\n                \"9999-12-31\" if \"fail\" not in scenario[\"test_name\"] else \"2021-01-01\"\n            ),\n            \"file_name_contains\": f\"e{scenario['file_extension']}\",\n            \"args\": {\"sep\": \"|\"},\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        if \"fail\" not in scenario[\"test_name\"]:\n            _execute_and_validate(acon, scenario)\n        else:\n            with pytest.raises(\n                ValueError, match=\"No files were found with the specified parameters.\"\n            ):\n                _execute_and_validate(acon, scenario)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_fwf\",\n            \"test_name\": \"earliest_file\",\n            \"sftp_files_format\": \"fwf\",\n            \"file_name\": \"file5\",\n            \"file_extension\": \".txt\",\n        }\n    ],\n)\ndef test_sftp_reader_fwf(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - fwf type.\n\n    This test covers a connection using add auto policy and tests\n    earliest file and additional args.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"add_auto_policy\": True,\n            \"earliest_file\": True,\n            \"file_name_contains\": scenario[\"file_extension\"],\n            \"args\": {\"index_col\": False, \"names\": [\"value\"]},\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        _execute_and_validate(acon, scenario)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_gz_file\",\n            \"test_name\": \"compressed_gz_file\",\n            \"sftp_files_format\": \"csv\",\n            \"file_name\": \"file6.compress\",\n            \"file_extension\": \".gz\",\n        },\n    ],\n)\ndef test_sftp_reader_gz_file(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - compressed gz type.\n\n    This tests covers a connection using keys and tests a scenario of\n    extracting a compressed gz file.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"key_type\": \"RSA\",\n            \"pkey\": LocalStorage.read_file(SERVER_KEY_PUBLIC).split()[1],\n            \"file_name_contains\": \"file6\",\n            \"args\": {\"sep\": \"|\"},\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        _execute_and_validate(acon, scenario)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_json\",\n            \"test_name\": \"greater_than\",\n            \"sftp_files_format\": \"json\",\n            \"file_name\": \"file3\",\n            \"file_extension\": \".json\",\n        }\n    ],\n)\ndef test_sftp_reader_json(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - json type.\n\n    This tests covers a connection with add auto policy and tests date time\n    greater than specified date and additional args.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"add_auto_policy\": True,\n            \"date_time_gt\": \"2022-01-01\",\n            \"file_name_contains\": scenario[\"file_extension\"],\n            \"args\": {\"lines\": True, \"orient\": \"columns\"},\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        _execute_and_validate(acon, scenario)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_mult_files\",\n            \"test_name\": \"file_name_contains\",\n            \"sftp_files_format\": \"csv\",\n            \"file_name\": \"*\",\n            \"file_extension\": \".csv\",\n        }\n    ],\n)\ndef test_sftp_reader_mult_files(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - multiple files.\n\n    This test covers a connection with add auto policy and tests file\n    contains with additional args.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"add_auto_policy\": True,\n            \"file_name_contains\": scenario[\"file_extension\"],\n            \"args\": {\"sep\": \"|\"},\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        _execute_and_validate(acon, scenario)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_xml\",\n            \"test_name\": \"lower_than\",\n            \"sftp_files_format\": \"xml\",\n            \"file_name\": \"file4\",\n            \"file_extension\": \".xml\",\n        },\n        {\n            \"scenario_name\": \"sftp_xml\",\n            \"test_name\": \"lower_than_fails\",\n            \"sftp_files_format\": \"xml\",\n            \"file_name\": \"file4\",\n            \"file_extension\": \".xml\",\n        },\n    ],\n)\ndef test_sftp_reader_xml(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - xml type.\n\n    This test covers a connection with add auto policy and date time\n    lower than specified date.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"add_auto_policy\": True,\n            \"date_time_lt\": (\n                \"9999-12-31\" if \"fail\" not in scenario[\"test_name\"] else \"2022-01-01\"\n            ),\n            \"file_name_contains\": scenario[\"file_extension\"],\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        if \"fail\" not in scenario[\"test_name\"]:\n            _execute_and_validate(acon, scenario)\n        else:\n            with pytest.raises(\n                ValueError, match=\"No files were found with the specified parameters.\"\n            ):\n                _execute_and_validate(acon, scenario)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sftp_zip_file\",\n            \"test_name\": \"compressed_zip_file\",\n            \"sftp_files_format\": \"csv\",\n            \"file_name\": \"file7\",\n            \"file_extension\": \".zip\",\n        },\n    ],\n)\ndef test_sftp_reader_zip_file(\n    sftp_client: SFTPClient,\n    sftpserver: SFTPServer,\n    scenario: dict,\n    remote_location: dict,\n) -> None:\n    \"\"\"Test loads from sftp source - compressed zip type.\n\n    This tests covers a connection using keys and tests a scenario of\n    extracting a compressed zip file.\n\n    Args:\n        sftp_client: sftp client used to perform tests.\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n        scenario: scenario being tested.\n        remote_location: serve files on remote location.\n    \"\"\"\n    LOGGER.info(f\"Starting Scenario {scenario['scenario_name']}\")\n    with sftpserver.serve_content(deepcopy(remote_location)):\n        rename_remote_files(sftp_client)\n\n        option_params = {\n            \"hostname\": sftpserver.host,\n            \"username\": \"dummy_user\",\n            \"password\": \"dummy_password\",\n            \"port\": sftpserver.port,\n            \"key_type\": \"RSA\",\n            \"pkey\": LocalStorage.read_file(SERVER_KEY_PUBLIC).split()[1],\n            \"sub_dir\": True,\n            \"file_name_contains\": \"file7\",\n            \"args\": {\"sep\": \"|\"},\n        }\n\n        acon = _get_test_acon(scenario, option_params)\n\n        _execute_and_validate(acon, scenario)\n\n\ndef test_sftp_server_available(sftpserver: SFTPServer) -> None:\n    \"\"\"Test availability of sftp server.\n\n    Args:\n        sftpserver: a local SFTP-Server created by pytest_sftpserver.\n    \"\"\"\n    assert isinstance(sftpserver, SFTPServer)\n    assert sftpserver.is_alive()\n    assert str(sftpserver.port) in sftpserver.url\n\n\ndef _execute_and_validate(\n    acon: dict,\n    scenario: dict,\n) -> None:\n    \"\"\"Execute the load and compare data of result and control.\n\n    Args:\n        acon: acon dict to be tested.\n        scenario: scenario to be tested.\n    \"\"\"\n    load_data(acon=acon)\n    result = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/{scenario['test_name']}/data\"\n    )\n\n    if scenario[\"scenario_name\"] == \"sftp_fwf\":\n        control = (\n            ExecEnv.SESSION.read.format(\"text\")\n            .option(\"lineSep\", \"\\n\")\n            .load(\n                f\"{TEST_RESOURCES}/data/{scenario['file_name']}\"\n                f\"{scenario['file_extension']}\"\n            )\n        )\n    elif scenario[\"scenario_name\"] == \"sftp_json\":\n        control = DataframeHelpers.read_from_file(\n            f\"{TEST_RESOURCES}/data/{scenario['file_name']}\"\n            f\"{scenario['file_extension']}\",\n            file_format=\"json\",\n        )\n    elif scenario[\"scenario_name\"] == \"sftp_xml\":\n        control = (\n            ExecEnv.SESSION.read.format(\"xml\")\n            .option(\"rowTag\", \"row\")\n            .load(\n                f\"{TEST_RESOURCES}/data/{scenario['file_name']}\"\n                f\"{scenario['file_extension']}\"\n            )\n        )\n    elif scenario[\"scenario_name\"] == \"sftp_zip_file\":\n        with ZipFile(\n            f\"{TEST_RESOURCES}/data/{scenario['file_name']}\"\n            f\"{scenario['file_extension']}\",\n            \"r\",\n        ) as zf:\n            file = pd.read_csv(TextIOWrapper(zf.open(zf.namelist()[0])), sep=\"|\")\n        control = ExecEnv.SESSION.createDataFrame(file)\n    else:\n        control = DataframeHelpers.read_from_file(\n            f\"{TEST_RESOURCES}/data/{scenario['file_name']}\"\n            f\"{scenario['file_extension']}\"\n        )\n\n    assert not DataframeHelpers.has_diff(result, control)\n\n\ndef _get_test_acon(\n    scenario: dict,\n    option_params: dict,\n) -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Args:\n        scenario: the scenario being tested.\n        option_params: option params for the scenario being tested.\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sftp_source\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"sftp\",\n                \"sftp_files_format\": scenario[\"sftp_files_format\"],\n                \"location\": \"remote_location\",\n                \"options\": option_params,\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sftp_bronze\",\n                \"input_id\": \"sftp_source\",\n                \"write_type\": \"overwrite\",\n                \"data_format\": \"csv\",\n                \"options\": {\"header\": True, \"delimiter\": \"|\", \"inferSchema\": True},\n                \"location\": f\"file:///{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/\"\n                f\"{scenario['test_name']}/data\",\n            }\n        ],\n    }\n\n\n@pytest.fixture(scope=\"module\")\ndef remote_location() -> dict:\n    \"\"\"Get files to serve on a remote sftp location.\n\n    For creating compressed file in the remote location,\n    it is necessary to read, decompress, cast it to bytes\n    and then send it to the location.\n    For regular files, only file read is necessary.\n\n    Returns:\n        A dict with the files for the remote location configured.\n    \"\"\"\n    remote_location: dict = {\"remote_location\": {}}\n\n    for file in FILES:\n        if file.endswith(\".gz\"):\n            file_name = file.rsplit(\".\", 1)[0]\n            with gzip.GzipFile(f\"{LOCAL_PATH}{file}\", \"rb\") as compressed_file:\n                file_data_string = compressed_file.read().decode()\n                file_bytes = gzip.compress(file_data_string.encode(\"utf-8\"))\n            remote_location[\"remote_location\"][f\"{file_name}\"] = file_bytes\n        elif file.endswith(\".zip\"):\n            file_name = file.rsplit(\".\", 1)[0]\n            with ZipFile(f\"{LOCAL_PATH}{file}\", \"r\") as f:\n                with f.open(f\"{file_name}.csv\") as zfile:\n                    data = zfile.read().decode()\n\n            bytesfile = io.BytesIO()\n            with ZipFile(bytesfile, mode=\"w\") as zf:\n                zf.writestr(f\"{file_name}.csv\", data)\n                zf.close()\n                file_bytes = bytesfile.getvalue()\n            remote_location[\"remote_location\"].update({\"sub_dir\": {}})\n            remote_location[\"remote_location\"][\"sub_dir\"][f\"{file_name}\"] = file_bytes\n        else:\n            file_name = file.split(\".\")[0]\n            remote_location[\"remote_location\"][f\"{file_name}\"] = LocalStorage.read_file(\n                f\"{LOCAL_PATH}{file}\"\n            )\n    return remote_location\n\n\ndef rename_remote_files(sftp_client: SFTPClient) -> None:\n    \"\"\"Rename files served remotely in SFTP.\"\"\"\n    for file in FILES:\n        file_name = file.rsplit(\".\", 1)[0]\n        try:\n            sftp_client.rename(\n                f\"/remote_location/{file_name}\",\n                f\"/remote_location/{file}\",\n            )\n        except IOError:\n            pass\n        try:\n            sftp_client.rename(\n                f\"/remote_location/sub_dir/{file_name}\",\n                f\"/remote_location/sub_dir/{file}\",\n            )\n        except IOError:\n            pass\n"
  },
  {
    "path": "tests/feature/test_sharepoint_reader.py",
    "content": "\"\"\"Test Sharepoint reader.\"\"\"\n\nimport json\nfrom pathlib import Path\nfrom typing import Any, Dict, List, Set\nfrom unittest.mock import Mock, patch\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import SharepointFile\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom tests.conftest import FEATURE_RESOURCES\nfrom tests.utils.local_storage import LocalStorage\nfrom tests.utils.mocks import MockRESTResponse\n\nTEST_NAME = \"sharepoint\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\n\nTEST_SCENARIOS_READER_SUCCESS: List[List[str]] = [\n    [\"reader\", \"read_single_csv_success\"],\n    [\"reader\", \"read_single_csv_full_path_success\"],\n    [\"reader\", \"read_folder_csv_success\"],\n    [\"reader\", \"read_folder_csv_pattern_success\"],\n    [\"reader\", \"read_single_csv_archive_enabled_success\"],\n    [\"reader\", \"read_folder_csv_archive_enabled_success\"],\n    [\"reader\", \"read_single_csv_archive_default_enabled_success\"],\n    [\"reader\", \"read_single_csv_archive_success_subfolder_override_success\"],\n    [\"reader\", \"read_folder_csv_archive_success_subfolder_override_success\"],\n]\n\nTEST_SCENARIOS_READER_FAILURES: List[List[str]] = [\n    [\n        \"reader\",\n        \"read_folder_csv_one_file_schema_mismatch_should_archive_error\",\n        r\"Schema mismatch\",\n    ],\n    [\"reader\", \"read_single_csv_empty_file_should_archive_error\", r\"is empty\"],\n    [\n        \"reader\",\n        \"read_folder_csv_no_csv_files_should_fail\",\n        r\"No CSV files found in folder: sp_test\",\n    ],\n    [\n        \"reader\",\n        \"read_folder_csv_pattern_matches_no_files_should_fail\",\n        r\"No CSV files found in folder: sp_test\",\n    ],\n    [\n        \"reader\",\n        \"read_folder_csv_one_file_schema_mismatch_\"\n        \"custom_error_subfolder_should_archive_error\",\n        r\"Schema mismatch\",\n    ],\n    [\n        \"reader\",\n        \"read_single_csv_download_error_should_archive_error\",\n        r\"Download failed\",\n    ],\n    [\n        \"reader\",\n        \"read_single_csv_spark_load_fails_should_archive_error\",\n        r\"Failed to read Sharepoint file\",\n    ],\n]\n\n\nTEST_SCENARIOS_READER_EXCEPTIONS: List[List[str]] = [\n    [\n        \"reader\",\n        \"read_single_csv_full_path_with_file_name_should_fail\",\n        \"When `folder_relative_path` points to a file, `file_name` must be None.\",\n    ],\n    [\n        \"reader\",\n        \"read_folder_path_does_not_exist_should_fail\",\n        \"Folder 'missing_folder' does not exist in Sharepoint.\",\n    ],\n    [\n        \"reader\",\n        \"read_file_name_and_file_pattern_conflict_should_fail\",\n        \"Conflicting options: provide either `file_name` or `file_pattern`\",\n    ],\n    [\n        \"reader\",\n        \"read_file_name_unsupported_extension_should_fail\",\n        \"`file_name` must end with one of\",\n    ],\n    [\n        \"reader\",\n        \"read_folder_relative_path_looks_like_file_unsupported_extension_should_fail\",\n        \"`folder_relative_path` appears to be a file path but does not end with one of\",\n    ],\n    [\n        \"reader\",\n        \"read_unsupported_file_type_should_fail\",\n        \"`file_type` must be one of\",\n    ],\n    [\n        \"reader\",\n        \"read_single_csv_full_path_with_file_pattern_should_fail\",\n        \"When `folder_relative_path` points to a file, `file_pattern` must be None.\",\n    ],\n    [\n        \"reader\",\n        \"read_single_csv_full_path_with_file_type_should_fail\",\n        \"When `folder_relative_path` points to a file, `file_type` must be None\",\n    ],\n]\n\n# Helper functions\n\n\ndef _read_bytes(path_value: str) -> bytes:\n    \"\"\"Read a test file as bytes.\"\"\"\n    return Path(path_value).read_bytes()\n\n\ndef _get_output_path_by_scenario() -> Dict[str, str]:\n    \"\"\"Return the delta output location for each success scenario.\"\"\"\n    return {\n        \"read_single_csv_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta/\"\n        ),\n        \"read_single_csv_full_path_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_full_path/\"\n        ),\n        \"read_folder_csv_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder/\"\n        ),\n        \"read_folder_csv_pattern_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_pattern/\"\n        ),\n        \"read_single_csv_archive_enabled_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/\"\n            \"reader/delta_single_archive_enabled/\"\n        ),\n        \"read_folder_csv_archive_enabled_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/\"\n            \"reader/delta_folder_archive_enabled/\"\n        ),\n        \"read_single_csv_archive_default_enabled_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/\"\n            \"reader/delta_single_archive_default_enabled/\"\n        ),\n        \"read_single_csv_archive_success_subfolder_override_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/\"\n            \"reader/delta_single_archive_success_subfolder_override/\"\n        ),\n        \"read_folder_csv_archive_success_subfolder_override_success\": (\n            \"/app/tests/lakehouse/out/feature/sharepoint/\"\n            \"reader/delta_folder_archive_success_subfolder_override/\"\n        ),\n    }\n\n\ndef _setup_sharepoint_reader_mocks_for_success(\n    scenario_name: str,\n    mock_list_items_in_path: Mock,\n    mock_get_file_metadata: Mock,\n) -> None:\n    \"\"\"Configure SharePoint mocks used by Sharepoint reader success scenarios.\n\n    Args:\n        scenario_name: Test scenario identifier.\n        mock_list_items_in_path: Mock for SharepointUtils.list_items_in_path.\n        mock_get_file_metadata: Mock for SharepointUtils.get_file_metadata.\n    \"\"\"\n    is_folder_read_scenario = scenario_name.startswith(\"read_folder_\")\n\n    if is_folder_read_scenario:\n        mock_list_items_in_path.return_value = [\n            {\"name\": \"sample_1.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\"name\": \"sample_2.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\"name\": \"other.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\"name\": \"ignore.txt\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n        ]\n\n        file_bytes_by_path: Dict[str, bytes] = {\n            \"sp_test/sample_1.csv\": _read_bytes(\n                f\"{TEST_RESOURCES}/reader/data/sample_1.csv\"\n            ),\n            \"sp_test/sample_2.csv\": _read_bytes(\n                f\"{TEST_RESOURCES}/reader/data/sample_2.csv\"\n            ),\n            \"sp_test/other.csv\": _read_bytes(f\"{TEST_RESOURCES}/reader/data/other.csv\"),\n        }\n\n        def get_file_metadata_side_effect_for_folder(file_path: str) -> SharepointFile:\n            \"\"\"Side effect function for `get_file_metadata` mock in folder scenarios.\"\"\"\n            return SharepointFile(\n                file_name=file_path.split(\"/\")[-1],\n                time_created=\"\",\n                time_modified=\"\",\n                content=file_bytes_by_path[file_path],\n                _folder=file_path.rsplit(\"/\", 1)[0],\n            )\n\n        mock_get_file_metadata.side_effect = get_file_metadata_side_effect_for_folder\n        return\n\n    content = _read_bytes(f\"{TEST_RESOURCES}/reader/data/sample_1.csv\")\n\n    def get_file_metadata_side_effect_for_single_file(file_path: str) -> SharepointFile:\n        \"\"\"Side effect function for `get_file_metadata` mock in single file scenarios.\n\n        Args:\n            file_path: The path of the file for which metadata is being requested.\n\n        Returns:\n            A SharepointFile object with the content set to the bytes read from the\n            test file.\n        \"\"\"\n        folder = file_path.rsplit(\"/\", 1)[0] if \"/\" in file_path else \"sp_test\"\n        return SharepointFile(\n            file_name=file_path.split(\"/\")[-1],\n            time_created=\"\",\n            time_modified=\"\",\n            content=content,\n            _folder=folder,\n        )\n\n    mock_get_file_metadata.side_effect = get_file_metadata_side_effect_for_single_file\n\n\ndef _assert_archive_calls_for_success(\n    scenario_name: str,\n    mock_archive_sharepoint_file: Mock,\n) -> None:\n    \"\"\"Assert archive behavior for Sharepoint reader success scenarios.\n\n    Args:\n        scenario_name: Test scenario identifier.\n        mock_archive_sharepoint_file: Mock for SharepointUtils.archive_sharepoint_file.\n    \"\"\"\n    is_folder_read_scenario = scenario_name.startswith(\"read_folder_\")\n\n    folder_expected_calls_by_scenario: Dict[str, int] = {\n        \"read_folder_csv_success\": 3,\n        \"read_folder_csv_pattern_success\": 2,\n        \"read_folder_csv_archive_enabled_success\": 3,\n        \"read_folder_csv_archive_success_subfolder_override_success\": 3,\n    }\n\n    folder_archive_enabled_scenarios: Set[str] = {\n        \"read_folder_csv_archive_enabled_success\",\n        \"read_folder_csv_archive_success_subfolder_override_success\",\n    }\n\n    single_file_archive_enabled_scenarios: Set[str] = {\n        \"read_single_csv_archive_enabled_success\",\n        \"read_single_csv_archive_default_enabled_success\",\n        \"read_single_csv_archive_success_subfolder_override_success\",\n    }\n\n    success_subfolder_by_scenario: Dict[str, str] = {\n        \"read_single_csv_archive_success_subfolder_override_success\": \"processed\",\n        \"read_folder_csv_archive_success_subfolder_override_success\": \"processed\",\n    }\n\n    expected_success_subfolder = success_subfolder_by_scenario.get(\n        scenario_name, \"done\"\n    )\n\n    if is_folder_read_scenario:\n        expected_calls = folder_expected_calls_by_scenario[scenario_name]\n        assert mock_archive_sharepoint_file.call_count == expected_calls\n\n        expected_move_enabled = scenario_name in folder_archive_enabled_scenarios\n        for call in mock_archive_sharepoint_file.call_args_list:\n            assert call.kwargs[\"move_enabled\"] is expected_move_enabled\n            if expected_move_enabled:\n                to_path = call.kwargs[\"to_path\"]\n                assert to_path is not None\n                assert to_path.endswith(f\"/{expected_success_subfolder}\")\n        return\n\n    mock_archive_sharepoint_file.assert_called_once()\n    expected_move_enabled = scenario_name in single_file_archive_enabled_scenarios\n    assert (\n        mock_archive_sharepoint_file.call_args.kwargs[\"move_enabled\"]\n        is expected_move_enabled\n    )\n\n    if expected_move_enabled:\n        to_path = mock_archive_sharepoint_file.call_args.kwargs[\"to_path\"]\n        assert to_path is not None\n        assert to_path.endswith(f\"/{expected_success_subfolder}\")\n\n\ndef _assert_sharepoint_reader_success_output(\n    scenario_name: str,\n    output_path: str,\n) -> None:\n    \"\"\"Assert the delta output produced by Sharepoint reader success scenarios.\n\n    Args:\n        scenario_name: Test scenario identifier.\n        output_path: Delta output location for the scenario.\n    \"\"\"\n    data_frame = ExecEnv.SESSION.read.format(\"delta\").load(output_path)\n    assert data_frame.columns == [\"col_a\", \"col_b\"]\n\n    if scenario_name in {\n        \"read_folder_csv_success\",\n        \"read_folder_csv_archive_enabled_success\",\n        \"read_folder_csv_archive_success_subfolder_override_success\",\n    }:\n        assert data_frame.count() == 3\n        rows = [row.asDict() for row in data_frame.orderBy(\"col_a\").collect()]\n        assert rows == [\n            {\"col_a\": 1, \"col_b\": 2},\n            {\"col_a\": 3, \"col_b\": 4},\n            {\"col_a\": 999, \"col_b\": 999},\n        ]\n    elif scenario_name == \"read_folder_csv_pattern_success\":\n        assert data_frame.count() == 2\n        rows = [row.asDict() for row in data_frame.orderBy(\"col_a\").collect()]\n        assert rows == [\n            {\"col_a\": 1, \"col_b\": 2},\n            {\"col_a\": 3, \"col_b\": 4},\n        ]\n\n\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.archive_sharepoint_file\"\n)\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.get_file_metadata\")\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.list_items_in_path\")\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.check_if_endpoint_exists\",\n    return_value=True,\n)\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._create_app\")\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._get_token\")\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._make_request\")\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS_READER_SUCCESS)\ndef test_sharepoint_reader_success(\n    mock_make_request: Any,\n    mock_get_token: Any,\n    mock_create_app: Any,\n    mock_check_if_endpoint_exists: Any,\n    mock_list_items_in_path: Any,\n    mock_get_file_metadata: Any,\n    mock_archive_sharepoint_file: Any,\n    scenario: List[str],\n) -> None:\n    \"\"\"Test Sharepoint reader happy paths (single file, full path, folder).\"\"\"\n    scenario_name = scenario[1]\n\n    output_path_by_scenario = _get_output_path_by_scenario()\n\n    mock_archive_sharepoint_file.return_value = None\n    mock_make_request.return_value = None\n\n    _setup_sharepoint_reader_mocks_for_success(\n        scenario_name=scenario_name,\n        mock_list_items_in_path=mock_list_items_in_path,\n        mock_get_file_metadata=mock_get_file_metadata,\n    )\n\n    output_path = output_path_by_scenario[scenario_name]\n    LocalStorage.clean_folder(output_path)\n\n    load_data(f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/{scenario_name}.json\")\n\n    _assert_archive_calls_for_success(\n        scenario_name=scenario_name,\n        mock_archive_sharepoint_file=mock_archive_sharepoint_file,\n    )\n\n    _assert_sharepoint_reader_success_output(\n        scenario_name=scenario_name,\n        output_path=output_path,\n    )\n\n\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.archive_sharepoint_file\"\n)\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.get_file_metadata\")\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.list_items_in_path\")\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.check_if_endpoint_exists\",\n    return_value=True,\n)\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._create_app\")\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._get_token\")\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._make_request\")\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS_READER_FAILURES)\ndef test_sharepoint_reader_failures(\n    mock_make_request: Any,\n    mock_get_token: Any,\n    mock_create_app: Any,\n    mock_check_if_endpoint_exists: Any,\n    mock_list_items_in_path: Any,\n    mock_get_file_metadata: Any,\n    mock_archive_sharepoint_file: Any,\n    scenario: List[str],\n    tmp_path: Path,\n) -> None:\n    \"\"\"Test Sharepoint reader runtime failure scenarios.\n\n    This test covers failures that happen during file processing (for example schema\n    mismatches, empty files, or folder contents that result in non readable CSVs).\n    These are different from `test_sharepoint_reader_exceptions`, which validates\n    fail-fast configuration errors (invalid option combinations, unsupported file\n    types) that should raise before any file processing starts.\n    For runtime failures where archiving is enabled, the reader should move the\n    problematic file(s) to the configured error subfolder (default: \"error\").\n    The assertions at the end verify:\n    - the job failed with the expected error message\n    - archiving was invoked with `move_enabled=True`\n    - the archive target folder matches the expected error subfolder\n    - the archived file is one of the files involved in the scenario\n    \"\"\"\n    scenario_name = scenario[1]\n    expected_error_regex = scenario[2]\n\n    mock_archive_sharepoint_file.return_value = None\n    mock_make_request.return_value = None\n\n    should_assert_no_archive_calls = False\n    expected_error_subfolder = \"error\"\n    allowed_file_names: Set[str] = set()\n\n    should_patch_spark_load = False\n\n    # Scenario-specific mocking + expectations (no load_data here)\n    if \"schema_mismatch\" in scenario_name:\n        expected_error_subfolder = (\n            \"failed\" if \"custom_error_subfolder\" in scenario_name else \"error\"\n        )\n        allowed_file_names = {\"sample_1.csv\", \"bad_schema.csv\"}\n\n        mock_list_items_in_path.return_value = [\n            {\"name\": \"sample_1.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\n                \"name\": \"bad_schema.csv\",\n                \"createdDateTime\": \"\",\n                \"lastModifiedDateTime\": \"\",\n            },\n        ]\n\n        file_bytes_by_path: Dict[str, bytes] = {\n            \"sp_test/sample_1.csv\": _read_bytes(\n                f\"{TEST_RESOURCES}/reader/data/sample_1.csv\"\n            ),\n            \"sp_test/bad_schema.csv\": _read_bytes(\n                f\"{TEST_RESOURCES}/reader/data/bad_schema.csv\"\n            ),\n        }\n\n        def get_file_metadata_side_effect(file_path: str) -> SharepointFile:\n            return SharepointFile(\n                file_name=file_path.split(\"/\")[-1],\n                time_created=\"\",\n                time_modified=\"\",\n                content=file_bytes_by_path[file_path],\n                _folder=file_path.rsplit(\"/\", 1)[0],\n            )\n\n        mock_get_file_metadata.side_effect = get_file_metadata_side_effect\n\n    elif scenario_name == \"read_single_csv_empty_file_should_archive_error\":\n        allowed_file_names = {\"empty.csv\"}\n\n        def get_file_metadata_side_effect(file_path: str) -> SharepointFile:\n            return SharepointFile(\n                file_name=\"empty.csv\",\n                time_created=\"\",\n                time_modified=\"\",\n                content=b\"\",\n                _folder=\"sp_test\",\n            )\n\n        mock_get_file_metadata.side_effect = get_file_metadata_side_effect\n\n    elif scenario_name == \"read_folder_csv_no_csv_files_should_fail\":\n        should_assert_no_archive_calls = True\n        mock_list_items_in_path.return_value = [\n            {\"name\": \"ignore.txt\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\"name\": \"readme.md\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n        ]\n\n    elif scenario_name == \"read_folder_csv_pattern_matches_no_files_should_fail\":\n        should_assert_no_archive_calls = True\n        mock_list_items_in_path.return_value = [\n            {\"name\": \"sample_1.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\"name\": \"sample_2.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n            {\"name\": \"other.csv\", \"createdDateTime\": \"\", \"lastModifiedDateTime\": \"\"},\n        ]\n\n    elif scenario_name == \"read_single_csv_download_error_should_archive_error\":\n        allowed_file_names = {\"sample_1.csv\"}\n\n        first_sharepoint_file = SharepointFile(\n            file_name=\"sample_1.csv\",\n            time_created=\"\",\n            time_modified=\"\",\n            content=b\"not-empty\",\n            _folder=\"sp_test\",\n        )\n\n        mock_get_file_metadata.side_effect = [\n            first_sharepoint_file,\n            ValueError(\"Download failed\"),\n        ]\n    elif scenario_name == \"read_single_csv_spark_load_fails_should_archive_error\":\n        should_patch_spark_load = True\n        allowed_file_names = {\"sample_1.csv\"}\n\n        sp_file_first = SharepointFile(\n            file_name=\"sample_1.csv\",\n            time_created=\"\",\n            time_modified=\"\",\n            content=b\"col_a,col_b\\n1,2\\n\",\n            _folder=\"sp_test\",\n        )\n\n        sp_file_second = SharepointFile(\n            file_name=\"sample_1.csv\",\n            time_created=\"\",\n            time_modified=\"\",\n            content=b\"col_a,col_b\\n1,2\\n\",\n            _folder=\"sp_test\",\n        )\n\n        mock_get_file_metadata.side_effect = [sp_file_first, sp_file_second]\n\n    else:\n        raise ValueError(f\"Unhandled failure scenario: {scenario_name}\")\n\n    # Execute + assert error (exactly once per scenario)\n    acon_path = f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/{scenario_name}.json\"\n\n    if should_patch_spark_load:\n        fake_local_file: Path = tmp_path / \"fake.csv\"\n        fake_local_file.write_text(\"dummy\")\n        with (\n            patch(\n                \"lakehouse_engine.utils.sharepoint_utils.\"\n                \"SharepointUtils.save_to_staging_area\",\n                return_value=str(fake_local_file),\n            ),\n            patch(\n                \"pyspark.sql.readwriter.DataFrameReader.load\",\n                side_effect=Exception(\"Spark load failed\"),\n            ),\n        ):\n            with pytest.raises(ValueError, match=expected_error_regex):\n                load_data(acon_path)\n    else:\n        with pytest.raises(ValueError, match=expected_error_regex):\n            load_data(acon_path)\n\n    # For scenarios that fail before reading any CSV file (folder contains no CSVs, or\n    # the pattern filters everything out), there is no concrete CSV file to archive.\n    # We assert no archive attempts are made.\n    if should_assert_no_archive_calls:\n        assert mock_archive_sharepoint_file.call_count == 0\n        assert mock_get_file_metadata.call_count == 0\n        return\n\n    # For processing-time failures, the reader should attempt to archive the failing\n    # file(s) into the configured error subfolder (default: \"error\").\n    # We assert at least one archive call targeted that error folder with move enabled,\n    # and that the archived file belongs to this scenario.\n    error_calls = [\n        c\n        for c in mock_archive_sharepoint_file.call_args_list\n        if (c.kwargs.get(\"to_path\") or \"\").endswith(f\"/{expected_error_subfolder}\")\n    ]\n    assert len(error_calls) >= 1\n\n    for c in error_calls:\n        assert c.kwargs[\"move_enabled\"] is True\n        sp_file = c.kwargs.get(\"sp_file\")\n        assert sp_file is not None\n        assert sp_file.file_name in allowed_file_names\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS_READER_EXCEPTIONS)\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._create_app\")\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._get_token\",\n    return_value=\"fake-token\",\n)\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._make_request\",\n    side_effect=[\n        # site id\n        MockRESTResponse(\n            status_code=200,\n            json_data=json.loads(\n                open(f\"{TEST_RESOURCES}/reader/mocks/get_site_id.json\").read()\n            ),\n        ),\n        # drive id\n        MockRESTResponse(\n            status_code=200,\n            json_data=json.loads(\n                open(f\"{TEST_RESOURCES}/reader/mocks/get_drive_id.json\").read()\n            ),\n        ),\n    ],\n)\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.check_if_endpoint_exists\",\n    return_value=True,\n)\ndef test_sharepoint_reader_exceptions(\n    mock_check_if_endpoint_exists: Any,\n    mock_make_request: Any,\n    mock_get_token: Any,\n    mock_create_app: Any,\n    scenario: List[str],\n) -> None:\n    \"\"\"Test Sharepoint reader invalid configs that must fail fast.\"\"\"\n    scenario_name = scenario[1]\n\n    if scenario_name == \"read_folder_path_does_not_exist_should_fail\":\n        mock_check_if_endpoint_exists.return_value = False\n\n    with pytest.raises(ValueError, match=scenario[2]):\n        load_data(f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/{scenario_name}.json\")\n"
  },
  {
    "path": "tests/feature/test_sharepoint_writer.py",
    "content": "\"\"\"Test Sharepoint utils.\"\"\"\n\nimport json\nfrom typing import Any, List\nfrom unittest.mock import MagicMock, patch\n\nimport pytest\n\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.io.exceptions import (\n    EndpointNotFoundException,\n    InputNotFoundException,\n    NotSupportedException,\n)\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.local_storage import LocalStorage\nfrom tests.utils.mocks import MockRESTResponse\n\n\"\"\"\nTests for Sharepoint-related utilities and functionality.\n\nThis test suite validates the behavior of the Sharepoint writer, ensuring\nthat it handles various scenarios correctly. The tests cover validation of\nmandatory inputs, unsupported operations, endpoint existence checks, and\nsuccessful writing to Sharepoint.\n\nScenarios tested:\n- Attempting to use streaming with the Sharepoint writer raises a\n  `NotSupportedException`.\n- Missing mandatory options (`site_name`, `drive_name`, `local_path`) raises\n  an `InputNotFoundException`.\n- Providing an invalid endpoint raises an `EndpointNotFoundException`.\n- Successful writing to Sharepoint and associated log validation.\n\nMocks:\n- `SharepointWriter._get_sharepoint_utils` is patched to simulate the behavior\n  of the Sharepoint utilities without making actual external calls.\n- Mock REST responses simulate Sharepoint API interactions for success cases.\n\nDependencies:\n- Uses pytest for parameterized testing of different scenarios.\n- Relies on a local storage utility for preparing test data and file operations.\n\"\"\"\n\nTEST_NAME = \"sharepoint\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\n\nTEST_SCENARIOS_EXCEPTIONS = [\n    [\n        \"streaming_exception\",\n        \"Sharepoint writer doesn't support streaming!\",\n    ],\n    [\n        \"drive_exception\",\n        \"Please provide all mandatory Sharepoint options. \\n\"\n        \"Expected: site_name, drive_name and local_path. \"\n        \"Value should not be None.\\n\"\n        \"Provided: site_name=mock_site, \\n\"\n        \"drive_name=, \\n\"\n        \"local_path=mock_path\",\n    ],\n    [\n        \"site_exception\",\n        \"Please provide all mandatory Sharepoint options. \\n\"\n        \"Expected: site_name, drive_name and local_path. \"\n        \"Value should not be None.\\n\"\n        \"Provided: site_name=, \\n\"\n        \"drive_name=mock_drive, \\n\"\n        \"local_path=mock_path\",\n    ],\n    [\n        \"local_path_exception\",\n        \"Please provide all mandatory Sharepoint options. \\n\"\n        \"Expected: site_name, drive_name and local_path. \"\n        \"Value should not be None.\\n\"\n        \"Provided: site_name=mock_site, \\n\"\n        \"drive_name=mock_drive, \\n\"\n        \"local_path=\",\n    ],\n    [\"endpoint_exception\", \"The provided endpoint does not exist!\"],\n]\n\nTEST_SCENARIOS_WRITER = [\n    [\n        \"writer\",\n        \"write_to_local_success\",\n        f\"Deleted the local folder: {TEST_LAKEHOUSE_OUT}/writer/data\",\n    ],\n]\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS_EXCEPTIONS)\n@patch(\n    \"lakehouse_engine.io.writers.sharepoint_writer.SharepointWriter._get_sharepoint_utils\"  # noqa\n)\ndef test_sharepoint_writer_exceptions(\n    mock_get_sharepoint_utils: MagicMock, scenario: List[str]\n) -> None:\n    \"\"\"Test writing to Sharepoint from csv source.\n\n    Args:\n        scenario: scenario to test.\n        mock_get_sharepoint_utils: patch sharepoint_utils.\n    \"\"\"\n    mock_sharepoint_utils = MagicMock()\n    mock_get_sharepoint_utils.return_value = mock_sharepoint_utils\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/file_source.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    if scenario[1] == \"streaming_exception\":\n        with pytest.raises(NotSupportedException, match=scenario[2]):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/streaming_exception.json\"\n            )\n    elif scenario[1] == \"site_exception\":\n        with pytest.raises(InputNotFoundException, match=scenario[2]):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/site_exception.json\"\n            )\n    elif scenario[1] == \"drive_exception\":\n        with pytest.raises(InputNotFoundException, match=scenario[2]):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/drive_exception.json\"\n            )\n    elif scenario[1] == \"local_path_exception\":\n        with pytest.raises(InputNotFoundException, match=scenario[2]):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/local_path_exception.json\"\n            )\n    elif scenario[1] == \"endpoint_exception\":\n        mock_sharepoint_utils.check_if_endpoint_exists.return_value = False\n        with pytest.raises(EndpointNotFoundException, match=scenario[2]):\n            load_data(\n                f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/endpoint_exception.json\"\n            )\n\n\n@pytest.mark.parametrize(\"scenario\", TEST_SCENARIOS_WRITER)\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils.check_if_endpoint_exists\",\n    return_value=True,  # noqa\n)\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._create_app\")  # noqa\n@patch(\"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._get_token\")  # noqa\n@patch(\n    \"lakehouse_engine.utils.sharepoint_utils.SharepointUtils._make_request\",\n    side_effect=[\n        MockRESTResponse(\n            status_code=200,\n            json_data=json.loads(\n                open(f\"{TEST_RESOURCES}/writer/mocks/get_site_id.json\").read()\n            ),\n        ),\n        MockRESTResponse(\n            status_code=200,\n            json_data=json.loads(\n                open(f\"{TEST_RESOURCES}/writer/mocks/get_drive_id.json\").read()\n            ),\n        ),\n        MockRESTResponse(\n            status_code=200,\n            json_data=json.loads(\n                open(f\"{TEST_RESOURCES}/writer/mocks/create_upload_session.json\").read()\n            ),\n        ),\n        MockRESTResponse(status_code=200),  # final upload to sharepoint\n    ],\n)  # noqa\ndef test_sharepoint_writer(\n    _: Any, __: Any, ___: Any, _make_requests: Any, scenario: List[str], caplog: Any\n) -> None:\n    \"\"\"Test writing to Sharepoint from csv source.\n\n    Args:\n        scenario: scenario to test.\n        caplog: fetch logs.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario[0]}/data/file_source.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    if scenario[0] == \"writer\" and scenario[1] == \"write_to_local_success\":\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario[0]}/data/file_source.csv\",\n            f\"{TEST_LAKEHOUSE_IN}/data/\",\n        )\n\n        load_data(\n            f\"file://{TEST_RESOURCES}/{scenario[0]}/acons/write_to_local_success.json\"\n        )\n\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario[0]}/data/file_source.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/{scenario[0]}/data/\",\n        )\n\n        assert scenario[2] in caplog.text\n"
  },
  {
    "path": "tests/feature/test_table_manager.py",
    "content": "\"\"\"Test table manager.\"\"\"\n\nimport logging\nfrom typing import Any\n\nimport pytest\nfrom pyspark.sql.utils import AnalysisException\n\nfrom lakehouse_engine.engine import manage_table\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"table_manager\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenarios\",\n    [\n        {\n            \"table_and_view_name\": [\"SimpleSplitScenario\"],\n            \"locations_name\": [\"simple_split_scenario\"],\n            \"create_tbl_sql\": \"test_table_simple_split_scenario.sql\",\n            \"create_tbl_json\": \"acon_create_table_simple_split_scenario\",\n            \"execute_sql_json\": \"acon_execute_sql_simple_split_scenario\",\n            \"create_vw_sql\": \"test_view_simple_split_scenario\",\n            \"create_vw_json\": \"acon_create_view_simple_split_scenario\",\n            \"describe_tbl_json\": \"acon_describe_simple_split_scenario\",\n            \"vacuum_tbl_json\": \"acon_vacuum_table_simple_split_scenario\",\n            \"vacuum_loc_json\": \"acon_vacuum_location_simple_split_scenario\",\n            \"optimize_tbl_json_\": \"optimize_table_simple_split_scenario\",\n            \"optimize_loc_json\": \"optimize_location_simple_split_scenario\",\n            \"compute_statistics_tbl_json\": [\"table_stats_simple_split_scenario\"],\n            \"show_tbl_prop_json\": \"show_tbl_properties_simple_split_scenario\",\n            \"tbl_primary_keys_json\": \"get_tbl_pk_simple_split_scenario\",\n            \"drop_vw_json\": \"acon_drop_view_simple_split_scenario\",\n            \"delete_json\": \"acon_delete_where_table_simple_split_scenario\",\n            \"drop_tbl_json\": \"acon_drop_table_simple_split_scenario\",\n        },\n        {\n            \"table_and_view_name\": [\n                \"ComplexDefaultScenario1\",\n                \"ComplexDefaultScenario2\",\n            ],\n            \"locations_name\": [\n                \"complex_default_scenario1\",\n                \"complex_default_scenario2\",\n            ],\n            \"create_tbl_sql\": \"test_table_complex_default_scenario.sql\",\n            \"create_tbl_json\": \"acon_create_table_complex_default_scenario\",\n            \"execute_sql_json\": \"acon_execute_sql_complex_default_scenario\",\n            \"create_vw_sql\": \"test_view_complex_default_scenario\",\n            \"create_vw_json\": \"acon_create_view_complex_default_scenario\",\n            \"compute_statistics_tbl_json\": [\n                \"table_stats_complex_default_scenario1\",\n                \"table_stats_complex_default_scenario2\",\n            ],\n        },\n        {\n            \"table_and_view_name\": [\n                \"ComplexDifferentDelimiterScenario1\",\n                \"ComplexDifferentDelimiterScenario2\",\n            ],\n            \"locations_name\": [\n                \"complex_different_delimiter_scenario1\",\n                \"complex_different_delimiter_scenario2\",\n            ],\n            \"create_tbl_sql\": \"test_table_complex_different_delimiter_scenario.sql\",\n            \"create_tbl_json\": \"acon_create_table_complex_different_delimiter_scenario\",\n            \"execute_sql_json\": \"acon_execute_sql_complex_different_delimiter_scenario\",\n            \"create_vw_sql\": \"test_view_complex_different_delimiter_scenario\",\n            \"create_vw_json\": \"acon_create_view_complex_different_delimiter_scenario\",\n            \"compute_statistics_tbl_json\": [\n                \"table_stats_complex_different_delimiter_scenario1\",\n                \"table_stats_complex_different_delimiter_scenario2\",\n            ],\n        },\n    ],\n)\ndef test_table_manager(scenarios: dict, caplog: Any) -> None:\n    \"\"\"Test functions from table manager.\n\n    Args:\n        scenarios: scenarios to test.\n        caplog: captured log.\n    \"\"\"\n    with caplog.at_level(logging.INFO):\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/create/table/{scenarios['create_tbl_sql']}\",\n            f\"{TEST_LAKEHOUSE_IN}/create/table/\",\n        )\n\n        manage_table(\n            f\"file://{TEST_RESOURCES}/create/{scenarios['create_tbl_json']}.json\"\n        )\n        assert \"create_table successfully executed!\" in caplog.text\n\n        manage_table(\n            f\"file://{TEST_RESOURCES}/execute_sql/{scenarios['execute_sql_json']}.json\"\n        )\n        assert \"sql successfully executed!\" in caplog.text\n\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/create/view/{scenarios['create_vw_sql']}.sql\",\n            f\"{TEST_LAKEHOUSE_IN}/create/view/\",\n        )\n\n        manage_table(\n            f\"file://{TEST_RESOURCES}/create/{scenarios['create_vw_json']}.json\"\n        )\n        assert \"create_view successfully executed!\" in caplog.text\n\n        if scenarios.get(\"describe_tbl_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/describe/\"\n                f\"{scenarios['describe_tbl_json']}.json\"\n            )\n            assert (\n                \"DataFrame[col_name: string, data_type: string, comment: string]\"\n                in caplog.text\n            )\n\n        if scenarios.get(\"vacuum_tbl_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/vacuum/{scenarios['vacuum_tbl_json']}.json\"\n            )\n            assert (\n                \"Vacuuming table: test_db.DummyTableBronzeSimpleSplitScenario\"\n                in caplog.text\n            )\n\n        if scenarios.get(\"vacuum_loc_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/vacuum/{scenarios['vacuum_loc_json']}.json\"\n            )\n            assert (\n                \"Vacuuming location: file:///app/tests/lakehouse/out/feature/\"\n                \"table_manager/dummy_table_bronze/data_simple_split_scenario\"\n                in caplog.text\n            )\n\n        if scenarios.get(\"optimize_tbl_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/optimize/\"\n                f\"{scenarios['optimize_tbl_json']}.json\"\n            )\n            assert (\n                \"sql command: OPTIMIZE test_db.DummyTableBronzeSimpleSplitScenario \"\n                \"WHERE year >= 2021 and month >= 09 and day > 01 ZORDER BY (col1,col2)\"\n                in caplog.text\n            )\n\n        if scenarios.get(\"optimize_loc_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/optimize/\"\n                f\"{scenarios['optimize_loc_json']}.json\"\n            )\n            assert (\n                f\"sql command: OPTIMIZE delta.`file://{TEST_LAKEHOUSE_OUT}/\"\n                \"dummy_table_bronze/data_simple_split_scenario` WHERE year >= 2021 \"\n                \"and month >= 09 and day > 01 ZORDER BY (col1,col2)\" in caplog.text\n            )\n\n        with pytest.raises(\n            AnalysisException, match=\".*ANALYZE TABLE is not supported for v2 tables.*\"\n        ):\n            # compute table stats is still not supported in current OS delta lake.\n            if scenarios.get(\"compute_statistics_tbl_json\") is not None:\n                for (\n                    compute_statistics_table_index,\n                    compute_statistics_table_json_file,\n                ) in enumerate(scenarios[\"compute_statistics_tbl_json\"]):\n                    manage_table(\n                        f\"file://{TEST_RESOURCES}/compute_table_statistics/\"\n                        f\"{compute_statistics_table_json_file}.json\"\n                    )\n                    scenario_name = scenarios[\"table_and_view_name\"][\n                        compute_statistics_table_index\n                    ]\n                    assert (\n                        \"sql command: ANALYZE TABLE test_db.DummyTable\"\n                        f\"Bronze{scenario_name} COMPUTE STATISTICS\" in caplog.text\n                    )\n\n        if scenarios.get(\"show_tbl_prop_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/show_tbl_properties/\"\n                f\"{scenarios['show_tbl_prop_json']}.json\"\n            )\n            assert (\n                \"sql command: SHOW TBLPROPERTIES test_db.DummyTable\"\n                \"BronzeSimpleSplitScenario\" in caplog.text\n            )\n\n        if scenarios.get(\"tbl_primary_keys_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/get_tbl_pk/\"\n                f\"{scenarios['tbl_primary_keys_json']}.json\"\n            )\n            assert \"['id', 'col1']\" in caplog.text\n\n        if scenarios.get(\"drop_vw_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/drop/{scenarios['drop_vw_json']}.json\"\n            )\n            assert \"View successfully dropped!\" in caplog.text\n\n        if scenarios.get(\"delete_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/delete/{scenarios['delete_json']}.json\"\n            )\n            assert (\n                \"sql command: DELETE FROM test_db.DummyTable\"\n                \"BronzeSimpleSplitScenario WHERE year=2021\"\n                in caplog.text  # nosec: B608\n            )\n\n        if scenarios.get(\"drop_tbl_json\") is not None:\n            manage_table(\n                f\"file://{TEST_RESOURCES}/drop/{scenarios['drop_tbl_json']}.json\"\n            )\n            assert \"Table successfully dropped!\" in caplog.text\n"
  },
  {
    "path": "tests/feature/test_writers.py",
    "content": "\"\"\"Test engine writers.\n\nDelta merge tests writers weren't added because it is always batch,\nmicro batch or normal batch, but always batch. Also, we have another\ntest like delta_load that uses delta_merge_writer.\nKafka writer weren't added also, because we cannot\nsimulate kafka on local tests. All other writers were covered.\n\"\"\"\n\nimport logging\nimport os\nimport random\nimport string\nfrom collections import namedtuple\nfrom typing import Any, Optional, OrderedDict\nfrom unittest.mock import patch\n\nimport pytest\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.types import StructType\n\nfrom lakehouse_engine.core.definitions import OutputFormat, OutputSpec\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.io.exceptions import NotSupportedException\nfrom lakehouse_engine.io.writers.dataframe_writer import DataFrameWriter\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"writers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_NAME}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_NAME}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_NAME}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_NAME}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_batch_files\"},\n        {\"scenario_name\": \"write_streaming_files\"},\n        {\n            \"scenario_name\": \"write_streaming_foreachBatch_files\",\n        },\n    ],\n)\ndef test_write_to_files(scenario: dict) -> None:\n    \"\"\"Test file writer.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    _prepare_files()\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n    load_data(acon=acon)\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/writers_control.csv\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_batch_rest_api\"},\n        {\"scenario_name\": \"write_streaming_rest_api\"},\n    ],\n)\ndef test_write_to_rest_api(scenario: dict) -> None:\n    \"\"\"Test rest api writer.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    _prepare_files()\n\n    RestResponse = namedtuple(\"RestResponse\", \"status_code text\")\n\n    with patch(\n        \"lakehouse_engine.io.writers.rest_api_writer.execute_api_request\",\n        return_value=RestResponse(status_code=200, text=\"ok\"),\n    ):\n        acon = ConfigUtils.get_acon(\n            f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n        )\n        load_data(acon=acon)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_batch_jdbc\"},\n        {\"scenario_name\": \"write_streaming_foreachBatch_jdbc\"},\n    ],\n)\ndef test_write_to_jdbc(scenario: dict) -> None:\n    \"\"\"Test jdbc writer.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    _prepare_files()\n\n    os.mkdir(f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/\")\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n    load_data(acon=acon)\n\n    result_df = DataframeHelpers.read_from_jdbc(\n        f\"jdbc:sqlite:{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/test.db\",\n        f\"{scenario['scenario_name']}\",\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/writers_control.csv\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_batch_table\"},\n        {\"scenario_name\": \"write_streaming_table\"},\n        {\"scenario_name\": \"write_streaming_foreachBatch_table\"},\n    ],\n)\ndef test_write_to_table(scenario: dict) -> None:\n    \"\"\"Test table writer.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    _prepare_files()\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n    load_data(acon=acon)\n\n    result_df = DataframeHelpers.read_from_table(f\"test_db.{scenario['scenario_name']}\")\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/writers_control.csv\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_batch_console\"},\n        {\"scenario_name\": \"write_streaming_console\"},\n        {\"scenario_name\": \"write_streaming_foreachBatch_console\"},\n    ],\n)\ndef test_write_to_console(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test console writer.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    _prepare_files()\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n    load_data(acon=acon)\n\n    captured = capsys.readouterr()\n\n    logging.info(captured.out)\n\n    assert \"20140601|customer1|article3|\" in captured.out\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_batch_dataframe\"},\n        {\"scenario_name\": \"write_streaming_dataframe\"},\n        {\"scenario_name\": \"write_streaming_foreachBatch_dataframe\"},\n    ],\n)\ndef test_write_to_dataframe(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test dataframe writer returning the output by OutputSpec.\n\n    Description of the test scenarios:\n        - write_batch_dataframe - test writing a DataFrame from two batch sources,\n        uniting both sources.\n        It's generated a DataFrame containing the data from both sources.\n        - write_streaming_dataframe - similar to write_batch_dataframe but inputting\n        data from a stream.\n        - write_streaming_foreachBatch_dataframe - similar to write_batch_dataframe but\n        mixing batch and streaming,\n        so the first source from batch and the second from a stream.\n        This test have the responsibility to\n        execute the writer using micro batch strategy.\n\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_IN}/source\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}\")\n    _prepare_files()\n\n    result = load_data(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/writers_control.csv\"\n    )\n    expected_keys = [\"sales\"]\n\n    assert not DataframeHelpers.has_diff(result.get(\"sales\"), control_df)\n    assert len(result.keys()) == len(expected_keys)\n    assert all(\n        subject == expected for subject, expected in zip(result.keys(), expected_keys)\n    )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"write_streaming_df_with_checkpoint\",\n            \"control\": \"streaming_dataframe\",\n        },\n        {\n            \"scenario_name\": \"write_streaming_foreachBatch_df_with_checkpoint\",\n            \"control\": \"streaming_dataframe_foreachBatch\",\n        },\n    ],\n)\ndef test_write_to_dataframe_checkpoints(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test dataframe writer using checkpoint for the next run.\n\n    In this test our InputSpecs have the option `maxFilesPerTrigger`,\n    this option forces our stream to read a maximum files per iteration,\n    this property also needs to have a checkpoint location\n    because spark internally needs to control the state of reading the\n    files.\n\n    Description of the test scenarios:\n        - write_streaming_dataframe - test if the checkpoint is working\n         as expected when writing the data\n         from stream to DataFrame.\n         We have two different input files for each source\n         we expect to read just the first\n         in the first execution and the second in the next one.\n         - write_streaming_foreachBatch_dataframe - test if the\n         checkpoint is working as expected when writing\n         the data from stream and batch using\n         the micro batch strategy to DataFrame.\n         As we have two different input files for each source\n         we expect to read just the first file\n         in the first execution and the second in the\n         next one for the stream with checkpoint source.\n         On the batch source we expect to read the first\n         file in the first run and both files in the second run.\n\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_IN}/source\")\n    LocalStorage.clean_folder(f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}\")\n\n    for iteration in range(1, 2):\n        _prepare_files(iteration)\n        result = load_data(\n            f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n        )\n\n        control_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_CONTROL}/data/\"\n            f\"writers_control_{scenario['control']}_{iteration}.csv\"\n        )\n        expected_keys = [\"sales\"]\n\n        assert not DataframeHelpers.has_diff(result.get(\"sales\"), control_df)\n        assert len(result.keys()) == len(expected_keys)\n        assert all(\n            subject == expected\n            for subject, expected in zip(result.keys(), expected_keys)\n        )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"write_streaming_multiple_dfs\"},\n    ],\n)\ndef test_multiple_write_to_dataframe(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test dataframe writer chaining ACON calls.\n\n    This test have the objective to demonstrate how you can use\n    the output from an ACON as input to another ACON,\n    showing the flexibility of this writer unlock to us.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    _prepare_files()\n\n    multiple_df_result = load_data(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n\n    generated_acon = _generate_acon_from_source(multiple_df_result)\n\n    result = load_data(acon=generated_acon)\n    result_keys = list(multiple_df_result.keys()) + list(result.keys())\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/writers_control.csv\"\n    )\n    expected_keys = [\"sales_historical\", \"sales_new\", \"sales\"]\n\n    assert not DataframeHelpers.has_diff(result.get(\"sales\"), control_df)\n    assert len(result_keys) == len(expected_keys)\n    assert all(\n        subject == expected for subject, expected in zip(result_keys, expected_keys)\n    )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"write_streaming_processing_time_dataframe\",\n            \"streaming_processing_time\": \"2 seconds\",\n        },\n        {\n            \"scenario_name\": \"write_streaming_continuous_dataframe\",\n            \"streaming_continuous\": \"2 seconds\",\n        },\n    ],\n)\ndef test_write_to_dataframe_exception(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test expected exception for dataframe writer on stream cases.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n\n    def dataframe_writer(\n        df: DataFrame = None,\n        data: OrderedDict = None,\n        streaming_processing_time: Optional[str] = None,\n        streaming_continuous: Optional[str] = None,\n    ) -> DataFrameWriter:\n        \"\"\"Create DataFrame Writer.\n\n        Args:\n            df: dataframe containing the data to append.\n            data: list of all dfs generated on previous steps before writer.\n            streaming_processing_time: if streaming query is to be kept alive,\n                this indicates the processing time of each micro batch.\n            streaming_continuous: set a trigger that runs\n                a continuous query with a given\n        checkpoint interval.\n        \"\"\"\n        if not df:\n            df = DataframeHelpers.create_empty_dataframe(StructType([]))\n\n        spec = OutputSpec(\n            spec_id=random.choice(string.ascii_letters),  # nosec\n            input_id=random.choice(string.ascii_letters),  # nosec\n            write_type=None,\n            data_format=OutputFormat.DATAFRAME.value,\n            streaming_processing_time=streaming_processing_time,\n            streaming_continuous=streaming_continuous,\n        )\n\n        return DataFrameWriter(output_spec=spec, df=df.coalesce(1), data=data)\n\n    with pytest.raises(NotSupportedException) as exception:\n        dataframe_writer(\n            streaming_processing_time=scenario.get(\"streaming_processing_time\"),\n            streaming_continuous=scenario.get(\"streaming_continuous\"),\n        ).write()\n\n    assert (\n        \"DataFrame writer doesn't support processing time or continuous streaming\"\n        in str(exception.value)\n    )\n\n\ndef _generate_acon_from_source(source: OrderedDict) -> dict:\n    \"\"\"Create an ACON from dictionary source containing resulted dataframes.\n\n    Args:\n        source: Dictionary containing source computed dataframes.\n    \"\"\"\n    return {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_historical\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"dataframe\",\n                \"df_name\": source.get(\"sales_historical\"),\n            },\n            {\n                \"spec_id\": \"sales_new\",\n                \"read_type\": \"batch\",\n                \"data_format\": \"dataframe\",\n                \"df_name\": source.get(\"sales_new\"),\n            },\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"union_dataframes\",\n                \"input_id\": \"sales_historical\",\n                \"transformers\": [\n                    {\"function\": \"union\", \"args\": {\"union_with\": [\"sales_new\"]}}\n                ],\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales\",\n                \"input_id\": \"union_dataframes\",\n                \"data_format\": \"dataframe\",\n            }\n        ],\n    }\n\n\ndef _prepare_files(iteration: int = 0) -> None:\n    file_suffix = \"*\" if iteration == 0 else f\"{iteration}\"\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/source/sales_historical_{file_suffix}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/source/sales_historical/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/source/sales_new_{file_suffix}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/source/sales_new/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/schema/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/schema/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/control/*.*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n"
  },
  {
    "path": "tests/feature/transformations/__init__.py",
    "content": "\"\"\"Transformations feature tests.\"\"\"\n"
  },
  {
    "path": "tests/feature/transformations/test_chain_transformations.py",
    "content": "\"\"\"Test chain transformer.\"\"\"\n\nfrom typing import Any\n\nimport pytest\nfrom pyspark.sql.utils import StreamingQueryException\n\nfrom lakehouse_engine.core.definitions import InputFormat, OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/chain_transformations\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"batch\"},\n        {\"scenario_name\": \"streaming\"},\n        {\"scenario_name\": \"streaming_batch\"},\n        {\"scenario_name\": \"write_streaming_struct_data\"},\n        {\"scenario_name\": \"write_streaming_struct_data_fail\"},\n    ],\n)\ndef test_chain_transformations(scenario: dict, caplog: Any) -> None:\n    \"\"\"Test chain transformation.\n\n    Args:\n        scenario: scenario to test.\n            batch - scenario where we are using batch dataframes;\n            streaming - scenario where we are using streaming dataframes;\n            streaming_batch - scenario where we are using batch and streaming\n                dataframes;\n            write_streaming_struct_data - scenario where are we making transformations\n                in first place, use this result to apply other transform and write\n                in micro batch;\n            write_streaming_struct_data_fail - scenario where we are trying to use a\n                result from micro batch transformation into another transform, this\n                one should fail because we cannot have dependency from micro batch.\n        caplog: captured log.\n    \"\"\"\n    _prepare_files()\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/acons/{scenario['scenario_name']}.json\"\n    )\n\n    if scenario[\"scenario_name\"] == \"write_streaming_struct_data_fail\":\n        with pytest.raises(\n            StreamingQueryException,\n            match=\".*An exception was raised by the Python Proxy.*\",\n        ):\n            load_data(acon=acon)\n\n        assert (\n            \"A column, variable, or function parameter with name `sample_json_field1` \"\n            \"cannot be resolved.\" in caplog.text\n        )\n    else:\n        load_data(acon=acon)\n\n        result_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/data\",\n            file_format=OutputFormat.DELTAFILES.value,\n        )\n\n        if scenario[\"scenario_name\"] == \"write_streaming_struct_data\":\n            control_df = DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_CONTROL}/data/struct_data.json\",\n                file_format=InputFormat.JSON.value,\n                options={\"multiLine\": \"true\"},\n            ).select(\n                \"salesorder\",\n                \"item\",\n                \"article\",\n                \"sample_json_field1\",\n                \"sample_json_field4\",\n                \"item_amount_json\",\n            )\n        else:\n            control_df = DataframeHelpers.read_from_file(\n                f\"{TEST_LAKEHOUSE_CONTROL}/data/chain_control.csv\"\n            )\n\n        assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\ndef _prepare_files() -> None:\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/source/sales_historical.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/source/sales_historical/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/source/sales_new.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/source/sales_new/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/source/customers.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/source/customers/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/source/struct_data.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/source/struct_data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/schema/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/schema/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/control/*.*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n"
  },
  {
    "path": "tests/feature/transformations/test_column_creators.py",
    "content": "\"\"\"Test Column Creator Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat, OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/column_creators\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"streaming\", \"batch\"],\n)\ndef test_column_creators(scenario: str) -> None:\n    \"\"\"Test column creators.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data\",\n        file_format=InputFormat.JSON.value,\n        options={\"multiLine\": \"true\"},\n    ).select(\n        \"salesorder\",\n        \"item\",\n        \"date\",\n        \"customer\",\n        \"article\",\n        \"amount\",\n        \"dummy_string\",\n        \"dummy_int\",\n        \"dummy_double\",\n        \"dummy_boolean\",\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_column_reshapers.py",
    "content": "\"\"\"Test Column Reshaping Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/column_reshapers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"type\": \"batch\", \"scenario_name\": \"flatten_schema\"},\n        {\"type\": \"streaming\", \"scenario_name\": \"flatten_schema\"},\n        {\"type\": \"batch\", \"scenario_name\": \"explode_arrays\"},\n        {\"type\": \"streaming\", \"scenario_name\": \"explode_arrays\"},\n        {\"type\": \"batch\", \"scenario_name\": \"flatten_and_explode_arrays_and_maps\"},\n        {\"type\": \"streaming\", \"scenario_name\": \"flatten_and_explode_arrays_and_maps\"},\n    ],\n)\ndef test_column_reshapers(scenario: dict) -> None:\n    \"\"\"Test column reshaping transformers.\n\n    Args:\n        scenario: scenario to test.\n            flatten_schema: This test flattens the struct.\n            explode_arrays: This test explode the array columns specified.\n            flatten_and_explode_arrays_and_maps: This test flattens the struct\n                and explode the array  and map columns specified.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario['scenario_name']}/data/source/*.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario['scenario_name']}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario['scenario_name']}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario['scenario_name']}/\",\n    )\n\n    load_data(\n        f\"file://{TEST_RESOURCES}/{scenario['scenario_name']}/{scenario['type']}.json\"\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario['scenario_name']}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario['scenario_name']}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario['scenario_name']}/{scenario['type']}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario['scenario_name']}/data/\"\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_data_maskers.py",
    "content": "\"\"\"Test Data Masking Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/data_maskers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"drop_columns\", \"hash_masking\"],\n)\ndef test_data_maskers(scenario: str) -> None:\n    \"\"\"Test data masking transformers.\n\n    Args:\n        scenario: scenario to test.\n            drop_columns - scenario where we mask data by dropping columns;\n            hash_masking - scenario where we mask data by hashing columns.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}_control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_date_transformers.py",
    "content": "\"\"\"Test Date Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/date_transformers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"streaming\"],\n)\ndef test_date_transformers(scenario: str) -> None:\n    \"\"\"Test date transformers.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    ).drop(\"curr_date\")\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_drop_duplicate_rows.py",
    "content": "\"\"\"Test drop_duplicate_rows function.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat, OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/drop_duplicate_rows\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"batch\"],\n        [\"streaming\"],\n    ],\n)\ndef test_drop_duplicate_rows(scenario: str) -> None:\n    \"\"\"Tests drop duplicate rows transformer available in the ACON transform_specs.\n\n    Args:\n        scenario: scenario to test.\n            batch - test the transformer utilization in batch mode.\n                The transformer is tested 3 times: 1) without providing arguments;\n                2) providing an empty list ([]); and 3) providing a list with\n                columns names ([\"order_number\",\"item_number\"]). This happens\n                using 3 different dataframes saved in different locations\n                specified in the ACON. In the 2 first times, the transformer\n                should have the same behaviour has using the pyspark\n                function distinct().\n            streaming - the same as batch but using streaming.\n\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/{scenario[0]}_*.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario[0]}.json\")\n    load_data(acon=acon)\n\n    control_drop_duplicates = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario[0]}_drop_duplicates.json\",\n        file_format=InputFormat.JSON.value,\n        options={\"multiLine\": \"true\"},\n    )\n\n    control_distinct = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario[0]}_distinct.json\",\n        file_format=InputFormat.JSON.value,\n        options={\"multiLine\": \"true\"},\n    )\n\n    df_transform_columns = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/columns/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    assert not DataframeHelpers.has_diff(df_transform_columns, control_drop_duplicates)\n\n    df_transform_no_args = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/orders_duplicate_no_args/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    assert not DataframeHelpers.has_diff(df_transform_no_args, control_distinct)\n\n    df_transform_empty = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/orders_duplicate_empty/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    assert not DataframeHelpers.has_diff(df_transform_empty, control_distinct)\n"
  },
  {
    "path": "tests/feature/transformations/test_joiners.py",
    "content": "\"\"\"Test Join Transformers.\"\"\"\n\nfrom typing import List\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/joiners\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"streaming\", \"control_scenario_1_and_2\"],\n        [\"streaming_without_broadcast\", \"control_scenario_1_and_2\"],\n        [\"streaming_without_column_rename\", \"control_scenario_3\"],\n        [\"streaming_foreachBatch\", \"control_scenario_1_and_2\"],\n        [\"batch\", \"control_scenario_1_and_2\"],\n    ],\n)\ndef test_joiners(scenario: List[str]) -> None:\n    \"\"\"Test join transformers.\n\n    Args:\n        scenario: scenario to test.\n            streaming - join streaming scenario.\n            streaming_without_broadcast - same as streaming scenario but without\n            broadcast join. Note: also differs by partitioning by customer and date,\n            not only date.\n            streaming_without_column_rename - same as streaming scenario but without\n            renaming name column to customer_name.\n            streaming_foreachBatch - join streaming scenario in foreachBatch mode.\n            batch - join batch scenario.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/customer-part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/customers/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/sales-part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/sales/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario[0]}.json\")\n\n    if scenario[0] != \"batch\":\n        load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/sales-part-02.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/sales/\",\n    )\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario[1]}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario[1]}_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_multiple_transformations.py",
    "content": "\"\"\"Test multiple transformations and output specs on the same ACON.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import InputFormat, OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/multiple_transform\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"batch\"],\n)\ndef test_multiple_transformations(scenario: str) -> None:\n    \"\"\"Tests multiple transformations available in the ACON transform_specs.\\\n    Transformations are saved in different locations, according to the output_specs.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*.json\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    result_transform_df1 = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/orders_customer_cols/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    result_transform_df2 = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/orders_kpi_cols/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data\",\n        file_format=InputFormat.JSON.value,\n        options={\"multiLine\": \"true\"},\n    )\n\n    assert not DataframeHelpers.has_diff(\n        result_transform_df1, control_df.select(\"date\", \"country\", \"customer_number\")\n    )\n    assert not DataframeHelpers.has_diff(\n        result_transform_df2, control_df.select(\"date\", \"city\", \"amount\")\n    )\n"
  },
  {
    "path": "tests/feature/transformations/test_null_handlers.py",
    "content": "\"\"\"Test Null Handler Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/null_handlers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"replace_nulls\", \"replace_nulls_col_subset\"],\n)\ndef test_replace_nulls(scenario: str) -> None:\n    \"\"\"Test date transformers.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/control/*.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    ).drop(\"curr_date\")\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_optimizers.py",
    "content": "\"\"\"Test Optimizer transformers.\"\"\"\n\nimport pytest\nfrom pyspark.sql.dataframe import DataFrame\n\nfrom lakehouse_engine.engine import load_data\nfrom tests.conftest import FEATURE_RESOURCES, LAKEHOUSE_FEATURE_IN\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/optimizers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\n\n\ndef is_df_cached(df: DataFrame) -> DataFrame:\n    \"\"\"Check if the dataframe is cached.\n\n    Args:\n        df: DataFrame passed as input.\n\n    Returns:\n        DataFrame: same as the input DataFrame.\n    \"\"\"\n    if not df.is_cached:\n        raise Exception\n\n    return df\n\n\ndef is_df_not_cached(df: DataFrame) -> DataFrame:\n    \"\"\"Check if the dataframe is not cached.\n\n    Args:\n        df: DataFrame passed as input.\n\n    Returns:\n        DataFrame: same as the input DataFrame.\n    \"\"\"\n    if df.is_cached:\n        raise Exception\n\n    return df\n\n\n@pytest.mark.parametrize(\"scenario\", [\"batch\", \"streaming\"])\ndef test_optimizer(scenario: str) -> None:\n    \"\"\"Test the optimizer transformer both in batch and streaming.\"\"\"\n    acon = _get_test_acon(scenario)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/\",\n    )\n\n    load_data(acon=acon)\n\n\ndef _get_test_acon(read_type: str) -> dict:\n    \"\"\"Creates a test ACON with the desired logic for the algorithm.\n\n    Args:\n        read_type: the read type (streaming or batch).\n\n    Returns:\n        dict: the ACON for the algorithm configuration.\n    \"\"\"\n    acon = {\n        \"input_specs\": [\n            {\n                \"spec_id\": \"sales_source\",\n                \"read_type\": read_type,\n                \"data_format\": \"csv\",\n                \"options\": {\"header\": True, \"delimiter\": \"|\", \"inferSchema\": True},\n                \"location\": f\"file:///{TEST_LAKEHOUSE_IN}/data/\",\n            }\n        ],\n        \"transform_specs\": [\n            {\n                \"spec_id\": \"transformed_sales_source\",\n                \"input_id\": \"sales_source\",\n                \"transformers\": [\n                    {\n                        \"function\": \"persist\",\n                        \"args\": {\"storage_level\": \"MEMORY_AND_DISK\"},\n                    },\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": is_df_cached},\n                    },\n                    {\n                        \"function\": \"unpersist\",\n                    },\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": is_df_not_cached},\n                    },\n                    {\n                        \"function\": \"cache\",\n                    },\n                    {\n                        \"function\": \"custom_transformation\",\n                        \"args\": {\"custom_transformer\": is_df_cached},\n                    },\n                ],\n            }\n        ],\n        \"output_specs\": [\n            {\n                \"spec_id\": \"sales_bronze\",\n                \"input_id\": \"transformed_sales_source\",\n                \"data_format\": \"console\",\n            }\n        ],\n    }\n\n    if read_type == \"streaming\":\n        acon[\"transform_specs\"][0][  # type: ignore\n            \"force_streaming_foreach_batch_processing\"\n        ] = True\n        acon[\"exec_env\"] = {\"spark.sql.streaming.schemaInference\": True}\n\n    return acon\n"
  },
  {
    "path": "tests/feature/transformations/test_regex_transformers.py",
    "content": "\"\"\"Test Regex Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/regex_transformers\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\"with_regex_value\"],\n)\ndef test_regex_transformers(scenario: str) -> None:\n    \"\"\"Test regex transformers.\n\n    Args:\n        scenario: scenario to test.\n            with_regex_value - test with_regex_value feature in the regex transformers.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/source/*.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/data/\",\n    )\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario}/\",\n    )\n    acon = ConfigUtils.get_acon(f\"file://{TEST_RESOURCES}/{scenario}/batch.json\")\n    load_data(acon=acon)\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario}/data/control/part-01.csv\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data/\",\n    )\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/{scenario}/data\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario}/control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n"
  },
  {
    "path": "tests/feature/transformations/test_unions.py",
    "content": "\"\"\"Test Union Transformers.\"\"\"\n\nfrom typing import List\n\nimport pytest\nfrom pyspark.sql.utils import AnalysisException\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/unions\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        [\"batch\", \"union\", \"control_sales\"],\n        [\"batch\", \"union_diff_schema\", \"\"],\n        [\"batch\", \"unionByName\", \"control_sales\"],\n        [\"batch\", \"unionByName_diff_schema\", \"control_sales_shipment\"],\n        [\"batch\", \"unionByName_diff_schema_error\", \"\"],\n        [\"streaming\", \"union\", \"control_sales_streaming\"],\n        [\"streaming\", \"unionByName_diff_schema\", \"control_sales_shipment_streaming\"],\n        [\"streaming\", \"union_foreachBatch\", \"control_sales_streaming_foreachBatch\"],\n        [\n            \"streaming\",\n            \"unionByName_diff_schema_foreachBatch\",\n            \"control_sales_shipment_streaming_foreachBatch\",\n        ],\n    ],\n)\ndef test_unions(scenario: List[str]) -> None:\n    \"\"\"Test union transformers.\n\n    Args:\n        scenario: scenario to test.\n            batch_union - union batch scenario, using union function based on\n            columns' position.\n            batch_union_diff_schema - same as batch_union scenario but tries\n            to union data with different schema, throwing an exception.\n            batch_unionByName - union batch scenario, using unionByName\n            function based on columns' names.\n            batch_unionByName_diff_schema - same as batch_unionByName\n            scenario but allows the union of datasets with different schemas\n            enabling the allowMissingColumns param.\n            batch_unionByName_diff_schema_error - same as\n            batch_unionByName_diff_schema but disabling the allowMissingColumns\n            param and therefore, throwing an exception.\n            streaming_union - union streaming scenario, using union function\n            based on columns' position.\n            streaming_unionByName_diff_schema - union streaming scenario,\n            using unionByName function based on columns' names and allowing the\n            union of datasets with different schemas.\n            streaming_union_foreachBatch - union streaming scenario, using union\n            function based on columns' position in foreachBatch mode.\n            streaming_unionByName_diff_schema_foreachBatch - union streaming scenario,\n            using unionByName function based on columns' names and allowing the\n            union of datasets with different schemas in foreachBatch mode.\n\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/\",\n    )\n\n    copy_data_files(1)\n\n    acon = ConfigUtils.get_acon(\n        f\"file://{TEST_RESOURCES}/{scenario[0]}_{scenario[1]}.json\"\n    )\n\n    if \"union_diff_schema\" in scenario[1] or \"error\" in scenario[1]:\n        with pytest.raises(\n            AnalysisException,\n            match=\".*UNION can only be performed on inputs with the same number.*\",\n        ):\n            load_data(acon=acon)\n\n    else:\n        if scenario[0] != \"batch\":\n            load_data(acon=acon)\n\n            copy_data_files(2)\n\n        load_data(acon=acon)\n\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/data/control/*.csv\",\n            f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n        )\n\n        result_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_OUT}/{scenario[0]}_{scenario[1]}/data\",\n            file_format=OutputFormat.DELTAFILES.value,\n        )\n        control_df = DataframeHelpers.read_from_file(\n            f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario[2]}.csv\"\n        )\n\n        assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\ndef copy_data_files(iteration: int) -> None:\n    \"\"\"Copies the data files to the tests input location.\n\n    Args:\n        iteration: number indicating the file to load.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/sales-historical-part-0{iteration}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/sales/sales_historical/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/sales-new-part-0{iteration}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/sales/sales_new/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/data/source/sales-shipment-part-0{iteration}.csv\",\n        f\"{TEST_LAKEHOUSE_IN}/data/sales/sales_shipment/\",\n    )\n"
  },
  {
    "path": "tests/feature/transformations/test_watermarker.py",
    "content": "\"\"\"Test Watermarker Transformers.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import OutputFormat\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import load_data\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import (\n    FEATURE_RESOURCES,\n    LAKEHOUSE_FEATURE_CONTROL,\n    LAKEHOUSE_FEATURE_IN,\n    LAKEHOUSE_FEATURE_OUT,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_PATH = \"transformations/watermarker\"\nTEST_RESOURCES = f\"{FEATURE_RESOURCES}/{TEST_PATH}\"\nTEST_LAKEHOUSE_IN = f\"{LAKEHOUSE_FEATURE_IN}/{TEST_PATH}\"\nTEST_LAKEHOUSE_CONTROL = f\"{LAKEHOUSE_FEATURE_CONTROL}/{TEST_PATH}\"\nTEST_LAKEHOUSE_OUT = f\"{LAKEHOUSE_FEATURE_OUT}/{TEST_PATH}\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"streaming_drop_duplicates\", \"loads\": 2},\n        {\"scenario_name\": \"streaming_drop_duplicates_overall_watermark\", \"loads\": 2},\n    ],\n)\ndef test_drop_duplicates_with_watermark(scenario: dict) -> None:\n    \"\"\"Test deduplication applying watermarking.\n\n    For both test scenarios if there is late data coming out of the\n    watermark time, this data won't be integrated. It won't be in the\n    target destination (and so it is also not in the control data).\n\n    Args:\n        scenario: scenario to test.\n            streaming_drop_duplicates - apply drop duplicates over a streaming\n             dataframe.\n            streaming_drop_duplicates_overall_watermark - apply drop duplicates over\n             a streaming dataframe defined as an independent transformation.\n             It also uses the Group and rank transformation which ignores the watermark\n             because that transformation is applied over a foreach batch operation.\n    \"\"\"\n    scenario_name = scenario[\"scenario_name\"]\n    loads = scenario[\"loads\"]\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario_name}/data/control/*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario_name}/*schema.json\",\n        f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/\",\n    )\n\n    for load in range(1, loads + 1):\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario_name}/data/source/part-0{str(load)}.csv\",\n            f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/\",\n        )\n        acon = ConfigUtils.get_acon(\n            f\"file://{TEST_RESOURCES}/{scenario_name}/{scenario_name}.json\"\n        )\n        load_data(acon=acon)\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario_name}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario_name}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario_name}/source_schema.json\"\n        ),\n    )\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\"scenario_name\": \"streaming_inner_join\", \"loads\": 2},\n        {\"scenario_name\": \"streaming_right_outer_join\", \"loads\": 2},\n        {\"scenario_name\": \"streaming_left_outer_join\", \"loads\": 5},\n    ],\n)\ndef test_joins_with_watermark(scenario: dict) -> None:\n    \"\"\"Test join operations applying watermarking.\n\n    Args:\n        scenario: scenario to test.\n            streaming_inner_join - apply inner join over 2 streaming dataframes.\n            streaming_right_outer_join - apply right outer join over 2 streaming\n             dataframes.\n            streaming_left_outer_join - apply left outer join over 2 streaming\n             dataframes.\n    \"\"\"\n    scenario_name = scenario[\"scenario_name\"]\n    loads = scenario[\"loads\"]\n    if scenario_name == \"streaming_right_outer_join\":\n        _drop_and_create_table(\n            \"streaming_outer_join\", f\"{TEST_LAKEHOUSE_OUT}/{scenario_name}/data\"\n        )\n\n    for load in range(1, loads + 1):\n        file_prefix = f\"part-0{str(load)}.csv\"\n        if load >= 1 and not scenario_name == \"streaming_inner_join\":\n            LocalStorage.copy_file(\n                f\"{TEST_RESOURCES}/{scenario_name}/data/source/customer-{file_prefix}\",\n                f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/customers/\",\n            )\n        elif load == 1 and scenario_name == \"streaming_inner_join\":\n            LocalStorage.copy_file(\n                f\"{TEST_RESOURCES}/{scenario_name}/data/source/customer-part-01.csv\",\n                f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/customers/\",\n            )\n\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario_name}/data/source/sales-{file_prefix}\",\n            f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/data/sales/\",\n        )\n        LocalStorage.copy_file(\n            f\"{TEST_RESOURCES}/{scenario_name}/*schema.json\",\n            f\"{TEST_LAKEHOUSE_IN}/{scenario_name}/\",\n        )\n\n        acon = ConfigUtils.get_acon(\n            f\"file://{TEST_RESOURCES}/{scenario_name}/{scenario_name}.json\"\n        )\n        load_data(acon=acon)\n\n    result_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_OUT}/{scenario_name}/data\",\n        file_format=OutputFormat.DELTAFILES.value,\n    )\n\n    LocalStorage.copy_file(\n        f\"{TEST_RESOURCES}/{scenario_name}/data/control/*\",\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/\",\n    )\n\n    control_df = DataframeHelpers.read_from_file(\n        f\"{TEST_LAKEHOUSE_CONTROL}/data/{scenario_name}.csv\",\n        schema=SchemaUtils.from_file_to_dict(\n            f\"file://{TEST_LAKEHOUSE_IN}/{scenario_name}/\"\n            f\"{scenario_name}_control_schema.json\"\n        ),\n    )\n\n    assert not DataframeHelpers.has_diff(result_df, control_df)\n\n\ndef _drop_and_create_table(table_name: str, location: str) -> None:\n    \"\"\"Create test table.\n\n    Args:\n        table_name: name of the table.\n        location: location of the table.\n    \"\"\"\n    ExecEnv.SESSION.sql(f\"DROP TABLE IF EXISTS test_db.{table_name}\")\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS test_db.{table_name} (\n            salesorder int,\n            item int,\n            date timestamp,\n            customer string,\n            article string,\n            amount int,\n            customer_name string\n        )\n        USING delta\n        LOCATION '{location}'\n        \"\"\"\n    )\n"
  },
  {
    "path": "tests/resources/feature/append_load/failfast/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"enforce_schema_from_table\": \"test_db.failfast_table\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/append_load/failfast/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.failfast_table\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_date\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_date\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.failfast_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/failfast/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/append_load/failfast/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/append_load/failfast/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.failfast_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/failfast/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/append_load/failfast/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/append_load/failfast/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/append_load/failfast/data/source/part-03.csv",
    "content": "salesorder|item|date|customer|article|amount2|onemorecolumn\n5|1|20170510|customer4|article6|15000|NA\n5|2|20170510|customer4|article3|10000|NA\n5|3|20170510|customer4|article5|8000|NA\n6|1|20170601|customer2|article4|10000|NA\n6|2|20170601|customer2|article1|5000|NA\n6|3|20170601|customer2|article2|9000|NA"
  },
  {
    "path": "tests/resources/feature/append_load/jdbc_permissive/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"jdbc_args\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/append_load/jdbc_permissive/tests.db\",\n        \"table\": \"jdbc_permissive\",\n        \"properties\": {\n          \"driver\": \"org.sqlite.JDBC\"\n        }\n      },\n      \"options\": {\n        \"numPartitions\": 1\n      }\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.jdbc_permissive_table\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_date\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_date\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.jdbc_permissive_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/jdbc_permissive/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/append_load/jdbc_permissive/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"jdbc_args\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/append_load/jdbc_permissive/tests.db\",\n        \"table\": \"jdbc_permissive\",\n        \"properties\": {\n          \"driver\": \"org.sqlite.JDBC\"\n        }\n      },\n      \"options\": {\n        \"numPartitions\": 1\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.jdbc_permissive_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/jdbc_permissive/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/append_load/jdbc_permissive/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/append_load/jdbc_permissive/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/append_load/jdbc_permissive/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/append_load/jdbc_permissive/data/source/part-03.csv",
    "content": "salesorder|item|date|customer|article|amount\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_dropmalformed/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_dropmalformed/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_dropmalformed/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_dropmalformed/data/source/part-03.csv",
    "content": "salesorder|item|date|customer|article|amount2|onemorecolumn\n5|1|20170510|customer4|article6|15000|NA\n5|2|20170510|customer4|article3|10000|NA\n5|3|20170510|customer4|article5|8000|NA\n6|1|20170601|customer2|article4|10000|NA\n6|2|20170601|customer2|article1|5000|NA\n6|3|20170601|customer2|article2|9000|NA"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_dropmalformed/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"DROPMALFORMED\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/append_load/streaming_dropmalformed/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_dropmalformed_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/append_load/streaming_dropmalformed/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/streaming_dropmalformed/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_with_terminators/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_with_terminators/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/append_load/streaming_with_terminators/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"DROPMALFORMED\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/append_load/streaming_with_terminators/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_with_terminators_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/append_load/streaming_with_terminators/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/append_load/streaming_with_terminators/data\"\n    }\n  ],\n  \"terminate_specs\": [\n    {\n      \"function\": \"optimize_dataset\",\n      \"args\": {\n        \"db_table\": \"test_db.streaming_with_terminators_table\",\n        \"debug\": true\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"group_article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article_number\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data/source/part-01.csv",
    "content": "salesorder|item|date|group_article|article_number|amount\n1|1|20160601|IE4089|IE4019|1000\n1|2|20160601|IE4088|IE4018|2000\n1|3|20160601|IE4087|IE4017|500\n2|1|20170215|IE4086|IE4016|100\n2|2|20170215|IE4085|IE4015|500\n2|3|20170215|IE4084|IE4014|300\n3|1|20170215|IE4083|IE4013|2000\n3|2|20170215|IE4082|IE4012|1200\n3|3|20170215|IE4081|IE4011|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data/source/part-02.csv",
    "content": "salesorder|item|date|group_article|article_number|amount\n1|1|20160601|IE4099|IE4039|1000\n1|2|20160601|IE4098|IE4038|2000\n1|3|20160601|IE4097|IE4037|500\n2|1|20170215|IE4096|IE4036|100\n2|2|20170215|IE4095|IE4035|500\n2|3|20170215|IE4094|IE4034|300\n3|1|20170215|IE4093|IE4033|2000\n3|2|20170215|IE4092|IE4032|1200\n3|3|20170215|IE4091|IE4031|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"group_article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article_number\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"group_article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article_number\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_a_to_be_not_equal_to_b/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_a_to_be_smaller_or_equal_than_b/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"VBELN\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"EDATU\",\n            \"type\": \"date\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"MBDAT\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"ERDAT\",\n            \"type\": \"date\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"ERDATA\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"BPDAT\",\n            \"type\": \"date\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data/source/part-01.csv",
    "content": "VBELN|EDATU|MBDAT|ERDAT|ERDATA|BPDAT\n2001|2029-01-12|2023-11-21|2022-08-07|2022-08-07|2023-09-04\n2002|2029-01-12|2020-01-01|2020-01-04|2019-08-07|2023-10-14\n2003|2019-01-12|2023-03-21|2009-01-14|2012-08-07|2024-12-24"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data/source/part-02.csv",
    "content": "VBELN|EDATU|MBDAT|ERDAT|ERDATA|BPDAT\n2004|2029-01-12|2022-04-21|2010-05-04|2020-08-07|2024-11-04\n2005|2013-01-12|2022-05-21|2013-01-11|2022-05-21|2024-09-12"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"VBELN\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"EDATU\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"MBDAT\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ERDAT\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ERDATA\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"BPDAT\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"VBELN\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"EDATU\",\n            \"type\": \"date\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"MBDAT\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"ERDAT\",\n            \"type\": \"date\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"ERDATA\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"BPDAT\",\n            \"type\": \"date\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_pair_date_a_to_be_greater_than_or_equal_to_date_b/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|2016-06-01T12:00:00|customer1|article1|1000\n1|2|2016-06-01T12:00:00|customer1|article2|2000\n1|3|2016-06-01T12:00:00|customer1|article3|500\n2|1|2017-02-15T12:00:00|customer2|article4|100\n2|2|2017-02-15T12:00:00|customer2|article6|500\n2|3|2017-02-15T12:00:00|customer2|article1|300\n3|1|2017-02-15T12:00:00|customer1|article5|2000\n3|2|2017-02-15T12:00:00|customer1|article2|1200\n3|3|2017-02-15T12:00:00|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n4|1|2017-04-30T12:00:00|customer3|article3|800\n4|2|2017-04-30T12:00:00|customer3|article7|700\n4|3|2017-04-30T12:00:00|customer3|article1|300\n4|4|2017-04-30T12:00:00|customer3|article2|500\n5|1|2017-05-10T12:00:00|customer4|article6|1500\n5|2|2017-05-10T12:00:00|customer4|article3|1000\n5|3|2017-05-10T12:00:00|customer4|article5|800\n6|1|2017-06-01T12:00:00|customer2|article4|1000\n6|2|2017-06-01T12:00:00|customer2|article1|500\n6|3|2017-06-01T12:00:00|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_values_to_be_date_not_older_than/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"number\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data/source/part-01.csv",
    "content": "salesorder|item|number|customer|article|amount\n1|1|4061622965678|customer1|article1|1000\n1|2|4061622965678|customer1|article2|2000\n1|3|4061622965678|customer1|article3|500\n2|1|4061622965678|customer2|article4|100\n2|2|4061622965678|customer2|article6|500\n2|3|4061622965678|customer2|article1|300\n3|1|4061622965678|customer1|article5|2000\n3|2|4061622965678|customer1|article2|1200\n3|3|4061622965678|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n4|1|4061622965678|customer3|article3|800\n4|2|4061622965678|customer3|article7|700\n4|3|4061622965678|customer3|article1|300\n4|4|4061622965678|customer3|article2|500\n5|1|4061622965678|customer4|article6|1500\n5|2|4061622965678|customer4|article3|1000\n5|3|4061622965678|customer4|article5|800\n6|1|4061622965678|customer2|article4|1000\n6|2|4061622965678|customer2|article1|500\n6|3|4061622965678|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"number\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"number\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_column_values_to_not_be_null_or_empty_string/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"itemcode\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data/source/part-01.csv",
    "content": "salesorder|item|itemcode|date|customer|article|amount\n1|1|1|20160601|customer1|article1|1000\n1|2|2|20160601|customer1|article2|2000\n1|3|3|20160601|customer1|article3|500\n2|1|1|20170215|customer2|article4|100\n2|2|2|20170215|customer2|article6|500\n2|3|3|20170215|customer2|article1|300\n3|1|1|20170215|customer1|article5|2000\n3|2|2|20170215|customer1|article2|1200\n3|3|3|20170215|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data/source/part-02.csv",
    "content": "salesorder|item|itemcode|date|customer|article|amount\n4|1|1|20170430|customer3|article3|800\n4|2|2|20170430|customer3|article7|700\n4|3|3|20170430|customer3|article1|300\n4|4|4|20170430|customer3|article2|500\n5|1|1|20170510|customer4|article6|1500\n5|2|2|20170510|customer4|article3|1000\n5|3|3|20170510|customer4|article5|800\n6|1|1|20170601|customer2|article4|1000\n6|2|2|20170601|customer2|article1|500\n6|3|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"itemcode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"itemcode\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_multicolumn_column_a_must_equal_b_or_c/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_queried_column_agg_value_to_be/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_queried_column_agg_value_to_be/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"itemcode\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"year\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"month\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"day\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_queried_column_agg_value_to_be/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_queried_column_agg_value_to_be/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_queried_column_agg_value_to_be/data/source/part-01.csv",
    "content": "salesorder|item|itemcode|year|month|day|customer|article|amount\n1|1|1|2016|06|01|customer1|article1|1000\n1|2|2|2016|06|01|customer1|article2|2000\n1|3|3|2016|06|01|customer1|article3|500\n2|1|1|2017|02|15|customer2|article4|100\n2|2|2|2017|02|15|customer2|article6|500\n2|3|3|2017|02|15|customer2|article1|300\n3|1|1|2015|10|09|customer1|article5|2000\n3|2|2|2015|10|09|customer1|article2|1200\n3|3|3|2015|10|09|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_queried_column_agg_value_to_be/data/source/part-02.csv",
    "content": "salesorder|item|itemcode|year|month|day|customer|article|amount\n4|1|1|2020|04|30|customer3|article3|800\n4|2|2|2020|04|30|customer3|article7|700\n4|3|3|2021|11|31|customer3|article1|300\n4|4|4|2021|11|31|customer3|article2|500\n5|1|1|2022|01|01|customer4|article6|1500\n5|2|2|2022|01|01|customer4|article3|1000\n5|3|3|2022|01|01|customer4|article5|800\n6|1|1|2010|06|29|customer2|article4|1000\n6|2|2|2010|06|29|customer2|article1|500\n6|3|3|2010|06|29|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_queried_column_agg_value_to_be/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"itemcode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"year\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"month\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"day\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/custom_expectations/expect_queried_column_agg_value_to_be/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/custom_expectations/expect_queried_column_agg_value_to_be/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"itemcode\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"year\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"month\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"day\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_queried_column_agg_value_to_be/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/custom_expectations/expect_queried_column_agg_value_to_be/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/calculate_kpi/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"long\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/calculate_kpi/data/control/part-01.csv",
    "content": "date|amount\n20160601|3500"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/calculate_kpi/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/calculate_kpi/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/delta_load/data/control/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|15000\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|20000\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|5000\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|1000\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|5000\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|3000\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|20000\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|7000\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|4000\n20180110120052t|request1|3|1|1|4|4||20170430|customer3|article2|7000\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|15000\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|10000\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|8000\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|10000\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|5000\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|9000\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|12000"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/delta_load/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N|20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2|N|20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/delta_load/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/delta_load/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|2|1|14|4|4||20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/delta_load/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|3|1|1|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/sql_transformation/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"long\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/sql_transformation/data/control/part-01.csv",
    "content": "date|amount\n20160601|3500"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/sql_transformation/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/data_loader_custom_transformer/sql_transformation/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/build_data_docs/with_data_docs_local_fs/20240410-080323-dq_success-sales_orders-checkpoint/20240410T080323.289170Z/7ba399ea28cc40bf8c79213a440aeb91.json",
    "content": "{\n  \"evaluation_parameters\": {},\n  \"meta\": {\n    \"active_batch_definition\": {\n      \"batch_identifiers\": {\n        \"input_id\": \"sales_orders\",\n        \"spec_id\": \"dq_success\",\n        \"timestamp\": \"20240410080151\"\n      },\n      \"data_asset_name\": \"dq_success-sales_orders\",\n      \"data_connector_name\": \"dq_success-sales_orders-data_connector\",\n      \"datasource_name\": \"dq_success-sales_orders-datasource\"\n    },\n    \"batch_markers\": {\n      \"ge_load_time\": \"20240410T080323.295280Z\"\n    },\n    \"batch_spec\": {\n      \"batch_data\": \"SparkDataFrame\",\n      \"data_asset_name\": \"dq_success-sales_orders\"\n    },\n    \"checkpoint_id\": null,\n    \"checkpoint_name\": \"dq_success-sales_orders-checkpoint\",\n    \"expectation_suite_name\": \"dq_success-sales_orders-validator\",\n    \"great_expectations_version\": \"0.18.8\",\n    \"run_id\": {\n      \"run_name\": \"20240410-080323-dq_success-sales_orders-checkpoint\",\n      \"run_time\": \"2024-04-10T08:03:23.289170+00:00\"\n    },\n    \"validation_id\": null,\n    \"validation_time\": \"20240410T080323.296161Z\"\n  },\n  \"results\": [\n    {\n      \"exception_info\": {\n        \"exception_message\": null,\n        \"exception_traceback\": null,\n        \"raised_exception\": false\n      },\n      \"expectation_config\": {\n        \"expectation_type\": \"expect_column_to_exist\",\n        \"kwargs\": {\n          \"batch_id\": \"7ba399ea28cc40bf8c79213a440aeb91\",\n          \"column\": \"article\"\n        },\n        \"meta\": {}\n      },\n      \"meta\": {},\n      \"result\": {},\n      \"success\": true\n    },\n    {\n      \"exception_info\": {\n        \"exception_message\": null,\n        \"exception_traceback\": null,\n        \"raised_exception\": false\n      },\n      \"expectation_config\": {\n        \"expectation_type\": \"expect_table_row_count_to_be_between\",\n        \"kwargs\": {\n          \"batch_id\": \"7ba399ea28cc40bf8c79213a440aeb91\",\n          \"max_value\": 50,\n          \"min_value\": 0\n        },\n        \"meta\": {}\n      },\n      \"meta\": {},\n      \"result\": {\n        \"observed_value\": 34\n      },\n      \"success\": true\n    }\n  ],\n  \"statistics\": {\n    \"evaluated_expectations\": 2,\n    \"success_percent\": 100.0,\n    \"successful_expectations\": 2,\n    \"unsuccessful_expectations\": 0\n  },\n  \"success\": true\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/build_data_docs/without_data_docs_local_fs/20240409-143548-dq_validator-sales_source-checkpoint/20240409T143548.454043Z/f0d7bd293d22bcfd3c1fec5a7d566638.json",
    "content": "{\n  \"evaluation_parameters\": {},\n  \"meta\": {\n    \"active_batch_definition\": {\n      \"batch_identifiers\": {\n        \"input_id\": \"sales_source\",\n        \"spec_id\": \"dq_validator\",\n        \"timestamp\": \"20240409143443\"\n      },\n      \"data_asset_name\": \"dq_validator-sales_source\",\n      \"data_connector_name\": \"dq_validator-sales_source-data_connector\",\n      \"datasource_name\": \"dq_validator-sales_source-datasource\"\n    },\n    \"batch_markers\": {\n      \"ge_load_time\": \"20240409T143548.465215Z\"\n    },\n    \"batch_spec\": {\n      \"batch_data\": \"SparkDataFrame\",\n      \"data_asset_name\": \"dq_validator-sales_source\"\n    },\n    \"checkpoint_id\": null,\n    \"checkpoint_name\": \"dq_validator-sales_source-checkpoint\",\n    \"expectation_suite_name\": \"dq_validator-sales_source-validator\",\n    \"great_expectations_version\": \"0.18.8\",\n    \"run_id\": {\n      \"run_name\": \"20240409-143548-dq_validator-sales_source-checkpoint\",\n      \"run_time\": \"2024-04-09T14:35:48.454043+00:00\"\n    },\n    \"validation_id\": null,\n    \"validation_time\": \"20240409T143548.466032Z\"\n  },\n  \"results\": [\n    {\n      \"exception_info\": {\n        \"exception_message\": null,\n        \"exception_traceback\": null,\n        \"raised_exception\": false\n      },\n      \"expectation_config\": {\n        \"expectation_type\": \"expect_table_row_count_to_be_between\",\n        \"kwargs\": {\n          \"batch_id\": \"f0d7bd293d22bcfd3c1fec5a7d566638\",\n          \"max_value\": 34,\n          \"min_value\": 34\n        },\n        \"meta\": {}\n      },\n      \"meta\": {},\n      \"result\": {\n        \"observed_value\": 34\n      },\n      \"success\": true\n    },\n    {\n      \"exception_info\": {\n        \"exception_message\": null,\n        \"exception_traceback\": null,\n        \"raised_exception\": false\n      },\n      \"expectation_config\": {\n        \"expectation_type\": \"expect_table_column_count_to_be_between\",\n        \"kwargs\": {\n          \"batch_id\": \"f0d7bd293d22bcfd3c1fec5a7d566638\",\n          \"max_value\": 12,\n          \"min_value\": 12\n        },\n        \"meta\": {}\n      },\n      \"meta\": {},\n      \"result\": {\n        \"observed_value\": 12\n      },\n      \"success\": true\n    }\n  ],\n  \"statistics\": {\n    \"evaluated_expectations\": 2,\n    \"success_percent\": 100.0,\n    \"successful_expectations\": 2,\n    \"unsuccessful_expectations\": 0\n  },\n  \"success\": true\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":19,\"min_value\":19,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":19,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, null, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":12,\"min_value\":12,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":12,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, null, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":9,\"min_value\":9,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":9,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":12,\"min_value\":12,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":12,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"fake_column\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"sum_total\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/control/sales.json",
    "content": "{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"1\",\"recordmode\":\"N\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article3\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"10\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"30\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"3\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article5\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article6\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article3\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article5\",\"amount\":\"80\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"90\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"3\",\"salesorder\":\"1\",\"item\":\"1\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"5\",\"salesorder\":\"2\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"2\",\"partno\":\"1\",\"record\":\"2\",\"salesorder\":\"4\",\"item\":\"4\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article2\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"1\",\"salesorder\":\"7\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20180110\",\"customer\":\"customer5\",\"article\":\"article2\",\"amount\":\"120\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"4\",\"salesorder\":\"4\",\"item\":\"1\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"6\",\"salesorder\":\"4\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article1\",\"amount\":\"40\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/control/sales_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"actrequest_timestamp\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"request\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"datapakid\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"partno\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"record\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"salesorder\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"item\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"recordmode\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"date\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"customer\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"article\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"amount\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"dq_validations\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"run_name\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exceptions\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_row_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_failure_details\",\n            \"nullable\": true,\n            \"type\": {\n              \"containsNull\": true,\n              \"elementType\": {\n                \"fields\": [\n                  {\n                    \"metadata\": {},\n                    \"name\": \"expectation_type\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  },\n                  {\n                    \"metadata\": {},\n                    \"name\": \"kwargs\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  }\n                ],\n                \"type\": \"struct\"\n              },\n              \"type\": \"array\"\n            }\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/dq_functions/test_db.dq_functions_source_load_with_dq_table_delta_with_dupl_tag_gen_fail_init.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_table_row_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 19, \"max_value\": 19}\nrule_2|expect_table_column_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 12, \"max_value\": 12}\nrule_3|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/dq_functions/test_db.dq_functions_source_load_with_dq_table_delta_with_dupl_tag_gen_fail_new.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_table_row_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 9, \"max_value\": 9}\nrule_2|expect_table_column_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 12, \"max_value\": 12}\nrule_3|expect_column_to_exist|in_motion|test_db|dummy_sales|fake_column|{\"column\": \"fake_column\"}\nrule_4|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/streaming_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"prisma\",\n      \"dq_db_table\": \"test_db.dq_functions_source_load_with_dq_table_delta_with_dupl_tag_gen_fail_init\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/dq\",\n      \"result_sink_format\": \"delta\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_table_table_filter\": \"dummy_sales\",\n      \"tag_source_data\": true,\n      \"source\": \"condensed_sales\",\n      \"data_product_name\": \"delta_with_dupl_tag_gen_fail\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/streaming_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"prisma\",\n      \"dq_db_table\": \"test_db.dq_functions_source_load_with_dq_table_delta_with_dupl_tag_gen_fail_new\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/dq\",\n      \"result_sink_format\": \"delta\",\n      \"dq_table_table_filter\": \"dummy_sales\",\n      \"tag_source_data\": true,\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"source\": \"condensed_sales\",\n      \"data_product_name\": \"delta_with_dupl_tag_gen_fail\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      },\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_dupl_tag_gen_fail/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"19.0\",\"min_value\":\"19.0\",\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":19,\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, null, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"12.0\",\"min_value\":\"12.0\",\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":12,\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, null, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":null,\"column_A\":null,\"column_B\":null,\"column_list\":\"[salesorder, request]\",\"sum_total\":\"5.0\", \"unexpected_index_list\":[{\"run_success\":false,\"customer\":\"customer1\",\"date\":null,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"1\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20160601,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"1\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20160601,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"1\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170215,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"2\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20170215,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"3\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"4\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170601,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"6\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170215,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"2\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170215,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"2\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20170215,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"3\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20170215,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"3\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170601,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"6\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170601,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"6\"}],\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, null, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"9.0\",\"min_value\":\"9.0\",\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":9,\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"12.0\",\"min_value\":\"12.0\",\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":12,\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_values_to_be_in_set\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"salesorder\",\"column_A\":null,\"column_B\":null, \"unexpected_index_list\":[{\"run_success\":false,\"customer\":\"customer5\",\"date\":\"20180110\",\"item\":\"1\",\"salesorder\":\"7\"}],\"value_set\":\"[1, 2, 3, 4, 5]\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"3.0\",\"min_value\":\"3.0\",\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"amount\",\"column_A\":null,\"column_B\":null, \"unexpected_index_list\":[{\"run_success\":false,\"amount\":\"70\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"4\",\"salesorder\":\"4\"},{\"run_success\":false,\"amount\":\"50\",\"customer\":\"customer2\",\"date\":20170215,\"item\":2,\"salesorder\":2},{\"run_success\":false,\"amount\":\"70\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"1\",\"salesorder\":\"4\"},{\"run_success\":false,\"amount\":\"80\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"2\",\"salesorder\":\"4\"},{\"run_success\":false,\"amount\":\"40\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"3\",\"salesorder\":\"4\"}],\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"fake_column\",\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_pair_values_to_be_equal\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":null,\"column_A\":\"datapakid\",\"column_B\":\"partno\", \"unexpected_index_list\":[{\"run_success\":false,\"datapakid\":\"2\",\"salesorder\":\"4\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"4\",\"partno\":\"1\"}],\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\": \"1, 1, 20160601, customer1||2, 2, 20170215, customer2||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||7, 1, 20180110, customer5\"}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_list\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"sum_total\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": {\n          \"fields\": [\n            {\n              \"metadata\": {},\n              \"name\": \"customer\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"date\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"item\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"request\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"salesorder\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"run_success\",\n              \"nullable\": true,\n              \"type\": \"boolean\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"amount\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"datapakid\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"partno\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            }\n          ],\n          \"type\": \"struct\"\n        },\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_A\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_B\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"value_set\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/control/sales.json",
    "content": "{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"1\",\"recordmode\":\"N\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article3\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"10\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"30\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"3\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article5\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article6\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article3\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article5\",\"amount\":\"80\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"90\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"3\",\"salesorder\":\"1\",\"item\":\"1\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"5\",\"salesorder\":\"2\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"2\",\"partno\":\"1\",\"record\":\"2\",\"salesorder\":\"4\",\"item\":\"4\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article2\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_pair_values_to_be_equal\",\"kwargs\":\"{\\\"column_A\\\":\\\"datapakid\\\",\\\"column_B\\\":\\\"partno\\\"}\"},{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"1\",\"salesorder\":\"7\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20180110\",\"customer\":\"customer5\",\"article\":\"article2\",\"amount\":\"120\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_values_to_be_in_set\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"salesorder\\\",\\\"value_set\\\":[1,2,3,4,5]}\"}]}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"4\",\"salesorder\":\"4\",\"item\":\"1\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"6\",\"salesorder\":\"4\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article1\",\"amount\":\"40\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/control/sales_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"actrequest_timestamp\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"request\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"datapakid\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"partno\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"record\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"salesorder\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"item\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"recordmode\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"date\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"customer\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"article\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"amount\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"dq_validations\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"run_name\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exceptions\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_row_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_failure_details\",\n            \"nullable\": true,\n            \"type\": {\n              \"containsNull\": true,\n              \"elementType\": {\n                \"fields\": [\n                  {\n                    \"metadata\": {},\n                    \"name\": \"expectation_type\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  },\n                  {\n                    \"metadata\": {},\n                    \"name\": \"kwargs\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  }\n                ],\n                \"type\": \"struct\"\n              },\n              \"type\": \"array\"\n            }\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/dq_functions/test_db.dq_functions_source_load_with_dq_table_delta_with_duplicates_tag_init.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_table_row_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 19, \"max_value\": 19}\nrule_2|expect_table_column_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 12, \"max_value\": 12}\nrule_3|expect_multicolumn_sum_to_equal|in_motion|test_db|dummy_sales|salesorder,request|{\"column_list\": [\"salesorder\", \"request\"], \"sum_total\": 5}\nrule_4|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_5|expect_multicolumn_sum_to_equal|in_motion|test_db|dummy_sales|salesorder,request|{\"column_list\": [\"salesorder\", \"request\"], \"sum_total\": 5}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/dq_functions/test_db.dq_functions_source_load_with_dq_table_delta_with_duplicates_tag_new.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_table_row_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 9, \"max_value\": 9}\nrule_2|expect_table_column_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 12, \"max_value\": 12}\nrule_3|expect_column_values_to_be_in_set|in_motion|test_db|dummy_sales|salesorder|{\"column\": \"salesorder\", \"value_set\": [1, 2, 3, 4, 5]}\nrule_4|expect_column_value_lengths_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 3, \"max_value\": 3}\nrule_5|expect_column_to_exist|in_motion|test_db|dummy_sales|fake_column|{\"column\": \"fake_column\"}\nrule_6|expect_column_pair_values_to_be_equal|in_motion|test_db|dummy_sales|datapakid, partno|{\"column_A\": \"datapakid\", \"column_B\": \"partno\"}\nrule_7|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/streaming_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"prisma\",\n      \"dq_db_table\": \"test_db.dq_functions_source_load_with_dq_table_delta_with_duplicates_tag_init\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/dq\",\n      \"result_sink_format\": \"delta\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_table_table_filter\": \"dummy_sales\",\n      \"tag_source_data\": true,\n      \"source\": \"condensed_sales\",\n      \"data_product_name\": \"delta_with_duplicates_tag\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/streaming_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"prisma\",\n      \"dq_db_table\": \"test_db.dq_functions_source_load_with_dq_table_delta_with_duplicates_tag_new\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/dq\",\n      \"result_sink_format\": \"delta\",\n      \"tag_source_data\": true,\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_table_table_filter\": \"dummy_sales\",\n      \"source\": \"condensed_sales\",\n      \"data_product_name\": \"delta_with_duplicates_tag\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      },\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/delta_with_duplicates_tag/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_table/full_overwrite_tag/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"prisma\",\n      \"dq_db_table\": \"test_db.dq_functions_source_load_with_dq_table_full_overwrite_tag_init\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/full_overwrite_tag/dq\",\n      \"result_sink_format\": \"delta\",\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"sales\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_table_table_filter\": \"dummy_sales\",\n      \"tag_source_data\": true,\n      \"data_product_name\": \"full_overwrite_tag\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/full_overwrite_tag/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/batch_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_table/full_overwrite_tag/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"prisma\",\n      \"dq_db_table\": \"test_db.dq_functions_source_load_with_dq_table_full_overwrite_tag_init\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/full_overwrite_tag/dq\",\n      \"result_sink_format\": \"delta\",\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"sales\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_table_table_filter\": \"dummy_sales\",\n      \"tag_source_data\": true,\n      \"data_product_name\": \"full_overwrite_tag\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_table/full_overwrite_tag/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.sources.partitionColumnTypeInference.enabled\": false\n  }\n}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_1\",\"source\":\"sales\",\"batch_id\":\"batch_id_1\",\"column\":\"article\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_1\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, 20160601, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_2\",\"source\":\"sales\",\"batch_id\":\"batch_id_2\",\"max_value\":50,\"min_value\":3,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_2\",\"observed_value\":19,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, 20160601, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_1\",\"source\":\"sales\",\"batch_id\":\"batch_id_1\",\"column\":\"article\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_1\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, 20160601, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_2\",\"source\":\"sales\",\"batch_id\":\"batch_id_2\",\"max_value\":50,\"min_value\":3,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_2\",\"observed_value\":19,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"], \"processed_keys\":\"1, 1, 20160601, customer1||1, 2, 20160601, customer1||1, 3, 20160601, customer1||2, 1, 20170215, customer2||2, 2, 20170215, customer2||2, 3, 20170215, customer2||3, 1, 20170215, customer1||3, 2, 20170215, customer1||3, 3, 20170215, customer1||4, 1, 20170430, customer3||4, 2, 20170430, customer3||4, 3, 20170430, customer3||4, 4, 20170430, customer3||5, 1, 20170510, customer4||5, 2, 20170510, customer4||5, 3, 20170510, customer4||6, 1, 20170601, customer2||6, 2, 20170601, customer2||6, 3, 20170601, customer2\"}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/control/sales.json",
    "content": "{\"salesorder\":\"1\",\"item\":\"1\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"10000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"1\",\"item\":\"2\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"20000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"1\",\"item\":\"3\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article3\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"2\",\"item\":\"1\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"1000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"2\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article6\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"2\",\"item\":\"3\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"3000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"3\",\"item\":\"1\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article5\",\"amount\":\"20000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"3\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"12000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"3\",\"item\":\"3\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article4\",\"amount\":\"9000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"1\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":\"8000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"2\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article7\",\"amount\":\"7000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"3\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article1\",\"amount\":\"3000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"4\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article2\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"5\",\"item\":\"1\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article6\",\"amount\":\"15000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"5\",\"item\":\"2\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article3\",\"amount\":\"10000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"5\",\"item\":\"3\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article5\",\"amount\":\"8000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"6\",\"item\":\"1\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"10000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"6\",\"item\":\"2\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"6\",\"item\":\"3\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"9000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/control/sales_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"salesorder\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"item\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"date\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"customer\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"article\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"amount\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"dq_validations\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"run_name\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exceptions\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_row_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_failure_details\",\n            \"nullable\": true,\n            \"type\": {\n              \"containsNull\": true,\n              \"elementType\": {\n                \"fields\": [\n                  {\n                    \"metadata\": {},\n                    \"name\": \"expectation_type\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  },\n                  {\n                    \"metadata\": {},\n                    \"name\": \"kwargs\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  }\n                ],\n                \"type\": \"struct\"\n              },\n              \"type\": \"array\"\n            }\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/dq_functions/test_db.dq_functions_source_load_with_dq_table_full_overwrite_tag_init.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_to_exist|in_motion|test_db|dummy_sales|article|{\"column\": \"article\"}\nrule_2|expect_table_row_count_to_be_between|in_motion|test_db|dummy_sales||{\"min_value\": 3, \"max_value\": 50}\nrule_3|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/dq_functions/test_db.dq_functions_source_load_with_dq_table_full_overwrite_tag_new.csv",
    "content": "dq_rule_id|dq_check_type|dq_tech_function|execution_point|schema|table|column|filters|arguments\nrule_1|COLUMN EXISTS|expect_column_to_exist|in_motion|test_db|dummy_sales|article||{\"column\": \"article\"}\nrule_2|ROW COUNT|expect_table_row_count_to_be_between|in_motion|test_db|dummy_sales|||{\"min_value\": 3, \"max_value\": 50}\nrule_3|TABLE STRUCTURE|expect_wrong_expectation|at_rest|test_db|no_table|amount||{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_table/full_overwrite_tag/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":19,\"min_value\":19,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":19,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":12,\"min_value\":12,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":12,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":9,\"min_value\":9,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":9,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":12,\"min_value\":12,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":12,\"column\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"fake_column\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"sum_total\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/control/sales.json",
    "content": "{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"1\",\"recordmode\":\"N\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article3\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"10\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"30\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"3\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article5\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article6\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article3\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article5\",\"amount\":\"80\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"90\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"3\",\"salesorder\":\"1\",\"item\":\"1\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"5\",\"salesorder\":\"2\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"2\",\"partno\":\"1\",\"record\":\"2\",\"salesorder\":\"4\",\"item\":\"4\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article2\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"1\",\"salesorder\":\"7\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20180110\",\"customer\":\"customer5\",\"article\":\"article2\",\"amount\":\"120\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"4\",\"salesorder\":\"4\",\"item\":\"1\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"6\",\"salesorder\":\"4\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article1\",\"amount\":\"40\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/control/sales_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"actrequest_timestamp\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"request\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"datapakid\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"partno\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"record\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"salesorder\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"item\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"recordmode\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"date\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"customer\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"article\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"amount\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"dq_validations\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"run_name\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exceptions\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_row_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_failure_details\",\n            \"nullable\": true,\n            \"type\": {\n              \"containsNull\": true,\n              \"elementType\": {\n                \"fields\": [\n                  {\n                    \"metadata\": {},\n                    \"name\": \"expectation_type\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  },\n                  {\n                    \"metadata\": {},\n                    \"name\": \"kwargs\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  }\n                ],\n                \"type\": \"struct\"\n              },\n              \"type\": \"array\"\n            }\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/streaming_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/dq\",\n      \"result_sink_db_table\": \"test_db.validator_delta_with_dupl_tag_gen_fail\",\n      \"result_sink_format\": \"delta\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"tag_source_data\": true,\n      \"source\": \"condensed_sales\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 19,\n            \"max_value\": 19\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/streaming_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/dq\",\n      \"result_sink_db_table\": \"test_db.validator_delta_with_dupl_tag_gen_fail\",\n      \"result_sink_format\": \"delta\",\n      \"tag_source_data\": true,\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"source\": \"condensed_sales\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 9,\n            \"max_value\": 9\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        },\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"fake_column\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      },\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_dupl_tag_gen_fail/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"19.0\",\"min_value\":\"19.0\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"19\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"12.0\",\"min_value\":\"12.0\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"12\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"9.0\",\"min_value\":\"9.0\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"9\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"12.0\",\"min_value\":\"12.0\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"12\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/streaming_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates/dq\",\n      \"result_sink_db_table\": \"test_db.validator_delta_with_duplicates\",\n      \"result_sink_format\": \"json\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"source\": \"condensed_sales\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 19,\n            \"max_value\": 19\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates/streaming_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates/dq\",\n      \"result_sink_db_table\": \"test_db.validator_delta_with_duplicates\",\n      \"result_sink_format\": \"json\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"source\": \"condensed_sales\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 9,\n            \"max_value\": 9\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      },\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"19.0\",\"min_value\":\"19.0\",\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"19\",\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"12.0\",\"min_value\":\"12.0\",\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"12\",\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":3,\"success_percent\":66.66666666666666,\"successful_expectations\":2,\"unsuccessful_expectations\":1,\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":null,\"column_A\":null,\"column_B\":null,\"column_list\":\"[salesorder, request]\",\"sum_total\":\"5.0\", \"unexpected_index_list\":[{\"run_success\":false,\"customer\":\"customer1\",\"date\":null,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"1\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20160601,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"1\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20160601,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"1\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170215,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"2\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20170215,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"3\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"4\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170601,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"6\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170215,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"2\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170215,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"2\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20170215,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"3\"},{\"run_success\":false,\"customer\":\"customer1\",\"date\":20170215,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"3\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"2\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer3\",\"date\":20170430,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"4\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170601,\"item\":\"1\",\"request\":\"0\",\"salesorder\":\"6\"},{\"run_success\":false,\"customer\":\"customer2\",\"date\":20170601,\"item\":\"3\",\"request\":\"0\",\"salesorder\":\"6\"}],\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"9.0\",\"min_value\":\"9.0\",\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"9\",\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"12.0\",\"min_value\":\"12.0\",\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_table_column_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":\"12\",\"column\":null,\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_values_to_be_in_set\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"salesorder\",\"column_A\":null,\"column_B\":null, \"unexpected_index_list\":[{\"run_success\":false,\"customer\":\"customer5\",\"date\":\"20180110\",\"item\":\"1\",\"salesorder\":\"7\"}],\"value_set\":\"[1, 2, 3, 4, 5]\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":\"3.0\",\"min_value\":\"3.0\",\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"amount\",\"column_A\":null,\"column_B\":null, \"unexpected_index_list\":[{\"run_success\":false,\"amount\":\"70\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"4\",\"salesorder\":\"4\"},{\"run_success\":false,\"amount\":\"50\",\"customer\":\"customer2\",\"date\":20170215,\"item\":2,\"salesorder\":2},{\"run_success\":false,\"amount\":\"70\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"1\",\"salesorder\":\"4\"},{\"run_success\":false,\"amount\":\"80\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"2\",\"salesorder\":\"4\"},{\"run_success\":false,\"amount\":\"40\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"3\",\"salesorder\":\"4\"}],\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":\"fake_column\",\"column_A\":null,\"column_B\":null,\"unexpected_index_list\":null,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T10:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations\",\"success\":false,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"condensed_sales\",\"validation_results\":\"validation_results\",\"source\":\"condensed_sales\",\"batch_id\":\"batch_id\",\"max_value\":null,\"min_value\":null,\"evaluated_expectations\":6,\"success_percent\":33.33333333333333,\"successful_expectations\":2,\"unsuccessful_expectations\":4,\"expectation_type\":\"expect_column_pair_values_to_be_equal\",\"expectation_success\":false,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info\",\"observed_value\":null,\"column\":null,\"column_A\":\"datapakid\",\"column_B\":\"partno\", \"unexpected_index_list\":[{\"run_success\":false,\"datapakid\":\"2\",\"salesorder\":\"4\",\"customer\":\"customer3\",\"date\":20170430,\"item\":\"4\",\"partno\":\"1\"}],\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_list\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"sum_total\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": {\n          \"fields\": [\n            {\n              \"metadata\": {},\n              \"name\": \"customer\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"date\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"item\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"request\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"salesorder\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"run_success\",\n              \"nullable\": true,\n              \"type\": \"boolean\"\n            },\n             {\n              \"metadata\": {},\n              \"name\": \"amount\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"datapakid\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            },\n            {\n              \"metadata\": {},\n              \"name\": \"partno\",\n              \"nullable\": true,\n              \"type\": \"string\"\n            }\n          ],\n          \"type\": \"struct\"\n        },\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_A\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column_B\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"value_set\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/control/sales.json",
    "content": "{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"1\",\"recordmode\":\"N\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"1\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article3\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"10\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"2\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"30\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"3\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article5\",\"amount\":\"200\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article6\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article3\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"5\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article5\",\"amount\":\"80\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"100\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"2\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"00000000000000t\",\"request\":\"0\",\"datapakid\":\"0\",\"partno\":\"0\",\"record\":\"0\",\"salesorder\":\"6\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"90\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_multicolumn_sum_to_equal\",\"kwargs\":\"{\\\"column_list\\\":[\\\"salesorder\\\",\\\"request\\\"],\\\"sum_total\\\":5.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"3\",\"salesorder\":\"1\",\"item\":\"1\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"150\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"5\",\"salesorder\":\"2\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"50\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"2\",\"partno\":\"1\",\"record\":\"2\",\"salesorder\":\"4\",\"item\":\"4\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article2\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_pair_values_to_be_equal\",\"kwargs\":\"{\\\"column_A\\\":\\\"datapakid\\\",\\\"column_B\\\":\\\"partno\\\"}\"},{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110120052t\",\"request\":\"request1\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"1\",\"salesorder\":\"7\",\"item\":\"1\",\"recordmode\":\"N\",\"date\":\"20180110\",\"customer\":\"customer5\",\"article\":\"article2\",\"amount\":\"120\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_values_to_be_in_set\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"salesorder\\\",\\\"value_set\\\":[1,2,3,4,5]}\"}]}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"4\",\"salesorder\":\"4\",\"item\":\"1\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":\"70\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n{\"actrequest_timestamp\":\"20180110130103t\",\"request\":\"request2\",\"datapakid\":\"1\",\"partno\":\"1\",\"record\":\"6\",\"salesorder\":\"4\",\"item\":\"3\",\"recordmode\":\"N\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article1\",\"amount\":\"40\",\"dq_validations\":{\"run_name\":\"--dq_validator-condensed_sales--checkpoint\",\"run_success\":false,\"raised_exceptions\":false,\"run_row_success\":false,\"dq_failure_details\":[{\"expectation_type\":\"expect_column_value_lengths_to_be_between\",\"kwargs\":\"{\\\"batch_id\\\":\\\"f254637fcd94414aae931f85b2d20d02\\\",\\\"column\\\":\\\"amount\\\",\\\"max_value\\\":3.0,\\\"min_value\\\":3.0}\"}]}}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/control/sales_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"actrequest_timestamp\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"request\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"datapakid\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"partno\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"record\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"salesorder\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"item\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"recordmode\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"date\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"customer\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"article\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"amount\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"dq_validations\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"run_name\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exceptions\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_row_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_failure_details\",\n            \"nullable\": true,\n            \"type\": {\n              \"containsNull\": true,\n              \"elementType\": {\n                \"fields\": [\n                  {\n                    \"metadata\": {},\n                    \"name\": \"expectation_type\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  },\n                  {\n                    \"metadata\": {},\n                    \"name\": \"kwargs\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  }\n                ],\n                \"type\": \"struct\"\n              },\n              \"type\": \"array\"\n            }\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/streaming_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/dq\",\n      \"result_sink_db_table\": \"test_db.validator_delta_with_duplicates_tag\",\n      \"result_sink_format\": \"delta\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"tag_source_data\": true,\n      \"source\": \"condensed_sales\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 19,\n            \"max_value\": 19\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        },\n        {\n          \"function\": \"expect_multicolumn_sum_to_equal\",\n          \"args\":{\n            \"column_list\": [\"salesorder\", \"request\"],\n            \"sum_total\": 5\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/streaming_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"condensed_sales\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/dq\",\n      \"result_sink_db_table\": \"test_db.validator_delta_with_duplicates_tag\",\n      \"result_sink_format\": \"delta\",\n      \"tag_source_data\": true,\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"source\": \"condensed_sales\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 9,\n            \"max_value\": 9\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        },\n        {\n          \"function\": \"expect_column_values_to_be_in_set\",\n          \"args\": {\n            \"column\": \"salesorder\",\n            \"value_set\": [1, 2, 3, 4, 5]\n          }\n        },\n        {\n          \"function\": \"expect_column_value_lengths_to_be_between\",\n          \"args\": {\n            \"column\": \"amount\",\n            \"min_value\": 3,\n            \"max_value\": 3\n          }\n        },\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"fake_column\"\n          }\n        },\n        {\n          \"function\": \"expect_column_pair_values_to_be_equal\",\n          \"args\": {\n            \"column_A\": \"datapakid\",\n            \"column_B\": \"partno\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      },\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/delta_with_duplicates_tag/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/full_overwrite/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"validator\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite/dq\",\n      \"result_sink_db_table\": \"test_db.validator_full_overwrite\",\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"sales\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"tag_source_data\": true,\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"article\"\n          }\n        },\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\": {\n            \"min_value\": 3,\n            \"max_value\": 50\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite/batch_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/full_overwrite/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"validator\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite/dq\",\n      \"result_sink_db_table\": \"test_db.validator_full_overwrite\",\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"sales\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"tag_source_data\": true,\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"article\"\n          }\n        },\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\": {\n            \"min_value\": 3,\n            \"max_value\": 50\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.sources.partitionColumnTypeInference.enabled\": false\n  }\n}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_1\",\"source\":\"sales\",\"batch_id\":\"batch_id_1\",\"column\":\"article\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_1\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_2\",\"source\":\"sales\",\"batch_id\":\"batch_id_2\",\"max_value\":50,\"min_value\":3,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_2\",\"observed_value\":19,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_1\",\"source\":\"sales\",\"batch_id\":\"batch_id_1\",\"column\":\"article\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_1\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_2\",\"source\":\"sales\",\"batch_id\":\"batch_id_2\",\"max_value\":50,\"min_value\":3,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_2\",\"observed_value\":19,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"validator\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite_tag/dq\",\n      \"result_sink_db_table\": \"test_db.validator_full_overwrite_tag\",\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"sales\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"tag_source_data\": true,\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"article\"\n          }\n        },\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\": {\n            \"min_value\": 3,\n            \"max_value\": 50\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/batch_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"validator\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite_tag/dq\",\n      \"result_sink_db_table\": \"test_db.validator_full_overwrite_tag\",\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"sales\",\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"tag_source_data\": true,\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"article\"\n          }\n        },\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\": {\n            \"min_value\": 3,\n            \"max_value\": 50\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.sources.partitionColumnTypeInference.enabled\": false\n  }\n}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_1\",\"source\":\"sales\",\"batch_id\":\"batch_id_1\",\"column\":\"article\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_1\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20221228-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_2\",\"source\":\"sales\",\"batch_id\":\"batch_id_2\",\"max_value\":50,\"min_value\":3,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_2\",\"observed_value\":19,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_1\",\"source\":\"sales\",\"batch_id\":\"batch_id_1\",\"column\":\"article\",\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_column_to_exist\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_1\",\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20221229-104013-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-12-29T21:40:13.053632+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"validation_results\":\"validation_results_2\",\"source\":\"sales\",\"batch_id\":\"batch_id_2\",\"max_value\":50,\"min_value\":3,\"evaluated_expectations\":2,\"success_percent\":100.0,\"successful_expectations\":2,\"unsuccessful_expectations\":0,\"expectation_type\":\"expect_table_row_count_to_be_between\",\"expectation_success\":true,\"kwargs\":\"kwargs\",\"exception_info\":\"exception_info_2\",\"observed_value\":19,\"run_time_year\":2022,\"run_time_month\":12,\"run_time_day\":29,\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_result_identifier\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"batch_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"column\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"max_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"min_value\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"evaluated_expectations\",\n      \"nullable\": true,\n      \"type\": \"float\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success_percent\",\n      \"nullable\": true,\n      \"type\": \"double\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"successful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unsuccessful_expectations\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_type\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"expectation_success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"exception_info\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"exception_message\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"exception_traceback\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exception\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"meta\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"column\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_check_type\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_rule_id\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"execution_point\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"filters\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"schema\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"table\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"observed_value\",\n      \"nullable\": true,\n      \"type\": \"long\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_year\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_month\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time_day\",\n      \"nullable\": true,\n      \"type\": \"integer\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"kwargs\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data/control/sales.json",
    "content": "{\"salesorder\":\"1\",\"item\":\"1\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":\"10000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"1\",\"item\":\"2\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"20000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"1\",\"item\":\"3\",\"date\":\"20160601\",\"customer\":\"customer1\",\"article\":\"article3\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"2\",\"item\":\"1\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"1000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"2\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article6\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"2\",\"item\":\"3\",\"date\":\"20170215\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"3000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"3\",\"item\":\"1\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article5\",\"amount\":\"20000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"3\",\"item\":\"2\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article2\",\"amount\":\"12000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"3\",\"item\":\"3\",\"date\":\"20170215\",\"customer\":\"customer1\",\"article\":\"article4\",\"amount\":\"9000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"1\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":\"8000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"2\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article7\",\"amount\":\"7000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"3\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article1\",\"amount\":\"3000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"4\",\"item\":\"4\",\"date\":\"20170430\",\"customer\":\"customer3\",\"article\":\"article2\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"5\",\"item\":\"1\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article6\",\"amount\":\"15000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"5\",\"item\":\"2\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article3\",\"amount\":\"10000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"5\",\"item\":\"3\",\"date\":\"20170510\",\"customer\":\"customer4\",\"article\":\"article5\",\"amount\":\"8000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"6\",\"item\":\"1\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":\"10000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"6\",\"item\":\"2\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article1\",\"amount\":\"5000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n{\"salesorder\":\"6\",\"item\":\"3\",\"date\":\"20170601\",\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":\"9000\",\"dq_validations\":{\"run_name\":\"--dq_validator-sales_source--checkpoint\",\"run_success\":true,\"raised_exceptions\":false,\"run_row_success\":true}}\n"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data/control/sales_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"salesorder\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"item\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"date\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"customer\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"article\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"amount\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"dq_validations\",\n      \"nullable\": true,\n      \"type\": {\n        \"fields\": [\n          {\n            \"metadata\": {},\n            \"name\": \"run_name\",\n            \"nullable\": true,\n            \"type\": \"string\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"raised_exceptions\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"run_row_success\",\n            \"nullable\": true,\n            \"type\": \"boolean\"\n          },\n          {\n            \"metadata\": {},\n            \"name\": \"dq_failure_details\",\n            \"nullable\": true,\n            \"type\": {\n              \"containsNull\": true,\n              \"elementType\": {\n                \"fields\": [\n                  {\n                    \"metadata\": {},\n                    \"name\": \"expectation_type\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  },\n                  {\n                    \"metadata\": {},\n                    \"name\": \"kwargs\",\n                    \"nullable\": true,\n                    \"type\": \"string\"\n                  }\n                ],\n                \"type\": \"struct\"\n              },\n              \"type\": \"array\"\n            }\n          }\n        ],\n        \"type\": \"struct\"\n      }\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/full_overwrite_tag/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/data/control/data_validator.json",
    "content": "{\"checkpoint_config\":\"checkpoint_config_init\",\"run_name\":\"20220611-211348-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-06-11T21:13:48.505870+00:00\",\"run_results\":\"run_results_for_all_expectations_1\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}\n{\"checkpoint_config\":\"checkpoint_config\",\"run_name\":\"20220612-211348-dq_validator-sales_source-checkpoint\",\"run_time\":\"2022-06-12T21:13:48.505870+00:00\",\"run_results\":\"run_results_for_all_expectations_2\",\"success\":true,\"validation_result_identifier\":\"validation_result_identifier\",\"spec_id\":\"dq_validator\",\"input_id\":\"sales_source\",\"source_primary_key\": [\"salesorder\", \"item\", \"date\", \"customer\"]}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/data/control/data_validator_schema.json",
    "content": "{\n  \"fields\": [\n    {\n      \"metadata\": {},\n      \"name\": \"checkpoint_config\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_name\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"run_time\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"validation_results\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"success\",\n      \"nullable\": true,\n      \"type\": \"boolean\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"spec_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"input_id\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"source_primary_key\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"unexpected_index_list\",\n      \"nullable\": true,\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"metadata\": {},\n      \"name\": \"processed_keys\",\n      \"nullable\": true,\n      \"type\": \"string\"\n    }\n  ],\n  \"type\": \"struct\"\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/streaming_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/no_transformers/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/no_transformers/dq\",\n      \"result_sink_db_table\": \"test_db.validator_no_transformers\",\n      \"result_sink_format\": \"json\",\n      \"result_sink_explode\": false,\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 34,\n            \"max_value\": 34\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.test_no_transformers\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/no_transformers/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/no_transformers/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/load_with_dq_validator/no_transformers/streaming_new.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/data_quality/load_with_dq_validator/no_transformers/data\"\n    }\n  ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"sales_source\",\n      \"dq_type\": \"validator\",\n      \"cache_df\": true,\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/no_transformers/dq\",\n      \"result_sink_db_table\": \"test_db.validator_no_transformers\",\n      \"result_sink_format\": \"json\",\n      \"result_sink_explode\": false,\n      \"unexpected_rows_pk\": [\"salesorder\", \"item\", \"date\", \"customer\"],\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 26,\n            \"max_value\": 26\n          }\n        },\n        {\n          \"function\": \"expect_table_column_count_to_be_between\",\n          \"args\":{\n            \"min_value\": 12,\n            \"max_value\": 12\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"dq_validator\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.test_no_transformers\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/no_transformers/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/data_quality/load_with_dq_validator/no_transformers/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/data_quality/validator/data/control/data_validator.csv",
    "content": "checkpoint_config|run_name|run_time|validation_results|success|validation_result_identifier|spec_id|input_id|source_primary_key|processed_keys\ncheckpoint_config|20220612-221423-validator-sales_orders-checkpoint|2022-06-12T22:14:23.625852+00:00|validation_results_for_all_expectations|true|validation_result_identifier|dq_success|sales_orders|[\"salesorder\", \"item\", \"date\", \"customer\"]|\ncheckpoint_config2|20220613-221423-validator-sales_orders-checkpoint2|2022-06-12T22:14:23.625852+00:00|validation_results_for_all_expectations2|false|validation_result_identifier|dq_failure_error_disabled|sales_orders|[\"salesorder\", \"item\", \"date\", \"customer\"]|"
  },
  {
    "path": "tests/resources/feature/data_quality/validator/data/dq_functions/test_db.dq_functions_source_dq_failure.csv",
    "content": "dq_rule_id|dq_check_type|dq_tech_function|execution_point|schema|table|column|filters|arguments\nrule_1|COLUMN EXISTS|expect_column_to_exist|at_rest|test_db|dummy_sales|article||{\"column\": \"article\"}\nrule_2|ROW COUNT|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|||{\"min_value\": 0, \"max_value\": 1}"
  },
  {
    "path": "tests/resources/feature/data_quality/validator/data/dq_functions/test_db.dq_functions_source_dq_failure_error_disabled.csv",
    "content": "dq_rule_id|dq_check_type|dq_tech_function|execution_point|schema|table|column|filters|arguments\nrule_1|ROW COUNT|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|||{\"min_value\": 0, \"max_value\": 1}"
  },
  {
    "path": "tests/resources/feature/data_quality/validator/data/dq_functions/test_db.dq_functions_source_dq_failure_max_percentage.csv",
    "content": "dq_rule_id|dq_check_type|dq_tech_function|execution_point|schema|table|column|filters|arguments\nrule_1|COLUMN EXISTS|expect_column_to_exist|at_rest|test_db|dummy_sales|article||{\"column\": \"article\"}\nrule_2|ROW COUNT|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|||{\"min_value\": 0, \"max_value\": 1}"
  },
  {
    "path": "tests/resources/feature/data_quality/validator/data/dq_functions/test_db.dq_functions_source_dq_success.csv",
    "content": "dq_rule_id|dq_check_type|dq_tech_function|execution_point|schema|table|column|filters|arguments\nrule_1|COLUMN EXISTS|expect_column_to_exist|at_rest|test_db|dummy_sales|article||{\"column\": \"article\"}\nrule_2|ROW COUNT|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|||{\"min_value\": 0, \"max_value\": 50}"
  },
  {
    "path": "tests/resources/feature/data_quality/validator/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/data\"\n    },\n    {\n      \"spec_id\": \"sales_silver\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_silver_timestamp\",\n      \"input_id\": \"sales_silver\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"extraction_date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"extraction_date\",\n            \"increment_df\": \"max_sales_silver_timestamp\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"group_and_rank\",\n          \"args\": {\n            \"group_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key\": [\n              \"extraction_date\",\n              \"changed_on\",\n              \"lhe_row_id\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"group_and_rank\",\n          \"args\": {\n            \"group_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key\": [\n              \"extraction_date\",\n              \"changed_on\",\n              \"lhe_row_id\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/control_batch_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_row_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"extraction_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/control_streaming_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"extraction_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_batch_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_row_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/data/control/batch.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|lhe_row_id|extraction_date\n1|1|shipped|20200811|20160601|customer1|article1|150|2|202108111500000000\n1|2|created|20200811|20160601|customer1|article2|200|1|202108111400000000\n1|3|created|20200811|20160601|customer1|article3|50|2|202108111400000000\n2|1|created|20200811|20170215|customer2|article4|10|3|202108111400000000\n2|2|shipped|20200811|20170215|customer2|article2|50|0|202108111600000000\n2|3|created|20200811|20170215|customer2|article1|30|5|202108111400000000\n3|1|created|20200811|20170215|customer1|article5|200|6|202108111400000000\n3|2|released|20200811|20170215|customer1|article2|120|4|202108111500000000\n3|3|released|20200811|20170215|customer1|article4|90|5|202108111500000000\n4|1|cancelled|20200811|20170430|customer3|article3|100|1|202108111600000000\n4|2|released|20200811|20170430|customer3|article7|80|2|202108111600000000\n4|4|released|20200811|20170430|customer3|article2|60|4|202108111600000000\n5|1|created|20200811|20170510|customer4|article6|150|13|202108111400000000\n5|2|created|20200811|20170510|customer4|article3|100|14|202108111400000000\n5|3|created|20200811|20170510|customer4|article5|80|15|202108111400000000\n6|1|created|20200811|20170601|customer2|article4|100|16|202108111400000000\n6|2|created|20200811|20170601|customer2|article1|50|17|202108111400000000\n6|3|created|20200811|20170601|customer2|article2|90|18|202108111400000000\n7|1|cancelled|20200811|20180110|customer5|article2|120|0|202108111500000000"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/data/control/streaming.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|extraction_date|lhe_batch_id|lhe_row_id\n1|1|shipped|20200811|20160601|customer1|article1|150|202108111500000000|4|2\n1|2|created|20200811|20160601|customer1|article2|200|202108111400000000|3|1\n1|3|created|20200811|20160601|customer1|article3|50|202108111400000000|3|2\n2|1|created|20200811|20170215|customer2|article4|10|202108111400000000|3|3\n2|2|shipped|20200811|20170215|customer2|article2|50|202108111600000000|5|0\n2|3|created|20200811|20170215|customer2|article1|30|202108111400000000|3|5\n3|1|created|20200811|20170215|customer1|article5|200|202108111400000000|3|6\n3|2|released|20200811|20170215|customer1|article2|120|202108111500000000|4|4\n3|3|released|20200811|20170215|customer1|article4|90|202108111500000000|4|5\n4|1|cancelled|20200811|20170430|customer3|article3|100|202108111600000000|5|1\n4|2|released|20200811|20170430|customer3|article7|80|202108111600000000|5|2\n4|4|released|20200811|20170430|customer3|article2|60|202108111600000000|5|4\n5|1|created|20200811|20170510|customer4|article6|150|202108111400000000|3|13\n5|2|created|20200811|20170510|customer4|article3|100|202108111400000000|3|14\n5|3|created|20200811|20170510|customer4|article5|80|202108111400000000|3|15\n6|1|created|20200811|20170601|customer2|article4|100|202108111400000000|3|16\n6|2|created|20200811|20170601|customer2|article1|50|202108111400000000|3|17\n6|3|created|20200811|20170601|customer2|article2|90|202108111400000000|3|18\n7|1|cancelled|20200811|20180110|customer5|article2|120|202108111500000000|4|0"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/data/source/WE_SO_SCL_202108111400000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n1|1|created|20200811|20160601|customer1|article1|100\n1|2|created|20200811|20160601|customer1|article2|200\n1|3|created|20200811|20160601|customer1|article3|50\n2|1|created|20200811|20170215|customer2|article4|10\n2|2|created|20200811|20170215|customer2|article6|50\n2|3|created|20200811|20170215|customer2|article1|30\n3|1|created|20200811|20170215|customer1|article5|200\n3|2|created|20200811|20170215|customer1|article2|120\n3|3|created|20200811|20170215|customer1|article4|90\n4|1|created|20200811|20170430|customer3|article3|80\n4|2|created|20200811|20170430|customer3|article7|70\n4|3|created|20200811|20170430|customer3|article1|30\n4|4|created|20200811|20170430|customer3|article2|50\n5|1|created|20200811|20170510|customer4|article6|150\n5|2|created|20200811|20170510|customer4|article3|100\n5|3|created|20200811|20170510|customer4|article5|80\n6|1|created|20200811|20170601|customer2|article4|100\n6|2|created|20200811|20170601|customer2|article1|50\n6|3|created|20200811|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/data/source/WE_SO_SCL_202108111500000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n7|1|cancelled|20200811|20180110|customer5|article2|120\n7|1|created|20200811|20180110|customer5|article2|120\n1|1|shipped|20200811|20160601|customer1|article1|150\n2|2|released|20200811|20170215|customer2|article2|50\n3|2|released|20200811|20170215|customer1|article2|120\n3|3|released|20200811|20170215|customer1|article4|90"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/data/source/WE_SO_SCL_202108111600000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n2|2|shipped|20200811|20170215|customer2|article2|50\n4|1|cancelled|20200811|20170430|customer3|article3|100\n4|2|released|20200811|20170430|customer3|article7|80\n4|3|deleted|20200811|20170430|customer3|article1|30\n4|4|released|20200811|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/streaming_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/streaming/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/streaming/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_bronze_with_extraction_date\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"group_and_rank\",\n          \"args\": {\n            \"group_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key\": [\n              \"extraction_date\",\n              \"changed_on\",\n              \"lhe_row_id\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"sales_bronze_with_extraction_date\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/fail_with_duplicates_in_same_file/streaming/checkpoint\"\n      },\n      \"with_batch_id\": true,\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/data\"\n    },\n    {\n      \"spec_id\": \"sales_silver\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_silver_timestamp\",\n      \"input_id\": \"sales_silver\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"extraction_date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"extraction_date\",\n            \"increment_df\": \"max_sales_silver_timestamp\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"group_and_rank\",\n          \"args\": {\n            \"group_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key\": [\n              \"extraction_date\",\n              \"changed_on\",\n              \"lhe_row_id\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"group_and_rank\",\n          \"args\": {\n            \"group_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key\": [\n              \"extraction_date\",\n              \"changed_on\",\n              \"lhe_row_id\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/with_duplicates_in_same_file/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/control_batch_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_row_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"extraction_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/control_streaming_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"extraction_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_batch_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_row_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/data/control/batch.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|lhe_row_id|extraction_date\n1|1|shipped|20200811|20160601|customer1|article1|150|2|202108111500000000\n1|2|created|20200811|20160601|customer1|article2|200|1|202108111400000000\n1|3|created|20200811|20160601|customer1|article3|50|2|202108111400000000\n2|1|created|20200811|20170215|customer2|article4|10|3|202108111400000000\n2|2|shipped|20200811|20170215|customer2|article2|50|0|202108111600000000\n2|3|created|20200811|20170215|customer2|article1|30|5|202108111400000000\n3|1|created|20200811|20170215|customer1|article5|200|6|202108111400000000\n3|2|released|20200811|20170215|customer1|article2|120|4|202108111500000000\n3|3|released|20200811|20170215|customer1|article4|90|5|202108111500000000\n4|1|cancelled|20200811|20170430|customer3|article3|100|1|202108111600000000\n4|2|released|20200811|20170430|customer3|article7|80|2|202108111600000000\n4|4|released|20200811|20170430|customer3|article2|60|4|202108111600000000\n5|1|created|20200811|20170510|customer4|article6|150|13|202108111400000000\n5|2|created|20200811|20170510|customer4|article3|100|14|202108111400000000\n5|3|created|20200811|20170510|customer4|article5|80|15|202108111400000000\n6|1|created|20200811|20170601|customer2|article4|100|16|202108111400000000\n6|2|created|20200811|20170601|customer2|article1|50|17|202108111400000000\n6|3|created|20200811|20170601|customer2|article2|90|18|202108111400000000\n7|1|cancelled|20200811|20180110|customer5|article2|120|1|202108111500000000"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/data/control/streaming.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|extraction_date|lhe_batch_id|lhe_row_id\n1|1|shipped|20200811|20160601|customer1|article1|150|202108111500000000|4|2\n1|2|created|20200811|20160601|customer1|article2|200|202108111400000000|3|1\n1|3|created|20200811|20160601|customer1|article3|50|202108111400000000|3|2\n2|1|created|20200811|20170215|customer2|article4|10|202108111400000000|3|3\n2|2|shipped|20200811|20170215|customer2|article2|50|202108111600000000|5|0\n2|3|created|20200811|20170215|customer2|article1|30|202108111400000000|3|5\n3|1|created|20200811|20170215|customer1|article5|200|202108111400000000|3|6\n3|2|released|20200811|20170215|customer1|article2|120|202108111500000000|4|4\n3|3|released|20200811|20170215|customer1|article4|90|202108111500000000|4|5\n4|1|cancelled|20200811|20170430|customer3|article3|100|202108111600000000|5|1\n4|2|released|20200811|20170430|customer3|article7|80|202108111600000000|5|2\n4|4|released|20200811|20170430|customer3|article2|60|202108111600000000|5|4\n5|1|created|20200811|20170510|customer4|article6|150|202108111400000000|3|13\n5|2|created|20200811|20170510|customer4|article3|100|202108111400000000|3|14\n5|3|created|20200811|20170510|customer4|article5|80|202108111400000000|3|15\n6|1|created|20200811|20170601|customer2|article4|100|202108111400000000|3|16\n6|2|created|20200811|20170601|customer2|article1|50|202108111400000000|3|17\n6|3|created|20200811|20170601|customer2|article2|90|202108111400000000|3|18\n7|1|cancelled|20200811|20180110|customer5|article2|120|202108111500000000|4|1"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/data/source/WE_SO_SCL_202108111400000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n1|1|created|20200811|20160601|customer1|article1|100\n1|2|created|20200811|20160601|customer1|article2|200\n1|3|created|20200811|20160601|customer1|article3|50\n2|1|created|20200811|20170215|customer2|article4|10\n2|2|created|20200811|20170215|customer2|article6|50\n2|3|created|20200811|20170215|customer2|article1|30\n3|1|created|20200811|20170215|customer1|article5|200\n3|2|created|20200811|20170215|customer1|article2|120\n3|3|created|20200811|20170215|customer1|article4|90\n4|1|created|20200811|20170430|customer3|article3|80\n4|2|created|20200811|20170430|customer3|article7|70\n4|3|created|20200811|20170430|customer3|article1|30\n4|4|created|20200811|20170430|customer3|article2|50\n5|1|created|20200811|20170510|customer4|article6|150\n5|2|created|20200811|20170510|customer4|article3|100\n5|3|created|20200811|20170510|customer4|article5|80\n6|1|created|20200811|20170601|customer2|article4|100\n6|2|created|20200811|20170601|customer2|article1|50\n6|3|created|20200811|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/data/source/WE_SO_SCL_202108111500000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n7|1|created|20200811|20180110|customer5|article2|120\n7|1|cancelled|20200811|20180110|customer5|article2|120\n1|1|shipped|20200811|20160601|customer1|article1|150\n2|2|released|20200811|20170215|customer2|article2|50\n3|2|released|20200811|20170215|customer1|article2|120\n3|3|released|20200811|20170215|customer1|article4|90"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/data/source/WE_SO_SCL_202108111600000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n2|2|shipped|20200811|20170215|customer2|article2|50\n4|1|cancelled|20200811|20170430|customer3|article3|100\n4|2|released|20200811|20170430|customer3|article7|80\n4|3|deleted|20200811|20170430|customer3|article1|30\n4|4|released|20200811|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_bronze_with_extraction_date\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"group_and_rank\",\n          \"args\": {\n            \"group_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key\": [\n              \"extraction_date\",\n              \"changed_on\",\n              \"lhe_row_id\"\n            ]\n          }\n        },\n        {\n          \"function\": \"repartition\",\n          \"args\": {\n            \"num_partitions\": 1\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"sales_bronze_with_extraction_date\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/delta_load/group_and_rank/with_duplicates_in_same_file/streaming/checkpoint\"\n      },\n      \"with_batch_id\": true,\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/control_batch_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lhe_row_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"extraction_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/insert_column_set/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"example_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/insert_column_set/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/insert_column_set/data\"\n    },\n    {\n      \"spec_id\": \"example_silver\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/insert_column_set/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_silver_timestamp\",\n      \"input_id\": \"example_silver\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"extraction_date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"example_transform\",\n      \"input_id\": \"example_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"extraction_date\",\n            \"increment_df\": \"max_sales_silver_timestamp\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"example_output\",\n      \"input_id\": \"example_transform\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/insert_column_set/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"insert_predicate\": \"new.event in ('shipped','cancelled')\",\n        \"insert_column_set\": {\"salesorder\": \"new.salesorder\", \"item\": \"new.item\", \"event\": \"new.event\",\"changed_on\": \"new.changed_on\",\n          \"amount\": \"new.amount + 101\", \"lhe_row_id\": \"new.lhe_row_id\", \"extraction_date\": \"new.extraction_date\"},\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/insert_column_set/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"example_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/insert_column_set/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/insert_column_set/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"example_transform\",\n      \"input_id\": \"example_input\",\n      \"transformers\": [\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"example_bronze\",\n      \"input_id\": \"example_transform\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/insert_column_set/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/insert_column_set/data/control/batch.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|lhe_row_id|extraction_date\n1|1|shipped|20200811|20160601|customer1|article1|150|2|202108111500000000\n1|3|created|20200811|20160601|customer1|article3|50|2|202108111400000000\n2|1|created|20200811|20170215|customer2|article4|10|3|202108111400000000\n3|1|created|20200811|20170215|customer1|article5|200|6|202108111400000000\n4|2|created|20200811|20170430|customer3|article7|70|10|202108111400000000\n4|3|created|20200811|20170430|customer3|article1|30|11|202108111400000000\n5|1|created|20200811|20170510|customer4|article6|150|13|202108111400000000\n5|2|created|20200811|20170510|customer4|article3|100|14|202108111400000000\n5|3|created|20200811|20170510|customer4|article5|80|15|202108111400000000\n6|1|created|20200811|20170601|customer2|article4|100|16|202108111400000000\n6|2|created|20200811|20170601|customer2|article1|50|17|202108111400000000\n7|1|cancelled|20200811||||221|1|202108111500000000\n1|2|created|20200811|20160601|customer1|article2|200|1|202108111400000000\n2|2|released|20200811|20170215|customer2|article2|50|3|202108111500000000\n2|3|created|20200811|20170215|customer2|article1|30|5|202108111400000000\n3|2|released|20200811|20170215|customer1|article2|120|4|202108111500000000\n3|3|released|20200811|20170215|customer1|article4|90|5|202108111500000000\n4|1|created|20200811|20170430|customer3|article3|80|9|202108111400000000\n4|4|created|20200811|20170430|customer3|article2|50|12|202108111400000000\n6|3|created|20200811|20170601|customer2|article2|90|18|202108111400000000"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/insert_column_set/data/source/WE_SO_SCL_202108111400000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n1|1|created|20200811|20160601|customer1|article1|100\n1|2|created|20200811|20160601|customer1|article2|200\n1|3|created|20200811|20160601|customer1|article3|50\n2|1|created|20200811|20170215|customer2|article4|10\n2|2|created|20200811|20170215|customer2|article6|50\n2|3|created|20200811|20170215|customer2|article1|30\n3|1|created|20200811|20170215|customer1|article5|200\n3|2|created|20200811|20170215|customer1|article2|120\n3|3|created|20200811|20170215|customer1|article4|90\n4|1|created|20200811|20170430|customer3|article3|80\n4|2|created|20200811|20170430|customer3|article7|70\n4|3|created|20200811|20170430|customer3|article1|30\n4|4|created|20200811|20170430|customer3|article2|50\n5|1|created|20200811|20170510|customer4|article6|150\n5|2|created|20200811|20170510|customer4|article3|100\n5|3|created|20200811|20170510|customer4|article5|80\n6|1|created|20200811|20170601|customer2|article4|100\n6|2|created|20200811|20170601|customer2|article1|50\n6|3|created|20200811|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/insert_column_set/data/source/WE_SO_SCL_202108111500000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n7|1|created|20200811|20180110|customer5|article2|120\n7|1|cancelled|20200811|20180110|customer5|article2|120\n1|1|shipped|20200811|20160601|customer1|article1|150\n2|2|released|20200811|20170215|customer2|article2|50\n3|2|released|20200811|20170215|customer1|article2|120\n3|3|released|20200811|20170215|customer1|article4|90"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"event\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"changed_on\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_all/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"example_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_all/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_all/data\"\n    },\n    {\n      \"spec_id\": \"example_silver\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/update_all/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_silver_timestamp\",\n      \"input_id\": \"example_silver\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"extraction_date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"example_transform\",\n      \"input_id\": \"example_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"extraction_date\",\n            \"increment_df\": \"max_sales_silver_timestamp\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"example_output\",\n      \"input_id\": \"example_transform\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/update_all/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_all/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"example_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_all/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_all/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"example_transform\",\n      \"input_id\": \"example_input\",\n      \"transformers\": [\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"example_bronze\",\n      \"input_id\": \"example_transform\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/update_all/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_all/data/control/batch.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|lhe_row_id|extraction_date\n1|1|shipped|20200811|20160601|customer1|article1|150|2|202108111500000000\n1|3|created|20200811|20160601|customer1|article3|50|2|202108111400000000\n2|1|created|20200811|20170215|customer2|article4|10|3|202108111400000000\n3|1|created|20200811|20170215|customer1|article5|200|6|202108111400000000\n4|2|created|20200811|20170430|customer3|article7|70|10|202108111400000000\n4|3|created|20200811|20170430|customer3|article1|30|11|202108111400000000\n5|1|created|20200811|20170510|customer4|article6|150|13|202108111400000000\n5|2|created|20200811|20170510|customer4|article3|100|14|202108111400000000\n5|3|created|20200811|20170510|customer4|article5|80|15|202108111400000000\n6|1|created|20200811|20170601|customer2|article4|100|16|202108111400000000\n6|2|created|20200811|20170601|customer2|article1|50|17|202108111400000000\n7|1|created|20200811|20180110|customer5|article2|120|0|202108111500000000\n7|1|cancelled|20200811|20180110|customer5|article2|120|1|202108111500000000\n1|2|created|20200811|20160601|customer1|article2|200|1|202108111400000000\n2|2|released|20200811|20170215|customer2|article2|50|3|202108111500000000\n2|3|created|20200811|20170215|customer2|article1|30|5|202108111400000000\n3|2|released|20200811|20170215|customer1|article2|120|4|202108111500000000\n3|3|released|20200811|20170215|customer1|article4|90|5|202108111500000000\n4|1|created|20200811|20170430|customer3|article3|80|9|202108111400000000\n4|4|created|20200811|20170430|customer3|article2|50|12|202108111400000000\n6|3|created|20200811|20170601|customer2|article2|90|18|202108111400000000"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_all/data/source/WE_SO_SCL_202108111400000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n1|1|created|20200811|20160601|customer1|article1|100\n1|2|created|20200811|20160601|customer1|article2|200\n1|3|created|20200811|20160601|customer1|article3|50\n2|1|created|20200811|20170215|customer2|article4|10\n2|2|created|20200811|20170215|customer2|article6|50\n2|3|created|20200811|20170215|customer2|article1|30\n3|1|created|20200811|20170215|customer1|article5|200\n3|2|created|20200811|20170215|customer1|article2|120\n3|3|created|20200811|20170215|customer1|article4|90\n4|1|created|20200811|20170430|customer3|article3|80\n4|2|created|20200811|20170430|customer3|article7|70\n4|3|created|20200811|20170430|customer3|article1|30\n4|4|created|20200811|20170430|customer3|article2|50\n5|1|created|20200811|20170510|customer4|article6|150\n5|2|created|20200811|20170510|customer4|article3|100\n5|3|created|20200811|20170510|customer4|article5|80\n6|1|created|20200811|20170601|customer2|article4|100\n6|2|created|20200811|20170601|customer2|article1|50\n6|3|created|20200811|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_all/data/source/WE_SO_SCL_202108111500000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n7|1|created|20200811|20180110|customer5|article2|120\n7|1|cancelled|20200811|20180110|customer5|article2|120\n1|1|shipped|20200811|20160601|customer1|article1|150\n2|2|released|20200811|20170215|customer2|article2|50\n3|2|released|20200811|20170215|customer1|article2|120\n3|3|released|20200811|20170215|customer1|article4|90"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_column_set/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"example_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_column_set/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_column_set/data\"\n    },\n    {\n      \"spec_id\": \"example_silver\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/update_column_set/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_silver_timestamp\",\n      \"input_id\": \"example_silver\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"extraction_date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"example_transform\",\n      \"input_id\": \"example_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"extraction_date\",\n            \"increment_df\": \"max_sales_silver_timestamp\"\n          }\n        },\n        {\n          \"function\": \"with_auto_increment_id\"\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"example_output\",\n      \"input_id\": \"example_transform\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/update_column_set/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item\",\n        \"update_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on\",\n        \"update_column_set\": {\"event\": \"current.event\", \"lhe_row_id\": \"new.lhe_row_id + 100\" },\n        \"delete_predicate\": \"new.extraction_date >= current.extraction_date and new.changed_on >= current.changed_on and new.event = 'deleted'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_column_set/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"example_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_column_set/source_schema.json\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/merge_options/update_column_set/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"example_transform\",\n      \"input_id\": \"example_input\",\n      \"transformers\": [\n        {\n          \"function\": \"with_auto_increment_id\"\n        },\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"example_bronze\",\n      \"input_id\": \"example_transform\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/merge_options/update_column_set/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_column_set/data/control/batch.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount|lhe_row_id|extraction_date\n1|2|created|20200811|20160601|customer1|article2|200|1|202108111400000000\n2|2|created|20200811|20170215|customer2|article6|50|103|202108111400000000\n2|3|created|20200811|20170215|customer2|article1|30|5|202108111400000000\n3|2|created|20200811|20170215|customer1|article2|120|104|202108111400000000\n3|3|created|20200811|20170215|customer1|article4|90|105|202108111400000000\n4|1|created|20200811|20170430|customer3|article3|80|9|202108111400000000\n4|4|created|20200811|20170430|customer3|article2|50|12|202108111400000000\n6|3|created|20200811|20170601|customer2|article2|90|18|202108111400000000\n1|1|created|20200811|20160601|customer1|article1|100|102|202108111400000000\n1|3|created|20200811|20160601|customer1|article3|50|2|202108111400000000\n2|1|created|20200811|20170215|customer2|article4|10|3|202108111400000000\n3|1|created|20200811|20170215|customer1|article5|200|6|202108111400000000\n4|2|created|20200811|20170430|customer3|article7|70|10|202108111400000000\n4|3|created|20200811|20170430|customer3|article1|30|11|202108111400000000\n5|1|created|20200811|20170510|customer4|article6|150|13|202108111400000000\n5|2|created|20200811|20170510|customer4|article3|100|14|202108111400000000\n5|3|created|20200811|20170510|customer4|article5|80|15|202108111400000000\n6|1|created|20200811|20170601|customer2|article4|100|16|202108111400000000\n6|2|created|20200811|20170601|customer2|article1|50|17|202108111400000000\n7|1|created|20200811|20180110|customer5|article2|120|0|202108111500000000\n7|1|cancelled|20200811|20180110|customer5|article2|120|1|202108111500000000"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_column_set/data/source/WE_SO_SCL_202108111400000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n1|1|created|20200811|20160601|customer1|article1|100\n1|2|created|20200811|20160601|customer1|article2|200\n1|3|created|20200811|20160601|customer1|article3|50\n2|1|created|20200811|20170215|customer2|article4|10\n2|2|created|20200811|20170215|customer2|article6|50\n2|3|created|20200811|20170215|customer2|article1|30\n3|1|created|20200811|20170215|customer1|article5|200\n3|2|created|20200811|20170215|customer1|article2|120\n3|3|created|20200811|20170215|customer1|article4|90\n4|1|created|20200811|20170430|customer3|article3|80\n4|2|created|20200811|20170430|customer3|article7|70\n4|3|created|20200811|20170430|customer3|article1|30\n4|4|created|20200811|20170430|customer3|article2|50\n5|1|created|20200811|20170510|customer4|article6|150\n5|2|created|20200811|20170510|customer4|article3|100\n5|3|created|20200811|20170510|customer4|article5|80\n6|1|created|20200811|20170601|customer2|article4|100\n6|2|created|20200811|20170601|customer2|article1|50\n6|3|created|20200811|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/merge_options/update_column_set/data/source/WE_SO_SCL_202108111500000000.csv",
    "content": "salesorder|item|event|changed_on|date|customer|article|amount\n7|1|created|20200811|20180110|customer5|article2|120\n7|1|cancelled|20200811|20180110|customer5|article2|120\n1|1|shipped|20200811|20160601|customer1|article1|150\n2|2|released|20200811|20170215|customer2|article2|50\n3|2|released|20200811|20170215|customer1|article2|120\n3|3|released|20200811|20170215|customer1|article4|90"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/batch_backfill.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/backfill/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_value\": \"20180110120052t\",\n            \"greater_or_equal\": true\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/backfill/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/backfill/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/backfill/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|1500\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|2000\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|500\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|100\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|500\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|300\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|2000\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|700\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|400\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|700\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|1500\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|1000\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|800\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|1000\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|500\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|900\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|1200"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/data/source/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N|20160601|customer1|article1|1000\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|2000\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|500\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|2|2|N|20170215|customer2|article6|500\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|300\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|2000\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|1200\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|900\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|800\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|700\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|300\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|500\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|1500\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|1000\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|800\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|1000\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|500\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/data/source/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/data/source/part-03.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/data/source/part-04.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/backfill/data/source/part-05.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|1200\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|1000\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|1500\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|500\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|500\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|1200\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-900\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|800\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|1000\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|700\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|800\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|300\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|500\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|600\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|600\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|700\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|1000\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|700\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|800\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|400"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/direct_silver_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/direct_silver_load/bronze/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\",\n            \"greater_or_equal\": true\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/direct_silver_load/bronze/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.actrequest_timestamp = new.actrequest_timestamp and current.datapakid = new.datapakid and current.partno = new.partno and current.record = new.record and current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    },\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/direct_silver_load/silver/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/direct_silver_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"partno\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/direct_silver_load/bronze/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.actrequest_timestamp = new.actrequest_timestamp and current.datapakid = new.datapakid and current.partno = new.partno and current.record = new.record and current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    },\n    {\n      \"spec_id\": \"sales_silver\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/direct_silver_load/silver/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N|20160601|customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2|N|20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/data/control/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/data/source/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N|20160601|customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2|N|20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/data/source/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/data/source/part-03.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/direct_silver_load/data/source/part-04.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/late_arriving_changes/batch/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/late_arriving_changes/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\",\n            \"greater_or_equal\": true\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/late_arriving_changes/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"update_predicate\": \"new.extraction_timestamp > current.extraction_timestamp or new.actrequest_timestamp > current.actrequest_timestamp or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid > current.datapakid) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno > current.partno) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno = current.partno and new.record >= current.record)\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/late_arriving_changes/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/late_arriving_changes/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/data/source/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N|20160601|customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2|N|20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/data/source/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/data/source/part-03.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/data/source/part-04.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/late_arriving_changes/streaming_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/late_arriving_changes/streaming/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"transformed_sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"transformed_sales_source\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/late_arriving_changes/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/late_arriving_changes/streaming/checkpoint\"\n      },\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"update_predicate\": \"new.extraction_timestamp > current.extraction_timestamp or new.actrequest_timestamp > current.actrequest_timestamp or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid > current.datapakid) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno > current.partno) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno = current.partno and new.record >= current.record)\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/out_of_order_changes/batch/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/out_of_order_changes/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\",\n            \"greater_or_equal\": true\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/out_of_order_changes/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"update_predicate\": \"new.extraction_timestamp > current.extraction_timestamp or new.actrequest_timestamp > current.actrequest_timestamp or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid > current.datapakid) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno > current.partno) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno = current.partno and new.record >= current.record)\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/out_of_order_changes/batch/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/out_of_order_changes/batch/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20211227175200t|20180110120052t|request1|3|1|1|4|4||20170430|customer3|article2|70\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/data/source/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N|20160601|customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2|N|20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/data/source/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/data/source/part-03.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|2|1|14|4|4||20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/data/source/part-04.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|3|1|1|4|4||20170430|customer3|article2|70\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/out_of_order_changes/streaming_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/out_of_order_changes/streaming/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"transformed_sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"transformed_sales_source\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/out_of_order_changes/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/out_of_order_changes/streaming/checkpoint\"\n      },\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"update_predicate\": \"new.extraction_timestamp > current.extraction_timestamp or new.actrequest_timestamp > current.actrequest_timestamp or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid > current.datapakid) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno > current.partno) or ( new.actrequest_timestamp = current.actrequest_timestamp and new.datapakid = current.datapakid and new.partno = current.partno and new.record >= current.record)\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"ranking_key_asc\": [\n              \"recordmode\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data/source/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data/source/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|discount|uninteresting_column\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|0.0|10.0\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|10.0|\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|10.0|\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|10.0|\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|10.0|\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|10.0|\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|10.0|\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|10.0|"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data/source/part-03.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|discount|uninteresting_column\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100|10.0|\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70|10.0|\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80|10.0|\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30|10.0|\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50|10.0|\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60|10.0|\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60|10.0|"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_deletes_additional_columns/data/source/part-04.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|discount|uninteresting_column\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70|10.0|\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100|10.0|\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70|10.0|\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80|10.0|\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40|10.0|"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/with_duplicates/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_duplicates/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_duplicates/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/with_duplicates/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"ranking_key_asc\": [\n              \"recordmode\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_duplicates/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/data/source/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|00000000000000t|0|0|0|0|1|1|N||customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|2|N||customer2|article6|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80\n20211227175200t|00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70\n20211227175200t|00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30\n20211227175200t|00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/data/source/part-02.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120\n20211227175200t|20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120\n20211227175200t|20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90\n20211227175200t|20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80\n20211227175200t|20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/data/source/part-03.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|9|4|1||20170430|customer3|article3|100\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|1|1|10|4|2|X|20170430|customer3|article7|70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110120052t|request1|1|1|12|4|3|D|20170430|customer3|article1|30\n20211227175200t|20180110120052t|request1|1|1|13|4|4|X|20170430|customer3|article2|50\n20211227175200t|20180110120052t|request1|1|1|14|4|4||20170430|customer3|article2|60\n20211227175200t|20180110120052t|request1|2|1|1|4|4|X|20170430|customer3|article2|60"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_duplicates/data/source/part-04.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3|article2|70\n20211227175200t|20180110130103t|request2|1|1|3|4|1|X|20170430|customer3|article3|100\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3|N|20170430|customer3|article1|40\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3|article3|70\n20211227175200t|20180110130103t|request2|1|1|5|4|2|D|20170430|customer3|article7|80"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/batch_delta.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\",\n        \"insert_predicate\": \"new.recordmode is null or new.recordmode not in ('R','D','X')\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"extraction_timestamp\",\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"ranking_key_asc\": [\n              \"recordmode\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data/control/part-01.csv",
    "content": "extraction_timestamp|actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount\n20211227175200t|20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150\n20211227175200t|00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200\n20211227175200t|00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50\n20211227175200t|00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10\n20211227175200t|20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50\n20211227175200t|00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30\n20211227175200t|00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200\n20211227175200t|00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120\n20211227175200t|00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90\n20211227175200t|20180110130103t|request2|1|1|4|4|1||20170430|customer3||70\n20211227175200t|20180110120052t|request1|1|1|11|4|2||20170430|customer3|article7|80\n20211227175200t|20180110130103t|request2|1|1|6|4|3||20170430|customer3||40\n20211227175200t|20180110120052t|request1|2|1|2|4|4||20170430|customer3||70\n20211227175200t|00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150\n20211227175200t|00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100\n20211227175200t|00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80\n20211227175200t|00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100\n20211227175200t|00000000000000t|0|0|0|0|6|2|N|20170601|customer2|article1|50\n20211227175200t|00000000000000t|0|0|0|0|6|3|N|20170601|customer2|article2|90\n20211227175200t|20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data/source/part-01.json",
    "content": "{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 1, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20160601\", \"customer\": \"customer1\", \"article\": \"article1\", \"amount\": 100 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 1, \"item\": 2, \"recordmode\": \"N\", \"date\": \"20160601\", \"customer\": \"customer1\", \"article\": \"article2\", \"amount\": 200 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 1, \"item\": 3, \"recordmode\": \"N\", \"date\": \"20160601\", \"customer\": \"customer1\", \"article\": \"article3\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 2, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20170215\", \"customer\": \"customer2\", \"article\": \"article4\", \"amount\": 10 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 2, \"item\": 2, \"recordmode\": \"N\", \"date\": \"20170215\", \"customer\": \"customer2\", \"article\": \"article6\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 2, \"item\": 3, \"recordmode\": \"N\", \"date\": \"20170215\", \"customer\": \"customer2\", \"article\": \"article1\", \"amount\": 30 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 3, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20170215\", \"customer\": \"customer1\", \"article\": \"article5\", \"amount\": 200 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 3, \"item\": 2, \"recordmode\": \"N\", \"date\": \"20170215\", \"customer\": \"customer1\", \"article\": \"article2\", \"amount\": 120 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 3, \"item\": 3, \"recordmode\": \"N\", \"date\": \"20170215\", \"customer\": \"customer1\", \"article\": \"article4\", \"amount\": 90 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 4, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article3\", \"amount\": 80 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 4, \"item\": 2, \"recordmode\": \"N\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article7\", \"amount\": 70 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 4, \"item\": 3, \"recordmode\": \"N\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article1\", \"amount\": 30 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 4, \"item\": 4, \"recordmode\": \"N\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article2\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 5, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20170510\", \"customer\": \"customer4\", \"article\": \"article6\", \"amount\": 150 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 5, \"item\": 2, \"recordmode\": \"N\", \"date\": \"20170510\", \"customer\": \"customer4\", \"article\": \"article3\", \"amount\": 100 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 5, \"item\": 3, \"recordmode\": \"N\", \"date\": \"20170510\", \"customer\": \"customer4\", \"article\": \"article5\", \"amount\": 80 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 6, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20170601\", \"customer\": \"customer2\", \"article\": \"article4\", \"amount\": 100 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 6, \"item\": 2, \"recordmode\": \"N\", \"date\": \"20170601\", \"customer\": \"customer2\", \"article\": \"article1\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"00000000000000t\", \"request\": \"0\", \"datapakid\": 0, \"partno\": 0, \"record\": 0, \"salesorder\": 6, \"item\": 3, \"recordmode\": \"N\", \"date\": \"20170601\", \"customer\": \"customer2\", \"article\": \"article2\", \"amount\": 90 }"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data/source/part-02.json",
    "content": "{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 1, \"salesorder\": 7, \"item\": 1, \"recordmode\": \"N\", \"date\": \"20180110\", \"customer\": \"customer5\", \"article\": \"article2\", \"amount\": 120 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 2, \"salesorder\": 1, \"item\": 1, \"recordmode\": \"X\", \"date\": \"20160601\", \"customer\": \"customer1\", \"article\": \"article1\", \"amount\": 100 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 3, \"salesorder\": 1, \"item\": 1, \"recordmode\": null, \"date\": \"20160601\", \"customer\": \"customer1\", \"article\": \"article1\", \"amount\": 150 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 4, \"salesorder\": 2, \"item\": 2, \"recordmode\": \"X\", \"date\": \"20170215\", \"customer\": \"customer2\", \"article\": \"article6\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 5, \"salesorder\": 2, \"item\": 2, \"recordmode\": null, \"date\": \"20170215\", \"customer\": \"customer2\", \"article\": \"article2\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 8, \"salesorder\": 4, \"item\": 1, \"recordmode\": \"X\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article3\", \"amount\": 80 }"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data/source/part-03.json",
    "content": "{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 9, \"salesorder\": 4, \"item\": 1, \"recordmode\": null, \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article3\", \"amount\": 100 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 10, \"salesorder\": 4, \"item\": 2, \"recordmode\": \"X\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article7\", \"amount\": 70 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 11, \"salesorder\": 4, \"item\": 2, \"recordmode\": null, \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article7\", \"amount\": 80 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 13, \"salesorder\": 4, \"item\": 4, \"recordmode\": \"X\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article2\", \"amount\": 50 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 1, \"partno\": 1, \"record\": 14, \"salesorder\": 4, \"item\": 4, \"recordmode\": null, \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article2\", \"amount\": 60 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 2, \"partno\": 1, \"record\": 1, \"salesorder\": 4, \"item\": 4, \"recordmode\": \"X\", \"date\": \"20170430\", \"customer\": \"customer3\", \"article\": \"article2\", \"amount\": 60 }"
  },
  {
    "path": "tests/resources/feature/delta_load/record_mode_cdc/with_upserts_only_removed_columns/data/source/part-04.json",
    "content": "{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110120052t\", \"request\": \"request1\", \"datapakid\": 2, \"partno\": 1, \"record\": 2, \"salesorder\": 4, \"item\": 4, \"recordmode\": null, \"date\": \"20170430\", \"customer\": \"customer3\", \"amount\": 70 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110130103t\", \"request\": \"request2\", \"datapakid\": 1, \"partno\": 1, \"record\": 3, \"salesorder\": 4, \"item\": 1, \"recordmode\": \"X\", \"date\": \"20170430\", \"customer\": \"customer3\", \"amount\": 100 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110130103t\", \"request\": \"request2\", \"datapakid\": 1, \"partno\": 1, \"record\": 4, \"salesorder\": 4, \"item\": 1, \"recordmode\": null, \"date\": \"20170430\", \"customer\": \"customer3\", \"amount\": 70 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110130103t\", \"request\": \"request2\", \"datapakid\": 1, \"partno\": 1, \"record\": 5, \"salesorder\": 4, \"item\": 3, \"recordmode\": \"X\", \"date\": \"20170430\", \"customer\": \"customer3\", \"amount\": 30 }\n{ \"extraction_timestamp\": \"20211227175200t\", \"actrequest_timestamp\": \"20180110130103t\", \"request\": \"request2\", \"datapakid\": 1, \"partno\": 1, \"record\": 6, \"salesorder\": 4, \"item\": 3, \"recordmode\": null, \"date\": \"20170430\", \"customer\": \"customer3\", \"amount\": 40 }"
  },
  {
    "path": "tests/resources/feature/dq_validator/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/dq_validator/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"db_table\": \"test_db.dq_sales\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/dq_validator/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/control/data_restore_control.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/control/dq_control_failure.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|false|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/control/dq_control_failure_disabled.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|false|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/control/dq_control_success.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/control/dq_control_success_explode.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|true|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/control/dq_control_success_explode_disabled.csv",
    "content": "checkpoint_config|run_id|run_results|success|spec_id|input_id\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|false|dq_sales|sales_source\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|false|dq_sales|sales_source\ncheckpoint configs|{20220729-143444-dq_sales-sales_source-checkpoint, 2022-07-29T14:34:44.852796+00:00}|run_results_for_all_expectations|false|dq_sales|sales_source"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900"
  },
  {
    "path": "tests/resources/feature/dq_validator/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/dq_validator/dq_sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/dq_validator/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/dq_validator/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/dq_validator/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/dq_validator/checkpoint\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.schemaInference\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/dq_validator/streaming_dataframe_two_runs/data/dq_functions/test_db.dq_functions_streaming_dataframe_two_runs_first_run.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}"
  },
  {
    "path": "tests/resources/feature/dq_validator/streaming_dataframe_two_runs/data/dq_functions/test_db.dq_functions_streaming_dataframe_two_runs_second_run.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\n"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_dataframe_failure_disabled/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_dataframe_failure_disabled/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_dataframe_success/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_dataframe_success/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_dq_rule/data/dq_functions/test_db.dq_table_rule_id_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_3|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"article\", \"min_value\": 3}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_4|expect_wrong_expectation|at_rest|test_db|dummy_invoice|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"article\", \"column_B\": \"amount\"}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_dq_rule/data/dq_functions/test_db.dq_table_rule_id_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_failure_disabled/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_failure_disabled/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_success/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_batch_success/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_streaming_dq_rule/data/dq_functions/test_db.dq_table_rule_id_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\", \"min_value\": 0}\nrule_3|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"article\", \"min_value\": 3}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|dummy_invoice|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_5|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"article\", \"column_B\": \"amount\"}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_streaming_dq_rule/data/dq_functions/test_db.dq_table_rule_id_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\", \"min_value\": 0}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_streaming_failure_disabled/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_streaming_failure_disabled/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_streaming_success/data/dq_functions/test_db.dq_functions_source_table_failure.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_table_row_count_to_be_between|at_rest|test_db|dummy_sales|amount|{\"min_value\": 3, \"max_value\": 11}\nrule_6|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/dq_validator/table_streaming_success/data/dq_functions/test_db.dq_functions_source_table_success.csv",
    "content": "dq_rule_id|dq_tech_function|execution_point|schema|table|column|arguments\nrule_1|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_2|expect_column_min_to_be_between|in_motion|test_db|dummy_sales|amount|{\"column\": \"amount\", \"min_value\": 0}\nrule_3|expect_column_to_exist|at_rest|test_db|dummy_sales|amount|{\"column\": \"article\"}\nrule_4|expect_column_pair_a_to_be_smaller_or_equal_than_b|at_rest|test_db|dummy_sales|amount|{\"column_A\": \"salesorder\", \"column_B\": \"amount\"}\nrule_5|expect_wrong_expectation|at_rest|test_db|no_table|amount|{\"min_value\": 3, \"max_value\": 11}"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/dq_validator/data/control.json",
    "content": "{\"acon\": {\"input_spec\": {\"spec_id\": \"sales_source\", \"read_type\": \"batch\", \"data_format\": \"csv\", \"options\": {\"mode\": \"FAILFAST\", \"header\": true, \"delimiter\": \"|\"}, \"location\": \"/app/tests/lakehouse/in/feature/engine_usage_stats/dq_validator/data/\"}, \"dq_spec\": {\"spec_id\": \"dq_sales\", \"input_id\": \"sales_source\", \"dq_type\": \"validator\", \"store_backend\": \"file_system\", \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/engine_usage_stats/dq\", \"result_sink_db_table\": \"test_db.dq_validator\", \"result_sink_format\": \"json\", \"result_sink_explode\": false, \"dq_functions\": [{\"function\": \"expect_column_to_exist\", \"args\": {\"column\": \"article\"}}, {\"function\": \"expect_table_row_count_to_be_between\", \"args\": {\"min_value\": 3, \"max_value\": 11}}, {\"function\": \"expect_column_pair_a_to_be_smaller_or_equal_than_b\", \"args\": {\"column_A\": \"salesorder\", \"column_B\": \"amount\"}}]}, \"exec_env\": {\"dp_name\": \"dq_validator\"}}, \"dp_name\": \"dq_validator\", \"environment\": \"\", \"workspace_id\": \"\", \"job_id\": \"\", \"job_name\": \"\", \"run_id\": \"\", \"function\": \"execute_dq_validation\", \"engine_version\": \"1.17.0\", \"start_timestamp\": \"2024-01-03 15:05:58.808058\", \"year\": 2024, \"month\": 1}"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/dq_validator/data/source.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/load_custom_transf_and_df/data/control.json",
    "content": "{\"acon\": {\"input_specs\": [{\"spec_id\": \"sales_source\", \"read_type\": \"batch\", \"data_format\": \"dataframe\", \"df_name\": \"DataFrame[salesorder: int, item: int, date: int, customer: string, article: string, amount: int]\"}], \"transform_specs\": [{\"spec_id\": \"renamed_kpi\", \"input_id\": \"sales_source\", \"transformers\": [{\"function\": \"rename\", \"args\": {\"cols\": {\"salesorder\": \"salesorder1\"}}}, {\"function\": \"custom_transformation\", \"args\": {\"custom_transformer\": \"<function custom_transformation at 0xffff92554d30>\"}}]}], \"output_specs\": [{\"spec_id\": \"sales_bronze\", \"input_id\": \"renamed_kpi\", \"write_type\": \"overwrite\", \"data_format\": \"delta\", \"location\": \"/app/tests/lakehouse/out/feature/engine_usage_stats/load_custom_transf_and_df/data/\"}], \"exec_env\": {\"dp_name\": \"load_custom_transf_and_df\"}}, \"dp_name\": \"load_custom_transf_and_df\", \"environment\": \"\", \"workspace_id\": \"\", \"job_id\": \"\", \"job_name\": \"\", \"run_id\": \"\", \"function\": \"load_data\", \"engine_version\": \"1.17.0\", \"start_timestamp\": \"2023-12-29 18:24:55.282039\", \"year\": 2023, \"month\": 12}"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/load_custom_transf_and_df/data/source.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/load_simple_acon/data/control.json",
    "content": "{\"acon\": {\"input_specs\": [{\"spec_id\": \"sales_source\", \"read_type\": \"batch\", \"data_format\": \"csv\", \"options\": {\"mode\": \"FAILFAST\", \"header\": true, \"delimiter\": \"|\", \"password\": \"******\"}, \"location\": \"/app/tests/lakehouse/in/feature/engine_usage_stats/load_simple_acon/data/\"}], \"transform_specs\": [{\"spec_id\": \"renamed_kpi\", \"input_id\": \"sales_source\", \"transformers\": [{\"function\": \"rename\", \"args\": {\"cols\": {\"salesorder\": \"salesorder1\"}}}]}], \"output_specs\": [{\"spec_id\": \"sales_bronze\", \"input_id\": \"renamed_kpi\", \"write_type\": \"overwrite\", \"data_format\": \"delta\", \"location\": \"/app/tests/lakehouse/out/feature/engine_usage_stats/load_simple_acon/data/\"}], \"exec_env\": {\"dp_name\": \"load_simple_acon\"}}, \"dp_name\": \"load_simple_acon\", \"environment\": \"\", \"workspace_id\": \"\", \"job_id\": \"\", \"job_name\": \"\", \"run_id\": \"\", \"function\": \"load_data\", \"engine_version\": \"1.17.0\", \"start_timestamp\": \"2023-12-29 22:43:27.654809\", \"year\": 2023, \"month\": 12}"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/load_simple_acon/data/source.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/engine_usage_stats/table_manager/data/control.json",
    "content": "{\"acon\": {\"function\": \"execute_sql\", \"sql\": \"select 1\", \"exec_env\": {\"dp_name\": \"table_manager\"}}, \"dp_name\": \"table_manager\", \"environment\": \"\", \"workspace_id\": \"\", \"job_id\": \"\", \"job_name\": \"\", \"run_id\": \"\", \"function\": \"manage_table\", \"engine_version\": \"1.17.0\", \"start_timestamp\": \"2024-01-03 00:00:00.000000\", \"year\": 2024, \"month\": 1}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/control/dummy_table.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210812171010000000000|94|100|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20210812171010000000000|95|101|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20210812171010000000000|96|102|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20210812181010000000000|97|103|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20210812181010000000000|98|104|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20210812181010000000000|99|105|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|\n20211112171010000000000|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211112171010000000000|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211112171010000000000|1|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211112171010000000000|1|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211112171010000000000|2|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211112171010000000000|2|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211112171010000000000|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211112171010000000000|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211112171010000000000|3|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211112171010000000000|3|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|\n20211113121010000000000|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211113121010000000000|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211113121010000000000|1|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211113121010000000000|1|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211113121010000000000|1|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211113121010000000000|2|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211117111010000000000|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211117111010000000000|3|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211118111010000000000|3|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211118111010000000000|4|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/control/dummy_table_join_condition.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210812171010000000000|94|100|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20210812171010000000000|95|101|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20210812171010000000000|96|102|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20210812181010000000000|97|103|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20210812181010000000000|98|104|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20210812181010000000000|99|105|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/control/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"reqtsn\",\n      \"type\": \"decimal(23,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/source/dummy_table.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210812171010000000000|94|100|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20210812171010000000000|95|101|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20210812171010000000000|96|102|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20210812181010000000000|97|103|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20210812181010000000000|98|104|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20210812181010000000000|99|105|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/source/dummy_table_1.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210712171010000000000|1|1|3|1|20170510|2017-05-10 01:01:01.000|customer40|article60|15|\n20211112171010000000000|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211112171010000000000|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211112171010000000000|1|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211112171010000000000|1|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211112171010000000000|2|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211112171010000000000|2|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211112171010000000000|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211112171010000000000|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211112171010000000000|3|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211112171010000000000|3|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/source/dummy_table_2.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211113121010000000000|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211113121010000000000|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211113121010000000000|1|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211113121010000000000|1|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211113121010000000000|1|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211113121010000000000|2|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211117111010000000000|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211117111010000000000|3|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211118111010000000000|3|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211118111010000000000|4|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/data/source/rspmrequest.csv",
    "content": "request_tsn|storage|last_operation_type|last_process_tsn|last_time_stamp|records|records_read|records_updated|creation_end_time|uname|source|request_status|request_status_before_deletion|last_request_status|request_is_in_process|tlogo|datatarget|syst_date|syst_time|housekeeping_status\n20210712171010000000000|AQ|C|20210712171010000000000|20211006073103000116000|643705|0|0|20211006073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211006|073100|00\n20210812181010000000000|AQ|C|20211006073059000008000|20211006073103000116000|643705|0|0|20211006073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211006|073100|00\n20210912171010000000000|AQ|C|20211006073059000008000|20211006073103000116000|643705|0|0|20211006073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211006|073100|00\n20211112171010000000000|AQ|C|20211206073059000008000|20211206073103000116000|643705|0|0|20211206073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211206|073100|00\n20211113121010000000000|AQ|C|20211206073059000008000|20211206073103000116000|643705|0|0|20211206073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211206|073100|00\n20211115111010000000000|AQ|D|20211020123121000011000|20211020123121000097000|381824|0|0|20211020113419000145000|UNAME|SOURCE|D|GG||N|ADSO|DUMMY_TABLE|20211020|113416|00\n20211116111010000000000|CL|D|20211020123121000011000|20211020123121000097000|381824|0|0|20211020113419000145000|UNAME|SOURCE|D|GG||N|ADSO|DUMMY_TABLE|20211020|113416|00\n20211117111010000000000|AQ|C|20211020123734000053000|20211020123735000009000|431528|0|0|20211020123240000008000|UNAME|SOURCE|GR|GR||N|ADSO|DUMMY_TABLE|20211020|123236|00\n20211118111010000000000|AQ|C|20211020223734000053000|20211020223735000009000|431528|0|0|20211020223240000008000|UNAME|SOURCE|GR|GR||N|ADSO|DUMMY_TABLE|20211020|223236|00\n20211118111010000000000|CL|D|20211020123734000053000|20211020123735000009000|431528|0|0|20211020123240000008000|UNAME|SOURCE|D|GG||N|ADSO|DUMMY_TABLE|20211020|123236|00"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"reqtsn\",\n      \"type\": \"decimal(23,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_aq_dso/rspmrequest_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"request_tsn\",\n      \"type\": \"decimal(23,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"storage\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_operation_type\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_process_tsn\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_time_stamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"records\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"records_read\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"records_updated\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"creation_end_time\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"uname\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"source\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_status_before_deletion\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_request_status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_is_in_process\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"tlogo\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datatarget\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"syst_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"syst_time\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"housekeeping_status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/control/dummy_table.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210713151010000000000|0|0|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20210713151010000000000|0|0|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20210713151010000000000|0|0|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20210713151010000000000|0|0|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20210713151010000000000|0|0|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20210713151010000000000|0|0|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|\n20211112171010000000000|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211112171010000000000|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211112171010000000000|1|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211112171010000000000|1|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211112171010000000000|2|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211112171010000000000|2|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211112171010000000000|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211112171010000000000|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211112171010000000000|3|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211112171010000000000|3|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|\n20211113121010000000000|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211113121010000000000|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211113121010000000000|1|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211113121010000000000|1|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211113121010000000000|1|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211113121010000000000|2|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211117111010000000000|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211117111010000000000|3|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211118111010000000000|3|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211118111010000000000|4|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/control/dummy_table_join_condition.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210713151010000000000|0|0|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20210713151010000000000|0|0|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20210713151010000000000|0|0|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20210713151010000000000|0|0|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20210713151010000000000|0|0|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20210713151010000000000|0|0|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/control/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"reqtsn\",\n      \"type\": \"decimal(23,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/source/dummy_table.csv",
    "content": "salesorder|item|date|time|customer|/bic/article|amount|order_date\n1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/source/dummy_table_cl_1.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20210712171010000000000|1|1|3|1|20170510|2017-05-10 01:01:01.000|customer40|article60|15|\n20211112171010000000000|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211112171010000000000|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211112171010000000000|1|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211112171010000000000|1|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211112171010000000000|2|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211112171010000000000|2|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211112171010000000000|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211112171010000000000|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211112171010000000000|3|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211112171010000000000|3|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/source/dummy_table_cl_2.csv",
    "content": "reqtsn|datapakid|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211113121010000000000|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211113121010000000000|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211113121010000000000|1|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211113121010000000000|1|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211113121010000000000|1|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211113121010000000000|2|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211117111010000000000|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211117111010000000000|3|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211118111010000000000|3|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211118111010000000000|4|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/data/source/rspmrequest.csv",
    "content": "REQUEST_TSN|STORAGE|LAST_OPERATION_TYPE|LAST_PROCESS_TSN|LAST_TIME_STAMP|RECORDS|RECORDS_READ|RECORDS_UPDATED|CREATION_END_TIME|UNAME|SOURCE|REQUEST_STATUS|REQUEST_STATUS_BEFORE_DELETION|LAST_REQUEST_STATUS|REQUEST_IS_IN_PROCESS|TLOGO|DATATARGET|SYST_DATE|SYST_TIME|HOUSEKEEPING_STATUS\n20210712171010000000000|AT|C|20211006073059000008000|20211006073103000116000|643705|0|0|20211006073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211006|073100|00\n20211112171010000000000|AT|C|20211206073059000008000|20211206073103000116000|643705|0|0|20211206073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211206|073100|00\n20211113121010000000000|AT|C|20211206073059000008000|20211206073103000116000|643705|0|0|20211206073103000116000|UNAME|SOURCE|GG|||N|ADSO|DUMMY_TABLE|20211206|073100|00\n20211115111010000000000|AT|D|20211020123121000011000|20211020123121000097000|381824|0|0|20211020113419000145000|UNAME|SOURCE|D|GG||N|ADSO|DUMMY_TABLE|20211020|113416|00\n20211116111010000000000|CL|D|20211020123121000011000|20211020123121000097000|381824|0|0|20211020113419000145000|UNAME|SOURCE|D|GG||N|ADSO|DUMMY_TABLE|20211020|113416|00\n20211117111010000000000|AT|C|20211020123734000053000|20211020123735000009000|431528|0|0|20211020123240000008000|UNAME|SOURCE|GG|GG||N|ADSO|DUMMY_TABLE|20211020|123236|00\n20211118111010000000000|AT|C|20211020223734000053000|20211020223735000009000|431528|0|0|20211020223240000008000|UNAME|SOURCE|GG|GG||N|ADSO|DUMMY_TABLE|20211020|223236|00\n20211118111010000000000|CL|D|20211020123734000053000|20211020123735000009000|431528|0|0|20211020123240000008000|UNAME|SOURCE|D|GG||N|ADSO|DUMMY_TABLE|20211020|123236|00"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/dummy_table_cl_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"reqtsn\",\n      \"type\": \"decimal(23,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_b4/extract_cl_dso/rspmrequest_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"request_tsn\",\n      \"type\": \"decimal(23,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"storage\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_operation_type\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_process_tsn\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_time_stamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"records\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"records_read\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"records_updated\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"creation_end_time\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"uname\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"source\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_status_before_deletion\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"last_request_status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_is_in_process\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"tlogo\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datatarget\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"syst_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"syst_time\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"housekeeping_status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/derive_changelog_table_name/RSBASIDOC_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"slogsys\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"rlogsys\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"tsprefix\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/derive_changelog_table_name/RSTSODS_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"odsname_tech\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"odsname\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"userapp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"version\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/derive_changelog_table_name/data/source/RSBASIDOC.csv",
    "content": "slogsys|rlogsys|tsprefix\nDHACLNT003|DHACLNT003|OA\nFFEWFEWCLN|FFEWFEWCLN|CA\nPHACLNT003|DHACLNT003|CA\nPHACLNT003|DHACLNT003|CB\nAHACLNT003|DHACLNT001|CT\nAHACLNT003|DHACLNT002|CD"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/derive_changelog_table_name/data/source/RSTSODS.csv",
    "content": "odsname_tech|odsname|userapp|version\ntest_table_OA|8test_table_OA|CHANGELOG|000\ntestchartable_OA|8testchartable_OA|CHANGELOG|000\ntestrtable_OA|8testrtable_OA|CHANGELOG|000\ntest_test_table_OA|8test_test_table_OA|CHANGELOG|000\ntest_table_OA|8test_table_OA|CHANGELOG|001\ntest_table_OA|8test_table_OA|NOTCHANGELOG|000\ntesttable_OA|8testtable_OA|CHANGELOG|000\ntesttable_OA|8testtable_OA|CHANGELOG|001\ntesttable_OA|8testtable_OA|NOTCHANGELOG|000"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/control/dummy_table.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211004151010|0|0|0|0|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20211004151010|0|0|0|0|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20211004151010|0|0|0|0|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20211004151010|0|0|0|0|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20211004151010|0|0|0|0|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20211004151010|0|0|0|0|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|2|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|2|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|1|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|1|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|3|1|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211104151010|ODSR_1C6Q7CHLJJ08WG131T491L1ZF|3|1|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|\n20211112171010|ODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211112171010|ODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211112171010|ODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|2|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211112171010|ODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|2|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211112171010|ODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|3|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211112171010|ODSR_2C6Q7CHLJJ08WG131T491L1ZF|2|1|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211113121010|ODSR_3C6Q7CHLJJ08WG131T491L1ZF|2|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211113121010|ODSR_3C6Q7CHLJJ08WG131T491L1ZF|3|1|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211114111010|ODSR_4C6Q7CHLJJ08WG131T491L1ZA|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|\n20211114111010|ODSR_4C6Q7CHLJJ08WG131T491L1ZF|3|2|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211114111010|ODSR_4C6Q7CHLJJ08WG131T491L1ZF|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/control/dummy_table_join_condition.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211004151010|0|0|0|0|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20211004151010|0|0|0|0|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20211004151010|0|0|0|0|1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n20211004151010|0|0|0|0|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20211004151010|0|0|0|0|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20211004151010|0|0|0|0|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|\n20211114111010|ODSR_4C6Q7CHLJJ08WG131T491L1ZA|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/control/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"decimal(15,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/source/dummy_table.csv",
    "content": "salesorder|item|date|time|customer|/bic/article|amount|order_date\n1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n1|4|20160601|2016-06-01 10:01:12.000|customer99||3000|\n1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/source/dummy_table_cl_1.csv",
    "content": "request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\nDTPR_OLD_REQUEST_TO_IGNORE_444|1|1|1|3|1|20170510|2017-05-10 01:01:01.000|customer40|article60|15|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|2|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|1|2|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|1|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|1|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|2|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|3|1|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZF|3|1|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|\nODSR_1C6Q7CHLJJ08WG131T491L1ZA|3|1|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/source/dummy_table_cl_2.csv",
    "content": "request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\nODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\nODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\nODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|2|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\nODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|2|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\nODSR_2C6Q7CHLJJ08WG131T491L1ZF|1|3|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\nODSR_2C6Q7CHLJJ08WG131T491L1ZF|2|1|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\nODSR_3C6Q7CHLJJ08WG131T491L1ZF|2|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\nODSR_3C6Q7CHLJJ08WG131T491L1ZF|3|1|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\nODSR_4C6Q7CHLJJ08WG131T491L1ZF|3|2|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\nODSR_4C6Q7CHLJJ08WG131T491L1ZF|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|\nODSR_4C6Q7CHLJJ08WG131T491L1ZA|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/data/source/rsodsactreq.csv",
    "content": "odsobject|request|datapakid|activate|sidconversion|actrequest|operation|status|paketsize|timestamp\ndummy_table|DTPR_OLD_REQUEST_TO_IGNORE_444|0|||DTPR_OLD_REQUEST_TO_IGNORE_444|A|0|0000020000|20211004151010\ndummy_table|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_1C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211104151010\ndummy_table|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_2C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211112171010\ndummy_table|DTPR_F89Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_3C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211113121010\ndummy_table|DTPR_F99Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_4C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211114111010\ndummy_table|ODSR_4C6Q7CHLJJ08WG131T491L1ZA|0|||ODSR_4C6Q7CHLJJ08WG131T491L1ZA|A|0|0000020000|20211114111010"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/dummy_table_cl_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_dso/rsodsactreq_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"odsobject\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"activate\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"sidconversion\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"actrequest\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"operation\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"paketsize\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"timestamp\",\n      \"type\": \"decimal(15,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/control/dummy_table.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211004151010|DTPR_INIT_REQUEST_123|1|1|1|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20211004151010|DTPR_INIT_REQUEST_123|1|1|3|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20211004151010|DTPR_INIT_REQUEST_123|1|1|3|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20211004151010|DTPR_INIT_REQUEST_123|2|2|1|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20211004151010|DTPR_INIT_REQUEST_123|2|3|1|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|2|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|2|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|1|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|1|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|3|1|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|3|1|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|2|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|2|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|3|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|2|1|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211113121010|DTPR_F89Y1VBE6JO079PMFTL1X8BPY|2|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211113121010|DTPR_F89Y1VBE6JO079PMFTL1X8BPY|3|1|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211114111010|DTPR_F99Y1VBE6JO079PMFTL1X8BPY|3|2|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211114111010|DTPR_F99Y1VBE6JO079PMFTL1X8BPY|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/control/dummy_table_actreq_timestamp.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211003161010|DTPR_INIT_REQUEST_123|1|1|1|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20211003161010|DTPR_INIT_REQUEST_123|1|1|3|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20211003161010|DTPR_INIT_REQUEST_123|1|1|3|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20211003161010|DTPR_INIT_REQUEST_123|2|2|1|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20211003161010|DTPR_INIT_REQUEST_123|2|3|1|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|2|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|2|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|1|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|1|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|3|1|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\n20211104151010|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|3|1|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|2|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|2|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|3|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\n20211112171010|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|2|1|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\n20211113121010|DTPR_F89Y1VBE6JO079PMFTL1X8BPY|2|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\n20211113121010|DTPR_F89Y1VBE6JO079PMFTL1X8BPY|3|1|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\n20211114111010|DTPR_F99Y1VBE6JO079PMFTL1X8BPY|3|2|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\n20211114111010|DTPR_F99Y1VBE6JO079PMFTL1X8BPY|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/control/dummy_table_join_condition.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\n20211004151010|DTPR_INIT_REQUEST_123|1|1|1|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\n20211004151010|DTPR_INIT_REQUEST_123|1|1|3|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\n20211004151010|DTPR_INIT_REQUEST_123|1|1|3|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\n20211004151010|DTPR_INIT_REQUEST_123|2|2|1|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\n20211004151010|DTPR_INIT_REQUEST_123|2|3|1|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/control/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"decimal(15,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/source/dummy_table.csv",
    "content": "request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\nDTPR_INIT_REQUEST_123|1|1|1|1|1|20160601|2016-06-01 10:01:12.000|customer1|article1|1000|\nDTPR_INIT_REQUEST_123|1|1|3|1|2|20160601|2016-06-01 10:01:12.000|customer1|article2|2000|\nDTPR_INIT_REQUEST_123|1|1|3|1|3|20160601|2016-06-01 10:01:12.000|customer1|article3|500|\nDTPR_INIT_REQUEST_123|2|2|1|2|1|20160701|2016-07-01 10:01:12.000|customer11|article33|500|\nDTPR_INIT_REQUEST_123|2|3|1|3|1|20160701|2016-07-01 10:01:13.000|customer11|article33|500|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/source/dummy_table_1.csv",
    "content": "request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\nDTPR_OLD_REQUEST_TO_IGNORE_444|1|1|1|3|1|20170510|2017-05-10 01:01:01.000|customer40|article60|15|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|1|1|2|1|20170215|2017-02-15 10:01:12.000|customer2|article4|1000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|1|2|2|2|20170215|2017-02-15 10:01:12.000|customer2|article6|5000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|2|3|2|3|20170215|2017-02-15 10:01:12.000|customer2|article1|3000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|1|2|4|3|1|20170215|2017-02-15 10:01:12.000|customer1|article5|20000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|1|5|3|2|20170215|2017-02-15 10:01:12.000|customer1|article2|12000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|1|6|3|3|20170215|2017-02-15 10:01:12.000|customer1|article4|9000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|2|7|4|1|20170430|2017-04-30 10:01:12.000|customer3|article3|8000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|2|2|8|4|2|20170430|2017-04-30 10:01:12.000|customer3|article7|7000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|3|1|9|4|3|20170430|2017-04-30 10:01:12.000|customer3|article1|3000|\nDTPR_F49Y1VBE6JO079PMFTL1X8BPY|3|1|10|4|4|20170430|2017-04-30 10:01:12.000|customer3|article2|5000|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/source/dummy_table_2.csv",
    "content": "request|datapakid|partno|record|salesorder|item|date|time|customer|/bic/article|amount|order_date\nDTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|1|1|5|1|20170510|2017-05-10 01:01:01.000|customer4|article6|15000|\nDTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|1|2|5|2|20170510|2017-05-10 01:01:01.000|customer4|article3|10000|\nDTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|2|3|5|3|20170510|2017-05-10 01:01:01.000|customer4|article5|8000|\nDTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|2|4|6|1|20170601|2017-06-01 01:01:01.000|customer2|article4|10000|\nDTPR_F69Y1VBE6JO079PMFTL1X8BPY|1|3|5|6|2|20170601|2017-06-01 01:01:01.000|customer2|article1|5000|\nDTPR_F69Y1VBE6JO079PMFTL1X8BPY|2|1|6|6|3|20170601|2017-06-01 01:01:01.000|customer2|article2|9000|\nDTPR_F89Y1VBE6JO079PMFTL1X8BPY|2|2|7|6|2|20170602|2017-06-02 01:01:01.000|customer5|article1|5320|\nDTPR_F89Y1VBE6JO079PMFTL1X8BPY|3|1|8|6|3|20170602|2017-06-02 01:01:01.000|customer5|article2|9320|\nDTPR_F99Y1VBE6JO079PMFTL1X8BPY|3|2|9|6|2|20170603|2017-06-03 01:01:01.000|customer6|article1|5010|\nDTPR_F99Y1VBE6JO079PMFTL1X8BPY|4|1|10|6|3|20170603|2017-06-03 01:01:01.000|customer6|article2|50|"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/data/source/rsodsactreq.csv",
    "content": "odsobject|request|datapakid|activate|sidconversion|actrequest|operation|status|paketsize|timestamp\ndummy_table|DTPR_OLD_REQUEST_TO_IGNORE_444|0|||DTPR_OLD_REQUEST_TO_IGNORE_444|A|0|0000020000|20211003151010\ndummy_table|DTPR_F49Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_1C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211104151010\ndummy_table|DTPR_F69Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_2C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211112171010\ndummy_table|DTPR_F89Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_3C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211113121010\ndummy_table|DTPR_F99Y1VBE6JO079PMFTL1X8BPY|0|||ODSR_4C6Q7CHLJJ08WG131T491L1ZF|A|0|0000020000|20211114111010\ndummy_table|DTPR_INIT_REQUEST_123|0|||INIT_RECORD_L23F|A|0|0000010000|20211003161010"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/dummy_table_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"time\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"/bic/article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/extract_from_sap_bw/extract_write_optimised_dso/rsodsactreq_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"odsobject\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"activate\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"sidconversion\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"actrequest\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"operation\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"status\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"paketsize\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"timestamp\",\n      \"type\": \"decimal(15,0)\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/check_restore_status/acon_check_restore_status_directory.json",
    "content": "{\n  \"function\": \"check_restore_status\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\"\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/check_restore_status/acon_check_restore_status_single_object.json",
    "content": "{\n  \"function\": \"check_restore_status\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\"\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/copy_object/acon_copy_directory.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_directory\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/copy_object/acon_copy_directory_dry_run.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_directory\",\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/copy_object/acon_copy_single_object.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/copy_object/acon_copy_single_object_dry_run.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/delete_objects/acon_delete_objects.json",
    "content": "{\n  \"function\": \"delete_objects\",\n  \"bucket\": \"test_bucket\",\n  \"object_paths\": [\"test_single_file.json\", \"test_directory\"],\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/delete_objects/acon_delete_objects_dry_run.json",
    "content": "{\n  \"function\": \"delete_objects\",\n  \"bucket\": \"test_bucket\",\n  \"object_paths\": [\"test_single_file.json\", \"test_directory\"],\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/request_restore/acon_request_restore_directory.json",
    "content": "{\n  \"function\": \"request_restore\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Bulk\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/request_restore/acon_request_restore_single_object.json",
    "content": "{\n  \"function\": \"request_restore\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Bulk\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/request_restore_to_destination_and_wait/acon_request_restore_to_destination_and_wait_directory.json",
    "content": "{\n  \"function\": \"request_restore_to_destination_and_wait\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_directory\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Expedited\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/request_restore_to_destination_and_wait/acon_request_restore_to_destination_and_wait_single_object.json",
    "content": "{\n  \"function\": \"request_restore_to_destination_and_wait\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Expedited\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager/request_restore_to_destination_and_wait/acon_request_restore_to_destination_and_wait_single_object_raise_error.json",
    "content": "{\n  \"function\": \"request_restore_to_destination_and_wait\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Bulk\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/copy_objects/acon_copy_directory.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"\",\n  \"source_object\": \"tests/lakehouse/dbfs/test_directory\",\n  \"destination_bucket\": \"\",\n  \"destination_object\": \"tests/lakehouse/dbfs/destination_directory\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/copy_objects/acon_copy_directory_dry_run.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"\",\n  \"source_object\": \"tests/lakehouse/dbfs/test_directory\",\n  \"destination_bucket\": \"\",\n  \"destination_object\": \"tests/lakehouse/dbfs/destination_directory\",\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/copy_objects/acon_copy_single_object.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"\",\n  \"source_object\": \"tests/lakehouse/dbfs/test_single_file.json\",\n  \"destination_bucket\": \"\",\n  \"destination_object\": \"tests/lakehouse/dbfs/destination_single_file.json\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/delete_objects/acon_delete_objects.json",
    "content": "{\n  \"function\": \"delete_objects\",\n  \"bucket\": \"\",\n  \"object_paths\": [\"tests/lakehouse/dbfs/destination_directory\"],\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/delete_objects/acon_delete_objects_dry_run.json",
    "content": "{\n  \"function\": \"delete_objects\",\n  \"bucket\": \"\",\n  \"object_paths\": [\"tests/lakehouse/dbfs/test_directory\", \"tests/lakehouse/dbfs/destination_directory\"],\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/move_objects/acon_move_objects.json",
    "content": "{\n  \"function\": \"move_objects\",\n  \"bucket\": \"\",\n  \"source_object\": \"tests/lakehouse/dbfs/test_directory\",\n  \"destination_bucket\": \"\",\n  \"destination_object\": \"tests/lakehouse/dbfs/test_mv_directory\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_dbfs/move_objects/acon_move_objects_dry_run.json",
    "content": "{\n  \"function\": \"move_objects\",\n  \"bucket\": \"\",\n  \"source_object\": \"tests/lakehouse/dbfs/test_directory\",\n  \"destination_bucket\": \"\",\n  \"destination_object\": \"tests/lakehouse/dbfs/test_mv_directory\",\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/check_restore_status/acon_check_restore_status_directory.json",
    "content": "{\n  \"function\": \"check_restore_status\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\"\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/check_restore_status/acon_check_restore_status_single_object.json",
    "content": "{\n  \"function\": \"check_restore_status\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\"\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/copy_objects/acon_copy_directory.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_directory\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/copy_objects/acon_copy_directory_dry_run.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_directory\",\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/copy_objects/acon_copy_single_object.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/copy_objects/acon_copy_single_object_dry_run.json",
    "content": "{\n  \"function\": \"copy_objects\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/delete_objects/acon_delete_objects.json",
    "content": "{\n  \"function\": \"delete_objects\",\n  \"bucket\": \"test_bucket\",\n  \"object_paths\": [\"test_single_file.json\", \"test_directory\"],\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/delete_objects/acon_delete_objects_dry_run.json",
    "content": "{\n  \"function\": \"delete_objects\",\n  \"bucket\": \"test_bucket\",\n  \"object_paths\": [\"test_single_file.json\", \"test_directory\"],\n  \"dry_run\": true\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/request_restore/acon_request_restore_directory.json",
    "content": "{\n  \"function\": \"request_restore\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Bulk\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/request_restore/acon_request_restore_single_object.json",
    "content": "{\n  \"function\": \"request_restore\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Bulk\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/request_restore_to_destination_and_wait/acon_request_restore_to_destination_and_wait_directory.json",
    "content": "{\n  \"function\": \"request_restore_to_destination_and_wait\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_directory\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_directory\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Expedited\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/request_restore_to_destination_and_wait/acon_request_restore_to_destination_and_wait_single_object.json",
    "content": "{\n  \"function\": \"request_restore_to_destination_and_wait\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Expedited\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/file_manager_s3/request_restore_to_destination_and_wait/acon_request_restore_to_destination_and_wait_single_object_raise_error.json",
    "content": "{\n  \"function\": \"request_restore_to_destination_and_wait\",\n  \"bucket\": \"test_bucket\",\n  \"source_object\": \"test_single_file.json\",\n  \"destination_bucket\": \"destination_bucket\",\n  \"destination_object\": \"destination_single_file\",\n  \"restore_expiration\": 1,\n  \"retrieval_tier\": \"Bulk\",\n  \"dry_run\": false\n}"
  },
  {
    "path": "tests/resources/feature/full_load/full_overwrite/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/full_load/full_overwrite/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"repartitioned_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"repartition\",\n          \"args\": {\n            \"num_partitions\": 1,\n            \"cols\": [\"date\", \"customer\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/full_load/full_overwrite/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/full_load/full_overwrite/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/full_load/full_overwrite/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/full_load/full_overwrite/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/full_load/full_overwrite/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/full_load/full_overwrite/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/full_load/full_overwrite/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/full_load/with_filter/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"filtered_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"expression_filter\",\n          \"args\": {\n            \"exp\": \"date like '2016%'\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"filtered_sales\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"parquet\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/full_load/with_filter/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/full_load/with_filter/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"parquet\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/full_load/with_filter/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter_partition_overwrite/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/full_load/with_filter_partition_overwrite/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"filtered_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"expression_filter\",\n          \"args\": {\n            \"exp\": \"date like '2016%'\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"filtered_sales\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/full_load/with_filter_partition_overwrite/data\",\n      \"options\": {\n        \"replaceWhere\": \"date like '2016%'\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter_partition_overwrite/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/full_load/with_filter_partition_overwrite/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\",\n        \"customer\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/full_load/with_filter_partition_overwrite/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter_partition_overwrite/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter_partition_overwrite/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|100\n2|2|20170215|customer2|article6|500\n2|3|20170215|customer2|article1|300\n3|1|20170215|customer1|article5|2000\n3|2|20170215|customer1|article2|1200\n3|3|20170215|customer1|article4|900\n4|1|20170430|customer3|article3|800\n4|2|20170430|customer3|article7|700\n4|3|20170430|customer3|article1|300\n4|4|20170430|customer3|article2|500\n5|1|20170510|customer4|article6|1500\n5|2|20170510|customer4|article3|1000\n5|3|20170510|customer4|article5|800\n6|1|20170601|customer2|article4|1000\n6|2|20170601|customer2|article1|500\n6|3|20170601|customer2|article2|900"
  },
  {
    "path": "tests/resources/feature/full_load/with_filter_partition_overwrite/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|10000\n1|2|20160601|customer1|article2|20000\n1|3|20160601|customer1|article3|5000\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000\n5|1|20170510|customer4|article6|15000\n5|2|20170510|customer4|article3|10000\n5|3|20170510|customer4|article5|8000\n6|1|20170601|customer2|article4|10000\n6|2|20170601|customer2|article1|5000\n6|3|20170601|customer2|article2|9000"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_dummy_sales_kpi.csv",
    "content": "cadence|order_date|to_date|category_name|qty_articles|total_amount|total_amount_last_year|avg_total_amount_last_2_years|discounted_total_amount\nYEAR|2016-01-01|2016-12-31|category_a|3|7000|0|0|3920.0000000000005\nYEAR|2017-01-01|2017-12-31|category_a|10|15000|7000|7000|8400\nYEAR|2018-01-01|2018-12-31|category_a|4|36|15000|11000|20.160000000000004\nYEAR|2017-01-01|2017-12-31|category_b|5|11000|0|0|6160.000000000001\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_nam_orders_all_snapshot.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY6|8|808|0|0|0|4\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY2|10|1010|0|0|0|5\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY2|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY7|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY11|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY1|24|2424|0|0|0|12\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY2|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY5|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY1|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY9|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY4|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY3|6|606|0|0|0|3\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY4|9|909|0|0|0|4.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY8|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY9|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY6|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY1|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY5|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY8|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY4|9|909|0|0|0|4.5\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY1|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY10|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY1|12|1213|0|0|0|6\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY5|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY2|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY1|78|7878|0|0|0|39\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY2|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY5|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY2|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY6|8|808|0|0|0|4\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY6|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY1|78|7878|0|0|0|39\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY2|10|1010|0|0|0|5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY1|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY1|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY4|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY3|6|606|0|0|0|3\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY2|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY2|5|505|0|0|0|2.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY6|4|404|0|0|0|2\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY7|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY1|24|2424|0|0|0|12\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY1|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY6|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY11|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY1|12|1213|0|0|0|6\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY10|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY2|5|505|0|0|0|2.5\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_nam_orders_filtered_snapshot.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY6|8|808|0|0|0|4\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY3|6|606|0|0|0|3\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY6|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY6|8|808|0|0|0|4\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY6|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY3|6|606|0|0|0|3\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY6|4|404|0|0|0|2\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY6|2|202|0|0|0|1\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_negative_offset_orders_all.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY1|24|2424|21|21|21|12\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY3|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-09|10102417|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY2|5|505|5|5|13|2.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY5|3|303|3|3|6|1.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY8|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-09|10102415|COUNTRY6|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-08|10102416|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY11|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY4|9|909|9|9|9|4.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY7|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY1|71|7171|0|0|0|35.5\nWEEK|2022-01-02|2022-01-07|10102415|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-09|10102417|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY7|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-07|10102417|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY8|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY10|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102418|COUNTRY1|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102416|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY4|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY1|78|7878|78|78|227|39\nWEEK|2022-01-02|2022-01-09|10102418|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY11|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY6|4|404|4|4|12|2\nWEEK|2022-01-02|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102418|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY6|8|808|8|8|24|4\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY3|3|303|3|3|6|1.5\nWEEK|2022-01-02|2022-01-06|10102417|COUNTRY2|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-08|10102415|COUNTRY1|12|1213|12|12|22|6\nWEEK|2022-01-02|2022-01-08|10102415|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY8|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY5|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY11|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY2|10|1010|10|10|27|5\nWEEK|2022-01-02|2022-01-09|10102416|COUNTRY2|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-08|10102416|COUNTRY2|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY1|21|2121|0|0|0|10.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY6|8|808|8|8|16|4\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY9|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102418|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-07|10102415|COUNTRY6|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY6|4|404|4|4|8|2\nWEEK|2022-01-02|2022-01-06|10102418|COUNTRY1|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY4|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-08|10102418|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-07|10102417|COUNTRY1|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY3|6|606|6|6|17|3\nWEEK|2022-01-02|2022-01-08|10102415|COUNTRY2|4|404|3|3|6|2\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY4|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102417|COUNTRY1|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-08|10102418|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY4|9|909|0|0|0|4.5\nWEEK|2022-01-02|2022-01-09|10102417|COUNTRY2|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-09|10102416|COUNTRY1|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-08|10102417|COUNTRY2|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-08|10102417|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-09|10102415|COUNTRY2|4|404|4|4|10|2\nWEEK|2022-01-02|2022-01-06|10102416|COUNTRY1|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY1|24|2424|24|24|69|12\nWEEK|2022-01-02|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY5|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY4|9|909|9|9|18|4.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY2|8|808|0|0|0|4\nWEEK|2022-01-02|2022-01-07|10102418|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY9|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-07|10102417|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY2|4|404|0|0|0|2\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY2|5|505|4|4|8|2.5\nWEEK|2022-01-02|2022-01-09|10102418|COUNTRY1|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY6|4|404|4|4|4|2\nWEEK|2022-01-02|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY4|9|909|9|9|27|4.5\nWEEK|2022-01-02|2022-01-07|10102416|COUNTRY2|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY1|78|7878|71|71|71|39\nWEEK|2022-01-02|2022-01-06|10102417|COUNTRY1|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-07|10102415|COUNTRY2|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY6|8|808|8|8|8|4\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY7|3|303|3|3|6|1.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY9|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-08|10102418|COUNTRY1|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY3|6|606|5|5|5|3\nWEEK|2022-01-02|2022-01-06|10102416|COUNTRY2|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY5|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nWEEK|2022-01-02|2022-01-09|10102415|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-07|10102416|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-07|10102416|COUNTRY1|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-07|10102415|COUNTRY1|12|1213|10|10|10|6\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY9|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-07|10102419|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102415|COUNTRY2|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY5|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY5|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY1|24|2424|24|24|45|12\nWEEK|2022-01-02|2022-01-08|10102416|COUNTRY6|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY3|6|606|6|6|11|3\nWEEK|2022-01-02|2022-01-09|10102415|COUNTRY1|12|1213|12|12|34|6\nWEEK|2022-01-02|2022-01-08|10102419|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY2|10|1010|9|9|17|5\nWEEK|2022-01-02|2022-01-06|10102415|COUNTRY1|10|1011|0|0|0|5\nWEEK|2022-01-02|2022-01-07|10102417|COUNTRY2|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY11|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY1|78|7878|78|78|149|39\nWEEK|2022-01-02|2022-01-09|10102419|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-07|10102416|COUNTRY6|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-08|10102415|COUNTRY6|3|303|3|3|6|1.5\nWEEK|2022-01-02|2022-01-09|10102416|COUNTRY6|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY8|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY4|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY2|9|909|8|8|8|4.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY10|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-08|10102417|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY3|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-08|10102417|COUNTRY1|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY2|4|404|4|4|4|2\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY5|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY7|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY5|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY10|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-08|10102416|COUNTRY1|3|303|3|3|6|1.5\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_negative_offset_orders_filtered.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY3|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-09|10102417|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-09|10102415|COUNTRY6|3|303|3|3|9|1.5\nWEEK|2022-01-02|2022-01-08|10102416|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-07|10102415|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-09|10102417|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-07|10102417|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102416|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-09|10102418|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-09|10102413|COUNTRY6|4|404|4|4|12|2\nWEEK|2022-01-02|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102418|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY6|8|808|8|8|24|4\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY3|3|303|3|3|6|1.5\nWEEK|2022-01-02|2022-01-08|10102415|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY6|8|808|8|8|16|4\nWEEK|2022-01-02|2022-01-09|10102418|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-07|10102415|COUNTRY6|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-08|10102413|COUNTRY6|4|404|4|4|8|2\nWEEK|2022-01-02|2022-01-08|10102418|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-09|10102412|COUNTRY3|6|606|6|6|17|3\nWEEK|2022-01-02|2022-01-08|10102418|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-08|10102417|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nWEEK|2022-01-02|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102418|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nWEEK|2022-01-02|2022-01-07|10102417|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY6|4|404|4|4|4|2\nWEEK|2022-01-02|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY6|8|808|8|8|8|4\nWEEK|2022-01-02|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-02|2022-01-07|10102412|COUNTRY3|6|606|5|5|5|3\nWEEK|2022-01-02|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nWEEK|2022-01-02|2022-01-09|10102415|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-07|10102416|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-07|10102419|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-02|2022-01-08|10102416|COUNTRY6|2|202|2|2|4|1\nWEEK|2022-01-02|2022-01-08|10102412|COUNTRY3|6|606|6|6|11|3\nWEEK|2022-01-02|2022-01-08|10102419|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-09|10102419|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-02|2022-01-07|10102416|COUNTRY6|2|202|2|2|2|1\nWEEK|2022-01-02|2022-01-08|10102415|COUNTRY6|3|303|3|3|6|1.5\nWEEK|2022-01-02|2022-01-09|10102416|COUNTRY6|2|202|2|2|6|1\nWEEK|2022-01-02|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-02|2022-01-08|10102417|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-02|2022-01-07|10102413|COUNTRY3|3|303|3|3|3|1.5\nWEEK|2022-01-02|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_orders_all.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY1|71|7171|0|0|0|35.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY1|7|707|71|0|71|3.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY10|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY11|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY2|8|808|0|0|0|4\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY2|1|101|8|0|8|0.5\nDAY|2022-01-08|2022-01-08|10102412|COUNTRY2|1|101|1|0|9|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY3|1|101|5|0|5|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY4|9|909|0|0|0|4.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY5|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY7|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY8|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY9|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY1|21|2121|0|0|0|10.5\nDAY|2022-01-07|2022-01-07|10102413|COUNTRY1|3|303|21|0|21|1.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY2|4|404|0|0|0|2\nDAY|2022-01-08|2022-01-08|10102413|COUNTRY2|1|101|4|0|4|0.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY4|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY5|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY1|10|1011|0|0|0|5\nDAY|2022-01-07|2022-01-07|10102415|COUNTRY1|2|202|10|0|10|1\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY2|3|303|0|0|0|1.5\nDAY|2022-01-08|2022-01-08|10102415|COUNTRY2|1|101|3|0|3|0.5\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY1|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY2|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY1|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY2|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY1|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY1|78|7878|0|0|0|39\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY10|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY11|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY2|10|1010|0|0|0|5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY3|6|606|0|0|0|3\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY4|9|909|0|0|0|4.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY5|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY6|8|808|0|0|0|4\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY7|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY8|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY9|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY1|24|2424|0|0|0|12\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY2|5|505|0|0|0|2.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY4|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY5|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY6|4|404|0|0|0|2\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY1|12|1213|0|0|0|6\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY2|4|404|0|0|0|2\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY1|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY2|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY6|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY1|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY2|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY1|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY1|78|7878|0|0|0|39\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY10|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY11|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY2|10|1010|0|0|0|5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY3|6|606|0|0|0|3\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY4|9|909|0|0|0|4.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY5|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY6|8|808|0|0|0|4\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY7|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY8|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY9|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY1|24|2424|0|0|0|12\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY2|5|505|0|0|0|2.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY4|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY5|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY6|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY1|12|1213|0|0|0|6\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY2|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY1|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY2|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY6|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY1|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY2|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY1|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY1|78|7878|0|0|0|39\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY10|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY11|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY2|10|1010|0|0|0|5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY3|6|606|0|0|0|3\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY4|9|909|0|0|0|4.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY5|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY6|8|808|0|0|0|4\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY7|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY8|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY9|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY1|24|2424|0|0|0|12\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY2|5|505|0|0|0|2.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY3|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY4|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY5|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY6|4|404|0|0|0|2\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY1|12|1213|0|0|0|6\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY2|4|404|0|0|0|2\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY6|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY1|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY2|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY6|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY1|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY2|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY1|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102419|COUNTRY6|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY1|78|7878|0|0|0|39\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY10|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY11|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY2|10|1010|0|0|0|5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY3|6|606|0|0|0|3\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY4|9|909|0|0|0|4.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY5|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY6|8|808|0|0|0|4\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY7|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY8|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY9|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY1|24|2424|0|0|0|12\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY2|5|505|0|0|0|2.5\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY4|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY5|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY6|4|404|0|0|0|2\nYEAR|2022-01-01|2022-12-31|10102415|COUNTRY1|12|1213|0|0|0|6\nYEAR|2022-01-01|2022-12-31|10102415|COUNTRY2|4|404|0|0|0|2\nYEAR|2022-01-01|2022-12-31|10102415|COUNTRY3|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102416|COUNTRY1|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102416|COUNTRY2|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102416|COUNTRY3|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102416|COUNTRY6|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102417|COUNTRY1|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102417|COUNTRY2|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102418|COUNTRY1|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102419|COUNTRY6|1|101|0|0|0|0.5\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_orders_all_snapshot.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY1|71|7171|0|0|0|35.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY8|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY2|4|404|0|0|0|2\nDAY|2022-01-08|2022-01-08|10102415|COUNTRY2|1|101|3|0|3|0.5\nDAY|2022-01-06|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY4|9|909|0|0|0|4.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY7|3|303|0|0|0|1.5\nDAY|2022-01-07|2022-01-07|10102415|COUNTRY1|2|202|10|0|10|1\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY4|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY5|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY2|2|202|0|0|0|1\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY1|7|707|71|0|71|3.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY2|1|101|8|0|8|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY1|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY5|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY1|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY1|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY9|1|101|0|0|0|0.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY3|1|101|5|0|5|0.5\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY2|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY1|21|2121|0|0|0|10.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY2|8|808|0|0|0|4\nDAY|2022-01-07|2022-01-07|10102413|COUNTRY1|3|303|21|0|21|1.5\nDAY|2022-01-08|2022-01-08|10102413|COUNTRY2|1|101|4|0|4|0.5\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY1|10|1011|0|0|0|5\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY11|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY2|3|303|0|0|0|1.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY10|1|101|0|0|0|0.5\nDAY|2022-01-08|2022-01-08|10102412|COUNTRY2|1|101|1|0|9|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY2|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY9|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY7|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY3|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY2|4|404|4|4|10|2\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY8|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY4|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY6|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-07|10102415|COUNTRY1|12|1213|10|10|10|6\nWEEK|2022-01-03|2022-01-06|10102417|COUNTRY1|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY8|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-08|10102415|COUNTRY6|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY2|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY6|8|808|8|8|8|4\nWEEK|2022-01-03|2022-01-07|10102415|COUNTRY2|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-07|10102417|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102418|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY7|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-07|10102418|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY2|5|505|4|4|8|2.5\nWEEK|2022-01-03|2022-01-07|10102417|COUNTRY2|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY11|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-06|10102415|COUNTRY2|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY1|21|2121|0|0|0|10.5\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY4|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-07|10102418|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY2|10|1010|10|10|27|5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY6|4|404|4|4|12|2\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY4|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-08|10102418|COUNTRY1|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY3|6|606|6|6|17|3\nWEEK|2022-01-03|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-06|10102416|COUNTRY1|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-08|10102417|COUNTRY2|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY8|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY1|71|7171|0|0|0|35.5\nWEEK|2022-01-03|2022-01-06|10102418|COUNTRY1|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-06|10102417|COUNTRY2|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY9|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102416|COUNTRY2|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nWEEK|2022-01-03|2022-01-06|10102415|COUNTRY1|10|1011|0|0|0|5\nWEEK|2022-01-03|2022-01-08|10102415|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY4|9|909|0|0|0|4.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY10|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-07|10102416|COUNTRY1|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY4|9|909|9|9|27|4.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY7|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-08|10102416|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-07|10102418|COUNTRY1|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY6|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-08|10102416|COUNTRY1|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY5|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY6|4|404|4|4|4|2\nWEEK|2022-01-03|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-08|10102419|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY3|6|606|5|5|5|3\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY2|9|909|8|8|8|4.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY2|10|1010|9|9|17|5\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY5|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-07|10102415|COUNTRY6|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY5|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY9|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY6|4|404|4|4|8|2\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-08|10102415|COUNTRY2|4|404|3|3|6|2\nWEEK|2022-01-03|2022-01-08|10102417|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-08|10102415|COUNTRY1|12|1213|12|12|22|6\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY5|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY5|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY2|5|505|5|5|13|2.5\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY2|4|404|4|4|4|2\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY1|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY10|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY8|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY3|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY6|8|808|8|8|24|4\nWEEK|2022-01-03|2022-01-08|10102416|COUNTRY6|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY2|4|404|0|0|0|2\nWEEK|2022-01-03|2022-01-07|10102419|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY7|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-07|10102417|COUNTRY1|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY1|78|7878|78|78|227|39\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY5|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-09|10102419|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY1|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY4|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-07|10102415|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY4|9|909|9|9|9|4.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY1|24|2424|24|24|45|12\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY1|78|7878|78|78|149|39\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY9|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY5|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-07|10102417|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY3|6|606|6|6|11|3\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY11|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY11|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY10|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY1|24|2424|24|24|69|12\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY5|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-08|10102417|COUNTRY1|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY1|78|7878|71|71|71|39\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY6|8|808|8|8|16|4\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY1|12|1213|12|12|34|6\nWEEK|2022-01-03|2022-01-07|10102416|COUNTRY6|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY11|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102417|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-08|10102418|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY1|24|2424|21|21|21|12\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY3|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY2|8|808|0|0|0|4\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY4|9|909|9|9|18|4.5\nWEEK|2022-01-03|2022-01-07|10102416|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-08|10102416|COUNTRY2|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY1|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-07|10102416|COUNTRY2|2|202|2|2|2|1\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_orders_filtered.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY3|1|101|5|0|5|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY3|6|606|0|0|0|3\nMONTH|2022-01-01|2022-01-31|10102412|COUNTRY6|8|808|0|0|0|4\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102413|COUNTRY6|4|404|0|0|0|2\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY3|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102416|COUNTRY6|2|202|0|0|0|1\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nMONTH|2022-01-01|2022-01-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY3|6|606|0|0|0|3\nQUARTER|2022-01-01|2022-03-31|10102412|COUNTRY6|8|808|0|0|0|4\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102413|COUNTRY6|4|404|0|0|0|2\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY3|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102416|COUNTRY6|2|202|0|0|0|1\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nQUARTER|2022-01-01|2022-03-31|10102419|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY3|6|606|0|0|0|3\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY6|8|808|0|0|0|4\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY3|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY6|4|404|0|0|0|2\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY6|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY6|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102419|COUNTRY6|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY3|6|606|0|0|0|3\nYEAR|2022-01-01|2022-12-31|10102412|COUNTRY6|8|808|0|0|0|4\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY3|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102413|COUNTRY6|4|404|0|0|0|2\nYEAR|2022-01-01|2022-12-31|10102415|COUNTRY3|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102415|COUNTRY6|3|303|0|0|0|1.5\nYEAR|2022-01-01|2022-12-31|10102416|COUNTRY3|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102416|COUNTRY6|2|202|0|0|0|1\nYEAR|2022-01-01|2022-12-31|10102417|COUNTRY3|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102417|COUNTRY6|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102418|COUNTRY3|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102418|COUNTRY6|1|101|0|0|0|0.5\nYEAR|2022-01-01|2022-12-31|10102419|COUNTRY6|1|101|0|0|0|0.5\n"
  },
  {
    "path": "tests/resources/feature/gab/control/data/vw_orders_filtered_snapshot.csv",
    "content": "cadence|order_date|to_date|sales_order_schedule|delivery_country_cod|orders|total_sales|orders_last_cad|orders_last_year|orders_avg_last_3_1|orders_derived\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nDAY|2022-01-06|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nDAY|2022-01-07|2022-01-07|10102412|COUNTRY3|1|101|5|0|5|0.5\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nDAY|2022-01-06|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nDAY|2022-01-06|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nDAY|2022-01-06|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102418|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY3|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY6|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-08|10102415|COUNTRY6|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY6|8|808|8|8|8|4\nWEEK|2022-01-03|2022-01-07|10102417|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102418|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-06|10102419|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-07|10102418|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-07|10102418|COUNTRY3|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-09|10102413|COUNTRY6|4|404|4|4|12|2\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY3|6|606|6|6|17|3\nWEEK|2022-01-03|2022-01-06|10102415|COUNTRY6|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-06|10102418|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-06|10102415|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY6|4|404|0|0|0|2\nWEEK|2022-01-03|2022-01-08|10102415|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-08|10102416|COUNTRY3|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY3|5|505|0|0|0|2.5\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY6|3|303|3|3|9|1.5\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY6|4|404|4|4|4|2\nWEEK|2022-01-03|2022-01-06|10102417|COUNTRY6|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-08|10102419|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-07|10102412|COUNTRY3|6|606|5|5|5|3\nWEEK|2022-01-03|2022-01-09|10102415|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-09|10102418|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-07|10102415|COUNTRY6|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY6|4|404|4|4|8|2\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY3|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102417|COUNTRY3|1|101|0|0|0|0.5\nWEEK|2022-01-03|2022-01-08|10102417|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-09|10102416|COUNTRY3|2|202|2|2|6|1\nWEEK|2022-01-03|2022-01-06|10102416|COUNTRY3|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-07|10102413|COUNTRY3|3|303|3|3|3|1.5\nWEEK|2022-01-03|2022-01-09|10102412|COUNTRY6|8|808|8|8|24|4\nWEEK|2022-01-03|2022-01-08|10102416|COUNTRY6|2|202|2|2|4|1\nWEEK|2022-01-03|2022-01-07|10102419|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-09|10102419|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102413|COUNTRY3|3|303|0|0|0|1.5\nWEEK|2022-01-03|2022-01-06|10102416|COUNTRY6|2|202|0|0|0|1\nWEEK|2022-01-03|2022-01-07|10102415|COUNTRY3|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-07|10102417|COUNTRY6|1|101|1|1|1|0.5\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY3|6|606|6|6|11|3\nWEEK|2022-01-03|2022-01-09|10102417|COUNTRY6|1|101|1|1|3|0.5\nWEEK|2022-01-03|2022-01-06|10102412|COUNTRY6|8|808|0|0|0|4\nWEEK|2022-01-03|2022-01-08|10102412|COUNTRY6|8|808|8|8|16|4\nWEEK|2022-01-03|2022-01-07|10102416|COUNTRY6|2|202|2|2|2|1\nWEEK|2022-01-03|2022-01-08|10102417|COUNTRY3|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-08|10102418|COUNTRY6|1|101|1|1|2|0.5\nWEEK|2022-01-03|2022-01-08|10102413|COUNTRY3|3|303|3|3|6|1.5\nWEEK|2022-01-03|2022-01-07|10102416|COUNTRY3|2|202|2|2|2|1\n"
  },
  {
    "path": "tests/resources/feature/gab/control/schema/vw_dummy_sales_kpi.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\":\"cadence\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"order_date\",\n      \"type\":\"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"to_date\",\n      \"type\":\"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"category_name\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"qty_articles\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"total_amount\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"total_amount_last_year\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"avg_total_amount_last_2_years\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"discounted_total_amount\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/gab/control/schema/vw_orders.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"cadence\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"to_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"sales_order_schedule\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"delivery_country_cod\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"orders\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"total_sales\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"orders_last_cad\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"orders_last_year\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"orders_avg_last_3_1\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"orders_derived\",\n      \"type\":\"double\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/column_list/calendar.json",
    "content": "{\n  \"calendar_date\": \"date\",\n  \"day_en\": \"string\",\n  \"weeknum_mon\": \"int\",\n  \"weekstart_mon\": \"date\",\n  \"weekend_mon\": \"date\",\n  \"weekstart_sun\": \"date\",\n  \"weekend_sun\": \"date\",\n  \"month_start\": \"date\",\n  \"month_end\": \"date\",\n  \"quarter_start\": \"date\",\n  \"quarter_end\": \"date\",\n  \"year_start\": \"date\",\n  \"year_end\": \"date\"\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/column_list/dummy_sales_kpi.json",
    "content": "{\n  \"order_date\": \"date\",\n  \"article_id\": \"string\",\n  \"amount\": \"int\"\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/column_list/gab_log_events.json",
    "content": "{\n  \"run_start_time\": \"timestamp\",\n  \"run_end_time\": \"timestamp\",\n  \"input_start_date\": \"timestamp\",\n  \"input_end_date\": \"timestamp\",\n  \"query_id\": \"string\",\n  \"query_label\": \"string\",\n  \"cadence\": \"string\",\n  \"stage_name\": \"string\",\n  \"stage_query\": \"string\",\n  \"status\": \"string\",\n  \"error_code\": \"string\"\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/column_list/gab_use_case_results.json",
    "content": "{\n  \"query_id\": \"string\",\n  \"cadence\": \"string\",\n  \"from_date\": \"date\",\n  \"to_date\": \"date\",\n  \"d1\": \"string\",\n  \"d2\": \"string\",\n  \"d3\": \"string\",\n  \"d4\": \"string\",\n  \"d5\": \"string\",\n  \"d6\": \"string\",\n  \"d7\": \"string\",\n  \"d8\": \"string\",\n  \"d9\": \"string\",\n  \"d10\": \"string\",\n  \"d11\": \"string\",\n  \"d12\": \"string\",\n  \"d13\": \"string\",\n  \"d14\": \"string\",\n  \"d15\": \"string\",\n  \"d16\": \"string\",\n  \"d17\": \"string\",\n  \"d18\": \"string\",\n  \"d19\": \"string\",\n  \"d20\": \"string\",\n  \"d21\": \"string\",\n  \"d22\": \"string\",\n  \"d23\": \"string\",\n  \"d24\": \"string\",\n  \"d25\": \"string\",\n  \"d26\": \"string\",\n  \"d27\": \"string\",\n  \"d28\": \"string\",\n  \"d29\": \"string\",\n  \"d30\": \"string\",\n  \"d31\": \"string\",\n  \"d32\": \"string\",\n  \"d33\": \"string\",\n  \"d34\": \"string\",\n  \"d35\": \"string\",\n  \"d36\": \"string\",\n  \"d37\": \"string\",\n  \"d38\": \"string\",\n  \"d39\": \"string\",\n  \"d40\": \"string\",\n  \"m1\": \"double\",\n  \"m2\": \"double\",\n  \"m3\": \"double\",\n  \"m4\": \"double\",\n  \"m5\": \"double\",\n  \"m6\": \"double\",\n  \"m7\": \"double\",\n  \"m8\": \"double\",\n  \"m9\": \"double\",\n  \"m10\": \"double\",\n  \"m11\": \"double\",\n  \"m12\": \"double\",\n  \"m13\": \"double\",\n  \"m14\": \"double\",\n  \"m15\": \"double\",\n  \"m16\": \"double\",\n  \"m17\": \"double\",\n  \"m18\": \"double\",\n  \"m19\": \"double\",\n  \"m20\": \"double\",\n  \"m21\": \"double\",\n  \"m22\": \"double\",\n  \"m23\": \"double\",\n  \"m24\": \"double\",\n  \"m25\": \"double\",\n  \"m26\": \"double\",\n  \"m27\": \"double\",\n  \"m28\": \"double\",\n  \"m29\": \"double\",\n  \"m30\": \"double\",\n  \"m31\": \"double\",\n  \"m32\": \"double\",\n  \"m33\": \"double\",\n  \"m34\": \"double\",\n  \"m35\": \"double\",\n  \"m36\": \"double\",\n  \"m37\": \"double\",\n  \"m38\": \"double\",\n  \"m39\": \"double\",\n  \"m40\": \"double\",\n  \"lh_created_on\": \"timestamp\"\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/column_list/lkp_query_builder.json",
    "content": "{\n  \"query_id\": \"int\",\n  \"query_label\": \"string\",\n  \"query_type\": \"string\",\n  \"mappings\": \"string\",\n  \"intermediate_stages\": \"string\",\n  \"recon_window\": \"string\",\n  \"timezone_offset\": \"int\",\n  \"start_of_the_week\": \"string\",\n  \"is_active\": \"string\",\n  \"queue\": \"string\",\n  \"lh_created_on\": \"timestamp\"\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/column_list/order_events.json",
    "content": "{\n  \"request_timestamp\": \"string\",\n  \"data_pack_id\": \"string\",\n  \"record_number\": \"int\",\n  \"update_mode\": \"string\",\n  \"sales_order_header\": \"string\",\n  \"sales_order_schedule\": \"string\",\n  \"sales_order_item\": \"string\",\n  \"orgsales_orgp\": \"string\",\n  \"order_header_key\": \"string\",\n  \"order_line_key\": \"string\",\n  \"derived_order_header\": \"string\",\n  \"derived_order_line_k\": \"string\",\n  \"return_reason\": \"string\",\n  \"reqmnt_category\": \"string\",\n  \"delivery_status10\": \"string\",\n  \"req_del_dt_item\": \"date\",\n  \"reason_for_rejsize\": \"string\",\n  \"invoice_item_price\": \"string\",\n  \"id_of_the_customer\": \"string\",\n  \"logistics_profit_ctr\": \"string\",\n  \"material_availabilit\": \"date\",\n  \"mso_store\": \"string\",\n  \"name_of_orderer\": \"string\",\n  \"overall_delivery_sta\": \"string\",\n  \"overall_processing_s20\": \"string\",\n  \"overall_processing_s21\": \"string\",\n  \"coupon_code\": \"string\",\n  \"org_grape_bapcx\": \"string\",\n  \"cust_service_rep\": \"string\",\n  \"customer_purchase_or25\": \"date\",\n  \"delivery_country_cod\": \"string\",\n  \"delivery_city_code\": \"string\",\n  \"delivery_post_code\": \"string\",\n  \"delivery_state_code\": \"string\",\n  \"delivery_status30\": \"string\",\n  \"ops_del_block_sohdr\": \"string\",\n  \"ops_del_block_soscl\": \"string\",\n  \"ecom_crm_id\": \"string\",\n  \"conf_del_date_size\": \"date\",\n  \"created_on\": \"date\",\n  \"time\": \"string\",\n  \"sales_doc_item_cat\": \"string\",\n  \"shipping_campaign_id\": \"string\",\n  \"shipping_coupon_code\": \"string\",\n  \"shipping_city\": \"string\",\n  \"shipping_postal_code\": \"string\",\n  \"shp_promotion_code\": \"string\",\n  \"size_grid\": \"string\",\n  \"main_chan_frm_src\": \"string\",\n  \"prctr_billing\": \"string\",\n  \"prere_indfrm_src\": \"string\",\n  \"reg__clr_from_src\": \"string\",\n  \"update_flag\": \"string\",\n  \"usage\": \"string\",\n  \"so_header_usgindp\": \"string\",\n  \"vas_customer_defined\": \"string\",\n  \"adidas_group_article\": \"string\",\n  \"billto_cust\": \"string\",\n  \"requirement_type\": \"string\",\n  \"shipto_cust__r2\": \"string\",\n  \"soldto_cust_r2\": \"string\",\n  \"sales_doc_category\": \"string\",\n  \"product_division\": \"string\",\n  \"promotion_code\": \"string\",\n  \"sd_categ_precdoc\": \"string\",\n  \"so_hdrpreceding_doc\": \"string\",\n  \"so_itmpreceding_doc\": \"string\",\n  \"so_scl_prec_doc\": \"string\",\n  \"article__region__s\": \"string\",\n  \"reference_1\": \"string\",\n  \"mkt_place_order_num\": \"string\",\n  \"sales_representative\": \"string\",\n  \"subtotal_1_source\": \"decimal\",\n  \"subtotal_2_source\": \"decimal\",\n  \"subtotal_3_source\": \"decimal\",\n  \"subtotal_4_source\": \"decimal\",\n  \"subtotal_5_source\": \"decimal\",\n  \"subtotal_6_source\": \"decimal\",\n  \"grid_value\": \"string\",\n  \"orgcompcodep\": \"string\",\n  \"created_by\": \"string\",\n  \"miscdistchcopap\": \"string\",\n  \"document_currency\": \"string\",\n  \"reason_for_order\": \"string\",\n  \"opsplantp\": \"string\",\n  \"sales_group\": \"string\",\n  \"sales_office\": \"string\",\n  \"sales_unit\": \"string\",\n  \"storage_location\": \"string\",\n  \"so_net_price_2\": \"decimal\",\n  \"sales_order_net_valu\": \"decimal\",\n  \"so_conf_qty\": \"decimal\",\n  \"so_cum_order_qty\": \"decimal\",\n  \"so_net_price\": \"decimal\",\n  \"so_net_value\": \"decimal\",\n  \"so_org_qty\": \"decimal\",\n  \"so_conf_qty_actual\": \"decimal\",\n  \"sales_order_qty\": \"decimal\",\n  \"sales_odr_qty_actual\": \"decimal\",\n  \"article_campaign_id\": \"string\",\n  \"sales_document_type\": \"string\",\n  \"order_date_header\": \"date\",\n  \"billing_city\": \"string\",\n  \"billing_postal_code\": \"string\",\n  \"customer_po_time\": \"string\",\n  \"customer_purchase_or101\": \"string\",\n  \"overall_rej_status\": \"string\",\n  \"changed_on\": \"date\",\n  \"epoch_status\": \"string\",\n  \"sales_order_canqty\": \"decimal\",\n  \"epoch_entry_type\": \"string\",\n  \"epoch_entry_by\": \"string\",\n  \"epoch_order_type\": \"string\",\n  \"epoch_line_type\": \"string\",\n  \"omnihub_marketplace\": \"string\",\n  \"confirmed_delivery_t\": \"string\",\n  \"shipping_city_addres112\": \"string\",\n  \"shipping_city_addres113\": \"string\",\n  \"shipping_city_addres114\": \"string\",\n  \"billing_city_address115\": \"string\",\n  \"billing_city_address116\": \"string\",\n  \"billing_city_address117\": \"string\",\n  \"omnihub_seller_org\": \"string\",\n  \"omnihub_locale_code\": \"string\",\n  \"customer_po_type\": \"string\",\n  \"omnihub_carrier_serv\": \"string\",\n  \"qualifier\": \"string\",\n  \"omnihub_document_typ\": \"string\",\n  \"omnihub_return_code\": \"string\",\n  \"refund_process_date\": \"date\",\n  \"refund_process_time\": \"string\",\n  \"omni_cancel_reason\": \"string\",\n  \"sales_order_ecom_fre\": \"decimal\",\n  \"omnihub_custom_order\": \"string\",\n  \"vas_packing_type_so\": \"string\",\n  \"vas_spl_ser_type_so\": \"string\",\n  \"vas_tktlbl_type_so\": \"string\",\n  \"exchange_flag\": \"string\",\n  \"exchange_type\": \"string\",\n  \"customer_po_timedw\": \"string\",\n  \"cnc_store_id\": \"string\",\n  \"last_hold__type\": \"string\",\n  \"last_hold_released_t\": \"string\",\n  \"last_hold_release_dt\": \"date\",\n  \"dynamic_pricing_iden\": \"string\",\n  \"dynamic_pricing_valu\": \"string\",\n  \"dymamic_pricing_amnt\": \"decimal\",\n  \"exchange_reason\": \"string\",\n  \"omnihub_site_id\": \"string\",\n  \"international_shipme\": \"string\",\n  \"exchange_variant\": \"string\",\n  \"secondary_article_ca\": \"string\",\n  \"secondary_article_pr\": \"string\",\n  \"secondary_coupon_cod\": \"string\",\n  \"double_discount_flag\": \"string\",\n  \"extraction_date\": \"string\",\n  \"lhe_batch_id\": \"int\",\n  \"lhe_row_id\": \"bigint\",\n  \"source_update_date\": \"date\",\n  \"source_update_time\": \"string\"\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/data/dummy_sales_kpi.csv",
    "content": "order_date|article_id|amount\n2017-02-15|article1|600\n2017-02-15|article6|1000\n2017-02-15|article2|2400\n2017-02-15|article4|2000\n2017-02-15|article5|4000\n2017-04-30|article7|1400\n2017-04-30|article2|1000\n2017-04-30|article3|1600\n2017-04-30|article1|600\n2016-06-01|article2|4000\n2016-06-01|article1|2000\n2016-06-01|article3|1000\n2017-05-10|article5|1600\n2017-05-10|article6|3000\n2017-05-10|article3|2000\n2017-06-01|article4|2000\n2017-06-01|article1|1000\n2017-06-01|article2|1800\n2018-07-11|article3|6\n2018-07-11|article1|2\n2018-07-01|article2|18\n2018-07-01|article1|10\n"
  },
  {
    "path": "tests/resources/feature/gab/setup/data/lkp_query_builder.csv",
    "content": "query_id|query_label|query_type|mappings|intermediate_stages|recon_window|timezone_offset|start_of_the_week|is_active|queue|lh_created_on\n742783030|order_events|GLOBAL|{ 'vw_orders_all': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': {} }, 'vw_orders_filtered': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': 'd2 in (\"COUNTRY6\", \"COUNTRY3\")' } }|{'1': {'file_path': 'order_events/1_order_events.sql','table_alias': 'order_events_query','storage_level': 'MEMORY_ONLY','project_date_column': 'order_date_header','filter_date_column': 'order_date_header','repartition': {}}}|{'DAY': {}, 'WEEK': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'MONTH': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'QUARTER': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'YEAR': {'recon_window': {'DAY': {'snapshot': 'N'}}}}|0|SUNDAY|Y|Medium|2024-02-08T11:33:49.76Z\n74776315|dummy_sales_kpi|GLOBAL|{ 'vw_dummy_sales_kpi': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'category_name' }, 'metric': { 'm1': { 'metric_name': 'qty_articles', 'calculated_metric': {}, 'derived_metric': {} }, 'm2': { 'metric_name': 'total_amount', 'calculated_metric': { 'last_cadence': [ { 'label': 'total_amount_last_year', 'window': '1' } ], 'window_function': [ { 'label': 'avg_total_amount_last_2_years', 'window': [ 2, 1 ], 'agg_func': 'avg' } ] }, 'derived_metric': [ { 'label': 'discounted_total_amount', 'formula': 'total_amount*0.56' } ] } }, 'filter': {} } }|{ '1': { 'file_path': 'dummy_sales_kpi/1_article_category.sql', 'table_alias': 'article_categories', 'storage_level': 'MEMORY_ONLY', 'project_date_column': '', 'filter_date_column': '', 'repartition': {} }, '2': { 'file_path': 'dummy_sales_kpi/2_dummy_sales_kpi.sql', 'table_alias': 'dummy_sales_kpi', 'storage_level': 'MEMORY_ONLY', 'project_date_column': 'order_date', 'filter_date_column': 'order_date', 'repartition': {} } }|{'YEAR': {}}|0|MONDAY|Y|Low|2024-03-07T15:38:52.922Z\n742783031|order_events_snapshot|GLOBAL|{ 'vw_orders_all_snapshot': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': {} }, 'vw_orders_filtered_snapshot': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': 'd2 in (\"COUNTRY6\", \"COUNTRY3\")' } }|{ '1': { 'file_path': 'order_events/1_order_events.sql', 'table_alias': 'order_events_query', 'storage_level': 'MEMORY_ONLY', 'project_date_column': 'order_date_header', 'filter_date_column': 'order_date_header', 'repartition': {} } }|{'DAY': {}, 'WEEK': {'recon_window': {'DAY': {'snapshot': 'Y'}}}, 'MONTH': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'QUARTER': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'YEAR': {'recon_window': {'DAY': {'snapshot': 'N'}}}}|0|SUNDAY|Y|Medium|2024-03-25T10:17:51.907Z\n742783032|order_events_nam|NAM|{ 'vw_nam_orders_all_snapshot': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': {} }, 'vw_nam_orders_filtered_snapshot': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': 'd2 in (\"COUNTRY6\", \"COUNTRY3\")' } }|{ '1': { 'file_path': 'order_events/1_order_events.sql', 'table_alias': 'order_events_query', 'storage_level': 'MEMORY_ONLY', 'project_date_column': 'order_date_header', 'filter_date_column': 'order_date_header', 'repartition': {} } }|{'DAY': {}, 'WEEK': {'recon_window': {'DAY': {'snapshot': 'Y'}}}, 'MONTH': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'QUARTER': {'recon_window': {'DAY': {'snapshot': 'N'}}}, 'YEAR': {'recon_window': {'DAY': {'snapshot': 'N'}}}}|0|MONDAY|Y|Medium|2024-03-25T10:19:12.597Z\n742783034|order_events_negative_timezone_offset|GLOBAL|{ 'vw_negative_offset_orders_all': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': {} }, 'vw_negative_offset_orders_filtered': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': 'd2 in (\"COUNTRY6\", \"COUNTRY3\")' } }|{ '1': { 'file_path': 'order_events/1_order_events.sql', 'table_alias': 'order_events_query', 'storage_level': 'MEMORY_ONLY', 'project_date_column': 'order_date_header', 'filter_date_column': 'order_date_header', 'repartition': {'numPartitions':3, 'keys':['order_date']} } }|{'WEEK': {'recon_window': {'DAY': {'snapshot': 'Y'}}}}|-3|MONDAY|Y|Medium|2024-03-25T10:20:27.992Z\n742783035|order_events_empty_reconciliation_window|GLOBAL|{ 'vw_negative_offset_orders_all': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': {} }, 'vw_negative_offset_orders_filtered': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': 'd2 in (\"COUNTRY6\", \"COUNTRY3\")' } }|{ '1': { 'file_path': 'order_events/1_order_events.sql', 'table_alias': 'order_events_query', 'storage_level': 'MEMORY_ONLY', 'project_date_column': 'order_date_header', 'filter_date_column': 'order_date_header', 'repartition': {'numPartitions':3, 'keys':['order_date']} } }|{}|-3|MONDAY|Y|Medium|2024-03-25T10:20:27.992Z\n742783036|order_events_unexisting_cadence|GLOBAL|{ 'vw_negative_offset_orders_all': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': {} }, 'vw_negative_offset_orders_filtered': { 'dimensions': { 'from_date': 'order_date', 'to_date': 'to_date', 'd1': 'sales_order_schedule', 'd2': 'delivery_country_cod' }, 'metric': { 'm1': { 'metric_name': 'orders', 'calculated_metric': { 'last_cadence': [ { 'label': 'orders_last_cad', 'window': '1' } ], 'last_year_cadence': [ { 'label': 'orders_last_year', 'window': 1 } ], 'window_function': [ { 'label': 'orders_avg_last_3_1', 'window': [ 3, 1 ], 'agg_func': 'sum' } ] }, 'derived_metric': [ { 'label': 'orders_derived', 'formula': 'orders*0.5' } ] }, 'm2': { 'metric_name': 'total_sales', 'calculated_metric': {}, 'derived_metric': {} } }, 'filter': 'd2 in (\"COUNTRY6\", \"COUNTRY3\")' } }|{ '1': { 'file_path': 'order_events/1_order_events.sql', 'table_alias': 'order_events_query', 'storage_level': 'MEMORY_ONLY', 'project_date_column': 'order_date_header', 'filter_date_column': 'order_date_header', 'repartition': {'numPartitions':3, 'keys':['order_date']} } }|{'UNEXINSTING_CADENCE': {'recon_window': {'DAY': {'snapshot': 'Y'}}}}|-3|MONDAY|Y|Medium|2024-03-25T10:20:27.992Z\n"
  },
  {
    "path": "tests/resources/feature/gab/setup/data/order_events.csv",
    "content": "request_timestamp|data_pack_id|record_number|update_mode|sales_order_header|sales_order_schedule|sales_order_item|orgsales_orgp|order_header_key|order_line_key|derived_order_header|derived_order_line_k|return_reason|reqmnt_category|delivery_status10|req_del_dt_item|reason_for_rejsize|invoice_item_price|id_of_the_customer|logistics_profit_ctr|material_availabilit|mso_store|name_of_orderer|overall_delivery_sta|overall_processing_s20|overall_processing_s21|coupon_code|org_grape_bapcx|cust_service_rep|customer_purchase_or25|delivery_country_cod|delivery_city_code|delivery_post_code|delivery_state_code|delivery_status30|ops_del_block_sohdr|ops_del_block_soscl|ecom_crm_id|conf_del_date_size|created_on|time|sales_doc_item_cat|shipping_campaign_id|shipping_coupon_code|shipping_city|shipping_postal_code|shp_promotion_code|size_grid|main_chan_frm_src|prctr_billing|prere_indfrm_src|reg__clr_from_src|update_flag|usage|so_header_usgindp|vas_customer_defined|adidas_group_article|billto_cust|requirement_type|shipto_cust__r2|soldto_cust_r2|sales_doc_category|product_division|promotion_code|sd_categ_precdoc|so_hdrpreceding_doc|so_itmpreceding_doc|so_scl_prec_doc|article__region__s|reference_1|mkt_place_order_num|sales_representative|subtotal_1_source|subtotal_2_source|subtotal_3_source|subtotal_4_source|subtotal_5_source|subtotal_6_source|grid_value|orgcompcodep|created_by|miscdistchcopap|document_currency|reason_for_order|opsplantp|sales_group|sales_office|sales_unit|storage_location|so_net_price_2|sales_order_net_valu|so_conf_qty|so_cum_order_qty|so_net_price|so_net_value|so_org_qty|so_conf_qty_actual|sales_order_qty|sales_odr_qty_actual|article_campaign_id|sales_document_type|order_date_header|billing_city|billing_postal_code|customer_po_time|customer_purchase_or101|overall_rej_status|changed_on|epoch_status|sales_order_canqty|epoch_entry_type|epoch_entry_by|epoch_order_type|epoch_line_type|omnihub_marketplace|confirmed_delivery_t|shipping_city_addres112|shipping_city_addres113|shipping_city_addres114|billing_city_address115|billing_city_address116|billing_city_address117|omnihub_seller_org|omnihub_locale_code|customer_po_type|omnihub_carrier_serv|qualifier|omnihub_document_typ|omnihub_return_code|refund_process_date|refund_process_time|omni_cancel_reason|sales_order_ecom_fre|omnihub_custom_order|vas_packing_type_so|vas_spl_ser_type_so|vas_tktlbl_type_so|exchange_flag|exchange_type|customer_po_timedw|cnc_store_id|last_hold__type|last_hold_released_t|last_hold_release_dt|dynamic_pricing_iden|dynamic_pricing_valu|dymamic_pricing_amnt|exchange_reason|omnihub_site_id|international_shipme|exchange_variant|secondary_article_ca|secondary_article_pr|secondary_coupon_cod|double_discount_flag|extraction_date|lhe_batch_id|lhe_row_id|source_update_date|source_update_time\nnull|null|null|null|VALUE1|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING1|null|AC|null|null|STRING1|2022-01-10|2022-01-06|81351101|VALUE1|null|null|CITY1|888420101|null|2300101|404|null|RER|CRC|XCX|null|null|null|STRING2|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING2|STRING1|STRING1|null|null|null|null|null|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500124.00|null|101.000|101.000|6500124.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY1|885201|8123401001|STRING1|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING1|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING1|null|null|null|null|null|124814001|null|STRING1|2140001|2022-01-06|null|null|null|STRING1|COMP1COUNTRY1|null|null|STRING1|STRING1|STRING1|No||2|1|null|null\nnull|null|null|null|VALUE3|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING3|null|AC|null|null|STRING3|2022-01-11|2022-01-06|81351103|VALUE1|null|null|CITY3|888420103|null|2300104|404|null|RER|CRC|XCX|null|null|null|STRING4|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING4|STRING4|STRING4|null|null|15002.00|null|15002.00|null|null|230003|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|85003.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY3|885203|8123401003|STRING4|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING3|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING3|null|null|null|null|null|124814003|null|STRING1|2140003|2022-01-06|null|null|null|STRING3|COMP1COUNTRY3|null|null|null|null|null|No||2|3|null|null\nnull|null|null|null|VALUE4|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING4|METHOD1|AC|null|null|STRING4|2022-01-11|2022-01-06|81351104|VALUE1|null|null|CITY4|888420104|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING5|null|101|null|null|null|1001|null|null|null|null|null|STRING5|STRING5|STRING5|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY4|885204|8123401004|STRING5|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING4|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING4|null|null|null|null|null|124814004|null|STRING1|2140004|2022-01-06|null|null|null|STRING4|COMP1COUNTRY1|null|null|null|null|null|No||2|4|null|null\nnull|null|null|null|VALUE7|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE3|null|null|2022-01-05|COUNTRY1|null|STRING7|null|AC|null|null|STRING7|2022-01-11|2022-01-06|81351107|VALUE1|null|null|CITY7|888420107|null|2300110|404|null|RER|RCR|XCX|null|null|null|STRING10|null|101|null|null|null|1001|STRING4|null|null|null|null|STRING10|STRING10|STRING10|null|null|null|null|null|null|null|230007|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|null|101.000|101.000|101.000|101.000|STRING4|STRING1|2022-01-06|CITY7|885207|8123401007|STRING10|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING7|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING7|null|null|null|null|null|124814007|null|STRING1|2140007|2022-01-06|null|null|null|STRING7|COMP1COUNTRY1|null|null|null|null|null|No||2|9|null|null\nnull|null|null|null|VALUE8|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE4|null|null|2022-01-05|COUNTRY1|null|STRING8|null|AC|null|null|STRING8|2022-01-13|2022-01-06|81351108|VALUE1|null|null|CITY8|888420108|null|2300111|404|null|RER|RCR|XCX|null|null|STRING1|STRING11|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING11|STRING11|STRING11|null|null|15004.00|null|15004.00|null|null|230008|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85009.00|101.000|101.000|101.000|101.000|STRING5|STRING1|2022-01-06|CITY8|885208|8123401008|STRING11|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE2|STRING8|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING8|null|EFL|null|null|null|124814008|null|STRING1|2140008|2022-01-06|null|null|null|STRING8|COMP1COUNTRY1|null|null|null|null|null|No||2|10|null|null\nnull|null|null|null|VALUE8|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE4|null|null|2022-01-05|COUNTRY1|null|STRING8|null|AC|null|null|STRING8|2022-01-13|2022-01-06|81351108|VALUE1|null|null|CITY8|888420108|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING12|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING12|STRING11|STRING11|null|null|15005.00|null|15005.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500131.00|null|101.000|101.000|6500131.00|85010.00|101.000|101.000|101.000|101.000|STRING5|STRING1|2022-01-06|CITY8|885208|8123401008|STRING11|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING8|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING8|null|null|null|null|null|124814008|null|STRING1|2140008|2022-01-06|null|null|null|STRING8|COMP1COUNTRY1|null|null|null|null|null|No||2|11|null|null\nnull|null|null|null|VALUE8|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE4|null|null|2022-01-05|COUNTRY1|null|STRING8|null|AC|null|null|STRING8|2022-01-13|2022-01-06|81351108|VALUE1|null|null|CITY8|888420108|null|2300113|404|null|RER|RCR|XCX|null|null|null|STRING13|null|101|null|null|null|1003|STRING5|null|null|null|null|STRING13|STRING11|STRING11|null|null|15006.00|null|15006.00|null|null|230010|null|null|101|STRING1|null|null|null|null|PCP|101|6500132.00|null|102.000|102.000|6500132.00|85011.00|102.000|102.000|102.000|102.000|STRING5|STRING1|2022-01-06|CITY8|885208|8123401008|STRING11|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING8|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING8|null|null|null|null|null|124814008|null|STRING1|2140008|2022-01-06|null|null|null|STRING8|COMP1COUNTRY1|null|null|null|null|null|No||2|12|null|null\nnull|null|null|null|VALUE9|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING9|null|AC|null|null|STRING9|2022-01-11|2022-01-06|81351109|VALUE1|null|null|CITY9|888420109|null|2300114|404|null|RER|RCR|XCX|null|null|null|STRING14|null|101|null|null|null|1002|null|null|null|null|null|STRING14|STRING14|STRING14|null|null|15003.00|null|15003.00|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500133.00|null|101.000|101.000|6500133.00|85012.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY9|885209|8123401009|STRING14|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING9|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING9|null|null|null|null|null|124814009|null|STRING1|2140009|2022-01-06|null|null|null|STRING9|COMP1COUNTRY1|null|null|null|null|null|No||2|13|null|null\nnull|null|null|null|VALUE10|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE5|null|null|2022-01-05|COUNTRY1|null|STRING10|null|AC|null|null|STRING10|2022-01-13|2022-01-06|81351110|VALUE1|null|null|CITY10|888420110|null|2300115|404|null|RER|RCR|XCX|null|null|STRING2|STRING15|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING15|STRING15|STRING15|null|null|15007.00|null|15007.00|null|null|230012|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85004.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY10|885210|8123401010|STRING15|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE2|STRING10|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING10|null|EFL|null|null|null|124814010|null|STRING1|2140010|2022-01-06|null|null|null|STRING10|COMP1COUNTRY1|null|null|null|null|null|No||2|14|null|null\nnull|null|null|null|VALUE11|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY4|null|STRING11|null|AC|null|null|STRING11|2022-01-11|2022-01-06|81351111|VALUE1|null|null|CITY11|888420111|null|2300116|404|null|RER|CRC|XCX|null|null|null|STRING16|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING16|STRING16|STRING16|null|null|15008.00|null|15008.00|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|6500134.00|null|101.000|101.000|6500134.00|85013.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY11|885211|8123401011|STRING16|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING11|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING11|null|null|null|null|null|124814011|null|STRING1|2140011|2022-01-06|null|null|null|STRING11|COMP1COUNTRY4|null|null|null|null|null|No||2|15|null|null\nnull|null|null|null|VALUE13|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING13|null|AC|null|null|STRING13|2022-01-11|2022-01-06|81351113|VALUE1|null|null|CITY13|888420113|null|2300103|404|null|RER|CRC|XCX|null|null|null|STRING18|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING18|STRING18|STRING18|null|null|15010.00|null|15010.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85015.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY13|885213|8123401013|STRING18|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING13|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING13|null|null|null|null|null|124814013|null|STRING1|2140013|2022-01-06|null|null|null|STRING13|COMP1COUNTRY1|null|null|null|null|null|No||2|17|null|null\nnull|null|null|null|VALUE14|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING14|null|AC|null|null|STRING14|2022-01-11|2022-01-06|81351114|VALUE1|null|null|CITY14|888420114|null|2300103|404|null|RER|RCR|XCX|null|null|null|STRING19|null|101|null|null|null|1001|null|null|null|null|null|STRING19|STRING19|STRING19|null|null|15003.00|null|15003.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|85016.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY14|885214|8123401014|STRING19|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING14|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING14|null|null|null|null|null|124814014|null|STRING1|2140014|2022-01-06|null|null|null|STRING14|COMP1COUNTRY1|null|null|null|null|null|No||2|18|null|null\nnull|null|null|null|VALUE15|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-14|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY4|null|STRING15|null|AA|null|null|STRING15|2022-01-14|2022-01-06|81351115|VALUE1|null|null|CITY15|888420115|null|2300120|404|null|RER|RCR|XCX|null|null|STRING3|STRING20|null|101|null|null|null|1002|null|null|null|null|null|STRING20|STRING20|STRING20|null|null|15003.00|null|null|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|85009.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY15|885215|8123401015|STRING20|null|2022-01-06|STRING4|10020.000|STRING3|STORE1|STRING1|TYPE2|STRING15|10240403|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING1|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING15|null|EFL|null|null|null|124814015|null|null|null|null|null|null|null|STRING15|COMP1COUNTRY4|null|null|null|null|null|No||2|19|null|null\nnull|null|null|null|VALUE16|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY5|null|STRING16|null|AA|null|null|STRING16|null|2022-01-06|81351116|VALUE1|null|null|CITY16|888420116|null|2300121|404|null|RER|CRC|XCX|null|null|null|STRING21|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING21|STRING21|STRING21|null|null|null|15003.00|null|null|null|230016|null|null|101|STRING2|null|null|null|null|PCP|101|6500124.00|null|101.000|101.000|6500124.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY16|885216|8123401016|STRING21|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING16|10240404|null|null|null|null|null|null|COMPANY1|COUNTRYAB5|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING16|null|null|null|null|null|124814016|null|null|null|null|null|null|null|STRING16|COMP1COUNTRY5|null|null|STRING2|STRING1|null|Yes||2|20|null|null\nnull|null|null|null|VALUE16|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY5|null|STRING16|null|AA|null|null|STRING16|null|2022-01-06|81351116|VALUE1|null|null|CITY16|888420116|null|2300105|404|null|RER|CRC|XCX|null|null|null|STRING22|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING22|STRING21|STRING21|null|null|null|15003.00|null|null|null|230004|null|null|101|STRING2|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY16|885216|8123401016|STRING21|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING16|10240404|null|null|null|null|null|null|COMPANY1|COUNTRYAB5|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING16|null|null|null|null|null|124814016|null|null|null|null|null|null|null|STRING16|COMP1COUNTRY5|null|null|STRING2|STRING1|null|Yes||2|21|null|null\nnull|null|null|null|VALUE17|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING17|null|AC|null|null|STRING17|2022-01-11|2022-01-06|81351117|VALUE1|null|null|CITY17|888420117|null|2300123|404|null|RER|CRC|XCX|null|null|null|STRING23|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING23|STRING23|STRING23|null|null|15010.00|null|15010.00|null|null|230017|null|null|101|STRING1|null|null|null|null|PCP|101|6500136.00|null|101.000|101.000|6500136.00|85008.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY17|885217|8123401017|STRING23|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING17|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING17|null|null|null|null|null|124814017|null|STRING1|2140016|2022-01-06|null|null|null|STRING17|COMP1COUNTRY1|null|null|null|null|null|No||2|22|null|null\nnull|null|null|null|VALUE18|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING18|null|AC|null|null|STRING18|2022-01-11|2022-01-06|81351118|VALUE1|null|null|CITY18|888420118|null|2300114|404|null|RER|CRC|XCX|null|null|null|STRING24|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING24|STRING24|STRING24|null|null|null|null|null|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY18|885218|8123401018|STRING24|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING18|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING18|null|null|null|null|null|124814018|null|STRING1|2140017|2022-01-06|null|null|null|STRING18|COMP1COUNTRY1|null|null|null|null|null|No||2|23|null|null\nnull|null|null|null|VALUE18|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING18|null|AC|null|null|STRING18|2022-01-11|2022-01-06|81351118|VALUE1|null|null|CITY18|888420118|null|2300114|404|null|RER|RCR|XCX|null|null|null|STRING14|null|101|null|null|null|1002|null|null|null|null|null|STRING14|STRING24|STRING24|null|null|15003.00|null|15003.00|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500133.00|null|101.000|101.000|6500133.00|85012.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY18|885218|8123401018|STRING24|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING18|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING18|null|null|null|null|null|124814018|null|STRING1|2140017|2022-01-06|null|null|null|STRING18|COMP1COUNTRY1|null|null|null|null|null|No||2|24|null|null\nnull|null|null|null|VALUE19|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING19|null|AC|null|null|STRING19|2022-01-10|2022-01-06|81351119|VALUE1|null|null|CITY19|888420119|null|2300103|404|null|RER|CRC|XCX|null|null|null|STRING25|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING25|STRING26|STRING26|null|null|15008.00|null|15008.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500134.00|null|101.000|101.000|6500134.00|85013.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY19|885219|8123401019|STRING26|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING19|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING19|null|null|null|null|null|124814019|null|STRING1|2140018|2022-01-06|null|null|null|STRING19|COMP1COUNTRY1|null|null|null|null|null|No||2|25|null|null\nnull|null|null|null|VALUE20|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING20|null|AC|null|null|STRING20|2022-01-11|2022-01-06|81351120|VALUE1|null|null|CITY20|888420120|null|2300107|404|null|RER|RCR|XCX|null|null|null|STRING26|null|101|null|null|null|1001|null|null|null|null|null|STRING26|STRING27|STRING27|null|null|15003.00|null|15003.00|null|null|230005|null|null|101|STRING1|null|null|null|null|PCP|101|6500136.00|null|101.000|101.000|6500136.00|85017.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY20|885220|8123401020|STRING27|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING20|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING20|null|null|null|null|null|124814020|null|STRING1|2140019|2022-01-06|null|null|null|STRING20|COMP1COUNTRY1|null|null|null|null|null|No||2|26|null|null\nnull|null|null|null|VALUE21|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING21|null|AC|null|null|STRING21|2022-01-11|2022-01-06|81351121|VALUE1|null|null|CITY21|888420121|null|2300128|404|null|RER|CRC|XCX|null|null|null|STRING27|null|101|null|null|null|1003|STRING2|null|null|null|null|STRING27|STRING28|STRING28|null|null|null|null|null|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY21|885221|8123401021|STRING28|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING21|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING21|null|null|null|null|null|124814021|null|STRING1|2140010|2022-01-06|null|null|null|STRING21|COMP1COUNTRY1|null|null|null|null|null|No||2|27|null|null\nnull|null|null|null|VALUE22|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING22|null|AC|null|null|STRING22|2022-01-11|2022-01-06|81351122|VALUE1|null|null|CITY22|888420122|null|2300129|404|null|RER|RCR|XCX|null|null|null|STRING28|null|101|null|null|null|1002|null|null|null|null|null|STRING28|STRING29|STRING29|null|null|15003.00|null|15003.00|null|null|230019|null|null|101|STRING1|null|null|null|null|PCP|101|6500137.00|null|101.000|101.000|6500137.00|85018.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY22|885222|8123401022|STRING29|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING22|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING22|null|null|null|null|null|124814022|null|STRING1|2140011|2022-01-06|null|null|null|STRING22|COMP1COUNTRY1|null|null|null|null|null|No||2|28|null|null\nnull|null|null|null|VALUE23|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING23|null|AC|null|null|STRING23|2022-01-13|2022-01-06|81351123|VALUE1|null|null|CITY23|888420123|null|2300117|404|null|RER|RCR|XCX|null|null|null|STRING29|null|101|null|null|null|1002|null|null|null|null|null|STRING29|STRING30|STRING30|null|null|15003.00|null|15003.00|null|null|230014|null|null|101|STRING1|null|null|null|null|PCP|101|6500124.00|null|101.000|101.000|6500124.00|85019.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY23|885223|8123401023|STRING30|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING23|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING23|null|null|null|null|null|124814023|null|STRING1|2140020|2022-01-06|null|null|null|STRING23|COMP1COUNTRY6|null|null|null|null|null|No||2|29|null|null\nnull|null|null|null|VALUE23|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING23|null|AC|null|null|STRING23|2022-01-13|2022-01-06|81351123|VALUE1|null|null|CITY23|888420123|null|2300117|404|null|RER|RCR|XCX|null|null|null|STRING30|null|101|null|null|null|1002|null|null|null|null|null|STRING30|STRING30|STRING30|null|null|15003.00|null|15003.00|null|null|230014|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY23|885223|8123401023|STRING30|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING23|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING23|null|null|null|null|null|124814023|null|STRING1|2140020|2022-01-06|null|null|null|STRING23|COMP1COUNTRY6|null|null|null|null|null|No||2|30|null|null\nnull|null|null|null|VALUE23|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING23|null|AC|null|null|STRING23|2022-01-13|2022-01-06|81351123|VALUE1|null|null|CITY23|888420123|null|2300117|404|null|RER|RCR|XCX|null|null|null|STRING31|null|101|null|null|null|1002|null|null|null|null|null|STRING31|STRING30|STRING30|null|null|15003.00|null|15003.00|null|null|230014|null|null|101|STRING1|null|null|null|null|PCP|101|6500137.00|null|101.000|101.000|6500137.00|85018.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY23|885223|8123401023|STRING30|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING23|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING23|null|null|null|null|null|124814023|null|STRING1|2140020|2022-01-06|null|null|null|STRING23|COMP1COUNTRY6|null|null|null|null|null|No||2|31|null|null\nnull|null|null|null|VALUE23|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING23|null|AC|null|null|STRING23|2022-01-13|2022-01-06|81351123|VALUE1|null|null|CITY23|888420123|null|2300133|404|null|RER|RCR|XCX|null|null|null|STRING32|null|101|null|null|null|1001|null|null|null|null|null|STRING32|STRING30|STRING30|null|null|15003.00|null|15003.00|null|null|230020|null|null|101|STRING1|null|null|null|null|PCP|101|6500124.00|null|101.000|101.000|6500124.00|85019.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY23|885223|8123401023|STRING30|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING23|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING23|null|null|null|null|null|124814023|null|STRING1|2140020|2022-01-06|null|null|null|STRING23|COMP1COUNTRY6|null|null|null|null|null|No||2|32|null|null\nnull|null|null|null|VALUE24|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE7|null|null|2022-01-05|COUNTRY1|null|STRING24|null|AC|null|null|STRING24|2022-01-13|2022-01-06|81351124|VALUE1|null|null|CITY24|888420124|null|2300134|404|null|RER|RCR|XCX|null|null|null|STRING33|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING33|STRING34|STRING34|null|null|15011.00|null|15011.00|null|null|230021|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85016.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY24|885224|8123401024|STRING34|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING24|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING24|null|null|null|null|null|124814024|null|STRING1|2140021|2022-01-06|null|null|null|STRING24|COMP1COUNTRY1|null|null|null|null|null|No||2|33|null|null\nnull|null|null|null|VALUE25|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING25|null|AC|null|null|STRING25|2022-01-11|2022-01-06|81351125|VALUE1|null|null|CITY25|888420125|null|2300135|404|null|RER|CRC|XCX|null|null|null|STRING34|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING34|STRING35|STRING35|null|null|15012.00|null|15012.00|null|null|230022|null|null|101|STRING1|null|null|null|null|PCP|101|6500138.00|null|101.000|101.000|6500138.00|85020.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY25|885225|8123401025|STRING35|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING25|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING25|null|null|null|null|null|124814025|null|STRING1|2140022|2022-01-06|null|null|null|STRING25|COMP1COUNTRY2|null|null|null|null|null|No||2|34|null|null\nnull|null|null|null|VALUE25|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING25|null|AC|null|null|STRING25|2022-01-11|2022-01-06|81351125|VALUE1|null|null|CITY25|888420125|null|2300103|404|null|RER|CRC|XCX|null|null|null|STRING35|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING35|STRING35|STRING35|null|null|15012.00|null|15012.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500138.00|null|101.000|101.000|6500138.00|85020.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY25|885225|8123401025|STRING35|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING25|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING25|null|null|null|null|null|124814025|null|STRING1|2140022|2022-01-06|null|null|null|STRING25|COMP1COUNTRY2|null|null|null|null|null|No||2|35|null|null\nnull|null|null|null|VALUE25|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING25|null|AC|null|null|STRING25|2022-01-11|2022-01-06|81351125|VALUE1|null|null|CITY25|888420125|null|2300107|404|null|RER|CRC|XCX|null|null|null|STRING36|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING36|STRING35|STRING35|null|null|15013.00|null|15013.00|null|null|230005|null|null|101|STRING1|null|null|null|null|PCP|101|6500138.00|null|101.000|101.000|6500138.00|85021.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY25|885225|8123401025|STRING35|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING25|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING25|null|null|null|null|null|124814025|null|STRING1|2140022|2022-01-06|null|null|null|STRING25|COMP1COUNTRY2|null|null|null|null|null|No||2|36|null|null\nnull|null|null|null|VALUE25|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING25|null|AC|null|null|STRING25|2022-01-11|2022-01-06|81351125|VALUE1|null|null|CITY25|888420125|null|2300138|404|null|RER|CRC|XCX|null|null|null|STRING37|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING37|STRING35|STRING35|null|null|15013.00|null|15013.00|null|null|230023|null|null|101|STRING1|null|null|null|null|PCP|101|6500138.00|null|101.000|101.000|6500138.00|85021.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY25|885225|8123401025|STRING35|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING25|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING25|null|null|null|null|null|124814025|null|STRING1|2140022|2022-01-06|null|null|null|STRING25|COMP1COUNTRY2|null|null|null|null|null|No||2|37|null|null\nnull|null|null|null|VALUE25|10102417|10102417|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING25|null|AC|null|null|STRING25|2022-01-11|2022-01-06|81351125|VALUE1|null|null|CITY25|888420125|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING38|null|101|null|null|null|1002|null|null|null|null|null|STRING38|STRING35|STRING35|null|null|15003.00|null|15003.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY25|885225|8123401025|STRING35|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING25|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING25|null|null|null|null|null|124814025|null|STRING1|2140022|2022-01-06|null|null|null|STRING25|COMP1COUNTRY2|null|null|null|null|null|No||2|38|null|null\nnull|null|null|null|VALUE26|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY7|null|STRING26|null|AC|null|null|STRING26|2022-01-11|2022-01-06|81351126|VALUE1|null|null|CITY26|888420126|null|2300140|404|null|RER|RCR|XCX|null|null|null|STRING39|null|101|null|null|null|1002|null|null|null|null|null|STRING39|STRING40|STRING40|null|null|15003.00|null|15003.00|null|null|230025|null|null|101|STRING3|null|null|null|null|PCP|101|6500139.00|null|101.000|101.000|6500139.00|85022.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY26|885226|8123401026|STRING40|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING26|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB7|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING26|null|null|null|null|null|124814026|null|STRING1|2140023|2022-01-06|null|null|null|STRING26|COMP1COUNTRY7|null|null|null|null|null|No||2|39|null|null\nnull|null|null|null|VALUE27|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE8|null|null|2022-01-05|COUNTRY1|null|STRING27|null|AC|null|null|STRING27|2022-01-10|2022-01-06|81351127|VALUE1|null|null|CITY27|888420127|null|2300114|404|null|RER|RCR|XCX|null|null|null|STRING40|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING40|STRING41|STRING41|null|null|15014.00|null|15014.00|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85023.00|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY27|885227|8123401027|STRING41|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING27|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING27|null|null|null|null|null|124814027|null|STRING1|2140024|2022-01-06|null|null|null|STRING27|COMP1COUNTRY1|null|null|null|null|null|No||2|40|null|null\nnull|null|null|null|VALUE27|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE8|null|null|2022-01-05|COUNTRY1|null|STRING27|null|AC|null|null|STRING27|2022-01-10|2022-01-06|81351127|VALUE1|null|null|CITY27|888420127|null|2300142|404|null|RER|RCR|XCX|null|null|null|STRING41|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING41|STRING41|STRING41|null|null|15014.00|null|15014.00|null|null|230026|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85023.00|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY27|885227|8123401027|STRING41|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING27|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING27|null|null|null|null|null|124814027|null|STRING1|2140024|2022-01-06|null|null|null|STRING27|COMP1COUNTRY1|null|null|null|null|null|No||2|41|null|null\nnull|null|null|null|VALUE27|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING27|null|AC|null|null|STRING27|2022-01-10|2022-01-06|81351127|VALUE1|null|null|CITY27|888420127|null|2300143|404|null|RER|CRC|XCX|null|null|null|STRING42|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING42|STRING41|STRING41|null|null|null|null|null|null|null|230027|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY27|885227|8123401027|STRING41|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING27|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING27|null|null|null|null|null|124814027|null|STRING1|2140024|2022-01-06|null|null|null|STRING27|COMP1COUNTRY1|null|null|STRING1|STRING1|STRING2|Yes||2|42|null|null\nnull|null|null|null|VALUE28|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE9|null|null|2022-01-05|COUNTRY1|null|STRING28|null|AC|null|null|STRING28|2022-01-11|2022-01-06|81351128|VALUE1|null|null|CITY20|888420128|null|2300101|404|null|RER|RCR|XCX|null|null|null|STRING43|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING43|STRING44|STRING44|null|null|null|null|null|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|null|101.000|101.000|101.000|101.000|STRING8|STRING1|2022-01-06|CITY20|885228|8123401028|STRING44|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING28|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING28|null|null|null|null|null|124814028|null|STRING1|2140025|2022-01-06|null|null|null|STRING28|COMP1COUNTRY1|null|null|null|null|null|No||2|43|null|null\nnull|null|null|null|VALUE28|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING28|null|AC|null|null|STRING28|2022-01-11|2022-01-06|81351128|VALUE1|null|null|CITY20|888420128|null|2300116|404|null|RER|CRC|XCX|null|null|null|STRING44|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING44|STRING44|STRING44|null|null|null|null|null|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|6500136.00|null|101.000|101.000|6500136.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY20|885228|8123401028|STRING44|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING28|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING28|null|null|null|null|null|124814028|null|STRING1|2140025|2022-01-06|null|null|null|STRING28|COMP1COUNTRY1|null|null|STRING3|STRING2|STRING3|Yes||2|44|null|null\nnull|null|null|null|VALUE29|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING29|METHOD2|AC|null|null|STRING29|2022-01-11|2022-01-06|81351129|VALUE1|null|null|CITY28|888420129|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING45|null|101|null|null|null|1001|null|null|null|null|null|STRING45|STRING46|STRING46|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY28|885229|8123401029|STRING46|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING29|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING29|null|null|null|null|null|124814029|null|STRING1|2140026|2022-01-06|null|null|null|STRING29|COMP1COUNTRY1|null|null|null|null|null|No||2|45|null|null\nnull|null|null|null|VALUE30|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY4|null|STRING30|null|AC|null|null|STRING30|2022-01-11|2022-01-06|81351130|VALUE1|null|null|CITY29|888420130|null|2300103|404|null|RER|RCR|XCX|null|null|null|STRING46|null|101|null|null|null|1001|null|null|null|null|null|STRING46|STRING47|STRING47|null|null|15003.00|null|15003.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY29|885230|8123401030|STRING47|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING30|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING30|null|null|null|null|null|124814030|null|STRING1|2140027|2022-01-06|null|null|null|STRING30|COMP1COUNTRY4|null|null|null|null|null|No||2|46|null|null\nnull|null|null|null|VALUE31|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY4|null|STRING31|null|AA|null|null|STRING31|2022-01-13|2022-01-06|81351131|VALUE1|null|null|CITY11|888420131|null|2300116|404|null|RER|CRC|XCX|null|null|null|STRING47|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING47|STRING48|STRING48|null|null|null|null|null|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY11|885231|8123401031|STRING48|null|2022-01-06|STRING4|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING31|10240405|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING31|null|null|null|null|null|124814031|null|null|null|null|null|null|null|STRING31|COMP1COUNTRY4|null|null|STRING4|STRING1|null|Yes||2|47|null|null\nnull|null|null|null|VALUE32|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING32|null|AC|null|null|STRING32|2022-01-11|2022-01-06|81351132|VALUE1|null|null|CITY30|888420132|null|2300103|404|null|RER|CRC|XCX|null|null|null|STRING48|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING48|STRING49|STRING49|null|null|15015.00|null|15015.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500141.00|null|101.000|101.000|6500141.00|85026.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY30|885232|8123401032|STRING49|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING32|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING32|null|null|null|null|null|124814032|null|STRING1|2140008|2022-01-06|null|null|null|STRING32|COMP1COUNTRY2|null|null|null|null|null|No||2|48|null|null\nnull|null|null|null|VALUE32|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING32|null|AC|null|null|STRING32|2022-01-11|2022-01-06|81351132|VALUE1|null|null|CITY30|888420132|null|2300103|404|null|RER|RCR|XCX|null|null|null|STRING49|null|101|null|null|null|1001|null|null|null|null|null|STRING49|STRING49|STRING49|null|null|15003.00|null|15003.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY30|885232|8123401032|STRING49|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING32|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING32|null|null|null|null|null|124814032|null|STRING1|2140008|2022-01-06|null|null|null|STRING32|COMP1COUNTRY2|null|null|null|null|null|No||2|49|null|null\nnull|null|null|null|VALUE33|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE10|null|null|2022-01-05|COUNTRY1|null|STRING33|null|AC|null|null|STRING33|2022-01-13|2022-01-06|81351133|VALUE1|null|null|CITY31|888420133|null|2300151|404|null|RER|RCR|XCX|null|null|null|STRING50|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING50|STRING51|STRING51|null|null|15016.00|null|15016.00|null|null|230028|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85028.00|101.000|101.000|101.000|101.000|STRING9|STRING1|2022-01-06|CITY31|885233|8123401033|STRING51|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING33|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING33|null|null|null|null|null|124814033|null|STRING1|2140028|2022-01-06|null|null|null|STRING33|COMP1COUNTRY1|null|null|null|null|null|No||2|50|null|null\nnull|null|null|null|VALUE33|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE10|null|null|2022-01-05|COUNTRY1|null|STRING33|null|AC|null|null|STRING33|2022-01-13|2022-01-06|81351133|VALUE1|null|null|CITY31|888420133|null|2300151|404|null|RER|RCR|XCX|null|null|null|STRING51|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING51|STRING51|STRING51|null|null|15016.00|null|15016.00|null|null|230028|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85028.00|101.000|101.000|101.000|101.000|STRING9|STRING1|2022-01-06|CITY31|885233|8123401033|STRING51|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING33|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING33|null|null|null|null|null|124814033|null|STRING1|2140028|2022-01-06|null|null|null|STRING33|COMP1COUNTRY1|null|null|null|null|null|No||2|51|null|null\nnull|null|null|null|VALUE34|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE11|null|null|2022-01-05|COUNTRY1|null|STRING34|null|AC|null|null|STRING34|2022-01-11|2022-01-06|81351134|VALUE1|null|null|CITY20|888420134|null|2300104|404|null|RER|RCR|XCX|null|null|null|STRING52|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING52|STRING53|STRING53|null|null|15011.00|null|15011.00|null|null|230003|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85016.00|101.000|101.000|101.000|101.000|STRING10|STRING1|2022-01-06|CITY20|885234|8123401034|STRING53|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING34|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING34|null|null|null|null|null|124814034|null|STRING1|2140029|2022-01-06|null|null|null|STRING34|COMP1COUNTRY1|null|null|null|null|null|No||2|52|null|null\nnull|null|null|null|VALUE35|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE12|null|null|2022-01-05|COUNTRY8|null|STRING35|null|AC|null|null|STRING35|2022-01-11|2022-01-06|81351135|VALUE1|null|null|CITY32|888420135|null|2300111|404|null|RER|RCR|XCX|null|null|null|STRING53|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING53|STRING54|STRING54|null|null|null|null|null|null|null|230008|null|null|101|STRING4|null|null|null|null|PCP|101|6500143.00|null|101.000|101.000|6500143.00|null|101.000|101.000|101.000|101.000|STRING11|STRING1|2022-01-06|CITY32|885235|8123401035|STRING54|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING35|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB8|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING35|null|null|null|null|null|124814035|null|STRING1|2140030|2022-01-06|null|null|null|STRING35|COMP1COUNTRY8|null|null|null|null|null|No||2|53|null|null\nnull|null|null|null|VALUE36|10102412|10102412|null|3,02E+25|3,02E+25|null|null|XXX|null|AA|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY7|null|STRING36|null|AA|null|null|STRING36|null|2022-01-06|81351136|VALUE1|null|null|CITY33|888420136|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING54|null|101|null|null|null|1001|STRING7|null|6024050701|null|null|STRING54|STRING55|STRING55|null|null|null|null|null|null|null|230004|null|null|101|STRING3|null|null|null|null|PCP|101|6500144.00|null|101.000|101.000|6500144.00|null|101.000|101.000|101.000|101.000|STRING12|STRING1|2022-01-06|CITY33|885236|8123401036|STRING55|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING36|10240406|null|null|null|null|null|null|COMPANY2|COUNTRYAB7|STRING3|STRING3|1923002|1923001|123051|null|null|null|10349200.00|STRING36|null|null|null|EOD|STRING1|124814036|null|null|null|null|null|null|null|STRING36|null|null|null|STRING5|STRING3|null|Yes||2|54|null|null\nnull|null|null|null|VALUE38|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING38|null|AC|null|null|STRING38|2022-01-11|2022-01-06|81351138|VALUE1|null|null|CITY35|888420138|null|2300110|404|null|RER|CRC|XCX|null|null|null|STRING56|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING56|STRING57|STRING57|null|null|15010.00|null|15010.00|null|null|230007|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85015.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY35|885238|8123401038|STRING57|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING38|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING38|null|null|null|null|null|124814038|null|STRING1|2140032|2022-01-06|null|null|null|STRING38|COMP1COUNTRY1|null|null|null|null|null|No||2|56|null|null\nnull|null|null|null|VALUE39|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING39|null|AC|null|null|STRING39|2022-01-11|2022-01-06|81351139|VALUE1|null|null|CITY36|888420139|null|2300103|404|null|RER|RCR|XCX|null|null|null|STRING57|null|101|null|null|null|1001|null|null|null|null|null|STRING57|STRING58|STRING58|null|null|15003.00|null|15003.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY36|885239|8123401039|STRING58|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING39|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING39|null|null|null|null|null|124814039|null|STRING1|2140033|2022-01-06|null|null|null|STRING39|COMP1COUNTRY1|null|null|null|null|null|No||2|57|null|null\nnull|null|null|null|VALUE40|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE13|null|null|2022-01-05|COUNTRY1|null|STRING40|METHOD3|AC|null|null|STRING40|2022-01-13|2022-01-06|81351140|VALUE1|null|null|CITY37|888420140|null|2300114|404|null|RER|RCR|XCX|null|null|null|STRING58|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING58|STRING59|STRING59|null|null|15001.00|null|15001.00|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85027.00|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY37|885240|8123401040|STRING59|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING40|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING40|null|null|null|null|null|124814040|null|STRING1|2140034|2022-01-06|null|null|null|STRING40|COMP1COUNTRY1|null|null|null|null|null|No||2|58|null|null\nnull|null|null|null|VALUE41|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE14|null|null|2022-01-05|COUNTRY5|null|STRING41|null|AA|null|null|STRING41|2022-01-10|2022-01-06|81351141|VALUE1|null|null|CITY38|888420141|null|2300123|404|null|RER|RCR|XCX|null|null|null|STRING59|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING59|STRING60|STRING60|null|null|null|null|null|null|null|230017|null|null|101|STRING2|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|null|101.000|101.000|101.000|101.000|STRING8|STRING1|2022-01-06|CITY38|885241|8123401041|STRING60|null|2022-01-06|STRING4|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING41|10240406|null|null|null|null|null|null|COMPANY1|COUNTRYAB5|STRING1|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING41|null|null|null|null|null|124814041|null|null|null|null|null|null|null|STRING41|COMP1COUNTRY5|null|null|null|null|null|No||2|59|null|null\nnull|null|null|null|VALUE42|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE15|null|null|2022-01-05|COUNTRY9|null|STRING42|METHOD4|AA|null|null|STRING42|null|2022-01-06|81351142|VALUE1|null|null|CITY39|888420142|null|2300110|404|null|RER|RCR|XCX|null|null|null|STRING60|null|101|null|null|null|1001|STRING1|null|null|null|null|STRING60|STRING61|STRING61|null|null|null|null|null|null|null|230007|null|null|101|STRING5|null|null|null|null|PCP|101|6500146.00|null|101.000|101.000|6500146.00|null|101.000|101.000|101.000|101.000|STRING13|STRING2|2022-01-06|CITY39|885242|8123401042|STRING61|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING3|TYPE1|STRING42|10240407|null|null|null|null|null|null|COMPANY1|COUNTRYAB9|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|STRING42|null|null|null|null|null|124814031|3059002|null|null|null|null|null|null|STRING42|COMP1COUNTRY9|null|null|null|null|null|No||2|60|null|null\nnull|null|null|null|VALUE43|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-14|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY7|null|STRING43|null|AA|null|null|STRING43|2022-01-14|2022-01-06|81351143|VALUE1|null|null|CITY40|888420143|null|2300116|404|null|RER|CRC|XCX|null|null|null|STRING61|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING61|STRING62|STRING62|null|null|null|null|null|null|null|230013|null|null|101|STRING3|null|null|null|null|PCP|101|6500147.00|null|101.000|101.000|6500147.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY40|885243|8123401043|STRING62|null|2022-01-06|STRING4|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING43|10240408|null|null|null|null|null|null|COMPANY1|COUNTRYAB7|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING43|null|null|null|null|null|124814042|null|null|null|null|null|null|null|STRING43|COMP1COUNTRY7|null|null|STRING6|STRING1|null|Yes||2|61|null|null\nnull|null|null|null|VALUE44|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE16|null|null|2022-01-05|COUNTRY1|null|STRING44|null|AC|null|null|STRING44|2022-01-13|2022-01-06|81351144|VALUE1|null|null|CITY41|888420144|null|2300163|404|null|RER|RCR|XCX|null|null|STRING4|STRING62|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING62|STRING63|STRING63|null|null|15014.00|null|15014.00|null|null|230029|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85030.00|101.000|101.000|101.000|101.000|STRING4|STRING1|2022-01-06|CITY41|885244|8123401044|STRING63|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE2|STRING44|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING44|null|EFL|null|null|null|124814043|null|STRING1|2140035|2022-01-06|null|null|null|STRING44|COMP1COUNTRY1|null|null|null|null|null|No||2|62|null|null\nnull|null|null|null|VALUE45|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING45|null|AC|null|null|STRING45|2022-01-10|2022-01-06|81351145|VALUE1|null|null|CITY42|888420145|null|2300164|404|null|RER|RCR|XCX|null|null|null|STRING63|null|101|null|null|null|1002|null|null|null|null|null|STRING63|STRING64|STRING64|null|null|15003.00|null|15003.00|null|null|230030|null|null|101|STRING1|null|null|null|null|PCP|101|6500149.00|null|101.000|101.000|6500149.00|85031.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY42|885245|8123401045|STRING64|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING45|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING45|null|null|null|null|null|124814044|null|STRING1|2140036|2022-01-06|null|null|null|STRING45|COMP1COUNTRY1|null|null|null|null|null|No||2|63|null|null\nnull|null|null|null|VALUE46|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE17|null|null|2022-01-05|COUNTRY1|null|STRING46|METHOD5|AC|null|null|STRING46|2022-01-11|2022-01-06|81351146|VALUE1|null|null|CITY20|888420146|null|2300165|404|null|RER|RCR|XCX|null|null|null|STRING64|null|101|null|null|null|1001|STRING8|null|null|null|null|STRING64|STRING65|STRING65|null|null|15017.00|null|15017.00|null|null|230031|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85032.00|101.000|101.000|101.000|101.000|STRING14|STRING1|2022-01-06|CITY20|885246|8123401046|STRING65|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING46|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING46|null|null|null|null|null|124814045|null|STRING1|2140037|2022-01-06|null|null|null|STRING46|COMP1COUNTRY1|null|null|null|null|null|No||2|64|null|null\nnull|null|null|null|VALUE47|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING47|METHOD3|AC|null|null|STRING47|2022-01-11|2022-01-06|81351147|VALUE1|null|null|CITY20|888420147|null|2300120|404|null|RER|CRC|XCX|null|null|null|STRING65|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING65|STRING66|STRING66|null|null|null|null|null|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY20|885247|8123401047|STRING66|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING47|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING47|null|null|null|null|null|124814046|null|STRING1|2140003|2022-01-06|null|null|null|STRING47|COMP1COUNTRY1|null|null|null|null|null|No||2|65|null|null\nnull|null|null|null|VALUE47|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING47|METHOD3|AC|null|null|STRING47|2022-01-11|2022-01-06|81351147|VALUE1|null|null|CITY20|888420147|null|2300120|404|null|RER|CRC|XCX|null|null|null|STRING66|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING66|STRING66|STRING66|null|null|null|null|null|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500133.00|null|101.000|101.000|6500133.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY20|885247|8123401047|STRING66|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING47|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING47|null|null|null|null|null|124814046|null|STRING1|2140003|2022-01-06|null|null|null|STRING47|COMP1COUNTRY1|null|null|null|null|null|No||2|66|null|null\nnull|null|null|null|VALUE47|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING47|METHOD3|AC|null|null|STRING47|2022-01-11|2022-01-06|81351147|VALUE1|null|null|CITY20|888420147|null|2300120|404|null|RER|CRC|XCX|null|null|null|STRING67|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING67|STRING66|STRING66|null|null|15014.00|null|15014.00|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85023.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY20|885247|8123401047|STRING66|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING47|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING47|null|null|null|null|null|124814046|null|STRING1|2140003|2022-01-06|null|null|null|STRING47|COMP1COUNTRY1|null|null|null|null|null|No||2|67|null|null\nnull|null|null|null|VALUE49|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING49|null|AC|null|null|STRING49|2022-01-10|2022-01-06|81351149|VALUE1|null|null|CITY44|888420149|null|2300128|404|null|RER|RCR|XCX|null|null|null|STRING70|null|101|null|null|null|1002|null|null|null|null|null|STRING70|STRING71|STRING71|null|null|15003.00|null|15003.00|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY44|885249|8123401049|STRING71|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING49|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING49|null|null|null|null|null|124814048|null|STRING1|2140039|2022-01-06|null|null|null|STRING49|COMP1COUNTRY1|null|null|null|null|null|No||2|70|null|null\nnull|null|null|null|VALUE49|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING49|null|AC|null|null|STRING49|2022-01-10|2022-01-06|81351149|VALUE1|null|null|CITY44|888420149|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING71|null|101|null|null|null|1002|null|null|null|null|null|STRING71|STRING71|STRING71|null|null|15003.00|null|15003.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500150.00|null|101.000|101.000|6500150.00|85034.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY44|885249|8123401049|STRING71|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING49|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING49|null|null|null|null|null|124814048|null|STRING1|2140039|2022-01-06|null|null|null|STRING49|COMP1COUNTRY1|null|null|null|null|null|No||2|71|null|null\nnull|null|null|null|VALUE50|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY4|null|STRING50|null|AA|null|null|STRING50|2022-01-13|2022-01-06|81351142|VALUE1|null|null|CITY45|888420150|null|2300116|404|null|RER|CRC|XCX|null|null|null|STRING72|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING72|STRING73|STRING73|null|null|null|null|null|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY45|885250|8123401042|STRING73|null|2022-01-06|STRING4|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING50|10240407|null|null|null|null|null|null|COMPANY2|COUNTRYAB4|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING50|null|null|null|null|null|124814049|null|null|null|null|null|null|null|STRING50|COMP2COUNTRY4|null|null|null|null|null|No||2|72|null|null\nnull|null|null|null|VALUE51|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING51|null|AC|null|null|STRING51|2022-01-11|2022-01-06|81351150|VALUE1|null|null|CITY46|888420151|null|2300121|404|null|RER|CRC|XCX|null|null|null|STRING73|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING73|STRING74|STRING74|null|null|null|null|null|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY46|885251|8123401050|STRING74|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING51|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING51|null|null|null|null|null|124814050|null|STRING1|2140030|2022-01-06|null|null|null|STRING51|COMP1COUNTRY3|null|null|null|null|null|No||2|73|null|null\nnull|null|null|null|VALUE51|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING51|null|AC|null|null|STRING51|2022-01-11|2022-01-06|81351150|VALUE1|null|null|CITY46|888420151|null|2300175|404|null|RER|CRC|XCX|null|null|null|STRING74|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING74|STRING74|STRING74|null|null|15019.00|null|15019.00|null|null|230032|null|null|101|STRING1|null|null|null|null|PCP|101|6500151.00|null|101.000|101.000|6500151.00|85009.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY46|885251|8123401050|STRING74|null|2022-01-06|STRING5|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING51|10240409|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING51|null|null|null|null|null|124814050|null|STRING1|2140030|2022-01-06|null|null|null|STRING51|COMP1COUNTRY3|null|null|null|null|null|No||2|74|null|null\nnull|null|null|null|VALUE52|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE18|null|null|2022-01-05|COUNTRY1|null|STRING52|null|AC|null|null|STRING52|2022-01-11|2022-01-06|81351151|VALUE1|null|null|CITY47|888420152|null|2300115|404|null|RER|RCR|XCX|null|null|null|STRING75|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING75|STRING76|STRING76|null|null|15007.00|null|15007.00|null|null|230012|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85004.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY47|885252|8123401051|STRING76|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING52|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING52|null|null|null|null|null|124814051|null|STRING1|2140040|2022-01-06|null|null|null|STRING52|COMP1COUNTRY1|null|null|null|null|null|No||2|75|null|null\nnull|null|null|null|VALUE53|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING53|METHOD6|AC|null|null|STRING53|2022-01-11|2022-01-06|81351152|VALUE1|null|null|CITY48|888420153|null|2300177|404|null|RER|RCR|XCX|null|null|null|STRING76|null|101|null|null|null|1001|null|null|null|null|null|STRING76|STRING77|STRING77|null|null|15003.00|null|15003.00|null|null|230033|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|85015.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY48|885253|8123401052|STRING77|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING53|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING53|null|null|null|null|null|124814052|null|STRING1|2140041|2022-01-06|null|null|null|STRING53|COMP1COUNTRY1|null|null|null|null|null|No||2|76|null|null\nnull|null|null|null|VALUE54|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE19|null|null|2022-01-05|COUNTRY1|null|STRING54|null|AC|null|null|STRING54|2022-01-11|2022-01-06|81351153|VALUE1|null|null|CITY49|888420154|null|2300128|404|null|RER|RCR|XCX|null|null|null|STRING77|null|101|null|null|null|1003|STRING6|null|null|null|null|STRING77|STRING78|STRING78|null|null|15020.00|null|15020.00|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|6500152.00|null|101.000|101.000|6500152.00|85035.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY49|885254|8123401053|STRING78|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING54|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING54|null|null|null|null|null|124814053|null|STRING1|2140042|2022-01-06|null|null|null|STRING54|COMP1COUNTRY1|null|null|null|null|null|No||2|77|null|null\nnull|null|null|null|VALUE54|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE19|null|null|2022-01-05|COUNTRY1|null|STRING54|null|AC|null|null|STRING54|2022-01-11|2022-01-06|81351153|VALUE1|null|null|CITY49|888420154|null|2300133|404|null|RER|RCR|XCX|null|null|null|STRING78|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING78|STRING78|STRING78|null|null|null|null|null|null|null|230020|null|null|101|STRING1|null|null|null|null|PCP|101|6500124.00|null|101.000|101.000|6500124.00|null|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY49|885254|8123401053|STRING78|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING54|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING54|null|null|null|null|null|124814053|null|STRING1|2140042|2022-01-06|null|null|null|STRING54|COMP1COUNTRY1|null|null|null|null|null|No||2|78|null|null\nnull|null|null|null|VALUE55|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING55|null|AC|null|null|STRING55|2022-01-13|2022-01-06|81351154|VALUE1|null|null|CITY50|888420155|null|2300180|404|null|RER|RCR|XCX|null|null|null|STRING79|null|101|null|null|null|1001|null|null|null|null|null|STRING79|STRING80|STRING80|null|null|15003.00|null|15003.00|null|null|230034|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85036.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY50|885255|8123401054|STRING80|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING55|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING55|null|null|null|null|null|124814054|null|STRING1|2140043|2022-01-06|null|null|null|STRING55|COMP1COUNTRY6|null|null|null|null|null|No||2|79|null|null\nnull|null|null|null|VALUE56|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE20|null|null|2022-01-05|COUNTRY1|null|STRING56|null|AC|null|null|STRING56|2022-01-11|2022-01-06|81351155|VALUE1|null|null|CITY51|888420156|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING80|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING80|STRING81|STRING81|null|null|15016.00|null|15016.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85028.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY51|885256|8123401055|STRING81|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING56|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING56|null|null|null|null|null|124814055|null|STRING1|2140021|2022-01-06|null|null|null|STRING56|COMP1COUNTRY1|null|null|null|null|null|No||2|80|null|null\nnull|null|null|null|VALUE56|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING56|null|AC|null|null|STRING56|2022-01-11|2022-01-06|81351155|VALUE1|null|null|CITY51|888420156|null|2300175|404|null|RER|RCR|XCX|null|null|null|STRING81|null|101|null|null|null|1001|null|null|null|null|null|STRING81|STRING81|STRING81|null|null|15003.00|null|15003.00|null|null|230032|null|null|101|STRING1|null|null|null|null|PCP|101|6500141.00|null|101.000|101.000|6500141.00|85037.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY51|885256|8123401055|STRING81|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING56|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING56|null|null|null|null|null|124814055|null|STRING1|2140021|2022-01-06|null|null|null|STRING56|COMP1COUNTRY1|null|null|null|null|null|No||2|81|null|null\nnull|null|null|null|VALUE56|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE20|null|null|2022-01-05|COUNTRY1|null|STRING56|null|AC|null|null|STRING56|2022-01-11|2022-01-06|81351155|VALUE1|null|null|CITY51|888420156|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING82|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING82|STRING81|STRING81|null|null|15021.00|null|15021.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85038.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY51|885256|8123401055|STRING81|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING56|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING56|null|null|null|null|null|124814055|null|STRING1|2140021|2022-01-06|null|null|null|STRING56|COMP1COUNTRY1|null|null|null|null|null|No||2|82|null|null\nnull|null|null|null|VALUE56|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE20|null|null|2022-01-05|COUNTRY1|null|STRING56|null|AC|null|null|STRING56|null|2022-01-06|81351155|VALUE1|null|null|CITY51|888420156|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING83|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING83|STRING81|STRING81|null|null|15020.00|null|15020.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85023.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY51|885256|8123401055|STRING81|null|2022-01-06|STRING6|10020.000|STRING1|STORE1|STRING1|TYPE3|STRING56|10240402|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|null|1923002|1923001|null|null|null|null|10349200.00|STRING57|null|null|null|null|null|124814055|null|STRING1|2140021|2022-01-06|null|null|null|STRING56|COMP1COUNTRY1|null|null|null|null|null|No||2|83|null|null\nnull|null|null|null|VALUE57|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING57|null|AC|null|null|STRING57|2022-01-11|2022-01-06|81351156|VALUE1|null|null|CITY52|888420157|null|2300185|404|null|RER|RCR|XCX|null|null|null|STRING84|null|101|null|null|null|1001|null|null|null|null|null|STRING84|STRING85|STRING85|null|null|15003.00|null|15003.00|null|null|230035|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85006.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY52|885257|8123401056|STRING85|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING57|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING58|null|null|null|null|null|124814056|null|STRING1|2140044|2022-01-06|null|null|null|STRING57|COMP1COUNTRY6|null|null|null|null|null|No||2|84|null|null\nnull|null|null|null|VALUE57|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING57|null|AC|null|null|STRING57|2022-01-11|2022-01-06|81351156|VALUE1|null|null|CITY52|888420157|null|2300185|404|null|RER|RCR|XCX|null|null|null|STRING85|null|101|null|null|null|1001|null|null|null|null|null|STRING85|STRING85|STRING85|null|null|15003.00|null|15003.00|null|null|230035|null|null|101|STRING1|null|null|null|null|PCP|101|6500153.00|null|101.000|101.000|6500153.00|85039.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY52|885257|8123401056|STRING85|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING57|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING58|null|null|null|null|null|124814056|null|STRING1|2140044|2022-01-06|null|null|null|STRING57|COMP1COUNTRY6|null|null|null|null|null|No||2|85|null|null\nnull|null|null|null|VALUE57|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING57|null|AC|null|null|STRING57|2022-01-11|2022-01-06|81351156|VALUE1|null|null|CITY52|888420157|null|2300185|404|null|RER|RCR|XCX|null|null|null|STRING86|null|101|null|null|null|1001|null|null|null|null|null|STRING86|STRING85|STRING85|null|null|15003.00|null|15003.00|null|null|230035|null|null|101|STRING1|null|null|null|null|PCP|101|6500137.00|null|101.000|101.000|6500137.00|85018.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY52|885257|8123401056|STRING85|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING57|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING58|null|null|null|null|null|124814056|null|STRING1|2140044|2022-01-06|null|null|null|STRING57|COMP1COUNTRY6|null|null|null|null|null|No||2|86|null|null\nnull|null|null|null|VALUE58|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING58|null|AC|null|null|STRING58|2022-01-13|2022-01-06|81351157|VALUE1|null|null|CITY53|888420158|null|2300101|404|null|RER|RCR|XCX|null|null|STRING5|STRING87|null|101|null|null|null|1002|null|null|null|null|null|STRING87|STRING88|STRING88|null|null|15003.00|null|null|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY53|885258|8123401057|STRING88|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE2|STRING58|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING59|null|EFL|null|null|null|124814057|null|STRING1|2140045|2022-01-06|null|null|null|STRING58|COMP1COUNTRY3|null|null|null|null|null|No||2|87|null|null\nnull|null|null|null|VALUE59|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING59|null|AC|null|null|STRING59|2022-01-11|2022-01-06|81351158|VALUE1|null|null|CITY54|888420159|null|2300105|404|null|RER|CRC|XCX|null|null|null|STRING88|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING88|STRING89|STRING89|null|null|null|null|null|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY54|885259|8123401058|STRING89|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING59|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING60|null|null|null|null|null|124814058|null|STRING1|2140046|2022-01-06|null|null|null|STRING59|COMP1COUNTRY1|null|null|null|null|null|No||2|88|null|null\nnull|null|null|null|VALUE59|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING59|null|AC|null|null|STRING59|2022-01-11|2022-01-06|81351158|VALUE1|null|null|CITY54|888420159|null|2300190|404|null|RER|RCR|XCX|null|null|null|STRING89|null|101|null|null|null|1003|null|null|null|null|null|STRING89|STRING89|STRING89|null|null|15003.00|null|15003.00|null|null|230036|null|null|101|STRING1|null|null|null|null|PCP|101|6500154.00|null|101.000|101.000|6500154.00|85040.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY54|885259|8123401058|STRING89|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING59|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING60|null|null|null|null|null|124814058|null|STRING1|2140046|2022-01-06|null|null|null|STRING59|COMP1COUNTRY1|null|null|null|null|null|No||2|89|null|null\nnull|null|null|null|VALUE59|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING59|null|AC|null|null|STRING59|2022-01-11|2022-01-06|81351158|VALUE1|null|null|CITY54|888420159|null|2300191|404|null|RER|RCR|XCX|null|null|null|STRING90|null|101|null|null|null|1003|null|null|null|null|null|STRING90|STRING89|STRING89|null|null|15003.00|null|15003.00|null|null|230037|null|null|101|STRING1|null|null|null|null|PCP|101|6500154.00|null|101.000|101.000|6500154.00|85040.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY54|885259|8123401058|STRING89|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING59|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING60|null|null|null|null|null|124814058|null|STRING1|2140046|2022-01-06|null|null|null|STRING59|COMP1COUNTRY1|null|null|null|null|null|No||2|90|null|null\nnull|null|null|null|VALUE60|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE2|null|null|2022-01-05|COUNTRY1|null|STRING60|null|AC|null|null|STRING60|2022-01-13|2022-01-06|81351159|VALUE1|null|null|CITY55|888420160|null|2300192|404|null|RER|RCR|XCX|null|null|null|STRING91|null|101|null|null|null|1002|STRING3|null|null|null|null|STRING91|STRING92|STRING92|null|null|null|null|null|null|null|230038|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|null|101.000|101.000|101.000|101.000|STRING3|STRING1|2022-01-06|CITY55|885260|8123401059|STRING92|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING60|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING61|null|null|null|null|null|124814059|null|STRING1|2140047|2022-01-06|null|null|null|STRING60|COMP1COUNTRY1|null|null|null|null|null|No||2|91|null|null\nnull|null|null|null|VALUE61|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING61|null|AC|null|null|STRING61|2022-01-11|2022-01-06|81351160|VALUE1|null|null|CITY20|888420161|null|2300123|404|null|RER|RCR|XCX|null|null|null|STRING92|null|101|null|null|null|1001|null|null|null|null|null|STRING92|STRING93|STRING93|null|null|15003.00|null|15003.00|null|null|230017|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY20|885261|8123401060|STRING93|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING61|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING62|null|null|null|null|null|124814060|null|STRING1|2140048|2022-01-06|null|null|null|STRING61|COMP1COUNTRY1|null|null|null|null|null|No||2|92|null|null\nnull|null|null|null|VALUE61|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE21|null|null|2022-01-05|COUNTRY1|null|STRING61|null|AC|null|null|STRING61|2022-01-11|2022-01-06|81351160|VALUE1|null|null|CITY20|888420161|null|2300129|404|null|RER|RCR|XCX|null|null|null|STRING93|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING93|STRING93|STRING93|null|null|15022.00|null|15022.00|null|null|230019|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85041.00|101.000|101.000|101.000|101.000|STRING8|STRING1|2022-01-06|CITY20|885261|8123401060|STRING93|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING61|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING62|null|null|null|null|null|124814060|null|STRING1|2140048|2022-01-06|null|null|null|STRING61|COMP1COUNTRY1|null|null|null|null|null|No||2|93|null|null\nnull|null|null|null|VALUE63|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING63|null|AC|null|null|STRING63|2022-01-11|2022-01-06|81351162|VALUE1|null|null|CITY57|888420163|null|2300101|404|null|RER|CRC|XCX|null|null|null|STRING95|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING95|STRING96|STRING96|null|null|null|null|null|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500134.00|null|101.000|101.000|6500134.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY57|885263|8123401062|STRING96|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING63|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING64|null|null|null|null|null|124814062|null|STRING1|2140044|2022-01-06|null|null|null|STRING63|COMP1COUNTRY1|null|null|STRING1|STRING1|null|Yes||2|95|null|null\nnull|null|null|null|VALUE64|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING64|null|AC|null|null|STRING64|2022-01-11|2022-01-06|81351163|VALUE1|null|null|CITY58|888420164|null|2300165|404|null|RER|RCR|XCX|null|null|null|STRING96|null|101|null|null|null|1002|STRING7|null|null|null|null|STRING96|STRING97|STRING97|null|null|15016.00|null|15016.00|null|null|230031|null|null|101|STRING1|null|null|null|null|PCP|101|6500152.00|null|101.000|101.000|6500152.00|85042.00|101.000|101.000|101.000|101.000|STRING12|STRING1|2022-01-06|CITY58|885264|8123401063|STRING97|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING64|10240401|null|null|null|null|null|null|COMPANY2|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING65|null|null|null|null|null|124814063|null|STRING1|2140050|2022-01-06|null|null|null|STRING64|COMP2COUNTRY3|null|null|null|null|null|No||2|96|null|null\nnull|null|null|null|VALUE64|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING64|null|AC|null|null|STRING64|2022-01-11|2022-01-06|81351163|VALUE1|null|null|CITY58|888420164|null|2300164|404|null|RER|RCR|XCX|null|null|null|STRING97|null|101|null|null|null|1002|STRING7|null|null|null|null|STRING97|STRING97|STRING97|null|null|null|null|null|null|null|230030|null|null|101|STRING1|null|null|null|null|PCP|101|6500150.00|null|101.000|101.000|6500150.00|null|101.000|101.000|101.000|101.000|STRING12|STRING1|2022-01-06|CITY58|885264|8123401063|STRING97|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING64|10240401|null|null|null|null|null|null|COMPANY2|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING65|null|null|null|null|null|124814063|null|STRING1|2140050|2022-01-06|null|null|null|STRING64|COMP2COUNTRY3|null|null|null|null|null|No||2|97|null|null\nnull|null|null|null|VALUE64|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING64|null|AC|null|null|STRING64|2022-01-11|2022-01-06|81351163|VALUE1|null|null|CITY58|888420164|null|2300164|404|null|RER|RCR|XCX|null|null|null|STRING98|null|101|null|null|null|1002|STRING7|null|null|null|null|STRING98|STRING97|STRING97|null|null|null|null|null|null|null|230030|null|null|101|STRING1|null|null|null|null|PCP|101|6500150.00|null|101.000|101.000|6500150.00|null|101.000|101.000|101.000|101.000|STRING12|STRING1|2022-01-06|CITY58|885264|8123401063|STRING97|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING64|10240401|null|null|null|null|null|null|COMPANY2|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING65|null|null|null|null|null|124814063|null|STRING1|2140050|2022-01-06|null|null|null|STRING64|COMP2COUNTRY3|null|null|null|null|null|No||2|98|null|null\nnull|null|null|null|VALUE64|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING64|null|AC|null|null|STRING64|2022-01-11|2022-01-06|81351163|VALUE1|null|null|CITY58|888420164|null|2300175|404|null|RER|RCR|XCX|null|null|null|STRING99|null|101|null|null|null|1001|STRING7|null|null|null|null|STRING99|STRING97|STRING97|null|null|15011.00|null|15011.00|null|null|230032|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85016.00|101.000|101.000|101.000|101.000|STRING12|STRING1|2022-01-06|CITY58|885264|8123401063|STRING97|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING64|10240401|null|null|null|null|null|null|COMPANY2|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING65|null|null|null|null|null|124814063|null|STRING1|2140050|2022-01-06|null|null|null|STRING64|COMP2COUNTRY3|null|null|null|null|null|No||2|99|null|null\nnull|null|null|null|VALUE65|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING65|null|AC|null|null|STRING65|2022-01-11|2022-01-06|81351164|VALUE1|null|null|CITY59|888420165|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING100|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING100|STRING101|STRING101|null|null|null|null|null|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|null|101.000|101.000|101.000|101.000|STRING15|STRING1|2022-01-06|CITY59|885265|8123401064|STRING101|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING65|10240401|null|null|null|null|null|null|COMPANY2|COUNTRYAB10|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING66|null|null|null|null|null|124814064|null|STRING1|2140051|2022-01-06|null|null|null|STRING65|COMP2COUNTRY6|null|null|null|null|null|No||2|100|null|null\nnull|null|null|null|VALUE66|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING66|null|AC|null|null|STRING66|2022-01-13|2022-01-06|81351165|VALUE1|null|null|CITY60|888420166|null|2300116|404|null|RER|CRC|XCX|null|null|null|STRING101|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING101|STRING102|STRING102|null|null|15023.00|null|15023.00|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|6500149.00|null|101.000|101.000|6500149.00|85043.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY60|885266|8123401065|STRING102|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING66|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING67|null|null|null|null|null|124814065|null|STRING1|2140052|2022-01-06|null|null|null|STRING66|COMP1COUNTRY1|null|null|null|null|null|No||2|101|null|null\nnull|null|null|null|VALUE67|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE23|null|null|2022-01-05|COUNTRY1|null|STRING67|null|AC|null|null|STRING67|2022-01-11|2022-01-06|81351166|VALUE1|null|null|CITY61|888420167|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING102|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING102|STRING103|STRING103|null|null|15017.00|null|15017.00|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|85044.00|101.000|101.000|101.000|101.000|STRING5|STRING1|2022-01-06|CITY61|885267|8123401066|STRING103|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING67|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING68|null|null|null|null|null|124814066|null|STRING1|2140053|2022-01-06|null|null|null|STRING67|COMP1COUNTRY1|null|null|null|null|null|No||2|102|null|null\nnull|null|null|null|VALUE68|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING68|null|AC|null|null|STRING68|2022-01-11|2022-01-06|81351167|VALUE1|null|null|CITY62|888420168|null|2300140|404|null|RER|CRC|XCX|null|null|null|STRING103|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING103|STRING104|STRING104|null|null|null|null|null|null|null|230025|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY62|885268|8123401067|STRING104|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING68|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING69|null|null|null|null|null|124814067|null|STRING1|2140054|2022-01-06|null|null|null|STRING68|COMP1COUNTRY1|null|null|STRING7|STRING2|null|Yes||2|103|null|null\nnull|null|null|null|VALUE69|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING69|null|AC|null|null|STRING69|2022-01-11|2022-01-06|81351168|VALUE1|null|null|CITY63|888420169|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING104|null|101|null|null|null|1002|null|null|null|null|null|STRING104|STRING105|STRING105|null|null|15003.00|null|15003.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY63|885269|8123401068|STRING105|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING69|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING70|null|null|null|null|null|124814068|null|STRING1|2140055|2022-01-06|null|null|null|STRING69|COMP1COUNTRY1|null|null|null|null|null|No||2|104|null|null\nnull|null|null|null|VALUE70|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING70|METHOD2|AC|null|null|STRING70|2022-01-11|2022-01-06|81351169|VALUE1|null|null|CITY64|888420170|null|2300180|404|null|RER|RCR|XCX|null|null|null|STRING85|null|101|null|null|null|1001|null|null|null|null|null|STRING105|STRING106|STRING106|null|null|15003.00|null|15003.00|null|null|230034|null|null|101|STRING1|null|null|null|null|PCP|101|6500153.00|null|101.000|101.000|6500153.00|85039.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY64|885270|8123401069|STRING106|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING70|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING71|null|null|null|null|null|124814069|null|STRING1|2140056|2022-01-06|null|null|null|STRING70|COMP1COUNTRY1|null|null|null|null|null|No||2|105|null|null\nnull|null|null|null|VALUE71|10102412|10102412|null|3,02E+25|3,02E+25|null|null|XYX|null|AA|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY2|null|STRING71|null|AA|null|null|STRING71|null|2022-01-06|81351170|VALUE1|null|null|CITY65|888420171|null|2300207|404|null|RER|RCR|XCX|null|null|null|STRING105|null|101|null|null|null|1002|null|null|6024050703|null|null|STRING106|STRING107|STRING107|null|null|15003.00|null|15003.00|null|null|230039|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY65|885271|8123401070|STRING107|null|2022-01-06|STRING4|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING71|10240410|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING3|STRING3|1923002|1923001|123124061|null|null|null|10349200.00|STRING72|null|null|null|EOD|STRING1|124814070|null|null|null|null|null|null|null|STRING71|null|null|STRING1|null|null|null|No||2|106|null|null\nnull|null|null|null|VALUE72|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE2|null|null|2022-01-05|COUNTRY4|null|STRING72|null|AA|null|null|STRING72|2022-01-13|2022-01-06|81351170|VALUE1|null|null|CITY66|888420172|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING106|null|101|null|null|null|1001|STRING3|null|null|null|null|STRING107|STRING108|STRING108|null|null|null|null|null|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500131.00|null|101.000|101.000|6500131.00|null|101.000|101.000|101.000|101.000|STRING3|STRING1|2022-01-06|CITY66|885272|8123401070|STRING108|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING72|10240410|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING73|null|null|null|null|null|124814071|null|null|null|null|null|null|null|STRING72|COMP1COUNTRY4|null|null|null|null|null|No||2|107|null|null\nnull|null|null|null|VALUE73|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE24|null|null|2022-01-05|COUNTRY1|null|STRING73|null|AC|null|null|STRING73|2022-01-13|2022-01-06|81351171|VALUE1|null|null|CITY67|888420173|null|2300121|404|null|RER|RCR|XCX|null|null|null|STRING107|null|101|null|null|null|1001|STRING1|null|null|null|null|STRING108|STRING109|STRING109|null|null|15024.00|null|15024.00|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|85045.00|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY67|885273|8123401071|STRING109|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING73|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING74|null|null|null|null|null|124814072|null|STRING1|2140057|2022-01-06|null|null|null|STRING73|COMP1COUNTRY1|null|null|null|null|null|No||2|108|null|null\nnull|null|null|null|VALUE74|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING74|null|AC|null|null|STRING74|2022-01-13|2022-01-06|81351172|VALUE1|null|null|CITY68|888420174|null|2300143|404|null|RER|RCR|XCX|null|null|null|STRING108|null|101|null|null|null|1001|null|null|null|null|null|STRING109|STRING110|STRING110|null|null|15003.00|null|15003.00|null|null|230027|null|null|101|STRING1|null|null|null|null|PCP|101|6500137.00|null|101.000|101.000|6500137.00|85018.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY68|885274|8123401072|STRING110|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING74|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING75|null|null|null|null|null|124814073|null|STRING1|2140058|2022-01-06|null|null|null|STRING74|COMP1COUNTRY1|null|null|null|null|null|No||2|109|null|null\nnull|null|null|null|VALUE75|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING75|null|AC|null|null|STRING75|2022-01-11|2022-01-06|81351173|VALUE1|null|null|CITY69|888420175|null|2300211|404|null|RER|RCR|XCX|null|null|null|STRING109|null|101|null|null|null|1003|null|null|null|null|null|STRING110|STRING111|STRING111|null|null|15003.00|null|15003.00|null|null|230040|null|null|101|STRING1|null|null|null|null|PCP|101|6500155.00|null|101.000|101.000|6500155.00|85046.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY69|885275|8123401073|STRING111|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING75|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING76|null|null|null|null|null|124814074|null|STRING1|2140039|2022-01-06|null|null|null|STRING75|COMP1COUNTRY2|null|null|null|null|null|No||2|110|null|null\nnull|null|null|null|VALUE75|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING75|null|AC|null|null|STRING75|2022-01-11|2022-01-06|81351173|VALUE1|null|null|CITY69|888420175|null|2300211|404|null|RER|RCR|XCX|null|null|null|STRING110|null|101|null|null|null|1003|null|null|null|null|null|STRING111|STRING111|STRING111|null|null|15003.00|null|15003.00|null|null|230040|null|null|101|STRING1|null|null|null|null|PCP|101|6500155.00|null|101.000|101.000|6500155.00|85046.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY69|885275|8123401073|STRING111|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING75|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING76|null|null|null|null|null|124814074|null|STRING1|2140039|2022-01-06|null|null|null|STRING75|COMP1COUNTRY2|null|null|null|null|null|No||2|111|null|null\nnull|null|null|null|VALUE75|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING75|null|AC|null|null|STRING75|2022-01-11|2022-01-06|81351173|VALUE1|null|null|CITY69|888420175|null|2300135|404|null|RER|RCR|XCX|null|null|null|STRING111|null|101|null|null|null|1001|null|null|null|null|null|STRING112|STRING111|STRING111|null|null|15003.00|null|15003.00|null|null|230022|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY69|885275|8123401073|STRING111|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING75|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING76|null|null|null|null|null|124814074|null|STRING1|2140039|2022-01-06|null|null|null|STRING75|COMP1COUNTRY2|null|null|null|null|null|No||2|112|null|null\nnull|null|null|null|VALUE75|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING75|null|AC|null|null|STRING75|2022-01-11|2022-01-06|81351173|VALUE1|null|null|CITY69|888420175|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING112|null|101|null|null|null|1002|null|null|null|null|null|STRING113|STRING111|STRING111|null|null|15003.00|null|15003.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500156.00|null|101.000|101.000|6500156.00|85023.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY69|885275|8123401073|STRING111|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING75|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING76|null|null|null|null|null|124814074|null|STRING1|2140039|2022-01-06|null|null|null|STRING75|COMP1COUNTRY2|null|null|null|null|null|No||2|113|null|null\nnull|null|null|null|VALUE75|10102417|10102417|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING75|null|AC|null|null|STRING75|2022-01-11|2022-01-06|81351173|VALUE1|null|null|CITY69|888420175|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING113|null|101|null|null|null|1002|null|null|null|null|null|STRING114|STRING111|STRING111|null|null|15003.00|null|15003.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY69|885275|8123401073|STRING111|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING75|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING76|null|null|null|null|null|124814074|null|STRING1|2140039|2022-01-06|null|null|null|STRING75|COMP1COUNTRY2|null|null|null|null|null|No||2|114|null|null\nnull|null|null|null|VALUE76|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE25|null|null|2022-01-05|COUNTRY2|null|STRING76|null|AC|null|null|STRING76|2022-01-11|2022-01-06|81351174|VALUE1|null|null|CITY70|888420176|null|2300133|404|null|RER|RCR|XCX|null|null|null|STRING114|null|101|null|null|null|1001|STRING1|null|null|null|null|STRING115|STRING116|STRING116|null|null|15017.00|15004.00|15017.00|null|null|230020|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|5003.00|101.000|101.000|6500148.00|85032.00|101.000|101.000|101.000|101.000|STRING16|STRING1|2022-01-06|CITY70|885276|8123401074|STRING116|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING76|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING77|null|null|null|null|null|124814075|null|STRING1|2140059|2022-01-06|null|null|null|STRING76|COMP1COUNTRY2|null|null|null|null|null|No||2|115|null|null\nnull|null|null|null|VALUE77|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE26|null|null|2022-01-05|COUNTRY1|null|STRING77|null|AC|null|null|STRING77|2022-01-13|2022-01-06|81351175|VALUE1|null|null|CITY71|888420177|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING115|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING116|STRING117|STRING117|null|null|15024.00|null|15024.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|85045.00|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY71|885277|8123401075|STRING117|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING77|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING78|null|null|null|null|null|124814076|null|STRING1|2140060|2022-01-06|null|null|null|STRING77|COMP1COUNTRY1|null|null|null|null|null|No||2|116|null|null\nnull|null|null|null|VALUE78|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING78|null|AC|null|null|STRING78|2022-01-11|2022-01-06|81351176|VALUE1|null|null|CITY72|888420178|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING116|null|101|null|null|null|1001|null|null|null|null|null|STRING117|STRING118|STRING118|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY72|885278|8123401076|STRING118|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING78|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING79|null|null|null|null|null|124814077|null|STRING1|2140061|2022-01-06|null|null|null|STRING78|COMP1COUNTRY1|null|null|null|null|null|No||2|117|null|null\nnull|null|null|null|VALUE79|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING79|null|AC|null|null|STRING79|2022-01-11|2022-01-06|81351177|VALUE1|null|null|CITY73|888420179|null|2300114|404|null|RER|RCR|XCX|null|null|null|STRING5|null|101|null|null|null|1001|null|null|null|null|null|STRING118|STRING119|STRING119|null|null|15003.00|null|15003.00|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY73|885279|8123401077|STRING119|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING79|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING80|null|null|null|null|null|124814078|null|STRING1|2140062|2022-01-06|null|null|null|STRING79|COMP1COUNTRY1|null|null|null|null|null|No||2|118|null|null\nnull|null|null|null|VALUE80|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING80|null|AC|null|null|STRING80|2022-01-11|2022-01-06|81351178|VALUE1|null|null|CITY74|888420180|null|2300140|404|null|RER|RCR|XCX|null|null|null|STRING117|null|101|null|null|null|1002|null|null|null|null|null|STRING119|STRING120|STRING120|null|null|15003.00|null|15003.00|null|null|230025|null|null|101|STRING1|null|null|null|null|PCP|101|6500145.00|null|101.000|101.000|6500145.00|85029.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY74|885280|8123401078|STRING120|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING80|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING81|null|null|null|null|null|124814079|null|STRING1|2140063|2022-01-06|null|null|null|STRING80|COMP1COUNTRY2|null|null|null|null|null|No||2|119|null|null\nnull|null|null|null|VALUE80|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING80|null|AC|null|null|STRING80|2022-01-11|2022-01-06|81351178|VALUE1|null|null|CITY74|888420180|null|2300140|404|null|RER|RCR|XCX|null|null|null|STRING118|null|101|null|null|null|1002|null|null|null|null|null|STRING120|STRING120|STRING120|null|null|15003.00|15005.00|15003.00|null|null|230025|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|5004.00|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY74|885280|8123401078|STRING120|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING80|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING81|null|null|null|null|null|124814079|null|STRING1|2140063|2022-01-06|null|null|null|STRING80|COMP1COUNTRY2|null|null|null|null|null|No||2|120|null|null\nnull|null|null|null|VALUE80|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING80|null|AC|null|null|STRING80|2022-01-11|2022-01-06|81351178|VALUE1|null|null|CITY74|888420180|null|2300140|404|null|RER|RCR|XCX|null|null|null|STRING119|null|101|null|null|null|1002|null|null|null|null|null|STRING121|STRING120|STRING120|null|null|15003.00|15005.00|15003.00|null|null|230025|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|5005.00|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY74|885280|8123401078|STRING120|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING80|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING81|null|null|null|null|null|124814079|null|STRING1|2140063|2022-01-06|null|null|null|STRING80|COMP1COUNTRY2|null|null|null|null|null|No||2|121|null|null\nnull|null|null|null|VALUE81|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE27|null|null|2022-01-05|COUNTRY1|null|STRING81|null|AC|null|null|STRING81|2022-01-11|2022-01-06|81351179|VALUE1|null|null|CITY75|888420181|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING120|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING122|STRING123|STRING123|null|null|15025.00|null|15025.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500141.00|null|101.000|101.000|6500141.00|85047.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY75|885281|8123401079|STRING123|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING81|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING82|null|null|null|null|null|124814080|null|STRING1|2140013|2022-01-06|null|null|null|STRING81|COMP1COUNTRY1|null|null|null|null|null|No||2|122|null|null\nnull|null|null|null|VALUE81|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE27|null|null|2022-01-05|COUNTRY1|null|STRING81|null|AC|null|null|STRING81|2022-01-11|2022-01-06|81351179|VALUE1|null|null|CITY75|888420181|null|2300224|404|null|RER|RCR|XCX|null|null|null|STRING121|null|101|null|null|null|1003|STRING6|null|null|null|null|STRING123|STRING123|STRING123|null|null|null|null|null|null|null|230041|null|null|101|STRING1|null|null|null|null|PCP|101|6500157.00|null|101.000|101.000|6500157.00|null|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY75|885281|8123401079|STRING123|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING81|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING82|null|null|null|null|null|124814080|null|STRING1|2140013|2022-01-06|null|null|null|STRING81|COMP1COUNTRY1|null|null|null|null|null|No||2|123|null|null\nnull|null|null|null|VALUE81|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE27|null|null|2022-01-05|COUNTRY1|null|STRING81|null|AC|null|null|STRING81|2022-01-11|2022-01-06|81351179|VALUE1|null|null|CITY75|888420181|null|2300225|404|null|RER|RCR|XCX|null|null|null|STRING122|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING124|STRING123|STRING123|null|null|null|null|null|null|null|230042|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|null|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY75|885281|8123401079|STRING123|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING81|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING82|null|null|null|null|null|124814080|null|STRING1|2140013|2022-01-06|null|null|null|STRING81|COMP1COUNTRY1|null|null|null|null|null|No||2|124|null|null\nnull|null|null|null|VALUE81|10102417|10102417|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE27|null|null|2022-01-05|COUNTRY1|null|STRING81|null|AC|null|null|STRING81|2022-01-11|2022-01-06|81351179|VALUE1|null|null|CITY75|888420181|null|2300225|404|null|RER|RCR|XCX|null|null|null|STRING123|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING125|STRING123|STRING123|null|null|15024.00|null|15024.00|null|null|230042|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85048.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY75|885281|8123401079|STRING123|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING81|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING82|null|null|null|null|null|124814080|null|STRING1|2140013|2022-01-06|null|null|null|STRING81|COMP1COUNTRY1|null|null|null|null|null|No||2|125|null|null\nnull|null|null|null|VALUE81|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING81|null|AC|null|null|STRING81|2022-01-11|2022-01-06|81351179|VALUE1|null|null|CITY75|888420181|null|2300224|404|null|RER|RCR|XCX|null|null|null|STRING124|null|101|null|null|null|1003|null|null|null|null|null|STRING126|STRING123|STRING123|null|null|15003.00|null|15003.00|null|null|230041|null|null|101|STRING1|null|null|null|null|PCP|101|6500157.00|null|101.000|101.000|6500157.00|85049.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY75|885281|8123401079|STRING123|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING81|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING82|null|null|null|null|null|124814080|null|STRING1|2140013|2022-01-06|null|null|null|STRING81|COMP1COUNTRY1|null|null|null|null|null|No||2|126|null|null\nnull|null|null|null|VALUE82|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING82|null|AC|null|null|STRING82|2022-01-11|2022-01-06|81351180|VALUE1|null|null|CITY76|888420182|null|2300185|404|null|RER|CRC|XCX|null|null|null|STRING125|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING127|STRING128|STRING128|null|null|15014.00|null|15014.00|null|null|230035|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85023.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY76|885282|8123401080|STRING128|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING82|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING83|null|null|null|null|null|124814081|null|STRING1|2140064|2022-01-06|null|null|null|STRING82|COMP1COUNTRY1|null|null|null|null|null|No||2|127|null|null\nnull|null|null|null|VALUE83|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING83|null|AC|null|null|STRING83|2022-01-11|2022-01-06|81351181|VALUE1|null|null|CITY77|888420183|null|2300121|404|null|RER|RCR|XCX|null|null|null|STRING126|null|101|null|null|null|1002|null|null|null|null|null|STRING128|STRING129|STRING129|null|null|15003.00|null|15003.00|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|85016.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY77|885283|8123401081|STRING129|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING83|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING84|null|null|null|null|null|124814082|null|STRING1|2140011|2022-01-06|null|null|null|STRING83|COMP1COUNTRY1|null|null|null|null|null|No||2|128|null|null\nnull|null|null|null|VALUE84|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING84|null|AC|null|null|STRING84|2022-01-11|2022-01-06|81351182|VALUE1|null|null|CITY78|888420184|null|2300103|404|null|RER|CRC|XCX|null|null|null|STRING127|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING129|STRING130|STRING130|null|null|null|null|null|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY78|885284|8123401082|STRING130|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING84|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING85|null|null|null|null|null|124814083|null|STRING1|2140065|2022-01-06|null|null|null|STRING84|COMP1COUNTRY1|null|null|STRING7|STRING2|null|Yes||2|129|null|null\nnull|null|null|null|VALUE85|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING85|null|AC|null|null|STRING85|null|2022-01-06|81351183|VALUE1|null|null|CITY79|888420185|null|2300135|404|null|RER|RCR|XCX|null|null|null|STRING128|null|101|null|null|null|1001|null|null|null|null|null|STRING130|STRING131|STRING131|null|null|15003.00|null|15003.00|null|null|230022|null|null|101|STRING1|null|null|null|null|PCP|101|6500131.00|null|101.000|101.000|6500131.00|85050.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY79|885285|8123401083|STRING131|null|2022-01-06|STRING6|10020.000|STRING2|STORE1|STRING1|TYPE3|STRING85|10240402|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|null|1923002|1923001|null|null|null|null|10349200.00|STRING86|null|null|null|null|null|124814084|null|STRING1|2140033|2022-01-06|null|null|null|STRING85|COMP1COUNTRY1|null|null|null|null|null|No||2|130|null|null\nnull|null|null|null|VALUE85|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING85|null|AC|null|null|STRING85|2022-01-10|2022-01-06|81351183|VALUE1|null|null|CITY79|888420185|null|2300232|404|null|RER|RCR|XCX|null|null|STRING6|STRING30|null|101|null|null|null|1002|null|null|null|null|null|STRING131|STRING131|STRING131|null|null|15003.00|null|15003.00|null|null|230043|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85004.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY79|885285|8123401083|STRING131|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE2|STRING85|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING87|null|EFL|null|null|null|124814084|null|STRING1|2140033|2022-01-06|null|null|null|STRING85|COMP1COUNTRY1|null|null|null|null|null|No||2|131|null|null\nnull|null|null|null|VALUE86|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING86|null|AC|null|null|STRING86|2022-01-11|2022-01-06|81351184|VALUE1|null|null|CITY80|888420186|null|2300233|404|null|RER|RCR|XCX|null|null|null|STRING129|null|101|null|null|null|1001|null|null|null|null|null|STRING132|STRING133|STRING133|null|null|15003.00|null|15003.00|null|null|230044|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY80|885286|8123401084|STRING133|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING86|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING88|null|null|null|null|null|124814085|null|STRING1|2140066|2022-01-06|null|null|null|STRING86|COMP1COUNTRY1|null|null|null|null|null|No||2|132|null|null\nnull|null|null|null|VALUE86|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING86|null|AC|null|null|STRING86|2022-01-11|2022-01-06|81351184|VALUE1|null|null|CITY80|888420186|null|2300234|404|null|RER|RCR|XCX|null|null|null|STRING130|null|101|null|null|null|1001|null|null|null|null|null|STRING133|STRING133|STRING133|null|null|15003.00|null|15003.00|null|null|230045|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY80|885286|8123401084|STRING133|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING86|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING88|null|null|null|null|null|124814085|null|STRING1|2140066|2022-01-06|null|null|null|STRING86|COMP1COUNTRY1|null|null|null|null|null|No||2|133|null|null\nnull|null|null|null|VALUE86|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING86|null|AC|null|null|STRING86|2022-01-11|2022-01-06|81351184|VALUE1|null|null|CITY80|888420186|null|2300142|404|null|RER|CRC|XCX|null|null|null|STRING131|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING134|STRING133|STRING133|null|null|null|null|null|null|null|230026|null|null|101|STRING1|null|null|null|null|PCP|101|6500145.00|null|101.000|101.000|6500145.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY80|885286|8123401084|STRING133|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING86|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING88|null|null|null|null|null|124814085|null|STRING1|2140066|2022-01-06|null|null|null|STRING86|COMP1COUNTRY1|null|null|null|null|null|No||2|134|null|null\nnull|null|null|null|VALUE87|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING87|null|AC|null|null|STRING87|2022-01-13|2022-01-06|81351185|VALUE1|null|null|CITY81|888420187|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING132|null|101|null|null|null|1002|null|null|null|null|null|STRING135|STRING136|STRING136|null|null|15003.00|null|15003.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500150.00|null|101.000|101.000|6500150.00|85034.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY81|885287|8123401085|STRING136|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING87|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING89|null|null|null|null|null|124814086|null|STRING1|2140067|2022-01-06|null|null|null|STRING87|COMP1COUNTRY1|null|null|null|null|null|No||2|135|null|null\nnull|null|null|null|VALUE88|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY4|null|STRING88|null|AC|null|null|STRING88|2022-01-11|2022-01-06|81351186|VALUE1|null|null|CITY82|888420188|null|2300111|404|null|RER|RCR|XCX|null|null|null|STRING133|null|101|null|null|null|1001|null|null|null|null|null|STRING136|STRING137|STRING137|null|null|15003.00|null|15003.00|null|null|230008|null|null|101|STRING1|null|null|null|null|PCP|101|6500158.00|null|101.000|101.000|6500158.00|85051.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY82|885288|8123401086|STRING137|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING88|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING90|null|null|null|null|null|124814087|null|STRING1|2140068|2022-01-06|null|null|null|STRING88|COMP1COUNTRY4|null|null|null|null|null|No||2|136|null|null\nnull|null|null|null|VALUE89|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING89|null|AC|null|null|STRING89|2022-01-11|2022-01-06|81351187|VALUE1|null|null|CITY83|888420189|null|2300116|404|null|RER|RCR|XCX|null|null|null|STRING134|null|101|null|null|null|1001|null|null|null|null|null|STRING137|STRING138|STRING138|null|null|15003.00|null|15003.00|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY83|885289|8123401087|STRING138|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING89|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING91|null|null|null|null|null|124814088|null|STRING1|2140069|2022-01-06|null|null|null|STRING89|COMP1COUNTRY1|null|null|null|null|null|No||2|137|null|null\nnull|null|null|null|VALUE90|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING90|METHOD7|AC|null|null|STRING90|2022-01-13|2022-01-06|81351188|VALUE1|null|null|CITY84|888420190|null|2300128|404|null|RER|CRC|XCX|null|null|null|STRING135|null|101|null|null|null|1003|STRING2|null|null|null|null|STRING138|STRING139|STRING139|null|null|null|null|null|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY84|885290|8123401088|STRING139|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING90|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING92|null|null|null|null|null|124814089|null|STRING1|2140070|2022-01-06|null|null|null|STRING90|COMP1COUNTRY1|null|null|null|null|null|No||2|138|null|null\nnull|null|null|null|VALUE91|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE28|null|null|2022-01-05|COUNTRY4|null|STRING91|null|AC|null|null|STRING91|2022-01-11|2022-01-06|81351189|VALUE1|null|null|CITY85|888420191|null|2300240|404|null|RER|RCR|XCX|null|null|null|STRING136|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING139|STRING140|STRING140|null|null|15026.00|null|15026.00|null|null|230046|null|null|101|STRING1|null|null|null|null|PCP|101|6500136.00|null|101.000|101.000|6500136.00|85052.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY85|885291|8123401089|STRING140|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING91|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING93|null|null|null|null|null|124814090|null|STRING1|2140071|2022-01-06|null|null|null|STRING91|COMP1COUNTRY4|null|null|null|null|null|No||2|139|null|null\nnull|null|null|null|VALUE91|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE28|null|null|2022-01-05|COUNTRY4|null|STRING91|null|AC|null|null|STRING91|2022-01-11|2022-01-06|81351189|VALUE1|null|null|CITY85|888420191|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING137|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING140|STRING140|STRING140|null|null|15026.00|null|15026.00|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500136.00|null|101.000|101.000|6500136.00|85052.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY85|885291|8123401089|STRING140|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING91|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING93|null|null|null|null|null|124814090|null|STRING1|2140071|2022-01-06|null|null|null|STRING91|COMP1COUNTRY4|null|null|null|null|null|No||2|140|null|null\nnull|null|null|null|VALUE92|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE29|null|null|2022-01-05|COUNTRY1|null|STRING92|null|AC|null|null|STRING92|2022-01-13|2022-01-06|81351190|VALUE1|null|null|CITY86|888420192|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING138|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING141|STRING142|STRING142|null|null|null|null|null|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|null|101.000|101.000|101.000|101.000|STRING8|STRING1|2022-01-06|CITY86|885292|8123401090|STRING142|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING92|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING94|null|null|null|null|null|124814091|null|STRING1|2140072|2022-01-06|null|null|null|STRING92|COMP1COUNTRY1|null|null|null|null|null|No||2|141|null|null\nnull|null|null|null|VALUE93|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING93|null|AC|null|null|STRING93|2022-01-11|2022-01-06|81351191|VALUE1|null|null|CITY46|888420193|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING139|null|101|null|null|null|1002|null|null|null|null|null|STRING142|STRING143|STRING143|null|null|15003.00|null|15003.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY46|885293|8123401091|STRING143|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING93|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING95|null|null|null|null|null|124814092|null|STRING1|2140028|2022-01-06|null|null|null|STRING93|COMP1COUNTRY3|null|null|null|null|null|No||2|142|null|null\nnull|null|null|null|VALUE93|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING93|null|AC|null|null|STRING93|2022-01-11|2022-01-06|81351191|VALUE1|null|null|CITY46|888420193|null|2300113|404|null|RER|RCR|XCX|null|null|null|STRING140|null|101|null|null|null|1002|null|null|null|null|null|STRING143|STRING143|STRING143|null|null|15003.00|null|15003.00|null|null|230010|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY46|885293|8123401091|STRING143|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING93|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING95|null|null|null|null|null|124814092|null|STRING1|2140028|2022-01-06|null|null|null|STRING93|COMP1COUNTRY3|null|null|null|null|null|No||2|143|null|null\nnull|null|null|null|VALUE93|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING93|null|AC|null|null|STRING93|2022-01-11|2022-01-06|81351191|VALUE1|null|null|CITY46|888420193|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING141|null|101|null|null|null|1002|null|null|null|null|null|STRING144|STRING143|STRING143|null|null|15003.00|null|15003.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY46|885293|8123401091|STRING143|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING93|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING95|null|null|null|null|null|124814092|null|STRING1|2140028|2022-01-06|null|null|null|STRING93|COMP1COUNTRY3|null|null|null|null|null|No||2|144|null|null\nnull|null|null|null|VALUE93|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING93|null|AC|null|null|STRING93|2022-01-11|2022-01-06|81351191|VALUE1|null|null|CITY46|888420193|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING142|null|101|null|null|null|1002|null|null|null|null|null|STRING145|STRING143|STRING143|null|null|15003.00|null|15003.00|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85024.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY46|885293|8123401091|STRING143|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING93|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING95|null|null|null|null|null|124814092|null|STRING1|2140028|2022-01-06|null|null|null|STRING93|COMP1COUNTRY3|null|null|null|null|null|No||2|145|null|null\nnull|null|null|null|VALUE93|10102417|10102417|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE2|null|null|2022-01-05|COUNTRY3|null|STRING93|null|AC|null|null|STRING93|2022-01-11|2022-01-06|81351191|VALUE1|null|null|CITY46|888420193|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING143|null|101|null|null|null|1002|STRING3|null|null|null|null|STRING146|STRING143|STRING143|null|null|null|null|null|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500156.00|null|101.000|101.000|6500156.00|null|101.000|101.000|101.000|101.000|STRING3|STRING1|2022-01-06|CITY46|885293|8123401091|STRING143|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING93|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING95|null|null|null|null|null|124814092|null|STRING1|2140028|2022-01-06|null|null|null|STRING93|COMP1COUNTRY3|null|null|null|null|null|No||2|146|null|null\nnull|null|null|null|VALUE93|10102418|10102418|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY3|null|STRING93|null|AC|null|null|STRING93|2022-01-11|2022-01-06|81351191|VALUE1|null|null|CITY46|888420193|null|2300112|404|null|RER|CRC|XCX|null|null|null|STRING144|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING147|STRING143|STRING143|null|null|null|null|null|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY46|885293|8123401091|STRING143|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING93|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING95|null|null|null|null|null|124814092|null|STRING1|2140028|2022-01-06|null|null|null|STRING93|COMP1COUNTRY3|null|null|null|null|null|No||2|147|null|null\nnull|null|null|null|VALUE94|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE2|null|null|2022-01-05|COUNTRY1|null|STRING94|null|AC|null|null|STRING94|2022-01-10|2022-01-06|81351192|VALUE1|null|null|CITY44|888420194|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING145|null|101|null|null|null|1001|STRING3|null|null|null|null|STRING148|STRING149|STRING149|null|null|15027.00|null|15027.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85053.00|101.000|101.000|101.000|101.000|STRING3|STRING1|2022-01-06|CITY44|885294|8123401092|STRING149|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING94|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING96|null|null|null|null|null|124814093|null|STRING1|2140073|2022-01-06|null|null|null|STRING94|COMP1COUNTRY1|null|null|null|null|null|No||2|148|null|null\nnull|null|null|null|VALUE95|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY2|null|STRING95|null|AC|null|null|STRING95|2022-01-11|2022-01-06|81351193|VALUE1|null|null|CITY87|888420195|null|2300240|404|null|RER|RCR|XCX|null|null|null|STRING146|null|101|null|null|null|1001|null|null|null|null|null|STRING149|STRING150|STRING150|null|null|15003.00|null|15003.00|null|null|230046|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY87|885295|8123401093|STRING150|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING95|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING97|null|null|null|null|null|124814094|null|STRING1|2140074|2022-01-06|null|null|null|STRING95|COMP1COUNTRY2|null|null|null|null|null|No||2|149|null|null\nnull|null|null|null|VALUE97|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING97|null|AC|null|null|STRING97|2022-01-10|2022-01-06|81351195|VALUE1|null|null|CITY88|888420197|null|2300165|404|null|RER|CRC|XCX|null|null|null|STRING150|null|101|null|null|null|1002|STRING9|null|null|null|null|STRING153|STRING154|STRING154|null|null|null|null|null|null|null|230031|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY88|885297|8123401095|STRING154|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING97|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING99|null|null|null|null|null|124814096|null|STRING1|2140076|2022-01-06|null|null|null|STRING97|COMP1COUNTRY6|null|null|null|null|null|No||2|153|null|null\nnull|null|null|null|VALUE97|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING97|null|AC|null|null|STRING97|2022-01-10|2022-01-06|81351195|VALUE1|null|null|CITY88|888420197|null|2300133|404|null|RER|CRC|XCX|null|null|null|STRING151|null|101|null|null|null|1002|STRING9|null|null|null|null|STRING154|STRING154|STRING154|null|null|15024.00|null|15024.00|null|null|230020|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|85045.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY88|885297|8123401095|STRING154|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING97|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING99|null|null|null|null|null|124814096|null|STRING1|2140076|2022-01-06|null|null|null|STRING97|COMP1COUNTRY6|null|null|null|null|null|No||2|154|null|null\nnull|null|null|null|VALUE98|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING98|null|AC|null|null|STRING98|2022-01-11|2022-01-06|81351196|VALUE1|null|null|CITY89|888420198|null|2300135|404|null|RER|RCR|XCX|null|null|null|STRING152|null|101|null|null|null|1001|null|null|null|null|null|STRING155|STRING156|STRING156|null|null|15003.00|null|15003.00|null|null|230022|null|null|101|STRING1|null|null|null|null|PCP|101|6500159.00|null|101.000|101.000|6500159.00|85054.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY89|885298|8123401096|STRING156|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING98|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING100|null|null|null|null|null|124814097|null|STRING1|2140077|2022-01-06|null|null|null|STRING98|COMP1COUNTRY1|null|null|null|null|null|No||2|155|null|null\nnull|null|null|null|VALUE99|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING99|null|AC|null|null|STRING99|2022-01-11|2022-01-06|81351197|VALUE1|null|null|CITY89|888420199|null|2300180|404|null|RER|CRC|XCX|null|null|null|STRING153|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING156|STRING157|STRING157|null|null|null|null|null|null|null|230034|null|null|101|STRING1|null|null|null|null|PCP|101|6500160.00|null|101.000|101.000|6500160.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY89|885299|8123401097|STRING157|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING99|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING101|null|null|null|null|null|124814098|null|STRING1|2140078|2022-01-06|null|null|null|STRING99|COMP1COUNTRY1|null|null|STRING1|STRING1|null|Yes||2|156|null|null\nnull|null|null|null|VALUE99|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING99|null|AC|null|null|STRING99|2022-01-11|2022-01-06|81351197|VALUE1|null|null|CITY89|888420199|null|2300114|404|null|RER|CRC|XCX|null|null|null|STRING154|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING157|STRING157|STRING157|null|null|null|null|null|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY89|885299|8123401097|STRING157|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING99|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING101|null|null|null|null|null|124814098|null|STRING1|2140078|2022-01-06|null|null|null|STRING99|COMP1COUNTRY1|null|null|STRING1|STRING1|null|Yes||2|157|null|null\nnull|null|null|null|VALUE99|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING99|null|AC|null|null|STRING99|2022-01-11|2022-01-06|81351197|VALUE1|null|null|CITY89|888420199|null|2300114|404|null|RER|CRC|XCX|null|null|null|STRING155|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING158|STRING157|STRING157|null|null|null|null|null|null|null|230011|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY89|885299|8123401097|STRING157|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING99|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING101|null|null|null|null|null|124814098|null|STRING1|2140078|2022-01-06|null|null|null|STRING99|COMP1COUNTRY1|null|null|STRING1|STRING1|null|Yes||2|158|null|null\nnull|null|null|null|VALUE100|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING100|null|AC|null|null|STRING100|2022-01-10|2022-01-06|81351198|VALUE1|null|null|CITY27|888420200|null|2300128|404|null|RER|CRC|XCX|null|null|null|STRING156|null|101|null|null|null|1003|STRING2|null|null|null|null|STRING159|STRING160|STRING160|null|null|null|null|null|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|6500152.00|null|101.000|101.000|6500152.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY27|885300|8123401098|STRING160|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING100|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING102|null|null|null|null|null|124814099|null|STRING1|2140079|2022-01-06|null|null|null|STRING100|COMP1COUNTRY1|null|null|STRING1|STRING1|STRING4|Yes||2|159|null|null\nnull|null|null|null|VALUE100|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE30|null|null|2022-01-05|COUNTRY1|null|STRING100|null|AC|null|null|STRING100|2022-01-10|2022-01-06|81351198|VALUE1|null|null|CITY27|888420200|null|2300108|404|null|RER|RCR|XCX|null|null|null|STRING157|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING160|STRING160|STRING160|null|null|null|null|null|null|null|230006|null|null|101|STRING1|null|null|null|null|PCP|101|6500137.00|null|101.000|101.000|6500137.00|null|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY27|885300|8123401098|STRING160|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING100|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING102|null|null|null|null|null|124814099|null|STRING1|2140079|2022-01-06|null|null|null|STRING100|COMP1COUNTRY1|null|null|null|null|null|No||2|160|null|null\nnull|null|null|null|VALUE101|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE31|null|null|2022-01-05|COUNTRY2|null|STRING101|null|AA|null|null|STRING101|2022-01-13|2022-01-06|81351199|VALUE1|null|null|CITY90|888420201|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING158|null|101|null|null|null|1001|STRING1|null|null|null|null|STRING161|STRING162|STRING162|null|null|15028.00|null|15028.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500131.00|null|101.000|101.000|6500131.00|85055.00|101.000|101.000|101.000|101.000|STRING16|STRING1|2022-01-06|CITY90|885301|8123401099|STRING162|null|2022-01-06|STRING4|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING101|10240411|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING103|null|null|null|null|null|124814100|null|null|null|null|null|null|null|STRING101|COMP1COUNTRY2|null|null|null|null|null|No||2|161|null|null\nnull|null|null|null|VALUE102|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING102|null|AC|null|null|STRING102|2022-01-13|2022-01-06|81351200|VALUE1|null|null|CITY91|888420202|null|2300140|404|null|RER|CRC|XCX|null|null|null|STRING159|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING162|STRING163|STRING163|null|null|null|null|null|null|null|230025|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY91|885302|8123401100|STRING163|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING102|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING104|null|null|null|null|null|124814101|null|STRING1|2140080|2022-01-06|null|null|null|STRING102|COMP1COUNTRY1|null|null|null|null|null|No||2|162|null|null\nnull|null|null|null|VALUE102|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING102|null|AC|null|null|STRING102|2022-01-13|2022-01-06|81351200|VALUE1|null|null|CITY91|888420202|null|2300121|404|null|RER|CRC|XCX|null|null|null|STRING160|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING163|STRING163|STRING163|null|null|null|null|null|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY91|885302|8123401100|STRING163|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING102|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING104|null|null|null|null|null|124814101|null|STRING1|2140080|2022-01-06|null|null|null|STRING102|COMP1COUNTRY1|null|null|null|null|null|No||2|163|null|null\nnull|null|null|null|VALUE104|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING104|null|AC|null|null|STRING104|2022-01-11|2022-01-06|81351202|VALUE1|null|null|CITY93|888420204|null|2300240|404|null|RER|CRC|XCX|null|null|null|STRING162|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING165|STRING166|STRING166|null|null|15002.00|null|15002.00|null|null|230046|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|85003.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY93|885304|8123401102|STRING166|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING104|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING106|null|null|null|null|null|124814103|null|STRING1|2140082|2022-01-06|null|null|null|STRING104|COMP1COUNTRY1|null|null|null|null|null|No||2|165|null|null\nnull|null|null|null|VALUE105|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|3901235|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY1|null|STRING105|null|AA|null|null|STRING105|null|2022-01-06|81351199|VALUE1|null|null|CITY94|888420205|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING163|null|101|null|null|null|1001|null|null|null|null|null|STRING166|STRING167|STRING167|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY94|885305|8123401099|STRING167|null|2022-01-06|STRING4|10020.000|STRING4|STORE1|STRING3|TYPE1|STRING105|10240412|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|null|null|null|null|null|null|124814104|3059001|null|null|null|null|null|null|STRING105|COMP1COUNTRY1|null|null|null|null|null|No||2|166|null|null\nnull|null|null|null|VALUE105|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|3901235|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY1|null|STRING105|null|AA|null|null|STRING105|null|2022-01-06|81351199|VALUE1|null|null|CITY94|888420205|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING163|null|101|null|null|null|1001|null|null|null|null|null|STRING167|STRING167|STRING167|null|null|15003.00|null|15003.00|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY94|885305|8123401099|STRING167|null|2022-01-06|STRING4|10020.000|STRING4|STORE1|STRING3|TYPE1|STRING105|10240412|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|null|null|null|null|null|null|124814104|3059001|null|null|null|null|null|null|STRING105|COMP1COUNTRY1|null|null|null|null|null|No||2|167|null|null\nnull|null|null|null|VALUE105|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|3901235|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY1|null|STRING105|null|AA|null|null|STRING105|null|2022-01-06|81351199|VALUE1|null|null|CITY94|888420205|null|2300269|404|null|RER|RCR|XCX|null|null|null|STRING163|null|101|null|null|null|1001|null|null|null|null|null|STRING168|STRING167|STRING167|null|null|15003.00|null|15003.00|null|null|230047|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY94|885305|8123401099|STRING167|null|2022-01-06|STRING4|10020.000|STRING4|STORE1|STRING3|TYPE1|STRING105|10240412|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|null|null|null|null|null|null|124814104|3059001|null|null|null|null|null|null|STRING105|COMP1COUNTRY1|null|null|null|null|null|No||2|168|null|null\nnull|null|null|null|VALUE105|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|3901235|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY1|null|STRING105|null|AA|null|null|STRING105|null|2022-01-06|81351199|VALUE1|null|null|CITY94|888420205|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING164|null|101|null|null|null|1001|null|null|null|null|null|STRING169|STRING167|STRING167|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY94|885305|8123401099|STRING167|null|2022-01-06|STRING4|10020.000|STRING4|STORE1|STRING3|TYPE1|STRING105|10240412|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|null|null|null|null|null|null|124814104|3059001|null|null|null|null|null|null|STRING105|COMP1COUNTRY1|null|null|null|null|null|No||2|169|null|null\nnull|null|null|null|VALUE105|10102417|10102417|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|3901235|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY1|null|STRING105|null|AA|null|null|STRING105|null|2022-01-06|81351199|VALUE1|null|null|CITY94|888420205|null|2300120|404|null|RER|RCR|XCX|null|null|null|STRING164|null|101|null|null|null|1001|null|null|null|null|null|STRING170|STRING167|STRING167|null|null|15003.00|null|15003.00|null|null|230015|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY94|885305|8123401099|STRING167|null|2022-01-06|STRING4|10020.000|STRING4|STORE1|STRING3|TYPE1|STRING105|10240412|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|null|null|null|null|null|null|124814104|3059001|null|null|null|null|null|null|STRING105|COMP1COUNTRY1|null|null|null|null|null|No||2|170|null|null\nnull|null|null|null|VALUE105|10102418|10102418|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|3901235|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY1|null|STRING105|null|AA|null|null|STRING105|null|2022-01-06|81351199|VALUE1|null|null|CITY94|888420205|null|2300269|404|null|RER|RCR|XCX|null|null|null|STRING164|null|101|null|null|null|1001|null|null|null|null|null|STRING171|STRING167|STRING167|null|null|15003.00|null|15003.00|null|null|230047|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY94|885305|8123401099|STRING167|null|2022-01-06|STRING4|10020.000|STRING4|STORE1|STRING3|TYPE1|STRING105|10240412|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING2|null|1923002|1923001|null|null|null|null|10349200.00|null|null|null|null|null|null|124814104|3059001|null|null|null|null|null|null|STRING105|COMP1COUNTRY1|null|null|null|null|null|No||2|171|null|null\nnull|null|null|null|VALUE107|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE32|null|null|2022-01-05|COUNTRY1|null|STRING107|null|AC|null|null|STRING107|2022-01-11|2022-01-06|81351204|VALUE1|null|null|CITY95|888420207|null|2300117|404|null|RER|RCR|XCX|null|null|null|STRING167|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING175|STRING176|STRING176|null|null|15020.00|null|15020.00|null|null|230014|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85023.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY95|885307|8123401104|STRING176|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING107|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING109|null|null|null|null|null|124814106|null|STRING1|2140084|2022-01-06|null|null|null|STRING107|COMP1COUNTRY1|null|null|null|null|null|No||2|175|null|null\nnull|null|null|null|VALUE108|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING108|null|AC|null|null|STRING108|2022-01-11|2022-01-06|81351205|VALUE1|null|null|CITY20|888420208|null|2300101|404|null|RER|CRC|XCX|null|null|null|STRING168|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING176|STRING177|STRING177|null|null|null|null|null|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY20|885308|8123401105|STRING177|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING108|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING110|null|null|null|null|null|124814107|null|STRING1|2140017|2022-01-06|null|null|null|STRING108|COMP1COUNTRY1|null|null|STRING7|STRING2|STRING5|Yes||2|176|null|null\nnull|null|null|null|VALUE108|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE33|null|null|2022-01-05|COUNTRY1|null|STRING108|null|AC|null|null|STRING108|2022-01-11|2022-01-06|81351205|VALUE1|null|null|CITY20|888420208|null|2300101|404|null|RER|RCR|XCX|null|null|null|STRING169|null|101|null|null|null|1002|STRING10|null|null|null|null|STRING177|STRING177|STRING177|null|null|15014.00|null|15014.00|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85056.00|101.000|101.000|101.000|101.000|STRING7|STRING1|2022-01-06|CITY20|885308|8123401105|STRING177|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING108|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING110|null|null|null|null|null|124814107|null|STRING1|2140017|2022-01-06|null|null|null|STRING108|COMP1COUNTRY1|null|null|null|null|null|No||2|177|null|null\nnull|null|null|null|VALUE108|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING108|null|AC|null|null|STRING108|2022-01-11|2022-01-06|81351205|VALUE1|null|null|CITY20|888420208|null|2300143|404|null|RER|CRC|XCX|null|null|null|STRING170|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING178|STRING177|STRING177|null|null|15014.00|null|15014.00|null|null|230027|null|null|101|STRING1|null|null|null|null|PCP|101|6500140.00|null|101.000|101.000|6500140.00|85056.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY20|885308|8123401105|STRING177|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING108|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING110|null|null|null|null|null|124814107|null|STRING1|2140017|2022-01-06|null|null|null|STRING108|COMP1COUNTRY1|null|null|STRING7|STRING2|STRING5|Yes||2|178|null|null\nnull|null|null|null|VALUE109|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE34|null|null|2022-01-05|COUNTRY6|null|STRING109|null|AC|null|null|STRING109|2022-01-10|2022-01-06|81351206|VALUE1|null|null|CITY96|888420209|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING171|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING179|STRING180|STRING180|null|null|15022.00|null|15022.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|85013.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY96|885309|8123401106|STRING180|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING109|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING1|STRING2|1923001|1923001|null|null|null|null|null|STRING111|null|null|null|null|null|124814108|null|STRING1|2140024|2022-01-06|null|null|null|STRING109|COMP1COUNTRY6|null|null|null|null|null|No||2|179|null|null\nnull|null|null|null|VALUE111|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE36|null|null|2022-01-05|COUNTRY1|null|STRING111|null|AC|null|null|STRING111|2022-01-11|2022-01-06|81351208|VALUE1|null|null|CITY98|888420211|null|2300180|404|null|RER|RCR|XCX|null|null|null|STRING173|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING181|STRING182|STRING182|null|null|15022.00|null|15022.00|null|null|230034|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85041.00|101.000|101.000|101.000|101.000|STRING18|STRING1|2022-01-06|CITY98|885311|8123401108|STRING182|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING111|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING113|null|null|null|null|null|124814110|null|STRING1|2140086|2022-01-06|null|null|null|STRING111|COMP1COUNTRY1|null|null|null|null|null|No||2|181|null|null\nnull|null|null|null|VALUE111|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE36|null|null|2022-01-05|COUNTRY1|null|STRING111|null|AC|null|null|STRING111|2022-01-11|2022-01-06|81351208|VALUE1|null|null|CITY98|888420211|null|2300112|404|null|RER|RCR|XCX|null|null|null|STRING174|null|101|null|null|null|1001|STRING6|null|null|null|null|STRING182|STRING182|STRING182|null|null|null|null|null|null|null|230009|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|null|101.000|101.000|101.000|101.000|STRING18|STRING1|2022-01-06|CITY98|885311|8123401108|STRING182|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING111|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING113|null|null|null|null|null|124814110|null|STRING1|2140086|2022-01-06|null|null|null|STRING111|COMP1COUNTRY1|null|null|null|null|null|No||2|182|null|null\nnull|null|null|null|VALUE112|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-14|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING112|null|AC|null|null|STRING112|2022-01-14|2022-01-06|81351209|VALUE1|null|null|CITY99|888420212|null|2300101|404|null|RER|RCR|XCX|null|null|STRING7|STRING175|null|101|null|null|null|1001|null|null|null|null|null|STRING183|STRING184|STRING184|null|null|15003.00|null|15003.00|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85006.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY99|885312|8123401109|STRING184|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE2|STRING112|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING114|null|EFL|null|null|null|124814111|null|STRING1|2140087|2022-01-06|null|null|null|STRING112|COMP1COUNTRY1|null|null|null|null|null|No||2|183|null|null\nnull|null|null|null|VALUE113|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-15|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY4|null|STRING113|METHOD8|AA|null|null|STRING113|null|2022-01-06|81351210|VALUE1|null|null|CITY100|888420213|null|2300116|404|null|RER|RCR|XCX|null|null|null|STRING176|null|101|null|null|null|1001|null|null|null|null|null|STRING184|STRING185|STRING185|null|null|15003.00|null|15003.00|null|null|230013|null|null|101|STRING1|null|null|null|null|PCP|101|6500158.00|null|101.000|101.000|6500158.00|85051.00|101.000|101.000|101.000|101.000|null|STRING2|2022-01-06|CITY100|885313|8123401110|STRING185|null|2022-01-06|STRING4|10020.000|STRING1|STORE1|STRING3|TYPE1|STRING113|10240413|null|null|null|null|null|null|COMPANY1|COUNTRYAB4|STRING2|STRING1|1923002|1923001|null|null|null|null|10349200.00|STRING115|null|null|null|null|null|124814112|3059003|null|null|null|null|null|null|STRING113|COMP1COUNTRY4|null|null|null|null|null|No||2|184|null|null\nnull|null|null|null|VALUE114|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING114|METHOD9|AC|null|null|STRING114|2022-01-11|2022-01-06|81351211|VALUE1|null|null|CITY101|888420214|null|2300211|404|null|RER|RCR|XCX|null|null|null|STRING177|null|101|null|null|null|1003|null|null|null|null|null|STRING185|STRING186|STRING186|null|null|15003.00|null|15003.00|null|null|230040|null|null|101|STRING1|null|null|null|null|PCP|101|6500156.00|null|101.000|101.000|6500156.00|85023.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY101|885314|8123401111|STRING186|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING114|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING116|null|null|null|null|null|124814113|null|STRING1|2140088|2022-01-06|null|null|null|STRING114|COMP1COUNTRY1|null|null|null|null|null|No||2|185|null|null\nnull|null|null|null|VALUE115|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING115|null|AC|null|null|STRING115|2022-01-11|2022-01-06|81351212|VALUE1|null|null|CITY102|888420215|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING132|null|101|null|null|null|1002|null|null|null|null|null|STRING135|STRING187|STRING187|null|null|15003.00|null|15003.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500150.00|null|101.000|101.000|6500150.00|85034.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY102|885315|8123401112|STRING187|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING115|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING117|null|null|null|null|null|124814114|null|STRING1|2140089|2022-01-06|null|null|null|STRING115|COMP1COUNTRY1|null|null|null|null|null|No||2|186|null|null\nnull|null|null|null|VALUE116|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE37|null|null|2022-01-05|COUNTRY11|null|STRING116|null|AC|null|null|STRING116|2022-01-11|2022-01-06|81351213|VALUE1|null|null|CITY103|888420216|null|2300143|404|null|RER|RCR|XCX|null|null|null|STRING178|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING186|STRING188|STRING188|null|null|null|null|null|null|null|230027|null|null|101|STRING6|null|null|null|null|PCP|101|6500161.00|null|101.000|101.000|6500161.00|null|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-06|CITY103|885316|8123401113|STRING188|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING116|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB12|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING118|null|null|null|null|null|124814115|null|STRING1|2140090|2022-01-06|null|null|null|STRING116|COMP1COUNTRY10|null|null|null|null|null|No||2|187|null|null\nnull|null|null|null|VALUE117|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING117|null|AC|null|null|STRING117|2022-01-10|2022-01-06|81351214|VALUE1|null|null|CITY104|888420217|null|2300240|404|null|RER|RCR|XCX|null|null|null|STRING179|null|101|null|null|null|1001|null|null|null|null|null|STRING187|STRING189|STRING189|null|null|15003.00|null|15003.00|null|null|230046|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85025.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY104|885317|8123401114|STRING189|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING117|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING119|null|null|null|null|null|124814116|null|STRING1|2140059|2022-01-06|null|null|null|STRING117|COMP1COUNTRY1|null|null|null|null|null|No||2|188|null|null\nnull|null|null|null|VALUE118|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE38|null|null|2022-01-05|COUNTRY1|null|STRING118|null|AC|null|null|STRING118|2022-01-13|2022-01-06|81351215|VALUE1|null|null|CITY105|888420218|null|2300101|404|null|RER|RCR|XCX|null|null|null|STRING180|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING188|STRING190|STRING190|null|null|15001.00|null|15001.00|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85027.00|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY105|885318|8123401115|STRING190|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING118|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING120|null|null|null|null|null|124814117|null|STRING1|2140091|2022-01-06|null|null|null|STRING118|COMP1COUNTRY1|null|null|null|null|null|No||2|189|null|null\nnull|null|null|null|VALUE119|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY6|null|STRING119|null|AC|null|null|STRING119|2022-01-11|2022-01-06|81351216|VALUE1|null|null|CITY106|888420219|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING181|null|101|null|null|null|1001|null|null|null|null|null|STRING189|STRING191|STRING191|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500137.00|null|101.000|101.000|6500137.00|85018.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY106|885319|8123401116|STRING191|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING119|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING121|null|null|null|null|null|124814118|null|STRING1|2140092|2022-01-06|null|null|null|STRING119|COMP1COUNTRY6|null|null|null|null|null|No||2|190|null|null\nnull|null|null|null|VALUE120|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300105|404|null|RER|CRC|XCX|null|null|null|STRING182|null|101|null|null|null|1001|STRING9|null|null|null|null|STRING190|STRING192|STRING192|null|null|15029.00|null|15029.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|null|101.000|101.000|6500126.00|85057.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|STRING8|STRING2|STRING6|Yes||2|191|null|null\nnull|null|null|null|VALUE120|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300293|404|null|RER|CRC|XCX|null|null|null|STRING183|null|101|null|null|null|1002|STRING9|null|null|null|null|STRING191|STRING192|STRING192|null|null|15004.00|null|15004.00|null|null|230048|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85024.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|STRING8|STRING2|STRING6|Yes||2|192|null|null\nnull|null|null|null|VALUE120|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE39|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING184|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING192|STRING192|STRING192|null|null|15022.00|null|15022.00|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500127.00|null|101.000|101.000|6500127.00|85041.00|101.000|101.000|101.000|101.000|STRING19|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|null|null|null|No||2|193|null|null\nnull|null|null|null|VALUE120|10102416|10102416|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE39|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300128|404|null|RER|RCR|XCX|null|null|null|STRING185|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING193|STRING192|STRING192|null|null|null|null|null|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|null|101.000|101.000|101.000|101.000|STRING19|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|null|null|null|No||2|194|null|null\nnull|null|null|null|VALUE120|10102417|10102417|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE39|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300296|404|null|RER|RCR|XCX|null|null|null|STRING186|null|101|null|null|null|1003|STRING6|null|null|null|null|STRING194|STRING192|STRING192|null|null|null|null|null|null|null|230049|null|null|101|STRING1|null|null|null|null|PCP|101|6500132.00|null|101.000|101.000|6500132.00|null|101.000|101.000|101.000|101.000|STRING19|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|null|null|null|No||2|195|null|null\nnull|null|null|null|VALUE120|10102418|10102418|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|VALUE39|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300139|404|null|RER|RCR|XCX|null|null|null|STRING187|null|101|null|null|null|1002|STRING6|null|null|null|null|STRING195|STRING192|STRING192|null|null|null|null|null|null|null|230024|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|null|101.000|101.000|101.000|101.000|STRING19|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|null|null|null|No||2|196|null|null\nnull|null|null|null|VALUE120|10102419|10102419|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY6|null|STRING120|null|AA|null|null|STRING120|2022-01-10|2022-01-06|81351217|VALUE1|null|null|CITY107|888420220|null|2300111|404|null|RER|RCR|XCX|null|null|null|STRING188|null|101|null|null|null|1001|null|null|null|null|null|STRING196|STRING192|STRING192|null|null|15003.00|null|15003.00|null|null|230008|null|null|101|STRING1|null|null|null|null|PCP|101|6500131.00|null|101.000|101.000|6500131.00|85050.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY107|885320|8123401117|STRING192|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING120|10240414|null|null|null|null|null|null|COMPANY1|COUNTRYAB6|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING122|null|null|null|null|null|124814119|null|null|null|null|null|null|null|STRING120|COMP1COUNTRY6|null|null|null|null|null|No||2|197|null|null\nnull|null|null|null|VALUE121|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AA|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AA|AA|AA|null|null|null|2022-01-05|COUNTRY5|null|STRING121|null|AA|null|null|STRING121|2022-01-10|2022-01-06|81351218|VALUE1|null|null|CITY108|888420221|null|2300123|404|null|RER|RCR|XCX|null|null|null|STRING189|null|101|null|null|null|1001|null|null|null|null|null|STRING197|STRING199|STRING199|null|null|15003.00|null|15003.00|null|null|230017|null|null|101|STRING2|null|null|null|null|PCP|101|6500153.00|null|101.000|101.000|6500153.00|85039.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-06|CITY108|885321|8123401118|STRING199|null|2022-01-06|STRING4|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING121|10240415|null|null|null|null|null|null|COMPANY1|COUNTRYAB5|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING123|null|null|null|null|null|124814120|null|null|null|null|null|null|null|STRING121|COMP1COUNTRY5|null|null|null|null|null|No||2|198|null|null\nnull|null|null|null|VALUE1|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|VALUE1|null|null|2022-01-05|COUNTRY1|null|STRING1|null|AC|null|null|STRING1|2022-01-10|2022-01-06|81351101|VALUE1|null|null|CITY1|888420101|null|2300101|404|null|RER|RCR|XCX|null|null|null|STRING1|null|101|null|null|null|1002|STRING1|null|null|null|null|STRING1|STRING1|STRING1|null|null|null|null|null|null|null|230001|null|null|101|STRING1|null|null|null|null|PCP|101|6500123.00|null|101.000|101.000|6500123.00|null|101.000|101.000|101.000|101.000|STRING1|STRING1|2022-01-06|CITY1|885201|8123401001|STRING1|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING1|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING1|null|null|null|null|null|124814001|null|STRING1|2140001|2022-01-06|null|null|null|STRING1|COMP1COUNTRY1|null|null|null|null|null|No||2|0|null|null\nnull|null|null|null|VALUE2|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|VALUE2|null|null|2022-01-04|COUNTRY2|null|STRING2|null|AC|null|null|STRING2|2022-01-11|2022-01-07|81351102|VALUE1|null|null|CITY2|888420102|null|2300103|404|null|RER|RCR|XCX|null|null|null|STRING3|null|101|null|null|null|1001|STRING3|null|null|null|null|STRING3|STRING3|STRING3|null|null|15001.00|null|15001.00|null|null|230002|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85002.00|101.000|101.000|101.000|101.000|STRING3|STRING1|2022-01-07|CITY2|885202|8123401002|STRING3|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING2|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB2|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING2|null|null|null|null|null|124814002|null|STRING1|2140002|2022-01-07|null|null|null|STRING2|COMP1COUNTRY2|null|null|null|null|null|No||2|2|null|null\nnull|null|null|null|VALUE5|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING5|null|AC|null|null|STRING5|2022-01-11|2022-01-07|81351105|VALUE1|null|null|CITY5|888420105|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING6|null|101|null|null|null|1001|null|null|null|null|null|STRING6|STRING6|STRING6|null|null|15003.00|null|15003.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500128.00|null|101.000|101.000|6500128.00|85005.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY5|885205|8123401005|STRING6|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING5|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING5|null|null|null|null|null|124814005|null|STRING1|2140005|2022-01-07|null|null|null|STRING5|COMP1COUNTRY1|null|null|null|null|null|No||2|5|null|null\nnull|null|null|null|VALUE12|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-13|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|VALUE6|null|null|2022-01-04|COUNTRY1|null|STRING12|null|AC|null|null|STRING12|2022-01-13|2022-01-07|81351112|VALUE1|null|null|CITY12|888420112|null|2300117|404|null|RER|RCR|XCX|null|null|null|STRING17|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING17|STRING17|STRING17|null|null|15009.00|null|15009.00|null|null|230014|null|null|101|STRING1|null|null|null|null|PCP|101|6500124.00|null|101.000|101.000|6500124.00|85014.00|101.000|101.000|101.000|101.000|STRING4|STRING1|2022-01-07|CITY12|885212|8123401012|STRING17|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING12|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING12|null|null|null|null|null|124814012|null|STRING1|2140012|2022-01-07|null|null|null|STRING12|COMP1COUNTRY1|null|null|null|null|null|No||2|16|null|null\nnull|null|null|null|VALUE37|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING37|null|AC|null|null|STRING37|2022-01-11|2022-01-07|81351137|VALUE1|null|null|CITY34|888420137|null|2300128|404|null|RER|RCR|XCX|null|null|null|STRING55|null|101|null|null|null|1002|null|null|null|null|null|STRING55|STRING56|STRING56|null|null|15003.00|null|15003.00|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|6500145.00|null|101.000|101.000|6500145.00|85029.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY34|885237|8123401037|STRING56|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING37|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING37|null|null|null|null|null|124814037|null|STRING1|2140031|2022-01-07|null|null|null|STRING37|COMP1COUNTRY1|null|null|null|null|null|No||2|55|null|null\nnull|null|null|null|VALUE48|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING48|null|AC|null|null|STRING48|2022-01-11|2022-01-07|81351148|VALUE1|null|null|CITY43|888420148|null|2300135|404|null|RER|CRC|XCX|null|null|null|STRING68|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING68|STRING69|STRING69|null|null|null|null|null|null|null|230022|null|null|101|STRING1|null|null|null|null|PCP|101|null|null|101.000|101.000|null|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-07|CITY43|885248|8123401048|STRING69|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING48|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING48|null|null|null|null|null|124814047|null|STRING1|2140038|2022-01-07|null|null|null|STRING48|COMP1COUNTRY1|null|null|null|null|null|No||2|68|null|null\nnull|null|null|null|VALUE48|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING48|null|AC|null|null|STRING48|2022-01-11|2022-01-07|81351148|VALUE1|null|null|CITY43|888420148|null|2300135|404|null|RER|CRC|XCX|null|null|null|STRING69|null|101|null|null|null|1001|STRING2|null|null|null|null|STRING69|STRING69|STRING69|null|null|15018.00|null|15018.00|null|null|230022|null|null|101|STRING1|null|null|null|null|PCP|101|6500134.00|null|101.000|101.000|6500134.00|85033.00|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-07|CITY43|885248|8123401048|STRING69|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING48|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING48|null|null|null|null|null|124814047|null|STRING1|2140038|2022-01-07|null|null|null|STRING48|COMP1COUNTRY1|null|null|null|null|null|No||2|69|null|null\nnull|null|null|null|VALUE62|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|VALUE22|null|null|2022-01-04|COUNTRY3|null|STRING62|null|AC|null|null|STRING62|2022-01-11|2022-01-07|81351161|VALUE1|null|null|CITY56|888420162|null|2300140|404|null|RER|RCR|XCX|null|null|null|STRING94|null|101|null|null|null|1002|STRING5|null|null|null|null|STRING94|STRING95|STRING95|null|null|15004.00|null|15004.00|null|null|230025|null|null|101|STRING1|null|null|null|null|PCP|101|6500125.00|null|101.000|101.000|6500125.00|85009.00|101.000|101.000|101.000|101.000|STRING6|STRING1|2022-01-07|CITY56|885262|8123401061|STRING95|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING62|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB3|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING63|null|null|null|null|null|124814061|null|STRING1|2140049|2022-01-07|null|null|null|STRING62|COMP1COUNTRY3|null|null|null|null|null|No||2|94|null|null\nnull|null|null|null|VALUE96|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING96|null|AC|null|null|STRING96|2022-01-11|2022-01-07|81351194|VALUE1|null|null|CITY73|888420196|null|2300121|404|null|RER|RCR|XCX|null|null|null|STRING147|null|101|null|null|null|1002|null|null|null|null|null|STRING150|STRING151|STRING151|null|null|15003.00|null|15003.00|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|6500145.00|null|101.000|101.000|6500145.00|85029.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY73|885296|8123401094|STRING151|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING96|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING98|null|null|null|null|null|124814095|null|STRING1|2140075|2022-01-07|null|null|null|STRING96|COMP1COUNTRY1|null|null|null|null|null|No||2|150|null|null\nnull|null|null|null|VALUE96|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING96|null|AC|null|null|STRING96|2022-01-11|2022-01-07|81351194|VALUE1|null|null|CITY73|888420196|null|2300121|404|null|RER|RCR|XCX|null|null|null|STRING148|null|101|null|null|null|1002|null|null|null|null|null|STRING151|STRING151|STRING151|null|null|15003.00|null|15003.00|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY73|885296|8123401094|STRING151|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING96|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING98|null|null|null|null|null|124814095|null|STRING1|2140075|2022-01-07|null|null|null|STRING96|COMP1COUNTRY1|null|null|null|null|null|No||2|151|null|null\nnull|null|null|null|VALUE96|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING96|null|AC|null|null|STRING96|2022-01-11|2022-01-07|81351194|VALUE1|null|null|CITY73|888420196|null|2300121|404|null|RER|RCR|XCX|null|null|null|STRING149|null|101|null|null|null|1002|null|null|null|null|null|STRING152|STRING151|STRING151|null|null|15003.00|null|15003.00|null|null|230016|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY73|885296|8123401094|STRING151|null|2022-01-06|STRING1|10020.000|STRING2|STORE1|STRING1|TYPE1|STRING96|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING3|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING98|null|null|null|null|null|124814095|null|STRING1|2140075|2022-01-07|null|null|null|STRING96|COMP1COUNTRY1|null|null|null|null|null|No||2|152|null|null\nnull|null|null|null|VALUE103|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY10|null|STRING103|null|AC|null|null|STRING103|2022-01-11|2022-01-07|81351201|VALUE1|null|null|CITY92|888420203|null|2300128|404|null|RER|CRC|XCX|null|null|null|STRING161|null|101|null|null|null|1002|STRING2|null|null|null|null|STRING164|STRING165|STRING165|null|null|null|null|null|null|null|230018|null|null|101|STRING1|null|null|null|null|PCP|101|6500135.00|null|101.000|101.000|6500135.00|null|101.000|101.000|101.000|101.000|STRING2|STRING1|2022-01-07|CITY92|885303|8123401101|STRING165|null|2022-01-06|STRING1|10020.000|STRING1|STORE1|STRING1|TYPE1|STRING103|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB11|STRING3|STRING2|1923001|1923001|null|null|null|null|null|STRING105|null|null|null|null|null|124814102|null|STRING1|2140081|2022-01-07|null|null|null|STRING103|COMP1COUNTRY9|null|null|null|null|null|No||2|164|null|null\nnull|null|null|null|VALUE106|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING106|null|AC|null|null|STRING106|2022-01-11|2022-01-07|81351203|VALUE1|null|null|CITY20|888420206|null|2300180|404|null|RER|RCR|XCX|null|null|null|STRING64|null|101|null|null|null|1001|null|null|null|null|null|STRING172|STRING173|STRING173|null|null|15003.00|null|15003.00|null|null|230034|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85036.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY20|885306|8123401103|STRING173|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING106|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING107|null|null|null|null|null|124814105|null|STRING1|2140083|2022-01-07|null|null|null|STRING106|COMP1COUNTRY1|null|null|null|null|null|No||2|172|null|null\nnull|null|null|null|VALUE106|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING106|null|AC|null|null|STRING106|2022-01-11|2022-01-07|81351203|VALUE1|null|null|CITY20|888420206|null|2300180|404|null|RER|RCR|XCX|null|null|null|STRING165|null|101|null|null|null|1001|null|null|null|null|null|STRING173|STRING173|STRING173|null|null|15003.00|null|15003.00|null|null|230034|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85036.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY20|885306|8123401103|STRING173|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING106|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING107|null|null|null|null|null|124814105|null|STRING1|2140083|2022-01-07|null|null|null|STRING106|COMP1COUNTRY1|null|null|null|null|null|No||2|173|null|null\nnull|null|null|null|VALUE106|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-14|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|null|null|null|2022-01-04|COUNTRY1|null|STRING106|null|AC|null|null|STRING106|2022-01-14|2022-01-07|81351203|VALUE1|null|null|CITY20|888420206|null|2300233|404|null|RER|RCR|XCX|null|null|null|STRING166|null|101|null|null|null|1001|null|null|null|null|null|STRING174|STRING173|STRING173|null|null|15003.00|null|15003.00|null|null|230044|null|null|101|STRING1|null|null|null|null|PCP|101|6500142.00|null|101.000|101.000|6500142.00|85027.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-07|CITY20|885306|8123401103|STRING173|null|2022-01-06|STRING6|10020.000|STRING3|STORE1|STRING1|TYPE3|STRING106|10240402|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|null|1923002|1923001|null|null|null|null|10349200.00|STRING108|null|null|null|null|null|124814105|null|STRING1|2140083|2022-01-07|null|null|null|STRING106|COMP1COUNTRY1|null|null|null|null|null|No||2|174|null|null\nnull|null|null|null|VALUE110|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-11|null|null|null|null|2022-01-04|null|STRING1|AC|AC|AC|VALUE35|null|null|2022-01-04|COUNTRY1|null|STRING110|null|AC|null|null|STRING110|2022-01-11|2022-01-07|81351207|VALUE1|null|null|CITY97|888420210|null|2300105|404|null|RER|RCR|XCX|null|null|null|STRING172|null|101|null|null|null|1001|STRING5|null|null|null|null|STRING180|STRING181|STRING181|null|null|15014.00|null|15014.00|null|null|230004|null|null|101|STRING1|null|null|null|null|PCP|101|6500148.00|null|101.000|101.000|6500148.00|85030.00|101.000|101.000|101.000|101.000|STRING17|STRING1|2022-01-07|CITY97|885310|8123401107|STRING181|null|2022-01-06|STRING1|10020.000|STRING3|STORE1|STRING1|TYPE1|STRING110|10240401|null|null|null|null|null|null|COMPANY1|COUNTRYAB1|STRING1|STRING2|1923001|1923001|null|null|null|null|10349200.00|STRING112|null|null|null|null|null|124814109|null|STRING1|2140085|2022-01-07|null|null|null|STRING110|COMP1COUNTRY1|null|null|null|null|null|No||2|180|null|null\nnull|null|null|null|VALUE6|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-12|null|null|null|null|2022-01-03|null|STRING1|AC|AC|AC|null|null|null|2022-01-03|COUNTRY2|null|STRING6|null|AC|null|null|STRING6|2022-01-12|2022-01-08|81351106|VALUE1|null|null|CITY6|888420106|null|2300107|404|null|RER|RCR|XCX|null|null|null|STRING7|null|101|null|null|null|1001|null|null|null|null|null|STRING7|STRING7|STRING7|null|null|15003.00|null|15003.00|null|null|230005|null|null|101|STRING1|null|null|null|null|PCP|101|6500129.00|null|101.000|101.000|6500129.00|85006.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-08|CITY6|885206|8123401006|STRING7|null|2022-01-06|STRING2|10020.000|STRING1|STORE1|STRING2|TYPE1|STRING6|10240402|null|null|null|null|null|null|COMPANY2|COUNTRYAB2|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING6|null|null|null|null|null|124814006|null|STRING1|2140006|2022-01-08|null|null|null|STRING6|COMP2COUNTRY2|null|null|null|null|null|No||2|6|null|null\nnull|null|null|null|VALUE6|10102413|10102413|null|3,02E+25|3,02E+25|null|null| |null|AB|2022-01-12|null|null|null|null|2022-01-03|null|STRING1|AB|AB|AB|null|null|null|2022-01-03|COUNTRY2|null|STRING6|null|AB|null|null|STRING6|2022-01-12|2022-01-08|81351106|VALUE1|null|null|CITY6|888420106|null|2300108|404|null|RER|RCR|XCX|null|null|null|STRING8|null|101|null|null|null|1003|null|null|null|null|null|STRING8|STRING7|STRING7|null|null|15003.00|15001.00|15003.00|null|null|230006|null|null|101|STRING1|null|null|null|null|PCP|101|6500130.00|5001.00|101.000|101.000|6500130.00|85007.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-08|CITY6|885206|8123401006|STRING7|null|2022-01-06|STRING3|10020.000|STRING1|STORE1|STRING2|TYPE1|STRING6|10240402|null|null|null|null|null|null|COMPANY2|COUNTRYAB2|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING6|null|null|null|null|null|124814006|null|STRING1|2140006|2022-01-08|null|null|null|STRING6|COMP2COUNTRY2|null|null|null|null|null|No||2|7|null|null\nnull|null|null|null|VALUE6|10102415|10102415|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-12|null|null|null|null|2022-01-03|null|STRING1|AC|AC|AC|null|null|null|2022-01-03|COUNTRY2|null|STRING6|null|AC|null|null|STRING6|2022-01-12|2022-01-08|81351106|VALUE1|null|null|CITY6|888420106|null|2300107|404|null|RER|RCR|XCX|null|null|null|STRING9|null|101|null|null|null|1001|null|null|null|null|null|STRING9|STRING7|STRING7|null|null|15003.00|15002.00|15003.00|null|null|230005|null|null|101|STRING1|null|null|null|null|PCP|101|6500126.00|5002.00|101.000|101.000|6500126.00|85008.00|101.000|101.000|101.000|101.000|null|STRING1|2022-01-08|CITY6|885206|8123401006|STRING7|null|2022-01-06|STRING2|10020.000|STRING1|STORE1|STRING2|TYPE1|STRING6|10240402|null|null|null|null|null|null|COMPANY2|COUNTRYAB2|STRING3|STRING3|1923002|1923001|null|null|null|null|10349200.00|STRING6|null|null|null|null|null|124814006|null|STRING1|2140006|2022-01-08|null|null|null|STRING6|COMP2COUNTRY2|null|null|null|null|null|No||2|8|null|null\nnull|null|null|null|VALUE122|10102412|10102412|null|3,02E+25|3,02E+25|null|null| |null|AC|2022-01-10|null|null|null|null|2022-01-05|null|STRING1|AC|AC|AC|null|null|null|2022-01-05|COUNTRY1|null|STRING122|null|AC|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|101|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|PCP|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null|null||2|199|null|null\n"
  },
  {
    "path": "tests/resources/feature/gab/setup/schema/dummy_sales_kpi.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/schema/lkp_query_builder.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"query_id\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"query_label\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"query_type\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"mappings\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"intermediate_stages\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recon_window\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"timezone_offset\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"start_of_the_week\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"is_active\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"queue\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"lh_created_on\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/gab/setup/schema/order_events.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\"name\": \"request_timestamp\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"data_pack_id\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"record_number\", \"type\": \"integer\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"update_mode\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_header\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_schedule\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_item\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"orgsales_orgp\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"order_header_key\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"order_line_key\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"derived_order_header\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"derived_order_line_k\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"return_reason\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"reqmnt_category\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"delivery_status10\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"req_del_dt_item\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"reason_for_rejsize\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"invoice_item_price\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"id_of_the_customer\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"logistics_profit_ctr\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"material_availabilit\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"mso_store\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"name_of_orderer\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"overall_delivery_sta\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"overall_processing_s20\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"overall_processing_s21\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"coupon_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"org_grape_bapcx\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"cust_service_rep\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"customer_purchase_or25\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"delivery_country_cod\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"delivery_city_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"delivery_post_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"delivery_state_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"delivery_status30\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"ops_del_block_sohdr\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"ops_del_block_soscl\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"ecom_crm_id\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"conf_del_date_size\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"created_on\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"time\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_doc_item_cat\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_campaign_id\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_coupon_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_city\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_postal_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shp_promotion_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"size_grid\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"main_chan_frm_src\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"prctr_billing\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"prere_indfrm_src\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"reg__clr_from_src\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"update_flag\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"usage\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_header_usgindp\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"vas_customer_defined\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"adidas_group_article\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"billto_cust\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"requirement_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipto_cust__r2\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"soldto_cust_r2\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_doc_category\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"product_division\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"promotion_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sd_categ_precdoc\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_hdrpreceding_doc\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_itmpreceding_doc\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_scl_prec_doc\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"article__region__s\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"reference_1\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"mkt_place_order_num\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_representative\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"subtotal_1_source\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"subtotal_2_source\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"subtotal_3_source\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"subtotal_4_source\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"subtotal_5_source\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"subtotal_6_source\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"grid_value\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"orgcompcodep\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"created_by\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"miscdistchcopap\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"document_currency\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"reason_for_order\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"opsplantp\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_group\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_office\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_unit\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"storage_location\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_net_price_2\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_net_valu\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_conf_qty\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_cum_order_qty\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_net_price\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_net_value\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_org_qty\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"so_conf_qty_actual\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_qty\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_odr_qty_actual\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"article_campaign_id\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_document_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"order_date_header\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"billing_city\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"billing_postal_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"customer_po_time\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"customer_purchase_or101\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"overall_rej_status\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"changed_on\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"epoch_status\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_canqty\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"epoch_entry_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"epoch_entry_by\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"epoch_order_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"epoch_line_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_marketplace\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"confirmed_delivery_t\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_city_addres112\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_city_addres113\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"shipping_city_addres114\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"billing_city_address115\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"billing_city_address116\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"billing_city_address117\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_seller_org\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_locale_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"customer_po_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_carrier_serv\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"qualifier\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_document_typ\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_return_code\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"refund_process_date\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"refund_process_time\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omni_cancel_reason\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"sales_order_ecom_fre\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_custom_order\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"vas_packing_type_so\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"vas_spl_ser_type_so\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"vas_tktlbl_type_so\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"exchange_flag\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"exchange_type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"customer_po_timedw\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"cnc_store_id\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"last_hold__type\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"last_hold_released_t\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"last_hold_release_dt\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"dynamic_pricing_iden\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"dynamic_pricing_valu\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"dymamic_pricing_amnt\", \"type\": \"decimal\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"exchange_reason\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"omnihub_site_id\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"international_shipme\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"exchange_variant\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"secondary_article_ca\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"secondary_article_pr\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"secondary_coupon_cod\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"double_discount_flag\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"extraction_date\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"lhe_batch_id\", \"type\": \"integer\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"lhe_row_id\", \"type\": \"long\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"source_update_date\", \"type\": \"date\",\"nullable\": true,\"metadata\": {}},\n    {\"name\": \"source_update_time\", \"type\": \"string\",\"nullable\": true,\"metadata\": {}}\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/gab/usecases/dummy_sales_kpi/1_article_category.sql",
    "content": "SELECT\n    \"category_a\" AS category_name\n   ,\"article1\" AS article_id\nUNION\nSELECT\n     \"category_a\" AS category_name\n    ,\"article2\" AS article_id\nUNION\nSELECT\n     \"category_a\" AS category_name\n    ,\"article3\" AS article_id\nUNION\nSELECT\n     \"category_a\" AS category_name\n    ,\"article4\" AS article_id\nUNION\nSELECT\n     \"category_b\" AS category_name\n    ,\"article5\" AS article_id\nUNION\nSELECT\n     \"category_b\" AS category_name\n    ,\"article6\" AS article_id\nUNION\nSELECT\n     \"category_b\" AS category_name\n    ,\"article7\" AS article_id\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/dummy_sales_kpi/2_dummy_sales_kpi.sql",
    "content": "SELECT\n   {% if replace_offset_value == 0 %} {{ project_date_column }} {% else %} ({{ project_date_column }} + interval '{{offset_value}}' hour) {% endif %} AS order_date,\n  {{ to_date }} AS to_date,\n  b.category_name,\n  COUNT(a.article_id) qty_articles,\n  SUM(amount) total_amount\nFROM\n  `{{ database }}`.`dummy_sales_kpi` a {{ joins }}\n  LEFT JOIN article_categories b\n    ON a.article_id = b.article_id\nWHERE\n  TO_DATE({{ filter_date_column }}, 'yyyyMMdd') >= (\n          '{{start_date}}' + INTERVAL '{{offset_value}}' HOUR\n  )\n  AND TO_DATE({{ filter_date_column }}, 'yyyyMMdd') < (\n          '{{ end_date}}' + INTERVAL '{{offset_value}}' HOUR\n  )\nGROUP BY\n  1,2,3"
  },
  {
    "path": "tests/resources/feature/gab/usecases/dummy_sales_kpi/scenario/dummy_sales_kpi.json",
    "content": "{\n  \"query_label_filter\": [\"dummy_sales_kpi\"],\n  \"queue_filter\": [\"Low\"],\n  \"cadence_filter\": [\"DAY\",\"WEEK\",\"MONTH\",\"QUARTER\",\"YEAR\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2016-01-01\",\n  \"end_date\": \"2018-12-31\",\n  \"rerun_flag\": \"N\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\",\n  \"calendar_table\": \"dim_calendar\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/1_order_events.sql",
    "content": "SELECT\n    {{ to_date }} AS to_date,\n    {% if replace_offset_value == 0 %} {{ project_date_column }} {% else %} ({{ project_date_column }} + INTERVAL '{{offset_value}}' HOUR) {% endif %} AS order_date,\n    sales_order_schedule,\n    delivery_country_cod,\n    COUNT(*) orders,\n    SUM(sales_order_qty) total_sales\nFROM {{ database }}.order_events {{ joins }}\nWHERE\n{{ filter_date_column }} >= (\n        '{{start_date}}' + INTERVAL '{{offset_value}}' HOUR\n)\nAND {{ filter_date_column }} < (\n        '{{ end_date}}' + INTERVAL '{{offset_value}}' HOUR\n)\nAND order_date_header IS NOT NULL\nGROUP BY ALL"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/order_events.json",
    "content": "{\n  \"query_label_filter\": [\"order_events\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"All\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"N\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/order_events_nam.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_nam\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"MONTH\",\"QUARTER\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"N\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/order_events_negative_timezone_offset.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_negative_timezone_offset\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"WEEK\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"Y\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/order_events_snapshot.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_snapshot\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"DAY\",\"WEEK\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"N\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/skip_use_case_by_empty_reconciliation.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_empty_reconciliation_window\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"WEEK\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"Y\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/skip_use_case_by_empty_requested_cadence.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_negative_timezone_offset\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"Y\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/skip_use_case_by_not_configured_cadence.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_negative_timezone_offset\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"YEAR\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"Y\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/gab/usecases/order_events/scenario/skip_use_case_by_unexisting_cadence.json",
    "content": "{\n  \"query_label_filter\": [\"order_events_unexisting_cadence\"],\n  \"queue_filter\": [\"Medium\"],\n  \"cadence_filter\": [\"WEEK\"],\n  \"target_database\": \"test_db\",\n  \"start_date\": \"2022-01-01\",\n  \"end_date\": \"2022-12-31\",\n  \"rerun_flag\": \"Y\",\n  \"target_table\": \"gab_use_case_results\",\n  \"source_database\": \"test_db\",\n  \"gab_base_path\": \"/app/tests/lakehouse/in/feature/gab/usecases_sql/\",\n  \"lookup_table\": \"lkp_query_builder\"\n}\n"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/data/ctr_heart_tbl_heartb_feed.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset||||1927384615203749|data-product_job_name_orders|||||UNPAUSED|TRUE \nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/data/ctrl_heart_tbl_exec_sensor.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset|||2025-08-14 23:00:00|1927384615203749|data-product_job_name_orders|NEW_EVENT_AVAILABLE|2025-08-14 23:00:00|||UNPAUSED|TRUE \nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/data/ctrl_heart_tbl_trigger_job.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset|||2025-08-14 23:00:00|1927384615203749|data-product_job_name_orders|COMPLETED|2025-08-14 23:00:00||2025-08-14 23:00:00|UNPAUSED|TRUE \ndelta_table|dummy_order|batch|dummy_heartbeat_asset||||1015557820139870|data-product_job_name_orders|IN_PROGRESS|2025-08-14 23:00:00|2025-08-14 23:00:00||UNPAUSED|true\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE\nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/data/ctrl_heart_tbl_updated.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset|||2025-08-14 23:00:00|1927384615203749|data-product_job_name_orders|COMPLETED|2025-08-14 23:00:00||2025-08-14 23:00:00|UNPAUSED|TRUE \nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|||||UNPAUSED|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/data/ctrl_sensor_tbl_upd_status.json",
    "content": "{\"sensor_id\": \"multiple_sensors_delta_table_hello_world_sensor\",\"assets\": [\"multiple_sensors_delta_table_hello_world\"],\"status\": \"ACQUIRED_NEW_DATA\",\"status_change_timestamp\": \"2024-10-29 14:30:38.268544\",\"checkpoint_location\": \"s3://lh-sadp-template-eu-west-1-as12/checkpoints/lakehouse_engine/sensors/multiple_sensors_delta_table_hello_world_sensor\",\"upstream_key\": null,\"upstream_value\": null}\n{\"sensor_id\": \"multiple_sensors_sap_bw_hello_world_sensor\",\"assets\": [\"multiple_sensors_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:48:18.406151\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"once_with_retry_sap_bw_hello_world_sensor\",\"assets\": [\"once_with_retry_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:29:37.167015\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"lmu_table_batch_sensor\",\"assets\": [\"lmu_article_description\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-02-13 14:24:10.528557\",\"checkpoint_location\": null,\"upstream_key\": \"date\",\"upstream_value\": \"20200201010101\"}\n{\"sensor_id\": \"sap_bw_hello_world_sensor\",\"assets\": [\"sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:28:18.24358\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"dummy_delta_table_1927384615203749\",\"assets\": [\"dummy_heartbeat_asset\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-08-14 23:00:00.00000\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"38172649503821\"}"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/schema/ctrl_heart_tbl_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\":\"sensor_source\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"sensor_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"sensor_read_type\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"asset_description\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_key\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"preprocess_query\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"latest_event_fetched_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"trigger_job_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"trigger_job_name\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status_change_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_start_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_end_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_state\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"dependency_flag\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/default/schema/ctrl_heart_tbl_trig_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\":\"sensor_source\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"sensor_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"sensor_read_type\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"asset_description\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_key\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"preprocess_query\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"trigger_job_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"trigger_job_name\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_state\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"dependency_flag\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/heartbeat_paused_sensor_new_record/data/ctr_heart_tbl_heartb_feed.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset||||1927384615203749|data-product_job_name_orders|||||PAUSED|TRUE\nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales||||||TRUE\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|||||COMPLETE|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/heartbeat_paused_sensor_new_record/data/ctrl_heart_tbl_exec_sensor.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset||||1927384615203749|data-product_job_name_orders|||||PAUSED|TRUE\nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales||||||TRUE\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|||||COMPLETE|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/heartbeat_paused_sensor_new_record/data/ctrl_heart_tbl_trigger_job.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset||||1927384615203749|data-product_job_name_orders|||||PAUSED|TRUE\ndelta_table|dummy_order|batch|dummy_heartbeat_asset||||1015557820139870|data-product_job_name_orders|IN PROGRESS||||UNPAUSED|true\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|COMPLETED|2025-08-14 23:00:00||2025-08-14 23:00:00|COMPLETE|TRUE\nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales|COMPLETED|2025-08-14 23:00:00||2025-08-14 23:00:00||TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/heartbeat_paused_sensor_new_record/data/ctrl_heart_tbl_updated.csv",
    "content": "sensor_source|sensor_id|sensor_read_type|asset_description|upstream_key|preprocess_query|latest_event_fetched_timestamp|trigger_job_id|trigger_job_name|status|status_change_timestamp|job_start_timestamp|job_end_timestamp|job_state|dependency_flag\ndelta_table|dummy_delta_table|streaming|dummy_heartbeat_asset||||1927384615203749|data-product_job_name_orders|||||PAUSED|TRUE\nsap_bw|dummy_sap_asset|batch|dummy_heartbeat_sap_bw|LOAD_DATE|||2604918372561094|data-product_job_name_sales|COMPLETED|2025-08-14 23:00:00||2025-08-14 23:00:00||TRUE\nkafka|sales: domain.workspace.load.dummy_topic|streaming|dummy_heartbeat_kafka||||2604918372561094|data-product_job_name_sales|COMPLETED|2025-08-14 23:00:00 ||2025-08-14 23:00:00|COMPLETE|TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/heartbeat_paused_sensor_new_record/data/ctrl_sensor_tbl_upd_status.json",
    "content": "{\"sensor_id\": \"multiple_sensors_delta_table_hello_world_sensor\",\"assets\": [\"multiple_sensors_delta_table_hello_world\"],\"status\": \"ACQUIRED_NEW_DATA\",\"status_change_timestamp\": \"2024-10-29 14:30:38.268544\",\"checkpoint_location\": \"s3://lh-sadp-template-eu-west-1-as12/checkpoints/lakehouse_engine/sensors/multiple_sensors_delta_table_hello_world_sensor\",\"upstream_key\": null,\"upstream_value\": null}\n{\"sensor_id\": \"multiple_sensors_sap_bw_hello_world_sensor\",\"assets\": [\"multiple_sensors_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:48:18.406151\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"once_with_retry_sap_bw_hello_world_sensor\",\"assets\": [\"once_with_retry_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:29:37.167015\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"lmu_table_batch_sensor\",\"assets\": [\"lmu_article_description\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-02-13 14:24:10.528557\",\"checkpoint_location\": null,\"upstream_key\": \"date\",\"upstream_value\": \"20200201010101\"}\n{\"sensor_id\": \"sap_bw_hello_world_sensor\",\"assets\": [\"sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:28:18.24358\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"dummy_sap_asset_2604918372561094\",\"assets\": [\"dummy_heartbeat_sap_bw\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-08-14 23:00:00\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"38172649503821\"}\n{\"sensor_id\": \"sales__domain_workspace_load_dummy_topic_2604918372561094\",\"assets\": null,\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-08-14 23:00:00\",\"checkpoint_location\": null,\"upstream_key\": \"None\",\"upstream_value\": \"None\"}"
  },
  {
    "path": "tests/resources/feature/heartbeat/control/heartbeat_paused_sensor_new_record/schema/ctrl_heart_tbl_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\":\"sensor_source\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"sensor_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"sensor_read_type\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"asset_description\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_key\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"preprocess_query\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"latest_event_fetched_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"trigger_job_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"trigger_job_name\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status_change_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_start_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_end_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"job_state\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"dependency_flag\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/default/column_list/heartbeat_sensor_control_table.json",
    "content": "{\n  \"sensor_source\": \"string\",\n  \"sensor_id\": \"string\",\n  \"sensor_read_type\": \"string\",\n  \"asset_description\": \"string\",\n  \"upstream_key\": \"string\",\n  \"preprocess_query\": \"string\",\n  \"latest_event_fetched_timestamp\": \"timestamp\",\n  \"trigger_job_id\": \"string\",\n  \"trigger_job_name\": \"string\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"job_start_timestamp\": \"timestamp\",\n  \"job_end_timestamp\": \"timestamp\",\n  \"job_state\": \"string\",\n  \"dependency_flag\": \"string\"\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/default/column_list/sensor_table.json",
    "content": "{\n  \"sensor_id\": \"string\",\n  \"assets\": \"array<string>\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"checkpoint_location\": \"string\",\n  \"upstream_key\": \"string\",\n  \"upstream_value\": \"string\"\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/default/data/setup_heartbeat_data.csv",
    "content": "sensor_source,sensor_id,sensor_read_type,asset_description,upstream_key,preprocess_query,trigger_job_id,trigger_job_name,job_state,dependency_flag\ndelta_table,dummy_delta_table,streaming,dummy_heartbeat_asset,,,1927384615203749,data-product_job_name_orders,UNPAUSED,TRUE \nsap_bw,dummy_sap_asset,batch,dummy_heartbeat_sap_bw,LOAD_DATE,,2604918372561094,data-product_job_name_sales,UNPAUSED,TRUE\nkafka,sales: domain.workspace.load.dummy_topic,streaming,dummy_heartbeat_kafka,,,2604918372561094,data-product_job_name_sales,UNPAUSED,TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/default/data/setup_sensor_data.json",
    "content": "{\"sensor_id\": \"multiple_sensors_delta_table_hello_world_sensor\",\"assets\": [\"multiple_sensors_delta_table_hello_world\"],\"status\": \"ACQUIRED_NEW_DATA\",\"status_change_timestamp\": \"2024-10-29 14:30:38.268544\",\"checkpoint_location\": \"s3://lh-sadp-template-eu-west-1-as12/checkpoints/lakehouse_engine/sensors/multiple_sensors_delta_table_hello_world_sensor\",\"upstream_key\": null,\"upstream_value\": null}\n{\"sensor_id\": \"multiple_sensors_sap_bw_hello_world_sensor\",\"assets\": [\"multiple_sensors_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:48:18.406151\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"once_with_retry_sap_bw_hello_world_sensor\",\"assets\": [\"once_with_retry_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:29:37.167015\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"lmu_table_batch_sensor\",\"assets\": [\"lmu_article_description\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-02-13 14:24:10.528557\",\"checkpoint_location\": null,\"upstream_key\": \"date\",\"upstream_value\": \"20200201010101\"}\n{\"sensor_id\": \"sap_bw_hello_world_sensor\",\"assets\": [\"sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:28:18.24358\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"dummy_delta_table_1927384615203749\",\"assets\": [\"dummy_heartbeat_asset\"],\"status\": \"ACQUIRED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:28:18.24358\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"38172649503821\"}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/default/schema/schema_sensor_df.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\":\"sensor_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"assets\",\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      },\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status_change_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"checkpoint_location\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_key\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_value\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/heartbeat_paused_sensor_new_record/column_list/heartbeat_sensor_control_table.json",
    "content": "{\n  \"sensor_source\": \"string\",\n  \"sensor_id\": \"string\",\n  \"sensor_read_type\": \"string\",\n  \"asset_description\": \"string\",\n  \"upstream_key\": \"string\",\n  \"preprocess_query\": \"string\",\n  \"latest_event_fetched_timestamp\": \"timestamp\",\n  \"trigger_job_id\": \"string\",\n  \"trigger_job_name\": \"string\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"job_start_timestamp\": \"timestamp\",\n  \"job_end_timestamp\": \"timestamp\",\n  \"job_state\": \"string\",\n  \"dependency_flag\": \"string\"\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/heartbeat_paused_sensor_new_record/column_list/sensor_table.json",
    "content": "{\n  \"sensor_id\": \"string\",\n  \"assets\": \"array<string>\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"checkpoint_location\": \"string\",\n  \"upstream_key\": \"string\",\n  \"upstream_value\": \"string\"\n}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/heartbeat_paused_sensor_new_record/data/setup_heartbeat_data.csv",
    "content": "sensor_source,sensor_id,sensor_read_type,asset_description,upstream_key,preprocess_query,trigger_job_id,trigger_job_name,job_state,dependency_flag\ndelta_table,dummy_delta_table,streaming,dummy_heartbeat_asset,,,1927384615203749,data-product_job_name_orders,PAUSED,TRUE\nsap_bw,dummy_sap_asset,batch,dummy_heartbeat_sap_bw,LOAD_DATE,,2604918372561094,data-product_job_name_sales,,TRUE\nkafka,sales: domain.workspace.load.dummy_topic,streaming,dummy_heartbeat_kafka,,,2604918372561094,data-product_job_name_sales,COMPLETE,TRUE"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/heartbeat_paused_sensor_new_record/data/setup_sensor_data.json",
    "content": "{\"sensor_id\": \"multiple_sensors_delta_table_hello_world_sensor\",\"assets\": [\"multiple_sensors_delta_table_hello_world\"],\"status\": \"ACQUIRED_NEW_DATA\",\"status_change_timestamp\": \"2024-10-29 14:30:38.268544\",\"checkpoint_location\": \"s3://lh-sadp-template-eu-west-1-as12/checkpoints/lakehouse_engine/sensors/multiple_sensors_delta_table_hello_world_sensor\",\"upstream_key\": null,\"upstream_value\": null}\n{\"sensor_id\": \"multiple_sensors_sap_bw_hello_world_sensor\",\"assets\": [\"multiple_sensors_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:48:18.406151\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"once_with_retry_sap_bw_hello_world_sensor\",\"assets\": [\"once_with_retry_sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:29:37.167015\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"lmu_table_batch_sensor\",\"assets\": [\"lmu_article_description\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2025-02-13 14:24:10.528557\",\"checkpoint_location\": null,\"upstream_key\": \"date\",\"upstream_value\": \"20200201010101\"}\n{\"sensor_id\": \"sap_bw_hello_world_sensor\",\"assets\": [\"sap_bw_hello_world\"],\"status\": \"PROCESSED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:28:18.24358\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"20220903195523\"}\n{\"sensor_id\": \"dummy_sap_asset_2604918372561094\",\"assets\": [\"dummy_heartbeat_sap_bw\"],\"status\": \"ACQUIRED_NEW_DATA\",\"status_change_timestamp\": \"2023-08-14 08:28:18.24358\",\"checkpoint_location\": null,\"upstream_key\": \"LOAD_DATE\",\"upstream_value\": \"38172649503821\"}"
  },
  {
    "path": "tests/resources/feature/heartbeat/setup/heartbeat_paused_sensor_new_record/schema/schema_sensor_df.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\":\"sensor_id\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"assets\",\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      },\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"status_change_timestamp\",\n      \"type\":\"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"checkpoint_location\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_key\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\":\"upstream_value\",\n      \"type\":\"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_format/correct_arguments/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"options\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_format/correct_arguments/tests.db\",\n        \"dbtable\": \"jdbc_format\",\n        \"driver\": \"org.sqlite.JDBC\",\n        \"numPartitions\": 1\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.jdbc_format_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/jdbc_reader/jdbc_format/correct_arguments/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_format/correct_arguments/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_format/correct_arguments/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_format/predicates/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"options\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_format/predicates/tests.db\",\n        \"dbtable\": \"options\",\n        \"driver\": \"org.sqlite.JDBC\",\n        \"predicates\": \"[customer=customer1]\"\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.options_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/jdbc_reader/jdbc_format/predicates/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_format/wrong_arguments/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"options\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_format/wrong_arguments/tests.db\",\n        \"table\": \"error_because_should_be_dbtable\",\n        \"driver\": \"org.sqlite.JDBC\",\n        \"numPartitions\": 1\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.jdbc_format_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/jdbc_reader/jdbc_format/wrong_arguments/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_function/correct_arguments/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"jdbc_args\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_function/correct_arguments/tests.db\",\n        \"table\": \"jdbc_function\",\n        \"properties\": {\n          \"driver\": \"org.sqlite.JDBC\"\n        }\n      },\n      \"options\": {\n        \"numPartitions\": 1\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.jdbc_function_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/jdbc_reader/jdbc_function/correct_arguments/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_function/correct_arguments/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_function/correct_arguments/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/jdbc_reader/jdbc_function/wrong_arguments/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"jdbc\",\n      \"jdbc_args\": {\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/in/feature/jdbc_reader/jdbc_function/wrong_arguments/tests.db\",\n        \"dbtable\": \"error_because_should_be_table_or_query\",\n        \"properties\": {\n          \"driver\": \"org.sqlite.JDBC\"\n        }\n      },\n      \"options\": {\n        \"numPartitions\": 1\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"db_table\": \"test_db.jdbc_function_table\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/jdbc_reader/jdbc_function/wrong_arguments/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/acon_create_table.json",
    "content": "{\n  \"function\": \"create_table\",\n  \"path\": \"file:///app/tests/resources/feature/materialize_cdf/data/table/streaming_with_cdf.sql\"\n}"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"_change_type\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"_commit_version\",\n      \"type\": \"long\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"_commit_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}\n"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/data/control/part-01_cdf.csv",
    "content": "salesorder|item|date|customer|article|amount|_change_type|_commit_version\n1|1|20160601|customer1|article1|1000|insert|1\n1|2|20160601|customer1|article2|2000|insert|1\n1|3|20160601|customer1|article3|500|insert|1\n2|1|20170215|customer2|article4|1000|insert|1\n2|2|20170215|customer2|article6|5000|insert|1\n2|3|20170215|customer2|article1|3000|insert|1\n3|1|20170215|customer1|article5|20000|insert|1\n3|2|20170215|customer1|article2|12000|insert|1\n3|3|20170215|customer1|article4|9000|insert|1\n4|1|20170430|customer3|article3|8000|insert|1\n4|2|20170430|customer3|article7|7000|insert|1\n4|3|20170430|customer3|article1|3000|insert|1\n4|4|20170430|customer3|article2|5000|insert|1"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/data/source/part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n5|1|20180601|customer1|article1|1000\n5|2|20180601|customer1|article2|2000\n5|3|20180601|customer1|article3|500\n6|1|20190215|customer2|article4|1000\n6|2|20190215|customer2|article6|5000\n6|3|20190215|customer2|article1|3000\n"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/data/table/streaming_with_cdf.sql",
    "content": "CREATE TABLE test_db.streaming_with_cdf (salesorder INT, item INT, date INT, customer STRING, article STRING, amount INT)\nUSING DELTA\nPARTITIONED BY (date)\nLOCATION 'file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf'\nTBLPROPERTIES(\n  'delta.enableChangeDataFeed'='true'\n)"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/streaming_with_clean_and_vacuum.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"DROPMALFORMED\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/materialize_cdf/streaming_with_cdf/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_with_cdf\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/data\"\n    }\n  ],\n  \"terminate_specs\": [\n    {\n      \"function\": \"expose_cdf\",\n      \"args\": {\n        \"db_table\": \"test_db.streaming_with_cdf\",\n        \"materialized_cdf_location\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/cdf_data\",\n        \"materialized_cdf_options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/cdf_checkpoint\"\n        },\n        \"vacuum_cdf\": true,\n        \"vacuum_hours\": 240,\n        \"clean_cdf\": true,\n        \"days_to_keep\": 1\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.sources.partitionColumnTypeInference.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/materialize_cdf/streaming_without_clean_cdf.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"DROPMALFORMED\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/materialize_cdf/streaming_with_cdf/data\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_with_cdf\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/data\"\n    }\n  ],\n  \"terminate_specs\": [\n    {\n      \"function\": \"expose_cdf\",\n      \"args\": {\n        \"db_table\": \"test_db.streaming_with_cdf\",\n        \"materialized_cdf_location\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/cdf_data\",\n        \"materialized_cdf_options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/materialize_cdf/streaming_with_cdf/cdf_checkpoint\"\n        },\n        \"clean_cdf\": false\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.sources.partitionColumnTypeInference.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/notification/test_attachement.txt",
    "content": "Test attachemment"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/current.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": 200\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 400\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": 600\n }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/current_different_rows.json",
    "content": "[\n  {\n    \"country\": \"pt\",\n    \"consumer\": 1,\n    \"date\": \"20211112\",\n    \"net_sales\": 200\n  },\n  {\n    \"country\": \"ge\",\n    \"consumer\": 2,\n    \"date\": \"20211113\",\n    \"net_sales\": 400\n  },\n  {\n    \"country\": \"pt\",\n    \"consumer\": 3,\n    \"date\": \"20211114\",\n    \"net_sales\": 600\n  },\n  {\n    \"country\": \"es\",\n    \"consumer\": 4,\n    \"date\": \"20211115\",\n    \"net_sales\": 250\n  }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/current_fail.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": 100\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 400\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": 600\n }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/current_nulls_and_zeros.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": null\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 0\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": null\n }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/current_nulls_and_zeros_fail.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": 0\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 0\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": null\n }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/truth.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": 200\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 400\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": 600\n }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/truth_different_rows.json",
    "content": "[\n  {\n    \"country\": \"pt\",\n    \"consumer\": 1,\n    \"date\": \"20211112\",\n    \"net_sales\": 200\n  },\n  {\n    \"country\": \"ge\",\n    \"consumer\": 2,\n    \"date\": \"20211113\",\n    \"net_sales\": 400\n  },\n  {\n    \"country\": \"pt\",\n    \"consumer\": 3,\n    \"date\": \"20211114\",\n    \"net_sales\": 600\n  },\n  {\n    \"country\": \"uk\",\n    \"consumer\": 4,\n    \"date\": \"20211115\",\n    \"net_sales\": 250\n  }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/truth_empty.json",
    "content": "[]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/truth_nulls_and_zeros.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": null\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 0\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": null\n }\n]"
  },
  {
    "path": "tests/resources/feature/reconciliation/data/truth_nulls_and_zeros_fail.json",
    "content": "[\n {\n   \"country\": \"pt\",\n   \"consumer\": 1,\n   \"date\": \"20211112\",\n   \"net_sales\": null\n },\n {\n   \"country\": \"ge\",\n   \"consumer\": 2,\n   \"date\": \"20211113\",\n   \"net_sales\": 0\n },\n {\n   \"country\": \"pt\",\n   \"consumer\": 3,\n   \"date\": \"20211114\",\n   \"net_sales\": null\n }\n]"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/batch_append_disabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/source_append_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.schema_evolution_append_load\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            }\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/batch_append_disabled_cast.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/source_append_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.schema_evolution_append_load\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/batch_append_enabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/source_append_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.schema_evolution_append_load\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            },\n            \"escape_col_names\": false\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/batch_append_enabled_cast.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/source_append_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"db_table\": \"test_db.schema_evolution_append_load\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"date\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"appended_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"date\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"appended_sales\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/batch_init_disabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/source_part-01_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/batch_init_enabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/source_part-01_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/append_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/control/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code|new_column\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1|\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1|\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2|\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3|\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4|\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6|\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6|\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2|new\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3|new"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/control/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6\n20180110120052t|request1|1|1|1|7|1|N|20180110||article2|120|2\n20180110120052t|request1|1|1|8|4|1|X|20170430||article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/control/part-05.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code|request_id\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1|\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1|\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2|\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3|\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4|\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6|\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6|\n20180110120052t||1|1|1|7|1|N|20180110|customer5|article2|120|2|request1\n20180110120052t||1|1|8|4|1|X|20170430|customer3|article3|80|3|request1"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/control/part-06.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|code\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article6|50|2\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5||120|2\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code|new_column\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2|new\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1|new\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1|new\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6|new\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2|new\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2|new\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4|new\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3|new"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/source/part-05.csv",
    "content": "actrequest_timestamp|request_id|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/data/source/part-06.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/control/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/control/control_schema_add_column.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/control/control_schema_rename.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/source/source_part-01_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/source/source_part-02_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/source/source_part-03_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/source/source_part-04_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/source/source_part-05_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/append_load/schema/source/source_part-06_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/batch_delta_disabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/source_delta_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/batch_delta_disabled_rename.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/source_delta_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            },\n            \"escape_col_names\": false\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/batch_delta_enabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/source_delta_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/data\"\n    },\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"max_sales_bronze_timestamp\",\n      \"input_id\": \"sales_bronze\",\n      \"transformers\": [\n        {\n          \"function\": \"get_max_value\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\"\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            },\n            \"escape_col_names\": false\n          }\n        },\n        {\n          \"function\": \"incremental_filter\",\n          \"args\": {\n            \"input_col\": \"actrequest_timestamp\",\n            \"increment_df\": \"max_sales_bronze_timestamp\"\n          }\n        },\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\",\n        \"delete_predicate\": \"new.recordmode in ('R','D','X')\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/batch_init_disabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/source_part-01_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"ranking_key_asc\": [\n              \"recordmode\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/batch_init_enabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/source_part-01_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/delta_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"condensed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"condense_record_mode_cdc\",\n          \"args\": {\n            \"business_key\": [\n              \"salesorder\",\n              \"item\"\n            ],\n            \"ranking_key_desc\": [\n              \"actrequest_timestamp\",\n              \"datapakid\",\n              \"partno\",\n              \"record\"\n            ],\n            \"ranking_key_asc\": [\n              \"recordmode\"\n            ],\n            \"record_mode_col\": \"recordmode\",\n            \"valid_record_modes\": [\n              \"\",\n              \"N\",\n              \"R\",\n              \"D\",\n              \"X\"\n            ]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"condensed_sales\",\n      \"write_type\": \"merge\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/delta_load/data\",\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.date <=> new.date\"\n      }\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/control/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|code|new_column\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2|\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4|\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2|new\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7|\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2|\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6|\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5|\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2|new\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1|new\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3|\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1|\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5|\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1|\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3|\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4|"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/control/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5\n20180110120052t|request1|1|1|1|7|1|N|20180110||article2|120|2\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/control/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article6|50|2\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5||120|2\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/control/part-05.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code|request_id\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2|\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4|\n20180110120052t|0|1|1|5|2|2||20170215|customer2|article2|50|2|request1\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7|\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2|\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6|\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5|\n20180110120052t||1|1|1|7|1|N|20180110|customer5|article2|120|2|request1\n20180110120052t|0|1|1|3|1|1||20160601|customer1|article1|150|1|request1\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3|\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1|\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5|\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1|\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3|\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4|"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/control/part-06.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|code\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6\n00000000000000t|0|0|0|0|2|3|N|20170215|customer2|article1|30|1\n00000000000000t|0|0|0|0|3|1|N|20170215|customer1|article5|200|5\n00000000000000t|0|0|0|0|3|2|N|20170215|customer1|article2|120|2\n00000000000000t|0|0|0|0|3|3|N|20170215|customer1|article4|90|4\n00000000000000t|0|0|0|0|4|1|N|20170430|customer3|article3|80|3\n00000000000000t|0|0|0|0|4|2|N|20170430|customer3|article7|70|7\n00000000000000t|0|0|0|0|4|3|N|20170430|customer3|article1|30|1\n00000000000000t|0|0|0|0|4|4|N|20170430|customer3|article2|50|2\n00000000000000t|0|0|0|0|5|1|N|20170510|customer4|article6|150|6\n00000000000000t|0|0|0|0|5|2|N|20170510|customer4|article3|100|3\n00000000000000t|0|0|0|0|5|3|N|20170510|customer4|article5|80|5\n00000000000000t|0|0|0|0|6|1|N|20170601|customer2|article4|100|4\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/source/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code|new_column\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2|new\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1|new\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1|new\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6|new\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2|new\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2|new\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4|new\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3|new"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/source/part-03.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/source/part-04.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/source/part-05.csv",
    "content": "actrequest_timestamp|request_id|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/data/source/part-06.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5|article2|120|2\n20180110120052t|request1|1|1|2|1|1|X|20160601|customer1|article1|100|1\n20180110120052t|request1|1|1|3|1|1||20160601|customer1|article1|150|1\n20180110120052t|request1|1|1|4|2|2|X|20170215|customer2|article6|50|6\n20180110120052t|request1|1|1|5|2|2||20170215|customer2|article2|50|2\n20180110120052t|request1|1|1|6|3|2|D|20170215|customer1|article2|120|2\n20180110120052t|request1|1|1|7|3|3|R|20170215|customer1|article4|-90|4\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3|article3|80|3"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/control/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/control/control_schema_add_column.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/control/control_schema_rename.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/source/source_part-01_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/source/source_part-02_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/source/source_part-03_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/source/source_part-04_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/source/source_part-05_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/delta_load/schema/source/source_part-06_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/batch_init.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/data\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/full_load/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/batch_merge_disabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"transformed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            }\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"transformed_sales\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/full_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/batch_merge_enabled.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"transformed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            }\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/full_load/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.databricks.delta.schema.autoMerge.enabled\": true\n  }\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/batch_overwrite.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/schema_evolution/full_load/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"transformed_sales\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"cast\",\n          \"args\": {\n            \"cols\": {\n              \"code\": \"StringType\"\n            }\n          }\n        },\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"ARTICLE\": \"article\"\n            },\n            \"escape_col_names\": false\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"transformed_sales\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/schema_evolution/full_load/data\",\n      \"options\": {\n        \"overwriteSchema\": true\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/data/control/part-02.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|article|amount|code|new_column\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1|\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1|\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2|\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3|\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4|\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6|\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6|\n20180110120052t|request1|1|1|1|7|1|N|20180110|customer5||120|2|new\n20180110120052t|request1|1|1|8|4|1|X|20170430|customer3||80|3|new"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/data/source/part-01.csv",
    "content": "actrequest_timestamp|request|datapakid|partno|record|salesorder|item|recordmode|date|customer|ARTICLE|amount|code\n00000000000000t|0|0|0|0|1|1|N||customer1|article1|100|1\n00000000000000t|0|0|0|0|1|1||20160601|customer1|article1|100|1\n00000000000000t|0|0|0|0|1|2|N|20160601|customer1|article2|200|2\n00000000000000t|0|0|0|0|1|3|N|20160601|customer1|article3|50|3\n00000000000000t|0|0|0|0|2|1|N|20170215|customer2|article4|10|4\n00000000000000t|0|0|0|0|2|2||20170215|customer2|article6|50|6\n00000000000000t|0|0|0|0|2|2|N||customer2|article6|50|6"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/data/source/part-02.csv",
    "content": "actrequest_timestamp|request_id|datapakid|partno|record|salesorder|item|recordmode|date|ARTICLE|amount|code|new_column\n20180110120052t|request1|1|1|1|7|1|N|20180110|article2|120|2|new\n20180110120052t|request1|1|1|2|1|1|X|20160601|article1|100|1|new\n20180110120052t|request1|1|1|3|1|1||20160601|article1|150|1|new\n20180110120052t|request1|1|1|4|2|2|X|20170215|article6|50|6|new\n20180110120052t|request1|1|1|5|2|2||20170215|article2|50|2|new\n20180110120052t|request1|1|1|6|3|2|D|20170215|article2|120|2|new\n20180110120052t|request1|1|1|7|3|3|R|20170215|article4|-90|4|new\n20180110120052t|request1|1|1|8|4|1|X|20170430|article3|80|3|new"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/schema/control/control_schema_merge_enabled.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/schema/control/control_schema_overwrite.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/schema/source/source_part-01_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/schema_evolution/full_load/schema/source/source_part-02_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"actrequest_timestamp\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"request_id\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"datapakid\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"partno\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"record\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"recordmode\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ARTICLE\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"code\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"new_column\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sftp_reader/data/file.csv",
    "content": "column1|column2\n1|1"
  },
  {
    "path": "tests/resources/feature/sftp_reader/data/file1.csv",
    "content": "column1|column2\n2|2"
  },
  {
    "path": "tests/resources/feature/sftp_reader/data/file2.csv",
    "content": "column1|column2\n3|3"
  },
  {
    "path": "tests/resources/feature/sftp_reader/data/file3.json",
    "content": "{\"colUserName\":\"TestName\", \"colCity\":\"TestCity\", \"colState\":\"TestState\"}"
  },
  {
    "path": "tests/resources/feature/sftp_reader/data/file4.xml",
    "content": "<?xml version='1.0' encoding='utf-8'?>\n<data xmlns=\"http://example.com\">\n    <row>\n        <name>userOne</name>\n        <age>50</age>\n        <city>CityTest</city>\n    </row>\n    <row>\n        <name>userTwo</name>\n        <age>40</age>\n        <city>CityTest2</city>\n    </row>\n    <row>\n        <name>userThree</name>\n        <age>30</age>\n        <city>CityTest3</city>\n    </row>\n</data>"
  },
  {
    "path": "tests/resources/feature/sftp_reader/data/file5.txt",
    "content": "value1\nvalue2\nvalue3"
  },
  {
    "path": "tests/resources/feature/sharepoint/exceptions/acons/drive_exception.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sharepoint_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/sharepoint/data/\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sharepoint_output\",\n      \"input_id\": \"sharepoint_input\",\n      \"data_format\": \"sharepoint\",\n      \"sharepoint_opts\": {\n        \"client_id\": \"CLIENT_ID\",\n        \"tenant_id\": \"TENANT_TEST\",\n        \"secret\": \"CLIENT_SECRET\",\n        \"site_name\": \"mock_site\",\n        \"drive_name\": \"\",\n        \"folder_relative_path\": \"sp_test\",\n        \"file_name\": \"sharepoint_test.csv\",\n        \"local_path\": \"mock_path\",\n        \"conflict_behaviour\": \"replace\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/exceptions/acons/endpoint_exception.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sharepoint_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/sharepoint/data/\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sharepoint_output\",\n      \"input_id\": \"sharepoint_input\",\n      \"data_format\": \"sharepoint\",\n      \"sharepoint_opts\": {\n        \"client_id\": \"CLIENT_ID\",\n        \"tenant_id\": \"TENANT_TEST\",\n        \"secret\": \"CLIENT_SECRET\",\n        \"site_name\": \"mock_site\",\n        \"drive_name\": \"mock_drive\",\n        \"folder_relative_path\": \"sp_test\",\n        \"file_name\": \"sharepoint_test.csv\",\n        \"local_path\": \"mock_path\",\n        \"conflict_behaviour\": \"replace\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/exceptions/acons/local_path_exception.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sharepoint_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/sharepoint/data/\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sharepoint_output\",\n      \"input_id\": \"sharepoint_input\",\n      \"data_format\": \"sharepoint\",\n      \"sharepoint_opts\": {\n        \"client_id\": \"CLIENT_ID\",\n        \"tenant_id\": \"TENANT_TEST\",\n        \"secret\": \"CLIENT_SECRET\",\n        \"site_name\": \"mock_site\",\n        \"drive_name\": \"mock_drive\",\n        \"folder_relative_path\": \"sp_test\",\n        \"file_name\": \"sharepoint_test.csv\",\n        \"local_path\": \"\",\n        \"conflict_behaviour\": \"replace\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/exceptions/acons/site_exception.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sharepoint_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/sharepoint/data/\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sharepoint_output\",\n      \"input_id\": \"sharepoint_input\",\n      \"data_format\": \"sharepoint\",\n      \"sharepoint_opts\": {\n        \"client_id\": \"CLIENT_ID\",\n        \"tenant_id\": \"TENANT_TEST\",\n        \"secret\": \"CLIENT_SECRET\",\n        \"site_name\": \"\",\n        \"drive_name\": \"mock_drive\",\n        \"folder_relative_path\": \"sp_test\",\n        \"file_name\": \"sharepoint_test.csv\",\n        \"local_path\": \"mock_path\",\n        \"conflict_behaviour\": \"replace\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/exceptions/acons/streaming_exception.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sharepoint_input\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/sharepoint/data/\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sharepoint_output\",\n      \"input_id\": \"sharepoint_input\",\n      \"data_format\": \"sharepoint\",\n      \"sharepoint_opts\": {\n        \"client_id\": \"CLIENT_ID\",\n        \"tenant_id\": \"TENANT_TEST\",\n        \"secret\": \"CLIENT_SECRET\",\n        \"site_name\": \"files_ingestion\",\n        \"drive_name\": \"Exports_DART_dev\",\n        \"folder_relative_path\": \"sp_test\",\n        \"file_name\": \"sharepoint_test.csv\",\n        \"local_path\": \"LOCAL_PATH\",\n        \"conflict_behaviour\": \"replace\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/exceptions/schemas/schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_file_name_and_file_pattern_conflict_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_pattern\": \"sample_*\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                }\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_conflict_file_name_pattern\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_conflict_file_name_pattern/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_file_name_unsupported_extension_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"bad.txt\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                }\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_bad_extension\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_bad_extension/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_archive_enabled_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_archive_enabled\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_archive_enabled/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_archive_success_subfolder_override_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_pattern\": \"*\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true,\n                \"archive_success_subfolder\": \"processed\"\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_archive_success_subfolder_override\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_archive_success_subfolder_override/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_no_csv_files_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_no_csv_files\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_no_csv_files/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_one_file_schema_mismatch_custom_error_subfolder_should_archive_error.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_pattern\": \"*\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true,\n                \"archive_error_subfolder\": \"failed\"\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_schema_mismatch_custom_error_subfolder\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_schema_mismatch_custom_error_subfolder/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_one_file_schema_mismatch_should_archive_error.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_pattern\": \"*\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true,\n                \"archive_success_subfolder\": \"done\",\n                \"archive_error_subfolder\": \"error\"\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_schema_mismatch\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_schema_mismatch/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_pattern_matches_no_files_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_pattern\": \"does_not_match_*.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_pattern_matches_no_files\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_pattern_matches_no_files/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_pattern_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_pattern\": \"sample_*\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_pattern\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_pattern/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_csv_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_pattern\": \"*\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_path_does_not_exist_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"missing_folder\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_missing_folder\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_missing_folder/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_folder_relative_path_looks_like_file_unsupported_extension_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test/bad.txt\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_folder_path_bad_ext\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_folder_path_bad_ext/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_archive_default_enabled_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                }\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_archive_default_enabled\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_single_archive_default_enabled/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_archive_enabled_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_archive_enabled\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_single_archive_enabled/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_archive_success_subfolder_override_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true,\n                \"archive_success_subfolder\": \"processed\"\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_archive_success_subfolder_override\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_single_archive_success_subfolder_override/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_download_error_should_archive_error.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_download_error\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_single_download_error/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_empty_file_should_archive_error.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"empty.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true,\n                \"archive_success_subfolder\": \"done\",\n                \"archive_error_subfolder\": \"error\"\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_empty_file\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_single_empty_file/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_full_path_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test/sample_1.csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_full_path\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_full_path/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_full_path_with_file_name_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test/sample_1.csv\",\n                \"file_name\": \"other.csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_conflict\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_conflict/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_full_path_with_file_pattern_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test/sample_1.csv\",\n                \"file_pattern\": \"*.csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                }\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_full_path_with_file_pattern_should_fail\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_full_path_with_file_pattern_should_fail/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_full_path_with_file_type_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test/sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                }\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_full_path_with_file_type_should_fail\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_full_path_with_file_type_should_fail/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_spark_load_fails_should_archive_error.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": true\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single_spark_load_fails\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_single_spark_load_fails/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_single_csv_success.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_name\": \"sample_1.csv\",\n                \"file_type\": \"csv\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_single\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/acons/read_unsupported_file_type_should_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sharepoint_input\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"sharepoint\",\n            \"sharepoint_opts\": {\n                \"client_id\": \"CLIENT_ID\",\n                \"tenant_id\": \"TENANT_ID\",\n                \"secret\": \"SECRET\",\n                \"site_name\": \"mock_site\",\n                \"drive_name\": \"mock_drive\",\n                \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/tmp/\",\n                \"folder_relative_path\": \"sp_test\",\n                \"file_type\": \"json\",\n                \"local_options\": {\n                    \"header\": true,\n                    \"delimiter\": \",\",\n                    \"inferSchema\": true\n                },\n                \"archive_enabled\": false\n            }\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sharepoint_output\",\n            \"input_id\": \"sharepoint_input\",\n            \"data_format\": \"delta\",\n            \"db_table\": \"test_db.sharepoint_reader_bad_file_type\",\n            \"write_type\": \"overwrite\",\n            \"location\": \"/app/tests/lakehouse/out/feature/sharepoint/reader/delta_bad_file_type/\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/data/bad_schema.csv",
    "content": "col_a,col_c\n1,999"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/data/other.csv",
    "content": "col_a,col_b\n999,999\n"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/data/sample_1.csv",
    "content": "col_a,col_b\n1,2"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/data/sample_2.csv",
    "content": "col_a,col_b\n3,4\n"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/mocks/get_drive_id.json",
    "content": "{\n    \"value\": [\n        {\n            \"name\": \"mock_drive\",\n            \"id\": \"test_drive_id\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/mocks/get_file_metadata.json",
    "content": "{\n    \"id\": \"test_item_id\",\n    \"name\": \"sample.csv\",\n    \"createdDateTime\": \"2026-01-01T00:00:00Z\",\n    \"lastModifiedDateTime\": \"2026-01-01T00:00:00Z\",\n    \"@microsoft.graph.downloadUrl\": \"https://download.mock/sample.csv\"\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/mocks/get_site_id.json",
    "content": "{\n    \"id\": \"test_site_id\",\n    \"displayName\": \"mock_site\"\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/reader/mocks/rename_file.json",
    "content": "{}"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/acons/write_to_local_success.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sharepoint_input\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"location\": \"/app/tests/lakehouse/in/feature/sharepoint/data/\",\n      \"schema\": {\n        \"type\": \"struct\",\n        \"fields\": [\n          {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n          },\n          {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n          }\n        ]\n      }\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sharepoint_output\",\n      \"input_id\": \"sharepoint_input\",\n      \"data_format\": \"sharepoint\",\n      \"sharepoint_opts\": {\n        \"client_id\": \"CLIENT_ID\",\n        \"tenant_id\": \"TENANT_TEST\",\n        \"secret\": \"CLIENT_SECRET\",\n        \"site_name\": \"mock_site\",\n        \"drive_name\": \"mock_drive\",\n        \"folder_relative_path\": \"sp_test\",\n        \"file_name\": \"sharepoint_test\",\n        \"local_path\": \"/app/tests/lakehouse/out/feature/sharepoint/writer/data/\",\n        \"conflict_behaviour\": \"replace\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/data/file_control.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/data/file_source.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/mocks/create_upload_session.json",
    "content": "{\n    \"uploadUrl\": \"test_site_id\"\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/mocks/get_drive_id.json",
    "content": "{\n    \"value\": [\n        {\n            \"name\": \"mock_drive\",\n            \"id\": \"test_drive_id\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/mocks/get_site_id.json",
    "content": "{\n    \"id\": \"test_site_id\",\n    \"displayName\": \"mock_site\"\n}"
  },
  {
    "path": "tests/resources/feature/sharepoint/writer/schemas/schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/compute_table_statistics/table_stats_complex_default_scenario1.json",
    "content": "{\n  \"function\": \"compute_table_statistics\",\n  \"table_or_view\": \"test_db.DummyTableBronzeComplexDefaultScenario1\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/compute_table_statistics/table_stats_complex_default_scenario2.json",
    "content": "{\n  \"function\": \"compute_table_statistics\",\n  \"table_or_view\": \"test_db.DummyTableBronzeComplexDefaultScenario2\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/compute_table_statistics/table_stats_complex_different_delimiter_scenario1.json",
    "content": "{\n  \"function\": \"compute_table_statistics\",\n  \"table_or_view\": \"test_db.DummyTableBronzeComplexDifferentDelimiterScenario1\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/compute_table_statistics/table_stats_complex_different_delimiter_scenario2.json",
    "content": "{\n  \"function\": \"compute_table_statistics\",\n  \"table_or_view\": \"test_db.DummyTableBronzeComplexDifferentDelimiterScenario2\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/compute_table_statistics/table_stats_simple_split_scenario.json",
    "content": "{\n  \"function\": \"compute_table_statistics\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_table.json",
    "content": "{\n  \"function\": \"create_table\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/table/test_table.sql\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_table_complex_default_scenario.json",
    "content": "{\n  \"function\": \"create_table\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/table/test_table_complex_default_scenario.sql\",\n  \"delimiter\": \";\",\n  \"advanced_parser\": true\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_table_complex_different_delimiter_scenario.json",
    "content": "{\n  \"function\": \"create_table\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/table/test_table_complex_different_delimiter_scenario.sql\",\n  \"delimiter\": \"===\",\n  \"advanced_parser\": true\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_table_simple_split_scenario.json",
    "content": "{\n  \"function\": \"create_table\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/table/test_table_simple_split_scenario.sql\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_view.json",
    "content": "{\n  \"function\": \"create_view\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/view/test_view.sql\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_view_complex_default_scenario.json",
    "content": "{\n  \"function\": \"create_view\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/view/test_view_complex_default_scenario.sql\",\n  \"advanced_parser\": true\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_view_complex_different_delimiter_scenario.json",
    "content": "{\n  \"function\": \"create_view\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/view/test_view_complex_different_delimiter_scenario.sql\",\n  \"delimiter\": \"===\",\n  \"advanced_parser\": true\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/acon_create_view_simple_split_scenario.json",
    "content": "{\n  \"function\": \"create_view\",\n  \"path\": \"file:///app/tests/lakehouse/in/feature/table_manager/create/view/test_view_simple_split_scenario.sql\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/create/table/test_table_complex_default_scenario.sql",
    "content": "-- New table manager test table, to check if new parser works as expected and deals well with different delimiters (;).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\nCREATE TABLE test_db.DummyTableBronzeComplexDefaultScenario1\n    (\n        id INT COMMENT 'id with special (< characters ;',\n        col1 STRING COMMENT 'col1 with >) special character \" and ;',\n        col2 INT COMMENT 'col2 with () special character \\\" and ;',\n        col3 BOOLEAN COMMENT 'col3 with special <> character \\\" and ;',\n        col4 STRING COMMENT \"col4 with special /* character ;\",\n        year INT COMMENT \"year with */ special character ;\",\n        month INT COMMENT \"month with special -- character ;\",\n        day INT COMMENT \"day with special \\\" character ;\"\n    )\nUSING DELTA PARTITIONED BY (year, month, day)\nLOCATION 'file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_complex_default_scenario1'\nTBLPROPERTIES('lakehouse.primary_key'=' id, `col1`');\n-- New table manager test table, to check if new parser works as expected and deals well to different delimiters (;).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\n/* New table manager test table, to check if new parser works as expected and deals well to different delimiters (;).\nThe parser must be able to deal with the delimiters that are inside of \"\", '', -- */\nCREATE TABLE test_db.DummyTableBronzeComplexDefaultScenario2\n    (\n    id INT COMMENT 'id with special (< characters ;',\n    col1 STRING COMMENT 'col1 with >) special character \" and ;',\n    col2 INT COMMENT 'col2 with () special character \\\" and ;',\n    col3 BOOLEAN COMMENT 'col3 with special <> character \\\" and ;',\n    col4 STRING COMMENT \"col4 with special /* character ;\",\n    year INT COMMENT \"year with */ special character ;\",\n    month INT COMMENT \"month with special -- character ;\",\n    day INT COMMENT \"day with special \\\" character ;\"\n    )\nUSING DELTA PARTITIONED BY (year, month, day)\nLOCATION 'file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_complex_default_scenario2'\nTBLPROPERTIES('lakehouse.primary_key'=' id, `col1`')"
  },
  {
    "path": "tests/resources/feature/table_manager/create/table/test_table_complex_different_delimiter_scenario.sql",
    "content": "-- New table manager test table, to check if new parser works as expected and deals well with different delimiters (===).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\nCREATE TABLE test_db.DummyTableBronzeComplexDifferentDelimiterScenario1\n    (\n        id INT COMMENT 'id with special (< characters ;',\n        col1 STRING COMMENT 'col1 with >) special character \" and ;',\n        col2 INT COMMENT 'col2 with () special character \\\" and ;',\n        col3 BOOLEAN COMMENT 'col3 with special <> character \\\" and ;',\n        col4 STRING COMMENT \"col4 with special /* character ;\",\n        year INT COMMENT \"year with */ special character ;\",\n        month INT COMMENT \"month with special -- character ;\",\n        day INT COMMENT \"day with special \\\" character ;\"\n    )\nUSING DELTA PARTITIONED BY (year, month, day)\nLOCATION 'file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_complex_different_delimiter_scenario1'\nTBLPROPERTIES('lakehouse.primary_key'=' id, `col1`')===\n-- New table manager test table, to check if new parser works as expected and deals well to different delimiters (===).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\n/* New table manager test table, to check if new parser works as expected and deals well to different delimiters (===).\nThe parser must be able to deal with the delimiters that are inside of \"\", '', -- */\nCREATE TABLE test_db.DummyTableBronzeComplexDifferentDelimiterScenario2\n    (\n    id INT COMMENT 'id with special (< characters ;',\n    col1 STRING COMMENT 'col1 with >) special character \" and ;',\n    col2 INT COMMENT 'col2 with () special character \\\" and ;',\n    col3 BOOLEAN COMMENT 'col3 with special <> character \\\" and ;',\n    col4 STRING COMMENT \"col4 with special /* character ;\",\n    year INT COMMENT \"year with */ special character ;\",\n    month INT COMMENT \"month with special -- character ;\",\n    day INT COMMENT \"day with special \\\" character ;\"\n    )\nUSING DELTA PARTITIONED BY (year, month, day)\nLOCATION 'file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_complex_different_delimiter_scenario2'\nTBLPROPERTIES('lakehouse.primary_key'=' id, `col1`')"
  },
  {
    "path": "tests/resources/feature/table_manager/create/table/test_table_simple_split_scenario.sql",
    "content": "CREATE TABLE test_db.DummyTableBronzeSimpleSplitScenario\n    (id INT, col1 STRING, col2 INT, col3 BOOLEAN, col4 STRING, year INT, month INT, day INT)\nUSING DELTA PARTITIONED BY (year, month, day)\nLOCATION 'file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_simple_split_scenario'\nTBLPROPERTIES('lakehouse.primary_key'=' id, `col1`')"
  },
  {
    "path": "tests/resources/feature/table_manager/create/view/test_view_complex_default_scenario.sql",
    "content": "-- New table manager test view, to check if new parser works as expected and deals well with different delimiters (;).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\nCREATE VIEW test_db.DummyViewBronzeComplexDefaultScenario1 (id,col1,col2,col3,col4) AS\n    SELECT id,col1,CONCAT_WS(\";\",col2) AS col2,col3,col4\n    FROM test_db.DummyTableBronzeComplexDefaultScenario1;\n-- New table manager test view, to check if new parser works as expected and deals well with different delimiters (;).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\nCREATE VIEW test_db.DummyViewBronzeComplexDefaultScenario2 (id,col1,col2,col3,col4) AS\n    SELECT id,col1,col2,CONCAT_WS(\";\",col3) AS col3,col4\n    FROM test_db.DummyTableBronzeComplexDefaultScenario2"
  },
  {
    "path": "tests/resources/feature/table_manager/create/view/test_view_complex_different_delimiter_scenario.sql",
    "content": "-- New table manager test view, to check if new parser works as expected and deals well with different delimiters (===).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\nCREATE VIEW test_db.DummyViewBronzeComplexDifferentDelimiterScenario1 (id,col1,col2,col3,col4) AS\n    SELECT id,col1,CONCAT_WS(\";\",col2) AS col2,col3,col4\n    FROM test_db.DummyTableBronzeComplexDifferentDelimiterScenario1===\n-- New table manager test view, to check if new parser works as expected and deals well with different delimiters (===).\n-- The parser must be able to deal with the delimiters that are inside of \"\", '', --, /* */.\nCREATE VIEW test_db.DummyViewBronzeComplexDifferentDelimiterScenario2 (id,col1,col2,col3,col4) AS\n    SELECT id,col1,col2,CONCAT_WS(\";\",col3) AS col3,col4\n    FROM test_db.DummyTableBronzeComplexDifferentDelimiterScenario2"
  },
  {
    "path": "tests/resources/feature/table_manager/create/view/test_view_simple_split_scenario.sql",
    "content": "CREATE VIEW test_db.DummyViewBronzeSimpleSplitScenario (id,col1,col2,col3,col4) AS\n    SELECT id,col1,col2,col3,col4\n    FROM test_db.DummyTableBronzeSimpleSplitScenario"
  },
  {
    "path": "tests/resources/feature/table_manager/delete/acon_delete_where_table_simple_split_scenario.json",
    "content": "{\n  \"function\": \"delete_where\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\",\n  \"where_clause\": \"year=2021\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/describe/acon_describe_simple_split_scenario.json",
    "content": "{\n  \"function\": \"describe\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/drop/acon_drop_table_simple_split_scenario.json",
    "content": "{\n  \"function\": \"drop_table\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/drop/acon_drop_view_simple_split_scenario.json",
    "content": "{\n  \"function\": \"drop_view\",\n  \"table_or_view\": \"test_db.DummyViewBronzeSimpleSplitScenario\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/execute_sql/acon_execute_sql_complex_default_scenario.json",
    "content": "{\n  \"function\": \"execute_sql\",\n  \"sql\": \"/* New table manager test view, to check if new parser works as expected and deals well to different delimiters (;).The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario1 ALTER COLUMN col1 COMMENT 'comment ; for col1'; /* New table manager test view, to check if new parser works as expected and deals well to different delimiters (;). The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario1 ALTER COLUMN col2 COMMENT 'comment for col2'; /* New table manager test view, to check if new parser works as expected and deals well to different delimiters (;). The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario2 ALTER COLUMN col1 COMMENT 'comment \\\" for col1'; /* New table manager test view, to check if new parser works as expected and deals well to different delimiters (;). The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario2 ALTER COLUMN col2 COMMENT 'comment () <> for col2'\",\n  \"advanced_parser\": \"True\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/execute_sql/acon_execute_sql_complex_different_delimiter_scenario.json",
    "content": "{\n  \"function\": \"execute_sql\",\n  \"sql\": \"/* New table manager test view, to check if new parser works as expected and deals well to different delimiters (===).The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario1 ALTER COLUMN col1 COMMENT 'comment === for col1'=== /* New table manager test view, to check if new parser works as expected and deals well to different delimiters (===). The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario1 ALTER COLUMN col2 COMMENT 'comment for col2'=== /* New table manager test view, to check if new parser works as expected and deals well to different delimiters (===). The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario2 ALTER COLUMN col1 COMMENT 'comment \\\" for col1'=== /* New table manager test view, to check if new parser works as expected and deals well to different delimiters (===). The parser must be able to deal with the delimiters that are inside of \\\"\\\", '', --, */ ALTER TABLE test_db.DummyTableBronzeComplexDefaultScenario2 ALTER COLUMN col2 COMMENT 'comment () <> for col2'\",\n  \"delimiter\": \"===\",\n  \"advanced_parser\": \"True\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/execute_sql/acon_execute_sql_simple_split_scenario.json",
    "content": "{\n  \"function\": \"execute_sql\",\n  \"sql\": \"ALTER TABLE test_db.DummyTableBronzeSimpleSplitScenario ALTER COLUMN col1 COMMENT 'comment for col1'\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/get_tbl_pk/get_tbl_pk_simple_split_scenario.json",
    "content": "{\n  \"function\": \"get_tbl_pk\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/optimize/optimize_location.json",
    "content": "{\n  \"function\": \"optimize\",\n  \"path\": \"file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data\",\n  \"where_clause\": \"year >= 2021 and month >= 09 and day > 01\",\n  \"optimize_zorder_col_list\": \"col1,col2\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/optimize/optimize_location_simple_split_scenario.json",
    "content": "{\n  \"function\": \"optimize\",\n  \"path\": \"file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_simple_split_scenario\",\n  \"where_clause\": \"year >= 2021 and month >= 09 and day > 01\",\n  \"optimize_zorder_col_list\": \"col1,col2\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/optimize/optimize_table.json",
    "content": "{\n  \"function\": \"optimize\",\n  \"table_or_view\": \"test_db.DummyTableBronze\",\n  \"where_clause\": \"year >= 2021 and month >= 09 and day > 01\",\n  \"optimize_zorder_col_list\": \"col1,col2\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/optimize/optimize_table_simple_split_scenario.json",
    "content": "{\n  \"function\": \"optimize\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\",\n  \"where_clause\": \"year >= 2021 and month >= 09 and day > 01\",\n  \"optimize_zorder_col_list\": \"col1,col2\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/show_tbl_properties/show_tbl_properties_simple_split_scenario.json",
    "content": "{\n  \"function\": \"show_tbl_properties\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\"\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/vacuum/acon_vacuum_location.json",
    "content": "{\n  \"function\": \"vacuum\",\n  \"path\": \"file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data\",\n  \"vacuum_hours\": 185\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/vacuum/acon_vacuum_location_simple_split_scenario.json",
    "content": "{\n  \"function\": \"vacuum\",\n  \"path\": \"file:///app/tests/lakehouse/out/feature/table_manager/dummy_table_bronze/data_simple_split_scenario\",\n  \"vacuum_hours\": 185\n}"
  },
  {
    "path": "tests/resources/feature/table_manager/vacuum/acon_vacuum_table_simple_split_scenario.json",
    "content": "{\n  \"function\": \"vacuum\",\n  \"table_or_view\": \"test_db.DummyTableBronzeSimpleSplitScenario\",\n  \"vacuum_hours\": 168\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/acons/batch.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_historical\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/sales_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/sales_historical/\"\n        },\n        {\n            \"spec_id\": \"sales_new\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/sales_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/sales_new/\"\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"incremented_historical\",\n            \"input_id\": \"sales_historical\",\n            \"transformers\": [\n                {\n                    \"function\": \"with_literals\",\n                    \"args\": {\n                        \"literals\": {\n                            \"is_historical\": true\n                        }\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"incremented_new\",\n            \"input_id\": \"sales_new\",\n            \"transformers\": [\n                {\n                    \"function\": \"with_literals\",\n                    \"args\": {\n                        \"literals\": {\n                            \"is_historical\": false\n                        }\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"union_dataframes\",\n            \"input_id\": \"incremented_historical\",\n            \"transformers\": [\n                {\n                    \"function\": \"union\",\n                    \"args\": {\"union_with\": [\"incremented_new\"]}\n                }\n            ]\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales\",\n            \"input_id\": \"union_dataframes\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"date\"],\n            \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/batch/data\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/acons/streaming.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_historical\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/sales_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/sales_historical/\"\n        },\n        {\n            \"spec_id\": \"sales_new\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/sales_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/sales_new/\"\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"incremented_historical\",\n            \"input_id\": \"sales_historical\",\n            \"transformers\": [\n                {\n                    \"function\": \"with_literals\",\n                    \"args\": {\n                        \"literals\": {\n                            \"is_historical\": true\n                        }\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"incremented_new\",\n            \"input_id\": \"sales_new\",\n            \"transformers\": [\n                {\n                    \"function\": \"with_literals\",\n                    \"args\": {\n                        \"literals\": {\n                            \"is_historical\": false\n                        }\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"union_dataframes\",\n            \"input_id\": \"incremented_historical\",\n            \"transformers\": [\n                {\n                    \"function\": \"union\",\n                    \"args\": {\"union_with\": [\"incremented_new\"]}\n                }\n            ]\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales\",\n            \"input_id\": \"union_dataframes\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"date\"],\n            \"options\": {\n                \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/streaming/checkpoint\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/streaming/data\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/acons/streaming_batch.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_historical\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/sales_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/sales_historical/\"\n        },\n        {\n            \"spec_id\": \"sales_new\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/sales_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/sales_new/\"\n        },\n        {\n            \"spec_id\": \"customers\",\n            \"read_type\": \"batch\",\n            \"data_format\": \"csv\",\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/customer_schema.json\",\n            \"options\": {\n                \"header\": true,\n                \"delimiter\": \"|\",\n                \"mode\": \"FAILFAST\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/customers/\"\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"incremented_historical\",\n            \"input_id\": \"sales_historical\",\n            \"transformers\": [\n                {\n                    \"function\": \"with_literals\",\n                    \"args\": {\n                        \"literals\": {\n                            \"is_historical\": true\n                        }\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"incremented_new\",\n            \"input_id\": \"sales_new\",\n            \"transformers\": [\n                {\n                    \"function\": \"with_literals\",\n                    \"args\": {\n                        \"literals\": {\n                            \"is_historical\": false\n                        }\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"union_dataframes\",\n            \"input_id\": \"incremented_historical\",\n            \"transformers\": [\n                {\n                    \"function\": \"union\",\n                    \"args\": {\"union_with\": [\"incremented_new\"]}\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"join_with_customers\",\n            \"input_id\": \"union_dataframes\",\n            \"force_streaming_foreach_batch_processing\": true,\n            \"transformers\": [\n                {\n                    \"function\": \"join\",\n                    \"args\": {\n                        \"join_with\": \"customers\",\n                        \"join_type\": \"left outer\",\n                        \"join_condition\": \"a.customer = b.customer\",\n                        \"select_cols\": [\"a.*\", \"b.name as customer_name\"]\n                    }\n                },\n                {\"function\": \"with_row_id\"}\n            ]\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales\",\n            \"input_id\": \"join_with_customers\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"partitions\": [\"date\"],\n            \"options\": {\n                \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/streaming_batch/checkpoint\"\n            },\n            \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/streaming_batch/data\"\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/acons/write_streaming_struct_data.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_source\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"mode\": \"FAILFAST\",\n                \"header\": true,\n                \"delimiter\": \"|\"\n            },\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/struct_data_schema.json\",\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/struct_data/\"\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"first_transform\",\n            \"input_id\": \"sales_source\",\n            \"transformers\": [\n                {\n                    \"function\": \"cast\",\n                    \"args\": {\n                        \"cols\": {\n                            \"date\": \"StringType\",\n                            \"amount\": \"StringType\"\n                        }\n                    }\n                },\n                {\n                    \"function\": \"rename\",\n                    \"args\": {\n                        \"cols\": {\n                            \"date\": \"date2\",\n                            \"customer\": \"customer2\"\n                        }\n                    }\n                },\n                {\n                    \"function\": \"with_expressions\",\n                    \"args\": {\n                        \"cols_and_exprs\": {\n                            \"constant\": \"'just a constant'\",\n                            \"length_customer2\": \"length(customer2)\"\n                        }\n                    }\n                },\n                {\n                    \"function\": \"from_json\",\n                    \"args\": {\n                        \"input_col\": \"sample\",\n                        \"schema\": {\n                            \"type\": \"struct\",\n                            \"fields\": [\n                                {\n                                    \"name\": \"field1\",\n                                    \"type\": \"string\",\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                },\n                                {\n                                    \"name\": \"field2\",\n                                    \"type\": \"string\",\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                },\n                                {\n                                    \"name\": \"field3\",\n                                    \"type\": \"double\",\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                },\n                                {\n                                    \"name\": \"field4\",\n                                    \"type\": {\n                                        \"type\": \"struct\",\n                                        \"fields\": [\n                                            {\n                                                \"name\": \"field1\",\n                                                \"type\": \"string\",\n                                                \"nullable\": true,\n                                                \"metadata\": {}\n                                            },\n                                            {\n                                                \"name\": \"field2\",\n                                                \"type\": \"string\",\n                                                \"nullable\": true,\n                                                \"metadata\": {}\n                                            }\n                                        ]\n                                    },\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                }\n                            ]\n                        }\n                    }\n                },\n                {\n                    \"function\": \"to_json\",\n                    \"args\": {\n                        \"in_cols\": [\n                            \"item\",\n                            \"amount\"\n                        ],\n                        \"out_col\": \"item_amount_json\"\n                    }\n                },\n                {\n                    \"function\": \"flatten_schema\",\n                    \"args\": {\n                        \"max_level\": 1\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"second_transform\",\n            \"input_id\": \"first_transform\",\n            \"force_streaming_foreach_batch_processing\": true,\n            \"transformers\": [\n                {\n                    \"function\": \"column_filter_exp\",\n                    \"args\": {\n                        \"exp\": [\"salesorder\",\"item\",\"article\",\"sample_json_field1\",\"sample_json_field4\",\"item_amount_json\"]\n                    }\n                }\n            ]\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales_bronze\",\n            \"input_id\": \"second_transform\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/write_streaming_struct_data/data\",\n            \"options\": {\n                \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/write_streaming_struct_data/checkpoint\"\n            }\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/acons/write_streaming_struct_data_fail.json",
    "content": "{\n    \"input_specs\": [\n        {\n            \"spec_id\": \"sales_source\",\n            \"read_type\": \"streaming\",\n            \"data_format\": \"csv\",\n            \"options\": {\n                \"mode\": \"FAILFAST\",\n                \"header\": true,\n                \"delimiter\": \"|\"\n            },\n            \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/schema/struct_data_schema.json\",\n            \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/chain_transformations/source/struct_data/\"\n        }\n    ],\n    \"transform_specs\": [\n        {\n            \"spec_id\": \"first_transform\",\n            \"input_id\": \"sales_source\",\n            \"force_streaming_foreach_batch_processing\": true,\n            \"transformers\": [\n                {\n                    \"function\": \"cast\",\n                    \"args\": {\n                        \"cols\": {\n                            \"date\": \"StringType\",\n                            \"amount\": \"StringType\"\n                        }\n                    }\n                },\n                {\n                    \"function\": \"rename\",\n                    \"args\": {\n                        \"cols\": {\n                            \"date\": \"date2\",\n                            \"customer\": \"customer2\"\n                        }\n                    }\n                },\n                {\n                    \"function\": \"with_expressions\",\n                    \"args\": {\n                        \"cols_and_exprs\": {\n                            \"constant\": \"'just a constant'\",\n                            \"length_customer2\": \"length(customer2)\"\n                        }\n                    }\n                },\n                {\n                    \"function\": \"from_json\",\n                    \"args\": {\n                        \"input_col\": \"sample\",\n                        \"schema\": {\n                            \"type\": \"struct\",\n                            \"fields\": [\n                                {\n                                    \"name\": \"field1\",\n                                    \"type\": \"string\",\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                },\n                                {\n                                    \"name\": \"field2\",\n                                    \"type\": \"string\",\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                },\n                                {\n                                    \"name\": \"field3\",\n                                    \"type\": \"double\",\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                },\n                                {\n                                    \"name\": \"field4\",\n                                    \"type\": {\n                                        \"type\": \"struct\",\n                                        \"fields\": [\n                                            {\n                                                \"name\": \"field1\",\n                                                \"type\": \"string\",\n                                                \"nullable\": true,\n                                                \"metadata\": {}\n                                            },\n                                            {\n                                                \"name\": \"field2\",\n                                                \"type\": \"string\",\n                                                \"nullable\": true,\n                                                \"metadata\": {}\n                                            }\n                                        ]\n                                    },\n                                    \"nullable\": true,\n                                    \"metadata\": {}\n                                }\n                            ]\n                        }\n                    }\n                },\n                {\n                    \"function\": \"to_json\",\n                    \"args\": {\n                        \"in_cols\": [\n                            \"item\",\n                            \"amount\"\n                        ],\n                        \"out_col\": \"item_amount_json\"\n                    }\n                },\n                {\n                    \"function\": \"flatten_schema\",\n                    \"args\": {\n                        \"max_level\": 1\n                    }\n                }\n            ]\n        },\n        {\n            \"spec_id\": \"second_transform\",\n            \"input_id\": \"first_transform\",\n            \"force_streaming_foreach_batch_processing\": true,\n            \"transformers\": [\n                {\n                    \"function\": \"column_filter_exp\",\n                    \"args\": {\n                        \"exp\": [\"salesorder\",\"item\",\"article\",\"sample_json_field1\",\"sample_json_field4\",\"item_amount_json\"]\n                    }\n                }\n            ]\n        }\n    ],\n    \"output_specs\": [\n        {\n            \"spec_id\": \"sales_bronze\",\n            \"input_id\": \"second_transform\",\n            \"write_type\": \"append\",\n            \"data_format\": \"delta\",\n            \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/write_streaming_struct_data_fail/data\",\n            \"options\": {\n                \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/chain_transformations/write_streaming_struct_data_fail/checkpoint\"\n            }\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/control/chain_control.csv",
    "content": "salesorder|item|date|customer|article|amount|is_historical|customer_name|lhe_row_id\n0|1|20140601|customer1|article1|1000|true|Anna|0\n0|2|20140601|customer1|article2|2000|true|Anna|8589934592\n0|3|20140601|customer1|article3|500|true|Anna|1\n1|1|20150601|customer1|article1|1000|true|Anna|2\n1|2|20150601|customer1|article2|2000|true|Anna|8589934593\n1|3|20150601|customer1|article3|500|true|Anna|8589934594\n2|1|20160215|customer2|article4|1000|false|John|3\n2|2|20160215|customer2|article6|5000|false|John|8589934595\n2|3|20160215|customer2|article1|3000|false|John|4\n3|1|20160215|customer1|article5|20000|false|Anna|8589934596\n6|1|20160218|customer3|article7|100|false|Sarah|5\n6|2|20160218|customer3|article9|500|false|Sarah|6\n6|3|20160218|customer3|article8|300|false|Sarah|8589934597\n7|1|20160218|customer5|article7|2000|false||8589934598"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/control/struct_data.json",
    "content": "[\n {\n   \"salesorder\": 1,\n   \"item\": 1,\n   \"article\": \"article1\",\n   \"amount\": \"1000\",\n   \"sample\": \"{\\\"field1\\\": \\\"value1\\\", \\\"field2\\\": \\\"value2\\\", \\\"field4\\\": {\\\"field1\\\": \\\"value1\\\", \\\"field2\\\": \\\"value2\\\"}}\",\n   \"date2\": \"20160601\",\n   \"customer2\": \"customer1\",\n   \"constant\": \"just a constant\",\n   \"length_customer2\": 9,\n   \"sample_json_field1\": \"value1\",\n   \"sample_json_field2\": \"value2\",\n   \"sample_json_field3\": null,\n   \"sample_json_field4\": {\"field1\": \"value1\", \"field2\": \"value2\"},\n   \"item_amount_json\": \"{\\\"item\\\":1,\\\"amount\\\":\\\"1000\\\"}\"\n },\n {\n   \"salesorder\": 1,\n   \"item\": 2,\n   \"article\": \"article2\",\n   \"amount\": \"2000\",\n   \"sample\": \"{\\\"field1\\\": \\\"value3\\\", \\\"field2\\\": \\\"value4\\\", \\\"field4\\\": {\\\"field1\\\": \\\"1value\\\", \\\"field2\\\": \\\"2value\\\"}}\",\n   \"date2\": \"20160601\",\n   \"customer2\": \"customer1\",\n   \"constant\": \"just a constant\",\n   \"length_customer2\": 9,\n   \"sample_json_field1\": \"value3\",\n   \"sample_json_field2\": \"value4\",\n   \"sample_json_field3\": null,\n   \"sample_json_field4\": {\"field1\": \"1value\", \"field2\": \"2value\"},\n   \"item_amount_json\": \"{\\\"item\\\":2,\\\"amount\\\":\\\"2000\\\"}\"\n },\n {\n   \"salesorder\": 1,\n   \"item\": 3,\n   \"article\": \"article3\",\n   \"amount\": \"500\",\n   \"sample\": \"{\\\"field1\\\": \\\"value5\\\", \\\"field3\\\": 6.25, \\\"field4\\\": {\\\"field1\\\": \\\"1value1\\\", \\\"field2\\\": \\\"2value2\\\"}}\",\n   \"date2\": \"20160601\",\n   \"customer2\": \"customer1\",\n   \"constant\": \"just a constant\",\n   \"length_customer2\": 9,\n   \"sample_json_field1\": \"value5\",\n   \"sample_json_field2\": null,\n   \"sample_json_field3\": 6.25,\n   \"sample_json_field4\": {\"field1\": \"1value1\", \"field2\": \"2value2\"},\n   \"item_amount_json\": \"{\\\"item\\\":3,\\\"amount\\\":\\\"500\\\"}\"\n }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/schema/customer_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"birth_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/schema/sales_schema.json",
    "content": "{\n    \"type\": \"struct\",\n    \"fields\": [\n        {\n            \"name\": \"salesorder\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"item\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"date\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"customer\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"article\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"amount\",\n            \"type\": \"integer\",\n            \"nullable\": true,\n            \"metadata\": {}\n        }\n    ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/schema/struct_data_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"sample\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/source/customers.csv",
    "content": "customer|name|birth_date\ncustomer1|Anna|01012002\ncustomer2|John|04051980\ncustomer3|Sarah|02051940"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/source/sales_historical.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/source/sales_new.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/transformations/chain_transformations/source/struct_data.csv",
    "content": "salesorder|item|date|customer|article|amount|sample\n1|1|20160601|customer1|article1|1000|{\"field1\":\"value1\",\"field2\":\"value2\",\"field4\":{\"field1\":\"value1\",\"field2\":\"value2\"}}\n1|2|20160601|customer1|article2|2000|{\"field1\":\"value3\",\"field2\":\"value4\",\"field4\":{\"field1\":\"1value\",\"field2\":\"2value\"}}\n1|3|20160601|customer1|article3|500|{\"field1\":\"value5\",\"field3\":6.25,\"field4\":{\"field1\":\"1value1\",\"field2\":\"2value2\"}}"
  },
  {
    "path": "tests/resources/feature/transformations/column_creators/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_creators/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_creators/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"with_literals\",\n          \"args\": {\n            \"literals\": {\n              \"dummy_string\": \"this is a string\",\n              \"dummy_int\": 100,\n              \"dummy_double\": 10.2,\n              \"dummy_boolean\": true\n            }\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_creators/batch/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/column_creators/batch/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_creators/data/control/part-01.json",
    "content": "[\n {\n   \"salesorder\": 1,\n   \"item\": 1,\n   \"article\": \"article1\",\n   \"amount\": 1000,\n   \"date\": 20160601,\n   \"customer\": \"customer1\",\n   \"dummy_string\": \"this is a string\",\n   \"dummy_int\": 100,\n   \"dummy_double\": 10.2,\n   \"dummy_boolean\": true\n },\n {\n   \"salesorder\": 1,\n   \"item\": 2,\n   \"article\": \"article2\",\n   \"amount\": 2000,\n   \"date\": 20160601,\n   \"customer\": \"customer1\",\n   \"dummy_string\": \"this is a string\",\n   \"dummy_int\": 100,\n   \"dummy_double\": 10.2,\n   \"dummy_boolean\": true\n },\n {\n   \"salesorder\": 1,\n   \"item\": 3,\n   \"article\": \"article3\",\n   \"amount\": 500,\n   \"date\": 20160601,\n   \"customer\": \"customer1\",\n   \"dummy_string\": \"this is a string\",\n   \"dummy_int\": 100,\n   \"dummy_double\": 10.2,\n   \"dummy_boolean\": true\n }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/column_creators/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/column_creators/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_creators/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_creators/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_creators/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"with_literals\",\n          \"args\": {\n            \"literals\": {\n              \"dummy_string\": \"this is a string\",\n              \"dummy_int\": 100,\n              \"dummy_double\": 10.2,\n              \"dummy_boolean\": true\n            }\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_creators/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/column_creators/streaming/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/explode_arrays/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"json\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/explode_arrays/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/explode_arrays/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"date\": \"date2\",\n              \"customer\": \"customer2\"\n            }\n          }\n        },\n        {\n          \"function\": \"with_expressions\",\n          \"args\": {\n            \"cols_and_exprs\": {\n              \"constant\": \"'just a constant'\",\n              \"length_customer2\": \"length(customer2)\"\n            }\n          }\n        },\n        {\n          \"function\": \"to_json\",\n          \"args\": {\n            \"in_cols\": [\n              \"item\",\n              \"amount\"\n            ],\n            \"out_col\": \"item_amount_json\"\n          }\n        },\n        {\n          \"function\": \"explode_columns\",\n          \"args\": {\n            \"explode_arrays\": true\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/explode_arrays/batch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/explode_arrays/data/control/part-01.csv",
    "content": "salesorder|item|article|amount|manufacturing_countries|sub_articles|date2|customer2|constant|length_customer2|item_amount_json\n1|1|article1|1000|Portugal|article101|20220101|customer1|just a constant|9|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|Portugal|article102|20220101|customer1|just a constant|9|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|Spain|article101|20220101|customer1|just a constant|9|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|Spain|article102|20220101|customer1|just a constant|9|{\"item\":1,\"amount\":1000}\n1|2|article2|1000|Portugal|article201|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Portugal|article202|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Portugal|article203|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Algeria|article201|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Algeria|article202|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Algeria|article203|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Italy|article201|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Italy|article202|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|Italy|article203|20220102|customer2|just a constant|9|{\"item\":2,\"amount\":1000}\n2|1|article3|1200|Norway|article301|20220102|customer3|just a constant|9|{\"item\":1,\"amount\":1200}\n2|1|article4|1500|Portugal|article401|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Portugal|article402|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Portugal|article403|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Malaysia|article401|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Malaysia|article402|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Malaysia|article403|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Germany|article401|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Germany|article402|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|Germany|article403|20220103|customer2|just a constant|9|{\"item\":1,\"amount\":1500}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/explode_arrays/data/source/part-01.json",
    "content": "{\"salesorder\": 1,\"item\": 1,\"date\": 20220101,\"customer\":\"customer1\",\"article\": \"article1\",\"amount\": 1000,\"manufacturing_countries\": [\"Portugal\", \"Spain\"], \"sub_articles\": [\"article101\", \"article102\"]}\n{\"salesorder\": 1,\"item\": 2,\"date\": 20220102,\"customer\":\"customer2\",\"article\": \"article2\",\"amount\": 1000,\"manufacturing_countries\": [\"Portugal\", \"Algeria\", \"Italy\"], \"sub_articles\": [\"article201\", \"article202\", \"article203\"]}\n{\"salesorder\": 2,\"item\": 1,\"date\": 20220102,\"customer\":\"customer3\",\"article\": \"article3\",\"amount\": 1200,\"manufacturing_countries\": [\"Norway\"], \"sub_articles\": [\"article301\"]}\n{\"salesorder\": 2,\"item\": 1,\"date\": 20220103,\"customer\":\"customer2\",\"article\": \"article4\",\"amount\": 1500,\"manufacturing_countries\": [\"Portugal\", \"Malaysia\", \"Germany\"], \"sub_articles\": [\"article401\", \"article402\", \"article403\"]}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/explode_arrays/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"manufacturing_countries\",\n      \"nullable\": true,\n      \"metadata\": {},\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"name\": \"sub_articles\",\n      \"nullable\": true,\n      \"metadata\": {},\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/explode_arrays/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"json\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/explode_arrays/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/explode_arrays/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"date\": \"date2\",\n              \"customer\": \"customer2\"\n            }\n          }\n        },\n        {\n          \"function\": \"with_expressions\",\n          \"args\": {\n            \"cols_and_exprs\": {\n              \"constant\": \"'just a constant'\",\n              \"length_customer2\": \"length(customer2)\"\n            }\n          }\n        },\n        {\n          \"function\": \"to_json\",\n          \"args\": {\n            \"in_cols\": [\n              \"item\",\n              \"amount\"\n            ],\n            \"out_col\": \"item_amount_json\"\n          }\n        },\n        {\n          \"function\": \"explode_columns\",\n          \"args\": {\n            \"explode_arrays\": true\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/explode_arrays/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/explode_arrays/streaming/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"json\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"date\": \"date2\",\n              \"customer\": \"customer2\"\n            }\n          }\n        },\n        {\n          \"function\": \"with_expressions\",\n          \"args\": {\n            \"cols_and_exprs\": {\n              \"constant\": \"'just a constant'\",\n              \"length_customer2\": \"length(customer2)\"\n            }\n          }\n        },\n        {\n          \"function\": \"from_json\",\n          \"args\": {\n            \"input_col\": \"agg_fields\",\n            \"schema\": {\n              \"type\": \"struct\",\n              \"fields\": [\n                {\n                  \"name\": \"field1\",\n                  \"nullable\": true,\n                  \"metadata\": {},\n                  \"type\": {\n                    \"containsNull\": true,\n                    \"elementType\": \"string\",\n                    \"type\": \"array\"\n                  }\n                },\n                {\n                  \"name\": \"field2\",\n                  \"type\": {\n                    \"type\": \"struct\",\n                    \"fields\": [\n                      {\n                        \"name\": \"field1\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      },\n                      {\n                        \"name\": \"field2\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      }\n                    ]\n                  },\n                  \"nullable\": true,\n                  \"metadata\": {}\n                }\n              ]\n            }\n          }\n        },\n        {\n          \"function\": \"to_json\",\n          \"args\": {\n            \"in_cols\": [\n              \"item\",\n              \"amount\"\n            ],\n            \"out_col\": \"item_amount_json\"\n          }\n        },\n        {\n          \"function\": \"flatten_schema\",\n          \"args\": {\n            \"max_level\": 2\n          }\n        },\n        {\n          \"function\": \"explode_columns\",\n          \"args\": {\n            \"explode_arrays\": true,\n            \"map_cols_to_explode\": [\n              \"sample\"\n            ]\n          }\n        },\n        {\n          \"function\": \"flatten_schema\",\n          \"args\": {\n            \"max_level\": 2\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/batch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/data/control/part-01.csv",
    "content": "salesorder|item|article|amount|sub_articles|sample_key|sample_value|agg_fields|date2|customer2|constant|length_customer2|agg_fields_json_field1|agg_fields_json_field2_field1|agg_fields_json_field2_field2|item_amount_json\n1|1|article1|1000|article101|field1|value1|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Portugal|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article101|field2|value2|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Portugal|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article101|field1|value1|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Spain|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article101|field2|value2|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Spain|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article102|field1|value1|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Portugal|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article102|field2|value2|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Portugal|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article102|field1|value1|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Spain|value1|value2|{\"item\":1,\"amount\":1000}\n1|1|article1|1000|article102|field2|value2|{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|Spain|value1|value2|{\"item\":1,\"amount\":1000}\n1|2|article2|1000|article201|field1|value3|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article201|field2|value4|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article201|field1|value5|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article201|field2|value6|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article202|field1|value3|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article202|field2|value4|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article202|field1|value5|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article202|field2|value6|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article203|field1|value3|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article203|field2|value4|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article203|field1|value5|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n1|2|article2|1000|article203|field2|value6|{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}|20220102|customer2|just a constant|9|Italy|value4|value5|{\"item\":2,\"amount\":1000}\n2|1|article3|1200|article301|field1|value7|{\"field1\":[\"Malaysia\",\"Germany\"]}|20220102|customer3|just a constant|9|Malaysia|||{\"item\":1,\"amount\":1200}\n2|1|article3|1200|article301|field2|value8|{\"field1\":[\"Malaysia\",\"Germany\"]}|20220102|customer3|just a constant|9|Malaysia|||{\"item\":1,\"amount\":1200}\n2|1|article3|1200|article301|field1|value7|{\"field1\":[\"Malaysia\",\"Germany\"]}|20220102|customer3|just a constant|9|Germany|||{\"item\":1,\"amount\":1200}\n2|1|article3|1200|article301|field2|value8|{\"field1\":[\"Malaysia\",\"Germany\"]}|20220102|customer3|just a constant|9|Germany|||{\"item\":1,\"amount\":1200}\n2|1|article4|1500|article401|field1|value9|{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}|20220103|customer2|just a constant|9||value2|value3|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|article401|field2|value10|{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}|20220103|customer2|just a constant|9||value2|value3|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|article402|field1|value9|{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}|20220103|customer2|just a constant|9||value2|value3|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|article402|field2|value10|{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}|20220103|customer2|just a constant|9||value2|value3|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|article403|field1|value9|{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}|20220103|customer2|just a constant|9||value2|value3|{\"item\":1,\"amount\":1500}\n2|1|article4|1500|article403|field2|value10|{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}|20220103|customer2|just a constant|9||value2|value3|{\"item\":1,\"amount\":1500}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/data/source/part-01.json",
    "content": "{\"salesorder\":1,\"item\":1,\"date\":20220101,\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":1000,\"sub_articles\":[\"article101\",\"article102\"],\"sample\":[{\"field1\":\"value1\",\"field2\":\"value2\"}],\"agg_fields\":{\"field1\":[\"Portugal\",\"Spain\"],\"field2\":{\"field1\":\"value1\",\"field2\":\"value2\"}}}\n{\"salesorder\":1,\"item\":2,\"date\":20220102,\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":1000,\"sub_articles\":[\"article201\",\"article202\",\"article203\"],\"sample\":[{\"field1\":\"value3\",\"field2\":\"value4\"},{\"field1\":\"value5\",\"field2\":\"value6\"}],\"agg_fields\":{\"field1\":[\"Italy\"],\"field2\":{\"field1\":\"value4\",\"field2\":\"value5\"}}}\n{\"salesorder\":2,\"item\":1,\"date\":20220102,\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":1200,\"sub_articles\":[\"article301\"],\"sample\":[{\"field1\":\"value7\",\"field2\":\"value8\"}],\"agg_fields\":{\"field1\":[\"Malaysia\",\"Germany\"]}}\n{\"salesorder\":2,\"item\":1,\"date\":20220103,\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":1500,\"sub_articles\":[\"article401\",\"article402\",\"article403\"],\"sample\":[{\"field1\":\"value9\",\"field2\":\"value10\"}],\"agg_fields\":{\"field2\":{\"field1\":\"value2\",\"field2\":\"value3\"}}}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"sub_articles\",\n      \"nullable\": true,\n      \"metadata\": {},\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": \"string\",\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"name\": \"sample\",\n      \"nullable\": true,\n      \"metadata\": {},\n      \"type\": {\n        \"containsNull\": true,\n        \"elementType\": {\n          \"keyType\": \"string\",\n          \"type\": \"map\",\n          \"valueContainsNull\": true,\n          \"valueType\": \"string\"\n        },\n        \"type\": \"array\"\n      }\n    },\n    {\n      \"name\": \"agg_fields\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"json\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"date\": \"date2\",\n              \"customer\": \"customer2\"\n            }\n          }\n        },\n        {\n          \"function\": \"with_expressions\",\n          \"args\": {\n            \"cols_and_exprs\": {\n              \"constant\": \"'just a constant'\",\n              \"length_customer2\": \"length(customer2)\"\n            }\n          }\n        },\n        {\n          \"function\": \"from_json\",\n          \"args\": {\n            \"input_col\": \"agg_fields\",\n            \"schema\": {\n              \"type\": \"struct\",\n              \"fields\": [\n                {\n                  \"name\": \"field1\",\n                  \"nullable\": true,\n                  \"metadata\": {},\n                  \"type\": {\n                    \"containsNull\": true,\n                    \"elementType\": \"string\",\n                    \"type\": \"array\"\n                  }\n                },\n                {\n                  \"name\": \"field2\",\n                  \"type\": {\n                    \"type\": \"struct\",\n                    \"fields\": [\n                      {\n                        \"name\": \"field1\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      },\n                      {\n                        \"name\": \"field2\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      }\n                    ]\n                  },\n                  \"nullable\": true,\n                  \"metadata\": {}\n                }\n              ]\n            }\n          }\n        },\n        {\n          \"function\": \"to_json\",\n          \"args\": {\n            \"in_cols\": [\n              \"item\",\n              \"amount\"\n            ],\n            \"out_col\": \"item_amount_json\"\n          }\n        },\n        {\n          \"function\": \"flatten_schema\",\n          \"args\": {\n            \"max_level\": 2\n          }\n        },\n        {\n          \"function\": \"explode_columns\",\n          \"args\": {\n            \"explode_arrays\": true,\n            \"map_cols_to_explode\": [\n              \"sample\"\n            ]\n          }\n        },\n        {\n          \"function\": \"flatten_schema\",\n          \"args\": {\n            \"max_level\": 2\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/flatten_and_explode_arrays_and_maps/streaming/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_schema/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"json\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_schema/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_schema/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"date\": \"date2\",\n              \"customer\": \"customer2\"\n            }\n          }\n        },\n        {\n          \"function\": \"with_expressions\",\n          \"args\": {\n            \"cols_and_exprs\": {\n              \"constant\": \"'just a constant'\",\n              \"length_customer2\": \"length(customer2)\"\n            }\n          }\n        },\n        {\n          \"function\": \"from_json\",\n          \"args\": {\n            \"input_col\": \"sample\",\n            \"schema\": {\n              \"type\": \"struct\",\n              \"fields\": [\n                {\n                  \"name\": \"field1\",\n                  \"type\": \"string\",\n                  \"nullable\": true,\n                  \"metadata\": {}\n                },\n                {\n                  \"name\": \"field2\",\n                  \"type\": \"string\",\n                  \"nullable\": true,\n                  \"metadata\": {}\n                },\n                {\n                  \"name\": \"field3\",\n                  \"type\": \"double\",\n                  \"nullable\": true,\n                  \"metadata\": {}\n                },\n                {\n                  \"name\": \"field4\",\n                  \"type\": {\n                    \"type\": \"struct\",\n                    \"fields\": [\n                      {\n                        \"name\": \"field1\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      },\n                      {\n                        \"name\": \"field2\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      }\n                    ]\n                  },\n                  \"nullable\": true,\n                  \"metadata\": {}\n                }\n              ]\n            }\n          }\n        },\n        {\n          \"function\": \"to_json\",\n          \"args\": {\n            \"in_cols\": [\n              \"item\",\n              \"amount\"\n            ],\n            \"out_col\": \"item_amount_json\"\n          }\n        },\n        {\n          \"function\": \"flatten_schema\",\n          \"args\": {\n            \"max_level\": 2\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/flatten_schema/batch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_schema/data/control/part-01.csv",
    "content": "salesorder|item|article|amount|sample|date2|customer2|constant|length_customer2|sample_json_field1|sample_json_field2|sample_json_field3|sample_json_field4_field1|sample_json_field4_field2|item_amount_json\n1|1|article1|1000|{\"field1\":\"value1\",\"field2\":\"value2\",\"field4\":{\"field1\":\"value1\",\"field2\":\"value2\"}}|20220101|customer1|just a constant|9|value1|value2||value1|value2|{\"item\":1,\"amount\":1000}\n1|2|article2|1000|{\"field1\":\"value3\",\"field2\":\"value4\",\"field4\":{\"field1\":\"1value\",\"field2\":\"2value\"}}|20220102|customer2|just a constant|9|value3|value4||1value|2value|{\"item\":2,\"amount\":1000}\n2|1|article3|1200|{\"field1\":\"value5\",\"field3\":6.25,\"field4\":{\"field1\":\"1value1\",\"field2\":\"2value2\"}}|20220102|customer3|just a constant|9|value5||6.25|1value1|2value2|{\"item\":1,\"amount\":1200}\n2|1|article4|1500|{\"field1\":\"value5\",\"field3\":6.25,\"field4\":{\"field1\":\"1value1\",\"field2\":\"2value2\"}}|20220103|customer2|just a constant|9|value5||6.25|1value1|2value2|{\"item\":1,\"amount\":1500}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_schema/data/source/part-01.json",
    "content": "{\"salesorder\":1,\"item\":1,\"date\":20220101,\"customer\":\"customer1\",\"article\":\"article1\",\"amount\":1000,\"sample\":{\"field1\":\"value1\",\"field2\":\"value2\",\"field4\":{\"field1\":\"value1\",\"field2\":\"value2\"}}}\n{\"salesorder\":1,\"item\":2,\"date\":20220102,\"customer\":\"customer2\",\"article\":\"article2\",\"amount\":1000,\"sample\":{\"field1\":\"value3\",\"field2\":\"value4\",\"field4\":{\"field1\":\"1value\",\"field2\":\"2value\"}}}\n{\"salesorder\":2,\"item\":1,\"date\":20220102,\"customer\":\"customer3\",\"article\":\"article3\",\"amount\":1200,\"sample\":{\"field1\":\"value5\",\"field3\":6.25,\"field4\":{\"field1\":\"1value1\",\"field2\":\"2value2\"}}}\n{\"salesorder\":2,\"item\":1,\"date\":20220103,\"customer\":\"customer2\",\"article\":\"article4\",\"amount\":1500,\"sample\":{\"field1\":\"value5\",\"field3\":6.25,\"field4\":{\"field1\":\"1value1\",\"field2\":\"2value2\"}}}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_schema/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"sample\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/column_reshapers/flatten_schema/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"json\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_schema/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/column_reshapers/flatten_schema/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"rename\",\n          \"args\": {\n            \"cols\": {\n              \"date\": \"date2\",\n              \"customer\": \"customer2\"\n            }\n          }\n        },\n        {\n          \"function\": \"with_expressions\",\n          \"args\": {\n            \"cols_and_exprs\": {\n              \"constant\": \"'just a constant'\",\n              \"length_customer2\": \"length(customer2)\"\n            }\n          }\n        },\n        {\n          \"function\": \"from_json\",\n          \"args\": {\n            \"input_col\": \"sample\",\n            \"schema\": {\n              \"type\": \"struct\",\n              \"fields\": [\n                {\n                  \"name\": \"field1\",\n                  \"type\": \"string\",\n                  \"nullable\": true,\n                  \"metadata\": {}\n                },\n                {\n                  \"name\": \"field2\",\n                  \"type\": \"string\",\n                  \"nullable\": true,\n                  \"metadata\": {}\n                },\n                {\n                  \"name\": \"field3\",\n                  \"type\": \"double\",\n                  \"nullable\": true,\n                  \"metadata\": {}\n                },\n                {\n                  \"name\": \"field4\",\n                  \"type\": {\n                    \"type\": \"struct\",\n                    \"fields\": [\n                      {\n                        \"name\": \"field1\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      },\n                      {\n                        \"name\": \"field2\",\n                        \"type\": \"string\",\n                        \"nullable\": true,\n                        \"metadata\": {}\n                      }\n                    ]\n                  },\n                  \"nullable\": true,\n                  \"metadata\": {}\n                }\n              ]\n            }\n          }\n        },\n        {\n          \"function\": \"to_json\",\n          \"args\": {\n            \"in_cols\": [\n              \"item\",\n              \"amount\"\n            ],\n            \"out_col\": \"item_amount_json\"\n          }\n        },\n        {\n          \"function\": \"flatten_schema\",\n          \"args\": {\n            \"max_level\": 2\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_source\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/flatten_schema/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/column_reshapers/flatten_schema/streaming/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/data/control/drop_columns.csv",
    "content": "salesorder|item|date|amount\n1|1|20160601|1000\n1|2|20160601|2000\n1|3|20160601|500"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/data/control/hash_masking.csv",
    "content": "salesorder|item|date|amount|customer|customer_hash|article|article_hash\n1|1|20160601|-14577491|customer1|dea26157fa355301663174eac368538cff8939f36681d6712dedba439ab98b70|article1|36b3061d4fb72c32379a2ad0f05ace632371107ce414a1b3d51ef64247f53952\n1|2|20160601|1268485177|customer1|dea26157fa355301663174eac368538cff8939f36681d6712dedba439ab98b70|article2|8e3ba57e23105c9aaceb58b2ad0f5de979199a7732a6ee3734404ca7745c6fef\n1|3|20160601|-2108627946|customer1|dea26157fa355301663174eac368538cff8939f36681d6712dedba439ab98b70|article3|12717ebdf09ca4f2b2318796b6653e9b96989eda7726da4d94b73a3614476ae6"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/drop_columns.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/data_maskers/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/data_maskers/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"masked_data\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"column_dropper\",\n          \"args\": {\n            \"cols\": [\"customer\", \"article\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"masked_data\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/data_maskers/drop_columns/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/drop_columns_control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/hash_masking.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/data_maskers/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/data_maskers/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"masked_data\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"hash_masker\",\n          \"args\": {\n            \"cols\": [\"customer\", \"article\"]\n          }\n        },\n        {\n          \"function\": \"hash_masker\",\n          \"args\": {\n            \"cols\": [\"amount\"],\n            \"approach\": \"MURMUR3\",\n            \"suffix\": \"\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"masked_data\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/data_maskers/hash_masking/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/hash_masking_control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer_hash\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article_hash\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/data_maskers/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/date_transformers/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date3\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2_day\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2_month\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2_week\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2_quarter\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2_year\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2_day\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2_month\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2_week\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2_quarter\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2_year\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/date_transformers/data/control/part-01.csv",
    "content": "salesorder|order_date|order_date2|order_date3|ship_date|ship_date2|order_date2_day|order_date2_month|order_date2_week|order_date2_quarter|order_date2_year|ship_date2_day|ship_date2_month|ship_date2_week|ship_date2_quarter|ship_date2_year|\n1|2016-06-01|2016-06-01|16-1-6|16-6-2|2016-02-06 23:40:43|1|6|22|2|2016|6|2|5|1|2016|\n2|2016-07-03|2016-07-03|16-3-7|16-22-5|2016-05-22 22:12:54|3|7|26|3|2016|22|5|20|2|2016|\n3|2017-01-02|2017-01-02|17-2-1|17-1-3|2017-03-01 07:43:11|2|1|1|1|2017|1|3|9|1|2017|\n"
  },
  {
    "path": "tests/resources/feature/transformations/date_transformers/data/source/part-01.csv",
    "content": "salesorder|order_date|order_date2|order_date3|ship_date|ship_date2\n1|2016-06-01|01-06-2016|20160601|2016-06-02 23:40:43|2016-02-06T23:40:43.000Z\n2|2016-07-03|03-07-2016|20160703|2016-22-05 22:12:54|2016-05-22T22:12:54.000Z\n3|2017-01-02|02-01-2017|20170102|2017-01-03 07:43:11|2017-03-01T07:43:11.000Z"
  },
  {
    "path": "tests/resources/feature/transformations/date_transformers/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date\",\n      \"type\": \"date\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date2\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"order_date3\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date2\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/date_transformers/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/date_transformers/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/date_transformers/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_with_new_dates\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"add_current_date\",\n          \"args\": {\n            \"output_col\": \"curr_date\"\n          }\n        },\n        {\n          \"function\": \"convert_to_date\",\n          \"args\": {\n            \"cols\": [\"order_date2\"],\n            \"source_format\": \"dd-MM-yyyy\"\n          }\n        },\n        {\n          \"function\": \"convert_to_date\",\n          \"args\": {\n            \"cols\": [\"order_date3\"],\n            \"source_format\": \"yyyyMMdd\"\n          }\n        },\n        {\n          \"function\": \"convert_to_timestamp\",\n          \"args\": {\n            \"cols\": [\"ship_date\"],\n            \"source_format\": \"yyyy-dd-MM HH:mm:ss\"\n          }\n        },\n        {\n          \"function\": \"format_date\",\n          \"args\": {\n            \"cols\": [\"order_date3\", \"ship_date\"],\n            \"target_format\": \"yy-d-M\"\n          }\n        },\n        {\n          \"function\": \"get_date_hierarchy\",\n          \"args\": {\n            \"cols\": [\"order_date2\", \"ship_date2\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_with_new_dates\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/date_transformers/streaming/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/date_transformers/streaming/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/batch.json",
    "content": "{\n    \"input_specs\": [\n      {\n        \"spec_id\": \"orders_source\",\n        \"read_type\": \"batch\",\n        \"data_format\": \"csv\",\n        \"options\": {\n          \"mode\": \"FAILFAST\",\n          \"header\": true,\n          \"delimiter\": \"|\"\n        },\n        \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/drop_duplicate_rows/source_schema.json\",\n        \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/drop_duplicate_rows/data/part-01.csv\"\n      }\n    ],\n    \"transform_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\"\n          }\n        ]\n      },\n      {\n        \"spec_id\": \"orders_duplicate_empty\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\",\n            \"args\": {\n              \"cols\": []\n            }\n          }\n        ]\n      },\n      {\n        \"spec_id\": \"orders_duplicate\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\",\n            \"args\": {\n              \"cols\": [\"order_number\",\"item_number\"]\n            }\n          }\n        ]\n      }\n    ],\n    \"output_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args_write\",\n        \"input_id\": \"orders_duplicate_no_args\",\n        \"write_type\": \"overwrite\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/batch/orders_duplicate_no_args/data\"\n      },\n      {\n        \"spec_id\": \"orders_duplicate_empty_write\",\n        \"input_id\": \"orders_duplicate_empty\",\n        \"write_type\": \"overwrite\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/batch/orders_duplicate_empty/data\"\n      },\n      {\n        \"spec_id\": \"orders_duplicate_write\",\n        \"input_id\": \"orders_duplicate\",\n        \"write_type\": \"overwrite\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/batch/columns/data\"\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/data/control/batch_distinct.json",
    "content": "[\n  {\n    \"order_number\": 1,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 20,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 22,\n    \"date\": 20220102,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 4,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  }, \n  {\n    \"order_number\": 2,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 300,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 200,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/data/control/batch_drop_duplicates.json",
    "content": "[\n  {\n    \"order_number\": 1,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 20,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 4,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  }, \n  {\n    \"order_number\": 2,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 300,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 200,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/data/control/streaming_distinct.json",
    "content": "[\n  {\n    \"order_number\": 1,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 20,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 22,\n    \"date\": 20220102,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 4,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  }, \n  {\n    \"order_number\": 2,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 300,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 200,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 3,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 3,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 15,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 3,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 220,\n    \"date\": 20220103,\n    \"customer_number\": \"customer3\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 4,\n    \"item_number\": 1,\n    \"article_number\": \"article3\",\n    \"amount\": 350,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 5,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 5,\n    \"item_number\": 1,\n    \"article_number\": \"article2\",\n    \"amount\": 300,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 5,\n    \"item_number\": 2,\n    \"article_number\": \"article4\",\n    \"amount\": 10,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"spain\",\n    \"city\": \"madrid\"\n  }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/data/control/streaming_drop_duplicates.json",
    "content": "[\n  {\n    \"order_number\": 1,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 20,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 4,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  }, \n  {\n    \"order_number\": 2,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 300,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 200,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 3,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 3,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 15,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 4,\n    \"item_number\": 1,\n    \"article_number\": \"article3\",\n    \"amount\": 350,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 5,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 5,\n    \"item_number\": 2,\n    \"article_number\": \"article4\",\n    \"amount\": 10,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"spain\",\n    \"city\": \"madrid\"\n  }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/data/source/part-01.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n1|1|20220101|customer1|portugal|porto|article1|10\n1|2|20220101|customer1|portugal|porto|article2|20\n1|2|20220102|customer1|portugal|porto|article2|22\n1|3|20220101|customer1|portugal|porto|article3|120\n1|3|20220101|customer1|portugal|porto|article3|120\n1|4|20220101|customer1|portugal|porto|article3|120\n1|4|20220101|customer1|portugal|porto|article3|120\n2|1|20220102|customer2|germany|nuremberg|article1|3\n2|2|20220102|customer2|germany|nuremberg|article2|300\n2|3|20220102|customer2|germany|nuremberg|article3|200\n2|3|20220102|customer2|germany|nuremberg|article3|200"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/data/source/part-02.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n3|1|20220101|customer1|portugal|porto|article1|10\n3|2|20220101|customer1|portugal|porto|article2|15\n3|2|20220103|customer3|portugal|porto|article2|220\n4|1|20220101|customer1|portugal|porto|article3|350\n4|1|20220101|customer1|portugal|porto|article3|350\n5|1|20220102|customer2|germany|nuremberg|article1|3\n5|1|20220102|customer2|germany|nuremberg|article2|300\n5|2|20220102|customer2|spain|madrid|article4|10\n5|2|20220102|customer2|spain|madrid|article4|10\n5|2|20220102|customer2|spain|madrid|article4|10"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/source_schema.json",
    "content": "{\n    \"type\": \"struct\",\n    \"fields\": [\n      {\n        \"name\": \"order_number\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"item_number\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"date\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"customer_number\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"country\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"city\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"article_number\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"amount\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/drop_duplicate_rows/streaming.json",
    "content": "{\n    \"input_specs\": [\n      {\n        \"spec_id\": \"orders_source\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"csv\",\n        \"options\": {\n          \"mode\": \"FAILFAST\",\n          \"header\": true,\n          \"delimiter\": \"|\"\n        },\n        \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/drop_duplicate_rows/source_schema.json\",\n        \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/drop_duplicate_rows/data\"\n      }\n    ],\n    \"transform_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\"\n          }\n        ]\n      },\n      {\n        \"spec_id\": \"orders_duplicate_empty\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\",\n            \"args\": {\n              \"cols\": []\n            }\n          }\n        ]\n      },\n      {\n        \"spec_id\": \"orders_duplicate\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\",\n            \"args\": {\n              \"cols\": [\"order_number\",\"item_number\"]\n            }\n          }\n        ]\n      }\n    ],\n    \"output_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args_write\",\n        \"input_id\": \"orders_duplicate_no_args\",\n        \"write_type\": \"append\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/streaming/orders_duplicate_no_args/checkpoint\"\n        },\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/streaming/orders_duplicate_no_args/data\"\n      },\n      {\n        \"spec_id\": \"orders_duplicate_empty_write\",\n        \"input_id\": \"orders_duplicate_empty\",\n        \"write_type\": \"append\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/streaming/orders_duplicate_empty/checkpoint\"\n        },\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/streaming/orders_duplicate_empty/data\"\n      },\n      {\n        \"spec_id\": \"orders_duplicate_write\",\n        \"input_id\": \"orders_duplicate\",\n        \"write_type\": \"append\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/streaming/columns/checkpoint\"\n        },\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/drop_duplicate_rows/streaming/columns/data\"\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"join_with\": \"customers\",\n            \"join_type\": \"left outer\",\n            \"join_condition\": \"a.customer = b.customer\",\n            \"select_cols\": [\"a.*\", \"b.name as customer_name\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.batch_join\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/batch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/control_scenario_1_and_2_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer_name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/control_scenario_3_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/customer_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"birth_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/data/control/control_scenario_1_and_2.csv",
    "content": "salesorder|item|date|customer|article|amount|customer_name\n1|1|20160601|customer1|article1|1000|Anna\n1|2|20160601|customer1|article2|2000|Anna\n1|3|20160601|customer1|article3|500|Anna\n2|1|20170215|customer2|article4|1000|John\n2|2|20170215|customer2|article6|5000|John\n2|3|20170215|customer2|article1|3000|John\n3|1|20170215|customer1|article5|20000|Anna\n3|2|20170215|customer1|article2|12000|Anna\n3|3|20170215|customer1|article4|9000|Anna\n4|1|20170430|customer3|article3|8000|Sarah\n4|2|20170430|customer3|article7|7000|Sarah\n4|3|20170430|customer3|article1|3000|Sarah\n4|4|20170430|customer3|article2|5000|Sarah"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/data/control/control_scenario_3.csv",
    "content": "salesorder|item|date|customer|article|amount|name\n1|1|20160601|customer1|article1|1000|Anna\n1|2|20160601|customer1|article2|2000|Anna\n1|3|20160601|customer1|article3|500|Anna\n2|1|20170215|customer2|article4|1000|John\n2|2|20170215|customer2|article6|5000|John\n2|3|20170215|customer2|article1|3000|John\n3|1|20170215|customer1|article5|20000|Anna\n3|2|20170215|customer1|article2|12000|Anna\n3|3|20170215|customer1|article4|9000|Anna\n4|1|20170430|customer3|article3|8000|Sarah\n4|2|20170430|customer3|article7|7000|Sarah\n4|3|20170430|customer3|article1|3000|Sarah\n4|4|20170430|customer3|article2|5000|Sarah"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/data/source/customer-part-01.csv",
    "content": "customer|name|birth_date\ncustomer1|Anna|01012002\ncustomer2|John|04051980\ncustomer3|Sarah|02051940"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/data/source/sales-part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/data/source/sales-part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20170215|customer2|article4|1000\n2|2|20170215|customer2|article6|5000\n2|3|20170215|customer2|article1|3000\n3|1|20170215|customer1|article5|20000\n3|2|20170215|customer1|article2|12000\n3|3|20170215|customer1|article4|9000\n4|1|20170430|customer3|article3|8000\n4|2|20170430|customer3|article7|7000\n4|3|20170430|customer3|article1|3000\n4|4|20170430|customer3|article2|5000"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/streaming.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"join_with\": \"customers\",\n            \"join_type\": \"left outer\",\n            \"join_condition\": \"a.customer = b.customer\",\n            \"select_cols\": [\"a.*\", \"b.name as customer_name\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_join\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/streaming_foreachBatch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"join_with\": \"customers\",\n            \"join_type\": \"left outer\",\n            \"join_condition\": \"a.customer = b.customer\",\n            \"select_cols\": [\"a.*\", \"b.name as customer_name\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_join_foreachBatch\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming_foreachBatch/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming_foreachBatch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/streaming_without_broadcast.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"join_with\": \"customers\",\n            \"join_type\": \"left outer\",\n            \"join_condition\": \"a.customer = b.customer\",\n            \"select_cols\": [\"a.*\", \"b.name as customer_name\"],\n            \"broadcast_join\": false\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_without_broadcast\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"customer\",\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming_without_broadcast/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming_without_broadcast/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/joiners/streaming_without_column_rename.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/joiners/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"join_with\": \"customers\",\n            \"join_type\": \"left outer\",\n            \"join_condition\": \"a.customer = b.customer\",\n            \"select_cols\": [\"a.*\", \"b.name\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_join_without_column_rename\",\n      \"data_format\": \"delta\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming_without_column_rename/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/joiners/streaming_without_column_rename/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/multiple_transform/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"orders_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/multiple_transform/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/multiple_transform/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"orders_customer_cols\",\n      \"input_id\": \"orders_source\",\n      \"transformers\": [\n        {\n          \"function\": \"column_filter_exp\",\n          \"args\": {\n            \"exp\": [\"date\", \"country\", \"customer_number\"]\n          }\n        }\n      ]\n    },\n    {\n      \"spec_id\": \"orders_kpi_cols\",\n      \"input_id\": \"orders_source\",\n      \"transformers\": [\n        {\n          \"function\": \"column_filter_exp\",\n          \"args\": {\n            \"exp\": [\"date\", \"city\", \"amount\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"orders_bronze_customer_cols\",\n      \"input_id\": \"orders_customer_cols\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/multiple_transform/batch/orders_customer_cols/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/multiple_transform/batch/orders_customer_cols/checkpoint\"\n      }\n    },\n    {\n      \"spec_id\": \"orders_bronze_kpi_cols\",\n      \"input_id\": \"orders_kpi_cols\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/multiple_transform/batch/orders_kpi_cols/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/multiple_transform/batch/orders_kpi_cols/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/multiple_transform/data/control/part-01.json",
    "content": "[\n  {\n    \"order_number\": 1,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 10,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 20,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 1,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 120,\n    \"date\": 20220101,\n    \"customer_number\": \"customer1\",\n    \"country\": \"portugal\",\n    \"city\": \"porto\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 1,\n    \"article_number\": \"article1\",\n    \"amount\": 3,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 2,\n    \"article_number\": \"article2\",\n    \"amount\": 300,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  },\n  {\n    \"order_number\": 2,\n    \"item_number\": 3,\n    \"article_number\": \"article3\",\n    \"amount\": 200,\n    \"date\": 20220102,\n    \"customer_number\": \"customer2\",\n    \"country\": \"germany\",\n    \"city\": \"nuremberg\"\n  }\n]"
  },
  {
    "path": "tests/resources/feature/transformations/multiple_transform/data/source/part-01.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n1|1|20220101|customer1|portugal|porto|article1|10\n1|2|20220101|customer1|portugal|porto|article2|20\n1|3|20220101|customer1|portugal|porto|article3|120\n2|1|20220102|customer2|germany|nuremberg|article1|3\n2|2|20220102|customer2|germany|nuremberg|article2|300\n2|3|20220102|customer2|germany|nuremberg|article3|200"
  },
  {
    "path": "tests/resources/feature/transformations/multiple_transform/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"order_number\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item_number\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer_number\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"country\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"city\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article_number\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": false,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": false,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"float\",\n      \"nullable\": false,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/data/control/replace_nulls.csv",
    "content": "salesorder|customer|amount\n1|customer1|-999\n-999|customer2|200.5\n3|UNKNOWN|100.0"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/data/control/replace_nulls_col_subset.csv",
    "content": "salesorder|customer|amount\n1|customer1|-999\n|customer2|200.5\n3||100.0"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/data/source/part-01.csv",
    "content": "salesorder|customer|amount\n1|customer1|\n|customer2|200.50\n3||100.00"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/replace_nulls.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/null_handlers/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/null_handlers/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_without_nulls\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"replace_nulls\"\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_without_nulls\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/null_handlers/replace_nulls/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/null_handlers/replace_nulls/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/replace_nulls_col_subset.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\"\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/null_handlers/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/null_handlers/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"sales_without_nulls\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"replace_nulls\",\n          \"args\": {\n            \"subset_cols\": [\"amount\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"sales_without_nulls\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/null_handlers/replace_nulls_col_subset/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/null_handlers/replace_nulls_col_subset/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/null_handlers/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"float\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/optimizers/data/source/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/regex_transformers/with_regex_value/batch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_source\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"with_filepath\": true,\n      \"options\": {\n        \"mode\": \"FAILFAST\",\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"inferSchema\": true\n      },\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/regex_transformers/with_regex_value/source_schema.json\",\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/regex_transformers/with_regex_value/data\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"with_extraction_date\",\n      \"input_id\": \"sales_source\",\n      \"transformers\": [\n        {\n          \"function\": \"with_regex_value\",\n          \"args\": {\n            \"input_col\": \"lhe_extraction_filepath\",\n            \"output_col\": \"extraction_date\",\n            \"drop_input_col\": true,\n            \"regex\": \".*WE_SO_SCL_(\\\\d+).csv\"\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_bronze\",\n      \"input_id\": \"with_extraction_date\",\n      \"write_type\": \"overwrite\",\n      \"data_format\": \"delta\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/regex_transformers/with_regex_value/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/regex_transformers/with_regex_value/control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"extraction_date\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/regex_transformers/with_regex_value/data/control/part-01.csv",
    "content": "salesorder|item|date|customer|article|amount|extraction_date\n1|1|20160601|customer1|article1|1000|202108111400000029\n1|2|20160601|customer1|article2|2000|202108111400000029\n1|3|20160601|customer1|article3|500|202108111400000029"
  },
  {
    "path": "tests/resources/feature/transformations/regex_transformers/with_regex_value/data/source/WE_SO_SCL_202108111400000029.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20160601|customer1|article1|1000\n1|2|20160601|customer1|article2|2000\n1|3|20160601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/regex_transformers/with_regex_value/source_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/batch_union.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/sales-historical-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/sales-new-part-01.csv\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/batch_union/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/batch_unionByName.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/sales-historical-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/sales-new-part-01.csv\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union_by_name\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/batch_unionByName/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/batch_unionByName_diff_schema.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/sales-historical-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/sales-new-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_shipment\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_shipment_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_shipment/sales-shipment-part-01.csv\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union_by_name\",\n          \"args\": {\n            \"union_with\": [\"sales_new\", \"sales_shipment\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/batch_unionByName_diff_schema/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/batch_unionByName_diff_schema_error.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/sales-historical-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/sales-new-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_shipment\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_shipment_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_shipment/sales-shipment-part-01.csv\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union_by_name\",\n          \"args\": {\n            \"union_with\": [\"sales_new\", \"sales_shipment\"],\n            \"allow_missing_columns\": false\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/batch_unionByName_diff_schema_error/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/batch_union_diff_schema.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/sales-historical-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/sales-new-part-01.csv\"\n    },\n    {\n      \"spec_id\": \"sales_shipment\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_shipment_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_shipment/sales-shipment-part-01.csv\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\", \"sales_shipment\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/batch_union_diff_schema/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/control/control_sales.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/control/control_sales_shipment.csv",
    "content": "salesorder|item|date|customer|article|amount|ship_date\n1|1|20150601|customer1|article1|1000|\n1|2|20150601|customer1|article2|2000|\n1|3|20150601|customer1|article3|500|\n2|1|20160215|customer2|article4|1000|\n2|2|20160215|customer2|article6|5000|\n2|3|20160215|customer2|article1|3000|\n3|1|20160215|customer1|article5|20000|\n4|1|20170215|customer2|article4|1000|20170216\n4|2|20170215|customer2|article6|5000|20170216\n5|1|20170215|customer1|article5|20000|20170216\n5|3|20170215|customer2|article1|3000|20170216"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/control/control_sales_shipment_streaming.csv",
    "content": "salesorder|item|date|customer|article|amount|ship_date\n0|1|20140601|customer1|article1|1000|\n0|2|20140601|customer1|article2|2000|\n0|3|20140601|customer1|article3|500|\n1|1|20150601|customer1|article1|1000|\n1|2|20150601|customer1|article2|2000|\n1|3|20150601|customer1|article3|500|\n2|1|20160215|customer2|article4|1000|\n2|2|20160215|customer2|article6|5000|\n2|3|20160215|customer2|article1|3000|\n3|1|20160215|customer1|article5|20000|\n4|1|20170215|customer2|article4|1000|20170216\n4|2|20170215|customer2|article6|5000|20170216\n5|1|20170215|customer1|article5|20000|20170216\n5|3|20170215|customer2|article1|3000|20170216\n6|1|20160218|customer3|article7|100|\n6|2|20160218|customer3|article9|500|\n6|3|20160218|customer3|article8|300|\n7|1|20160218|customer5|article7|2000|\n8|1|20190215|customer2|article4|1000|20190216\n8|2|20190215|customer2|article6|5000|20190216\n9|3|20190215|customer2|article1|3000|20190216\n9|1|20190215|customer1|article5|20000|20190216"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/control/control_sales_shipment_streaming_foreachBatch.csv",
    "content": "salesorder|item|date|customer|article|amount|ship_date\n0|1|20140601|customer1|article1|1000|\n0|2|20140601|customer1|article2|2000|\n0|3|20140601|customer1|article3|500|\n1|1|20150601|customer1|article1|1000|\n1|2|20150601|customer1|article2|2000|\n1|3|20150601|customer1|article3|500|\n2|1|20160215|customer2|article4|1000|\n2|2|20160215|customer2|article6|5000|\n2|3|20160215|customer2|article1|3000|\n3|1|20160215|customer1|article5|20000|\n4|1|20170215|customer2|article4|1000|20170216\n4|2|20170215|customer2|article6|5000|20170216\n5|1|20170215|customer1|article5|20000|20170216\n5|3|20170215|customer2|article1|3000|20170216\n6|1|20160218|customer3|article7|100|\n6|2|20160218|customer3|article9|500|\n6|3|20160218|customer3|article8|300|\n7|1|20160218|customer5|article7|2000|\n8|1|20190215|customer2|article4|1000|20190216\n8|2|20190215|customer2|article6|5000|20190216\n9|3|20190215|customer2|article1|3000|20190216\n9|1|20190215|customer1|article5|20000|20190216"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/control/control_sales_streaming.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/control/control_sales_streaming_foreachBatch.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/source/sales-historical-part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/source/sales-historical-part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/source/sales-new-part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/source/sales-new-part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/source/sales-shipment-part-01.csv",
    "content": "salesorder|item|date|customer|article|amount|ship_date\n4|1|20170215|customer2|article4|1000|20170216\n4|2|20170215|customer2|article6|5000|20170216\n5|3|20170215|customer2|article1|3000|20170216\n5|1|20170215|customer1|article5|20000|20170216"
  },
  {
    "path": "tests/resources/feature/transformations/unions/data/source/sales-shipment-part-02.csv",
    "content": "salesorder|item|date|customer|article|amount|ship_date\n8|1|20190215|customer2|article4|1000|20190216\n8|2|20190215|customer2|article6|5000|20190216\n9|3|20190215|customer2|article1|3000|20190216\n9|1|20190215|customer1|article5|20000|20190216"
  },
  {
    "path": "tests/resources/feature/transformations/unions/sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/sales_shipment_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"ship_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/streaming_union.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/\"\n    },\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_new\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_historical\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_union/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_union/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/streaming_unionByName_diff_schema.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/\"\n    },\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_shipment\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_shipment_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_shipment/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_new\",\n      \"transformers\": [\n        {\n          \"function\": \"union_by_name\",\n          \"args\": {\n            \"union_with\": [\"sales_historical\", \"sales_shipment\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_unionByName_diff_schema/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_unionByName_diff_schema/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/streaming_unionByName_diff_schema_foreachBatch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/\"\n    },\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_shipment\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_shipment_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_shipment/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_new\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\n          \"function\": \"union_by_name\",\n          \"args\": {\n            \"union_with\": [\"sales_historical\", \"sales_shipment\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_unionByName_diff_schema_foreachBatch/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_unionByName_diff_schema_foreachBatch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/unions/streaming_union_foreachBatch.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_new/\"\n    },\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/unions/data/sales/sales_historical/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_new\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_historical\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_union_foreachBatch/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/unions/streaming_union_foreachBatch/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates/data/control/streaming_drop_duplicates.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n1|2|2017-05-10 01:01:01|customer1|portugal|porto|article2|20\n2|1|2017-05-10 01:01:01|customer2|germany|nuremberg|article1|3\n2|2|2017-05-10 01:01:01|customer2|germany|nuremberg|article2|300\n3|1|2017-05-12 01:01:01|customer1|portugal|porto|article1|10\n3|2|2017-05-12 01:01:01|customer1|portugal|porto|article2|15\n3|2|2017-05-12 01:01:01|customer3|portugal|porto|article2|220\n1|1|2017-05-10 01:01:01|customer1|portugal|porto|article1|10\n1|2|2017-05-10 01:01:01|customer1|portugal|porto|article2|22\n1|3|2017-05-10 01:01:01|customer1|portugal|porto|article3|120\n1|4|2017-05-10 01:01:01|customer1|portugal|porto|article3|120\n2|3|2017-05-10 01:01:01|customer2|germany|nuremberg|article3|200\n4|1|2017-05-12 01:01:01|customer1|portugal|porto|article3|350\n5|1|2017-05-12 01:01:01|customer2|germany|nuremberg|article1|3\n5|1|2017-05-12 01:01:01|customer2|germany|nuremberg|article2|300\n5|2|2017-05-12 01:01:01|customer2|spain|madrid|article4|10\n5|2|2017-05-10 10:01:12|customer2|spain|madrid|article4|10"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates/data/source/part-01.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n1|1|2017-05-10 01:01:01.000|customer1|portugal|porto|article1|10\n1|2|2017-05-10 01:01:01.000|customer1|portugal|porto|article2|20\n1|2|2017-05-10 01:01:01.000|customer1|portugal|porto|article2|22\n1|3|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n1|3|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n1|4|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n1|4|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n2|1|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article1|3\n2|2|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article2|300\n2|3|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article3|200\n2|3|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article3|200\n5|2|2017-05-10 10:01:12.000|customer2|spain|madrid|article4|10"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates/data/source/part-02.csv",
    "content": "order_number|orders_duplicate_no_args|date|customer_number|country|city|article_number|amount\n3|1|2017-05-12 01:01:01.000|customer1|portugal|porto|article1|10\n3|2|2017-05-12 01:01:01.000|customer1|portugal|porto|article2|15\n3|2|2017-05-12 01:01:01.000|customer3|portugal|porto|article2|220\n4|1|2017-05-12 01:01:01.000|customer1|portugal|porto|article3|350\n4|1|2017-05-12 01:01:01.000|customer1|portugal|porto|article3|350\n5|1|2017-05-12 01:01:01.000|customer2|germany|nuremberg|article1|3\n5|1|2017-05-12 01:01:01.000|customer2|germany|nuremberg|article2|300\n5|2|2017-05-12 01:01:01.000|customer2|spain|madrid|article4|10\n5|2|2017-05-12 01:01:01.000|customer2|spain|madrid|article4|10\n5|2|2017-05-06 10:01:12.000|customer2|spain|madrid|article4|10\n5|2|2017-05-04 10:01:12.000|customer2|spain|madrid|article4|1000\n1|1|2017-05-10 01:01:01.000|customer1|portugal|porto|article1|10"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates/source_schema.json",
    "content": "{\n    \"type\": \"struct\",\n    \"fields\": [\n      {\n        \"name\": \"order_number\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"item_number\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"date\",\n        \"type\": \"timestamp\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"customer_number\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"country\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"city\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"article_number\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"amount\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates/streaming_drop_duplicates.json",
    "content": "{\n    \"input_specs\": [\n      {\n        \"spec_id\": \"orders_source\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"csv\",\n        \"options\": {\n          \"mode\": \"FAILFAST\",\n          \"header\": true,\n          \"delimiter\": \"|\"\n        },\n        \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_drop_duplicates/source_schema.json\",\n        \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_drop_duplicates/data\"\n      }\n    ],\n    \"transform_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"drop_duplicate_rows\",\n            \"args\": {\n                 \"watermarker\": {\"col\": \"date\", \"watermarking_time\":\"2 days\"}\n            }\n          }\n        ]\n      }\n    ],\n  \"dq_specs\": [\n    {\n      \"spec_id\": \"dq_validator\",\n      \"input_id\": \"orders_duplicate_no_args\",\n      \"dq_type\": \"validator\",\n      \"store_backend\": \"file_system\",\n      \"local_fs_root_dir\": \"/app/tests/lakehouse/out/feature/transformations/watermarker/streaming_drop_duplicates/dq\",\n      \"result_sink_db_table\": \"test_db.validator_full_overwrite\",\n      \"result_sink_explode\": true,\n      \"result_sink_extra_columns\": [\"validation_results.result.*\"],\n      \"source\": \"orders_source\",\n      \"dq_functions\": [\n        {\n          \"function\": \"expect_column_to_exist\",\n          \"args\": {\n            \"column\": \"date\"\n          }\n        },\n        {\n          \"function\": \"expect_table_row_count_to_be_between\",\n          \"args\": {\n            \"min_value\": 0,\n            \"max_value\": 20\n          }\n        }\n      ]\n    }\n  ],\n    \"output_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args_write\",\n        \"input_id\": \"orders_duplicate_no_args\",\n        \"write_type\": \"append\",\n        \"data_format\": \"delta\",\n        \"options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_drop_duplicates/checkpoint\"\n        },\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_drop_duplicates/data\"\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/data/control/streaming_drop_duplicates_overall_watermark.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n1|1|2017-05-10 01:01:01|customer1|portugal|porto|article1|10\n1|2|2017-05-10 01:01:01|customer1|portugal|porto|article2|22\n1|4|2017-05-10 01:01:01|customer1|portugal|porto|article3|120\n2|1|2017-05-10 01:01:01|customer2|germany|nuremberg|article1|3\n2|2|2017-05-10 01:01:01|customer2|germany|nuremberg|article2|300\n2|3|2017-05-10 01:01:01|customer2|germany|nuremberg|article3|200\n3|1|2017-05-12 01:01:01|customer1|portugal|porto|article1|10\n3|2|2017-05-12 01:01:01|customer1|portugal|porto|article2|15\n3|2|2017-05-12 01:01:01|customer3|portugal|porto|article2|220\n4|1|2017-05-12 01:01:01|customer1|portugal|porto|article3|350\n5|1|2017-05-12 01:01:01|customer2|germany|nuremberg|article1|3\n1|3|2017-05-10 01:01:01|customer1|portugal|porto|article3|120\n5|2|2017-05-10 10:01:12|customer2|spain|madrid|article4|10\n5|2|2017-05-12 01:01:03|customer2|spain|madrid|article4|10"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/data/source/part-01.csv",
    "content": "order_number|item_number|date|customer_number|country|city|article_number|amount\n1|1|2017-05-10 01:01:01.000|customer1|portugal|porto|article1|10\n1|2|2017-05-10 01:01:01.000|customer1|portugal|porto|article2|20\n1|2|2017-05-10 01:01:01.000|customer1|portugal|porto|article2|22\n1|3|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n1|3|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n1|4|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n1|4|2017-05-10 01:01:01.000|customer1|portugal|porto|article3|120\n2|1|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article1|3\n2|2|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article2|300\n2|3|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article3|200\n2|3|2017-05-10 01:01:01.000|customer2|germany|nuremberg|article3|200\n5|2|2017-05-10 10:01:12.000|customer2|spain|madrid|article4|10"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/data/source/part-02.csv",
    "content": "order_number|orders_duplicate_no_args|date|customer_number|country|city|article_number|amount\n3|1|2017-05-12 01:01:01.000|customer1|portugal|porto|article1|10\n3|2|2017-05-12 01:01:01.000|customer1|portugal|porto|article2|15\n3|2|2017-05-12 01:01:01.000|customer3|portugal|porto|article2|220\n4|1|2017-05-12 01:01:01.000|customer1|portugal|porto|article3|350\n4|1|2017-05-12 01:01:01.000|customer1|portugal|porto|article3|350\n5|1|2017-05-12 01:01:01.000|customer2|germany|nuremberg|article1|3\n5|1|2017-05-12 01:01:01.000|customer2|germany|nuremberg|article2|300\n5|2|2017-05-12 01:01:01.000|customer2|spain|madrid|article4|10\n5|2|2017-05-12 01:01:01.000|customer2|spain|madrid|article4|10\n5|2|2017-05-12 01:01:02.000|customer2|spain|madrid|article4|10\n5|2|2017-05-12 01:01:03.000|customer2|spain|madrid|article4|10\n5|2|2017-05-06 10:01:12.000|customer23|spain|madrid|article4|10\n5|2|2017-05-04 10:01:12.000|customer22|spain|madrid|article4|1000\n1|1|2017-05-10 01:01:01.000|customer1|portugal|porto|article1|10"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/source_schema.json",
    "content": "{\n    \"type\": \"struct\",\n    \"fields\": [\n      {\n        \"name\": \"order_number\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"item_number\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"date\",\n        \"type\": \"timestamp\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"customer_number\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"country\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"city\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"article_number\",\n        \"type\": \"string\",\n        \"nullable\": true,\n        \"metadata\": {}\n      },\n      {\n        \"name\": \"amount\",\n        \"type\": \"integer\",\n        \"nullable\": true,\n        \"metadata\": {}\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/streaming_drop_duplicates_overall_watermark.json",
    "content": "{\n    \"input_specs\": [\n      {\n        \"spec_id\": \"orders_source\",\n        \"read_type\": \"streaming\",\n        \"data_format\": \"csv\",\n        \"options\": {\n          \"mode\": \"FAILFAST\",\n          \"header\": true,\n          \"delimiter\": \"|\"\n        },\n        \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/source_schema.json\",\n        \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/data\"\n      }\n    ],\n    \"transform_specs\": [\n      {\n        \"spec_id\": \"watermarking_orders\",\n        \"input_id\": \"orders_source\",\n        \"transformers\": [\n          {\n            \"function\": \"with_watermark\",\n            \"args\" : {\"watermarker_column\": \"date\", \"watermarker_time\":\"2 days\"}\n          },\n          {\"function\":\"drop_duplicate_rows\"},\n          {\n            \"function\": \"group_and_rank\",\n            \"args\": {\n              \"group_key\": [\n                \"order_number\",\n                \"item_number\",\n                \"customer_number\",\n                \"city\"\n              ],\n              \"ranking_key\": [\n                \"date\"\n              ]\n            }\n          }\n        ]\n      }\n    ],\n    \"output_specs\": [\n      {\n        \"spec_id\": \"orders_duplicate_no_args_write\",\n        \"input_id\": \"watermarking_orders\",\n        \"write_type\": \"append\",\n        \"data_format\": \"delta\",\n        \"partitions\": [\"date\"],\n        \"options\": {\n          \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/checkpoint\"\n        },\n        \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_drop_duplicates_overall_watermark/data\"\n      }\n    ]\n  }"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/customer_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"birth_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/data/control/streaming_inner_join.csv",
    "content": "salesorder|item|date|customer|article|amount|customer_name\n3|1|2017-05-12 01:01:01|customer1|article5|20000|Anna\n3|2|2017-05-12 01:01:01|customer1|article2|12000|Anna\n3|3|2017-05-12 01:01:01|customer1|article4|9000|Anna\n4|1|2017-05-12 01:01:01|customer3|article3|8000|Sarah\n4|2|2017-05-12 01:01:01|customer3|article7|7000|Sarah\n4|3|2017-05-12 01:01:01|customer3|article1|3000|Sarah\n4|4|2017-05-12 01:01:01|customer3|article2|5000|Sarah\n2|1|2017-05-12 01:01:01|customer2|article4|1000|John\n2|2|2017-05-12 01:01:01|customer2|article6|5000|John\n2|3|2017-05-12 01:01:01|customer2|article1|3000|John\n1|1|2017-05-10 01:01:01|customer1|article1|1000|Anna\n1|2|2017-05-10 01:01:01|customer1|article2|2000|Anna\n1|3|2017-05-10 01:01:01|customer1|article3|500|Anna"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/data/source/customer-part-01.csv",
    "content": "customer|name|birth_date|date\ncustomer1|Anna|01012002|2017-05-10 01:01:01.000\ncustomer2|John|04051980|2017-05-10 01:01:01.000\ncustomer3|Sarah|02051940|2017-05-10 01:01:01.000\ncustomer7|George|02051940|2017-05-10 01:01:01.000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/data/source/sales-part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|2017-05-10 01:01:01.000|customer1|article1|1000\n1|2|2017-05-10 01:01:01.000|customer1|article2|2000\n1|3|2017-05-10 01:01:01.000|customer1|article3|500\n1|3|2017-05-10 01:01:01.000|customer10|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/data/source/sales-part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|2017-05-12 01:01:01.000|customer2|article4|1000\n2|2|2017-05-12 01:01:01.000|customer2|article6|5000\n2|3|2017-05-12 01:01:01.000|customer2|article1|3000\n3|1|2017-05-12 01:01:01.000|customer1|article5|20000\n3|2|2017-05-12 01:01:01.000|customer1|article2|12000\n3|3|2017-05-12 01:01:01.000|customer1|article4|9000\n4|1|2017-05-12 01:01:01.000|customer3|article3|8000\n4|2|2017-05-12 01:01:01.000|customer3|article7|7000\n4|3|2017-05-12 01:01:01.000|customer3|article1|3000\n4|4|2017-05-12 01:01:01.000|customer3|article2|5000\n4|4|2017-05-07 01:01:01.000|customer3|article2|5000\n1|3|2017-05-14 01:01:01.000|customer100|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/streaming_inner_join.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_inner_join/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_inner_join/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_inner_join/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_inner_join/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"join_with\": \"customers\",\n            \"join_type\": \"inner\",\n            \"join_condition\": \"a.customer = b.customer and a.date between b.date and b.date + interval 4 days\",\n            \"select_cols\": [\"a.*\", \"b.name as customer_name\"],\n            \"watermarker\": {\"a\":{\"col\":  \"date\", \"watermarking_time\":  \"2 days\"}, \"b\": {\"col\":  \"date\", \"watermarking_time\":  \"2 days\"}}\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"db_table\": \"test_db.streaming_inner_join\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_inner_join/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_inner_join/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.stateStore.stateSchemaCheck\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_inner_join/streaming_inner_join_control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer_name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/customer_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customerId\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customerClickTime\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/control/streaming_left_outer_join.csv",
    "content": "customerId|customerBuyTime|customerClickTime\n0|2018-03-06 04:32:09.076|2018-03-06 04:32:31.941\n1|2018-03-06 04:32:09.276|\n2|2018-03-06 04:32:09.476|\n3|2018-03-06 04:32:10.676|2018-03-06 04:32:31.941\n3|2018-03-06 07:54:10.876|2018-03-06 07:54:10.876\n4|2018-03-06 04:32:10.876|\n5|2018-03-06 04:32:10.076|2018-03-06 04:32:32.341\n10|2018-03-06 04:32:00|\n10|2018-03-06 04:53:10.676|2018-03-06 04:53:10.676\n11|2018-03-06 04:53:10.876|2018-03-06 04:54:00.676\n11|2018-03-06 04:53:10.876|2018-03-06 04:54:00.676\n15|2018-03-06 04:32:05.676|\n21|2018-03-03 04:53:10.876|\n"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/customer-part-01.csv",
    "content": "customerId|customerClickTime\n0|2018-03-06T04:32:31.941+0000\n3|2018-03-06T04:32:31.941+0000\n5|2018-03-06T04:32:32.341+0000\n8|2018-03-06T04:32:32.941+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/customer-part-02.csv",
    "content": "customerId|customerClickTime\n10|2018-03-06T04:53:10.676+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/customer-part-03.csv",
    "content": "customerId|customerClickTime\n11|2018-03-06T04:54:00.676+0000\n0|2018-03-06T07:53:10.876+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/customer-part-04.csv",
    "content": "customerId|customerClickTime"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/customer-part-05.csv",
    "content": "customerId|customerClickTime\n3|2018-03-06T07:54:10.876+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/sales-part-01.csv",
    "content": "customerId|customerBuyTime\n0|2018-03-06T04:32:09.076+0000\n1|2018-03-06T04:32:09.276+0000\n2|2018-03-06T04:32:09.476+0000\n21|2018-03-03T04:53:10.876+0000\n10|2018-03-06T04:32:00.000+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/sales-part-02.csv",
    "content": "customerId|customerBuyTime\n3|2018-03-06T04:32:10.676+0000\n4|2018-03-06T04:32:10.876+0000\n5|2018-03-06T04:32:10.076+0000\n15|2018-03-06T04:32:05.676+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/sales-part-03.csv",
    "content": "customerId|customerBuyTime\n10|2018-03-06T04:53:10.676+0000\n11|2018-03-06T04:53:10.876+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/sales-part-04.csv",
    "content": "customerId|customerBuyTime\n11|2018-03-06T04:53:10.876+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/data/source/sales-part-05.csv",
    "content": "customerId|customerBuyTime\n3|2018-03-06T07:54:10.876+0000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customerId\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customerBuyTime\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/streaming_left_outer_join.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_left_outer_join/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_left_outer_join/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_left_outer_join/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_left_outer_join/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"left_df_alias\": \"df\",\n            \"right_df_alias\": \"join_with\",\n            \"join_with\": \"customers\",\n            \"join_type\": \"left outer\",\n            \"join_condition\": \"df.customerId = join_with.customerId and join_with.customerClickTime BETWEEN df.customerBuyTime AND df.customerBuyTime + INTERVAL 1 MINUTE\",\n            \"select_cols\": [\"df.*\", \"join_with.customerClickTime\"],\n            \"watermarker\": {\"df\":{\"col\":  \"customerBuyTime\", \"watermarking_time\":  \"10 seconds\"}, \"join_with\": {\"col\":  \"customerClickTime\", \"watermarking_time\":  \"20 seconds\"}}\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"customerId\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_left_outer_join/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_left_outer_join/data\"\n    }\n  ],\n  \"exec_env\": {\n    \"spark.sql.streaming.stateStore.stateSchemaCheck\": false\n  }\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_left_outer_join/streaming_left_outer_join_control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customerId\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customerBuyTime\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customerClickTime\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/customer_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"birth_date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/data/control/streaming_right_outer_join.csv",
    "content": "salesorder|item|date|customer|article|amount|customer_name\n3|1|2017-05-12 01:01:01|customer1|article5|20000|Anna\n3|2|2017-05-12 01:01:01|customer1|article2|12000|Anna\n3|3|2017-05-12 01:01:01|customer1|article4|9000|Anna\n4|1|2017-05-12 01:01:01|customer3|article3|8000|Sarah\n2|1|2017-05-12 01:01:01|customer2|article4|1000|John\n2|2|2017-05-12 01:01:01|customer2|article6|5000|John\n2|3|2017-05-12 01:01:01|customer2|article1|3000|John\n1|1|2017-05-12 00:00:01|customer1|article1|1000|Anna\n1|2|2017-05-12 00:00:01|customer1|article2|2000|Anna\n1|3|2017-05-12 00:00:01|customer1|article3|500|Anna|\n||||||Fran|"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/data/source/customer-part-01.csv",
    "content": "customer|name|birth_date|date\ncustomer1|Anna|01012002|2017-05-12 23:01:01.000\ncustomer2|John|04051980|2017-05-12 23:01:01.000\ncustomer3|Sarah|02051940|2017-05-12 23:01:01.000\ncustomer5|Fran|02051940|2017-05-05 00:01:01.000\ncustomer6|Nuno|02051940|2017-05-12 00:01:01.000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/data/source/sales-part-01.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|2017-05-12 00:00:01.000|customer1|article1|1000\n1|2|2017-05-12 00:00:01.000|customer1|article2|2000\n1|3|2017-05-12 00:00:01.000|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/data/source/sales-part-02.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|2017-05-12 01:01:01.000|customer2|article4|1000\n2|2|2017-05-12 01:01:01.000|customer2|article6|5000\n2|3|2017-05-12 01:01:01.000|customer2|article1|3000\n3|1|2017-05-12 01:01:01.000|customer1|article5|20000\n3|2|2017-05-12 01:01:01.000|customer1|article2|12000\n3|3|2017-05-12 01:01:01.000|customer1|article4|9000\n4|1|2017-05-12 01:01:01.000|customer3|article3|8000\n4|3|2017-05-12 01:01:01.000|customer800|article1|3000\n4|4|2017-05-05 01:01:01.000|customer3|article2|5000\n4|4|2017-05-07 01:01:01.000|customer800|article2|5000"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/streaming_right_outer_join.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_right_outer_join/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_right_outer_join/data/sales\"\n    },\n    {\n      \"spec_id\": \"customers\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_right_outer_join/customer_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/transformations/watermarker/streaming_right_outer_join/data/customers\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"join_with_customers\",\n      \"input_id\": \"sales\",\n      \"transformers\": [\n        {\n          \"function\": \"join\",\n          \"args\": {\n            \"left_df_alias\": \"df\",\n            \"right_df_alias\": \"join_with\",\n            \"join_with\": \"customers\",\n            \"join_type\": \"right outer\",\n            \"join_condition\": \"df.customer = join_with.customer and join_with.date >= df.date AND join_with.date <= df.date + interval 1 days\",\n            \"select_cols\": [\"df.*\", \"join_with.name as customer_name\"],\n            \"watermarker\": {\"df\":{\"col\": \"date\", \"watermarking_time\": \"2 days\"}, \"join_with\": {\"col\": \"date\", \"watermarking_time\": \"2 days\"}}\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"join_with_customers\",\n      \"write_type\": \"merge\",\n      \"db_table\": \"test_db.streaming_outer_join\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\n        \"date\"\n      ],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_right_outer_join/checkpoint\"\n      },\n      \"merge_opts\": {\n        \"merge_predicate\": \"current.salesorder = new.salesorder and current.item = new.item and current.customer_name == new.customer_name\",\n        \"update_predicate\": \"new.date >= current.date\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/transformations/watermarker/streaming_right_outer_join/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/transformations/watermarker/streaming_right_outer_join/streaming_right_outer_join_control_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"timestamp\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer_name\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_batch_console.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        },\n        {\n          \"function\": \"coalesce\",\n          \"args\": {\n            \"num_partitions\": 1\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"console\",\n      \"options\": {\n        \"limit\": 8,\n        \"truncate\": false,\n        \"vertical\": false\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_batch_dataframe.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        },\n        {\n          \"function\": \"coalesce\",\n          \"args\": {\n            \"num_partitions\": 1\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"dataframe\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_batch_files.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"location\": \"file:///app/tests/lakehouse/out/feature/writers/write_batch_files/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_batch_jdbc.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"coalesce\",\n          \"args\": {\"num_partitions\": 1}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"jdbc\",\n      \"partitions\": [\"date\"],\n      \"options\":{\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/out/feature/writers/write_batch_jdbc/test.db\",\n        \"dbtable\": \"write_batch_jdbc\",\n        \"driver\": \"org.sqlite.JDBC\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_batch_rest_api.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"with_literals\",\n          \"args\": {\"literals\": {\"payload\": \"{\\\"a\\\": \\\"a value\\\"}\"}}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"rest_api\",\n      \"options\": {\n        \"rest_api_url\": \"https://www.dummy-url.local/dummy-endpoint\",\n        \"rest_api_method\": \"post\",\n        \"rest_api_header\": {\"Authorization\": \"Bearer dummytoken\"}\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_batch_table.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"coalesce\",\n          \"args\": {\"num_partitions\": 1}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"db_table\": \"test_db.write_batch_table\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/writers/write_batch_table/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_console.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"console\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_dataframe.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"dataframe\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_df_with_checkpoint.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\",\n        \"maxFilesPerTrigger\": \"1\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\",\n        \"maxFilesPerTrigger\": \"1\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"dataframe\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_df_with_checkpoint/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_files.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_files/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_files/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_foreachBatch_console.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"console\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_foreachBatch_dataframe.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"dataframe\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_foreachBatch_df_with_checkpoint.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\",\n        \"maxFilesPerTrigger\": \"1\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\n          \"function\": \"union\",\n          \"args\": {\n            \"union_with\": [\"sales_new\"]\n          }\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"dataframe\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_df_with_checkpoint/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_foreachBatch_files.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_files/checkpoint\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_files/data\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_foreachBatch_jdbc.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"coalesce\",\n          \"args\": {\"num_partitions\": 1}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"jdbc\",\n      \"partitions\": [\"date\"],\n      \"options\":{\n        \"url\": \"jdbc:sqlite:/app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_jdbc/test.db\",\n        \"dbtable\": \"write_streaming_foreachBatch_jdbc\",\n        \"driver\": \"org.sqlite.JDBC\",\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_jdbc/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_foreachBatch_table.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"batch\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"force_streaming_foreach_batch_processing\": true,\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"coalesce\",\n          \"args\": {\"num_partitions\": 1}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"db_table\": \"test_db.write_streaming_foreachBatch_table\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_table/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_foreachBatch_table/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_multiple_dfs.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"bronze_sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"bronze_sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"input_id\": \"bronze_sales_historical\",\n      \"data_format\": \"dataframe\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"input_id\": \"bronze_sales_new\",\n      \"data_format\": \"dataframe\"\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_rest_api.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"with_literals\",\n          \"args\": {\"literals\": {\"payload\": \"{\\\"a\\\": \\\"a value\\\"}\"}}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"data_format\": \"rest_api\",\n      \"options\": {\n        \"rest_api_url\": \"https://www.dummy-url.local/dummy-endpoint\",\n        \"rest_api_method\": \"put\",\n        \"rest_api_basic_auth_username\": \"dummy_user\",\n        \"rest_api_basic_auth_password\": \"dummy_password\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/acons/write_streaming_table.json",
    "content": "{\n  \"input_specs\": [\n    {\n      \"spec_id\": \"sales_historical\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n      \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_historical/\"\n    },\n    {\n      \"spec_id\": \"sales_new\",\n      \"read_type\": \"streaming\",\n      \"data_format\": \"csv\",\n      \"schema_path\": \"file:///app/tests/lakehouse/in/feature/writers/schema/sales_schema.json\",\n      \"options\": {\n        \"header\": true,\n        \"delimiter\": \"|\",\n        \"mode\": \"FAILFAST\"\n      },\n    \"location\": \"file:///app/tests/lakehouse/in/feature/writers/source/sales_new/\"\n    }\n  ],\n  \"transform_specs\": [\n    {\n      \"spec_id\": \"union_dataframes\",\n      \"input_id\": \"sales_historical\",\n      \"transformers\": [\n        {\"function\": \"union\",\n          \"args\": {\"union_with\": [\"sales_new\"]}\n        },\n        {\"function\": \"coalesce\",\n          \"args\": {\"num_partitions\": 1}\n        }\n      ]\n    }\n  ],\n  \"output_specs\": [\n    {\n      \"spec_id\": \"sales\",\n      \"input_id\": \"union_dataframes\",\n      \"write_type\": \"append\",\n      \"data_format\": \"delta\",\n      \"partitions\": [\"date\"],\n      \"db_table\": \"test_db.write_streaming_table\",\n      \"location\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_table/data\",\n      \"options\": {\n        \"checkpointLocation\": \"file:///app/tests/lakehouse/out/feature/writers/write_streaming_table/checkpoint\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/control/writers_control.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/writers/control/writers_control_streaming_dataframe_1.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n"
  },
  {
    "path": "tests/resources/feature/writers/control/writers_control_streaming_dataframe_2.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/writers/control/writers_control_streaming_dataframe_foreachBatch_1.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n"
  },
  {
    "path": "tests/resources/feature/writers/control/writers_control_streaming_dataframe_foreachBatch_2.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/feature/writers/schema/sales_schema.json",
    "content": "{\n  \"type\": \"struct\",\n  \"fields\": [\n    {\n      \"name\": \"salesorder\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"item\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"date\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"customer\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"article\",\n      \"type\": \"string\",\n      \"nullable\": true,\n      \"metadata\": {}\n    },\n    {\n      \"name\": \"amount\",\n      \"type\": \"integer\",\n      \"nullable\": true,\n      \"metadata\": {}\n    }\n  ]\n}"
  },
  {
    "path": "tests/resources/feature/writers/source/sales_historical_1.csv",
    "content": "salesorder|item|date|customer|article|amount\n0|1|20140601|customer1|article1|1000\n0|2|20140601|customer1|article2|2000\n0|3|20140601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/writers/source/sales_historical_2.csv",
    "content": "salesorder|item|date|customer|article|amount\n1|1|20150601|customer1|article1|1000\n1|2|20150601|customer1|article2|2000\n1|3|20150601|customer1|article3|500"
  },
  {
    "path": "tests/resources/feature/writers/source/sales_new_1.csv",
    "content": "salesorder|item|date|customer|article|amount\n2|1|20160215|customer2|article4|1000\n2|2|20160215|customer2|article6|5000\n2|3|20160215|customer2|article1|3000\n3|1|20160215|customer1|article5|20000\n"
  },
  {
    "path": "tests/resources/feature/writers/source/sales_new_2.csv",
    "content": "salesorder|item|date|customer|article|amount\n6|1|20160218|customer3|article7|100\n6|2|20160218|customer3|article9|500\n6|3|20160218|customer3|article8|300\n7|1|20160218|customer5|article7|2000"
  },
  {
    "path": "tests/resources/unit/custom_configs/custom_engine_config.yaml",
    "content": "notif_disallowed_email_servers:\n  - dummy.file.server"
  },
  {
    "path": "tests/resources/unit/heartbeat/heartbeat_acon_creation/setup/column_list/heartbeat_sensor_control_table.json",
    "content": "{\n  \"sensor_source\": \"string\",\n  \"sensor_id\": \"string\",\n  \"sensor_read_type\": \"string\",\n  \"asset_description\": \"string\",\n  \"upstream_key\": \"string\",\n  \"preprocess_query\": \"string\",\n  \"latest_event_fetched_timestamp\": \"timestamp\",\n  \"trigger_job_id\": \"string\",\n  \"trigger_job_name\": \"string\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"job_start_timestamp\": \"timestamp\",\n  \"job_end_timestamp\": \"timestamp\",\n  \"job_state\": \"string\",\n  \"dependency_flag\": \"string\"\n}"
  },
  {
    "path": "tests/resources/unit/heartbeat/heartbeat_acon_creation/setup/column_list/sensor_table.json",
    "content": "{\n  \"sensor_id\": \"string\",\n  \"assets\": \"array<string>\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"checkpoint_location\": \"string\",\n  \"upstream_key\": \"string\",\n  \"upstream_value\": \"string\"\n}"
  },
  {
    "path": "tests/resources/unit/heartbeat/heartbeat_anchor_job/setup/column_list/heartbeat_sensor_control_table.json",
    "content": "{\n  \"sensor_source\": \"string\",\n  \"sensor_id\": \"string\",\n  \"sensor_read_type\": \"string\",\n  \"asset_description\": \"string\",\n  \"upstream_key\": \"string\",\n  \"preprocess_query\": \"string\",\n  \"latest_event_fetched_timestamp\": \"timestamp\",\n  \"trigger_job_id\": \"string\",\n  \"trigger_job_name\": \"string\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"job_start_timestamp\": \"timestamp\",\n  \"job_end_timestamp\": \"timestamp\",\n  \"job_state\": \"string\",\n  \"dependency_flag\": \"string\"\n}"
  },
  {
    "path": "tests/resources/unit/heartbeat/heartbeat_anchor_job/setup/column_list/sensor_table.json",
    "content": "{\n  \"sensor_id\": \"string\",\n  \"assets\": \"array<string>\",\n  \"status\": \"string\",\n  \"status_change_timestamp\": \"timestamp\",\n  \"checkpoint_location\": \"string\",\n  \"upstream_key\": \"string\",\n  \"upstream_value\": \"string\"\n}"
  },
  {
    "path": "tests/resources/unit/sharepoint_reader/data/sample_ok.csv",
    "content": "col_a,col_b\n1,2"
  },
  {
    "path": "tests/resources/unit/sharepoint_reader/data/sample_other_delim.csv",
    "content": "col_a;col_b\n1;2"
  },
  {
    "path": "tests/unit/__init__.py",
    "content": "\"\"\"Tests utilities.\"\"\"\n"
  },
  {
    "path": "tests/unit/test_acon_validation.py",
    "content": "\"\"\"Unit tests for ACON validators.\"\"\"\n\nimport pytest\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Validate delete objects function\",\n            \"acon\": {\n                \"operations\": [\n                    {\n                        \"manager\": \"file\",\n                        \"function\": \"delete_objects\",\n                        \"bucket\": \"example-bucket\",\n                        \"object_paths\": [\"path/to/delete/\"],\n                        \"dry_run\": True,\n                    }\n                ],\n            },\n        },\n        {\n            \"name\": \"Validate copy objects function with missing parameters\",\n            \"acon\": {\n                \"operations\": [\n                    {\n                        \"manager\": \"file\",\n                        \"function\": \"copy_objects\",\n                        \"bucket\": \"example-bucket\",\n                        \"source_object\": [\"path/to/copy/\"],\n                    }\n                ]\n            },\n            \"exception\": \"\"\"Errors found during validation:\nMissing mandatory parameters for file manager function copy_objects: ['destination_bucket', 'destination_object', 'dry_run']\nType validation errors for file manager function copy_objects: [\"Parameter 'source_object' expected str, got list\"]\"\"\",  # noqa: E501\n        },\n        {\n            \"name\": \"Validate list of operations\",\n            \"acon\": {\n                \"operations\": [\n                    {\n                        \"manager\": \"file\",\n                        \"function\": \"delete_objects\",\n                        \"bucket\": \"example-bucket\",\n                        \"object_paths\": [\"path/to/delete/\"],\n                        \"dry_run\": True,\n                    },\n                    {\n                        \"manager\": \"table\",\n                        \"function\": \"execute_sql\",\n                        \"sql\": \"create example_table\",\n                    },\n                    {\n                        \"manager\": \"table\",\n                        \"function\": \"optimize\",\n                        \"table_or_view\": \"example_table\",\n                    },\n                ],\n            },\n        },\n        {\n            \"name\": \"Validate list of operations with errors\",\n            \"acon\": {\n                \"operations\": [\n                    {\n                        \"manager\": \"file\",\n                        \"function\": \"delete_objects\",\n                        \"bucket\": \"example-bucket\",\n                        \"object_paths\": \"path/to/delete/\",\n                        \"dry_run\": \"test string\",\n                    },\n                    {\n                        \"manager\": \"table\",\n                        \"function\": \"execute_sql\",\n                        \"sql\": 10,\n                    },\n                    {\n                        \"manager\": \"table\",\n                        \"function\": \"optimize_dataset\",\n                        \"table_or_view\": \"example_table\",\n                    },\n                ]\n            },\n            \"exception\": \"\"\"Errors found during validation:\nType validation errors for file manager function delete_objects: [\"Parameter 'object_paths' expected list, got str\", \"Parameter 'dry_run' expected bool, got str\"]\nType validation errors for table manager function execute_sql: [\"Parameter 'sql' expected str, got int\"]\nFunction 'optimize_dataset' not supported for table manager\"\"\",  # noqa: E501\n        },\n    ],\n)\ndef test_manager_validation(scenario: dict) -> None:\n    \"\"\"Test to validate manager acons.\"\"\"\n    from lakehouse_engine.engine import validate_manager_list\n\n    acon = scenario[\"acon\"]\n    exception = scenario.get(\"exception\", None)\n\n    if exception:\n        with pytest.raises(Exception) as e:\n            validate_manager_list(acon)\n        assert str(e.value) == exception\n    else:\n        validate_manager_list(acon)\n"
  },
  {
    "path": "tests/unit/test_custom_configs.py",
    "content": "\"\"\"Unit tests for overwritten the default configs.\"\"\"\n\nfrom lakehouse_engine.core import exec_env\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom tests.conftest import UNIT_RESOURCES\n\nLOGGER = LoggingHandler(__name__).get_logger()\n\nTEST_PATH = \"custom_configs\"\nTEST_RESOURCES = f\"{UNIT_RESOURCES}/{TEST_PATH}\"\n\n\ndef test_custom_config() -> None:\n    \"\"\"Testing using a custom configuration.\"\"\"\n    default_configs = exec_env.ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers\n    LOGGER.info(f\"Default disallowed email server: {default_configs}\")\n\n    # Testing custom configurations using a dictionary\n    exec_env.ExecEnv.set_default_engine_config(\n        custom_configs_dict={\"notif_disallowed_email_servers\": [\"dummy.server.test\"]},\n    )\n    dict_custom_configs = exec_env.ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers\n    LOGGER.info(\n        f\"Custom disallowed email server using dictionary: {dict_custom_configs}\"\n    )\n    assert default_configs != dict_custom_configs\n\n    # Testing custom configurations using a file\n    exec_env.ExecEnv.set_default_engine_config(\n        custom_configs_file_path=f\"{TEST_RESOURCES}/custom_engine_config.yaml\",\n    )\n    file_custom_configs = exec_env.ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers\n    LOGGER.info(\n        f\"Custom disallowed email server using configuration file: \"\n        f\"{file_custom_configs}\"\n    )\n    assert default_configs != file_custom_configs\n\n    # Resetting to the default configurations\n    exec_env.ExecEnv.set_default_engine_config(package=\"tests.configs\")\n    reset_configs = exec_env.ExecEnv.ENGINE_CONFIG.notif_disallowed_email_servers\n    LOGGER.info(f\"Reset disallowed email server: {reset_configs}\")\n    assert default_configs == reset_configs\n"
  },
  {
    "path": "tests/unit/test_databricks_utils.py",
    "content": "\"\"\"Unit tests for DatabricksUtils in lakehouse_engine.utils.databricks_utils.\"\"\"\n\nimport sys\nimport types\nfrom unittest.mock import MagicMock, patch\n\nfrom lakehouse_engine.utils.databricks_utils import DatabricksUtils\n\nCONTEXT_KEYS = {\n    \"runId\": \"76890\",\n    \"jobId\": \"657890\",\n    \"jobName\": \"sadp-template-dummy_job\",\n    \"workspaceId\": \"213245431\",\n    \"usagePolicyId\": \"4567890\",\n}\nCONTROL_DATA = {\n    \"run_id\": \"76890\",\n    \"job_id\": \"657890\",\n    \"job_name\": \"sadp-template-dummy_job\",\n    \"workspace_id\": \"213245431\",\n    \"policy_id\": \"4567890\",\n    \"dp_name\": \"sadp-template\",\n    \"environment\": \"dev\",\n}\n\n\ndef test_get_usage_context_for_serverless() -> None:\n    \"\"\"Test for get_usage_context_for_serverless method in DatabricksUtils.\"\"\"\n    # Create a fake module and function\n    fake_module = types.ModuleType(\"dbruntime.databricks_repl_context\")\n    fake_module.get_context = MagicMock(  # type: ignore[attr-defined]\n        return_value=MagicMock()\n    )\n    sys.modules[\"dbruntime\"] = types.ModuleType(\"dbruntime\")\n    sys.modules[\"dbruntime.databricks_repl_context\"] = fake_module\n\n    mock_context = MagicMock(**CONTEXT_KEYS)\n\n    # Patch get_context to return our mock context\n    with patch(\n        \"dbruntime.databricks_repl_context.get_context\", return_value=mock_context\n    ):\n\n        with patch(\n            \"lakehouse_engine.core.exec_env.ExecEnv.get_environment\", return_value=\"dev\"\n        ):\n            usage_stats: dict = {}\n            DatabricksUtils.get_usage_context_for_serverless(usage_stats)\n\n            assert (\n                usage_stats == CONTROL_DATA\n            ), f\"Expected usage_stats to be {CONTROL_DATA}, but got {usage_stats}\"\n\n    # Clean up after test\n    del sys.modules[\"dbruntime.databricks_repl_context\"]\n    del sys.modules[\"dbruntime\"]\n"
  },
  {
    "path": "tests/unit/test_failure_notification_creation.py",
    "content": "\"\"\"Unit tests for the creation of failure notifications.\"\"\"\n\nimport re\nimport time\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import TerminatorSpec\nfrom lakehouse_engine.terminators.notifier_factory import NotifierFactory\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom tests.utils.smtp_server import SMTPServer\n\nLOGGER = LoggingHandler(__name__).get_logger()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Email notification creation using a template.\",\n            \"spec\": [\n                TerminatorSpec(\n                    function=\"notify\",\n                    args={\n                        \"server\": \"localhost\",\n                        \"port\": \"1025\",\n                        \"type\": \"email\",\n                        \"template\": \"failure_notification_email\",\n                        \"from\": \"test-email@email.com\",\n                        \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                        \"on_failure\": True,\n                    },\n                ),\n            ],\n            \"server\": \"localhost\",\n            \"port\": \"1025\",\n            \"expected\": \"\"\"\n            Job local in workspace local has\n            failed with the exception: Test exception\"\"\",\n        },\n    ],\n)\ndef test_failure_notification_creation(scenario: dict) -> None:\n    \"\"\"Testing notification creation.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    expected_output = scenario[\"expected\"]\n\n    try:\n        port = scenario[\"port\"]\n        server = scenario[\"server\"]\n\n        smtp_server = SMTPServer(server, port)\n        smtp_server.start()\n\n        # We sleep so the subprocess has time to start the debug smtp server\n        time.sleep(2)\n\n        NotifierFactory.generate_failure_notification(\n            scenario[\"spec\"], ValueError(\"Test exception\")\n        )\n\n        message = _parse_email_output(smtp_server.get_last_message().as_string())\n\n        assert message == expected_output\n\n    finally:\n        smtp_server.stop()\n\n\ndef _parse_email_output(mail_content: str) -> str:\n    \"\"\"Parse the mail that was received in the debug smtp server.\n\n    The regex is fetching the data between the encoding's field 'bit' and\n    the next boundary of the email.\n    Example notification content:\n        Content-Type: multipart/mixed; boundary=\"===============1362798268250904879==\"\n        MIME-Version: 1.0\n        From: test-email@email.com\n        To: test-email1@email.com, test-email2@email.com\n        CC:\n        BCC:\n        Subject: Service Failure\n        Importance: normal\n        X-Peer: ('::1', 49472, 0, 0)\n        X-MailFrom: test-email@email.com\n        X-RcptTo: test-email1@email.com, test-email2@email.com\n\n        --===============1362798268250904879==\n        Content-Type: text/text; charset=\"us-ascii\"\n        MIME-Version: 1.0\n        Content-Transfer-Encoding: 7bit\n\n\n                    Job local in workspace local has\n                    failed with the exception: Test exception\n        --===============1362798268250904879==--\n\n    Args:\n        mail_content: The content of the email to parse.\n\n    Returns:\n        The parsed email message.\n    \"\"\"\n    message = re.search(\"(?<=bit\\n).*?(?=--=)\", mail_content, re.S).group()[1:-1]\n\n    return str(message)\n"
  },
  {
    "path": "tests/unit/test_heartbeat_acon_creation.py",
    "content": "\"\"\"Module that tests the Acon creation function from the heartbeat module.\"\"\"\n\nfrom unittest.mock import Mock, patch\n\nimport pytest\nfrom pyspark.sql import DataFrame\n\nfrom lakehouse_engine.algorithms.sensors.heartbeat import Heartbeat\nfrom lakehouse_engine.core.definitions import HeartbeatConfigSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import LAKEHOUSE, UNIT_RESOURCES\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"heartbeat_acon_creation\"\nFEATURE_TEST_RESOURCES = f\"{UNIT_RESOURCES}/heartbeat/{TEST_NAME}\"\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n_SETUP_DELTA_TABLES = [\"heartbeat_sensor_control_table\", \"sensor_table\"]\n\n\ndef _create_heartbeat_table() -> None:\n    \"\"\"Create the necessary tables required for using Heartbeat.\"\"\"\n    _LOGGER.info(\"Creating heartbeat tables\")\n    for table in _SETUP_DELTA_TABLES:\n        DataframeHelpers.create_delta_table(\n            cols=SchemaUtils.from_file_to_dict(\n                f\"file:///{FEATURE_TEST_RESOURCES}/setup/column_list/{table}.json\"\n            ),\n            table=table,\n        )\n\n\ndef _select_all(table: str) -> DataFrame:\n    \"\"\"Select all records from the specified table.\n\n    Args:\n        table (str): The name of the table.\n    \"\"\"\n    return ExecEnv.SESSION.sql(f\"SELECT * FROM  {table} ORDER BY sensor_id\")  # nosec\n\n\ndef _check_acon(heartbeat_table: str, acon: dict, acon_result_list: dict) -> None:\n    \"\"\"Validates the generated ACON.\n\n    Args:\n        heartbeat_table (str): The name of the heartbeat control table.\n        acon (dict): The initial ACON that feeds the heartbeat algorithm.\n        acon_result_list (dict): The expected ACON configuration.\n    \"\"\"\n    _LOGGER.info(\"Checking acon creation.\")\n    for control_table_row in _select_all(heartbeat_table).collect():\n        result = Heartbeat._get_sensor_acon_from_heartbeat(\n            HeartbeatConfigSpec.create_from_acon(acon), control_table_row\n        )\n        print(result)\n\n        assert result == acon_result_list[control_table_row[\"sensor_id\"]]\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"use_case_name\": \"delta_table\",\n            \"rows_to_add\": {\n                \"heartbeat\": \"\"\"\n                    (\"delta_table\",\"dummy_order\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"9274610384726150\",\"dummy_order_events\",\"COMPLETED\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\")\n                    \"\"\",\n            },\n            \"results\": {\n                \"dummy_order\": {\n                    \"sensor_id\": \"dummy_order_9274610384726150\",\n                    \"assets\": [\"delta_table_order_events_9274610384726150\"],\n                    \"control_db_table_name\": \"test_db.sensor_table\",\n                    \"input_spec\": {\n                        \"spec_id\": \"sensor_upstream\",\n                        \"read_type\": \"batch\",\n                        \"data_format\": \"delta\",\n                        \"db_table\": \"dummy_order\",\n                        \"options\": None,\n                        \"location\": None,\n                        \"schema\": None,\n                    },\n                    \"preprocess_query\": None,\n                    \"base_checkpoint_location\": None,\n                    \"fail_on_empty_result\": False,\n                },\n            },\n        },\n        {\n            \"use_case_name\": \"kafka\",\n            \"rows_to_add\": {\n                \"heartbeat\": \"\"\"\n                    (\"kafka\",\n                    \"sales: sales.dummy_deliveries\",\n                    \"batch\",\"delta_table_order_events\",NULL,NULL,NULL,\n                    \"1847362093847561\",\"dummy_order_events\",\"COMPLETED\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\")\n                    \"\"\",\n            },\n            \"results\": {\n                \"sales: sales.dummy_deliveries\": {\n                    \"sensor_id\": \"sales__sales_dummy_deliveries_1847362093847561\",\n                    \"assets\": [\"delta_table_order_events_1847362093847561\"],\n                    \"control_db_table_name\": \"test_db.sensor_table\",\n                    \"input_spec\": {\n                        \"spec_id\": \"sensor_upstream\",\n                        \"read_type\": \"batch\",\n                        \"data_format\": \"kafka\",\n                        \"db_table\": None,\n                        \"options\": {\n                            \"kafka.bootstrap.servers\": [\"server1\", \"server2\"],\n                            \"subscribe\": \"sales.dummy_deliveries\",\n                            \"startingOffsets\": \"earliest\",\n                            \"kafka.security.protocol\": \"SSL\",\n                            \"kafka.ssl.truststore.location\": \"trust_store_location\",\n                            \"kafka.ssl.truststore.password\": \"key\",\n                            \"kafka.ssl.keystore.location\": \"keystore_location\",\n                            \"kafka.ssl.keystore.password\": \"key\",\n                        },\n                        \"location\": None,\n                        \"schema\": None,\n                    },\n                    \"preprocess_query\": None,\n                    \"base_checkpoint_location\": None,\n                    \"fail_on_empty_result\": False,\n                }\n            },\n        },\n        {\n            \"use_case_name\": \"sap_b4\",\n            \"rows_to_add\": {\n                \"heartbeat\": \"\"\"\n                    (\"sap_b4\",\"SAP_DUMMY_ID\",\"batch\",\n                    \"dummy_tables\",\"LOAD_DATE\",NULL,NULL,\n                    \"6039184726153847\",\"dummy_order_events\",\"COMPLETED\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"FALSE\"),\n                    (\"sap_b4\",\"SAP_DUMMY_ID2\",\"batch\",\n                    \"dummy_tables\",\"LOAD_DATE\",NULL,NULL,\n                    \"7482910364728193\",\"dummy_order_events\",\"COMPLETED\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"FALSE\")\n                    \"\"\",\n            },\n            \"results\": {\n                \"SAP_DUMMY_ID\": {\n                    \"sensor_id\": \"SAP_DUMMY_ID_6039184726153847\",\n                    \"assets\": [\"dummy_tables_6039184726153847\"],\n                    \"control_db_table_name\": \"test_db.sensor_table\",\n                    \"input_spec\": {\n                        \"spec_id\": \"sensor_upstream\",\n                        \"read_type\": \"batch\",\n                        \"data_format\": \"sap_b4\",\n                        \"db_table\": None,\n                        \"options\": {\n                            \"prepareQuery\": (\n                                \"WITH sensor_new_data AS (SELECT CHAIN_ID, \"\n                                \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, ANALYZED_STATUS \"\n                                \"FROM sap_table \"\n                                \"WHERE UPPER(CHAIN_ID) = UPPER('SAP_DUMMY_ID') \"\n                                \"AND UPPER(ANALYZED_STATUS) = UPPER('G'))\"\n                            ),\n                            \"query\": (\n                                \"SELECT COUNT(1) as count, \"\n                                \"'LOAD_DATE' as UPSTREAM_KEY, \"\n                                \"max(LOAD_DATE) as UPSTREAM_VALUE FROM sensor_new_data \"\n                                \"WHERE LOAD_DATE > '19000101000000' HAVING COUNT(1) > 0\"\n                            ),\n                        },\n                        \"location\": None,\n                        \"schema\": None,\n                    },\n                    \"preprocess_query\": None,\n                    \"base_checkpoint_location\": None,\n                    \"fail_on_empty_result\": False,\n                },\n                \"SAP_DUMMY_ID2\": {\n                    \"sensor_id\": \"SAP_DUMMY_ID2_7482910364728193\",\n                    \"assets\": [\"dummy_tables_7482910364728193\"],\n                    \"control_db_table_name\": \"test_db.sensor_table\",\n                    \"input_spec\": {\n                        \"spec_id\": \"sensor_upstream\",\n                        \"read_type\": \"batch\",\n                        \"data_format\": \"sap_b4\",\n                        \"db_table\": None,\n                        \"options\": {\n                            \"prepareQuery\": (\n                                \"WITH sensor_new_data AS (SELECT CHAIN_ID, \"\n                                \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, ANALYZED_STATUS \"\n                                \"FROM sap_table \"\n                                \"WHERE \"\n                                \"UPPER(CHAIN_ID) = UPPER('SAP_DUMMY_ID2') \"\n                                \"AND UPPER(ANALYZED_STATUS) = UPPER('G'))\"\n                            ),\n                            \"query\": (\n                                \"SELECT COUNT(1) as count, \"\n                                \"'LOAD_DATE' as UPSTREAM_KEY, \"\n                                \"max(LOAD_DATE) as UPSTREAM_VALUE FROM sensor_new_data \"\n                                \"WHERE LOAD_DATE > '19000101000000' HAVING COUNT(1) > 0\"\n                            ),\n                        },\n                        \"location\": None,\n                        \"schema\": None,\n                    },\n                    \"preprocess_query\": None,\n                    \"base_checkpoint_location\": None,\n                    \"fail_on_empty_result\": False,\n                },\n            },\n        },\n    ],\n)\n@patch(\"lakehouse_engine.utils.databricks_utils.DatabricksUtils.get_db_utils\")\ndef test_get_sensor_acon(mock_get_db_utils: Mock, scenario: dict) -> None:\n    \"\"\"Test the acon creation.\n\n    Args:\n        mock_get_db_utils (Mock): The mocked object.\n        scenario (dict): The test scenario to execute.\n\n    Scenarios:\n        1- For delta tables source.\n        2- For kafka topics source.\n        3- For SAP sources. In this scenario we have two records\n            that will yield two different acons.\n    \"\"\"\n    scenario_name = scenario[\"use_case_name\"]\n    records = scenario[\"rows_to_add\"].get(\"heartbeat\")\n    acon_result_list = scenario[\"results\"]\n\n    heartbeat_table = \"test_db.heartbeat_sensor_control_table\"\n    sensor_table = \"test_db.sensor_table\"\n\n    acon = {\n        \"sensor_source\": scenario_name,\n        \"data_format\": \"delta\",\n        \"heartbeat_sensor_db_table\": heartbeat_table,\n        \"lakehouse_engine_sensor_db_table\": sensor_table,\n        \"token\": \"my-token\",\n        \"domain\": \"adidas-domain.cloud.databricks.com\",\n    }\n\n    _LOGGER.info(f\"Scenario: {scenario_name}\")\n\n    _create_heartbeat_table()\n\n    _LOGGER.info(\"Inserting records in heartbeat table.\")\n    ExecEnv.SESSION.sql(\n        f\"\"\"INSERT INTO {heartbeat_table}\n            VALUES {records}\"\"\"  # nosec\n    )\n\n    if scenario_name == \"sap_b4\":\n        _LOGGER.info(\"Inserting records in sensors table.\")\n        acon.update(\n            {\n                \"data_format\": \"sap_b4\",\n                \"jdbc_db_table\": \"sap_table\",\n                \"options\": {\n                    \"prepareQuery\": \"\",\n                    \"query\": \"\",\n                },\n            }\n        )\n\n    if scenario_name == \"kafka\":\n        acon.update(\n            {\n                \"data_format\": \"kafka\",\n                \"kafka_configs\": {\n                    \"sales\": {\n                        \"kafka_bootstrap_servers_list\": [\"server1\", \"server2\"],\n                        \"kafka_ssl_truststore_location\": \"trust_store_location\",\n                        \"kafka_ssl_keystore_location\": \"keystore_location\",\n                        \"truststore_pwd_secret_key\": \"trust_store_key\",\n                        \"keystore_pwd_secret_key\": \"keystore_pwd_secret_key\",\n                    }\n                },\n            }\n        )\n\n        mock_db_utils = Mock()\n        mock_secrets = Mock()\n        mock_secrets.get.return_value = \"key\"\n        mock_db_utils.secrets = mock_secrets\n        mock_get_db_utils.return_value = mock_db_utils\n\n        _check_acon(heartbeat_table, acon, acon_result_list)\n    else:\n        _check_acon(heartbeat_table, acon, acon_result_list)\n\n    for table in _SETUP_DELTA_TABLES:\n        LocalStorage.clean_folder(f\"{LAKEHOUSE}{table}\")\n        ExecEnv.SESSION.sql(f\"\"\"DROP TABLE IF EXISTS test_db.{table}\"\"\")  # nosec\n"
  },
  {
    "path": "tests/unit/test_heartbeat_anchor_job.py",
    "content": "\"\"\"Module that tests the anchor job function from the heartbeat module.\"\"\"\n\nfrom unittest.mock import Mock, patch\n\nimport pytest\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.engine import trigger_heartbeat_sensor_jobs\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom lakehouse_engine.utils.schema_utils import SchemaUtils\nfrom tests.conftest import LAKEHOUSE, UNIT_RESOURCES\nfrom tests.utils.dataframe_helpers import DataframeHelpers\nfrom tests.utils.local_storage import LocalStorage\n\nTEST_NAME = \"heartbeat_anchor_job\"\nFEATURE_TEST_RESOURCES = f\"{UNIT_RESOURCES}/heartbeat/{TEST_NAME}\"\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n_SETUP_DELTA_TABLES = [\"heartbeat_sensor_control_table\"]\n\n\ndef _create_heartbeat_table() -> None:\n    \"\"\"Create the necessary tables required for using Heartbeat.\"\"\"\n    _LOGGER.info(\"Creating tables\")\n    for table in _SETUP_DELTA_TABLES:\n        DataframeHelpers.create_delta_table(\n            cols=SchemaUtils.from_file_to_dict(\n                f\"file:///{FEATURE_TEST_RESOURCES}/setup/column_list/{table}.json\"\n            ),\n            table=table,\n        )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"use_case_name\": \"delta_table_trigger_2_jobs\",\n            \"sensor_source\": \"delta_table\",\n            \"trigger_jobs_records\": {\n                \"heartbeat\": \"\"\"\n                    (\"delta_table\",\"dummy_orders\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"3849201756384721\",\"events_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\"),\n                    (\"delta_table\",\"dummy_sales\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"3849201756384721\",\"events_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\"),\n                    (\"delta_table\",\"dummy_test\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"7601938475620193\",\"events_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\"),\n                    (\"delta_table\",\"dummy_test2\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"7601938475620193\",\"events_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\")\n                    \"\"\",\n            },\n            \"jobs_triggered_count\": 2,\n            \"job_id\": [\"3849201756384721\", \"7601938475620193\"],\n        },\n        {\n            \"use_case_name\": \"kafka_trigger_1_job\",\n            \"sensor_source\": \"kafka\",\n            \"trigger_jobs_records\": {\n                \"heartbeat\": \"\"\"\n                    (\"kafka\",\"dummy_test3\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"5918374620193847\",\"events_orders\",\"COMPLETE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"FALSE\"),\n                    (\"kafka\",\"dummy_test4\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"5918374620193847\",\"events_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\")\n                    \"\"\",\n            },\n            \"jobs_triggered_count\": 1,\n            \"job_id\": [\"5918374620193847\"],\n        },\n        {\n            \"use_case_name\": \"sap_b4_no_trigger\",\n            \"sensor_source\": \"sap_b4\",\n            \"trigger_jobs_records\": {\n                \"heartbeat\": \"\"\"\n                    (\"sap_b4\",\"dummy_test3\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"8203746159283746\",\"events_orders\",\"NEW_EVENT_AVAILABLE\",\n                    NULL,NULL,NULL,\"PAUSED\",\"FALSE\"),\n                    (\"sap_b4\",\"dummy_test4\",\"batch\",\n                    \"delta_table_order_events\",NULL,NULL,NULL,\n                    \"8203746159283746\",\"events_orders\",\"COMPLETE\",\n                    NULL,NULL,NULL,\"UNPAUSED\",\"TRUE\")\n                    \"\"\"\n            },\n            \"jobs_triggered_count\": 0,\n        },\n    ],\n)\n@patch(\n    \"lakehouse_engine.core.sensor_manager.SensorJobRunManager.run_job\",\n    return_value=(\"run_id\", None),\n)\ndef test_anchor_job(mock_run_job: Mock, scenario: dict) -> None:\n    \"\"\"Test the number of jobs triggered.\n\n    Args:\n        mock_run_job (Mock): The mocked object.\n        scenario: The test scenario to execute.\n\n    Scenarios:\n        1- 2 different jobs id's each one with two hard dependencies.\n            From the 4 records in the table, only two should trigger a job.\n        2- 1 job id with two records that can trigger the job.\n            Only 1 comply with the specifications to trigger a job.\n        3- 1 job id with two records that can trigger the job.\n            None comply with the specifications to trigger a job.\n    \"\"\"\n    scenario_name = scenario[\"use_case_name\"]\n    sensor_source = scenario[\"sensor_source\"]\n    records = scenario[\"trigger_jobs_records\"].get(\"heartbeat\")\n    jobs_triggered_count = scenario[\"jobs_triggered_count\"]\n\n    heartbeat_table = \"test_db.heartbeat_sensor_control_table\"\n    sensor_table = \"test_db.sensor_table\"\n\n    acon = {\n        \"heartbeat_sensor_db_table\": heartbeat_table,\n        \"lakehouse_engine_sensor_db_table\": sensor_table,\n        \"data_format\": \"delta\",\n        \"sensor_source\": sensor_source,\n        \"token\": \"my-token\",\n        \"domain\": \"adidas-domain.cloud.databricks.com\",\n    }\n\n    _LOGGER.info(f\"Scenario: {scenario_name}\")\n\n    _create_heartbeat_table()\n\n    ExecEnv.SESSION.sql(\n        f\"\"\"INSERT INTO {heartbeat_table}\n            VALUES {records}\"\"\"  # nosec\n    )\n\n    trigger_heartbeat_sensor_jobs(acon=acon)\n    assert mock_run_job.call_count == jobs_triggered_count\n\n    if jobs_triggered_count > 0:\n        triggered_job_id = scenario[\"job_id\"]\n        for call_args in mock_run_job.call_args_list:\n            assert call_args[0][0] in triggered_job_id\n\n    for table in _SETUP_DELTA_TABLES:\n        LocalStorage.clean_folder(f\"{LAKEHOUSE}{table}\")\n        ExecEnv.SESSION.sql(f\"\"\"DROP TABLE IF EXISTS test_db.{table}\"\"\")  # nosec\n"
  },
  {
    "path": "tests/unit/test_log_filter_sensitive_data.py",
    "content": "\"\"\"Unit tests focusing on the logging filter FilterSensitiveData.\"\"\"\n\nimport logging\nfrom typing import Any\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\nSTR_MSGS_TO_LOG = [\n    {  # Sample acon being logged, password has comma and double quotes\n        \"original_log\": \"Read Algorithm Configuration: {'input_specs': [{'spec_id': \"\n        \"'source', 'read_type': 'batch', 'data_format': 'sap_bw', 'options': \"\n        \"{'driver': 'org.sqlite.JDBC', 'user': 'user', 'password': 'p,w\\\"d', \"\n        \"'url': 'jdbc:url', 'dbtable': 'table', 'numPartitions': 2, 'extraction_type': \"\n        \"'delta', 'partitionColumn': 'item', 'lowerBound': 1, 'upperBound': 3}}], \"\n        \"'output_specs': [{'spec_id': 'bronze', 'input_id': 'source', 'write_type': \"\n        \"'append', 'data_format': 'delta', 'partitions': ['actrequest_timestamp'], \"\n        \"'location': 'file:////path'}]}\",\n        \"masked_log\": \"Read Algorithm Configuration: {'input_specs': [{'spec_id': \"\n        \"'source', 'read_type': 'batch', 'data_format': 'sap_bw', 'options': \"\n        \"{'driver': 'org.sqlite.JDBC', 'user': 'user', 'masked_cred': '******', \"\n        \"'url': 'jdbc:url', 'dbtable': 'table', 'numPartitions': 2, 'extraction_type': \"\n        \"'delta', 'partitionColumn': 'item', 'lowerBound': 1, 'upperBound': 3}}], \"\n        \"'output_specs': [{'spec_id': 'bronze', 'input_id': 'source', 'write_type': \"\n        \"'append', 'data_format': 'delta', 'partitions': ['actrequest_timestamp'], \"\n        \"'location': 'file:////path'}]}\",\n    },\n    {  # no single neither double quotes\n        \"original_log\": \"prop1: prop2, password: pwd, secret: secret\",\n        \"masked_log\": \"prop1: prop2, masked_cred: ******, \" \"masked_cred: ******, \",\n    },\n    {  # double quotes, password has single quotes and comma, ends with secret and space\n        # and additional log\n        \"original_log\": '\"prop1\": \"prop2\", \"password\": \"p,w\\'d\", '\n        '\"secret\": \"secret\" other logs',\n        \"masked_log\": '\"prop1\": \"prop2\", \"masked_cred\": \"******\", '\n        '\"masked_cred\": \"******\", other logs',\n    },\n    {\n        \"original_log\": \"Read Algorithm Configuration: {'input_specs': [{'spec_id': \"\n        \"'source', 'read_type': 'streaming', 'data_format': 'kafka', 'options': \"\n        \"{'kafka.ssl.truststore.password': 'p,w\\\"d', 'kafka.ssl.keystore.password': \"\n        \"'p,w\\\"d'}}], 'output_specs': [{'spec_id': 'bronze', 'input_id': 'source', \"\n        \"'write_type': 'append', 'data_format': 'delta', 'partitions': \"\n        \"['actrequest_timestamp'], 'location': 'file:////path'}]}\",\n        \"masked_log\": \"Read Algorithm Configuration: {'input_specs': [{'spec_id': \"\n        \"'source', 'read_type': 'streaming', 'data_format': 'kafka', 'options': \"\n        \"{'masked_cred': '******', 'masked_cred': '******', }], \"\n        \"'output_specs': [{'spec_id': 'bronze', 'input_id': 'source', 'write_type': \"\n        \"'append', 'data_format': 'delta', 'partitions': ['actrequest_timestamp'], \"\n        \"'location': 'file:////path'}]}\",\n    },\n]\nDICT_MSGS_TO_LOG = [\n    # fmt: off\n    {  # test with dict, because we rely on space after comma for the replace\n        # and python might change the dict structure in the future\n        \"original_log\": {\"secret\":\"dummy_pwd\",\"prop\":\"prop_val\"},  # noqa: E231\n        \"masked_log\": \"{'masked_cred': '******', 'prop': 'prop_val'}\",\n    },\n    # fmt: on\n]\nLOGGER = LoggingHandler(__name__).get_logger()\n\n\ndef test_log_filter_sensitive_data(caplog: Any) -> None:\n    \"\"\"Test the logging filter FilterSensitiveData.\n\n    Given a set of messages, each message is logged (original_log) and tested\n    against the expected output (masked_log).\n\n    :param caplog: captures the log.\n    \"\"\"\n    with caplog.at_level(logging.INFO):\n        for str_msg in STR_MSGS_TO_LOG:\n            LOGGER.info(str_msg[\"original_log\"])\n            assert str_msg[\"masked_log\"] in caplog.text\n\n        for dict_msg in DICT_MSGS_TO_LOG:\n            LOGGER.info(dict_msg[\"original_log\"])\n            assert dict_msg[\"masked_log\"] in caplog.text\n"
  },
  {
    "path": "tests/unit/test_notification_creation.py",
    "content": "\"\"\"Unit tests for notification creation functions.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import TerminatorSpec\nfrom lakehouse_engine.terminators.notifier_factory import NotifierFactory\nfrom lakehouse_engine.terminators.notifiers.email_notifier import EmailNotifier\nfrom lakehouse_engine.terminators.notifiers.exceptions import (\n    NotifierConfigException,\n    NotifierTemplateConfigException,\n    NotifierTemplateNotFoundException,\n)\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\nfrom tests.conftest import FEATURE_RESOURCES\n\nLOGGER = LoggingHandler(__name__).get_logger()\nTEST_ATTACHEMENTS_PATH = FEATURE_RESOURCES + \"/notification/\"\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Email notification creation using a template.\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"template\": \"failure_notification_email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"exception\": \"test-exception\",\n                },\n            ),\n            \"expected\": \"\"\"\n            Job local in workspace local has\n            failed with the exception: test-exception\"\"\",\n        },\n        {\n            \"name\": \"Error: missing template\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"template\": \"missing template\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"exception\": \"test-exception\",\n                },\n            ),\n            \"expected\": \"Template missing template does not exist\",\n        },\n        {\n            \"name\": \"Error: Malformed acon\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                    \"exception\": \"test-exception\",\n                },\n            ),\n            \"expected\": \"Malformed Notification Definition\",\n        },\n    ],\n)\ndef test_notification_creation(scenario: dict) -> None:\n    \"\"\"Testing notification creation.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    notifier = NotifierFactory.get_notifier(scenario[\"spec\"])\n\n    if \"Error: \" in scenario[\"name\"]:\n        with pytest.raises(\n            (\n                NotifierTemplateNotFoundException,\n                NotifierConfigException,\n                NotifierTemplateConfigException,\n            )\n        ) as e:\n            notifier.create_notification()\n        assert str(e.value) == scenario[\"expected\"]\n    else:\n        notifier.create_notification()\n        assert notifier.notification[\"message\"] == scenario[\"expected\"]\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        TerminatorSpec(\n            function=\"notify\",\n            args={\n                \"server\": \"localhost\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"from\": \"test-email@email.com\",\n                \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                \"subject\": \"test-subject\",\n                \"message\": \"test-message\",\n            },\n        ),\n        TerminatorSpec(\n            function=\"notify\",\n            args={\n                \"server\": \"localhost\",\n                \"port\": \"1025\",\n                \"type\": \"email\",\n                \"from\": \"test-email@email.com\",\n                \"cc\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                \"bcc\": [\"test-email3@email.com\", \"test-email4@email.com\"],\n                \"mimetype\": \"html\",\n                \"subject\": \"test-subject\",\n                \"message\": \"test-message\",\n                \"attachments\": [\n                    f\"{TEST_ATTACHEMENTS_PATH}test_attachement.txt\",\n                    f\"{TEST_ATTACHEMENTS_PATH}test_image.png\",\n                ],\n            },\n        ),\n    ],\n)\ndef test_office365_notification_creation(scenario: TerminatorSpec) -> None:\n    \"\"\"Testing Office 365 notification creation.\"\"\"\n    notifier = EmailNotifier(scenario)\n    body = notifier._create_graph_api_email_body()\n    for recipient, test_recipient in zip(\n        body.message.to_recipients, scenario.args.get(\"to\", [])\n    ):\n        assert recipient.email_address.address == test_recipient\n    for recipient, test_recipient in zip(\n        body.message.cc_recipients, scenario.args.get(\"cc\", [])\n    ):\n        assert recipient.email_address.address == test_recipient\n    for recipient, test_recipient in zip(\n        body.message.bcc_recipients, scenario.args.get(\"bcc\", [])\n    ):\n        assert recipient.email_address.address == test_recipient\n\n    if body.message.attachments:\n        for attachment, test_attachment in zip(\n            body.message.attachments, scenario.args.get(\"attachments\")\n        ):\n            assert attachment.name == test_attachment.split(\"/\")[-1]\n            with open(test_attachment, \"rb\") as file:\n                assert attachment.content_bytes == file.read()\n"
  },
  {
    "path": "tests/unit/test_notification_factory.py",
    "content": "\"\"\"Unit tests for notification factory module.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import TerminatorSpec\nfrom lakehouse_engine.terminators.notifier_factory import NotifierFactory\nfrom lakehouse_engine.terminators.notifiers.exceptions import NotifierNotFoundException\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\nLOGGER = LoggingHandler(__name__).get_logger()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Error: wrong type of notifier\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"snailmail\",\n                    \"template\": \"failure_notification_email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                },\n            ),\n            \"expected\": \"The requested notification format snailmail is not supported.\",\n        },\n        {\n            \"name\": \"Creation of email\",\n            \"spec\": TerminatorSpec(\n                function=\"notify\",\n                args={\n                    \"server\": \"localhost\",\n                    \"port\": \"1025\",\n                    \"type\": \"email\",\n                    \"template\": \"failure_notification_email\",\n                    \"from\": \"test-email@email.com\",\n                    \"to\": [\"test-email1@email.com\", \"test-email2@email.com\"],\n                },\n            ),\n            \"expected\": \"email\",\n        },\n    ],\n)\ndef test_notification_factory(scenario: dict) -> None:\n    \"\"\"Testing notification factory.\n\n    Args:\n        scenario: scenario to test.\n    \"\"\"\n    if \"Error: \" in scenario[\"name\"]:\n        with pytest.raises(NotifierNotFoundException) as e:\n            notifier = NotifierFactory.get_notifier(scenario[\"spec\"])\n\n        assert scenario[\"expected\"] == str(e.value)\n    else:\n        notifier = NotifierFactory.get_notifier(scenario[\"spec\"])\n\n        assert notifier.type == scenario[\"expected\"]\n"
  },
  {
    "path": "tests/unit/test_prisma_dq_rule_id.py",
    "content": "\"\"\"Test the manual definition of dq functions when using prisma dq framework.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.definitions import DQFunctionSpec, DQSpec, DQType\nfrom lakehouse_engine.utils.dq_utils import PrismaUtils\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Definition of DQ Functions using parameters without duplicates\",\n            \"spec_id\": \"spec_without_duplicates\",\n            \"dq_spec\": {\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_2\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_1\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_3\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                ],\n            },\n        },\n        {\n            \"name\": \"Error: Definition of DQ Functions using parameters \"\n            \"with duplicates\",\n            \"spec_id\": \"spec_with_duplicates\",\n            \"dq_spec\": {\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_2\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_1\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_2\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                ],\n            },\n        },\n    ],\n)\ndef test_prisma_manual_function_definition(scenario: dict) -> None:\n    \"\"\"Test the manual definition of dq functions when using prisma dq framework.\n\n    Args:\n        scenario (dict): The test scenario.\n    \"\"\"\n    dq_functions = [\n        DQFunctionSpec(function=dq_function[\"function\"], args=dq_function[\"args\"])\n        for dq_function in scenario[\"dq_spec\"][\"dq_functions\"]\n    ]\n\n    dq_spec_list = [\n        DQSpec(\n            spec_id=scenario[\"spec_id\"],\n            input_id=scenario[\"name\"],\n            dq_type=DQType.PRISMA.value,\n            dq_functions=dq_functions,\n        )\n    ]\n\n    if \"Error: \" in scenario[\"name\"]:\n        error = PrismaUtils.validate_rule_id_duplication(specs=dq_spec_list)\n        expected_error = {\"dq_spec_id: spec_with_duplicates\": \"rule_2; rule_1; rule_2\"}\n        _LOGGER.critical(\n            f\"A duplicate dq_rule_id was found!!!\"\n            \"Please verify the following list:\"\n            f\"{error}\"\n        )\n        assert error == expected_error\n    else:\n        PrismaUtils.validate_rule_id_duplication(specs=dq_spec_list)\n"
  },
  {
    "path": "tests/unit/test_prisma_function_definition.py",
    "content": "\"\"\"Test the manual definition of dq functions when using prisma dq framework.\"\"\"\n\nimport pytest\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.dq_processors.exceptions import DQSpecMalformedException\nfrom lakehouse_engine.utils.dq_utils import DQUtils\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"name\": \"Error: missing meta parameters\",\n            \"dq_spec\": {\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"table\": \"test_table\",\n                                \"execution_point\": \"in_motion\",\n                            },\n                        },\n                    },\n                ],\n            },\n            \"expected\": \"The dq function meta field must contain all the \"\n            \"fields defined\"\n            \": ['dq_rule_id', 'execution_point', 'filters', 'schema', \"\n            \"'table', 'column', 'dimension'].\\n\"\n            \"Found fields: ['table', 'execution_point'].\\n\"\n            \"Diff: ['column', 'dimension', 'dq_rule_id', 'filters', 'schema']\",\n        },\n        {\n            \"name\": \"Error: missing meta\",\n            \"dq_spec\": {\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                        },\n                    },\n                ],\n            },\n            \"expected\": \"The dq function must have a meta field containing all the \"\n            \"fields defined: ['dq_rule_id', \"\n            \"'execution_point', 'filters', 'schema', 'table', 'column', \"\n            \"'dimension'].\",\n        },\n        {\n            \"name\": \"Definition of DQ Functions\",\n            \"dq_spec\": {\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_2\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                            },\n                        },\n                    },\n                ],\n            },\n            \"expected\": None,\n        },\n        {\n            \"name\": \"Definition of DQ Functions with extra params\",\n            \"dq_spec\": {\n                \"dq_functions\": [\n                    {\n                        \"function\": \"expect_column_to_exist\",\n                        \"args\": {\n                            \"column\": \"test_column\",\n                            \"meta\": {\n                                \"dq_rule_id\": \"rule_2\",\n                                \"execution_point\": \"in_motion\",\n                                \"schema\": \"test_db\",\n                                \"table\": \"dummy_sales\",\n                                \"column\": \"\",\n                                \"dimension\": \"\",\n                                \"filters\": \"\",\n                                \"note\": \"Test Notes\",\n                            },\n                        },\n                    },\n                ],\n            },\n            \"expected\": None,\n        },\n    ],\n)\ndef test_prisma_manual_function_definition(scenario: dict) -> None:\n    \"\"\"Test the manual definition of dq functions when using prisma dq framework.\n\n    Args:\n        scenario (dict): The test scenario.\n    \"\"\"\n    dq_spec = scenario[\"dq_spec\"]\n    if \"Error: \" in scenario[\"name\"]:\n        with pytest.raises(DQSpecMalformedException) as e:\n            DQUtils.validate_dq_functions(\n                spec=dq_spec,\n                execution_point=\"in_motion\",\n                extra_meta_arguments=ExecEnv.ENGINE_CONFIG.dq_functions_column_list,\n            )\n        assert str(e.value) == scenario[\"expected\"]\n    else:\n        DQUtils.validate_dq_functions(\n            spec=dq_spec,\n            execution_point=\"in_motion\",\n            extra_meta_arguments=ExecEnv.ENGINE_CONFIG.dq_functions_column_list,\n        )\n"
  },
  {
    "path": "tests/unit/test_rest_api_functions.py",
    "content": "\"\"\"Test REST api related functions that cannot be tested inside Spark.\"\"\"\n\nimport logging\nfrom collections import namedtuple\nfrom typing import Any\nfrom unittest.mock import patch\n\nfrom pyspark.sql import Row\n\nfrom lakehouse_engine.core.definitions import OutputSpec\nfrom lakehouse_engine.io.writers.rest_api_writer import RestApiWriter\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\nLOGGER = LoggingHandler(__name__).get_logger()\nRestResponse = namedtuple(\"RestResponse\", \"status_code text\")\n\n\n@patch(\n    \"lakehouse_engine.io.writers.rest_api_writer.execute_api_request\",\n    return_value=RestResponse(status_code=200, text=\"ok\"),\n)\ndef test_send_payload_to_rest_api_simple_params(_: Any, caplog: Any) -> None:\n    \"\"\"Test if the REST API payload creation process is correct w/ simple params.\n\n    Args:\n        _: ignored patch.\n        caplog: captures the log.\n    \"\"\"\n    output_spec = OutputSpec(\n        spec_id=\"test_output\",\n        input_id=\"test_input\",\n        write_type=\"overwrite\",\n        data_format=\"rest_api\",\n        options={\n            \"rest_api_url\": \"https://www.dummy-url.local/dummy-endpoint\",\n            \"rest_api_method\": \"post\",\n            \"rest_api_header\": {\"Authorization\": \"Bearer dummytoken\"},\n        },\n    )\n    row = Row(payload='{\"dummy_payload\":\"dummy value\"}')\n    func = RestApiWriter._get_func_to_send_payload_to_rest_api(output_spec)\n    func(row)\n\n    str_to_assert = \"Final payload: {'dummy_payload': 'dummy value'}\"\n\n    with caplog.at_level(logging.DEBUG):\n        assert str_to_assert in caplog.text\n\n\n@patch(\n    \"lakehouse_engine.io.writers.rest_api_writer.execute_api_request\",\n    return_value=RestResponse(status_code=200, text=\"ok\"),\n)\ndef test_send_payload_to_rest_api_with_file_params(_: Any, caplog: Any) -> None:\n    \"\"\"Test if the REST API payload creation process is correct with file params.\n\n    Args:\n        _: ignored patch.\n        caplog: captures the log.\n    \"\"\"\n    output_spec = OutputSpec(\n        spec_id=\"test_output\",\n        input_id=\"test_input\",\n        write_type=\"overwrite\",\n        data_format=\"rest_api\",\n        options={\n            \"rest_api_url\": \"https://www.dummy-url.local/dummy-endpoint\",\n            \"rest_api_method\": \"post\",\n            \"rest_api_header\": {\"Authorization\": \"Bearer dummytoken\"},\n            \"rest_api_is_file_payload\": True,\n            \"rest_api_file_payload_name\": \"anotherFileName\",\n            \"rest_api_extra_json_payload\": {\"a\": \"b\"},\n        },\n    )\n    row = Row(payload='{\"dummy_payload\":\"dummy value\"}')\n    func = RestApiWriter._get_func_to_send_payload_to_rest_api(output_spec)\n    func(row)\n\n    str_to_assert = (\n        \"Final payload: {'anotherFileName': \"\n        \"'{\\\"dummy_payload\\\":\\\"dummy value\\\"}', 'a': 'b'}\"\n    )\n\n    with caplog.at_level(logging.DEBUG):\n        assert str_to_assert in caplog.text\n"
  },
  {
    "path": "tests/unit/test_sensor.py",
    "content": "\"\"\"Module with unit tests for Sensor class.\"\"\"\n\nfrom datetime import datetime\nfrom typing import Any\nfrom unittest.mock import MagicMock, patch\n\nimport pytest\nfrom pyspark.sql.types import Row, StructType\n\nfrom lakehouse_engine.algorithms.exceptions import (\n    NoNewDataException,\n    SensorAlreadyExistsException,\n)\nfrom lakehouse_engine.algorithms.sensors.sensor import Sensor\nfrom lakehouse_engine.core.definitions import (\n    InputFormat,\n    InputSpec,\n    ReadType,\n    SensorSpec,\n    SensorStatus,\n)\nfrom lakehouse_engine.core.sensor_manager import (\n    SensorControlTableManager,\n    SensorUpstreamManager,\n)\nfrom tests.utils.dataframe_helpers import DataframeHelpers\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"create_sensor\",\n            \"sensor_data\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": [\"asset_1\"],\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"sensor_already_exists\": False,\n            \"expected_result\": SensorSpec(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                control_db_table_name=\"control_sensor_table_name\",\n                input_spec=InputSpec(\n                    spec_id=\"input_spec\",\n                    read_type=ReadType.STREAMING.value,\n                    data_format=InputFormat.CSV.value,\n                ),\n                preprocess_query=None,\n                checkpoint_location=\"s3://dummy-bucket\"\n                \"/lakehouse_engine/sensors/sensor_id_1\",\n                fail_on_empty_result=False,\n            ),\n        },\n        {\n            \"scenario_name\": \"raise_exception_sensor_already_exists\",\n            \"sensor_data\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": [\"asset_1\"],\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"sensor_already_exists\": True,\n            \"expected_result\": \"There's already a sensor registered \"\n            \"with same id or assets!\",\n        },\n    ],\n)\ndef test_create_sensor(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test Sensor creation.\n\n    We will raise an exception if we try to create a Sensor\n    that already exists, otherwise we will create successfully.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    with patch.object(\n        Sensor,\n        \"_check_if_sensor_already_exists\",\n        new=MagicMock(return_value=scenario[\"sensor_already_exists\"]),\n    ) as sensor_already_exists_mock:\n        sensor_already_exists_mock.start()\n        if scenario[\"scenario_name\"] == \"raise_exception_sensor_already_exists\":\n            with pytest.raises(SensorAlreadyExistsException) as exception:\n                Sensor(scenario[\"sensor_data\"])\n\n            assert scenario[\"expected_result\"] == str(exception.value)\n        else:\n            subject = Sensor(scenario[\"sensor_data\"])\n\n            assert subject.spec == scenario[\"expected_result\"]\n        sensor_already_exists_mock.stop()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"create_non_existing_sensor_with_sensor_id\",\n            \"sensor_data\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": None,\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"control_db_sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=None,\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=datetime(2023, 5, 26, 14, 38, 16, 676508),\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": False,\n        },\n        {\n            \"scenario_name\": \"create_non_existing_sensor_with_assets\",\n            \"sensor_data\": {\n                \"sensor_id\": None,\n                \"assets\": [\"asset_1\"],\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"control_db_sensor_data\": Row(\n                sensor_id=None,\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=datetime(2023, 5, 26, 14, 38, 16, 676508),\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": False,\n        },\n        {\n            \"scenario_name\": \"create_non_existing_sensor_with_sensor_id_and_assets\",\n            \"sensor_data\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": [\"asset_1\"],\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"control_db_sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=datetime(2023, 5, 26, 14, 38, 16, 676508),\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": False,\n        },\n        {\n            \"scenario_name\": \"raise_exception_as_sensor_\"\n            \"already_exist_with_same_id_and_different_asset\",\n            \"sensor_data\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": [\"asset_1\"],\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"control_db_sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_2\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=datetime(2023, 5, 26, 14, 38, 16, 676508),\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": \"There's already a sensor \"\n            \"registered with same id or assets!\",\n        },\n        {\n            \"scenario_name\": \"raise_exception_as_sensor_\"\n            \"already_exist_with_same_asset_and_different_id\",\n            \"sensor_data\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": [\"asset_1\"],\n                \"control_db_table_name\": \"control_sensor_table_name\",\n                \"input_spec\": {\n                    \"spec_id\": \"input_spec\",\n                    \"read_type\": ReadType.STREAMING.value,\n                    \"data_format\": InputFormat.CSV.value,\n                },\n                \"fail_on_empty_result\": False,\n                \"base_checkpoint_location\": \"s3://dummy-bucket\",\n            },\n            \"control_db_sensor_data\": Row(\n                sensor_id=\"sensor_id_2\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=datetime(2023, 5, 26, 14, 38, 16, 676508),\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": \"There's already a sensor \"\n            \"registered with same id or assets!\",\n        },\n    ],\n)\ndef test_sensor_already_exists(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test if Sensor already exists.\n\n    We will raise an exception if the Sensor already exists by sensor_id or\n    by assets.\n    If the sensor doesn't exist we will create a new Sensor.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    with patch.object(\n        SensorControlTableManager,\n        \"read_sensor_table_data\",\n        new=MagicMock(return_value=scenario[\"control_db_sensor_data\"]),\n    ) as sensor_already_exists_mock:\n        sensor_already_exists_mock.start()\n        if \"raise_exception\" in scenario[\"scenario_name\"]:\n            with pytest.raises(SensorAlreadyExistsException) as exception:\n                Sensor(scenario[\"sensor_data\"])\n\n            assert scenario[\"expected_result\"] == str(exception.value)\n        else:\n            subject = Sensor(scenario[\"sensor_data\"])._check_if_sensor_already_exists()\n\n            assert subject == scenario[\"expected_result\"]\n        sensor_already_exists_mock.stop()\n\n\nclass TestExecuteSensor:\n    \"\"\"Test suite containing tests for the Sensor execute method.\"\"\"\n\n    _sensor_already_exists_mock = patch.object(\n        Sensor,\n        \"_check_if_sensor_already_exists\",\n        new=MagicMock(return_value=False),\n    )\n\n    @classmethod\n    def setup_class(cls) -> None:\n        \"\"\"Start mock for all test methods in this suite.\"\"\"\n        cls._sensor_already_exists_mock.start()\n\n    @classmethod\n    def teardown_class(cls) -> None:\n        \"\"\"Clean mock after all test methods in this suite.\"\"\"\n        cls._sensor_already_exists_mock.stop()\n\n    @pytest.mark.parametrize(\n        \"scenario\",\n        [\n            {\n                \"scenario_name\": \"execute_stream_sensor\",\n                \"sensor_data\": {\n                    \"sensor_id\": \"sensor_id_1\",\n                    \"assets\": [\"asset_1\"],\n                    \"control_db_table_name\": \"control_sensor_table_name\",\n                    \"input_spec\": {\n                        \"spec_id\": \"input_spec\",\n                        \"read_type\": ReadType.STREAMING.value,\n                        \"data_format\": InputFormat.CSV.value,\n                    },\n                    \"fail_on_empty_result\": False,\n                    \"base_checkpoint_location\": \"s3://dummy-bucket\",\n                },\n                \"expected_result\": True,\n            }\n        ],\n    )\n    def test_execute_stream_sensor(self, scenario: dict, capsys: Any) -> None:\n        \"\"\"Test streaming Sensor execution.\n\n        Args:\n            scenario: scenario to test.\n            capsys: capture stdout and stderr.\n        \"\"\"\n        with patch.object(\n            SensorControlTableManager,\n            \"check_if_sensor_has_acquired_data\",\n            new=MagicMock(return_value=scenario[\"expected_result\"]),\n        ) as check_if_sensor_acquired_data_mock:\n            check_if_sensor_acquired_data_mock.start()\n            with patch.object(\n                SensorUpstreamManager,\n                \"read_new_data\",\n                new=MagicMock(\n                    return_value=DataframeHelpers.create_empty_dataframe(StructType([]))\n                ),\n            ) as sensor_new_data_mock:\n                with patch.object(\n                    Sensor,\n                    \"_run_streaming_sensor\",\n                    new=MagicMock(return_value=scenario[\"expected_result\"]),\n                ) as run_stream_sensor_mock:\n                    run_stream_sensor_mock.start()\n                    subject = Sensor(scenario[\"sensor_data\"]).execute()\n\n                    assert subject == scenario[\"expected_result\"]\n                    run_stream_sensor_mock.stop()\n                sensor_new_data_mock.stop()\n            check_if_sensor_acquired_data_mock.stop()\n\n    @pytest.mark.parametrize(\n        \"scenario\",\n        [\n            {\n                \"scenario_name\": \"execute_batch_sensor\",\n                \"sensor_data\": {\n                    \"sensor_id\": \"sensor_id_1\",\n                    \"assets\": [\"asset_1\"],\n                    \"control_db_table_name\": \"control_sensor_table_name\",\n                    \"input_spec\": {\n                        \"spec_id\": \"input_spec\",\n                        \"read_type\": ReadType.BATCH.value,\n                        \"data_format\": InputFormat.JDBC.value,\n                    },\n                },\n                \"expected_result\": True,\n            },\n        ],\n    )\n    def test_execute_batch_sensor(self, scenario: dict, capsys: Any) -> None:\n        \"\"\"Test batch Sensor execution.\n\n        Args:\n            scenario: scenario to test.\n            capsys: capture stdout and stderr.\n        \"\"\"\n        with patch.object(\n            SensorControlTableManager,\n            \"check_if_sensor_has_acquired_data\",\n            new=MagicMock(return_value=scenario[\"expected_result\"]),\n        ) as check_if_sensor_acquired_data_mock:\n            check_if_sensor_acquired_data_mock.start()\n            with patch.object(\n                SensorUpstreamManager,\n                \"read_new_data\",\n                new=MagicMock(\n                    return_value=DataframeHelpers.create_empty_dataframe(StructType([]))\n                ),\n            ) as sensor_new_data_mock:\n                sensor_new_data_mock.start()\n                with patch.object(\n                    Sensor,\n                    \"_run_batch_sensor\",\n                    new=MagicMock(return_value=scenario[\"expected_result\"]),\n                ) as run_batch_sensor_mock:\n                    run_batch_sensor_mock.start()\n                    subject = Sensor(scenario[\"sensor_data\"]).execute()\n\n                    assert subject == scenario[\"expected_result\"]\n                    run_batch_sensor_mock.stop()\n                sensor_new_data_mock.stop()\n            check_if_sensor_acquired_data_mock.stop()\n\n    @pytest.mark.parametrize(\n        \"scenario\",\n        [\n            {\n                \"scenario_name\": \"raise_exception_sensor_\"\n                \"input_spec_format_not_implemented\",\n                \"sensor_data\": {\n                    \"sensor_id\": \"sensor_id_1\",\n                    \"assets\": [\"asset_1\"],\n                    \"control_db_table_name\": \"control_sensor_table_name\",\n                    \"input_spec\": {\n                        \"spec_id\": \"input_spec\",\n                        \"read_type\": ReadType.BATCH.value,\n                        \"data_format\": InputFormat.DATAFRAME.value,\n                    },\n                    \"base_checkpoint_location\": \"s3://dummy-bucket\",\n                },\n                \"expected_result\": \"A sensor has not been implemented yet for \"\n                \"this data format or, this data format is not available for \"\n                \"the read_type batch. Check the allowed combinations of \"\n                \"read_type and data_formats: {'streaming': ['kafka', 'avro', \"\n                \"'json', 'parquet', 'csv', 'delta', \"\n                \"'cloudfiles'], 'batch': ['delta', 'jdbc']}\",\n            },\n            {\n                \"scenario_name\": \"raise_exception_sensor_\"\n                \"input_spec_format_doesnt_exists\",\n                \"sensor_data\": {\n                    \"sensor_id\": \"sensor_id_1\",\n                    \"assets\": [\"asset_1\"],\n                    \"control_db_table_name\": \"control_sensor_table_name\",\n                    \"input_spec\": {\n                        \"spec_id\": \"input_spec\",\n                        \"db_table\": \"test_db.test_table\",\n                        \"read_type\": ReadType.BATCH.value,\n                        \"data_format\": \"databricks\",\n                    },\n                    \"base_checkpoint_location\": \"s3://dummy-bucket\",\n                },\n                \"expected_result\": \"Data format databricks isn't implemented yet.\",\n            },\n        ],\n    )\n    def test_execute_sensor_raise_no_input_spec_format_implemented(\n        self, scenario: dict, capsys: Any\n    ) -> None:\n        \"\"\"Expect to raise exception for input spec format not implemented.\n\n        Args:\n            scenario: scenario to test.\n            capsys: capture stdout and stderr.\n        \"\"\"\n        with pytest.raises(NotImplementedError) as exception:\n            Sensor(scenario[\"sensor_data\"]).execute()\n\n        assert scenario[\"expected_result\"] == str(exception.value)\n\n    @pytest.mark.parametrize(\n        \"scenario\",\n        [\n            {\n                \"scenario_name\": \"raise_no_new_data_exception\",\n                \"sensor_data\": {\n                    \"sensor_id\": \"sensor_id_1\",\n                    \"assets\": [\"asset_1\"],\n                    \"control_db_table_name\": \"control_sensor_table_name\",\n                    \"input_spec\": {\n                        \"spec_id\": \"input_spec\",\n                        \"read_type\": ReadType.STREAMING.value,\n                        \"data_format\": InputFormat.KAFKA.value,\n                    },\n                    \"base_checkpoint_location\": \"s3://dummy-bucket\",\n                    \"fail_on_empty_result\": True,\n                },\n                \"expected_result\": \"No data was acquired by sensor_id_1 sensor.\",\n            },\n        ],\n    )\n    def test_execute_sensor_raise_no_new_data_exception(\n        self, scenario: dict, capsys: Any\n    ) -> None:\n        \"\"\"Expect to raise exception for empty data.\n\n        When we pass the flag `fail_on_empty_result` equals to `True`.\n\n        Args:\n            scenario: scenario to test.\n            capsys: capture stdout and stderr.\n        \"\"\"\n        with patch.object(\n            SensorControlTableManager,\n            \"check_if_sensor_has_acquired_data\",\n            new=MagicMock(return_value=False),\n        ) as check_if_sensor_acquired_data_mock:\n            check_if_sensor_acquired_data_mock.start()\n            with patch.object(\n                SensorUpstreamManager,\n                \"read_new_data\",\n                new=MagicMock(\n                    return_value=DataframeHelpers.create_empty_dataframe(StructType([]))\n                ),\n            ) as sensor_new_data_mock:\n                with patch.object(\n                    Sensor, \"_run_streaming_sensor\", new=MagicMock(return_value=False)\n                ) as run_stream_sensor_mock:\n                    run_stream_sensor_mock.start()\n                    with pytest.raises(NoNewDataException) as exception:\n                        Sensor(scenario[\"sensor_data\"]).execute()\n\n                    assert scenario[\"expected_result\"] == str(exception.value)\n                    run_stream_sensor_mock.stop()\n                sensor_new_data_mock.stop()\n            check_if_sensor_acquired_data_mock.stop()\n"
  },
  {
    "path": "tests/unit/test_sensor_manager.py",
    "content": "\"\"\"Module with unit tests for Sensor Manager module.\"\"\"\n\nfrom datetime import datetime\nfrom typing import Any\nfrom unittest.mock import MagicMock, patch\n\nimport pytest\nfrom delta import DeltaTable\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.types import (\n    ArrayType,\n    Row,\n    StringType,\n    StructField,\n    StructType,\n    TimestampType,\n)\n\nfrom lakehouse_engine.algorithms.sensors.sensor import SensorStatus\nfrom lakehouse_engine.core.definitions import SensorSpec\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.core.sensor_manager import (\n    SensorControlTableManager,\n    SensorUpstreamManager,\n)\nfrom lakehouse_engine.io.reader_factory import ReaderFactory\n\nTEST_DEFAULT_DATETIME = datetime(2023, 5, 26, 14, 38, 16, 676508)\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"should_return_default_update_set_when_empty_fields\",\n            \"updated_set_to_add\": {},\n        },\n        {\n            \"scenario_name\": \"should_add_just_one_field_to_update_set\",\n            \"assets\": [\"asset_1\"],\n            \"updated_set_to_add\": {\"sensors.assets\": \"updates.assets\"},\n        },\n        {\n            \"scenario_name\": \"should_add_multiple_fields_to_update_set\",\n            \"assets\": [\"asset_1\"],\n            \"checkpoint_location\": \"s3://dummy-bucket/sensors/sensor_id_1\",\n            \"upstream_key\": \"dummy_column\",\n            \"upstream_value\": \"dummy_value\",\n            \"updated_set_to_add\": {\n                \"sensors.assets\": \"updates.assets\",\n                \"sensors.checkpoint_location\": \"updates.checkpoint_location\",\n                \"sensors.upstream_key\": \"updates.upstream_key\",\n                \"sensors.upstream_value\": \"updates.upstream_value\",\n            },\n        },\n    ],\n)\ndef test_sensor_update_set(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test sensor update set adding multiple fields based in the items to add.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    expected_default_update_set = {\n        \"sensors.sensor_id\": \"updates.sensor_id\",\n        \"sensors.status\": \"updates.status\",\n        \"sensors.status_change_timestamp\": \"updates.status_change_timestamp\",\n    }\n\n    subject = SensorControlTableManager._get_sensor_update_set(\n        assets=scenario.get(\"assets\"),\n        checkpoint_location=scenario.get(\"checkpoint_location\"),\n        upstream_key=scenario.get(\"upstream_key\"),\n        upstream_value=scenario.get(\"upstream_value\"),\n    )\n\n    assert subject == {**expected_default_update_set, **scenario[\"updated_set_to_add\"]}\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"true_when_table_data_and_status_acquired_new_data\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"assets\": [\"asset_1\"],\n            \"status\": SensorStatus.ACQUIRED_NEW_DATA.value,\n            \"status_change_timestamp\": datetime.now(),\n            \"checkpoint_location\": \"s3://dummy-bucket/sensors/sensor_id_1\",\n            \"upstream_key\": \"dummy_column\",\n            \"upstream_value\": \"dummy_value\",\n        },\n    ],\n)\ndef test_sensor_data(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test Sensor data construction.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    subject = SensorControlTableManager._convert_sensor_to_data(\n        spec=SensorSpec(\n            sensor_id=scenario[\"sensor_id\"],\n            assets=scenario[\"assets\"],\n            control_db_table_name=None,\n            checkpoint_location=scenario[\"checkpoint_location\"],\n            preprocess_query=None,\n            input_spec=None,\n        ),\n        status=scenario[\"status\"],\n        upstream_key=scenario[\"upstream_key\"],\n        upstream_value=scenario[\"upstream_value\"],\n        status_change_timestamp=scenario[\"status_change_timestamp\"],\n    )\n\n    assert subject == [\n        {\n            \"sensor_id\": scenario[\"sensor_id\"],\n            \"assets\": scenario[\"assets\"],\n            \"status\": scenario[\"status\"],\n            \"status_change_timestamp\": scenario[\"status_change_timestamp\"],\n            \"checkpoint_location\": scenario[\"checkpoint_location\"],\n            \"upstream_key\": scenario[\"upstream_key\"],\n            \"upstream_value\": scenario[\"upstream_value\"],\n        }\n    ]\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"true_when_table_data_and_status_acquired_new_data\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"control_db_table_name\": \"sensor_control_db_table\",\n            \"sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=TEST_DEFAULT_DATETIME,\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": True,\n        },\n        {\n            \"scenario_name\": \"false_when_table_data_is_absent\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"control_db_table_name\": \"sensor_control_db_table\",\n            \"sensor_data\": None,\n            \"expected_result\": False,\n        },\n        {\n            \"scenario_name\": \"false_when_table_data_is_present_and_\"\n            \"status_different_than_acquired_new_data\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"control_db_table_name\": \"sensor_control_db_table\",\n            \"sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.PROCESSED_NEW_DATA.value,\n                status_change_timestamp=TEST_DEFAULT_DATETIME,\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            ),\n            \"expected_result\": False,\n        },\n    ],\n)\ndef test_check_if_sensor_has_acquired_data(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test if Sensor has acquired data.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    with patch.object(\n        SensorControlTableManager,\n        \"read_sensor_table_data\",\n        new=MagicMock(return_value=scenario[\"sensor_data\"]),\n    ) as sensor_table_data_mock:\n        sensor_table_data_mock.start()\n        subject = SensorControlTableManager.check_if_sensor_has_acquired_data(\n            sensor_id=scenario[\"sensor_id\"],\n            control_db_table_name=scenario[\"control_db_table_name\"],\n        )\n\n        assert subject == scenario[\"expected_result\"]\n        sensor_table_data_mock.stop()\n\n\n@pytest.fixture\ndef control_table_fixture() -> DataFrame:\n    \"\"\"Return a dummy dataframe in the Sensor control table schema.\"\"\"\n    schema = StructType(\n        [\n            StructField(\"sensor_id\", StringType(), False),\n            StructField(\"assets\", ArrayType(StringType(), False), True),\n            StructField(\"status\", StringType(), False),\n            StructField(\"status_change_timestamp\", TimestampType(), False),\n            StructField(\"checkpoint_location\", StringType(), True),\n        ]\n    )\n    return ExecEnv.SESSION.createDataFrame(\n        [\n            [\n                \"sensor_id_1\",\n                [],\n                SensorStatus.ACQUIRED_NEW_DATA.value,\n                TEST_DEFAULT_DATETIME,\n                \"s3://dummy-bucket/sensors/sensor_id_1\",\n            ],\n            [\n                \"sensor_id_2\",\n                [\"asset_2\"],\n                SensorStatus.PROCESSED_NEW_DATA.value,\n                TEST_DEFAULT_DATETIME,\n                \"s3://dummy-bucket/sensors/sensor_id_2\",\n            ],\n            [\n                \"sensor_id_3\",\n                [\"asset_3\"],\n                SensorStatus.ACQUIRED_NEW_DATA.value,\n                TEST_DEFAULT_DATETIME,\n                \"s3://dummy-bucket/sensors/sensor_id_3\",\n            ],\n        ],\n        schema,\n    )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sensor_id_is_present\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"control_db_table_name\": \"sensor_control_db_table\",\n            \"assets\": None,\n            \"expected_result\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"assets\": [],\n                \"status\": SensorStatus.ACQUIRED_NEW_DATA.value,\n                \"status_change_timestamp\": TEST_DEFAULT_DATETIME,\n                \"checkpoint_location\": \"s3://dummy-bucket/sensors/sensor_id_1\",\n            },\n        },\n        {\n            \"scenario_name\": \"sensor_id_is_absent_and_assets_is_present\",\n            \"sensor_id\": None,\n            \"control_db_table_name\": \"sensor_control_db_table\",\n            \"assets\": [\"asset_2\"],\n            \"expected_result\": {\n                \"sensor_id\": \"sensor_id_2\",\n                \"assets\": [\"asset_2\"],\n                \"status\": SensorStatus.PROCESSED_NEW_DATA.value,\n                \"status_change_timestamp\": TEST_DEFAULT_DATETIME,\n                \"checkpoint_location\": \"s3://dummy-bucket/sensors/sensor_id_2\",\n            },\n        },\n        {\n            \"scenario_name\": \"sensor_id_and_sensor_asset_are_absent\",\n            \"sensor_id\": None,\n            \"control_db_table_name\": \"sensor_control_db_table\",\n            \"assets\": None,\n            \"expected_result\": \"Either sensor_id or assets \"\n            \"need to be provided as arguments.\",\n        },\n    ],\n)\ndef test_read_sensor_table_data(\n    scenario: dict, capsys: Any, control_table_fixture: DataFrame\n) -> None:\n    \"\"\"Test read data from Sensor control table.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n        control_table_fixture: fixture representing the\n            control table as DataFrame.\n    \"\"\"\n    expected_result = scenario[\"expected_result\"]\n\n    with patch.object(DeltaTable, \"forName\", MagicMock()) as delta_table_for_name_mock:\n        delta_table_for_name_mock.start()\n        with patch.object(\n            delta_table_for_name_mock.return_value,\n            \"toDF\",\n            MagicMock(return_value=control_table_fixture),\n        ) as delta_table_for_to_df_mock:\n            delta_table_for_to_df_mock.start()\n\n            if scenario[\"scenario_name\"] == \"sensor_id_and_sensor_asset_are_absent\":\n                with pytest.raises(ValueError) as exception:\n                    SensorControlTableManager.read_sensor_table_data(\n                        sensor_id=scenario[\"sensor_id\"],\n                        control_db_table_name=scenario[\"control_db_table_name\"],\n                        assets=scenario[\"assets\"],\n                    )\n\n                assert expected_result in str(exception.value)\n            else:\n                subject = SensorControlTableManager.read_sensor_table_data(\n                    sensor_id=scenario[\"sensor_id\"],\n                    control_db_table_name=scenario[\"control_db_table_name\"],\n                    assets=scenario[\"assets\"],\n                )\n\n                assert subject.asDict() == expected_result\n\n            delta_table_for_to_df_mock.stop()\n    delta_table_for_name_mock.stop()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"test_if_has_new_data\",\n            \"empty_df\": False,\n            \"expected_result\": True,\n        },\n        {\n            \"scenario_name\": \"test_if_has_not_new_data\",\n            \"empty_df\": True,\n            \"expected_result\": False,\n        },\n    ],\n)\ndef test_has_new_data(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test if checking for new data works correctly where there is new data.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    new_data_df = _prepare_new_data_tests(return_empty_df=scenario[\"empty_df\"])\n\n    has_new_data = SensorUpstreamManager.get_new_data(new_data_df) is not None\n\n    assert has_new_data == scenario[\"expected_result\"]\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"sensor_db_table_and_default_dummy_value\",\n            \"sensor\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"filter_exp\": \"?upstream_key > '?upstream_value'\",\n                \"control_db_table_name\": \"test_jdbc_sensor_default_dummy_value\",\n                \"upstream_key\": \"dummy_time\",\n                \"upstream_value\": None,\n            },\n            \"sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=TEST_DEFAULT_DATETIME,\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n                upstream_key=\"dummy_time\",\n                upstream_value=None,\n            ),\n            \"expected_result\": \"SELECT COUNT(1) as count, \"\n            \"'dummy_time' as UPSTREAM_KEY, \"\n            \"max(dummy_time) as UPSTREAM_VALUE \"\n            \"FROM sensor_new_data \"\n            \"WHERE dummy_time > '-2147483647' \"\n            \"HAVING COUNT(1) > 0\",\n        },\n        {\n            \"scenario_name\": \"sensor_db_table_with_custom_value\",\n            \"sensor\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"filter_exp\": \"?upstream_key > '?upstream_value'\",\n                \"control_db_table_name\": \"test_jdbc_sensor_custom_value\",\n                \"upstream_key\": \"dummy_time\",\n                \"upstream_value\": \"3333333333\",\n            },\n            \"sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=TEST_DEFAULT_DATETIME,\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n                upstream_key=\"dummy_time\",\n                upstream_value=\"3333333333\",\n            ),\n            \"expected_result\": \"SELECT COUNT(1) as count, \"\n            \"'dummy_time' as UPSTREAM_KEY, \"\n            \"max(dummy_time) as UPSTREAM_VALUE \"\n            \"FROM sensor_new_data \"\n            \"WHERE dummy_time > '3333333333' \"\n            \"HAVING COUNT(1) > 0\",\n        },\n        {\n            \"scenario_name\": \"filter_exp_preprocess_query\",\n            \"sensor\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"filter_exp\": \"my_column > 'my_value'\",\n                \"control_db_table_name\": None,\n                \"upstream_key\": None,\n                \"upstream_value\": None,\n            },\n            \"sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=TEST_DEFAULT_DATETIME,\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n                upstream_key=None,\n                upstream_value=None,\n            ),\n            \"expected_result\": \"SELECT COUNT(1) as count \"\n            \"FROM sensor_new_data \"\n            \"WHERE my_column > 'my_value' \"\n            \"HAVING COUNT(1) > 0\",\n        },\n        {\n            \"scenario_name\": \"filter_exp_preprocess_query_from_upstream_table_name\",\n            \"sensor\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"filter_exp\": \"?upstream_key > '?upstream_value'\",\n                \"control_db_table_name\": \"test_jdbc_sensor_default_dummy_value\",\n                \"upstream_key\": \"dummy_time\",\n                \"upstream_value\": \"3333333333\",\n                \"upstream_table_name\": \"test_db.dummy_table\",\n            },\n            \"sensor_data\": Row(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                status=SensorStatus.ACQUIRED_NEW_DATA.value,\n                status_change_timestamp=TEST_DEFAULT_DATETIME,\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n                upstream_key=\"dummy_time\",\n                upstream_value=\"3333333333\",\n            ),\n            \"expected_result\": \"SELECT COUNT(1) as count, \"\n            \"'dummy_time' as UPSTREAM_KEY, \"\n            \"max(dummy_time) as UPSTREAM_VALUE \"\n            \"FROM test_db.dummy_table \"\n            \"WHERE dummy_time > '3333333333' \"\n            \"HAVING COUNT(1) > 0\",\n        },\n        {\n            \"scenario_name\": \"raise_exception_db_name_is_defined_and_upstream_key_not\",\n            \"sensor\": {\n                \"sensor_id\": \"sensor_id_1\",\n                \"filter_exp\": \"my_column > 'my_value'\",\n                \"control_db_table_name\": \"test_jdbc_sensor_raise_exception\",\n                \"upstream_key\": None,\n                \"upstream_value\": None,\n            },\n            \"expected_result\": \"If control_db_table_name is defined, \"\n            \"upstream_key should \"\n            \"also be defined!\",\n        },\n    ],\n)\ndef test_if_generate_filter_exp_preprocess_query(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test filter expression for preprocess query gen.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    sensor_data = scenario[\"sensor\"]\n    expected_result = scenario[\"expected_result\"]\n    db_table = sensor_data.get(\"control_db_table_name\")\n\n    if (\n        scenario[\"scenario_name\"]\n        == \"raise_exception_db_name_is_defined_and_upstream_key_not\"\n    ):\n        with pytest.raises(ValueError) as exception:\n            SensorUpstreamManager.generate_filter_exp_query(\n                sensor_data.get(\"sensor_id\"),\n                sensor_data.get(\"filter_exp\"),\n                f\"test_db.{db_table}\" if db_table else None,\n                sensor_data.get(\"upstream_key\"),\n                sensor_data.get(\"upstream_value\"),\n            )\n\n        assert expected_result in str(exception.value)\n    else:\n        with patch.object(\n            SensorControlTableManager,\n            \"read_sensor_table_data\",\n            new=MagicMock(return_value=scenario[\"sensor_data\"]),\n        ) as sensor_table_data_mock:\n            sensor_table_data_mock.start()\n            subject = SensorUpstreamManager.generate_filter_exp_query(\n                sensor_data.get(\"sensor_id\"),\n                sensor_data.get(\"filter_exp\"),\n                f\"test_db.{db_table}\" if db_table else None,\n                sensor_data.get(\"upstream_key\"),\n                sensor_data.get(\"upstream_value\"),\n                sensor_data.get(\"upstream_table_name\"),\n            )\n\n            assert subject == expected_result\n            sensor_table_data_mock.stop()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"generate_sensor_table_preprocess_query\",\n            \"sensor_id\": \"sensor_id_1\",\n            \"expected_result\": \"SELECT * \"  # nosec\n            \"FROM sensor_new_data \"\n            \"WHERE\"\n            \" _change_type in ('insert', 'update_postimage')\"\n            \" and sensor_id = 'sensor_id_1'\"\n            f\" and status = '{SensorStatus.PROCESSED_NEW_DATA.value}'\",\n        }\n    ],\n)\ndef test_generate_sensor_table_preprocess_query(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test if we are generating correctly the preprocess query.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    subject = SensorUpstreamManager.generate_sensor_table_preprocess_query(\n        scenario[\"sensor_id\"]\n    )\n\n    assert subject == scenario[\"expected_result\"]\n\n\n@pytest.fixture\ndef dataframe_fixture() -> DataFrame:\n    \"\"\"Return a dummy dataframe to be used in our tests.\"\"\"\n    schema = StructType([StructField(\"dummy_field\", StringType(), True)])\n    return ExecEnv.SESSION.createDataFrame(\n        [[\"a\"], [\"b\"], [\"c\"]],\n        schema,\n    )\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"read_new_data\",\n            \"preprocess_query\": None,\n            \"expected_result\": 3,\n        },\n        {\n            \"scenario_name\": \"read_new_data_with_preprocess_query\",\n            \"preprocess_query\": \"SELECT *\"\n            \"FROM sensor_new_data \"\n            \"WHERE dummy_field = 'b' \",\n            \"expected_result\": 1,\n        },\n    ],\n)\ndef test_read_new_data(\n    scenario: dict, capsys: Any, dataframe_fixture: DataFrame\n) -> None:\n    \"\"\"Test if we execute the preprocess query when reading new data.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n        dataframe_fixture: fixture representing a dummy dataframe to be\n            used as mock return.\n    \"\"\"\n    with patch.object(\n        ReaderFactory, \"get_data\", MagicMock(return_value=dataframe_fixture)\n    ) as reader_factory_mock:\n        reader_factory_mock.start()\n\n        new_data = SensorUpstreamManager.read_new_data(\n            sensor_spec=SensorSpec(\n                sensor_id=\"sensor_id_1\",\n                assets=[\"asset_1\"],\n                control_db_table_name=\"test_db.sensor_control_table\",\n                input_spec=None,\n                preprocess_query=scenario[\"preprocess_query\"],\n                checkpoint_location=\"s3://dummy-bucket/sensors/sensor_id_1\",\n            )\n        )\n\n        assert new_data.count() == scenario[\"expected_result\"]\n        reader_factory_mock.stop()\n\n\n@pytest.mark.parametrize(\n    \"scenario\",\n    [\n        {\n            \"scenario_name\": \"generate_sap_logchain_query\",\n            \"chain_id\": \"MY_SAP_CHAIN_ID\",\n            \"expected_result\": \"WITH sensor_new_data AS (\"\n            \"SELECT \"\n            \"CHAIN_ID, \"\n            \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, \"\n            \"ANALYZED_STATUS \"\n            \"FROM SAPPHA.RSPCLOGCHAIN \"\n            \"WHERE \"\n            \"UPPER(CHAIN_ID) = UPPER('MY_SAP_CHAIN_ID') \"\n            \"AND UPPER(ANALYZED_STATUS) = UPPER('G')\"\n            \")\",  # nosec\n        },\n        {\n            \"scenario_name\": \"generate_sap_logchain_query_dbtable\",\n            \"chain_id\": \"MY_SAP_CHAIN_ID\",\n            \"dbtable\": \"test_db.test_table\",\n            \"expected_result\": \"WITH sensor_new_data AS (\"\n            \"SELECT \"\n            \"CHAIN_ID, \"\n            \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, \"\n            \"ANALYZED_STATUS \"\n            \"FROM test_db.test_table \"\n            \"WHERE \"\n            \"UPPER(CHAIN_ID) = UPPER('MY_SAP_CHAIN_ID') \"\n            \"AND UPPER(ANALYZED_STATUS) = UPPER('G')\"\n            \")\",  # nosec\n        },\n        {\n            \"scenario_name\": \"generate_sap_logchain_query_status\",\n            \"chain_id\": \"MY_SAP_CHAIN_ID\",\n            \"status\": \"A\",\n            \"expected_result\": \"WITH sensor_new_data AS (\"\n            \"SELECT \"\n            \"CHAIN_ID, \"\n            \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, \"\n            \"ANALYZED_STATUS \"\n            \"FROM SAPPHA.RSPCLOGCHAIN \"\n            \"WHERE \"\n            \"UPPER(CHAIN_ID) = UPPER('MY_SAP_CHAIN_ID') \"\n            \"AND UPPER(ANALYZED_STATUS) = UPPER('A')\"\n            \")\",  # nosec\n        },\n        {\n            \"scenario_name\": \"generate_sap_logchain_query_engine_table\",\n            \"chain_id\": \"MY_SAP_CHAIN_ID\",\n            \"engine_table_name\": \"test_SAPTABLE\",\n            \"expected_result\": \"WITH test_SAPTABLE AS (\"\n            \"SELECT \"\n            \"CHAIN_ID, \"\n            \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, \"\n            \"ANALYZED_STATUS \"\n            \"FROM SAPPHA.RSPCLOGCHAIN \"\n            \"WHERE \"\n            \"UPPER(CHAIN_ID) = UPPER('MY_SAP_CHAIN_ID') \"\n            \"AND UPPER(ANALYZED_STATUS) = UPPER('G')\"\n            \")\",  # nosec\n        },\n        {\n            \"scenario_name\": \"generate_sap_logchain_query_full_custom\",\n            \"chain_id\": \"MY_SAP_CHAIN_ID\",\n            \"dbtable\": \"test_db.test_table\",\n            \"status\": \"A\",\n            \"engine_table_name\": \"test_SAPTABLE\",\n            \"expected_result\": \"WITH test_SAPTABLE AS (\"\n            \"SELECT \"\n            \"CHAIN_ID, \"\n            \"CONCAT(DATUM, ZEIT) AS LOAD_DATE, \"\n            \"ANALYZED_STATUS \"\n            \"FROM test_db.test_table \"\n            \"WHERE \"\n            \"UPPER(CHAIN_ID) = UPPER('MY_SAP_CHAIN_ID') \"\n            \"AND UPPER(ANALYZED_STATUS) = UPPER('A')\"\n            \")\",  # nosec\n        },\n        {\n            \"scenario_name\": \"raise_exception_chain_id_is_not_defined\",\n            \"chain_id\": None,\n            \"expected_result\": \"To query on log chain SAP table the chain id \"\n            \"should be defined!\",\n        },\n    ],\n)\ndef test_generate_sensor_sap_logchain_query(scenario: dict, capsys: Any) -> None:\n    \"\"\"Test if we are generating correctly the sap logchain query.\n\n    Args:\n        scenario: scenario to test.\n        capsys: capture stdout and stderr.\n    \"\"\"\n    if scenario[\"scenario_name\"] == \"raise_exception_chain_id_is_not_defined\":\n        with pytest.raises(ValueError) as exception:\n            SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                scenario[\"chain_id\"],\n                scenario.get(\"dbtable\"),\n                scenario.get(\"status\"),\n                scenario.get(\"engine_table_name\"),\n            )\n\n        assert scenario[\"expected_result\"] in str(exception.value)\n    else:\n        if scenario[\"scenario_name\"] == \"generate_sap_logchain_query\":\n            subject = SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                scenario.get(\"chain_id\"),\n            )\n        elif scenario[\"scenario_name\"] == \"generate_sap_logchain_query_dbtable\":\n            subject = SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                scenario.get(\"chain_id\"),\n                dbtable=scenario.get(\"dbtable\"),\n            )\n        elif scenario[\"scenario_name\"] == \"generate_sap_logchain_query_status\":\n            subject = SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                scenario.get(\"chain_id\"),\n                status=scenario.get(\"status\"),\n            )\n        elif scenario[\"scenario_name\"] == \"generate_sap_logchain_query_engine_table\":\n            subject = SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                scenario.get(\"chain_id\"),\n                engine_table_name=scenario.get(\"engine_table_name\"),\n            )\n        else:\n            subject = SensorUpstreamManager.generate_sensor_sap_logchain_query(\n                scenario.get(\"chain_id\"),\n                scenario.get(\"dbtable\"),\n                scenario.get(\"status\"),\n                scenario.get(\"engine_table_name\"),\n            )\n\n        assert subject == scenario[\"expected_result\"]\n\n\ndef _prepare_new_data_tests(return_empty_df: bool = False) -> DataFrame:\n    schema = StructType([StructField(\"dummy_field\", StringType(), True)])\n\n    if return_empty_df:\n        return ExecEnv.SESSION.createDataFrame(\n            [],\n            schema,\n        )\n    else:\n        return ExecEnv.SESSION.createDataFrame(\n            [[\"a\"], [\"b\"], [\"c\"]],\n            schema,\n        )\n"
  },
  {
    "path": "tests/unit/test_sharepoint_csv_reader.py",
    "content": "\"\"\"Test Sharepoint CSV reader.\n\nUnit tests for delimiter detection and Spark CSV option resolution in\n`SharepointCsvReader`.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom typing import Any, Dict, cast\n\nfrom lakehouse_engine.io.readers.sharepoint_reader import SharepointCsvReader\n\n\nclass DummySharepointOptions:\n    \"\"\"Minimal Sharepoint options stub used to build a `SharepointCsvReader`.\n\n    Args:\n        local_options: Dictionary of local CSV read options (for example, header,\n            delimiter, sep).\n    \"\"\"\n\n    def __init__(self, local_options: Dict[str, Any]) -> None:\n        \"\"\"Initialize the dummy options with the provided local options.\"\"\"\n        self.local_options = local_options\n\n\nclass DummyInputSpec:\n    \"\"\"Minimal input spec stub that exposes `sharepoint_opts` as expected by the reader.\n\n    Args:\n        sharepoint_options: Instance containing `local_options`.\n    \"\"\"\n\n    def __init__(self, sharepoint_options: DummySharepointOptions) -> None:\n        \"\"\"Initialize the dummy input spec with the provided Sharepoint options.\"\"\"\n        self.sharepoint_opts = sharepoint_options\n\n\ndef create_csv_reader(local_options: Dict[str, Any]) -> SharepointCsvReader:\n    \"\"\"Create a `SharepointCsvReader` instance without calling its constructor.\n\n    Args:\n        local_options: Dictionary of local CSV read options.\n\n    Returns:\n        SharepointCsvReader: A partially-initialized reader instance.\n    \"\"\"\n    csv_reader: SharepointCsvReader = SharepointCsvReader.__new__(SharepointCsvReader)\n    csv_reader._input_spec = cast(\n        Any, DummyInputSpec(DummySharepointOptions(local_options))\n    )\n    return csv_reader\n\n\ndef test_detect_delimiter_uses_user_provided_delimiter() -> None:\n    \"\"\"It should always return the explicitly provided delimiter.\"\"\"\n    csv_reader: SharepointCsvReader = SharepointCsvReader.__new__(SharepointCsvReader)\n    detected: str = csv_reader.detect_delimiter(\n        file_content=b\"column_a;column_b\\n1;2\\n\",\n        provided_delimiter=\"|\",\n        expected_columns=None,\n    )\n    assert detected == \"|\"\n\n\ndef test_detect_delimiter_autodetects_semicolon() -> None:\n    \"\"\"It should infer the delimiter from the file content when none is provided.\"\"\"\n    csv_reader: SharepointCsvReader = SharepointCsvReader.__new__(SharepointCsvReader)\n    detected: str = csv_reader.detect_delimiter(\n        file_content=b\"column_a;column_b\\n1;2\\n\",\n        provided_delimiter=None,\n        expected_columns=None,\n    )\n    assert detected == \";\"\n\n\ndef test_detect_delimiter_defaults_to_comma_on_decode_error() -> None:\n    \"\"\"It should fall back to comma when content cannot be decoded for sniffing.\"\"\"\n    csv_reader: SharepointCsvReader = SharepointCsvReader.__new__(SharepointCsvReader)\n    detected: str = csv_reader.detect_delimiter(\n        file_content=b\"\\xff\\xfe\",\n        provided_delimiter=None,\n        expected_columns=None,\n    )\n    assert detected == \",\"\n\n\ndef test_resolve_csv_options_prefers_sep_over_delimiter() -> None:\n    \"\"\"`sep` should take precedence over `delimiter`, and `delimiter` should be removed.\n\n    Args:\n        None\n    Returns:\n        None\n    \"\"\"\n    csv_reader: SharepointCsvReader = create_csv_reader(\n        {\"sep\": \"|\", \"delimiter\": \",\", \"header\": True}\n    )\n    spark_options: Dict[str, Any] = csv_reader.resolve_spark_csv_options(\n        b\"column_a,column_b\\n1,2\\n\"\n    )\n    assert spark_options[\"sep\"] == \"|\"\n    assert \"delimiter\" not in spark_options\n\n\ndef test_resolve_spark_csv_options_uses_delimiter_when_sep_missing() -> None:\n    \"\"\"If `sep` is missing, `delimiter` should be mapped into `sep` and removed.\"\"\"\n    csv_reader: SharepointCsvReader = create_csv_reader(\n        {\"delimiter\": \";\", \"header\": True}\n    )\n    spark_options: Dict[str, Any] = csv_reader.resolve_spark_csv_options(\n        b\"column_a,column_b\\n1,2\\n\"\n    )\n    assert spark_options[\"sep\"] == \";\"\n    assert \"delimiter\" not in spark_options\n\n\ndef test_resolve_spark_csv_options_autodetects_when_no_delimiter_provided() -> None:\n    \"\"\"If neither `sep` nor `delimiter` is provided, it should autodetect from content.\n\n    Args:\n        None\n    Returns:\n        None\n    \"\"\"\n    csv_reader: SharepointCsvReader = create_csv_reader({\"header\": True})\n    spark_options: Dict[str, Any] = csv_reader.resolve_spark_csv_options(\n        b\"column_a|column_b\\n1|2\\n\"\n    )\n    assert spark_options[\"sep\"] == \"|\"\n\n\ndef test_resolve_spark_csv_options_warns_when_expected_columns_names_mismatch(\n    caplog: Any,\n) -> None:\n    \"\"\"Warn when expected column names do not match the header.\n\n    Args:\n        caplog: Pytest log capture fixture.\n\n    Returns:\n        None.\n    \"\"\"\n    csv_reader: SharepointCsvReader = create_csv_reader(\n        {\n            \"header\": True,\n            \"expected_columns\": [\"col_a\", \"col_b\"],\n        }\n    )\n\n    # Header uses semicolon, delimiter should be detected as ';', but names mismatch.\n    file_content: bytes = b\"wrong_a;wrong_b\\n1;2\\n\"\n\n    with caplog.at_level(\"WARNING\"):\n        csv_reader.resolve_spark_csv_options(file_content)\n\n    assert \"Expected columns don't match CSV header\" in caplog.text\n\n\ndef test_resolve_spark_csv_options_warns_when_expected_columns_validation_fails(\n    caplog: Any,\n) -> None:\n    \"\"\"Warn when validation against the header cannot be performed.\n\n    Args:\n        caplog: Pytest log capture fixture.\n\n    Returns:\n        None.\n    \"\"\"\n    csv_reader: SharepointCsvReader = create_csv_reader(\n        {\n            \"header\": True,\n            \"expected_columns\": [\"col_a\", \"col_b\"],\n        }\n    )\n\n    # Force decode failure inside the expected_columns validation block.\n    file_content: bytes = b\"\\xff\\xfe\"\n\n    with caplog.at_level(\"WARNING\"):\n        csv_reader.resolve_spark_csv_options(file_content)\n\n    assert \"Failed to validate expected_columns against CSV header\" in caplog.text\n"
  },
  {
    "path": "tests/unit/test_spark_session.py",
    "content": "\"\"\"Test if a new spark session returns the same object as current session.\"\"\"\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\nLOGGER = LoggingHandler(__name__).get_logger()\n\n\ndef test_spark_session() -> None:\n    \"\"\"Test if a new spark session returns the same object as current session.\"\"\"\n    old_session = ExecEnv.SESSION.getActiveSession()\n    ExecEnv.get_or_create()\n    new_session = ExecEnv.SESSION.getActiveSession()\n\n    assert old_session is new_session, (\n        \"Sessions pointing to different objects.\"\n        f\"{new_session} is different than {old_session}\"\n    )\n\n    LOGGER.info(\n        f\"New session ({new_session}) is the same as previously \"\n        f\"created session ({old_session}).\"\n    )\n"
  },
  {
    "path": "tests/unit/test_version.py",
    "content": "\"\"\"Test if the correct version of the lib is being read.\"\"\"\n\nimport re\n\nfrom lakehouse_engine.utils.configs.config_utils import ConfigUtils\n\n\ndef test_version() -> None:\n    \"\"\"Test if ConfigUtils is reading the correct version from pyproject.toml.\"\"\"\n    configUtils = ConfigUtils()\n\n    current_version = re.search(\n        r\"(?<=version = \\\").*(?=\\\")\", open(\"pyproject.toml\").read()\n    ).group()\n    assert current_version == configUtils.get_engine_version()\n"
  },
  {
    "path": "tests/utils/__init__.py",
    "content": "\"\"\"Tests utilities.\"\"\"\n"
  },
  {
    "path": "tests/utils/dataframe_helpers.py",
    "content": "\"\"\"Module with helper functions to interact with test dataframes.\"\"\"\n\nimport random\nimport string\nfrom typing import Optional, OrderedDict\n\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql.types import StructType\n\nfrom lakehouse_engine.core.definitions import (\n    InputFormat,\n    InputSpec,\n    OutputFormat,\n    OutputSpec,\n    ReadType,\n    WriteType,\n)\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom lakehouse_engine.io.readers.file_reader import FileReader\nfrom lakehouse_engine.io.readers.jdbc_reader import JDBCReader\nfrom lakehouse_engine.io.readers.table_reader import TableReader\nfrom lakehouse_engine.io.writers.jdbc_writer import JDBCWriter\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass DataframeHelpers(object):\n    \"\"\"Class with helper functions to interact with test dataframes.\"\"\"\n\n    _logger = LoggingHandler(__name__).get_logger()\n\n    @classmethod\n    def has_diff(\n        cls, df: DataFrame, another_df: DataFrame, group_and_order: bool = True\n    ) -> bool:\n        \"\"\"Check if a dataframe has differences comparing to another dataframe.\n\n        Note: the order of the columns and rows are not considered as differences\n        by default.\n\n        Args:\n            df: one dataframe.\n            another_df: another dataframe.\n            group_and_order: whether to group and order the DFs or not.\n\n        Returns:\n            True if it has a difference, false otherwise.\n        \"\"\"\n\n        def print_diff(desc: str, diff_df: DataFrame) -> None:\n            cls._logger.debug(desc)\n            for row in diff_df.collect():\n                cls._logger.debug(row)\n\n        cls._logger.debug(\"Checking if Dataframes have diff...\")\n        cols_to_group = df.columns\n        if group_and_order:\n            df = df.select(*cols_to_group).orderBy(*cols_to_group)\n            another_df = another_df.select(*cols_to_group).orderBy(*cols_to_group)\n\n        diff_1 = df.exceptAll(another_df)\n        diff_2 = another_df.exceptAll(df)\n        if diff_1.isEmpty() is False or diff_2.isEmpty() is False:\n            df.show(100, False)\n            another_df.show(100, False)\n            cls._logger.debug(\"Dataframes have diff...\")\n            print_diff(\"Diff 1:\", diff_1)\n            print_diff(\"Diff 2:\", diff_2)\n            return True\n        else:\n            return False\n\n    @staticmethod\n    def read_from_file(\n        location: str,\n        file_format: str = InputFormat.CSV.value,\n        schema: Optional[dict] = None,\n        options: Optional[dict] = None,\n    ) -> DataFrame:\n        \"\"\"Read data from a file into a dataframe.\n\n        Args:\n            location: location of the file(s).\n            file_format: file(s) format.\n            schema: schema of the files (only works with spark schema\n                StructType for now).\n            options: options (e.g., spark options) to read data.\n\n        Returns:\n            The dataframe that was read.\n        \"\"\"\n        if options is None and file_format == InputFormat.CSV.value:\n            options = {\"header\": True, \"delimiter\": \"|\", \"inferSchema\": True}\n        spec = InputSpec(\n            spec_id=random.choice(string.ascii_letters),  # nosec\n            read_type=ReadType.BATCH.value,\n            data_format=file_format,\n            location=location,\n            schema=schema,\n            options=options,\n        )\n        return FileReader(input_spec=spec).read()\n\n    @staticmethod\n    def read_from_table(db_table: str, options: Optional[dict] = None) -> DataFrame:\n        \"\"\"Read data from a table into a dataframe.\n\n        Args:\n            db_table: `database.table_name`.\n            options: options (e.g., spark options) to read data.\n\n        Returns:\n            DataFrame: the dataframe that was read.\n        \"\"\"\n        spec = InputSpec(\n            spec_id=random.choice(string.ascii_letters),  # nosec\n            read_type=ReadType.BATCH.value,\n            db_table=db_table,\n            options=options,\n        )\n        return TableReader(input_spec=spec).read()\n\n    @staticmethod\n    def read_from_jdbc(\n        uri: str, db_table: str, driver: str = \"org.sqlite.JDBC\"\n    ) -> DataFrame:\n        \"\"\"Read data from jdbc into a dataframe.\n\n        Args:\n            uri: uri for the jdbc connection.\n            db_table: `database.table_name`.\n            driver: driver class.\n\n        Returns:\n            DataFrame: the dataframe that was read.\n        \"\"\"\n        spec = InputSpec(\n            spec_id=random.choice(string.ascii_letters),  # nosec\n            db_table=db_table,\n            read_type=ReadType.BATCH.value,\n            options={\"url\": uri, \"dbtable\": db_table, \"driver\": driver},\n        )\n        return JDBCReader(input_spec=spec).read()\n\n    @staticmethod\n    def write_into_jdbc_table(\n        df: DataFrame,\n        uri: str,\n        db_table: str,\n        write_type: str = WriteType.APPEND.value,\n        driver: str = \"org.sqlite.JDBC\",\n        data: OrderedDict = None,\n    ) -> None:\n        \"\"\"Write data into a jdbc table.\n\n        Args:\n            df: dataframe containing the data to append.\n            uri: uri for the jdbc connection.\n            db_table: `database.table_name`.\n            write_type: type of writer to use for writing into the destination\n            driver: driver class.\n            data: list of all dfs generated on previous steps before writer.\n        \"\"\"\n        spec = OutputSpec(\n            spec_id=random.choice(string.ascii_letters),  # nosec\n            input_id=random.choice(string.ascii_letters),  # nosec\n            write_type=write_type,\n            data_format=OutputFormat.JDBC.value,\n            options={\"url\": uri, \"dbtable\": db_table, \"driver\": driver},\n        )\n\n        JDBCWriter(output_spec=spec, df=df.coalesce(1), data=data).write()\n\n    @staticmethod\n    def create_empty_dataframe(struct_type: StructType) -> DataFrame:\n        \"\"\"Create an empty DataFrame.\n\n        Args:\n            struct_type: dict containing a spark schema structure. [Check here](\n                https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html).\n\n        Returns:\n            An empty dataframe\n        \"\"\"\n        return ExecEnv.SESSION.createDataFrame(data=[], schema=struct_type)\n\n    @staticmethod\n    def create_dataframe(data: list, schema: StructType) -> DataFrame:\n        \"\"\"Create a DataFrame.\n\n        Args:\n            data: dict containing the data to create the DataFrame.\n            schema: dict containing a spark schema structure. [Check here](\n                https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructType.html).\n\n        Returns:\n            The created DataFrame.\n        \"\"\"\n        return ExecEnv.SESSION.createDataFrame(data=data, schema=schema)\n\n    @staticmethod\n    def create_delta_table(\n        cols: dict, table: str, db: str = \"test_db\", enable_cdf: bool = False\n    ) -> None:\n        \"\"\"Create a delta table for test purposes.\n\n        Args:\n            cols: dict of columns to create table and their types.\n            table: table name.\n            db: database name.\n            enable_cdf: whether to enable change data feed, or not.\n        \"\"\"\n        ExecEnv.SESSION.sql(\n            f\"\"\"\n            CREATE EXTERNAL TABLE {db}.{table} (\n                {','.join([f'{cname} {ctype}' for cname, ctype in cols.items()])}\n            )\n            USING delta\n            TBLPROPERTIES (delta.enableChangeDataFeed = {str(enable_cdf).lower()})\n            \"\"\"\n        )\n"
  },
  {
    "path": "tests/utils/dq_rules_table_utils.py",
    "content": "\"\"\"Utils for dealing with DQ Rules tables.\"\"\"\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\nfrom tests.utils.local_storage import LocalStorage\n\n\ndef _create_dq_functions_source_table(\n    test_resources_path: str,\n    lakehouse_in_path: str,\n    lakehouse_out_path: str,\n    test_name: str,\n    scenario: str,\n    table_name: str,\n) -> None:\n    \"\"\"Create test dq functions source table.\n\n    Args:\n        test_resources_path: path to the test resources.\n        lakehouse_in_path: path to the lakehouse in.\n        lakehouse_out_path: path to the lakehouse out.\n        test_name: name of the test.\n        scenario: name of the test scenario.\n        table_name: name of the test table.\n    \"\"\"\n    LocalStorage.copy_file(\n        f\"{test_resources_path}/{test_name}/data/dq_functions/{table_name}.csv\",\n        f\"{lakehouse_in_path}/{test_name}/{scenario}/dq_functions/\",\n    )\n\n    ExecEnv.SESSION.sql(\n        f\"\"\"\n        CREATE TABLE IF NOT EXISTS {table_name} (\n            dq_rule_id STRING,\n            dq_check_type STRING,\n            dq_tech_function STRING,\n            execution_point STRING,\n            schema STRING,\n            table STRING,\n            column STRING,\n            filters STRING,\n            arguments STRING,\n            expected_technical_expression STRING,\n            dimension STRING\n        )\n        USING delta\n        LOCATION '{lakehouse_out_path}/{test_name}/{scenario}/dq_functions'\n        TBLPROPERTIES(\n          'lakehouse.primary_key'='dq_rule_id',\n          'delta.enableChangeDataFeed'='false'\n        )\n        \"\"\"\n    )\n    dq_functions = (\n        ExecEnv.SESSION.read.option(\"delimiter\", \"|\")\n        .option(\"header\", True)\n        .csv(\n            f\"{lakehouse_in_path}/{test_name}/{scenario}/dq_functions/{table_name}.csv\"\n        )\n    )\n\n    dq_functions.write.saveAsTable(\n        name=f\"{table_name}\", format=\"delta\", mode=\"overwrite\"\n    )\n"
  },
  {
    "path": "tests/utils/exec_env_helpers.py",
    "content": "\"\"\"Module with helper functions to interact with test execution environment.\"\"\"\n\nfrom lakehouse_engine.core.exec_env import ExecEnv\n\n\nclass ExecEnvHelpers(object):\n    \"\"\"Class with helper functions to interact with test execution environment.\"\"\"\n\n    @staticmethod\n    def prepare_exec_env(spark_driver_memory: str) -> None:\n        \"\"\"Create single execution environment session.\"\"\"\n        ExecEnv.get_or_create(\n            app_name=\"Lakehouse Engine Tests\",\n            enable_hive_support=False,\n            config={\n                \"spark.master\": \"local[2]\",\n                \"spark.driver.memory\": spark_driver_memory,\n                \"spark.sql.warehouse.dir\": \"file:///app/tests/lakehouse/spark-warehouse/\",  # noqa: E501\n                \"spark.sql.shuffle.partitions\": \"2\",\n                \"spark.sql.extensions\": \"io.delta.sql.DeltaSparkSessionExtension\",\n                \"spark.sql.catalog.spark_catalog\": \"org.apache.spark.sql.delta.catalog.DeltaCatalog\",  # noqa: E501\n                \"spark.jars.packages\": \"io.delta:delta-spark_2.13:4.0.0,org.xerial:sqlite-jdbc:3.50.3.0\",  # noqa: E501\n                \"spark.jars.excludes\": \"net.sourceforge.f2j:arpack_combined_all\",\n                \"spark.sql.sources.parallelPartitionDiscovery.parallelism\": \"2\",\n                \"spark.sql.legacy.charVarcharAsString\": True,\n            },\n        )\n\n    @classmethod\n    def set_exec_env_config(cls, key: str, value: str) -> None:\n        \"\"\"Set any execution environment (e.g., spark) session setting.\"\"\"\n        ExecEnv.SESSION.conf.set(key, value)\n\n    @classmethod\n    def reset_default_spark_session_configs(cls) -> None:\n        \"\"\"Reset spark session configs.\"\"\"\n        cls.set_exec_env_config(\n            \"spark.databricks.delta.schema.autoMerge.enabled\", \"false\"\n        )\n        cls.set_exec_env_config(\"spark.sql.streaming.schemaInference\", \"false\")\n        cls.set_exec_env_config(\n            \"spark.sql.sources.partitionColumnTypeInference.enabled\", \"true\"\n        )\n"
  },
  {
    "path": "tests/utils/local_storage.py",
    "content": "\"\"\"Utilities to interact with the local file system used in the tests.\"\"\"\n\nimport glob\nfrom os import makedirs, path, remove\nfrom pathlib import Path\nfrom shutil import copy, copytree, rmtree\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n_LOGGER = LoggingHandler(__name__).get_logger()\n\n\nclass LocalStorage(object):\n    \"\"\"Helper class to support local storage operations in tests.\"\"\"\n\n    @staticmethod\n    def copy_file(from_path: str, to_path: str) -> None:\n        \"\"\"Copy files (supports regex) into target file or folder.\n\n        :param str from_path: path from where to copy files from (supports regex).\n        :param str to_path: path to where to copy files to.\n        \"\"\"\n        makedirs(path.dirname(to_path), exist_ok=True)\n\n        for file in glob.glob(from_path):\n            copy(file, to_path)\n\n    @staticmethod\n    def clean_folder(folder_path: str) -> None:\n        \"\"\"Clean a folder content.\n\n        :param str folder_path: path of the folder to clean.\n        \"\"\"\n        if Path(folder_path).is_dir():\n            rmtree(folder_path)\n\n    @staticmethod\n    def delete_file(file_path: str) -> None:\n        \"\"\"Delete a file.\n\n        :param str file_path: path of the file(s) to delete (supports regex).\n        \"\"\"\n        for file in glob.glob(file_path):\n            if Path(file).exists():\n                remove(file)\n\n    @staticmethod\n    def read_file(file_path: str) -> str:\n        \"\"\"Read file from directory.\n\n        Args:\n            file_path: path of the file to be read.\n        \"\"\"\n        with open(file_path, \"r\") as f:\n            result = f.read()\n        return result\n\n    @staticmethod\n    def copy_dir(source: str, destination: str) -> None:\n        \"\"\"Copy all files in a directory.\n\n        Args:\n            source: string with the source location.\n            destination: string with the destination location.\n        \"\"\"\n        copytree(source, destination, dirs_exist_ok=True)\n"
  },
  {
    "path": "tests/utils/mocks.py",
    "content": "\"\"\"Module to hold utilities Mocks tests.\"\"\"\n\nfrom __future__ import annotations\n\nfrom typing import Any, Optional\nfrom unittest.mock import MagicMock\n\n\nclass MockRESTResponse:\n    \"\"\"Mock Rest Responses for tests.\"\"\"\n\n    def __init__(\n        self,\n        status_code: int,\n        json_data: Optional[dict[str, Any]] = None,\n        content: bytes = b\"\",\n    ) -> None:\n        \"\"\"Construct MockRESTResponse instances.\n\n        :param status_code: status code.\n        :param json_data: json response.\n        :param content: raw response content.\n        \"\"\"\n        self.status_code: int = status_code\n        self.json_data: Optional[dict[str, Any]] = json_data\n        self.content: bytes = content\n        self.text: str = content.decode(\"utf-8\", errors=\"ignore\") if content else \"\"\n        self.raise_for_status: MagicMock = MagicMock()\n\n    def json(self) -> Optional[dict[str, Any]]:\n        \"\"\"Get json response.\n\n        :return dict: json response.\n        \"\"\"\n        return self.json_data\n\n    def __enter__(self) -> MockRESTResponse:\n        \"\"\"Allow use as a context manager.\"\"\"\n        return self\n\n    def __exit__(\n        self,\n        exc_type: type[BaseException] | None,\n        exc: BaseException | None,\n        tb: Any,\n    ) -> None:\n        \"\"\"Context manager exit.\"\"\"\n        return None\n"
  },
  {
    "path": "tests/utils/smtp_server.py",
    "content": "\"\"\"A simple SMTP server for testing purposes.\"\"\"\n\nfrom logging import Logger\nfrom typing import Any\n\nfrom aiosmtpd import controller\nfrom aiosmtpd.handlers import Message\n\nfrom lakehouse_engine.utils.logging_handler import LoggingHandler\n\n\nclass SMTPHandler(Message):\n    \"\"\"Custom handler to capture emails during testing.\"\"\"\n\n    def __init__(self) -> None:\n        \"\"\"Initialize the SMTP handler.\"\"\"\n        super().__init__()\n        self.messages: list = []\n\n    def handle_message(self, message: Any) -> None:\n        \"\"\"Handle incoming messages and store them for verification.\n\n        Args:\n            message: The incoming email message.\n\n        Returns:\n            A string indicating the result of the message handling.\n        \"\"\"\n        self.messages.append(message)\n\n\nclass SMTPServer:\n    \"\"\"Test SMTP server for unit testing.\"\"\"\n\n    _LOGGER: Logger = LoggingHandler(__name__).get_logger()\n\n    def __init__(self, host: str, port: int) -> None:\n        \"\"\"Initialize the SMTP server.\n\n        Args:\n            host: The hostname of the SMTP server.\n            port: The port number of the SMTP server.\n        \"\"\"\n        self.host = host\n        self.port = port\n        self.handler = SMTPHandler()\n        self.controller: controller.Controller | None = None\n\n    def start(self) -> None:\n        \"\"\"Start the SMTP server.\"\"\"\n        self.controller = controller.Controller(\n            self.handler, hostname=self.host, port=self.port\n        )\n        self.controller.start()\n        self._LOGGER.info(f\"Test SMTP server started on {self.host}:{self.port}\")\n\n    def stop(self) -> None:\n        \"\"\"Stop the SMTP server.\"\"\"\n        if self.controller:\n            self.controller.stop()\n            self._LOGGER.info(\"Test SMTP server stopped\")\n\n    def get_messages(self) -> list:\n        \"\"\"Get all captured messages.\"\"\"\n        return self.handler.messages\n\n    def clear_messages(self) -> None:\n        \"\"\"Clear all captured messages.\"\"\"\n        self.handler.messages.clear()\n\n    def get_last_message(self) -> Any:\n        \"\"\"Get the last received message.\"\"\"\n        return self.handler.messages[-1] if self.handler.messages else None\n"
  }
]