[
  {
    "path": ".github/ISSUE_TEMPLATE/1-bug-report.yml",
    "content": "name: \"🐛 Bug Report\"\ndescription: Create a new ticket for a bug.\ntitle: \"🐛 [BUG] - <title>\"\nlabels: [\n  \"bug\"\n]\n\nbody:\n  - type: textarea\n    id: environment-setting\n    attributes:\n      label: \"Environment Settings\"\n      description: Java, Pyspark version, Python version, ...\n      placeholder: Let us explain your environment settings to reproduce\n    validations:\n      required: true\n\n  - type: textarea\n    id: expected-behavior\n    attributes:\n      label: \"Expected Behavior\"\n      placeholder: A clear and concise description of what you would expect to happen.\n    validations:\n      required: true\n\n  - type: textarea\n    id: actual-behavior\n    attributes:\n      label: \"Actual Behavior\"\n      placeholder: A clear and concise description of what actually happened.\n      \n  - type: textarea\n    id: reproduction\n    attributes:\n      label: Reproduction\n      description: |\n        Please enter an explicit steps to reproduce your problem.\n        If you have any code snippets, error messages, and etc., please provide them here.\n\n      placeholder: |\n        Steps to reproduce:\n          \n          1.\n          2.\n          3.\n          4.\n    validations:\n      required: true\n\n\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/2-feature-request.yml",
    "content": "name: \"🚀 Feature Request\"\ndescription: Suggesting new desired feature and enhancement of existing feature\ntitle: \"🚀 [REQUEST] - <title>\"\nlabels: [\n  \"enhancement\", \"feature\"\n]\n\nbody:\n  - type: textarea\n    id: feature-request\n    attributes:\n      label: Feature request\n      description: |\n        Please describe the feature you want to add or needs to be enhanced.\n        If you have any related paper or code, please provide us.\n    validations:\n      required: true\n\n\n  - type: textarea\n    id: context\n    validations:\n      required: false\n    attributes:\n      label: Context\n      description: |\n        Please let us know your motivation or additional context for this suggestion.\n        Knowing the reason why it needs to be add/enhanced makes us easy to understand the need.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/3-documentation-improve.yml",
    "content": "name: \"📝 Documentation Improvement\"\ndescription: Report wrong or missing documentation. You can suggest new document or document that needs any improvement.\ntitle: \"📝 [Docs] - <title>\"\nlabels: [\n  \"docs\"\n]\n\nbody:\n  - type: checkboxes\n    attributes:\n      label: dataverse version checks\n      options:\n        - label: >\n            I have checked that the issue still exists on the latest versions of the _dataverse_.\n          required: true\n\n  - type: textarea\n    id: location\n    attributes:\n      label: Location of the documentation\n      description: >\n        Please provide the location of the documentation.\n        If you are suggesting new document, please provide appropriate place it has to be.\n    validations:\n      required: true\n\n  - type: textarea\n    id: problem\n    attributes:\n      label: Documentation problem\n      description: >\n        Please provide a description of what documentation you believe needs to be fixed/improved/added.\n    validations:\n      required: true\n\n  - type: textarea\n    id: suggestion\n    attributes:\n      label: Suggestion\n      description: >\n        Please explain the suggested fix and **why** it's better than the existing documentation.\n        Or it could be content of new document you are suggesting.\n    validations:\n      required: true"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "content": "blank_issues_enabled: true"
  },
  {
    "path": ".github/pull_request_template.md",
    "content": "## PR Checklist\nPlease check if your PR fulfills the following requirements:\n\n- [ ] The commit message follows _dataverse_ guidelines [link](https://github.com/UpstageAI/dataverse/blob/main/contribution/CONTRIBUTING.md#commit-guidelines):\n- [ ] Tests for the changes have been added (for bug fixes / features)\n- [ ] Docs have been added / updated (for bug fixes / features)\n\n\n## What does this PR do?\n<!-- Please describe the link to a relevant issue and current behavior that you are modifying.-->\n\n- Issue Number: #\n- Description: "
  },
  {
    "path": ".gitignore",
    "content": "# forbidden\n.env\nreference/\ncommon_crawl/\nnotebook/\n.cache/\nsample/\n\n# open-source \ncc_net/\ndps/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n#.idea/"
  },
  {
    "path": ".pre-commit-config.yaml",
    "content": "repos:\n    - repo: https://github.com/pre-commit/pre-commit-hooks\n      rev: v3.2.0\n      hooks:\n        # -   id: trailing-whitespace\n        -   id: check-added-large-files\n        -   id: detect-private-key\n        -   id: detect-aws-credentials\n            args: [--allow-missing-credentials]\n    - repo: https://github.com/pycqa/isort\n      rev: 5.13.2\n      hooks:\n        -   id: isort\n            args: [\n                    --profile=black,\n                ]\n    - repo: https://github.com/psf/black\n      rev:  23.12.1\n      hooks:\n        -   id: black\n            args: [\n                --line-length=100,\n            ]\n\n    - repo: https://github.com/myint/autoflake\n      rev: v2.2.0\n      hooks:\n        -   id: autoflake\n            args: [\n            # --in-place,\n            # --remove-unused-variables,\n            # --remove-all-unused-imports,\n            --expand-star-imports,\n            ]\n    - repo: https://github.com/PyCQA/flake8\n      rev: 6.0.0\n      hooks:\n        -   id: flake8\n            args: [\n                \"--ignore=E203, E501, W503\", \n                ]\n            # E203: Whitespace before ':'\n            # E501: line length - because black checks and this makes error even on commented code\n            # W503: PEP8 now recommends to break before binary operator (https://peps.python.org/pep-0008/#should-a-line-break-before-or-after-a-binary-operator)"
  },
  {
    "path": ".readthedocs.yaml",
    "content": "# .readthedocs.yml\n# Read the Docs configuration file\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details\n\n# Required\nversion: 2\n\n# Set the OS, Python version and other tools you might need\nbuild:\n  os: ubuntu-20.04\n  tools:\n    python: \"3.10\"\n    # You can also specify other tool versions:\n    # nodejs: \"19\"\n    # rust: \"1.64\"\n    # golang: \"1.19\"\n    \n# Build documentation in the docs/ directory with Sphinx\nsphinx:\n  configuration: docs/source/conf.py\n\n# Build documentation with MkDocs\n#mkdocs:\n#  configuration: mkdocs.yml\n\n# Optionally build your docs in additional formats such as PDF\n#formats:\n#  - pdf\n\n# Optionally set the version of Python and requirements required to build your docs\npython:\n  install:\n    - requirements: docs/source/requirements.txt\n"
  },
  {
    "path": "LICENSE",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "Makefile",
    "content": "\n.PHONY: aws_s3 pyspark java \n\naws_s3:\n\t@test -d $$SPARK_HOME/jars || mkdir -p $$SPARK_HOME/jars\n\t@test -f $$SPARK_HOME/jars/hadoop-aws-3.3.4.jar || wget -P $$SPARK_HOME/jars/ https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar\n\t@test -f $$SPARK_HOME/jars/aws-java-sdk-bundle-1.12.592.jar || wget -P $$SPARK_HOME/jars/ https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.592/aws-java-sdk-bundle-1.12.592.jar\n\npyspark:\n\techo \"export SPARK_HOME=$(shell pip show pyspark | grep Location | awk '{print $$2 \"/pyspark\"}')\" >> ~/.bashrc\n\techo \"export PYSPARK_PYTHON=python3\" >> ~/.bashrc\n\n# setting java environment\njava:\n\tsudo apt-get update\n\tsudo apt-get install openjdk-11-jdk\n\techo \"export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64\" >> ~/.bashrc\n"
  },
  {
    "path": "README.md",
    "content": "<div align=\"center\">\n\n<br>\n<picture>\n  <source media=\"(prefers-color-scheme: dark)\" srcset=\"docs/images/dataverse_logo-white.png\" width=300>\n  <source media=\"(prefers-color-scheme: light)\" srcset=\"docs/images/dataverse_logo-color.png\" width=300>\n  <img alt=\"DATAVERSE\" src=\"docs/images/dataverse_logo-color.png\" width=300>\n</picture>\n\n<br>\n\nThe Universe of Data. \nAll about Data, Data Science, and Data Engineering. </br>\nUpstage Solar is powered by Dataverse! Try at Upstage [Console](https://console.upstage.ai/)!\n\n[Docs](https://data-verse.gitbook.io/docs/) • [Examples](https://github.com/UpstageAI/dataverse/tree/main/examples) • [API Reference](https://data-verse.readthedocs.io/en/latest/) • [FAQ](https://data-verse.gitbook.io/docs/documents/faqs) • [Contribution Guide](https://github.com/UpstageAI/dataverse/blob/main/contribution/CONTRIBUTING.md)  • [Contact](mailto:dataverse@upstage.ai)  • [Discord](https://discord.gg/aAqF7pyq4h) • [Paper](https://arxiv.org/abs/2403.19340)\n<br><br>\n<div align=\"left\">\n\n## Welcome to Dataverse!\nDataverse is a freely-accessible open-source project that supports your **ETL(Extract, Transform and Load) pipeline with Python**. We offer a simple, standardized and user-friendly solution for data processing and management, catering to the needs of data scientists, analysts, and developers in LLM era. Even though you don't know much about Spark, you can use it easily via _dataverse_.\n\n### With Dataverse, you are empowered to\n\n- utilize a range of preprocessing functions without the need to install multiple libraries.\n- create high-quality data for analysis and training of Large Language Models (LLM).\n- leverage Spark with ease, regardless of your expertise level.\n- facilitate smoother collaboration among users with varying degress of Spark proficiency.\n- enjoy freedom from the limitations of local environments by harnessing the capabilities of AWS EMR.\n\n### Architecture of Dataverse\n![Architecture of Dataverse](./docs/images/dataverse_system_architecture_white.jpeg)\n\n### Key Features of Dataverse\n- **Block-Based**: In Dataverse, a `block` means a `registered ETL function` which is running on Spark. You can build Spark code like putting together puzzle pieces. You can easily add, take away, or re-arrange pieces to get the results you want via configure.\n- **Configure-Based**: All the setups for Spark and steps of block can be defined with configure. You don't need to know all the code. Just set up the options, and you're good to go.\n- **Extensible**: It's designed to meet your specific demands, allowing for custom features that fit perfectly with your project.\n\nIf you want to know more about Dataverse, please checkout our [docs](https://data-verse.gitbook.io/docs/).\n\nBy clicking below image, it'll take you to a short intro video!\n[![Brief Introduction](./docs/images/dataverse_hero.png)](https://youtu.be/yYyyLuPNK5s?feature=shared)\n<br>\n\n## 🌌 Installation\n### 🌠 Prerequisites\nTo use this library, the following conditions are needed:\n- Python (version between 3.10 and 3.11)\n- JDK (version 11)\n- PySpark\n\nDetail installation guide for prerequisites can be found on [here](https://data-verse.gitbook.io/docs/installation).\n\n### 🌠 Install via PyPi\n```bash\npip install dataverse\n```\n\n<br>\n\n## 🌌 Quickstart\nVarious and more detailed tutorials are [here](https://github.com/UpstageAI/dataverse/tree/main/examples).\n\n- [add_new_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_04_add_new_etl_process.ipynb) : If you want to use your custom function, you have to register the function on Dataverse. This will guide you from register to apply it on pipeline.\n- [test_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_05_test_etl_process.ipynb) : When you want to get test(sample) data to quickly test your ETL process, or need data from a certain point to test your ETL process.\n- [scaleout_with_EMR.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_06_scaleout_with_EMR.ipynb) : For people who want to run their pipeline on EMR cluster.\n\n\n<details>\n    <summary><u>Detail to the example etl configure.</u></summary>\n    <ul></ul>\n    <ul>\n        <li style=\"line-height:250%;\"> <b>data_ingestion___huggingface___hf2raw </b></li>\n        Load dataset from <a href=\"https://huggingface.co/datasets/allenai/ai2_arc\">Hugging Face</a>, which contains a total of 2.59k rows.\n    </ul>\n    <ul>\n        <li style=\"line-height:250%;\"> <b>utils___sampling___random </b></li>\n        To decrease the dataset size, randomly subsample 50% of data to reduce the size of dataset, with a default seed value of 42. <br/>\n        This will reduce the dataset to 1.29k rows. \n    </ul>\n    <ul>\n        <li style=\"line-height:250%;\"> <b>deduplication___minhash___lsh_jaccard </b></li>\n        Deduplicate by <code>question</code> column, 5-gram minhash jaccard similarity threshold of 0.1.\n    </ul>\n    <ul>\n        <li style=\"line-height:250%;\"> <b>data_save___parquet___ufl2parquet </b></li>\n        Save the processed dataset as a Parquet file to <code>./guideline/etl/sample/quickstart.parquet</code>.<br/>\n        The final dataset comprises around 1.14k rows.\n    </ul>\n</details>\n\n```python\n# 1. Set your ETL process as config.\n\nfrom omegaconf import OmegaConf\n\nETL_config = OmegaConf.create({\n    # Set up Spark\n    'spark': { \n        'appname': 'ETL',\n        'driver': {'memory': '4g'},\n    },\n    'etl': [\n        { \n          # Extract; You can use HuggingFace datset from hub directly!\n          'name': 'data_ingestion___huggingface___hf2raw', \n          'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\n        },\n        {\n          # Reduce dataset scale\n          'name': 'utils___sampling___random',\n          'args': {'sample_n_or_frac': 0.5}\n        },\n        {\n          # Transform; deduplicate data via minhash\n          'name': 'deduplication___minhash___lsh_jaccard', \n          'args': {'threshold': 0.1,\n                  'ngram_size': 5,\n                  'subset': 'question'}\n        },\n        {\n          # Load; Save the data\n          'name': 'data_save___parquet___ufl2parquet',\n          'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}\n        }\n      ]\n  })\n```\nAbove code block is an example of an ETL process in Dataverse. In Dataverse, the available registered ETL functions are referred to as `blocks`, and this example is comprised of four blocks. You can freely combine these blocks using config to create the ETL processes for your needs. The list of available functions and args of them can be found in the [API Reference](https://data-verse.readthedocs.io/en/latest/). Each functions 'args' should be added in dictionary format.\n\n```python\n# 2. Run ETLpipeline.\n\nfrom dataverse.etl import ETLPipeline\n\netl_pipeline = ETLPipeline()\nspark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)\n```\nETLPipeline is an object designed to manage the ETL processes. By inserting `ETL_config` which is defined in the previous step into ETLpipeline object and calling the `run` method, stacked ETL blocks will execute in the order they were stacked.\n\n```python\n# 3. Result file is saved on the save_path\n```\nAs the example gave `save_path` argument to the last block of `ETL_config`, data passed through the process will be saved on the given path.\n\n<br>\n\n## 🌌 Modules\nCurrently, about 50 functions are registered as the ETL process, which means they are eagerly awaiting your use!\n| Type      | Package         | description                                                                                       |\n|-----------|-----------------|---------------------------------------------------------------------------------------------------|\n| Extract   | data_ingestion  | Loading data from any source to the preferred format                                              |\n| Transform | bias            | (WIP) Reduce skewed or prejudiced data, particularly data that reinforce stereotypes.             |\n|           | cleaning        | Remove irrelevant, redundant, or noisy information, such as stop words or special characters.     |\n|           | decontamination | (WIP) Remove contaminated data including benchmark.                                               |\n|           | deduplication   | Remove duplicated data, targeting not only identical matches but also similar data.               |\n|           | pii             | PII stands for Personally Identifiable Information. Removing sensitive information from data.     |\n|           | quality         | Improving the data quality, in the perspective of accuracy, consistency, and reliability of data. |\n|           | toxicity        | (WIP) Removing harmful, offensive, or inappropriate content within the data.                      |\n| Load      | data_save       | Saving the processed data to a preferred source like data lake, database, etc.                    |\n| Utils     | utils           | Essential tools for data processing, including sampling, logging, statistics, etc.                |\n<br>\n\n## 🌌 Dataverse supports AWS\nDataverse works with AWS S3 and EMR, enabling you to load and save data on S3 and execute ETL pipelines through EMR. Step by step guide to setting up is [here](https://data-verse.gitbook.io/docs/lets-start/aws-s3-support).\n</br>\n\n## 🌌 Dataverse use-case\n> If you have any use-cases of your own, please feel free to let us know. </br>We would love to hear about them and possibly feature your case.\n\n\n*✨* [`Upstage`](https://www.upstage.ai/) is using Dataverse for preprocessing the data for the training of [Solar Mini](https://console.upstage.ai/services/solar?utm_source=upstage.ai&utm_medium=referral&utm_campaign=Main+hero+Solar+card&utm_term=Try+API+for+Free&utm_content=home). </br>\n*✨* [`Upstage`](https://www.upstage.ai/) is using Dataverse for preprocessing the data for the [Up 1T Token Club](https://en.content.upstage.ai/1tt).\n\n\n\n## 🌌 Contributors\n<a href=\"https://github.com/UpstageAI/dataverse/graphs/contributors\">\n  <img src=\"https://contrib.rocks/image?repo=UpstageAI/dataverse\" />\n</a>\n\n## 🌌 Acknowledgements\n\nDataverse is an open-source project orchestrated by the **Data-Centric LLM Team** at [`Upstage`](https://www.upstage.ai/), designed as an data ecosystem for LLM(Large Language Model). Launched in March 2024, this initiative stands at the forefront of advancing data handling in the realm of LLM.\n\n## 🌌 License\nDataverse is completely freely-accessible open-source and licensed under the Apache-2.0 license.\n\n\n## 🌌 Citation\nIf you want to cite our 🌌 Dataverse project, feel free to use the following bibtex. You can check our paper via [link](https://arxiv.org/abs/2403.19340).\n\n```bibtex\n@misc{park2024dataverse,\n      title={Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models}, \n      author={Hyunbyung Park and Sukyung Lee and Gyoungjin Gim and Yungi Kim and Dahyun Kim and Chanjun Park},\n      year={2024},\n      eprint={2403.19340},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n"
  },
  {
    "path": "contribution/CONTRIBUTING.md",
    "content": "# __Contribution Guidelines__\nWelcome to _Dataverse_! We warmly welcome any kind of contribution 😊✨. </br>\nThis page provides an outline on how to contribute to _Dataverse_ and suggestions for nice conventions to follow. \n> __These are guidelines, NOT rules 💡__ <p>\nThis page is not the Constituion of the _Dataverse_. We are providing guidelines to help you make a useful and efficient contribution to _Dataverse_. While we think these guidelines are sensible and we appreciate when they are observed, following them isn't strictly required. We hope you won't be tired by these guidelines. Also, we'd love to hear your ideas on how to improve our guidelines! \n\n</br>\n\n# Table of Contents\n- [Questions or Feedback](#questions-or-feedback)\n- [🤝 How to Contribute?](#how-to-contribute)\n- [Tests](#tests)\n- [Directory of Dataverse](#directory-of-dataverse)\n- [Design Philosophy](#design-philosophy)\n- [Commit Guidelines](#commit-guidelines)\n- [Style Guides](#style-guides)\n\n</br>\n\n# Questions or Feedback\nJoin the conversation on our GitHub discussion board! It's the go-to spot for questions, chats, and a helping hand from the _Dataverse_ community. Drop by and say hello here: [link](https://github.com/UpstageAI/dataverse/discussions)\n\nAnd if there's a shiny new feature you're dreaming of, don't be shy—head over to our [issue page](https://github.com/UpstageAI/dataverse/issues) to let us know! Your input could help shape the future. ✨\n\n</br>\n\n# How to Contribute?\n- Any kind of improvement of document: fixing typo, enhancing grammar or semantic structuring or adding new examples.\n- Submit issues related to bugs, new desired features, or enhancement of existing features.\n- Fix a bug, implement new feature or improving existing feature.\n- Answer other users' question or help.\n\n## __Documentation__\nWe appreciate all the pull requests to fix typo / improve grammar or semantic structuring of documents. Feel free to check! <br/>\nOur API reference page is constructed with [Sphinx](https://www.sphinx-doc.org/en/master/). We adhere to the [Google style for docstrings](https://google.github.io/styleguide/pyguide.html) as a fundamental practice, so please follow this format. The source files are located within the `docs/source/` directory.\n\n## __Report a Bug / Request New Feature / Suggest Enhancements__\nPlease open an issue whenever you find a bug or have an idea to enhance _Dataverse_. Maintainers will label it or leave comment on it as soon as they check the issue. Issues labeled as `Open for contribution` mean they are open for contribution.\n\n## __Fix a Bug / Add New Feature / Improve Existing Feature__\nIf you have a particular roadmap, goals, or new feature, share it via issue. When you already fixed a bug or have new feature that enhances _Dataverse_, you can jump on to fourth step which is opening pull requests. Please note that when you open pull requests without opening an issue or maintainers' check, it can be declined if it does not aligh with philosophy of _Dataverse_.\n\n### __1️⃣ Check issues labeled as__ `Open for contribution`\nYou can find issues waiting for your contribution by filtering label with `Open for contribution`. This label does not stand alone. It is always with `Bug`, `Docs` or `Enhancement`. Issues with `Critical` or `ASAP` label are more urgent. \n\n\n### __2️⃣ Leave a comment on the issue you want to contribute__\nOnce we review your comment, we'll entrust the issue to you by swapping out the `Open for contribution` label for a `WIP` (Work in Progress) label.\n\n### __3️⃣ Work on it__\nBefore diving into coding, do take a moment to familiarize yourself with our coding style by visiting this [style guides](#style-guides). And hey, if you hit a snag while tackling the issue, don't hesitate to drop a comment right there. Our community is a supportive bunch and will jump in to assist or brainstorm with you.\n\n1. Fork the repository of _Dataverse_.\n2. Clone your fork to your local disk.\n3. Create a new branch to hold your develompment changes. </br>\nIt's not required to adhere strictly to the branch naming example provided; consider it a mild suggestion.\n```\ngit checkout -b {prefix}/{issue-number}-{description}\n```\n4. Set up a development environment\n5. Develop the features in your branch\n\n\n### __4️⃣ Create a Pull Request__\nGo ahead and visit your GitHub fork, then initiate a pull request — it's time to share your awesome work! Before you do, double-check that you've completed everything on the checklist we provided. Once you're all set, submit your contributions for the project maintainers to review.\n\nDon't worry if the maintainers have some feedback or suggest changes—it's all part of the process and happens to even our most experienced contributors. Keep your updates flowing by working in your local branch and pushing any new changes to your fork. Your pull request will update automatically for everyone to see the progress.\n\n</br>\n\n# Tests\nThe Dataverse test framework is built using [pytest](https://docs.pytest.org/en/8.0.x/). Ensure that you write a corresponding test for any new features or changes you make. You'll find the test files in the `dataverse/dataverse/tests` directory.\n\n- Create a new test file if you've introduced a new category or a sub-category for the ETL process.\n- If your addition is a new feature within an existing category or sub-category, include your tests in the existing test file.\n\n</br>\n\n# Directory of Dataverse\nFor _Dataverse_'s overarching goals: check the [docs](https://data-verse.gitbook.io/docs#future-work)\n\n```{plain text}\n📦 dataverse/dataverse\n ┣ 📂 api\n ┣ 📂 config\n ┃ ┣ 📂 etl\n ┃ ┃ ┗ 📂 sample\n ┣ 📂 etl\n ┃ ┣ 📂 {CATEGORY}\n ┣ 📂 lab\n ┣ 📂 tests\n ┗ 📂 utils\n```\n- [`📂 api`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/api): The Dataverse API serves as a\ngateway for users.\n- [`📂 config`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/config): Contains configuration files for the Dataverse application. You can also find sample configuration file for etl process under this directory.\n- [`📂 etl`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/etl): Main directory of _Dataverse_ where all of the data processors are placed. Data processors are separated with it's category.\n- [`📂 lab`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/lab): TBD. Data analysis will be supported via here.\n- [`📂 tests`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/tests): Pytest files \n- [`📂 utils`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/utils): The Utilities module functions as a collection of internal helper tools. Its key features include API utilities that simplify interaction with various external APIs, including AWS EMR. Please be aware that another utils module is also included within the etl module.\n\n</br>\n\n# Design Philosophy\n- [Principles for Configuration](#principles-for-configuration)\n- [Principles for ETL Process](#principles-for-etl-process)\n\n## Principles for Configuration\n1. `One file` rules `ALL`\n2. `10 Seconds` to know what is going on\n\n#### 1. `One file` rules `ALL`\nOne cycle of ETL, Analyzer, etc. which we could call one job, will be controled by one configuration file. We are not going to use multiple configuration files to composite one big configuration file.\n\n#### 2. `10 Seconds` to know what is going on\nThe reader should be able to know what is going on in the configuration file within 10 seconds. This is to make sure the configuration file is easy and small enough to read and understand.\n\n## Principles for ETL Process\n> When you create your own ETL process, you should follow the following principles\n\n1. No `DRY` (Don't Repeat Yourself)\n2. One file Only\n\n\n#### 1. No `DRY` (Don't Repeat Yourself)\n> No `DRY` is applied between **ETL sub-categories**.\n- So if similar ETL processes are used in same sub-categories, it could be shared.\n- But if it's used in different sub-categories, it should not be shared.\n\nAs you can see in the following example, there are 2 ETL processes `common_process_a` and `common_process_b`seems nice to be shared. But as you can see, they are not shared. They are repeated. This is because of the No `DRY` principle.\n\n\n```python\n- deduplication/\n    - exact.py\n        - \"def common_process_a():\"\n        - \"def common_process_b():\"\n        - def deduplication___exact___a():\n    - exact_datasketch.py\n        - \"def common_process_a():\"\n        - \"def common_process_b():\"\n        - def deduplication___exact_datasketch___a():\n        - def deduplication___exact_datasketch___b():\n```\n\n#### 2. One file Only\nCode that ETL process uses should be in the same file. This is because of the `One file Only` principle. Except **ETL Base class, few required utils functions, and open sources** there should be no dependency outside the file.\n\n```python\n# This is OK ✅\n- deduplication/\n    - exact.py\n        - def helper_a():\n        - def helper_b():\n        - def etl_process():\n            helper_a()\n            helper_b()\n\n                    \n# This is not allowed ❌\n- deduplication/\n    - helper.py\n        - def helper_a():\n        - def helper_b():\n    - exact.py\n        from helper import helper_a\n        from helper import helper_b\n\n        - def etl_process():\n            helper_a()\n            helper_b()\n```\nETL process itself is meant to be built to be used in various combination of ETL pipeline **So try to make it as generic as possible.** \n\n</br>\n\n# Commit Guidelines\n### Commit strategy\n- Avoid mixing multiple, unrelated modifications in a single commit. One commit is related with one issue.\n- Each commit should encapsulate a complete, autonomous upgrade to the code.\n\n### Commit messages\nPlease make sure your commit messages follow `type`: `title (#<related issue number>)` format. <br/>\nFor example:\n```plain text\n<TYPE>: Short summary with 72 characters or less (#<Issue number>)\n\nIf you have more detalied explanatory text, put it as body.\nBut the body is optional.\n```\n- Find adequate type in the below list:\n    - `NEW`: introducing a new feature\n    - `ENHANCE`: improve an existing code/feature.\n    - `FIX`: fix a code bug\n    - `DOCS`: write/update/add any kind of documents including docstring\n    - `REFACTOR`: refactor existing code without any specific improvements\n    - `STYLE`: changes that do not affect the meaning of the code (ex. white-space, line length)\n    - `TEST`: add additional testing\n    - `DEL`: remove code or files\n    - `RELEASE`: release new version of dataverse\n    - `OTHER`: anything not covered above (not recommended)\n- Use the present tense (\"Add feature\" not \"Added feature\")\n- Do not end the subject line with a punctuation\n\n</br>\n\n# Style Guides\n### Pre-commit hook\nWe provide a pre-commit git hook for style check. You can find exact check list in this [file](https://github.com/UpstageAI/dataverse/blob/main/.pre-commit-config.yaml). <br/> Please run the code below before a commit is created:\n```bash\npre-commit run\n```\n\n"
  },
  {
    "path": "dataverse/README.md",
    "content": "# Dataverse\n> The Universe of Data\n\n\n## 🌌 Config\n> Config for the Dataverse\n\n## 🌌 API\n> Interface of Dataverse for external use\n\n## 🌌 ETL\n> ETL pipeline (Extract, Transform, Load)\n\n## 🌌 LAB\n> Data Analysis & Visualization\n\n## 🌌 Utils\n> Common utilities used internally for Dataverse"
  },
  {
    "path": "dataverse/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/api/README.md",
    "content": "# API (Application Programming Interface)\n> Interface with ease and efficiency"
  },
  {
    "path": "dataverse/api/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/api/cli.py",
    "content": "\n\"\"\"\nmain entry point for the dataverse CLI tool\n\"\"\"\n\nfrom dataverse.utils.setting import SystemSetting\n\n\ndef main():\n    \"\"\"Main entry point for the cli.\"\"\"\n    print(\"🌌 Hello Welcome to Dataverse! 🌌\")\n    print(\"=\" * 50)\n    print(\"We are still under construction for CLI!\")\n    print(\"=\" * 50)\n    print(\"QUARK - By Ducky 🦆\")\n\n    # set the system setting to CLI mode\n    SystemSetting().IS_CLI = True"
  },
  {
    "path": "dataverse/api/emr.py",
    "content": "\n\"\"\"\nAPI to use AWS EMR with spark-submit\n\"\"\"\n\nimport os\nimport argparse\nimport importlib\n\nfrom dataverse.etl import ETLPipeline\n\ndef import_dynamic_etls():\n    \"\"\"\n    Import dynamic etls which was created by user.\n    \"\"\"\n    dynamic_etl_path = \"/home/hadoop/dataverse/dynamic_etl\"\n    try:\n        files = os.listdir(dynamic_etl_path)\n    except FileNotFoundError:\n        return\n    except Exception as e:\n        raise e\n\n    # Filter out non-Python files\n    files = [f for f in files if f.endswith('.py')]\n\n    # Dynamically import all Python files in the directory\n    for file in files:\n        file_path = os.path.join(dynamic_etl_path, file)\n\n        # Remove .py at the end\n        module_name = file[:-3]\n\n        spec = importlib.util.spec_from_file_location(module_name, file_path)\n        module = importlib.util.module_from_spec(spec)\n        spec.loader.exec_module(module)\n\n\ndef main(config, verbose=False):\n    \"\"\"Main entry point for the aws emr.\"\"\"\n    etl_pipeline = ETLPipeline()\n    import_dynamic_etls()\n    spark, data = etl_pipeline.run(config=config, verbose=verbose)\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--config\", help=\"config file path\")\n    parser.add_argument(\"--verbose\", action='store_true')\n    args = parser.parse_args()\n    main(args.config, args.verbose)\n"
  },
  {
    "path": "dataverse/config/README.md",
    "content": "# Configuration\n> This directory contains configuration files for the Dataverse application\n\n\n## 🌌 How to use\n\n### 🌠 Load pre-built configuration\n> you can load the pre-built configuration from path, or dict, or OmegaConf\n\n#### Load from local path\n```python\nfrom dataverse.config import Config\n\nconfig = Config.load('path/to/config.yaml')\n```\n\n#### Load from AWS S3\n> you need to set aws credential with `aws configure` to use this feature\n\n```python\nfrom dataverse.config import Config\n\nconfig = Config.load('s3://path/to/config.yaml')\n```\n\n#### Load from dict\n```python\nconfig = Config.load({\n    \"spark\": {\"appname\": \"README.md example\"}\n    \"etl\": [\n        {\"name\": \"...\", \"args\": \"...\"},\n        {\"name\": \"...\", \"args\": \"...\"},\n    ]\n})\n```\n\n### 🌠 Set the empty args with `default` value\n> the args you already set will not be changed to default\n\n```python\nfrom dataverse.config import Config\n\nconfig = Config.load('path/to/config.yaml')\nconfig = Config.set_default(config)\n```\n\n### 🌠 Get `Default` configuration\n> `default` configuration has no `etl` pre-defined\n\n```python\nfrom dataverse.config import Config\n\nconfig = Config.default()\n```\n\n\n## 🌌 About Configuration\n\n### 🌠 Why configuration is just `OmegaConf`?\n> To make it simple and easy to use. We are not going to inherit some other `base` class to make it complicated. But still `Config` interface is provided as a helper for to load, save, set default, etc.\n\n### 🌠 2 Rules for configuration\n1. `One file` rules `ALL`\n2. `10 Seconds` to know what is going on\n\n#### `One file` rules `ALL`\nOne cycle of ETL, Analyzer, etc. which we could call one job, will be controled by one configuration file. We are not going to use multiple configuration files to composite one big configuration file.\n\n#### `10 Seconds` to know what is going on\nThe reader should be able to know what is going on in the configuration file within 10 seconds. This is to make sure the configuration file is easy and small enough to read and understand.\n\n\n### 🌠 What open source to choose for configuration?\n> **`omegaconf`**\n\n- `OmegaConf`\n    - For ease understanding & usage\n    - Omegaconf supports yaml, dict, json and even `dataclass` from python.\n- `hydra`\n    - hydra was also our candidate but to make it simple we are using OmegaConf. \n    - hydra requires multiple configuration files to composite one big configuration file\n    - also many people find out using hydra itself took quite a time just to understand\n"
  },
  {
    "path": "dataverse/config/__init__.py",
    "content": "\nfrom .interface import Config"
  },
  {
    "path": "dataverse/config/interface.py",
    "content": "\"\"\"\nInterface to check & load the configurations for installation environment\n\nawesome_config = Config.load(\"/path/to/ducky_awesome_config.yaml\")\nawesome_config = Config.load({awesome: config})\n\"\"\"\n\nimport re\nimport boto3\n\nfrom pathlib import Path\nfrom typing import Union\nfrom omegaconf import OmegaConf\nfrom omegaconf import DictConfig\n\nfrom dataverse.utils.setting import SystemSetting\nfrom dataverse.utils.api import aws_s3_read\nfrom dataverse.utils.api import aws_s3_write\nfrom pathlib import Path\n\n\n\nclass Config:\n    \"\"\"\n    Interface to check & load the configurations\n    \n    This class provides a lightweight wrapper for OmegaConf and allows checking and loading configurations.\n    It supports loading configurations from various sources such as files, AWS S3, and config strings.\n    The class also provides methods for saving configurations and setting default values for missing config arguments.\n    \"\"\"\n    def __new__(cls, *args, **kwargs):\n        raise NotImplementedError(\"Config is not allowed to be instantiated\")\n\n    @classmethod\n    def load(cls, config: Union[str, dict, DictConfig, OmegaConf, Path]):\n        \"\"\"\n        Load the configuration for the etl.\n\n        Args:\n            config (Union[str, dict, OmegaConf]): The configuration for the etl.\n                - str or Path: This could have several cases:\n                    - Path to the config file.\n                    - S3 path to the config file.\n                    - Config string. This is similar to loading a `yaml` file with `open()`.\n                - dict: Config dictionary.\n                - OmegaConf: Config object.\n\n        Returns:\n            The loaded configuration.\n\n        Raises:\n            ValueError: If the provided config is not a valid path or S3 path.\n            TypeError: If the provided config is not of type str, dict, or OmegaConf.\n        \"\"\"\n        if isinstance(config, (str, Path)):\n            if isinstance(config, Path):\n                config = str(config)\n\n            # Local File\n            if Path(config).is_file():\n                config = OmegaConf.load(config)\n\n            # AWS S3\n            elif config.startswith(('s3://', 's3a://', 's3n://')):\n                aws_s3_matched = re.match(r's3[a,n]?://([^/]+)/(.*)', config)\n                if aws_s3_matched:\n                    bucket, key = aws_s3_matched.groups()\n                    config_content = aws_s3_read(bucket, key)\n                    config = OmegaConf.create(config_content)\n                else:\n                    # Assume it's a config string that starts with s3\n                    config_str = config\n                    config = OmegaConf.create(config_str)\n\n                    # Check if it's a config string or not\n                    # In case of a config string, it should create a config object\n                    # If not, it will create {'config': None}\n                    if config_str in config and config[config_str] is None:\n                        raise ValueError(f\"config {config_str} is not a valid s3 path\")\n            \n            # String Config\n            else:\n                # Assume it's a config string\n                config_str = config\n                config = OmegaConf.create(config_str)\n\n                # Same as above, check if it's a config string or not\n                if config_str in config and config[config_str] is None:\n                    raise ValueError(f\"config {config_str} is not a valid path\")\n\n        elif isinstance(config, dict):\n            config = OmegaConf.create(config)\n        elif isinstance(config, (OmegaConf, DictConfig)):\n            pass\n        else:\n            raise TypeError(f\"config should be str, dict, or OmegaConf but got {type(config)}\")\n\n        return config\n\n    @classmethod\n    def save(cls, config, path: Union[str, Path]):\n        \"\"\"\n        Saves the configuration to a specified path.\n\n        Args:\n            config: The configuration to be saved.\n            path (Union[str, Path]): The path where the configuration should be saved.\n\n        Raises:\n            ValueError: If the provided path is not a valid S3 path.\n        \"\"\"\n        if path.startswith(('s3://', 's3a://', 's3n://')):\n            aws_s3_matched = re.match(r's3[a,n]?://([^/]+)/(.*)', path)\n            if aws_s3_matched:\n                bucket, key = aws_s3_matched.groups()\n                aws_s3_write(bucket, key, config)\n            else:\n                raise ValueError(f\"config path {path} is not a valid s3 path\")\n        else:\n            OmegaConf.save(config, Path(path))\n\n    @classmethod\n    def default(cls, emr: bool = False):\n        \"\"\"\n        Fill the missing config with default values.\n\n        Args:\n            emr (bool, optional): Flag indicating whether the config is for EMR. Defaults to False.\n\n        Returns:\n            dict: Default configuration dictionary.\n        \"\"\"\n        local_dir = f\"{SystemSetting().CACHE_DIR}/.cache/dataverse/tmp\"\n\n        default = OmegaConf.create({\n            'spark': {\n                'master': 'local[10]',\n                'appname': 'default',\n                'driver': {\n                    'memory': '8G',\n                    'maxResultSize': '2G',\n                },\n                'executor': {'memory': '1G'},\n                'local': {'dir': local_dir},\n                'ui': {'port': 4040},\n            },\n            'etl': [],\n        })\n\n        if emr:\n            default.update({\n                'emr': {\n                    'id': None,\n                    'working_dir': None,\n                    'name': 'dataverse_emr',\n                    'release': 'emr-6.15.0',\n                    'idle_timeout': 3600,\n\n                    # master (driver)\n                    'master_instance': {\n                        'type': None,\n                    },\n\n                    # core (data node)\n                    'core_instance': {\n                        'type': None,\n                        'count': 2,\n                    },\n\n                    # task (executors)\n                    'task_instance': {\n                        'type': None,\n                        'count': 0,\n                    },\n\n                    # EMR cluster created by dataverse or user\n                    'auto_generated': None,\n\n                    # iam\n                    'role': {\n                        'ec2': {\n                            'name': None,\n                            'policy_arns': None,\n                        },\n                        'emr': {\n                            'name': None,\n                            'policy_arns': None,\n                        }\n                    },\n                    'instance_profile': {\n                        'name': None,\n                        'ec2_role': None,\n                    },\n\n                    # TODO: allow more options to customize e.g. cidr, tag, etc.\n                    #       but make sure vpc is temporary and not shared\n                    'vpc': {\n                        'id': None,\n                    },\n                    'subnet': {\n                        'id': None,\n                        'public_id': None,\n                        'private_id': None,\n                        'public': True,\n                    },\n                    'security_group': {\n                        'id': None,\n                    },\n                    'gateway': {\n                        'id': None,\n                    },\n                    'route_table': {\n                        'id': None,\n                    },\n                    'elastic_ip': {\n                        'id': None,\n                    },\n                    'nat_gateway': {\n                        'id': None,\n                    },\n                }\n            })\n\n        return default\n\n    @classmethod\n    def set_default(cls, config: OmegaConf, emr: bool = False):\n        \"\"\"\n        Sets the missing config arguments with default values.\n\n        Args:\n            config (OmegaConf): The configuration object to merge with default values.\n            emr (bool, optional): Whether to use EMR configuration. Defaults to False.\n\n        Returns:\n            OmegaConf: The merged configuration object.\n\n        \"\"\"\n        return OmegaConf.merge(cls.default(emr=emr), config)\n"
  },
  {
    "path": "dataverse/etl/README.md",
    "content": "# ETL (Extract, Transform, Load)\n> Dataverse ETL is \"Block-based coding powered by Spark\"\n\n- Each block is called `ETL process`\n- Combination of ETL processes is called `ETL pipeline`\n- ETL pipeline is managed by `config` file\n\n\n## 🌌 What is ETL process?\n> ETL process is the small code snippet, that is considered as a single unit of ETL pipeline. It is meant to be form various combinations to accommodate different kinds of data sources and transformations in ETL pipeline so it should be as generic as possible.\n\n```python\ndef ETL_process(data, config):\n    return data\n```\n\n\n## 🌌 What is ETL pipeline?\n> ETL pipeline is the sequence of ETL processes.\n\n```python\ndata = ETL_process_1()\ndata = ETL_process_2(data)\ndata = ETL_process_3(data)\n```\n\n\n## 🌌 How to run ETL Pipeline?\n> Define the ETL process, and add in the config file to run the ETL pipeline.\n\n```python\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.config import Config\n\n# 1. Define the ETL process in the config file\nconfig = Config.load(\"TBD\")\nconfig = Config.set_default(config)\n\n# 2. Run the ETL pipeline\netl_pipeline = ETLPipeline()\nspark, data = etl_pipeline.run(config)\n```\n\n### 🌠 What is returned after running ETL pipeline?\n> `spark` and `data` is returned after running ETL pipeline\n\n- `spark` - spark session\n- `data` - data after running ETL pipeline\n\n\n#### `spark` status depends on the last ETL process\n- `data_load` ETL process at the end\n    - spark will be terminated\n- otherwise\n    - spark will be alive\n    - you can use `spark` to do whatever you want\n\n\n## 🌌 How to add new ETL process?\n> ETL is managed by registry. Whatever ETL you make, you need to register it to registry.\n\n### 🌠 Choose what `Category` & `Sub-Category` to put your ETL process\n> First you need to check the category and sub-category of the ETL process you want to add. \n\n```python\n======================================\n- etl/\n    - CATEGORY/\n        - __init__.py\n        - SUBCATEGORY.py\n            - def CATEGORY___SUBCATEGORY___ETL_PROCESS()\n======================================\n```\n\n- `category` is the folder. This is pre-defined and you can add a new category if needed. **Check below to learn more about category**\n- `sub-category` is the python file. This is not pre-defined and you have to decide which name could be appropriate for the ETL process you want to add.\n\nNow when you know the `category` and `sub-category`, you can add a new ETL process.\nThere are only one way to add a new ETL process\n\n### 🌠 Use decorator `@register_etl` to register your ETL `function`\n\n```python\n# check the __sample/ folder for example\nfrom dataverse.etl import register_etl\n\n@register_etl\ndef category___subcategory___etl(rdd, config):\n    # do something\n    return rdd\n```\n\n#### ☣️ Inheriting `BaseETL` is NOT ALLOWED ☣️\n```python\nfrom dataverse.etl import BaseETL\nclass category___subcategory___etl(BaseETL):\n    def run(rdd, config):\n        # do something\n        return rdd\n```\n\n### 🌠 ETL Process Class Naming Convention\n> This shared the same documentary with README.md in `__sample/` folder\n\n<details>\n\n```python\n[ETL Category]___[ETL Sub-Category]___[ETL Name]\n======================================\n- \"__sample/\"\n    - github.py\n        - def __sample___github___remove_url()\n        - def __sample___github___filter_by_stars()\n- \"bias/\"\n    - mmlu.py\n        - def bias___mmlu___remove_word()\n        - def bias___mmlu___to_parquet()\n    - ducky.py\n        - def bias___ducky___fly()\n        - def bias___ducky___quark()\n======================================\n```\n\n> caveat: the combination of `[ETL Category]___[ETL Sub-Category]___[ETL Name]` MUST be unique\n\n1. `[ETL Category]` is the folder and category where the ETL is defined\n    - `[ETL Category]` MUST be one of the following pre-defined list\n        - `cleaning`\n        - `decontamination`\n        - `deduplication`\n        - `data_ingestion`\n        - `pil`\n        - `quality`\n        - `toxicity`\n        - `bias`\n        - `data_load`\n        - `utils`\n2. `[ETL Sub-Category]` is the name of the file where the ETL is defined\n    - no pre-defined list\n        - it could be a dataset name\n        - or a nickname of yours\n        - or whatever you think it's appropriate\n    - e.g. `github` or `kaggle` or `mmlu` whatever you want\n3. `[ETL Name]` naming should follow `function` naming convention, even it's `class`\n    - all lower case\n    - use underscore `_` to separate words\n4. Each is separated by `___` (triple underscore)\n    - e.g. `bias___mmlu___remove_word()`\n\n\n#### Why does folder, file name included in the ETL class name?\n- To avoid the following tmp names on dynamic construction of ETL class\n    - e.g. `tmp___ipykernel_181248___remove_url` <- jupyter notebook env\n    - e.g. `python3.10___abc___remove_url` <- dynamic class construction by `type`\n- so decided to control the name space by only `ETL class name` which includes folder, file name\n\n</details>\n\n## 🌌 Principles for ETL Process\n> When you create your own ETL process, you should follow the following principles\n\n1. No `DRY` (Don't Repeat Yourself)\n2. One file Only\n\n\n### 🌠 No `DRY` (Don't Repeat Yourself)\n> No `DRY` is applied between **ETL sub-categories**.\n- So if similar ETL processes are used in same sub-categories, it could be shared.\n- But if it's used in different sub-categories, it should not be shared.\n\nAs you can see in the following example, there are 2 ETL processes `common_process_a` and `common_process_b`seems nice to be shared. But as you can see, they are not shared. They are repeated. This is because of the No `DRY` principle.\n\n\n```python\n- deduplication/\n    - exact.py\n        - \"def common_process_a():\"\n        - \"def common_process_b():\"\n        - def deduplication___exact___a():\n    - exact_datasketch.py\n        - \"def common_process_a():\"\n        - \"def common_process_b():\"\n        - def deduplication___exact_datasketch___a():\n        - def deduplication___exact_datasketch___b():\n```\n\n### 🌠 One file Only\nCode that ETL process uses should be in the same file. This is because of the `One file Only` principle. Except **ETL Base class, few required utils functions, and open sources** there should be no dependency outside the file.\n\n```python\n# This is OK ✅\n- deduplication/\n    - exact.py\n        - def helper_a():\n        - def helper_b():\n        - def etl_process():\n            helper_a()\n            helper_b()\n\n                    \n# This is not allowed ❌\n- deduplication/\n    - helper.py\n        - def helper_a():\n        - def helper_b():\n    - exact.py\n        from helper import helper_a\n        from helper import helper_b\n\n        - def etl_process():\n            helper_a()\n            helper_b()\n\n```\n\nETL process itself is meant to be built to be used in various combination of ETL pipeline **So try to make it as generic as possible.** 😊\n\n\n\n## 🌌 How to use ETL Process by Configuration\n> Now let's learn how to use ETL process by configuration\n\n### 🌠 Register ETL process\n> This is same as above. Register ETL process using `@register_etl` decorator\n\n```python\nfrom dataverse.etl import register_etl\n\n@register_etl\ndef etl_process_start(spark, load_path, repartition=3):\n    data = spark.read.load(load_path).repartition(repartition)\n    return data\n\n@register_etl\ndef etl_process_middle(data, threshold=0.5):\n    data = data.filter(data['stars'] > threshold)\n    return data\n\n@register_etl\ndef etl_process_end(data, save_path, repartition=1):\n    data.repartition(repartition).write.save(save_path)\n    return None\n```\n\n### 🌠 Define ETL process in the config file\nYou can use the following config to run the above ETL processes in order\n- `etl_process_start` -> `etl_process_middle` -> `etl_process_end`\n\n```yaml\nspark:\n  appname: dataverse_etl_sample\n  driver:\n    memory: 4g\netl:\n  - name: etl_process_start\n    args:\n      load_path: ./sample/raw.parquet\n      repartition: 3\n  - name: etl_process_middle\n    args:\n      threshold: 0.5\n  - name: etl_process_end\n    args:\n      save_path: ./sample/ufl.parquet\n      repartition: 1\n```\n\n**Check the following real example for more details**\n- Config located at `dataverse/config/etl/sample/ETL___one_cycle.yaml`\n\n```yaml\nspark:\n  appname: dataverse_etl_sample\n  driver:\n    memory: 16g\netl:\n  - name: data_ingestion___test___generate_fake_ufl\n  - name: utils___sampling___random\n    args:\n      sample_n_or_frac: 0.1\n  - name: deduplication___minhash___lsh_jaccard\n  - name: data_load___huggingface___ufl2hf_obj\n```\n\n\n## 🌌 How to add a new ETL Category\n\n### 🌠 Add a new folder to `etl/` folder\n\n```python\n======================================\n- etl/\n    - YOUR_NEW_CATEGORY/\n        - __init__.py\n        - YOUR_NEW_SUBCATEGORY.py\n    - data_ingestion/\n    ...\n======================================\n```\n\n### 🌠 Add a new category to `ETL_CATEGORY` in `registry.py`\n> Only added category will be recognized by the ETL pipeline\n\n```python\nETL_CATEGORIES = [\n    YOUR_NEW_CATEGORY,\n    'data_ingestion',\n    'decontamination',\n    'deduplication',\n    'bias',\n    'toxicity',\n    'cleaning',\n    'pii',\n    'quality',\n    'data_load',\n    'utils',\n]\n```\n\n### 🌠 Pre-defined ETL Categories\n\n```python\n======================================\n- etl/\n    - \"__sample/\"\n        - This is to show how to use the etl package\n    - \"data_ingestion/\"\n        - converting data from one format, schema to another\n    - \"data_load/\"\n        - saving data to desired location\n    - \"quality/\"\n        - improving data quality\n        - e.g. removing data with low quality\n    - \"cleaning/\"\n        - cleaning data\n        - e.g. removing HTML tags from text\n        - e.g. data normalization\n    - \"decontamination/\"\n        - removing contamination from data\n        - e.g. removing benchmark data from data\n    - \"deduplication/\"\n        - removing duplication inside data\n    - \"pii/\"\n        - removing PII from data\n    - \"bias/\" - \n        - removing bias from data\n        - e.g. removing data with gender bias words\n    - \"toxicity/\"\n        - removing toxic data\n        - e.g. removing data with toxic words\n    - \"utils/\"\n        - utilities for the ETL process\n        - e.g. sampling, logging, error handling, etc\n======================================\n```\n\n## 🌌 How to Ignore specific ETL Sub-Category\n> If you want to ignore some of the `ETL sub-category` python files, you can add the file name to `ETL_IGNORE` in `registry.py`\n\nwhen you want to make a file just for storage purpose, you can add the file name to `ETL_IGNORE` in `registry.py`\n\n```python\nETL_IGNORE = [\n    '__init__.py',\n    'storage.py'\n]\n```\n"
  },
  {
    "path": "dataverse/etl/__init__.py",
    "content": "\nfrom .registry import ETLRegistry\nfrom .registry import register_etl\nfrom .registry import BaseETL\nfrom .pipeline import ETLPipeline"
  },
  {
    "path": "dataverse/etl/__sample/README.md",
    "content": "# Sample\n> This is a showcase"
  },
  {
    "path": "dataverse/etl/__sample/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/__sample/ducky.py",
    "content": "\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\nfrom typing import Union\n\n\n@register_etl\ndef __sample___ducky___make_your_own_etl_processor(data: Union[RDD, DataFrame], *args, **kwargs):\n    \"\"\"\n    decorator will convert this function to BaseETL class\n    \"\"\"\n    print(\"make_your_own_etl_processor\")\n    return data"
  },
  {
    "path": "dataverse/etl/__sample/github.py",
    "content": "\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import BaseETL\nfrom dataverse.etl import register_etl\nfrom dataverse.etl import ETLRegistry\nfrom dataverse.etl.registry import ETLStructure\n\nfrom typing import Union\n\n\n@register_etl\ndef __sample___github___using_decorator(data: Union[RDD, DataFrame], *args, **kwargs):\n    \"\"\"\n    decorator will convert this function to BaseETL class\n    \"\"\"\n    print(\"sample using decorator\")\n    return data\n\n@register_etl\ndef __sample___github___config(data: Union[RDD, DataFrame], config: dict = None, *args, **kwargs):\n    \"\"\"\n    decorator will convert this function to BaseETL class\n    \"\"\"\n    print(\"config says\", config)\n    return data\n\nif __name__ == \"__main__\":\n    registry = ETLRegistry()\n\n    print(\"[ Testing ] registry etl using decorator\")\n    # this could seem like a function but it is actually a BaseETL class\n    etl = __sample___github___using_decorator\n    etl()(data=None)\n    print(\"is subclass of ETLStructure?\", issubclass(etl, ETLStructure), \"\\n\")\n\n    print(\"[ Testing ] registry etl using decorator with config\")\n    etl = __sample___github___config\n    etl()(data=None, config={\"hello\": \"world\"})\n    print(\"is subclass of ETLStructure?\", issubclass(etl, ETLStructure), \"\\n\")\n\n    # check is it properly registryed\n    print(\"[ Testing ] check is it properly registry\")\n    print(\"=\"*50)\n    print(registry._registry)\n    print(\"=\"*50)"
  },
  {
    "path": "dataverse/etl/bias/README.md",
    "content": ""
  },
  {
    "path": "dataverse/etl/bias/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/cleaning/README.md",
    "content": "# Cleaning\n> Data normalization, removing noise, and other data cleaning tasks.\n\n\n## 🌌 Naming Convention\n> This is a strong recommendation. You can use your own naming convention if you want.\n\n```python\ndef cleaning___[ETL Sub-Category]___[ETL Process]()\n```\n\n- `ETL Sub-Category` - the data source to handle\n    - e.g. unicode\n    - e.g. char\n    - e.g. word\n    - e.g. number \n- `ETL process name` - purpose of the ETL process\n    - e.g. remove\n    - e.g. filter\n    - e.g. normalize"
  },
  {
    "path": "dataverse/etl/cleaning/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/cleaning/char.py",
    "content": "\"\"\"\nA collection of modules for cleaning data at the character level.\nFor example: whitespace, accent characters, and unprintable characters.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\nimport unicodedata\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef cleaning___char___normalize_whitespace(\n    spark, data: Union[RDD, DataFrame], subset: str = \"text\", *args, **kwargs\n) -> RDD:\n    r\"\"\"\n    Normalize whitespace.\n    - Strips the leading and trailing whitespaces.\n    - Replaces all consecutive whitespaces with a single space,\n    excluding ``\\n`` and ``\\r`` characters.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        RDD: The processed data with normalized whitespace.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    pattern = re.compile(r\"[^\\S\\r\\n]+\")\n\n    def _normalize_whitespace(row):\n        row[subset] = re.sub(pattern, \" \", row[subset].strip())\n        return row\n\n    data = data.map(_normalize_whitespace)\n\n    return data\n\n\n@register_etl\ndef cleaning___char___remove_unprintable(\n    spark, data: Union[RDD, DataFrame], subset=\"text\", *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Remove all the non-printable characters.\n\n    Code is from facebookresearch/cc_net\n    https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        RDD: The processed data with unprintable characters are removed.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _remove_non_printable_char(row):\n        new_lines = []\n        for line in row[subset].split(\"\\n\"):\n            new_lines.append(\n                re.sub(f\"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]\", \"\", line)\n            )\n        row[subset] = \"\\n\".join(new_lines)\n        return row\n\n    data = data.map(_remove_non_printable_char)\n\n    return data\n\n\ndef strip_accents(text: str) -> str:\n    \"\"\"Strips accents from a piece of text.\"\"\"\n    nfd = unicodedata.normalize(\"NFD\", text)\n    output = [c for c in nfd if unicodedata.category(c) != \"Mn\"]\n    if len(output) == text:\n        return text\n    return \"\".join(output)\n\n\n@register_etl\ndef cleaning___char___remove_accent(\n    spark, data: Union[RDD, DataFrame], subset: str = \"text\", *args, **kwargs\n) -> RDD:\n    \"\"\"Strips accents from a piece of text.\n\n        +--------+--------+\n        | input  | output |\n        +========+========+\n        | café   | cafe   |\n        | résumé | resume |\n        +--------+--------+\n\n    Code is from facebookresearch/cc_net\n    https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        The processed data with accents removed.\n\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _strip_accents(row):\n        row[subset] = strip_accents(row[subset])\n        return row\n\n    data = data.map(_strip_accents)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/document.py",
    "content": "\"\"\"\nA collection of modules for cleaning data at the document level.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef cleaning___document___split_by_word(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    word_per_chunk: int = 100,\n    delimiter: str = \" \",\n    *args,\n    **kwargs\n) -> RDD:\n    \"\"\"\n    Split documents into smaller chunks by word.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n        word_per_chunk (int, optional): Number of words per chunk. Defaults to 100.\n        delimiter (str, optional): Delimiter to split the text. Defaults to \" \".\n\n    Returns:\n        RDD: The processed data with documents split into smaller chunks.\n\n    Raises:\n        ValueError: If word_per_chunk is not a positive integer.\n\n    Examples:\n        - word_per_chunk = 2\n        - delimiter = \" \"\n        - input\n\n            +-----------------------------+\n            |            text             |\n            +=============================+\n            | \"hello world, how are you?\" |\n            +-----------------------------+\n\n        - output\n\n            +----------------+\n            |      text      |\n            +================+\n            | \"hello world,\" |\n            +----------------+\n            | \"how are\"      |\n            +----------------+\n            | \"you?\"         |\n            +----------------+\n\n    Caveats:\n        - NO normalization is done here!\n            - This doesn't consider the whitespace normalization.\n            - Recommend using other normalization before this.\n        - All the keys from the original row are copied to all the new rows created.\n            - ``id`` is not unique anymore.\n            - Make sure ``id`` is assigned after this step.\n    \"\"\"\n\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _split_by_word(row):\n        words = row[subset].split(delimiter)\n\n        # Create chunks\n        chunks = []\n        for i in range(0, len(words), word_per_chunk):\n            chunks.append(delimiter.join(words[i : i + word_per_chunk]))\n\n        # Create a new dictionary for each chunk with all the keys from the original row\n        return [{**row, subset: chunk} for chunk in chunks]\n\n    data = data.flatMap(_split_by_word)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/html.py",
    "content": "\"\"\"\nA collection of modules for cleaning data includes html.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nimport html2text\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef cleaning___html___extract_plain_text(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    use_trafilatura: bool = False,\n    *args,\n    **kwargs\n) -> RDD:\n    r\"\"\"\n    Extracts plain text from HTML.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n        use_trafilatura (bool, optional): Whether to use trafilatura instead of html2text. Defaults to False.\n\n    Returns:\n        The plain data extracted from html.\n\n    Caveats:\n        - ``html2text`` adds a double newline after each paragraph, which is not handled at this point.\n        - The option to use `trafilatura` is provided because extracting plain text with ``trafilatura`` does not seem to work well in some cases.\n\n            - [OK] Case::\n\n                text = \"<body><h1>My First Heading</h1><p>My first paragraph.</p></body>\"\n\n                # html2text\n                print(html2text.html2text(text))\n                >>> '# My First Heading\\n\\nMy first paragraph.\\n\\n'\n\n                # trafilatura\n                print(trafilatura.html2txt(text))\n                >>> 'My First HeadingMy first paragraph.'\n\n            - [ERROR] Case (trafilatura removes all the text)::\n\n                text = \"<p>hello <br> nice to meet you.</p>\"\n\n                # html2text\n                print(html2text.html2text(text))\n                >>> 'hello  \\nnice to meet you.\\n\\n'\n\n                # trafilatura\n                print(trafilatura.html2txt(text))\n                >>> ''\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    # this is optional\n    if use_trafilatura:\n        import trafilatura\n\n        def _html2txt(row):\n            row[subset] = trafilatura.html2txt(row[subset])\n            return row\n\n    else:\n\n        def _html2txt(row):\n            row[subset] = html2text.html2text(row[subset])\n            return row\n\n    data = data.map(_html2txt)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/korean.py",
    "content": "\"\"\"\nThis is only for Korean text datas.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\nfrom enum import IntEnum\nfrom typing import List, Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\nclass KoreanType(IntEnum):\n    JAUM = 0\n    MOUM = 1\n    COMPLETE = 2\n    ELSE = -1\n\n\nKOR_BEGIN = 44032\nKOR_END = 55203\nCHOSUNG_BASE = 588\nJUNGSUNG_BASE = 28\nJAUM_BEGIN = 12593\nJAUM_END = 12622\nMOUM_BEGIN = 12623\nMOUM_END = 12643\n\n# fmt: off\nCHOSUNG = [\"ㄱ\", \"ㄲ\", \"ㄴ\", \"ㄷ\", \"ㄸ\", \"ㄹ\", \"ㅁ\", \"ㅂ\", \"ㅃ\", \"ㅅ\", \"ㅆ\", \"ㅇ\", \"ㅈ\", \"ㅉ\", \"ㅊ\", \"ㅋ\", \"ㅌ\", \"ㅍ\", \"ㅎ\"]\nJUNGSUNG = [\"ㅏ\", \"ㅐ\", \"ㅑ\", \"ㅒ\", \"ㅓ\", \"ㅔ\", \"ㅕ\", \"ㅖ\", \"ㅗ\", \"ㅘ\", \"ㅙ\", \"ㅚ\", \"ㅛ\", \"ㅜ\", \"ㅝ\", \"ㅞ\", \"ㅟ\", \"ㅠ\", \"ㅡ\", \"ㅢ\", \"ㅣ\"]\nJONGSUNG = [\" \", \"ㄱ\", \"ㄲ\", \"ㄳ\", \"ㄴ\", \"ㄵ\", \"ㄶ\", \"ㄷ\", \"ㄹ\", \"ㄺ\", \"ㄻ\", \"ㄼ\", \"ㄽ\", \"ㄾ\", \"ㄿ\", \"ㅀ\", \"ㅁ\", \"ㅂ\", \"ㅄ\", \"ㅅ\", \"ㅆ\", \"ㅇ\", \"ㅈ\", \"ㅊ\", \"ㅋ\", \"ㅌ\", \"ㅍ\", \"ㅎ\"]\n\nJAUM = [\"ㄱ\", \"ㄲ\", \"ㄳ\", \"ㄴ\", \"ㄵ\", \"ㄶ\", \"ㄷ\", \"ㄸ\", \"ㄹ\", \"ㄺ\", \"ㄻ\", \"ㄼ\", \"ㄽ\", \"ㄾ\", \"ㄿ\", \"ㅀ\", \"ㅁ\", \"ㅂ\", \"ㅃ\", \"ㅄ\", \"ㅅ\", \"ㅆ\", \"ㅇ\", \"ㅈ\", \"ㅉ\", \"ㅊ\", \"ㅋ\", \"ㅌ\", \"ㅍ\", \"ㅎ\"]\nMOUM = [\"ㅏ\", \"ㅐ\", \"ㅑ\", \"ㅒ\", \"ㅓ\", \"ㅔ\", \"ㅕ\", \"ㅖ\", \"ㅗ\", \"ㅘ\", \"ㅙ\", \"ㅚ\", \"ㅛ\", \"ㅜ\", \"ㅝ\", \"ㅞ\", \"ㅟ\", \"ㅠ\", \"ㅡ\", \"ㅢ\", \"ㅣ\"]\n\n\n# fmt: on\ndef character_is_korean(c):\n    i = ord(c)\n    return (\n        (KOR_BEGIN <= i <= KOR_END)\n        or (JAUM_BEGIN <= i <= JAUM_END)\n        or (MOUM_BEGIN <= i <= MOUM_END)\n    )\n\n\ndef decompose(c):\n    if not character_is_korean(c):\n        return None\n\n    i = ord(c)\n    if JAUM_BEGIN <= i <= JAUM_END:\n        return c, \" \", \" \"\n    if MOUM_BEGIN <= i <= MOUM_END:\n        return \" \", c, \" \"\n\n    i -= KOR_BEGIN\n    cho = i // CHOSUNG_BASE\n    jung = (i - cho * CHOSUNG_BASE) // JUNGSUNG_BASE\n    jong = i - cho * CHOSUNG_BASE - jung * JUNGSUNG_BASE\n\n    return CHOSUNG[cho], JUNGSUNG[jung], JONGSUNG[jong]\n\n\ndef compose(chosung, jungsung, jongsung):\n    unicode = KOR_BEGIN\n    unicode += CHOSUNG_BASE * CHOSUNG.index(chosung)\n    unicode += JUNGSUNG_BASE * JUNGSUNG.index(jungsung)\n    unicode += JONGSUNG.index(jongsung)\n    return chr(unicode)\n\n\ndef cleaning___korean___filter_by_ratio(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    filter_type: str = \"word\",\n    korean_ratio: float = 0.5,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Filters out the text that has less than `korean_ratio` excluding space.\n\n    Code is from eleutherAI/dps and was modified\n    https://github.com/EleutherAI/dps/blob/master/dps/spark/prep/korean_prep.py#L52\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data(Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        subset(str, optional): A subset or column to consider. Defaults to 'text'.\n        filter_type(str, optional): The type of filtering to be applied. Can be 'char' or 'word'. Defaults to 'word'.\n        korean_ratio(float, optional) : The minimum ratio of Korean characters or words required for a text to survive the filtering. Defaults to 0.5.\n\n    Returns:\n        The filtered data with it's Korean ratio.\n\n    Raises:\n        ValueError: If the filter_type is not 'char' or 'word', or if the korean_ratio is not between 0 and 1.\n\n    Examples:\n        With korean_ratio = 0.5\n\n            +------------------------------------------------+\n            |                       text                     |\n            +================================================+\n            |  \"한국어가 포함 비율이 50% 이상인 경우만 남김\" |\n            +------------------------------------------------+\n\n\n            - filter_type = 'char' -> [survive!]\n                - Korean characters: 17\n                - Non-Korean characters: 3\n                - Total characters: 20\n                - Korean character ratio: 17 / 20 > 0.5 -> True\n            - filter_type = 'word' -> [survive!]\n                - Korean characters: 6\n                - Non-Korean characters: 1\n                - Total characters: 7\n                - Korean character ratio: 6 / 7 > 0.5 -> True\n\n            +------------------------------------------------+\n            |                       text                     |\n            +================================================+\n            | \"korean including 비율이 50% 미만인 경우 제거\" |\n            +------------------------------------------------+\n\n            - filter_type = 'char' -> [remove!]\n                - Korean characters: 10\n                - Non-Korean characters: 28\n                - Total characters: 38\n                - Korean word ratio: 10 / 38 > 0.5 -> False\n            - filter_type = 'word' -> [survive!]\n                - Korean characters: 4\n                - Non-Korean characters: 3\n                - Total characters: 7\n                - Korean word ratio: 4 / 7 > 0.5 -> True\n\n    Note:\n        - The regex to count Korean characters doesn't work properly on characters that are not words.\n            - e.g 안녕\"하세요 is counted is 2 korean words - [\"안녕\", \"하세요\"]\n    \"\"\"\n    assert filter_type in [\n        \"char\",\n        \"word\",\n    ], f\"filter_type should be either `char` or `word` but got {filter_type}\"\n    assert (\n        0.0 <= korean_ratio <= 1.0\n    ), f\"korean_ratio should be between 0. ~ 1. but got {korean_ratio}\"\n\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _korean_ratio_filter(row):\n        if row[subset] is None or len(row[subset]) == 0:\n            return False\n\n        if filter_type == \"char\":\n            korean_counts = len(re.findall(\"[ㄱ-힣]\", row[subset]))\n            all_counts = len(re.sub(\"[ \\r\\n\\t\\f\\v]\", \"\", row[subset]))\n        if filter_type == \"word\":\n            korean_counts = len(re.findall(r\"\\b[\\w]*[ㄱ-힣][\\w]*\\b\", row[subset]))\n            all_counts = len(re.findall(r\"\\b\\w+\\b\", row[subset]))\n\n        if all_counts == 0:\n            return False\n\n        return (korean_counts / all_counts) >= korean_ratio\n\n    data = data.filter(_korean_ratio_filter)\n\n    return data\n\n\ndef classify_korean_type(unicode):\n    if JAUM_BEGIN <= unicode <= JAUM_END:\n        return KoreanType.JAUM\n    elif MOUM_BEGIN <= unicode <= MOUM_END:\n        return KoreanType.MOUM\n    elif KOR_BEGIN <= unicode <= KOR_END:\n        return KoreanType.COMPLETE\n    else:\n        return KoreanType.ELSE\n\n\ndef reduce_repeated_emotions(text, num_repeats=2):\n    if num_repeats > 0:\n        repeat_chars_pattern = re.compile(r\"(\\w)\\\\1{2,}\")\n        text = repeat_chars_pattern.sub(\"\\\\1\" * num_repeats, text)\n\n    return text\n\n\n@register_etl\ndef cleaning___korean___reduce_emoticon(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: Union[str, List[str]] = \"text\",\n    num_repeats: int = 2,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Reduces emoticon Korean characters.\n\n    It performs the following steps:\n\n    1. Splits complete Korean characters into individual characters, preserving only the previous jaum and next moum.\n\n        - e.g. (remain) ㅋㅋ킄ㅋㅋㅋ -> ㅋㅋ킄ㅋㅋㅋ\n        - e.g. (splited) ㅋㅋ쿠ㅜㅜㅜ -> ㅋㅋㅋㅜㅜㅜㅜ\n\n    2. Reduces repeating Korean characters.\n        - e.g. ㅋㅋㅋㅋㅋ -> ㅋㅋ\n\n    Args:\n        spark(SparkSession): The Spark session object.\n        data(Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        subset(str, optional): A subset or columns to consider. Defaults to 'text'.\n        num_repeats(int, optional): The number of repeating characters to reduce. Defaults to 2.\n\n    Returns:\n        RDD: The processed data with reduced emoticon Korean characters.\n\n    Note:\n        **[ potential risk of splitting complete korean character ]**\n\n        splitting emoticon characters into individual characters has high risk inside\n        so only left one case that is `complete korean character between jaum and moum`\n        other cases were added also but due to the risk, wiped out\n\n    References:\n        - `soynlp normalizer.py <https://github.com/lovit/soynlp/blob/master/soynlp/normalizer/_normalizer.py>`_\n        - `dps korean_prep.py <https://github.com/EleutherAI/dps/blob/master/dps/spark/prep/korean_prep.py>`_\n    \"\"\"\n\n    def _reduce_korean_emotion(row):\n        text = row[subset]\n        if not text:\n            return row\n\n        korean_types = [classify_korean_type(ord(c)) for c in text]\n        last_idx = len(korean_types) - 1\n\n        normalized_text = []\n        for i, (korean_type, c) in enumerate(zip(korean_types, text)):\n            # when complete korean character is between jaum and moum\n            if (0 < i < last_idx) and (\n                korean_types[i - 1] == KoreanType.JAUM\n                and korean_type == KoreanType.COMPLETE\n                and korean_types[i + 1] == KoreanType.MOUM\n            ):\n                cho, jung, jong = decompose(c)\n\n                # case 1. when complete kor char is combination of prev jaum and next moum\n                # e.g. ㅋ(쿠)ㅜ -> ㅋ(ㅋㅜ)ㅜ\n                if cho == text[i - 1] and jung == text[i + 1] and jong == \" \":\n                    normalized_text.append(cho)\n                    normalized_text.append(jung)\n\n                # case 2. otherwise, just leave it\n                # e.g. ㅋ(쿵)ㅜ -> ㅋ(쿵)ㅜ\n                else:\n                    normalized_text.append(c)\n\n            else:\n                normalized_text.append(c)\n\n        row[subset] = reduce_repeated_emotions(\"\".join(normalized_text), num_repeats)\n\n        return row\n\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    data = data.map(_reduce_korean_emotion)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/length.py",
    "content": "\"\"\"\nFiltering based on length.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef cleaning___length___char_len_filter(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    min_len: int = None,\n    max_len: int = None,\n    *args,\n    **kwargs\n) -> RDD:\n    \"\"\"\n    Filters the data by character length.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n        min_len (int, optional): The minimum length of characters to filter. If None, there is no minimum length.\n        max_len (int, optional): The maximum length of characters to filter. If None, there is no maximum length.\n\n    Returns:\n        The filtered data as an RDD.\n\n    Raises:\n        ValueError: If both min_len and max_len are None.\n\n    Note:\n        - min_len <= len <= max_len\n        - min_len and max_len can not be None at the same time.\n        - If min_len is None, then only the maximum length is considered.\n        - If max_len is None, then only the minimum length is considered.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    assert (\n        min_len is not None or max_len is not None\n    ), \"min_len and max_len cannot be None at the same time\"\n\n    if min_len is not None and max_len is not None:\n        data = data.filter(lambda row: min_len <= len(row[subset]) <= max_len)\n    elif min_len is None:\n        data = data.filter(lambda row: len(row[subset]) <= max_len)\n    elif max_len is None:\n        data = data.filter(lambda row: min_len <= len(row[subset]))\n\n    return data\n\n\n@register_etl\ndef cleaning___length___word_len_filter(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset=\"text\",\n    min_len: int = None,\n    max_len: int = None,\n    *args,\n    **kwargs\n):\n    \"\"\"\n    filter by word length\n\n    min_len <= len <= max_len\n    - if min_len is None, then len <= max_len\n    - if max_len is None, then len >= min_len\n\n    args:\n        subset: column to filter\n        min_len: minimum length to filter\n        max_len: maximum length to filter\n\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    assert (\n        min_len is not None or max_len is not None\n    ), \"min_len and max_len cannot be None at the same time\"\n\n    if min_len is not None and max_len is not None:\n        data = data.filter(lambda row: min_len <= len(row[subset].split()) <= max_len)\n    elif min_len is None:\n        data = data.filter(lambda row: len(row[subset].split()) <= max_len)\n    elif max_len is None:\n        data = data.filter(lambda row: min_len <= len(row[subset].split()))\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/number.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef cleaning___number___normalize(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    assign_number: int = 0,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Convert all the number to assigned number (e.g. 0)\n\n    Code is from facebookresearch/cc_net\n    https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py\n\n    Examples:\n\n        - input\n\n        +----------+\n        |   text   |\n        +==========+\n        |      1234|\n        | 1234.5678|\n        +----------+\n\n        - output\n\n        +----------+\n        |   text   |\n        +==========+\n        |      0000|\n        | 0000.0000|\n        +----------+\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n        assign_number (int, optional): The number to assign. Default is 0.\n\n    Returns:\n        The normalized data.\n\n    Raises:\n        AssertionError: If assign_number is not between 0 and 9 (inclusive).\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _normalize_number(row):\n        row[subset] = re.sub(r\"\\d\", str(assign_number), row[subset])\n        return row\n\n    # assign_number is between 0 ~ 9\n    assert assign_number in range(\n        10\n    ), f\"assign_number should be between 0 ~ 9 but got {assign_number}\"\n    data = data.map(_normalize_number)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/table.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql import functions as F\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef cleaning___table___merge_col_vertical(\n    spark,\n    data: Union[RDD, DataFrame],\n    col1: str = None,\n    col2: str = None,\n    merge_col_name: str = \"merge_col\",\n    *args,\n    **kwargs\n) -> RDD:\n    \"\"\"\n    Merges two columns vertically into one column.\n\n    Example:\n        Before:\n\n        +------+------+---------+\n        | col1 | col2 | species |\n        +======+======+=========+\n        | 1    | 2    | duck    |\n        +------+------+---------+\n        | 3    | 4    | duck    |\n        +------+------+---------+\n        | 5    | 6    | ducky   |\n        +------+------+---------+\n\n        After calling ``cleaning_table_merge_col_vertical(...)``:\n\n        +--------+---------+\n        | number | species |\n        +========+=========+\n        | 1      | duck    |\n        +--------+---------+\n        | 3      | duck    |\n        +--------+---------+\n        | 5      | ducky   |\n        +--------+---------+\n        | 2      | duck    |\n        +--------+---------+\n        | 4      | duck    |\n        +--------+---------+\n        | 6      | ducky   |\n        +--------+---------+\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        col1 (str): The name of the first column to merge.\n        col2 (str): The name of the second column to merge.\n        merge_col_name (str, optional): The name of the merged column.\n\n    Returns:\n        The processed data with the merged column.\n\n    Raises:\n        ValueError: If col1 or col2 is not specified.\n    \"\"\"\n    if isinstance(data, RDD):\n        data = data.toDF()\n\n    assert col1 is not None, \"col1 must be specified\"\n    assert col2 is not None, \"col2 must be specified\"\n\n    rest_cols = [c for c in data.columns if c not in [col1, col2]]\n    df1 = data.select(*rest_cols, F.col(col1).alias(merge_col_name))\n    df2 = data.select(*rest_cols, F.col(col2).alias(merge_col_name))\n\n    # union the dataframes\n    data = df1.union(df2)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/cleaning/unicode.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\nimport unicodedata\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\nUNICODE_PUNCT = {\n    \"，\": \",\",\n    \"。\": \".\",\n    \"、\": \",\",\n    \"„\": '\"',\n    \"”\": '\"',\n    \"“\": '\"',\n    \"«\": '\"',\n    \"»\": '\"',\n    \"１\": '\"',\n    \"」\": '\"',\n    \"「\": '\"',\n    \"《\": '\"',\n    \"》\": '\"',\n    \"´\": \"'\",\n    \"∶\": \":\",\n    \"：\": \":\",\n    \"？\": \"?\",\n    \"！\": \"!\",\n    \"（\": \"(\",\n    \"）\": \")\",\n    \"；\": \";\",\n    \"–\": \"-\",\n    \"—\": \" - \",\n    \"．\": \". \",\n    \"～\": \"~\",\n    \"’\": \"'\",\n    \"…\": \"...\",\n    \"━\": \"-\",\n    \"〈\": \"<\",\n    \"〉\": \">\",\n    \"【\": \"[\",\n    \"】\": \"]\",\n    \"％\": \"%\",\n    \"►\": \"-\",\n}\n\n\n@register_etl\ndef cleaning___unicode___remove_punct(\n    spark, data: Union[RDD, DataFrame], subset: str = \"text\", *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Removes all the Unicode punctuations.\n\n    Code is from facebookresearch/cc_net\n    https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        The cleaned data.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _remove_unicode_punct(row):\n        row[subset] = re.sub(f\"[{''.join(UNICODE_PUNCT.keys())}]\", \"\", row[subset])\n        return row\n\n    data = data.map(_remove_unicode_punct)\n\n    return data\n\n\n@register_etl\ndef cleaning___unicode___replace_punct(\n    spark, data: Union[RDD, DataFrame], subset: str = \"text\", *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Replace all the unicode punctuations\n\n    Code is from facebookresearch/cc_net\n    https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        The cleaned data.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _replace_unicode_punct(row):\n        row[subset] = \"\".join((UNICODE_PUNCT.get(c, c) for c in row[subset]))\n        return row\n\n    data = data.map(_replace_unicode_punct)\n\n    return data\n\n\n@register_etl\ndef cleaning___unicode___normalize(\n    spark, data: Union[RDD, DataFrame], subset=\"text\", *args, **kwargs\n):\n    \"\"\"\n    Normalize the unicode\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        The cleaned data.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _normalize(row):\n        row[subset] = unicodedata.normalize(\"NFC\", row[subset])\n        return row\n\n    data = data.map(_normalize)\n    return data\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/README.md",
    "content": "# Data Ingestion\n> Ingest various data sources into the desired format\n\n**Recommendation for Data Ingestion**\n> Use Data Ingestion to convert all datasets to unified format you choose before preprocessing(transform)\n- for `Text Only` Dataset, recommend using `ufl` format\n    - for details on `ufl` format, see below\n- for `other` dataset, consider creating a new unified format\n\n## 📚 Data Ingestion Flow\n> This is the recommended flow for data ingestion, but not mandatory\n\nThere is 2 types of data ingestion flow for standard\n- **1 step flow** (load & template)\n    - load `raw data` to `desired format` directly\n- **2 step flow** (load -> template)\n    - load `raw data` to `raw format` first with **dict type**\n    - convert `raw format` to `desired format`\n\nIf you want to create 3 steps, thats on you. Remember this is just a guideline.\n\n### 📗 Why 2 step flow?\n> To support various templates for the same data source\n\nLet's suppose we are ingesting `mmlu` dataset and our desired format is `ufl` format.\nAnd with the following 2 templates, we can create 2 different data with `ufl` format.\nTo give user a broader choice, multiple templates for the same data source is necessary and 2 step flow is the way to go.\n\n```python\n# raw format\nraw = {\n    \"question\": \"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.\",\n    \"choices\": [\"8\", \"2\", \"24\", \"120\"],\n    \"answer\": 1,\n}\n\n# template v1 - only question (q)\nufl = {\n    'id': \"b1c2d3e4f5g6h7i8j9k0\",\n    'name': \"mmlu\",\n    'text': \"Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.\",\n    'meta': {},\n}\n\n# template v2 - question, answer (qa)\nufl = {\n    'id': \"a1b2c3d4e5f6g7h8i9j0\",\n    'name': \"mmlu\",\n    'text': \"question: Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.\\nanswer: 8\",\n    'meta': {},\n}\n\n```\n\n\n## 📚 Naming Convention\n> This is a strong recommendation. You can use your own naming convention if you want.\n\n```python\ndef data_ingestion___[ETL Sub-Category]___[raw source]2[target format]()\n\n```\n-  `ETL Sub-Category` - 2 types of sub-category (python file)\n    1. Name to the data source name to handle (`specific` purpose)\n        - e.g. mmlu\n        - e.g. squad\n    2. Name `file format` itself (`general` purpose)\n        - e.g. parquet\n        - e.g. csv\n        - e.g. hugingface\n- `ETL process name`\n    - Name the ETL process as the `raw source` -> `target format`\n        - **raw source**\n            - `file format`\n                - `parquet` - (loading data from parquet)\n                - `hf` - (loading data from huggingface dataset)\n                - `csv` - (loading data from csv)\n                - etc\n            - `raw`\n                - the data is already loaded in memory as raw\n        - **target format**\n            - `ufl` - (loading data to ufl format)\n                - e.g. `parquet2ufl` means loading parquet to ufl format\n                - e.g. `hf2ufl` means loading huggingface dataset to ufl format\n            - `raw` - (loading data w/o any transformation)\n                - e.g. `parquet2raw` means loading parquet to raw format\n                - e.g. `hf2raw` means loading huggingface dataset to raw format\n            - `[YOUR_FORMAT]`\n                - this is on you\n\n**caveat**\n- `ufl` is not a file format rather a schema(data format). \n\n### 📗 1 step flow\n> direct loading raw data to desired format\n\n- In case of your data is already saved in UFL format, use `raw` loading ETL process\n    - e.g. `hf2raw` could be used as total 1 step when your data is already saved in UFL format\n\n    \n```python\n- \"data_ingestion/\"\n    # converting raw data to desired format\n    - mmlu.py\n        - def data_ingestion___mmlu___parquet2ufl()\n        - def data_ingestion___mmlu___hf2ufl()\n    - squad.py\n        - def data_ingestion___squad___hf2ufl()\n    - mnist.py\n        - def data_ingestion___mnist___csv2ufl()\n\n    # this is used when loading UFL format saved in parquet\n    - parquet.py\n        - def data_ingestion___parquet___pq2ufl()\n```\n\n### 📗 2 step flow\n> loading raw data to raw format first and then convert to desired format\n\n#### 📖 Step 1 - load raw data to raw format\n\n```python\n- \"data_ingestion/\"\n    # converting raw data to raw format\n    - huggingface.py\n        - def data_ingestion___huggingface___hf2raw()\n    - mmlu.py\n        - def data_ingestion___mmlu___parquet2raw()\n        - def data_ingestion___mmlu___hf2raw()\n    - mnist.py\n        - def data_ingestion___mnist___csv2raw()\n```\n\n#### 📖 Step 2 - convert raw format to desired format\n- Name the ETL process as the `raw format` -> `target format`\n    - e.g. `raw2ufl` means converting raw format to ufl format\n- Add template name to the end of the function name\n    - e.g. `raw2ufl_q` means converting raw format to ufl format with `question` template\n    - e.g. `raw2ufl_qa` means converting raw format to ufl format with `question & answer` template\n\n```python\n- \"data_ingestion/\"\n    # converting raw format to desired format\n    - mmlu.py\n        - def data_ingestion___mmlu___raw2ufl_q()\n        - def data_ingestion___mmlu___raw2ufl_qa()\n    - squad.py\n        - def data_ingestion___squad___raw2ufl_v1()\n    - mnist.py\n        - def data_ingestion___mnist___raw2ufl_v1()\n```\n\n\n## 📚 UFL (Upstage Format for LLM)\n> This is the schema(data format) recommended by the Upstage LLM. Dataverse standard format for preparing pretraining dataset.\n```python\n{\n\t\"id\":\"uuid\",\n\t\"name\": \"string\",\n\t\"text\":\"string\",\n\t\"meta\": \"string\",\n}\n```\n\n- `id` - uuid v1\n- `name` - name of the dataset\n- `text` - text of the dataset\n- `meta` - meta data of the dataset\n    - meta data is a stringified json object\n\n### 📗 Why stringified for meta data?\n> Meta data does not have a fixed schema. It can be anything. So, it is stringified to avoid any issues with the schema.\n\n**huggingface datasets** \n- when 2 datasets have different meta data schema, it will throw an error when merging the datasets"
  },
  {
    "path": "dataverse/etl/data_ingestion/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/data_ingestion/arrow.py",
    "content": "\"\"\"\nLoad Arrow.\nSupport direct loading of arrow saved huggingface dataset to spark dataframe.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport glob\nimport os\nfrom typing import List, Union\n\nimport numpy as np\nimport pyarrow as pa\nfrom omegaconf import ListConfig\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\n\n\ndef find_arrow_paths(directory):\n    \"\"\"find *.arrow files recursively\"\"\"\n    if isinstance(directory, str):\n        return glob.glob(os.path.join(directory, \"**/*.arrow\"), recursive=True)\n    elif isinstance(directory, list) or isinstance(directory, ListConfig):\n        arrow_paths = []\n        for d in directory:\n            arrow_paths.extend(find_arrow_paths(d))\n        return arrow_paths\n\n    raise ValueError(f\"directory must be str or list, got {type(directory)}\")\n\n\ndef get_dir_size(arrow_paths):\n    total_size = 0\n    for fp in arrow_paths:\n        # skip if it is not `.arrow` file\n        if not fp.endswith(\".arrow\"):\n            continue\n\n        # skip if it is symbolic link\n        if not os.path.islink(fp):\n            total_size += os.path.getsize(fp)\n\n    return total_size\n\n\ndef arrow_table_to_dict(arrow_path):\n    \"\"\"\n    speed 10000 take - 70ms\n\n    faster than\n    - pyarrow -> pydict direct loading\n    - pyarrow -> pandas -> pydict loading\n\n    TODO: speed and memory improvement\n    \"\"\"\n    in_memory_stream = pa.input_stream(arrow_path)\n    opened_stream = pa.ipc.open_stream(in_memory_stream)\n    table = opened_stream.read_all()\n\n    # get schema for field names\n    schema = table.schema\n\n    rows = []\n    # iterate over each row\n    for row in range(table.num_rows):\n        row_data = {\n            schema.field(col).name: table.column(col)[row].as_py()\n            for col in range(table.num_columns)\n        }\n        rows.append(row_data)\n\n    return rows\n\n\n@register_etl\ndef data_ingestion___arrow___hf2raw(\n    spark,\n    path: Union[str, List[str]],\n    sample_n: int = -1,\n    arrow_partition_mb_size: int = -1,\n    raw_partition_mb_size: int = 256,\n    repartition: int = -1,\n    seed: int = 42,\n    verbose: bool = True,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Directly loads the arrow saved HuggingFace dataset to raw format as a dictionary.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        path (Union[str, List[str]]): The path of the arrow folders.\n        sample_n (int, optional): The number of arrow files to be sampled. Defaults to -1.\n            If sample_n is -1, all arrow files will be loaded.\n        arrow_partition_mb_size (int, optional): The size of each arrow partition in MB. Defaults to -1.\n            If arrow_partition_size is -1, it will repartition arrow files by the number of arrow files.\n            This assumes that arrow file size is evenly distributed. When there is data skew in arrow file size, it is recommended to use the default (-1).\n        raw_partition_mb_size (int, optional): The size of each raw partition in MB. Defaults to 256.\n            This is activated only when repartition is -1.\n        repartition (int, optional): Manually choose the number of partitions. Defaults to -1.\n        seed (int, optional): The seed for sampling. Defaults to 42.\n        verbose (bool, optional): Whether to print the information of the dataset. Defaults to True.\n\n    Returns:\n        RDD: The RDD containing the raw data in dictionary format.\n\n    Examples:\n        >>> import datasets\n        >>> dataset = datasets.load_dataset('ducky')\n        >>> dataset.save_to_disk('your/path/to/ducky')\n        >>> data_ingestion___arrow___hf2raw()(spark, 'your/path/to/ducky')\n\n    Caveats:\n        Arrow paths are repartitioned by the number of arrow files.\n    \"\"\"\n    arrow_paths = find_arrow_paths(path)\n    assert len(arrow_paths) > 0, f\"no arrow files found in {path}\"\n\n    # sample from the arrow files\n    if sample_n > 0 and sample_n < len(arrow_paths):\n        np.random.seed(seed)\n        arrow_paths = np.random.choice(arrow_paths, size=sample_n, replace=False)\n\n    if arrow_partition_mb_size == -1:\n        # if data is skewed, recommend to use default (-1)\n        arrow_repartition = len(arrow_paths)\n    else:\n        # this assume that arrow file size is evenly distributed\n        assert (\n            arrow_partition_mb_size > 0\n        ), f\"arrow_partition_mb_size must be positive, got {arrow_partition_mb_size}\"\n        arrow_total_mb_size = get_dir_size(arrow_paths) / 1024 / 1024\n        arrow_repartition = arrow_total_mb_size // arrow_partition_mb_size\n        arrow_repartition += 1 if arrow_total_mb_size % arrow_partition_mb_size else 0\n        arrow_repartition = min(int(arrow_repartition), len(arrow_paths))\n\n    rdd = spark.sparkContext.parallelize(arrow_paths)\n    rdd = rdd.repartition(arrow_repartition)\n    rdd = rdd.flatMap(arrow_table_to_dict)\n\n    if repartition != -1:\n        raw_repartition = repartition\n    else:\n        assert (\n            raw_partition_mb_size > 0\n        ), f\"raw_partition_mb_size must be positive, got {raw_partition_mb_size}\"\n\n        arrow_total_mb_size = get_dir_size(arrow_paths) / 1024 / 1024\n        raw_repartition = arrow_total_mb_size // raw_partition_mb_size\n        raw_repartition += 1 if arrow_total_mb_size % raw_partition_mb_size else 0\n\n        # count the number of data points (this is expensive)\n        # this is to prevent the case where the number of data points is less than raw_repartition\n        total_data_n = rdd.count()\n        raw_repartition = min(int(raw_repartition), total_data_n)\n\n    rdd = rdd.repartition(raw_repartition)\n\n    return rdd\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/common_crawl.py",
    "content": "\"\"\"\nLoad Common Crawl data from dump-id & segment files\n\nCode is from facebookresearch/cc_net with some modifications\nhttps://github.com/facebookresearch/cc_net\n\nThis is a migration of the code to Dataverse.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport functools\nimport glob\nimport gzip\nimport io\nimport json\nimport os\nimport sys\nimport tempfile\nimport time\nimport typing as tp\nimport warnings\nfrom pathlib import Path\nfrom typing import Iterable, List, Optional, TextIO, Union\nfrom urllib.parse import urlparse\n\nimport numpy as np\nimport requests\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\nfrom dataverse.utils.format import get_uuidv1\nfrom dataverse.utils.setting import SystemSetting\n\n\ndef parse_doc(headers: List[str], doc: List[str]) -> Optional[dict]:\n    \"\"\"Headers format is:\n    WARC/1.0\n    WARC-Type: conversion\n    WARC-Target-URI: [url]\n    WARC-Date: [crawldate: 2019-02-15T19:15:59Z]\n    WARC-Record-ID: <urn:uuid:8865156e-d5f1-4734-9c68-4b46eaf2bb7e>\n    WARC-Refers-To: <urn:uuid:340152e2-65cf-4143-b522-8ce4e2d069d7>\n    WARC-Block-Digest: sha1:S3DTWCONT2L6ORTGCY2KXEZ37LNBB7V2\n    Content-Type: text/plain\n    Content-Length: 7743\n    \"\"\"\n    if not headers or not doc:\n        return None\n    try:\n        url, date, digest, length = None, None, None, None\n        for header in headers:\n            if header.startswith(\"WARC-Target-URI:\"):\n                url = header.split()[1]\n            elif header.startswith(\"WARC-Date:\"):\n                date = header.split()[1]\n            elif header.startswith(\"WARC-Block-Digest:\"):\n                digest = header.split()[1]\n            elif header.startswith(\"Content-Length:\"):\n                length = int(header.split()[1])\n\n    except Exception:\n        # logger.warning(\"Can't parse header:\", e, headers, doc)\n        return None\n\n    # Docs are separated by two empty lines.\n    last = None\n    if not doc[-1] and not doc[-2]:\n        last = -2\n    title, doc = doc[0], doc[1:last]\n\n    return {\n        \"url\": url,\n        \"date_download\": date,\n        \"digest\": digest,\n        \"length\": length,\n        \"nlines\": len(doc),\n        \"source_domain\": urlparse(url).netloc,\n        \"title\": title,\n        \"raw_content\": \"\\n\".join(doc),\n    }\n\n\ndef group_by_docs(warc_lines: Iterable[str]) -> Iterable[dict]:\n    doc: List[str] = []\n    headers, read_headers = [], True\n    for warc in warc_lines:\n        warc = warc.strip()\n        if read_headers:\n            headers.append(warc)\n            read_headers = warc != \"\"\n            continue\n\n        if warc == \"WARC/1.0\":\n            # We reached the beginning of the new doc.\n            parsed = parse_doc(headers, doc)\n            if parsed is not None:\n                yield parsed\n            headers, doc, read_headers = [warc], [], True\n            continue\n\n        doc.append(warc)\n\n    # Return the last document\n    if doc:\n        parsed = parse_doc(headers, doc)\n        if parsed is not None:\n            yield parsed\n\n\ndef _close_when_exhausted(file) -> Iterable[str]:\n    with file:\n        yield from file\n\n\ndef open_segment_file(segment: str, verbose: bool = True) -> Iterable[str]:\n    \"\"\"\n    overwrite the open_segment function to get the WET file from the folder\n\n    args:\n        segment: path to the WET file\n    \"\"\"\n    filename = Path(segment)\n    if filename.suffix == \".gz\":\n        file: TextIO = gzip.open(filename, \"rt\")  # type: ignore\n    else:\n        file = open(filename, \"rt\")\n    return _close_when_exhausted(file)\n\n\ndef process_segment_file(segment: str, verbose: bool = True) -> Iterable[dict]:\n    for doc in group_by_docs(open_segment_file(segment, verbose=verbose)):\n        doc[\"cc_segment\"] = segment\n        yield doc\n\n\ndef find_wet_files(directory):\n    \"\"\"find *.wet, *wet.gz files recursively\"\"\"\n    return glob.glob(os.path.join(directory, \"**/*.wet\"), recursive=True) + glob.glob(\n        os.path.join(directory, \"**/*.wet.gz\"), recursive=True\n    )\n\n\nWET_URL_ROOT = \"https://data.commoncrawl.org\"\nFileDescriptor = Union[Path, List[Path], str]\nReadableFileLike = Union[Iterable[str], FileDescriptor, None]\n\n\ndef _tmp(prefix: str = None, suffix: str = None, dir: Path = None) -> Path:\n    if isinstance(prefix, Path):\n        prefix = str(prefix)\n    if isinstance(suffix, Path):\n        suffix = str(suffix)\n    _, tmp_path = tempfile.mkstemp(prefix=prefix, suffix=suffix, dir=dir)\n    return Path(tmp_path)\n\n\ndef _yield_from(files: list) -> Iterable[str]:\n    for file in files:\n        yield from open_read(file)\n\n\ndef open_read(filename: ReadableFileLike) -> Iterable[str]:\n    \"\"\"Open the given file, list of files or files matching the given glob and read lines.\n\n    `filename` is None or \"-\" -> reads from stdin\n    `filename` is a Path / str -> interprets filename as a glob and open files matching it\n    `filename` is a list -> opens sequentially all files from the list using `open_read`\n    `filename` is something else -> returns the object wrapped in a `nullcontext`\n        This allows to pass already openened files or iterables.\n\n    `open_read` will decompress gzip files, given they have \".gz\" suffix.\n    \"\"\"\n    if filename is None:\n        return sys.stdin\n\n    if isinstance(filename, list):\n        assert isinstance(filename[0], Path)\n        if len(filename) == 0:\n            return []\n        if len(filename) > 1:\n            return _yield_from(filename)\n        filename = tp.cast(Path, filename[0])\n    if isinstance(filename, str):\n        if filename.startswith(\"http://\") or filename.startswith(\"https://\"):\n            return open_remote_file(filename)\n\n        filename = Path(filename)\n    if not isinstance(filename, Path):\n        # we might have received an iterable, return it unmodified.\n        return filename  # type: ignore\n\n    # Expand glob patterns only when reading\n    files = [Path(f) for f in sorted(glob.glob(str(filename)))]\n    if len(files) > 1:\n        return _yield_from(files)\n    if len(files) == 1:\n        filename = files[0]\n\n    assert isinstance(filename, Path)\n\n    if filename.suffix == \".gz\":\n        file: TextIO = gzip.open(filename, \"rt\")  # type: ignore\n    else:\n        file = open(filename, \"rt\")\n\n    return _close_when_exhausted(file)\n\n\ndef request_get_content(url: str, n_retry: int = 3, verbose: bool = True) -> bytes:\n    \"\"\"Retrieve the binary content at url.\n\n    Retry on connection errors.\n    \"\"\"\n    t0 = time.time()\n\n    if verbose:\n        # TODO: Logging will be activated later\n        # logging.info(f\"Starting download of {url}\")\n        print(f\"Starting download of {url}\")\n\n    for i in range(1, n_retry + 1):\n        try:\n            with requests.Session() as session:\n                r = session.get(url)\n                r.raise_for_status()\n            break\n        except requests.exceptions.RequestException as e:\n            # Sleep and try again on error, unless it's a 404.\n            message = e.args[0] if isinstance(e.args[0], str) else \"\"\n            if i == n_retry or \"Client Error\" in message:\n                raise e\n            warnings.warn(f\"Swallowed error {e} while downloading {url} ({i} out of {n_retry})\")\n            time.sleep(10 * 2**i)\n\n    if verbose:\n        dl_time = time.time() - t0\n        dl_speed = len(r.content) / dl_time / 1024\n        # logging.info(\n        #     f\"Downloaded {url} [{r.status_code}] took {dl_time:.0f}s ({dl_speed:.1f}kB/s)\"\n        # )\n        print(f\"Downloaded {url} [{r.status_code}] took {dl_time:.0f}s ({dl_speed:.1f}kB/s)\")\n\n    return r.content\n\n\ndef open_remote_file(url: str, cache: Path, verbose: bool = True) -> Iterable[str]:\n    \"\"\"\n    Download the files at the given url to memory and opens it as a file.\n    Assumes that the file is small, and fetch it when this function is called.\n    \"\"\"\n    if cache and cache.exists():\n        return open_read(cache)\n\n    # TODO: open the remote file in streaming mode.\n    # The hard part is that we need to write the content on disk at the same time,\n    # to implement disk caching.\n    raw_bytes = request_get_content(url, verbose=verbose)\n    content = io.BytesIO(raw_bytes)\n    if url.endswith(\".gz\"):\n        f: TextIO = gzip.open(content, mode=\"rt\")  # type: ignore\n    else:\n        f = io.TextIOWrapper(content)\n\n    try:\n        # The file might have been created even not fully downloaded/written\n        # so make sure tmp_cache is deleted when the program exits.\n        # and only replace the cache file when the download is complete.\n        if cache and not cache.exists():\n            tmp_cache = _tmp(cache)\n            tmp_cache.write_bytes(raw_bytes)\n            if not cache.exists():\n                tmp_cache.replace(cache)\n    finally:\n        if tmp_cache.exists():\n            tmp_cache.unlink()\n\n    return _close_when_exhausted(f)\n\n\ndef cc_wet_paths_url(dump_id: str) -> str:\n    return \"/\".join([WET_URL_ROOT, \"crawl-data\", \"CC-MAIN-\" + dump_id, \"wet.paths.gz\"])\n\n\ndef segment_url(segment: str):\n    return \"/\".join((WET_URL_ROOT, segment))\n\n\ndef cc_segment_urls(dump_id: str, cache_dir: Path, verbose: bool = True) -> List[str]:\n    wet_paths = cc_wet_paths_url(dump_id)\n    wet_paths_cache = cache_dir / f\"wet_{dump_id}.paths.gz\"\n    f = open_remote_file(wet_paths, cache=wet_paths_cache, verbose=verbose)\n    return [segment.strip() for segment in f]\n\n\ndef open_segment_url(segment: str, cache_dir: Path, verbose: bool = True) -> Iterable[str]:\n    url = segment_url(segment)\n    file: Optional[Path] = None\n    if cache_dir:\n        file = cache_dir / segment.split(\"/\")[-1]\n\n    return open_remote_file(url, cache=file, verbose=verbose)\n\n\ndef process_segment_url(segment: str, cache_dir: Path, verbose: bool = True) -> Iterable[str]:\n    for doc in group_by_docs(open_segment_url(segment, cache_dir, verbose=verbose)):\n        doc[\"cc_segment\"] = segment\n        yield doc\n\n\n@register_etl\ndef data_ingestion___common_crawl___wet2raw(\n    spark,\n    wet_path: str,\n    segment_n: int = -1,\n    repartition=20,\n    seed: int = 42,\n    verbose=True,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Load WET files and convert them to raw format as a dictionary.\n\n    [ what is WET? ]\n    - WET files which store extracted plain text from the data stored in the WARC.\n\n    Args:\n        spark: The Spark session.\n        wet_path: The path to the WET folder that includes WET format files.\n            This search recursively, so you don't need to specify the path to each WET file.\n            This search for all the *.wet, *.gz files in the folder.\n        segment_n: The number of segments to load. This is a sampling parameter.\n            One segment is about 1GB.\n            Set as -1 (default) to load all the segments.\n        repartition: The number of partitions.\n        seed: The random seed.\n        verbose: Whether to print the information of the dataset.\n\n    Returns:\n        rdd: The RDD containing the converted raw data.\n    \"\"\"\n    wet_paths = find_wet_files(wet_path)\n    if segment_n > 0 and segment_n < len(wet_paths):\n        np.random.seed(seed)\n        wet_paths = np.random.choice(wet_paths, size=segment_n, replace=False)\n\n    rdd = spark.sparkContext.parallelize(wet_paths)\n    rdd = rdd.flatMap(functools.partial(process_segment_file, verbose=verbose))\n    rdd = rdd.repartition(repartition)\n\n    return rdd\n\n\n@register_etl\ndef data_ingestion___common_crawl___dump2raw(\n    spark,\n    dump: str,\n    segment_n: int = -1,\n    repartition: int = 20,\n    use_cache: bool = True,\n    cache_dir: str = None,\n    seed: int = 42,\n    verbose: bool = True,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Ingests data from Common Crawl dump and converts it to raw format.\n\n    Args:\n        spark (SparkSession): The Spark session.\n        dump (str): The dump ID of the Common Crawl. For example, '2023-23'.\n        segment_n (int, optional): The number of segments to load. Default is -1, which loads all segments.\n            Note that one segment is about 1GB.\n        repartition (int, optional): The number of partitions. Default is 20.\n        use_cache (bool, optional): Whether to use the cache. Default is True.\n            If you want to save disk space, set as False because the size of cache can be large.\n            FYI, on WET dump is about 10TB.\n        cache_dir (str, optional): The cache path to save the dataset.\n        seed (int, optional): The random seed. Default is 42.\n        verbose (bool, optional): Whether to print the information of the dataset. Default is True.\n\n    Returns:\n        RDD: The RDD containing the processed data.\n    \"\"\"\n    if use_cache:\n        if cache_dir is None:\n            # save the parquet at package root path\n            cache_dir = SystemSetting().CACHE_DIR\n            cache_dir = f\"{cache_dir}/.cache/dataverse/dataset/common_crawl_{dump}\"\n        else:\n            cache_dir = f\"{cache_dir}/common_crawl_{dump}\"\n    else:\n        cache_dir = None\n\n    if not isinstance(cache_dir, Path):\n        cache_dir = Path(cache_dir)\n\n    # if cache dir exist creat one\n    if cache_dir and not cache_dir.exists():\n        cache_dir.mkdir(parents=True)\n\n    wet_urls = cc_segment_urls(dump, cache_dir, verbose=verbose)\n\n    if segment_n > 0 and segment_n < len(wet_urls):\n        np.random.seed(seed)\n        wet_urls = np.random.choice(wet_urls, size=segment_n, replace=False)\n\n    rdd = spark.sparkContext.parallelize(wet_urls)\n    rdd = rdd.flatMap(\n        functools.partial(\n            process_segment_url,\n            cache_dir=cache_dir,\n            verbose=verbose,\n        )\n    )\n    rdd = rdd.repartition(repartition)\n\n    return rdd\n\n\ndef convert_bytes(data):\n    if isinstance(data, bytes):\n        return data.decode()\n    if isinstance(data, dict):\n        return {convert_bytes(key): convert_bytes(value) for key, value in data.items()}\n    if isinstance(data, list):\n        return [convert_bytes(element) for element in data]\n    return data\n\n\n@register_etl\ndef data_ingestion___common_crawl___raw2ufl(spark, data: RDD, *args, **kwargs):\n    \"\"\"\n    Converts raw format to UFL with custom template.\n\n    Args:\n        spark (SparkSession): The Spark session.\n        data (RDD): The input data.\n\n    Returns:\n        The converted data in UFL format.\n    \"\"\"\n\n    def templatev1(data):\n        new_data = {}\n        new_data[\"id\"] = get_uuidv1()\n        new_data[\"name\"] = \"common_crawl\"\n        new_data[\"text\"] = f\"{data.get('raw_content', None)}\"\n        new_data[\"meta\"] = json.dumps(\n            convert_bytes(\n                {\n                    \"title\": data.get(\"title\", None),\n                    \"url\": data.get(\"url\", None),\n                    \"date_download\": data.get(\"date_download\", None),\n                    \"digest\": data.get(\"digest\", None),\n                    \"length\": data.get(\"length\", None),\n                    \"nlines\": data.get(\"nlines\", None),\n                    \"source_domain\": data.get(\"source_domain\", None),\n                    \"cc_segment\": data.get(\"cc_segment\", None),\n                }\n            )\n        )\n        return new_data\n\n    data = data.map(lambda x: templatev1(x))\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/csv.py",
    "content": "\"\"\"\nLoad CSV data\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\n\n# from dataverse.utils.format import huggingface2parquet, load_huggingface_dataset\n\n\n@register_etl\ndef data_ingestion___csv___csv2raw(\n    spark, path: Union[str, List[str]], repartition: int = 20, verbose: bool = True, *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Converts CSV data to raw RDD.\n\n    Args:\n        spark (SparkSession): The Spark session.\n        path (Union[str, List[str]]): The path(s) to the CSV file(s).\n        repartition (int, optional): The number of partitions for the RDD. Defaults to 20.\n        verbose (bool, optional): Whether to print the information of the dataset.\n\n    Returns:\n        RDD: The raw RDD containing the CSV data.\n    \"\"\"\n    if isinstance(path, str):\n        path = [path]\n\n    df = spark.read.csv(*path, header=True)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n\n    return rdd\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/cultura_x.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport json\n\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\nfrom dataverse.utils.format import get_uuidv1\n\n\n@register_etl\ndef data_ingestion___cultura_x___raw2ufl(spark, ufl: RDD, *args, **kwargs):\n    \"\"\"\n    Converts raw format to UFL with custom template.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        ufl(RDD): The input DataFrame in raw format.\n\n    Returns:\n        RDD: The transformed DataFrame in UFL format.\n    \"\"\"\n\n    def templatev1(row):\n        new_row = {}\n        new_row[\"id\"] = get_uuidv1()\n        new_row[\"name\"] = \"cultura_x\"\n        new_row[\"text\"] = row[\"text\"]\n        new_row[\"meta\"] = json.dumps(\n            {\n                \"url\": row[\"url\"],\n                \"timestamp\": row[\"timestamp\"],\n                \"source\": row[\"source\"],\n            }\n        )\n        return new_row\n\n    ufl = ufl.map(lambda x: templatev1(x))\n\n    return ufl\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/huggingface.py",
    "content": "\"\"\"\nLoad Huggingface data\n\nThis is used just to load huggingface dataset without any refomatting\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\nfrom dataverse.utils.format import huggingface2parquet, load_huggingface_dataset\n\n\n@register_etl\ndef data_ingestion___huggingface___hf2raw(\n    spark,\n    name_or_path: Union[str, List[str]],\n    split: int = None,\n    from_disk: bool = False,\n    repartition: int = 20,\n    verbose: bool = True,\n    *args,\n    **kwargs\n) -> RDD:\n    \"\"\"\n    Convert a HuggingFace dataset to raw format as a dictionary.\n\n    Args:\n        spark (SparkSession): The Spark session.\n        name_or_path (Union[str, List[str]]): The name or path of the HuggingFace dataset.\n        split(int, optional): The split of the dataset. Defaults to None.\n        from_disk(bool, optional): Whether to load from disk. Defaults to False.\n            No split is allowed when from_disk is True.\n        repartition(int, optional): The number of partitions. Defaults to 20.\n        verbose(bool, optional): Whether to print the information of the dataset. Defaults to True.\n\n    Returns:\n        rdd: The converted dataset as an RDD of dictionaries.\n    \"\"\"\n    dataset = load_huggingface_dataset(name_or_path, split=split, from_disk=from_disk)\n    parquet_path = huggingface2parquet(dataset, verbose=verbose)\n    df = spark.read.parquet(parquet_path)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n\n    return rdd\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/parquet.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef data_ingestion___parquet___pq2raw(\n    spark, path: Union[str, List[str]], repartition=20, *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Reads parquet files into an RDD and repartitions it.\n\n    Args:\n        spark (SparkSession): The Spark session.\n        path (str or list): The path of the parquet files.\n        repartition (int): The number of partitions.\n\n    Returns:\n        rdd: The repartitioned RDD containing the data from the parquet files.\n    \"\"\"\n    if isinstance(path, str):\n        path = [path]\n\n    df = spark.read.parquet(*path)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n    return rdd\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/red_pajama.py",
    "content": "\"\"\"\nSupported datasets:\nhttps://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T\nhttps://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nfrom dataverse.etl import register_etl\nfrom dataverse.utils.format import (\n    get_uuidv1,\n    huggingface2parquet,\n    load_huggingface_dataset,\n)\n\n\"\"\"\n1 stage data ingestion - default\n====================================\ndirect loading ufl with one ETL process\n\"\"\"\n\n\ndef convert2ufl(row):\n    row[\"id\"] = get_uuidv1()\n    row[\"name\"] = \"red_pajama\"\n    return row\n\n\n@register_etl\ndef data_ingestion___red_pajama___parquet2ufl(spark, input_paths, repartition=20, *args, **kwargs):\n    \"\"\"\n    convert parquet file to ufl\n    \"\"\"\n    df = spark.read.parquet(*input_paths)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n    rdd = rdd.map(lambda x: convert2ufl(x))\n\n    return rdd\n\n\n@register_etl\ndef data_ingestion___red_pajama___hf2ufl(\n    spark,\n    name_or_path: Union[str, List[str]] = \"togethercomputer/RedPajama-Data-1T-Sample\",\n    split=None,\n    from_disk=False,\n    repartition=20,\n    verbose=True,\n    *args,\n    **kwargs\n):\n    \"\"\"\n    convert huggingface dataset to ufl\n\n    Args:\n        spark (SparkSession): spark session\n        name_or_path (str or list): the name or path of the huggingface dataset\n        split (str): the split of the dataset\n        from_disk (bool): whether to load from disk\n            - no split is allowed when from_disk is True\n        repartition (int): the number of partitions\n        verbose (bool): whether to print the information of the dataset\n    \"\"\"\n    dataset = load_huggingface_dataset(name_or_path, split=split, from_disk=from_disk)\n    parquet_path = huggingface2parquet(dataset, verbose=verbose)\n\n    df = spark.read.parquet(parquet_path)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n    rdd = rdd.map(lambda x: convert2ufl(x))\n\n    return rdd\n\n\n\"\"\"\n2 stage data ingestion - default\n====================================\nloading ufl with custom template with two ETL process\n\"\"\"\n\n\n@register_etl\ndef data_ingestion___red_pajama___hf2raw(\n    spark,\n    name_or_path: Union[str, List[str]] = \"togethercomputer/RedPajama-Data-1T-Sample\",\n    split=None,\n    repartition=20,\n    verbose=True,\n    *args,\n    **kwargs\n):\n    \"\"\"\n    convert huggingface dataset to raw format as dict\n\n    Args:\n        spark (SparkSession): spark session\n        name_or_path (str or list): the name or path of the huggingface dataset\n        split (str): the split of the dataset\n        repartition (int): the number of partitions\n        verbose (bool): whether to print the information of the dataset\n    \"\"\"\n    dataset = load_huggingface_dataset(name_or_path, split=split)\n    parquet_path = huggingface2parquet(dataset, verbose=verbose)\n    df = spark.read.parquet(parquet_path)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n\n    return rdd\n\n\n@register_etl\ndef data_ingestion___red_pajama___raw2ufl_templatev1(spark, ufl, *args, **kwargs):\n    \"\"\"\n    convert raw format to ufl with custom template\n    \"\"\"\n\n    def templatev1(row):\n        row[\"id\"] = get_uuidv1()\n        row[\"name\"] = \"red_pajama\"\n        return row\n\n    ufl = ufl.map(lambda x: templatev1(x))\n\n    return ufl\n\n\n@register_etl\ndef data_ingestion___red_pajama___raw2ufl_templatev2(spark, ufl, *args, **kwargs):\n    ...\n    return ufl\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/slim_pajama.py",
    "content": "\"\"\"\nSupported datasets:\nhttps://huggingface.co/datasets/cerebras/SlimPajama-627B\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nfrom dataverse.etl import register_etl\nfrom dataverse.utils.format import huggingface2parquet, load_huggingface_dataset\n\n\n@register_etl\ndef data_ingestion___slim_pajama___parquet2ufl(spark, input_paths, repartition=20, *args, **kwargs):\n    \"\"\"\n    convert parquet file to ufl\n    \"\"\"\n    df = spark.read.parquet(*input_paths)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n    return rdd\n\n\n@register_etl\ndef data_ingestion___slim_pajama___hf2ufl(\n    spark,\n    name_or_path: Union[str, List[str]] = \"cerebras/SlimPajama-627B\",\n    split=None,\n    from_disk=False,\n    repartition=20,\n    verbose=True,\n    *args,\n    **kwargs\n):\n    \"\"\"\n    convert huggingface dataset to ufl\n\n    Args:\n        spark (SparkSession): spark session\n        name_or_path (str or list): the name or path of the huggingface dataset\n        split (str): the split of the dataset\n        from_disk (bool): whether to load from disk\n            - no split is allowed when from_disk is True\n        repartition (int): the number of partitions\n        verbose (bool): whether to print the information of the dataset\n    \"\"\"\n    dataset = load_huggingface_dataset(name_or_path, split=split, from_disk=from_disk)\n    parquet_path = huggingface2parquet(dataset, verbose=verbose)\n\n    df = spark.read.parquet(parquet_path)\n    rdd = df.rdd.repartition(repartition)\n    rdd = rdd.map(lambda row: row.asDict())\n    return rdd\n"
  },
  {
    "path": "dataverse/etl/data_ingestion/test.py",
    "content": "\"\"\"\nspecial purpose to create fake data for testing or debugging\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport json\n\nfrom faker import Faker\nfrom pyspark.rdd import RDD\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef data_ingestion___test___generate_fake_ufl(\n    spark, n: int = 100, repartition: int = 20, verbose: bool = True, *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Generate fake data for testing or debugging.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        n (int, optional): The number of data to generate. Default is 100.\n        repartition (int, optional): The number of partitions. Default is 20.\n        verbose (bool, optional): Whether to print the information of the dataset. Default is True.\n\n    Returns:\n        RDD: The generated fake data RDD.\n    \"\"\"\n    faker = Faker()\n\n    def _generate_fake_ufl(n=100):\n        while n > 0:\n            n -= 1\n            yield {\n                \"id\": faker.uuid4(),\n                \"name\": \"test_fake_ufl\",\n                \"text\": faker.text(),\n                \"meta\": json.dumps(\n                    {\n                        \"name\": faker.name(),\n                        \"age\": faker.random_int(0, 100),\n                        \"address\": faker.address(),\n                        \"job\": faker.job(),\n                    }\n                ),\n            }\n\n    rdd = spark.sparkContext.parallelize(_generate_fake_ufl(n=n))\n    rdd = rdd.repartition(repartition)\n\n    return rdd\n"
  },
  {
    "path": "dataverse/etl/data_save/README.md",
    "content": "# Data Save\n> How to save data to the destination? In other words, how to save the data to the destination?\n\n\n## 🌌 Naming Convention\n- TBD\n\n## 🌌 Supported Data Save Method\n- AWS (S3)\n- HuggingFace (Dataset)\n- Parquet"
  },
  {
    "path": "dataverse/etl/data_save/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/data_save/aws.py",
    "content": "\"\"\"\nTODO: Data saving to AWS S3\n\nThis is not implemented yet.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\n# TODO\n"
  },
  {
    "path": "dataverse/etl/data_save/huggingface.py",
    "content": "\"\"\"\nData saving to Huggingface Datasets\n\nHuggingface support spark natively!\nhttps://huggingface.co/docs/datasets/use_with_spark\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport os\nfrom typing import Union\n\nfrom datasets import Dataset\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef data_save___huggingface___ufl2hf_hub(spark, ufl, hub_path, repartition=1, *args, **kwargs):\n    \"\"\"\n    TODO: Save data to Hugging Face dataset and upload to hub.\n    \"\"\"\n    NotImplementedError()\n    return None\n\n\n@register_etl\ndef data_save___huggingface___ufl2hf(\n    spark, ufl: Union[RDD, DataFrame], save_path: str, repartition: int = 1, *args, **kwargs\n) -> str:\n    \"\"\"\n    Save data to HuggingFace dataset and return the path.\n\n    Args:\n        spark(sparkSession): The Spark session.\n        ufl(Union[RDD, DataFrame]):The input data to be saved.\n        save_path(str): The path to save the HF dataset.\n        repartition(int, optional): The number of partitions to repartition the data. Defaults to 1.\n\n    Raises:\n        ValueError: If the save_path already exists.\n        AssertionError: If ufl is not an RDD or DataFrame.\n\n    Returns:\n        str: The path where the HuggingFace dataset is saved.\n    \"\"\"\n\n    if os.path.exists(save_path):\n        raise ValueError(f\"save_path {save_path} already exists\")\n\n    if isinstance(ufl, RDD):\n        ufl = ufl.toDF()\n\n    assert isinstance(ufl, DataFrame), f\"ufl must be RDD or DataFrame, got {type(ufl)}\"\n\n    ufl = ufl.repartition(repartition)\n    hf_dataset = Dataset.from_spark(ufl)\n    hf_dataset.save_to_disk(save_path)\n\n    return save_path\n\n\n@register_etl\ndef data_save___huggingface___ufl2hf_obj(\n    spark, ufl: Union[RDD, DataFrame], repartition: int = 1, *args, **kwargs\n) -> Dataset:\n    \"\"\"\n    Convert data to HuggingFace dataset object.\n\n    Args:\n        spark(sparkSession): The Spark session.\n        ufl(Union[RDD, DataFrame]):The input data to be saved.\n        repartition(int, optional): The number of partitions to repartition the data. Defaults to 1.\n\n    Returns:\n        Dataset: The HuggingFace dataset object.\n\n    Raises:\n        AssertionError: If the input data is not RDD or DataFrame.\n    \"\"\"\n    if isinstance(ufl, RDD):\n        ufl = ufl.toDF()\n\n    assert isinstance(ufl, DataFrame), f\"ufl must be RDD or DataFrame, got {type(ufl)}\"\n\n    ufl = ufl.repartition(repartition)\n    hf_dataset = Dataset.from_spark(ufl)\n\n    return hf_dataset\n"
  },
  {
    "path": "dataverse/etl/data_save/parquet.py",
    "content": "\"\"\"\nData saving to Parquets\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport os\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef data_save___parquet___ufl2parquet(\n    spark,\n    ufl: Union[RDD, DataFrame],\n    save_path: str,\n    repartition: int = 1,\n    *args,\n    **kwargs,\n) -> str:\n    \"\"\"\n    Save data to parquet and return the path.\n\n    Args:\n        spark(sparkSession): The Spark session.\n        ufl(Union[RDD, DataFrame]):The input data to be saved.\n        save_path(str): The path to save the HF dataset.\n        repartition(int, optional): The number of partitions to repartition the data. Defaults to 1.\n\n    Raises:\n        ValueError: If the save_path already exists.\n\n    Returns:\n        str: The path where the parquet file is saved.\n    \"\"\"\n    if os.path.exists(save_path):\n        raise ValueError(f\"save_path {save_path} already exists\")\n\n    if isinstance(ufl, RDD):\n        ufl = ufl.toDF()\n\n    assert isinstance(ufl, DataFrame), f\"ufl must be RDD or DataFrame, got {type(ufl)}\"\n\n    ufl = ufl.repartition(repartition)\n    ufl.write.parquet(save_path, mode=\"overwrite\")\n\n    return save_path\n"
  },
  {
    "path": "dataverse/etl/decontamination/README.md",
    "content": ""
  },
  {
    "path": "dataverse/etl/decontamination/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/deduplication/README.md",
    "content": "# Deduplication\n> Deduplication is the process of removing duplicate records from a dataset.\n\nNormally this is clustered in 2 big categories:\n- **Exact Deduplication**: remove exact duplicate records\n- **Fuzzy Deduplication**: remove records that are similar to each other\n\n☣️ **caveat**️ ☣️\n> When we cluster sub-categories with just 2 big categories, it seems waste of space. So here temporalily we cluster sub-categories with more detailed categories.\n\n- part of name of full name (e.g. minhash)\n- open source name\n- etc\n\nBut we will change this to much better cluster in the future. And we need your help!\n\n💡Any ideas are welcomed!💡\n\n\n## 🌌 Exact Deduplication\n> Exact Deduplication is the process of removing exact duplicate records from a dataset.\n\n## 🌌 Fuzzy Deduplication\n> Fuzzy Deduplication is the process of removing records that are similar to each other from a dataset.\n"
  },
  {
    "path": "dataverse/etl/deduplication/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/deduplication/common_crawl.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport functools\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\nfrom pyspark.sql import functions as F\nfrom pyspark.sql.functions import collect_list, posexplode, split\n\nfrom dataverse.etl.registry import register_etl\n\n\ndef filter_lines(row, subset=\"text\"):\n    row = row.asDict()\n    text = row[subset]\n    line_ids = row[\"line_ids\"]\n\n    text_lines = text.split(\"\\n\")\n    filtered_texts = \"\\n\".join([text_lines[line_i] for line_i in sorted(line_ids)])\n\n    del row[\"line_ids\"]\n    row[subset] = filtered_texts\n\n    return row\n\n\n@register_etl\ndef deduplication___common_crawl___exact_line(\n    spark, data: Union[RDD, DataFrame], subset=\"text\", *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Performs exact line by line deduplication on the given data.\n\n    Strip and lower is applied to the line text before deduplication\n    but this will not be applied to the original text.\n\n    Examples:\n        - input\n\n            +--------+\n            |    text|\n            +========+\n            |   DuckY|\n            +--------+\n            |   dUKCY|\n            +--------+\n\n        - output\n\n            +--------+\n            |    text|\n            +========+\n            |   DuckY|\n            +--------+\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be deduplicated..\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n\n    Returns:\n        rdd: The deduplicated data.\n\n    Raises:\n        AssertionError: If the input data is not a DataFrame.\n    \"\"\"\n    if isinstance(data, RDD):\n        data = data.toDF()\n\n    data = data.cache()\n    data = data.withColumn(\"__id__\", F.monotonically_increasing_id())\n\n    assert isinstance(data, DataFrame), f\"data must be DataFrame, got {type(data)}\"\n    line_data = data.select(\n        \"__id__\", posexplode(split(data[subset], \"\\n\")).alias(\"line_id\", \"line\")\n    )\n    line_data = line_data.withColumn(\"line\", F.lower(F.trim(line_data[\"line\"])))\n    line_data = line_data.dropDuplicates(subset=[\"line\"])\n    line_data = line_data.groupBy(\"__id__\").agg(collect_list(\"line_id\").alias(\"line_ids\"))\n\n    merged_data = data.join(line_data, on=[\"__id__\"], how=\"inner\")\n    data.unpersist()\n    line_data.unpersist()\n\n    # remove __id__\n    merged_data = merged_data.drop(\"__id__\")\n\n    # filter the lines using the line_ids\n    merged_data = merged_data.rdd.map(functools.partial(filter_lines, subset=subset))\n\n    return merged_data\n"
  },
  {
    "path": "dataverse/etl/deduplication/exact.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\n\nfrom typing import List, Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef deduplication___exact___column(\n    spark, data: Union[RDD, DataFrame], subset: List[str] = [\"text\"], *args, **kwargs\n):\n    \"\"\"\n    Exact column deduplication\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be deduplicated..\n        subset(List[str]): Subset of columns to consider for duplication check. Default to ['text'].\n\n    Returns:\n        Deduplicated DataFrame object\n    \"\"\"\n    if isinstance(data, RDD):\n        data = data.toDF()\n\n    assert isinstance(data, DataFrame), f\"data must be DataFrame, got {type(data)}\"\n    data = data.dropDuplicates(subset=subset)\n    return data\n"
  },
  {
    "path": "dataverse/etl/deduplication/minhash.py",
    "content": "\"\"\"\nCode is from ChenghaoMou/text-dedup\nhttps://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py\n\nThis is a migration of the code to Dataverse.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport hashlib\nimport functools\nimport re\nimport os\nimport struct\nimport sys\nfrom itertools import tee\nfrom operator import add\nfrom typing import Any, List, Text, Tuple, Union\n\nimport numpy as np\nimport pyspark\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame, SparkSession\nfrom pyspark.sql import functions as F\nfrom pyspark.sql import types as T\nfrom pyspark.ml.feature import NGram, RegexTokenizer\nfrom scipy.integrate import quad as integrate\n\nfrom dataverse.etl.registry import register_etl\n\n# region: Connected Components in MapReduce and Beyond, 2014\ndef generate_edges(nodes: List[int]) -> List[Tuple[int, int]]:\n    \"\"\"\n    Generate edges from a cluster. Instead of generating N^2 edges, we only need all nodes align to a single node, since\n    we will be running connected components on the edges later.\n\n    Parameters\n    ----------\n    nodes : List[int]\n        The list of nodes in the cluster.\n\n    Returns\n    -------\n    List[Tuple[int, int]]\n        The list of edges.\n\n    Examples\n    --------\n    >>> generate_edges([1, 2, 3])\n    [(2, 1), (3, 1)]\n    \"\"\"\n    if len(nodes) <= 1:\n        return []\n\n    min_node = min(nodes)\n    return [(n, min_node) for n in nodes if n != min_node]\n\n\ndef get_hash(text: str, n_bytes: int=8):\n    return int.from_bytes(\n        hashlib.sha1(text.encode(\"utf-8\")).digest()[:n_bytes], \n        sys.byteorder\n    ) \n\n\ndef get_signatures(\n    shingles: List[str], \n    band_n: int, \n    row_per_band: int, \n    mod_prime: int, \n    hash_params: Tuple[np.ndarray]\n):\n    if not shingles:\n        return []\n    \n    shingles = np.array(\n        [get_hash(shingle) for shingle in set(shingles)], \n        dtype=np.uint64\n    )\n\n    signatures = np.full(\n        shape=(band_n * row_per_band), \n        fill_value=mod_prime, \n        dtype=np.uint64\n    )\n\n    chunk_size = 2 ** 10\n    a, b = hash_params\n    for i in range(0, len(shingles), chunk_size):\n        shingles_chunk = shingles[i:i+chunk_size]\n        signatures = np.minimum(\n            signatures, \n            np.min((shingles_chunk.reshape(-1, 1) * a + b) % mod_prime, axis=0)\n        )\n\n    return [\n        f\"{idx:02d}\" \\\n        + signatures[i*row_per_band:(i+1)*row_per_band].tobytes().hex() \n        for idx, i in enumerate(range(band_n))\n    ]\n\n\n# region: MinHashLSH\ndef optimal_param(\n    threshold: float,\n    num_perm: int,\n    false_positive_weight: float = 0.5,\n    false_negative_weight: float = 0.5,\n):\n    \"\"\"\n    Compute the optimal `MinHashLSH` parameter that minimizes the weighted sum\n    of probabilities of false positive and false negative, taken from datasketch.\n\n    Parameters\n    ----------\n    threshold : float\n        The threshold for similarity.\n    num_perm : int\n        The number of permutations.\n    false_positive_weight : float\n        The weight of false positive.\n    false_negative_weight : float\n        The weight of false negative.\n\n    Returns\n    -------\n    Tuple[int, int]\n        The optimal `b` and `r` parameters.\n        The number of bands, and the number of rows per band respectively.\n\n    Examples\n    --------\n    >>> optimal_param(0.7, 256)\n    (25, 10)\n    \"\"\"\n\n    def false_positive_area(threshold: float, b: int, r: int):\n        \"\"\"Source: `datasketch.lsh`\"\"\"\n\n        def area(s):\n            return 1 - (1 - s ** float(r)) ** float(b)\n\n        a, _ = integrate(area, 0.0, threshold)\n        return a\n\n    def false_negative_area(threshold: float, b: int, r: int):\n        \"\"\"Source: `datasketch.lsh`\"\"\"\n\n        def area(s):\n            return 1 - (1 - (1 - s ** float(r)) ** float(b))\n\n        a, _ = integrate(area, threshold, 1.0)\n        return a\n\n    min_error = float(\"inf\")\n    opt = (0, 0)\n    for b in range(1, num_perm + 1):\n        max_r = int(num_perm / b)\n        for r in range(1, max_r + 1):\n            fp = false_positive_area(threshold, b, r)\n            fn = false_negative_area(threshold, b, r)\n            error = fp * false_positive_weight + fn * false_negative_weight\n            if error < min_error:\n                min_error = error\n                opt = (b, r)\n    return opt\n\n# region: Quality Control\ndef process_cluster(cluster: List[Any]) -> List[Any]:\n    return cluster[:1]\n\n@register_etl\ndef deduplication___minhash___lsh_jaccard(\n    spark: SparkSession,\n    data: Union[RDD, DataFrame],\n    threshold: float = 0.7,\n    ngram_size: int = 5,\n    min_length: int = 5,\n    num_perm: int = 250,\n    band_n: int = None,\n    row_per_band: int = None,\n    id_col: Union[str, None] = None,\n    subset: str = \"text\",\n    seed: int = 42,\n    duplicates_save_path: Union[str, None] = None,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Fuzzy deduplication using MinHash and Locality Sensitive Hashing (LSH).\n\n    Args:\n        spark (SparkSession): The SparkSession object.\n        data (Union[RDD, DataFrame]): Input data to be deduplicated.\n        threshold (float, optional): Similarity threshold. Default is 0.7.\n        ngram_size (int, optional): Size of n-grams. Default is 5.\n        min_length (int, optional): Minimum token length of document to be considered. Default is 5.\n        num_perm (int, optional): Number of permutations. Default is 250.\n        band_n (int, optional): Number of bands. If not provided, it will be calculated based on the threshold and num_perm.\n        row_per_band (int, optional): Number of rows per band. If not provided, it will be calculated based on the threshold and num_perm.\n        id_col (str, optional): Key column for extract duplicated rows. If not provided, temporary id column will be created.\n        subset (str, optional): Column to deduplicate on. Default is \"text\".\n        seed (int, optional): Random seed. Default is 42.\n        duplicates_save_path (str, optional): Save path for duplicated entries. If not provided, not saving the duplicates.\n\n    Returns:\n        RDD: Deduplicated data as a DataFrame.\n    \"\"\"\n    spark.sparkContext.setCheckpointDir(\"checkpoint\")\n    from graphframes import GraphFrame\n\n    if isinstance(data, RDD):\n        data_df = data.toDF()\n    elif isinstance(data, DataFrame):\n        data_df = data\n\n    if (\n        duplicates_save_path is not None \n        and os.path.exists(duplicates_save_path)\n    ):\n        assert \"duplicates_save_path already exists.\"\n\n    temp_id_col, component_col, tokens_col, ngrams_col = \\\n        \"__id__\", \"__component__\", \"__tokens__\", \"__ngrams__\"\n    \n    exist_cols = set(data_df.columns)\n    while True:\n        if temp_id_col in exist_cols:\n            temp_id_col += \"_\"\n        elif component_col in exist_cols:\n            component_col += \"_\"\n        elif tokens_col in exist_cols:\n            tokens_col += \"_\"\n        elif ngrams_col in exist_cols:\n            ngrams_col += \"_\"\n        else:\n            break\n\n    if id_col is None:\n        id_col = temp_id_col\n        print(f\"create temp id col: {id_col}\")\n        data_df = data_df.withColumn(id_col, F.monotonically_increasing_id())\n        data_df.persist(pyspark.StorageLevel.DISK_ONLY)\n\n    if band_n is None or row_per_band is None:\n        band_n, row_per_band = optimal_param(threshold, num_perm)\n\n    mod_prime = 1 << 61 - 1 \n    gen = np.random.RandomState(seed)\n    hash_params = (\n        gen.randint(1, mod_prime, dtype=np.uint64, size=band_n * row_per_band),\n        gen.randint(0, mod_prime, dtype=np.uint64, size=band_n * row_per_band),\n    )\n\n    subset_type: str = [t for c, t in data_df.dtypes if c == subset][0]\n    if subset_type.startswith(\"str\"):\n        # assume subset col should be tokenized\n        tokens_df = RegexTokenizer(\n            inputCol=subset, \n            outputCol=tokens_col,\n            pattern=\"\\\\W\"\n        ).transform(\n            data_df\n            .select(id_col, F.col(subset).substr(1, 10_000_000).alias(subset)) \n        ).select(\n            id_col, tokens_col\n        ).filter(\n            F.size(tokens_col) >= min_length\n        )\n    elif subset_type.startswith(\"array\"):\n        print(\"already tokenized.\")\n        tokens_col = subset\n        tokens_df = data_df.select(id_col, tokens_col)\n\n    shingles_df = NGram(\n        n=ngram_size, \n        inputCol=tokens_col, \n        outputCol=ngrams_col\n    ).transform(tokens_df).select(id_col, ngrams_col)\n\n    sig_udf = F.udf(\n        functools.partial(\n            get_signatures,\n            band_n=band_n,\n            row_per_band=row_per_band,\n            mod_prime=mod_prime,\n            hash_params=hash_params\n        ), \n        returnType=T.ArrayType(T.StringType())\n    )\n    signature_df = (\n        shingles_df\n        .select(id_col, F.explode(sig_udf(ngrams_col)).alias(\"band\"))\n        .groupby(\"band\")\n        .agg(\n            F.collect_set(id_col).alias(\"ids\")\n        )\n    )\n\n    edge_udf = F.udf(\n        generate_edges, \n        returnType=T.ArrayType(T.ArrayType(data_df.schema[id_col].dataType))\n    )\n    edges_df = (\n        signature_df\n        .select(\"ids\")\n        .filter(F.size(\"ids\") > 1)\n        .select(F.explode(edge_udf(\"ids\")).alias(\"edges\"))\n        .distinct()\n        .selectExpr(\"edges[0] as src\", \"edges[1] as dst\")\n    ).persist(pyspark.StorageLevel.DISK_ONLY)\n\n    count = edges_df.count()\n    if count == 0:\n        print(\"no entry for deduplication.\")\n        edges_df.unpersist()\n        data_df.unpersist()\n        return data\n    \n    vertices_df = (\n        edges_df\n        .selectExpr(\"src as id\")\n        .union(edges_df.selectExpr(\"dst as id\"))\n        .distinct()\n    )\n\n    assignment = (\n        GraphFrame(vertices_df, edges_df)\n        .connectedComponents(broadcastThreshold=200 * (1024 ** 2))\n    )\n\n    join_df = data_df.join(\n        assignment.select(\n            F.col(\"id\").alias(id_col), \n            F.col(\"component\").alias(component_col)\n        ),\n        on=id_col,\n        how=\"left\"\n    )\n\n    if duplicates_save_path is not None:\n        duplicates_df = (\n            join_df\n            .filter(F.col(component_col).isNotNull())\n            .drop(ngrams_col)\n        )\n        if id_col == temp_id_col:\n            duplicates_df = duplicates_df.drop(id_col)\n        if tokens_col != subset:\n            duplicates_df = duplicates_df.drop(tokens_col)\n        \n        duplicates_df.write.parquet(duplicates_save_path)\n        duplicates_df.unpersist()\n\n    final_df = (\n        join_df\n        .filter(F.col(component_col).isNull())\n        .union(\n            join_df\n            .filter(F.col(component_col).isNotNull())\n            .dropDuplicates([component_col])\n        )\n        .drop(component_col, ngrams_col)\n    )\n\n    if id_col == temp_id_col:\n        final_df = final_df.drop(id_col)\n    if tokens_col != subset:\n        final_df = final_df.drop(tokens_col)\n\n    edges_df.unpersist()\n    return final_df.rdd"
  },
  {
    "path": "dataverse/etl/deduplication/polyglot.py",
    "content": "\"\"\"\nCode is from EleutherAI/dps\nhttps://github.com/EleutherAI/dps/blob/master/dps/spark/jobs/dedup_job.py\n\nThis is a migration of the deduplication job from the DPS project to the Dataverse.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport binascii\nimport random\nfrom itertools import combinations\nfrom typing import List, Union\n\nimport numpy as np\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\nMERSENNE_PRIME = (1 << 61) - 1\nMAX_HASH = (1 << 32) - 1\nHASH_RANGE = 1 << 32\n\n\ndef shingle_word(text: str, n_gram: int = 15, char_level: bool = False) -> List[str]:\n    \"\"\"\n    example\n    -------\n    >>> shingle_word(\"hello world from ducky\", n_gram=2)\n    ['hello_world', 'world_from', 'from_ducky']\n\n    >>> shingle_word(\"hello world from ducky\", n_gram=2, char_level=True)\n    ['h_e', 'e_l', 'l_l', 'l_o', 'o_w', 'w_o', 'o_r', 'r_l', 'l_d', 'd_f', 'f_r', 'r_o', 'o_m', 'm_d', 'd_u', 'u_c', 'c_k', 'k_y']\n    \"\"\"\n    res = []\n    text_words = text.split() if not char_level else text\n\n    for i in range(len(text_words)):\n        shingle = text_words[i : i + n_gram]\n\n        if len(shingle) == n_gram:\n            res.append(\"_\".join(shingle).encode(\"utf-8\"))\n\n    return res\n\n\ndef generate_minhash(shingles: List, num_perm: int = 64, seed: int = 1) -> np.array:\n    def hashfunc(b: bytes) -> bytes:\n        return binascii.crc32(b) & MAX_HASH\n\n    hashvalues = np.ones(num_perm, dtype=np.uint64) * MAX_HASH\n\n    generator = np.random.RandomState(seed)\n    permutations = np.array(\n        [\n            (\n                generator.randint(1, MERSENNE_PRIME, dtype=np.uint64),\n                generator.randint(0, MERSENNE_PRIME, dtype=np.uint64),\n            )\n            for _ in range(num_perm)\n        ],\n        dtype=np.uint64,\n    ).T\n\n    for shingle in shingles:\n        hv = hashfunc(shingle)\n        a, b = permutations\n        phv = np.bitwise_and((a * hv + b) % MERSENNE_PRIME, np.uint64(MAX_HASH))\n        hashvalues = np.minimum(phv, hashvalues)\n\n    return hashvalues\n\n\ndef jaccard_by_hashvalues(src_hashvalues, tgt_hashvalues) -> float:\n    if len(src_hashvalues) != len(tgt_hashvalues):\n        raise ValueError()\n\n    return np.float(np.count_nonzero(src_hashvalues == tgt_hashvalues)) / np.float(\n        len(src_hashvalues)\n    )\n\n\ndef expand_instances_by_minhash(\n    data, expand_size: int, n_gram: int, seed: int = 1, char_level: bool = False\n):\n    shingles = shingle_word(data[\"text\"], n_gram=n_gram, char_level=char_level)\n    minhashes = generate_minhash(shingles, num_perm=expand_size, seed=seed)\n\n    for mh in minhashes.tolist():\n        yield (str(mh), [dict(**data, shingles=shingles, hashvalues=minhashes)])\n\n\ndef explore_dedup_instance(hash_groups, threshold: float = 0.8):\n    if len(hash_groups) <= 1:\n        return\n\n    group_represent_text = hash_groups[0][\"text\"]  # not to remove all text instances in group.\n    pairs = combinations(hash_groups, 2)\n\n    for d_1, d_2 in pairs:\n        sim_score = jaccard_by_hashvalues(d_1[\"hashvalues\"], d_2[\"hashvalues\"])\n        if sim_score >= threshold:\n            dedup_text = [d_1[\"text\"], d_2[\"text\"]]\n            if group_represent_text in dedup_text:\n                yield dedup_text[0] if dedup_text[0] != group_represent_text else dedup_text[1]\n            else:\n                yield random.choice(dedup_text)\n\n\n@register_etl\ndef deduplication___polyglot___minhash(\n    spark,\n    data: Union[RDD, DataFrame],\n    expand_size: int = 64,\n    n_gram: int = 15,\n    seed: int = 1,\n    char_level: bool = False,\n    sim_threshold: float = 0.8,\n    *args,\n    **kwargs,\n):\n    \"\"\"\n    Fuzzy deduplication using MinHash algorithm.\n\n    Args:\n        spark (SparkSession): The SparkSession object.\n        data (Union[RDD, DataFrame]): The input data to be deduplicated.\n        expand_size (int, optional): The size of expansion for each instance. Defaults to 64.\n        n_gram (int, optional): The size of n-gram for tokenization. Defaults to 15.\n        seed (int, optional): The seed value for random number generation. Defaults to 1.\n        char_level (bool, optional): Whether to use character-level tokenization. Defaults to False.\n        sim_threshold (float, optional): The similarity threshold for deduplication. Defaults to 0.8.\n        *args: Additional positional arguments.\n        **kwargs: Additional keyword arguments.\n\n    Returns:\n        RDD or DataFrame: The deduplicated data.\n\n    Raises:\n        None\n\n    Examples:\n        >>> deduplication___polyglot___minhash()(spark, data, expand_size=128, sim_threshold=0.9)\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    overlap_kv_rdd: RDD = (\n        data.flatMap(\n            lambda x: expand_instances_by_minhash(\n                x,\n                expand_size=expand_size,\n                n_gram=n_gram,\n                seed=seed,\n                char_level=char_level,\n            )\n        )\n        .reduceByKey(lambda x, y: x + y)\n        .flatMap(lambda x: explore_dedup_instance(x[1], threshold=sim_threshold))\n        .distinct()\n        .map(lambda x: (x, dict(text=x)))\n        .cache()\n    )\n\n    data = data.map(lambda x: (x[\"text\"], x)).subtractByKey(overlap_kv_rdd).map(lambda x: x[1])\n    return data\n"
  },
  {
    "path": "dataverse/etl/pii/README.md",
    "content": "# PII (Personally Identifiable Information)\n> Replacing, Removing, and Anonymizing PII\n\n\n## 🌌 Naming Convention\n> This is a strong recommendation. You can use your own naming convention if you want.\n\n```python\ndef cleaning___[ETL Sub-Category]___[ETL Process]()\n```\n\n- `ETL Sub-Category` - the `PII` type\n    - e.g. card number\n    - e.g. email\n    - e.g. phone number\n- `ETL process name` - what you are doing to the `PII`\n    - e.g. remove\n    - e.g. replace"
  },
  {
    "path": "dataverse/etl/pii/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/pii/card.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport random\nimport re\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef pii___card___replace_card_number(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    pattern: str = r\"(\\d{4}-\\d{4}-\\d{4}-\\d{4})\",\n    random_pii: bool = True,\n    replace_pii: bool = False,\n    replace_token: str = \"[CARD_NUMBER]\",\n    start_token: str = \"\",\n    end_token: str = \"\",\n    *args,\n    **kwargs,\n) -> RDD:\n    r\"\"\"\n    Replace card number with a random number or a token\n\n    Args:\n        spark: The SparkSession object.\n        data (Union[RDD, DataFrame]): The input data to process.\n        subset (str, optional): The subset or columns to consider. Defaults to 'text'.\n        pattern (str, optional): The regex pattern to find. Defaults to r'(\\d{4}-\\d{4}-\\d{4}-\\d{4})'.\n        random_pii (bool, optional): If True, replace the pii with a random number. Defaults to True.\n        replace_pii (bool, optional): If True, replace the pii with the `replace_token`. Defaults to False.\n        replace_token (str, optional): The token to replace the pii with. Defaults to '[CARD_NUMBER]'.\n        start_token (str, optional): The start token to append where the pattern is found. Defaults to ''.\n        end_token (str, optional): The end token to append where the pattern is found. Defaults to ''.\n\n    Returns:\n        RDD: The processed data.\n\n    Caveats:\n        - `replace_pii` takes precedence over `random_pii`\n            - e.g when both are True, the card number will be replaced with the token\n            - e.g. this is 1234-1234-1234-1234 -> this is [CARD_NUMBER]\n        - `start_token` and `end_token` are used to append the token to the start and end of the card number\n            - it doens't matter with `random_card_number` or `replace_card_number` is True or False\n\n    Examples:\n        <input>\n            - text = 'card number is 1234-1234-1234-1234.'\n\n        <output>\n            - random pii\n                - text = 'card number is 2238-1534-1294-1274.'\n            - replace pii\n                - replace_token = '[CARD_NUMBER]'\n                - text = 'card number is [CARD_NUMBER].'\n            - start token\n                - start_token = '[CARD_NUMBER_START]'\n                - text = 'card number is [CARD_NUMBER_START]1234-1234-1234-1234.'\n            - end token\n                - end_token = '[CARD_NUMBER_END]'\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _replace_match(match):\n        match = match.group()\n        if replace_pii:\n            match = replace_token\n        elif random_pii:\n            match = re.sub(r\"\\d\", lambda x: str(random.randint(0, 9)), match)\n\n        return f\"{start_token}{match}{end_token}\"\n\n    def _replace_pii(row):\n        row[subset] = re.sub(pattern, _replace_match, row[subset])\n        return row\n\n    data = data.map(_replace_pii)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/pii/nin.py",
    "content": "\"\"\"\nNIN (National Identification Number)\n=====================================\nA national identification number, national identity number, or\nnational insurance number or JMBG/EMBG is used by the governments\nof many countries as a means of tracking their citizens, permanent residents,\nand temporary residents for the purposes of work, taxation,\ngovernment benefits, health care, and other governmentally-related functions.\n\nhttps://en.wikipedia.org/wiki/National_identification_number\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport random\nimport re\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef pii___nin___replace_korean_rrn(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    pattern: str = r\"\\d{6}-\\d{7}\",\n    random_pii: bool = True,\n    replace_pii: bool = False,\n    replace_token: str = \"[NIN]\",\n    start_token: str = \"\",\n    end_token: str = \"\",\n    *args,\n    **kwargs,\n) -> RDD:\n    r\"\"\"\n    Replace Korean RRN (Resident Registration Number) with a random number or a token\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data(Union[RDD, DataFrame]): The input data to be processed.\n        subset(str, optional): A subset or column to consider. Defaults to 'text'.\n        pattern(str, optional): The regex pattern to find. Defaults to r'\\d{6}-\\d{7}'.\n        random_pii(str, optional): If True, replace the pii with a random number. Defaults to True.\n        replace_pii(bool, optional): If True, replace the pii with the `replace_token`. Defaults to False.\n        replace_token(bool, optional): The token to replace the pii with. Defaults to '[NIN]'.\n        start_token(str, optional): The start token to append where the pattern is found. Defaults to ''.\n        end_token(str, optional): The end token to append where the pattern is found. Defaults to ''.\n\n    Returns:\n        rdd: The processed data with replaced Korean RRN.\n\n    Caveats:\n        - `replace_pii` takes precedence over `random_pii`\n        - `start_token` and `end_token` are used to append the token to the start and end of the number\n            - it doens't matter with `random_pii` or `replace_pii` is True or False\n\n\n    Examples:\n        <input>\n            - text = 'nin is 123456-1234567'\n\n        <output>\n            - random pii\n                - text = 'nin is 141124-1244121'\n            - replace pii\n                - replace_token = '[NIN]'\n                - text = 'nin is [NIN].'\n            - start token\n                - start_token = '[NIN_START]'\n                - text = 'nin is [NIN_START]123456-1234567'\n            - end token\n                - end_token = '[NIN_END]'\n                - text = 'nin is 123456-1234567[NIN_END].'\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _replace_match(match):\n        match = match.group()\n        if replace_pii:\n            match = replace_token\n        elif random_pii:\n            match = re.sub(r\"\\d\", lambda x: str(random.randint(0, 9)), match)\n\n        return f\"{start_token}{match}{end_token}\"\n\n    def _replace_pii(row):\n        row[subset] = re.sub(pattern, _replace_match, row[subset])\n        return row\n\n    data = data.map(_replace_pii)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/pipeline.py",
    "content": "\"\"\"\nETL Interface\n----------------------\nuser will be interacting with this interface\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport time\nfrom pathlib import Path\nfrom typing import Union\n\nimport boto3\nfrom omegaconf import DictConfig, OmegaConf\nfrom pyspark.conf import SparkConf\nfrom pyspark.sql import SparkSession\n\nfrom dataverse.config import Config\nfrom dataverse.etl import ETLRegistry\nfrom dataverse.utils.api import AWSClient, EMRManager, aws_check_credentials\nfrom dataverse.utils.setting import SystemSetting\n\n\nclass ETLPipeline:\n    \"\"\"\n    ETL Pipeline.\n\n    This class represents an ETL (Extract, Transform, Load) pipeline.\n    It provides methods for managing and executing ETL processes.\n\n    Attributes:\n        registry (ETLRegistry): The registry of ETL processes.\n\n    Examples:\n        >>> etl_pipeline = ETLPipeline()\n        >>> etl_pipeline.status()\n        >>> etl_pipeline.search('data_ingestion', 'ufl')\n        >>> spark, data = etl_pipeline.sample()\n\n        >>> config = Config.default()\n        >>> etl_pipeline.run(config = config)\n    \"\"\"\n\n    def __init__(self):\n        self.registry = ETLRegistry()\n\n    def __len__(self):\n        return len(self.registry)\n\n    def status(self):\n        \"\"\"\n        Get the status of the registry.\n\n        Returns:\n            str: The status of the registry.\n\n        Raises:\n            None\n\n        Examples:\n            >>> etl_pipeline = EtlPipeline()\n            >>> etl_pipeline.status()\n            'If you need details of ETL Registry use `etl_pipeline.search()`'\n\n        Note:\n            This method does not show detailed information.\n            It will only info about category .\n        \"\"\"\n        print(\"If you need details of ETL Registry use `etl_pipeline.search()`\")\n        return str(self.registry)\n\n    def search(self, category=None, sub_category=None):\n        \"\"\"\n        Get detailed status of the registry by searching.\n\n        This function lets you know category, sub_category, and etl_name.\n\n        Args:\n            category (str, optional): The category to filter the search results. Defaults to None.\n            sub_category (str, optional): The sub-category to filter the search results. Defaults to None.\n\n        Returns:\n            list: A list of search results matching the specified category and sub-category.\n\n        Examples:\n            Return every ETL\n            \n            >>> etl_pipeline.search() \n            \n            Only selected category\n            \n            >>> etl_pipeline.search('data_ingestion')\n            >>> etl_pipeline.search(category='data_ingestion')\n            \n            Only selected category & sub_category\n            \n            >>> etl_pipeline.search('data_ingestion', 'ufl')\n            >>> etl_pipeline.search(category='data_ingestion', sub_category='ufl')\n        \"\"\"\n        return self.registry.search(category=category, sub_category=sub_category)\n\n    def get(self, key):\n        \"\"\"get ETL class from registry\"\"\"\n        return self.registry.get(key=key)\n\n    def setup_spark_conf(self, config, verbose=False):\n        \"\"\"\n        AWS credential setting log is not influenced by the verbose by design\n        \"\"\"\n\n        # TODO: add more spark configurations\n        spark_conf = SparkConf()\n        spark_conf.set(\"spark.master\", config.spark.master)\n        spark_conf.set(\"spark.app.name\", config.spark.appname)\n        spark_conf.set(\"spark.driver.memory\", config.spark.driver.memory)\n        spark_conf.set(\"spark.driver.maxResultSize\", config.spark.driver.maxResultSize)\n        spark_conf.set(\"spark.executor.memory\", config.spark.executor.memory)\n        spark_conf.set(\"spark.local.dir\", config.spark.local.dir)\n        spark_conf.set(\"spark.ui.port\", config.spark.ui.port)\n        spark_conf.set(\"spark.jars.packages\", \"graphframes:graphframes:0.8.3-spark3.5-s_2.12\")\n        \n        # AWS S3 Support\n        if aws_check_credentials(verbose=verbose):\n            session = boto3.Session()\n            credentials = session.get_credentials()\n\n            spark_conf.set(\"spark.hadoop.fs.s3a.access.key\", credentials.access_key)\n            spark_conf.set(\"spark.hadoop.fs.s3a.secret.key\", credentials.secret_key)\n            spark_conf.set(\"spark.hadoop.fs.s3a.impl\", \"org.apache.hadoop.fs.s3a.S3AFileSystem\")\n\n            hadoop_ver = SystemSetting().get(\"HADOOP_VERSION\")\n            spark_conf.set(\n                \"spark.jars.packages\",\n                (\n                    f\"org.apache.hadoop:hadoop-aws:{hadoop_ver}\"\n                    f\",com.amazonaws:aws-java-sdk-bundle:1.12.592\"\n                ),\n            )\n\n            # check if the credentials are temporary or not\n            try:\n                spark_conf.set(\"spark.hadoop.fs.s3a.session.token\", credentials.token)\n                spark_conf.set(\n                    \"spark.hadoop.fs.s3a.aws.credentials.provider\",\n                    \"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider\",\n                )  # this is for temporary credentials\n                print(\"spark conf is set with [ temporary ] S3 credentials\")\n            except Exception:\n                print(\"spark conf is set with [ permanent ] S3 credentials\")\n\n        else:\n            print(\"[ No AWS Credentials Found] - Failed to set spark conf for S3\")\n\n        return spark_conf\n\n    def sample(\n        self,\n        n=100,\n        config=None,\n        sample_etl=\"data_ingestion___test___generate_fake_ufl\",\n        verbose=False,\n    ):\n        \"\"\"\n        Get the spark session and sample data.\n\n        Use this function to test the ETL pipeline quickly without config.\n\n        Args:\n            n (int): The number of data to generate. Default is 100.\n            config (Union[str, dict, OmegaConf]): Config for the ETL. Default is None.\n            sample_etl (str): The name of the sample ETL process. Default is \"data_ingestion___test___generate_fake_ufl\".\n            verbose (bool): If True, print the status. Default is False.\n\n        Returns:\n            Tuple[SparkSession, DataFrame]: The Spark session and the sampled data.\n        \"\"\"\n        if config is None:\n            config = Config.default()\n        else:\n            config = Config.load(config)\n            config = Config.set_default(config)\n\n            # remove all the ETL processes\n            config.etl = []\n\n        config.etl.append({\"name\": sample_etl, \"args\": {\"n\": n}})\n        if verbose:\n            print(\"=\" * 50)\n            print(\"[ Configuration ]\")\n            print(OmegaConf.to_yaml(config))\n            print(\"=\" * 50)\n\n        spark_conf = self.setup_spark_conf(config, verbose=verbose)\n        spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()\n        if verbose:\n            print(\"=\" * 50)\n            print(\"[ Spark Final Configuration ]\")\n            print(OmegaConf.to_yaml(spark_conf.getAll()))\n            print(\"=\" * 50)\n\n        sample_etl_class = self.get(key=sample_etl)\n        data = sample_etl_class()(spark, n=n, etl_name=sample_etl)\n\n        if verbose:\n            print(\n                (\n                    f\"{'=' * 50}\\n\"\n                    \"[ SAMPLE MODE ]\\n\"\n                    f\"{'=' * 50}\\n\"\n                    \"This is a quick way to get the sample data for testing or debugging w/o config.\\n\"\n                    \"If you want to test the ETL pipeline with your own data, please use `run` w/ config.\\n\"\n                    f\"{'=' * 50}\\n\"\n                    \"=> spark, data = etl_pipeline.sample()\\n\"\n                    \"=> data = data.map(add awesome duck to column)\\n\"\n                    f\"{'=' * 50}\\n\"\n                )\n            )\n\n        return spark, data\n\n    def run(\n        self,\n        config: Union[str, dict, DictConfig, OmegaConf, Path],\n        verbose=False,\n        cache=False,\n        emr=False,\n        *args,\n        **kwargs,\n    ):\n        \"\"\"\n        Runs the ETL process.\n\n        Args:\n            config (Union[str, dict, OmegaConf]): config for the etl\n                - str: path to the config file\n                - dict: config dict\n                - OmegaConf: config object\n            verbose (bool): if True, print the status of the etl pipeline\n                - the verbose will be applied to the ETL process as well\n                - ETL process `verbose` takes precedence over this\n            cache (bool): cache every stage of the ETL process\n            emr (bool): if True, run the ETL process on EMR\n        \"\"\"\n        # ================ [ EMR ] ===================\n        if emr:\n            return self.run_emr(\n                config,\n                verbose=verbose,\n                cache=cache,\n                *args,\n                **kwargs,\n            )\n\n        # =============== [ Set Config ] ==================\n        # mainly this is to fill the missing config args with default\n        config = Config.load(config)\n        config = Config.set_default(config)\n        if verbose:\n            print(\"=\" * 50)\n            print(\"[ Configuration ]\")\n            print(OmegaConf.to_yaml(config))\n            print(\"=\" * 50)\n\n        # ================ [ Set Spark ] ===================\n        spark_conf = self.setup_spark_conf(config, verbose=verbose)\n        spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()\n        if verbose:\n            print(\"=\" * 50)\n            print(\"[ Spark Final Configuration ]\")\n            print(OmegaConf.to_yaml(spark_conf.getAll()))\n            print(\"=\" * 50)\n\n        # ================= [ Run ETL ] ====================\n        # [ Load RDD/DataFrame ] - data ingestion\n        # [ Preprocessing ]\n        # [ Save RDD/DataFrame ] - data save\n        etl_configs = config.etl\n        total_etl_n = len(etl_configs)\n\n        # [switch] is the ETL process ended or not\n        # if not, spark session & data will be returned to continue\n        IS_ETL_FINISHED = True\n\n        data = None\n        prev_etl_name = None\n        prev_data = None  # for caching\n        for etl_i, etl_config in enumerate(etl_configs):\n            # etl_config.name format\n            # =====>[ etl_cate___etl_sub_cate___etl_name ]\n            etl_name = etl_config.name\n            etl_category = etl_name.split(\"___\")[0]\n            etl_class = self.get(key=etl_name)\n\n            # instantiate etl class\n            etl_instance = etl_class()\n\n            # this is middle creator mode\n            # if the last ETL process is not data save\n            if etl_i == total_etl_n - 1 and etl_category != \"data_save\":\n                if verbose:\n                    print(\n                        (\n                            f\"{'=' * 50}\\n\"\n                            \"[ DEBUG MODE ]\\n\"\n                            f\"{'=' * 50}\\n\"\n                            f\"Last ETL process was assigned for [ {etl_category} ]\\n\"\n                            \"Spark session will not be stopped and will be returned\\n\"\n                            \"If this is not intended, please assign [ data_save ] at the end.\\n\"\n                            f\"{'=' * 50}\\n\"\n                            \"Example:\\n\"\n                            \"=> spark, data = etl_pipeline.run(config)\\n\"\n                            \"=> data = data.map(add awesome duck to column)\\n\"\n                            f\"{'=' * 50}\\n\"\n                        )\n                    )\n                IS_ETL_FINISHED = False\n\n            # when args is not defined, set it to empty dict\n            if \"args\" in etl_config:\n                args = etl_config.args\n            else:\n                args = {}\n\n            # if verbose is not defined, set it same to the pipeline\n            if \"verbose\" not in args:\n                args[\"verbose\"] = verbose\n\n            # `etl_name` is passed to args for tracking\n            if etl_i == 0:\n                data = etl_instance(spark, **args, etl_name=etl_name, prev_etl_name=None)\n            else:\n                data = etl_instance(\n                    spark, data, **args, etl_name=etl_name, prev_etl_name=prev_etl_name\n                )\n\n            # cache the data\n            if cache:\n                if prev_data is not None:\n                    prev_data.unpersist()\n                data.cache()\n                prev_data = data\n\n            prev_etl_name = etl_name\n\n        # =============== [ Stop Spark ] ==================\n        if IS_ETL_FINISHED:\n            spark.stop()\n            if verbose:\n                print(\"=\" * 50)\n                print(\"[ Spark Successfully Done ]\")\n                print(\"=\" * 50)\n\n        return spark, data\n\n    def run_emr(\n        self,\n        config: Union[str, dict, DictConfig, OmegaConf, Path],\n        verbose=False,\n        cache=False,\n        *args,\n        **kwargs,\n    ):\n        \"\"\"\n        Runs the ETL process on an EMR cluster.\n\n        Args:\n            config (Union[str, dict, OmegaConf]): config for the etl\n                - str: path to the config file\n                - dict: config dict\n                - OmegaConf: config object\n            verbose (bool): if True, print the status of the etl pipeline\n                - the verbose will be applied to the ETL process as well\n                - ETL process `verbose` takes precedence over this\n            cache (bool): cache every stage of the ETL process\n\n        Returns:\n            None, Config:\n                - None for spark session\n                - Config for the config\n                    - originally data is returned, but it is not necessary for EMR\n        \"\"\"\n        if not aws_check_credentials(verbose=verbose):\n            raise ValueError(\"AWS EMR requires AWS credentials\")\n\n        # =============== [ Set Config ] ==================\n        config = Config.load(config)\n        config = Config.set_default(config, emr=True)\n\n        # EMR resource manager - yarn\n        config.spark.master = \"yarn\"\n\n        # reset local_dir for EMR cluster\n        config.spark.local.dir = \"/tmp\"\n\n        # ================ [ EMR ] ===================\n        # NOTE: config will be auto-updated by EMR Manager\n        emr_manager = EMRManager()\n\n        try:\n            # EMR cluster launch\n            emr_manager.launch(config)\n\n            if verbose:\n                print(\"=\" * 50)\n                print(\"[ Configuration ]\")\n                print(OmegaConf.to_yaml(config))\n                print(\"=\" * 50)\n\n            # EMR cluster environment setup & run spark\n            step_id = emr_manager.run(config, verbose=verbose)\n\n            # wait until EMR cluster step is done\n            emr_manager.wait(config, step_id)\n\n            # EMR Cluster terminate\n            # XXX: after EMR cluster is terminated, and confirmed by waiter\n            #      there is still a chance that the cluster is not terminated and cause error\n            #       - DependencyViolation (which depends on terminated cluster)\n            # FIXME: this is a temporary solution, need to find a better way to handle this\n            RETRY_TERMINATE = 5\n            for _ in range(RETRY_TERMINATE):\n                try:\n                    emr_manager.terminate(config)\n                    break\n                except AWSClient().ec2.exceptions.ClientError as e:\n                    if e.response[\"Error\"][\"Code\"] == \"DependencyViolation\":\n                        print(\"DependencyViolation - retrying to terminate EMR cluster\")\n                        time.sleep(5)\n                    else:\n                        raise e\n                except Exception as e:\n                    raise e\n\n        # ctrl + c\n        except KeyboardInterrupt:\n            print(\"KeyboardInterrupt - terminating EMR cluster\")\n            emr_manager.terminate(config)\n            raise KeyboardInterrupt\n        except Exception as e:\n            print(\"Exception - terminating EMR cluster\")\n            emr_manager.terminate(config)\n            raise e\n\n        return None, config\n"
  },
  {
    "path": "dataverse/etl/quality/README.md",
    "content": "# Quality"
  },
  {
    "path": "dataverse/etl/quality/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/quality/language.py",
    "content": "\"\"\"\nlanguage filtering from Common Crawl\n\nThis is a migration of the common crawl code to Dataverse.\nsome part of code is from facebookresearch/cc_net\nhttps://github.com/facebookresearch/cc_net/blob/main/cc_net/split_by_lang.py\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport functools\nfrom pathlib import Path\nfrom typing import List, Union\n\nimport requests\nfrom fasttext.FastText import _FastText\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl.registry import register_etl\nfrom dataverse.utils.setting import SystemSetting\n\n\ndef load_fasttext(\n    url=\"https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin\",\n):\n    \"\"\"\n    There is 2 issues found here\n    - due to unserilizable fasttext problem, we need to load the model for every task\n        - this is a problem, extremely slow\n        - we need to load the model once and use it for all tasks\n    - since this could lead to duplicated download, we need to check if the model is already downloaded\n        - so far found no duplicated download, but if there is, hope to be fixed in the future\n    \"\"\"\n    # FIXME: this is a manual check for duplicate download\n    # rd_n = np.random.randint(0, 1000000)\n    # print(rd_n, 'entered load_fasttext model!')\n\n    # Get the lid.bin file for Fasttext\n    cache_dir = SystemSetting().CACHE_DIR\n    cache_dir = Path(f\"{cache_dir}/.cache/dataverse/model\")\n    fasttext_path = cache_dir / \"fasttext\" / \"bin\" / \"lid.bin\"\n    fasttext_path.parent.mkdir(parents=True, exist_ok=True)  # Make directories if not existed\n\n    if not fasttext_path.exists():\n        # FIXME: this is a manual check for duplicate download\n        # print(rd_n, 'downloading fasttext model!')\n        response = requests.get(url, stream=True)\n\n        # Raise exception if downloading is not successful\n        response.raise_for_status()\n        with open(fasttext_path, \"wb\") as f:\n            for chunk in response.iter_content(chunk_size=8192):\n                f.write(chunk)\n\n    # FIXME: this is to suppress the warning message\n    # return fasttext.load_model(str(fasttext_path))\n    return _FastText(model_path=str(fasttext_path))\n\n\ndef language_predict_fasttext(row, model, top_k: int = 1, score_rounding: int = 2):\n    text = row[\"text\"].replace(\"\\n\", \"\")\n    labels, scores = model.predict(text, k=top_k)\n    labels = [label.replace(\"__label__\", \"\") for label in labels]\n\n    row[\"labels\"] = labels\n    row[\"scores\"] = scores.round(score_rounding)\n\n    return row\n\n\ndef language_predict_fasttext_by_partition(rows, top_k: int = 1, score_rounding: int = 2):\n    # loaded for every partition\n    model = load_fasttext()\n\n    # FIXME: not possible to use multiprocessing here because of the model is not serializable\n    # pool = multiprocessing.Pool(processes = os.cpu_count() or 0)\n    # results = pool.imap(\n    #     functools.partial(language_predict_fasttext, model=model, top_k=top_k),\n    #     rows,\n    # )\n    for row in rows:\n        yield language_predict_fasttext(row, model, top_k=top_k)\n\n\n@register_etl\ndef quality___language___fasttext_filter(\n    spark,\n    data: Union[RDD, DataFrame],\n    subset: str = \"text\",\n    top_k: int = 1,\n    score_rounding: int = 2,\n    threshold: float = 0.0,\n    whitelist: List[str] = None,\n    blacklist: List[str] = None,\n    *args,\n    **kwargs,\n) -> RDD:\n    \"\"\"\n    Filters data based on language using fasttext.\n    If language score is below threshold, that row will be filtered.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be processed.\n        subset (str, optional): A subset or column to consider. Defaults to 'text'.\n        top_k(int, optional): The number of top languages to keep after classification. Defaults to 1.\n            - if fasttext classified 3 languages, top_k=1 will keep the top language\n                - [en, fr, de] -> [en]\n            - if fasttext classified 3 languages, top_k=2 will keep the top 2 languages\n                - [en, fr, de] -> [en, fr]\n        score_rounding(int, optional): The number of decimal places to round the scores. Defaults to 2.\n        threshold(float, optional): The minimum score to keep the language. Defaults to 0.0.\n        whitelist(List[str], optional): The list of languages to keep. Defaults to None.\n        blacklist(List[str], optional): The list of languages to remove. Defaults to None.\n\n    Raises:\n        ValueError: If both whitelist and blacklist are not None.\n\n    Returns:\n        rdd: The filtered data.\n\n    Caveats about `whitelist` and `blacklist`:\n        - [Default] If both `whitelist` and `blacklist` are None, all languages will be kept.\n        - If both `whitelist` and `blacklist` are not None, an error will be raised.\n        - If `whitelist` is not None, only the languages in the `whitelist` will be kept.\n        - If `blacklist` is not None, the languages in the `blacklist` will be removed.\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    # detect language using fasttext\n    data = data.mapPartitions(\n        functools.partial(\n            language_predict_fasttext_by_partition,\n            top_k=top_k,\n            score_rounding=score_rounding,\n        )\n    )\n\n    # filter by threshold\n    data = data.filter(lambda x: any(s >= threshold for s in x[\"scores\"][:top_k]))\n\n    # filter by whitelist and blacklist\n    if whitelist is not None and blacklist is not None:\n        raise ValueError(\"whitelist and blacklist cannot be both not None\")\n    elif whitelist is not None:\n        data = data.filter(lambda x: any(label in whitelist for label in x[\"labels\"][:top_k]))\n    elif blacklist is not None:\n        data = data.filter(lambda x: all(label not in blacklist for label in x[\"labels\"][:top_k]))\n    else:\n        # otherwise, keep all languages\n        ...\n\n    # remove labels and scores\n    data = data.map(lambda x: {k: v for k, v in x.items() if k != \"labels\" and k != \"scores\"})\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/registry.py",
    "content": "\"\"\"\nBase class to support the registration of the ETL classes\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport abc\nimport importlib.util\nimport inspect\nimport os\nfrom functools import wraps\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.utils.setting import SystemSetting\n\n# TODO: If you add category directories, add them here too\n# _sample is a special directory that is not imported\n\n# This is where you choose what categories to register\nETL_CATEGORIES = [\n    \"data_ingestion\",\n    \"decontamination\",\n    \"deduplication\",\n    \"bias\",\n    \"toxicity\",\n    \"cleaning\",\n    \"pii\",\n    \"quality\",\n    \"data_save\",\n    \"utils\",\n]\n\nIGNORE_FILES = [\n    \"__init__.py\",\n]\n\n\ndef auto_register(etl_categories=ETL_CATEGORIES):\n    \"\"\"\n    This will automatically register all ETLs to the registry\n    \"\"\"\n    etl_path = os.path.dirname(os.path.abspath(__file__))\n    for etl_category in etl_categories:\n        # Get the files(sub-categories) in the category\n        category_path = os.path.join(etl_path, etl_category)\n        files = os.listdir(category_path)\n\n        # Filter out non-Python files\n        files = [f for f in files if f.endswith(\".py\")]\n\n        # Dynamically import all Python files in the directory\n        for file in files:\n            if file in IGNORE_FILES:\n                continue\n\n            file_path = os.path.join(category_path, file)\n\n            # Remove .py at the end\n            module_name = file[:-3]\n\n            spec = importlib.util.spec_from_file_location(module_name, file_path)\n            module = importlib.util.module_from_spec(spec)\n            spec.loader.exec_module(module)\n\n\n# To avoid circular dependency\nclass ETLStructure:\n    ...\n\n\nclass ETLRegistry:\n    \"\"\"Singleton class to register the ETL classes.\n\n    This class provides a registry for ETL classes.\n    It ensures that only one instance of the registry is created and provides\n    methods to register, search, and retrieve ETL classes.\n\n    Attributes:\n        _initialized (bool): Flag to check if the class has been initialized.\n        _registry (dict): Dictionary to store the registered ETL classes.\n        _status (dict): Dictionary to store the status of the registered ETL classes.\n\n    Methods:\n        __new__(): Creates a new instance of the class if it doesn't exist.\n        __init__(): Initializes the class and registers the ETL classes.\n        __len__(): Returns the number of registered ETL classes.\n        __repr__(): Returns a string representation of the registry.\n        __str__(): Returns a string representation of the registry.\n        _update_status(key): Updates the status of the registry.\n        _convert_to_report_format(status, print_sub_category, print_etl_name): Converts the status to a report format.\n    \"\"\"\n\n    _initialized = False\n\n    def __new__(cls):\n        if not hasattr(cls, \"instance\"):\n            cls.instance = super(ETLRegistry, cls).__new__(cls)\n        return cls.instance\n\n    def __init__(self):\n        \"\"\"\n        when the class is initialized, this is called everytime\n        regardless of the singleton. So adding the flag to check\n        \"\"\"\n        if self._initialized:\n            return\n        self._initialized = True\n        self._registry = {}\n        self._status = {}\n        auto_register()\n\n    def __len__(self):\n        return len(self._registry.keys())\n\n    def __repr__(self):\n        return self._convert_to_report_format(self._status)\n\n    def __str__(self):\n        return self.__repr__()\n\n    def reset(self):\n        \"\"\"\n        reset the registry\n        \"\"\"\n        self._registry = {}\n\n    def register(self, key: str, etl: ETLStructure):\n        \"\"\"\n        Registers the ETL (Extract, Transform, Load) process.\n\n        Args:\n            key (str): The key used to identify the ETL process. Should be in the format below:\n            etl (ETLStructure): The ETL process to be registered. It should be a subclass of ETLStructure.\n\n        Raises:\n            ValueError: If the key is not all lowercase, not separated by '___', or does not have 2 layers of category.\n            TypeError: If the ETL class is not a subclass of ETLStructure.\n            KeyError: If the key is already registered.\n\n        Note:\n            - The key should be in the format of:\n                - all lowercase\n                - separated by ___\n                - it should have 2 layers of category\n            \n            - Example: <etl_type>___<file_key>___<etl_key> or <category>___<sub_category>___<etl_key>.\n        \"\"\"\n        if not key.islower():\n            raise ValueError(f\"The key [ {key} ] should be all lowercase\")\n        if \"___\" not in key:\n            raise ValueError(f\"The key [ {key} ] should be separated by ___\")\n        if len(key.split(\"___\")) != 3:\n            raise ValueError(f\"The key [ {key} ] should have 2 layers of category\")\n\n        # all the etl should be the subclass of ETLStructure\n        if not issubclass(etl, ETLStructure):\n            raise TypeError(f\"ETL class should be subclass of ETLStructure not {etl}\")\n\n        # register\n        if key in self._registry:\n            if (os.getenv(\"DATAVERSE_TEST_MODE\") == \"True\") or (\n                os.getenv(\"DATAVERSE_BUILD_DOC\") == \"true\"\n            ):\n                pass\n            else:\n                raise KeyError(f\"The key [ {key} ] is already registered\")\n        else:\n            self._registry[key] = etl\n            self._update_status(key=key)\n\n    def _update_status(self, key: str):\n        category, sub_category, _ = key.split(\"___\")\n        if category not in self._status:\n            self._status[category] = {}\n\n        if sub_category not in self._status[category]:\n            self._status[category][sub_category] = [key]\n        else:\n            self._status[category][sub_category].append(key)\n\n    def search(self, category: str = None, sub_category: str = None):\n        \"\"\"\n        Search the ETL.\n\n        Args:\n            category (str, optional): The category to search for. Defaults to None.\n            sub_category (str, optional): The sub-category to search for. Defaults to None.\n\n        Returns:\n            dict: A dictionary containing the filtered status information.\n\n        Raises:\n            AssertionError: If category is a list or not a string.\n            AssertionError: If sub_category is a list or not a string.\n            ValueError: If sub_category is specified without category.\n\n        Note:\n            - Printing all the information is fixed as default.\n            - Set print_sub_category to True to print the sub-category.\n            - Set print_etl_name to True to print the ETL name.\n        \"\"\"\n        status = self._status\n\n        filtered_status = {}\n        if category is not None:\n            assert type(category) != list, \"we do not support list search for category\"\n            assert type(category) == str, \"category must be a string\"\n            if sub_category is None:\n                filtered_status[category] = status[category]\n            else:\n                assert type(sub_category) != list, \"we do not support list search for sub-category\"\n                assert type(sub_category) == str, \"sub_category must be a string\"\n                filtered_status[category] = {sub_category: status[category][sub_category]}\n        else:\n            if sub_category is not None:\n                raise ValueError(\"sub-category cannot be specified without category\")\n            filtered_status = status\n\n        return self._convert_to_report_format(\n            filtered_status,\n            print_sub_category=True,\n            print_etl_name=True,\n        )\n\n    def _convert_to_report_format(\n        self,\n        status,\n        print_sub_category=False,\n        print_etl_name=False,\n    ):\n        \"\"\"\n        convert status to report format\n\n        This includes the number of ETLs in each category and sub-category\n        and depending on the options, it can include the name of the ETLs\n\n        Args:\n            status (dict): the status from `search`\n        \"\"\"\n        # count the number of etls\n        stats = {}\n        total = 0\n        categories = list(status.keys())\n        for category in categories:\n            if category not in stats:\n                stats[category] = {}\n                stats[category][\"__total__\"] = 0\n\n            sub_categories = list(status[category].keys())\n            for sub_category in sub_categories:\n                sub_n = len(status[category][sub_category])\n                stats[category][sub_category] = sub_n\n                stats[category][\"__total__\"] += sub_n\n                total += sub_n\n\n        # convert to the report format\n        infos = []\n\n        infos.append(\"=\" * 50)\n        infos.append(f\"Total [ {total} ]\")\n        infos.append(\"=\" * 50)\n\n        for category in categories:\n            infos.append(f\"{category} [ {stats[category]['__total__']} ]\")\n            sub_categories = list(status[category].keys())\n\n            if print_sub_category:\n                for sub_category in sub_categories:\n                    infos.append(f\"{' ' * 4}- {sub_category} [ {stats[category][sub_category]} ]\")\n\n                    if print_etl_name:\n                        for etl in status[category][sub_category]:\n                            infos.append(f\"{' ' * 8}- {etl}\")\n\n        return \"\\n\".join(infos)\n\n    def get(self, key: str) -> ETLStructure:\n        \"\"\"\n        Retrieves the ETLStructure associated with the given key.\n\n        Args:\n            key (str): The key used to retrieve the ETLStructure. Should be in the format below.\n\n        Returns:\n            ETLStructure: The ETLStructure associated with the given key.\n\n        Raises:\n            ValueError: If the key is not all lowercase, not separated by '___', or does not have 2 layers of category.\n            KeyError: If the key is not registered in the registry.\n\n        Note:\n            - The key should be in the format of:\n                - all lowercase\n                - separated by ___\n                - it should have 2 layers of category\n            \n            - Example: <etl_type>___<file_key>___<etl_key> or <category>___<sub_category>___<etl_key>.\n        \"\"\"\n        if not key.islower():\n            raise ValueError(f\"The key [ {key} ] should be all lowercase\")\n        if \"___\" not in key:\n            raise ValueError(f\"The key [ {key} ] should be separated by ___\")\n        if len(key.split(\"___\")) != 3:\n            raise ValueError(f\"The key [ {key} ] should have 2 layers of category\")\n\n        if key not in self._registry:\n            raise KeyError(f\"The key {key} is not registered\")\n\n        return self._registry[key]\n\n    def get_all(self):\n        \"\"\"\n        get all the etls\n        \n        Returns:\n            list: list of all registered etls\n        \"\"\"\n        return list(self._registry.values())\n\n\nclass ETLAutoRegistry(abc.ABCMeta, type):\n    def __new__(cls, name, bases, attrs):\n        \"\"\"\n        Metaclass to register the ETL classes automatically to the registry\n        \"\"\"\n        # singleton registry\n        new_class = super().__new__(cls, name, bases, attrs)\n\n        # BaseETL is base class and should not be registered\n        # Another reason is BaseETL is not initialized yet before `__new__` is done but\n        # the registry will verify the class is subclass of BaseETL and raise error\n        # because BaseETL is not initialized yet :)\n        if name != \"BaseETL\":\n            if \"__file_path__\" not in attrs:\n                raise TypeError(\n                    \"Direct inheritance from BaseETL not allowed. Use @register_etl decorator.\"\n                )\n\n            registry = ETLRegistry()\n            registry.register(key=name, etl=new_class)\n\n        return new_class\n\n\nclass BaseETL(ETLStructure, metaclass=ETLAutoRegistry):\n    \"\"\"\n    Base class for spark ETL.\n\n    This class provides a base structure for implementing spark ETL processes.\n    If you need to use `self` directly, inherit this class.\n\n    Methods:\n        run(self, data, *args, **kwargs):\n            Run the preprocessing logic.\n            This method should be implemented by subclasses.\n\n        __call__(self, *args, **kwargs):\n            Call the `run` method to perform the preprocessing.\n    \"\"\"\n\n    @abc.abstractmethod\n    def run(self, data: Union[RDD, DataFrame], *args, **kwargs):\n        \"\"\"\n        run the preprocessing\n        \"\"\"\n        raise NotImplementedError()\n\n    def __call__(self, *args, **kwargs):\n        \"\"\"\n        call the method to do the preprocessing\n        \"\"\"\n        return self.run(*args, **kwargs)\n\n\ndef add_self(func):\n    \"\"\"\n    Decorator to add self to the function\n    intent is to make the function as a method\n    \"\"\"\n\n    @wraps(func)\n    def wrapper(self, *args, **kwargs):\n        return func(*args, **kwargs)\n\n    wrapper.__doc__ = func.__doc__\n    return wrapper\n\n\ndef register_etl(func):\n    \"\"\"\n    Decorator to register a function as an ETL.\n\n    Args:\n        func (callable): The function to be registered as an ETL.\n\n    Returns:\n        type: A dynamically created class that inherits from BaseETL and wraps the original function.\n\n    Raises:\n        None.\n\n    About Attributes:\n        - __file_path__ (str): The file path of the function where it is defined.\n        - __etl_dir__ (bool): If the file is in the etl directory. If not, it means it's a dynamically registered user-defined ETL.\n\n    Example:\n        >>> @register_etl\n        >>> def my_etl_function():\n        >>>    pass\n\n    Note:\n        The registered ETL function should not rely on the `self` parameter.\n\n        If you need to use `self`, directly inherit the BaseETL class.\n    \"\"\"\n    ETL_DIR = os.path.join(SystemSetting().DATAVERSE_HOME, \"etl\")\n    etl_file_path = inspect.getfile(func)\n\n    # I know using class name without snake case is awkward\n    # but I want to keep the class name as it is and user won't know it\n    etl_cls = type(\n        func.__name__,\n        (BaseETL,),\n        {\n            \"run\": add_self(func),\n            \"__file_path__\": etl_file_path,\n            \"__etl_dir__\": etl_file_path.startswith(ETL_DIR),\n        },\n    )\n\n    etl_cls.__doc__ = func.__doc__\n    etl_cls.__is_etl__ = True\n    return etl_cls\n"
  },
  {
    "path": "dataverse/etl/toxicity/README.md",
    "content": ""
  },
  {
    "path": "dataverse/etl/toxicity/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/utils/README.md",
    "content": "# Utils\n> Utilities for the ETL process. Not really part of the ETL process but useful for the ETL process.\n\nThis could be including the following\n- logging\n- error handling\n- data validation\n- sampling\n- etc\n\n"
  },
  {
    "path": "dataverse/etl/utils/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/etl/utils/log.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef utils___log___count(\n    spark, data: Union[RDD, DataFrame], prev_etl_name: str = None, *args, **kwargs\n) -> Union[RDD, DataFrame]:\n    \"\"\"\n    Simply count the number of rows in the data\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to extract the nouns from.\n        prev_etl_name (str, optional): name of the previous ETL process. Defaults to None.\n\n    Returns:\n        Union[RDD, DataFrame]: The input data. Nothing is changed.\n    \"\"\"\n    total_data = data.count()\n    print(\"=\" * 50)\n    print(f\"After [ {prev_etl_name} ] - Total data: {total_data}\")\n    print(\"=\" * 50)\n\n    return data\n"
  },
  {
    "path": "dataverse/etl/utils/sampling.py",
    "content": "\"\"\"\nSampling module for data ingestion\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef utils___sampling___random(\n    spark,\n    data: Union[RDD, DataFrame],\n    replace: bool = False,\n    sample_n_or_frac: float = 0.1,\n    seed: int = 42,\n    *args,\n    **kwargs\n) -> RDD:\n    \"\"\"\n    Randomly sample the input RDD.\n\n    Args:\n        spark (SparkSession): The Spark session object.\n        data (Union[RDD, DataFrame]): The input data to be sampled.\n        replace (bool, optional): Whether to sample with replacement. Defaults to False.\n        sample_n_or_frac (float, optional): Number of samples to take or fraction of the RDD to sample. Defaults to 0.1\n        seed (int, optional): Seed for the random number generator. Defaults to 42.\n\n    Returns:\n        RDD: Sampled RDD\n    \"\"\"\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    if isinstance(sample_n_or_frac, float):\n        data = data.sample(replace, sample_n_or_frac, seed)\n\n    # XXX: Take too long, 1M sample takes over 10 mins and didn't finish\n    elif isinstance(sample_n_or_frac, int):\n        data = data.takeSample(replace, sample_n_or_frac, seed)\n    return data\n"
  },
  {
    "path": "dataverse/etl/utils/statistics.py",
    "content": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom operator import add\nfrom typing import Union\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\n\n@register_etl\ndef utils___statistics___korean_nouns(\n    spark, data: Union[RDD, DataFrame], subset: str = \"text\", *args, **kwargs\n) -> RDD:\n    \"\"\"\n    Get the frequency of each noun in the given subset of the data.\n\n    Args:\n        spark: The SparkSession object.\n        data: The data to extract the nouns from.\n        subset: The subset of the data to extract the nouns from. Defaults to 'text'.\n\n    Returns:\n        RDD[List[Tuple[str, int]]]: The frequency of each noun in the given subset of the data.\n\n    Raises:\n        ImportError: If konlpy or Mecab is not installed.\n\n    Examples:\n        >>> data = [\n        ...     {'text': '오리는 꽥꽥 웁니다. 거위는'},\n        ...     {'text': '안녕 세상!'},\n        ...     {'text': '사람들은 꽥꽥 울지 않습니다. 오리가 웁니다'},\n        ... ]\n        >>> result = utils___statistics___korean_nouns()(spark, data, subset='text')\n        >>> result.collect()\n        [('오리', 2), ('거위', 1), ('세상', 1), ('사람', 1)]\n\n    Caveats:\n        - This function works for Korean text only.\n        - The function returns the frequency of each noun, not the unique noun list.\n    \"\"\"\n\n    # konlpy & mecab\n    try:\n        from konlpy.tag import Mecab\n    except ImportError:\n        raise ImportError(\n            \"Please install konlpy & Mecab:\\n\" \"pip install konlpy\\n\" \"pip install mecab-python3\\n\"\n        )\n\n    if isinstance(data, DataFrame):\n        data = data.rdd\n\n    mecab = Mecab()\n\n    def _parse_korean_nouns(text):\n        try:\n            if text is not None:\n                return mecab.nouns(text)\n            else:\n                return []\n        except Exception:\n            # Log the exception for debugging purposes\n            return []\n\n    # Count the frequency of each noun\n    data = data.flatMap(lambda x: _parse_korean_nouns(x[subset]))\n    noun_counts = data.map(lambda noun: (noun, 1)).reduceByKey(add)\n\n    return noun_counts\n"
  },
  {
    "path": "dataverse/lab/README.md",
    "content": "# Lab \n> Space Laboratory for data analysis\n\nThis will be further supported.\n- Data Exploration\n- Data Visualization \n- ETC"
  },
  {
    "path": "dataverse/lab/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/tests/conftest.py",
    "content": "import os\nimport sys\n\nimport pytest\n\nsys.path.append(\"./\")  # to find etl folder as module\n\n\n@pytest.fixture(scope=\"session\", autouse=True)\ndef set_test_mode_env():\n    old_value = os.getenv(\"DATAVERSE_TEST_MODE\")\n\n    # Activate test mode\n    os.environ[\"DATAVERSE_TEST_MODE\"] = \"True\"\n\n    # Bring back to previous value\n    yield\n\n    if old_value is None:\n        del os.environ[\"DATAVERSE_TEST_MODE\"]\n    else:\n        os.environ[\"DATAVERSE_TEST_MODE\"] = old_value\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_accent.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline, register_etl\n\n\n@register_etl\ndef helper___test___generate_accent(spark, *args, **kwargs):\n    data = [(\"café\",), (\"résumé\",), (\"piñata\",)]\n    df = spark.createDataFrame(data, [\"text\"])\n    return df\n\n\ndef test_cleaning___accent____remove():\n    from etl.cleaning.accent import cleaning___accent___remove  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"spark\": {\n                \"appname\": \"TEST-cleaning-accent\",\n                \"driver\": {\"memory\": \"4g\"},\n                \"args\": {\n                    \"verbose\": True,\n                },\n            },\n            \"etl\": [\n                {\"name\": \"helper___test___generate_accent\"},\n                {\"name\": \"cleaning___accent___remove\"},\n            ],\n        }\n    )\n    spark, result = etl_pipeline.run(ETL_config)\n    result_df = result.toDF()\n    expected_data = [(\"cafe\",), (\"resume\",), (\"pinata\",)]\n    expected_df = spark.createDataFrame(expected_data, [\"text\"])\n\n    assert expected_df.collect() == result_df.collect()\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_char.py",
    "content": "import random\nimport re\nfrom typing import Union\n\nfrom omegaconf import OmegaConf\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\nfaker_seed = 42\nunprintable_chars = list(map(chr, range(32))) + list(map(chr, range(127, 160)))\n\n\n@register_etl\ndef helper___test___generate_whitespace(\n    spark, data: Union[RDD, DataFrame], subset=\"text\", *args, **kwargs\n):\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _generate_whitespace(row):\n        row[subset] = row[subset].replace(\" \", \" \" * random.randint(1, 5))\n        row[subset] = \" \" * random.randint(0, 5) + row[subset] + \" \" * random.randint(0, 5)\n        return row\n\n    data = data.map(_generate_whitespace)\n\n    return data\n\n\ndef test_cleaning___char___normalize_whitespace():\n    from etl.cleaning.char import cleaning___char___normalize_whitespace  # noqa: F401\n    from etl.data_ingestion.test import (  # noqa: F401\n        data_ingestion___test___generate_fake_ufl,\n    )\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"spark\": {\n                \"appname\": \"Test-cleaning-char\",\n                \"driver\": {\"memory\": \"16g\"},\n                \"verbose\": True,\n            },\n            \"etl\": [\n                {\"name\": \"data_ingestion___test___generate_fake_ufl\"},\n                {\n                    \"name\": \"helper___test___generate_whitespace\",\n                    \"args\": {\"subset\": \"text\"},\n                },\n                {\"name\": \"cleaning___char___normalize_whitespace\"},\n            ],\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    double_space_pattern = re.compile(r\"[\\s\\r\\n]{2,}\")\n\n    for row in result.collect():\n        assert row[\"text\"] == row[\"text\"].strip()\n        assert re.findall(double_space_pattern, row[\"text\"]) == []\n\n\n@register_etl\ndef helper___test___generate_unprintable(\n    spark, data: Union[RDD, DataFrame], subset=\"text\", *args, **kwargs\n):\n    if isinstance(data, DataFrame):\n        data = data.rdd\n        data = data.map(lambda row: row.asDict())\n\n    def _insert_unprintable_chars(text):\n        unprintable_chars = list(map(chr, range(32))) + list(map(chr, range(127, 160)))\n        for _ in range(random.randint(1, 20)):\n            position = random.randint(0, len(text) - 1)\n            char = random.choice(unprintable_chars)\n            text = text[:position] + char + text[position:]\n        return text\n\n    def _generate_unprintable(row):\n        row[subset] = _insert_unprintable_chars(row[subset])\n        return row\n\n    data = data.map(_generate_unprintable)\n\n    return data\n\n\ndef test_cleaning___char___remove_unprintable():\n    from etl.cleaning.char import cleaning___char___remove_unprintable  # noqa: F401\n    from etl.data_ingestion.test import (  # noqa: F401\n        data_ingestion___test___generate_fake_ufl,\n    )\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"spark\": {\n                \"appname\": \"Test-cleaning-char\",\n                \"driver\": {\"memory\": \"16g\"},\n                \"verbose\": True,\n            },\n            \"etl\": [\n                {\n                    \"name\": \"data_ingestion___test___generate_fake_ufl\",\n                    \"args\": {\"faker_seed\": faker_seed},\n                },\n                {\n                    \"name\": \"helper___test___generate_unprintable\",\n                    \"args\": {\"subset\": \"text\"},\n                },\n            ],\n        }\n    )\n    _, unprintable_check = etl_pipeline.run(ETL_config)\n    for row in unprintable_check.collect():\n        assert any(chars in row[\"text\"] for chars in unprintable_chars)\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"spark\": {\n                \"appname\": \"Test-cleaning-char\",\n                \"driver\": {\"memory\": \"16g\"},\n                \"verbose\": True,\n            },\n            \"etl\": [\n                {\n                    \"name\": \"data_ingestion___test___generate_fake_ufl\",\n                    \"args\": {\"faker_seed\": faker_seed},\n                },\n                {\n                    \"name\": \"helper___test___generate_unprintable\",\n                    \"args\": {\"subset\": \"text\"},\n                },\n                {\n                    \"name\": \"cleaning___char___remove_unprintable\",\n                    \"args\": {\"subset\": \"text\"},\n                },\n            ],\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    for row in result.collect():\n        assert any(chars not in row[\"text\"] for chars in unprintable_chars)\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_document.py",
    "content": "import pytest\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\n\nfaker_seed = 42\nword_per_chunk = 5\ndelimiter = \" \"\n\n\n@pytest.fixture(scope=\"function\", autouse=True)\ndef fake_data_rdd():\n    Faker.seed(faker_seed)\n\n\ndef test_cleaning___document___split_by_word():\n    from etl.cleaning.document import cleaning___document___split_by_word  # noqa: F401\n    from etl.data_ingestion.test import (  # noqa: F401\n        data_ingestion___test___generate_fake_ufl,\n    )\n\n    etl_pipeline = ETLPipeline()\n    ETL_Config = OmegaConf.create(\n        {\n            \"spark\": {\n                \"appname\": \"TestCleaningDocument\",\n                \"driver\": {\"memory\": \"16g\"},\n                \"verbose\": True,\n            },\n            \"etl\": [\n                {\"name\": \"data_ingestion___test___generate_fake_ufl\"},\n            ],\n        }\n    )\n    _, original = etl_pipeline.run(ETL_Config)\n\n    etl_pipeline = ETLPipeline()\n    ETL_Config = OmegaConf.create(\n        {\n            \"spark\": {\n                \"appname\": \"TestCleaningDocument\",\n                \"driver\": {\"memory\": \"16g\"},\n                \"verbose\": True,\n            },\n            \"etl\": [\n                {\n                    \"name\": \"data_ingestion___test___generate_fake_ufl\",\n                    \"args\": {\"faker_seed\": faker_seed},\n                },\n                {\n                    \"name\": \"cleaning___document___split_by_word\",\n                    \"args\": {\n                        \"word_per_chunk\": word_per_chunk,\n                        \"subset\": \"text\",\n                        \"delimiter\": delimiter,\n                    },\n                },\n            ],\n        }\n    )\n    _, result = etl_pipeline.run(ETL_Config)\n\n    # check it is splitted properly\n    assert all(len(row[\"text\"].split(delimiter)) <= word_per_chunk for row in result.collect())\n\n    # check combined version of splitted is same with the original\n    result_combine = delimiter.join(result.map(lambda x: x[\"text\"]).collect())\n    original_combine = delimiter.join(original.map(lambda x: x[\"text\"]).collect())\n\n    assert len(original_combine) == len(result_combine)\n    assert original_combine == result_combine\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_html.py",
    "content": "import random\n\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\nfaker_seed = 42\nrandom_seed = 42\n\n\n@register_etl\ndef helper___test___generate_html(\n    spark,\n    n=100,\n    repartition=20,\n    faker_seed=None,\n    random_seed=None,\n    verbose=True,\n    *args,\n    **kwargs,\n):\n    faker = Faker()\n    if faker_seed is not None:\n        Faker.seed(faker_seed)\n    if random_seed is not None:\n        random.seed(random_seed)\n\n    def _generate_fake_html_format():\n        tags = [\"p\", \"h1\", \"h2\", \"div\", \"span\", \"a\", \"ul\", \"ol\", \"li\", \"strong\", \"em\"]\n        html_content = \"\"\n        for _ in range(random.randint(3, 10)):\n            tag = random.choice(tags)\n\n            if tag in [\"ul\", \"ol\"]:\n                items = \"\"\n                for _ in range(random.randint(2, 4)):\n                    items += f\"<li>{faker.sentence()}</li>\"\n                html_content += f\"<{tag}>{items}</{tags}>\"\n\n            elif tag == \"a\":\n                html_content += f'<a href=\"{faker.url()}\">{faker.word()}</a>'\n\n            else:\n                html_content += f\"<{tag}>{faker.text()}</{tag}>\"\n\n        return html_content\n\n    def _generate_fake_html(n=100):\n        while n > 0:\n            n -= 1\n            fake_html = _generate_fake_html_format()\n            yield {\"id\": faker.uuid4(), \"text\": fake_html}\n\n    rdd = spark.sparkContext.parallelize(_generate_fake_html(n=n))\n    rdd = rdd.repartition(repartition)\n    return rdd\n\n\ndef test_cleaning___html___extract_plain_text():\n    from etl.cleaning.html import cleaning___html___extract_plain_text  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\n                    \"name\": \"helper___test___generate_html\",\n                    \"args\": {\"faker_seed\": faker_seed},\n                },\n                {\n                    \"name\": \"cleaning___html___extract_plain_text\",\n                    \"args\": {\"subset\": \"text\"},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    for row in result.collect():\n        assert \"<\" not in row[\"text\"]\n        assert \">\" not in row[\"text\"]\n\n\ndef test_cleaning___html___extract_plain_text_trafilatura():\n    from etl.cleaning.html import cleaning___html___extract_plain_text  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\n                    \"name\": \"helper___test___generate_html\",\n                    \"args\": {\"faker_seed\": faker_seed},\n                },\n                {\n                    \"name\": \"cleaning___html___extract_plain_text\",\n                    \"args\": {\"subset\": \"text\", \"trafilatura\": True},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    for row in result.collect():\n        assert \"<\" not in row[\"text\"]\n        assert \">\" not in row[\"text\"]\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_korean.py",
    "content": "import random\n\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\nfaker_seed = 42\nrandom_seed = 42\n\n\n@register_etl\ndef helper___test___generate_korean(\n    spark,\n    n=100,\n    repartition=20,\n    create_type=\"word\",\n    faker_seed=None,\n    random_seed=None,\n    verbose=True,\n    *args,\n    **kwargs,\n):\n    \"\"\"\n    generate fake data that is mixed with korean and english.\n    This creates data based on random ratio for each row.\n\n    Args:\n        spark (SparkSession): spark session\n        n (int): the number of data to generate\n        repartition (int): the number of partitions\n        create_type (str): handles type of creating random data.\n        faker_seed (int, optional): Random seed of faker library. Defaults to None.\n        random_seed (int, optional): Random seed of random library. Defaults to None.\n        verbose (bool): whether to print the information of the dataset\n    \"\"\"\n    assert create_type in [\n        \"char\",\n        \"word\",\n    ], \"this is following filter_type of function `cleaning___korean___filter_by_ratio`\"\n\n    faker = Faker([\"en_US\", \"ko_KR\"])\n    if faker_seed is not None:\n        Faker.seed(faker_seed)\n    faker_en = faker[\"en_US\"]\n    faker_ko = faker[\"ko_KR\"]\n\n    if random_seed is not None:\n        random.seed(random_seed)\n\n    from etl.cleaning.korean import JAUM, KOR_BEGIN, KOR_END, MOUM\n\n    jamo = JAUM + MOUM\n\n    def _generate_fake_korean_english_mixed(\n        korean_ratio, total_count, create_type, space_ratio=0.3\n    ):\n        if create_type == \"word\":\n            words = []\n            for _ in range(total_count):\n                if random.random() < korean_ratio:\n                    words.append(faker_ko.name())\n                else:\n                    words.append(faker_en.last_name())\n            return \" \".join(words)\n        else:  # create_type == \"char\"\n            chars = \"\"\n            korean_length = int(total_count * korean_ratio)\n            for _ in range(korean_length):\n                korean_type = random.choice([\"jamo\", \"eumjeol\"])\n                cur_kor = (\n                    chr(random.randint(KOR_BEGIN, KOR_END))\n                    if korean_type == \"eumjeol\"\n                    else random.choice(jamo)\n                )\n                chars += cur_kor\n                if random.random() < space_ratio:\n                    chars += \" \"\n            english_length = total_count - korean_length\n            english_text = faker_en.sentence(nb_words=english_length)\n            chars += english_text[:english_length]\n            return chars\n\n    def _generate_fake_korean(n=100, create_type=\"word\"):\n        while n > 0:\n            n -= 1\n            korean_ratio = random.random()\n            total_count = random.randint(0, 300)\n            fake_korean = _generate_fake_korean_english_mixed(\n                korean_ratio, total_count, create_type\n            )\n            yield {\n                \"id\": faker.uuid4(),\n                \"text\": fake_korean,\n                \"korean_ratio\": korean_ratio,\n            }\n\n    rdd = spark.sparkContext.parallelize(_generate_fake_korean(n=n, create_type=create_type))\n    rdd = rdd.repartition(repartition)\n\n    return rdd\n\n\ndef test_cleaning___korean___filter_by_ratio():\n    from etl.cleaning.korean import cleaning___korean___filter_by_ratio  # noqa\n\n    filter_type = \"word\"\n    korean_ratio = 0.6\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\n                    \"name\": \"helper___test___generate_korean\",\n                    \"args\": {\n                        \"faker_seed\": faker_seed,\n                        \"random_seed\": random_seed,\n                        \"n\": 1000,\n                        \"create_type\": filter_type,\n                    },\n                },\n                {\n                    \"name\": \"cleaning___korean___filter_by_ratio\",\n                    \"args\": {\n                        \"subset\": \"text\",\n                        \"korean_ratio\": korean_ratio,\n                        \"filter_type\": filter_type,\n                    },\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    assert any(row[\"korean_ratio\"] < korean_ratio for row in result.collect())\n\n\ndef test_cleaning___korean___filter_by_ratio_chars():\n    from etl.cleaning.korean import cleaning___korean___filter_by_ratio  # noqa\n\n    filter_type = \"char\"\n    korean_ratio = 0.6\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\n                    \"name\": \"helper___test___generate_korean\",\n                    \"args\": {\n                        \"faker_seed\": faker_seed,\n                        \"random_seed\": random_seed,\n                        \"n\": 1000,\n                        \"create_type\": filter_type,\n                    },\n                },\n                {\n                    \"name\": \"cleaning___korean___filter_by_ratio\",\n                    \"args\": {\n                        \"subset\": \"text\",\n                        \"korean_ratio\": korean_ratio,\n                        \"filter_type\": filter_type,\n                    },\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    print(result.collect())\n    assert any(row[\"korean_ratio\"] < korean_ratio for row in result.collect())\n\n\n@register_etl\ndef helper___test___generate_korean_emoticon(spark, *args, **kwargs):\n    data = spark.createDataFrame(\n        [(1, \"안녕하세요ㅋㅋㅋㅋㅋ\"), (2, \"ㅎㅎㅎㅎㅎ잘 지내세요?\"), (3, \"그래요ㅜㅜㅜㅜ\"), (4, \"ㅋㅋ쿵ㅜㅜㅋ쿠ㅜㅜ\")],\n        [\"id\", \"text\"],\n    )\n    return data\n\n\ndef test_cleaning___korean___reduce_emoticon():\n    from etl.cleaning.korean import cleaning___korean___reduce_emoticon  # noqa\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_korean_emoticon\"},\n                {\"name\": \"cleaning___korean___reduce_emoticon\"},\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    expected_result = [\n        (1, \"안녕하세요ㅋㅋ\"),\n        (2, \"ㅎㅎ잘 지내세요?\"),\n        (3, \"그래요ㅜㅜ\"),\n        (4, \"ㅋㅋ쿵ㅜㅜㅋㅋㅜㅜ\"),\n    ]\n\n    for expected, result_row in zip(expected_result, result.collect()):\n        assert (\n            expected[1] == result_row[\"text\"]\n        ), f'Expected {expected[1]}, but got {result_row[\"text\"]}'\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_length.py",
    "content": "import pytest\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\nfaker_seed = 42\n\n\n@register_etl\ndef helper___test___generate_data_for_test_length(spark, n=10, faker_seed=None, *args, **kwargs):\n    faker = Faker()\n    if faker_seed is not None:\n        Faker.seed(faker_seed)\n\n    data = []\n    for _ in range(n):\n        fake_data = faker.paragraph()\n        data.append((fake_data, len(fake_data), len(fake_data.split())))\n    data.append((\"\", len(\"\"), len(\"\".split())))\n    df = spark.createDataFrame(data, [\"text\", \"char_length\", \"word_length\"])\n    return df\n\n\ndef test_cleaning___length___char_len_filter():\n    from etl.cleaning.length import cleaning___length___char_len_filter  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    original = OmegaConf.create(\n        {\"etl\": [{\"name\": \"helper___test___generate_data_for_test_length\"}]}\n    )\n    _, result = etl_pipeline.run(original)\n\n    max_value, min_value = (\n        result.select(\"char_length\").rdd.max()[0],\n        result.select(\"char_length\").rdd.min()[0],\n    )\n    print(\"*----------------------------*\")\n    print(f\"max length of test data is {max_value}. min length of test data is {min_value}\")\n\n    min_max_both = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\n                    \"name\": \"cleaning___length___char_len_filter\",\n                    \"args\": {\"min_len\": 3, \"max_len\": 40},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(min_max_both)\n    assert all((row[\"char_length\"] >= 3) and (row[\"char_length\"] <= 40) for row in result.collect())\n    assert result.filter(lambda row: row[\"text\"] == \"\").count() == 0\n\n    min_only = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\"name\": \"cleaning___length___char_len_filter\", \"args\": {\"min_len\": 3}},\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(min_only)\n    assert all(row[\"char_length\"] >= 3 for row in result.collect())\n    assert result.filter(lambda row: row[\"text\"] == \"\").count() == 0\n\n    max_only = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\"name\": \"cleaning___length___char_len_filter\", \"args\": {\"max_len\": 3}},\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(max_only)\n    assert all(row[\"char_length\"] <= 3 for row in result.collect())\n    assert result.filter(lambda row: row[\"text\"] == \"\").count() > 0\n\n    nothing_given = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\"name\": \"cleaning___length___char_len_filter\"},\n            ]\n        }\n    )\n    with pytest.raises(AssertionError):\n        _, result = etl_pipeline.run(nothing_given)\n\n\ndef test_cleaning___length___word_len_filter():\n    from etl.cleaning.length import cleaning___length___word_len_filter  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    original = OmegaConf.create(\n        {\"etl\": [{\"name\": \"helper___test___generate_data_for_test_length\"}]}\n    )\n    _, result = etl_pipeline.run(original)\n\n    max_value, min_value = (\n        result.select(\"word_length\").rdd.max()[0],\n        result.select(\"word_length\").rdd.min()[0],\n    )\n    print(\"*----------------------------*\")\n    print(f\"max length of test data is {max_value}. min length of test data is {min_value}\")\n\n    min_max_both = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\n                    \"name\": \"cleaning___length___word_len_filter\",\n                    \"args\": {\"min_len\": 3, \"max_len\": 40},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(min_max_both)\n    assert all((row[\"word_length\"] >= 3) and (row[\"word_length\"] <= 40) for row in result.collect())\n    assert result.filter(lambda row: row[\"text\"] == \"\").count() == 0\n\n    min_only = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\"name\": \"cleaning___length___word_len_filter\", \"args\": {\"min_len\": 3}},\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(min_only)\n    assert all(row[\"word_length\"] >= 3 for row in result.collect())\n    assert result.filter(lambda row: row[\"text\"] == \"\").count() == 0\n\n    max_only = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\"name\": \"cleaning___length___word_len_filter\", \"args\": {\"max_len\": 3}},\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(max_only)\n    assert all(row[\"word_length\"] <= 3 for row in result.collect())\n    assert result.filter(lambda row: row[\"text\"] == \"\").count() > 0\n\n    nothing_given = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_data_for_test_length\"},\n                {\"name\": \"cleaning___length___word_len_filter\"},\n            ]\n        }\n    )\n    with pytest.raises(AssertionError):\n        _, result = etl_pipeline.run(nothing_given)\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_number.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef helper___test___generate_number(spark, *args, **kwargs):\n    data = [\n        (\"1234 apples and 5678 oranges \",),\n        (\"9876.54321 dollars\",),\n        (\"This is random 1-3462-01.xx 87\",),\n        (\"**6*342* history 0.6242 00002\",),\n        (\"#eff000, af2f33, random color codes 1110 (013-0802-1143)\",),\n        (\"88888888888888888888-888\",),\n    ]\n    df = spark.createDataFrame(data, [\"text\"])\n    return df\n\n\ndef test_cleaning___number___normalize():\n    from etl.cleaning.number import cleaning___number___normalize  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_number\"},\n                {\"name\": \"cleaning___number___normalize\", \"args\": {\"assign_number\": 8}},\n            ]\n        }\n    )\n    spark, result = etl_pipeline.run(ETL_config)\n    result_df = result.toDF()\n    expected_data = [\n        (\"8888 apples and 8888 oranges \",),\n        (\"8888.88888 dollars\",),\n        (\"This is random 8-8888-88.xx 88\",),\n        (\"**8*888* history 8.8888 88888\",),\n        (\"#eff888, af8f88, random color codes 8888 (888-8888-8888)\",),\n        (\"88888888888888888888-888\",),\n    ]\n    expected_df = spark.createDataFrame(expected_data, [\"text\"])\n    assert expected_df.collect() == result_df.collect()\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_table.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef helper___test___generate_table(spark, *args, **kwargs):\n    data = [(1, 2, \"duck\"), (3, 4, \"duck\"), (5, 6, \"ducky\")]\n    df = spark.createDataFrame(data, [\"column1\", \"column2\", \"species\"])\n    return df\n\n\ndef test_cleaning___table___merge_col_vertical():\n    from etl.cleaning.table import cleaning___table___merge_col_vertical  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_table\"},\n                {\n                    \"name\": \"cleaning___table___merge_col_vertical\",\n                    \"args\": {\n                        \"col1\": \"column1\",\n                        \"col2\": \"column2\",\n                        \"merge_col_name\": \"number\",\n                    },\n                },\n            ]\n        }\n    )\n    spark, result = etl_pipeline.run(ETL_config)\n    expected_data = [\n        (\"duck\", 1),\n        (\"duck\", 3),\n        (\"ducky\", 5),\n        (\"duck\", 2),\n        (\"duck\", 4),\n        (\"ducky\", 6),\n    ]\n    expected_df = spark.createDataFrame(expected_data, [\"species\", \"number\"])\n    assert result.collect() == expected_df.collect()\n"
  },
  {
    "path": "dataverse/tests/test_cleaning_unicode.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef helper___test___generate_unicode_data(spark, *args, **kwargs):\n    data = [\n        (\"，。、„”“«»１」「《》´∶：？！（）；–—．～’…━〈〉【】％►\",),\n        (\"Hello， world！！0 1 2 ñ\",),\n        (\"This is fun。ction for —dataverse━\",),\n        (\"You can use 《dataverse》” for your ETL cycle？\",),\n        (\"Test sentence.\",),\n        (\"～～～～\",),\n    ]\n    df = spark.createDataFrame(data, [\"text\"])\n    return df\n\n\ndef helper___test___generate_expected_unicode_data(spark, type=\"remove\"):\n    assert type in [\"remove\", \"replace\", \"normalize\"]\n\n    if type == \"remove\":\n        expected_data = [\n            (\"\",),\n            (\"Hello world0 1 2 ñ\",),\n            (\"This is function for dataverse\",),\n            (\"You can use dataverse for your ETL cycle\",),\n            (\"Test sentence.\",),\n            (\"\",),\n        ]\n    elif type == \"replace\":\n        expected_data = [\n            (''',.,\"\"\"\"\"\"\"\"\"\"'::?!();- - . ~'...-<>[]%-''',),\n            (\"Hello, world!!0 1 2 ñ\",),\n            (\"This is fun.ction for  - dataverse-\",),\n            ('You can use \"dataverse\"\" for your ETL cycle?',),\n            (\"Test sentence.\",),\n            (\"~~~~\",),\n        ]\n\n    else:  # type == \"normalize\"\n        expected_data = [\n            (\"，。、„”“«»１」「《》´∶：？！（）；–—．～’…━〈〉【】％►\",),\n            (\"Hello， world！！0 1 2 ñ\",),\n            (\"This is fun。ction for —dataverse━\",),\n            (\"You can use 《dataverse》” for your ETL cycle？\",),\n            (\"Test sentence.\",),\n            (\"～～～～\",),\n        ]\n\n    return spark.createDataFrame(expected_data, [\"text\"])\n\n\ndef test_cleaning___unicode___remove_punct():\n    from etl.cleaning.unicode import cleaning___unicode___remove_punct  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_unicode_data\"},\n                {\"name\": \"cleaning___unicode___remove_punct\"},\n            ]\n        }\n    )\n    spark, result = etl_pipeline.run(ETL_config)\n    expected = helper___test___generate_expected_unicode_data(spark, type=\"remove\")\n\n    assert all(\n        result_row[\"text\"] == expected_row[\"text\"]\n        for (result_row, expected_row) in zip(result.collect(), expected.collect())\n    )\n\n\ndef test_cleaning___unicode___replace_punct():\n    from etl.cleaning.unicode import cleaning___unicode___replace_punct  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_unicode_data\"},\n                {\"name\": \"cleaning___unicode___replace_punct\"},\n            ]\n        }\n    )\n    spark, result = etl_pipeline.run(ETL_config)\n    expected = helper___test___generate_expected_unicode_data(spark, type=\"replace\")\n\n    assert all(\n        result_row[\"text\"] == expected_row[\"text\"]\n        for (result_row, expected_row) in zip(result.collect(), expected.collect())\n    )\n\n\ndef test_cleaning___unicode___normalize():\n    from etl.cleaning.unicode import cleaning___unicode___normalize  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_unicode_data\"},\n                {\"name\": \"cleaning___unicode___normalize\"},\n            ]\n        }\n    )\n    spark, result = etl_pipeline.run(ETL_config)\n    expected = helper___test___generate_expected_unicode_data(spark, type=\"normalize\")\n    assert all(\n        result_row[\"text\"] == expected_row[\"text\"]\n        for (result_row, expected_row) in zip(result.collect(), expected.collect())\n    )\n"
  },
  {
    "path": "dataverse/tests/test_deduplication_common_crawl.py",
    "content": "from omegaconf import OmegaConf\nfrom pyspark.sql import Row\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef helper___test___generate_exact_line(spark, *args, **kwagrs):\n    data = [\n        Row(text=\"DataversE\\ndATAVERSE\\nQuack\\nQUaCk\\nquack\", line_ids=[0, 2]),\n        Row(text=\"hello\\nHELLO\\nWorld\\nWoRLD\", line_ids=[0, 2]),\n    ]\n    df = spark.createDataFrame(data)\n    return df\n\n\ndef test_deduplication___common_crawl___exact_line():\n    from etl.deduplication.common_crawl import (  # noqa: F401\n        deduplication___common_crawl___exact_line,\n    )\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_exact_line\"},\n                {\"name\": \"deduplication___common_crawl___exact_line\"},\n            ]\n        }\n    )\n\n    spark, result = etl_pipeline.run(ETL_config)\n    result = spark.createDataFrame(result)\n\n    expected = [{\"text\": \"DataversE\\nQuack\"}, {\"text\": \"hello\\nWorld\"}]\n    expected = spark.createDataFrame(expected)\n\n    assert set(result.collect()) == set(expected.collect())\n"
  },
  {
    "path": "dataverse/tests/test_deduplication_exact.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef helper___test___generate_duplicated_data(spark, *args, **kwargs):\n    data = [(1, \"dataverse\"), (2, \"apple\"), (3, \"dataverse\"), (4, \"carrot\")]\n    columns = [\"id\", \"text\"]\n    df = spark.createDataFrame(data, columns)\n    return df\n\n\ndef test_deduplication___exact_column():\n    from etl.deduplication.exact import deduplication___exact___column  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n    columns = [\"id\", \"text\"]\n\n    # subset : text\n    ETL_config_text = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_duplicated_data\"},\n                {\n                    \"name\": \"deduplication___exact___column\",\n                    \"args\": {\"subset\": [\"text\"]},\n                },\n            ]\n        }\n    )\n    spark, result_text = etl_pipeline.run(ETL_config_text)\n    expected_text = [(1, \"dataverse\"), (2, \"apple\"), (4, \"carrot\")]\n    expected_text = spark.createDataFrame(expected_text, schema=columns)\n    assert sorted(result_text.collect()) == sorted(expected_text.collect())\n\n    # subset : text and id\n    ETL_config_text_id = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___generate_duplicated_data\"},\n                {\n                    \"name\": \"deduplication___exact___column\",\n                    \"args\": {\"subset\": [\"text\", \"id\"]},\n                },\n            ]\n        }\n    )\n    spark, result_text_id = etl_pipeline.run(ETL_config_text_id)\n    expected_text_id = [(1, \"dataverse\"), (2, \"apple\"), (3, \"dataverse\"), (4, \"carrot\")]\n    expected_text_id = spark.createDataFrame(expected_text_id, schema=columns)\n    assert set(result_text_id.collect()) == set(expected_text_id.collect())\n"
  },
  {
    "path": "dataverse/tests/test_deduplication_minhash.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n"
  },
  {
    "path": "dataverse/tests/test_deduplication_polyglot.py",
    "content": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n\n@register_etl\ndef helper___test___create_data_for_polyglot_minhash(spark, *args, **kwargs):\n    data = [\n        {\"text\": \"hello wolrd! Welcome to dataverse.\"},\n        {\"text\": \"hello wolrd! Welcome to dataverrrrse.\"},\n        {\"text\": \"a totally different sentence\"},\n    ]\n    df = spark.createDataFrame(data)\n    return df\n\n\ndef test_deduplication___polyglot___minhash():\n    from etl.deduplication.polyglot import (  # noqa: F401\n        deduplication___polyglot___minhash,\n    )\n\n    etl_pipeline = ETLPipeline()\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_polyglot_minhash\"},\n                {\n                    \"name\": \"deduplication___polyglot___minhash\",\n                    \"args\": {\n                        \"expand_size\": 64,\n                        \"n_gram\": 2,\n                        \"seed\": 1,\n                        \"char_level\": False,\n                        \"sim_threshold\": 0.2,\n                    },\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n\n    assert result.count() == 2\n\n    texts = set(map(lambda x: x[\"text\"], result.collect()))\n    assert (\"hello wolrd! Welcome to dataverse.\" in texts) or (\n        \"hello wolrd! Welcome to dataverrrrse.\" in texts\n    )\n    assert \"a totally different sentence\" in texts\n"
  },
  {
    "path": "dataverse/tests/test_pii_card.py",
    "content": "import re\n\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\nsample_card = \"1234-1234-1234-1234\"\n\n\n@register_etl\ndef helper___test___create_data_for_pii_card(spark, *args, **kwargs):\n    return spark.createDataFrame([{\"text\": f\"Your card No. is {sample_card}\"}])\n\n\ndef test_pii___card___replace():\n    from etl.pii.card import pii___card___replace_card_number  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n\n    # Case 1: replace with random number\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_pii_card\"},\n                {\n                    \"name\": \"pii___card___replace_card_number\",\n                    \"args\": {\"random_pii\": True},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    assert re.match(r\"(\\d{4}-\\d{4}-\\d{4}-\\d{4})\", result.collect()[0][\"text\"]) != sample_card\n\n    # Case 2: replace with replace token\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_pii_card\"},\n                {\n                    \"name\": \"pii___card___replace_card_number\",\n                    \"args\": {\"replace_pii\": True, \"replace_token\": \"[CARD_NUMBER]\"},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    expected = \"Your card No. is [CARD_NUMBER]\"\n\n    assert result.collect()[0][\"text\"] == expected\n\n    # Case 3: replace with replace token and add start, end token\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_pii_card\"},\n                {\n                    \"name\": \"pii___card___replace_card_number\",\n                    \"args\": {\n                        \"replace_pii\": True,\n                        \"replace_token\": \"[CARD_NUMBER]\",\n                        \"start_token\": \"[CARD_START]\",\n                        \"end_token\": \"[CARD_END]\",\n                    },\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    expected = \"Your card No. is [CARD_START][CARD_NUMBER][CARD_END]\"\n    assert result.collect()[0][\"text\"] == expected\n\n    # Case 4: can't detect with different pattern\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_pii_card\"},\n                {\n                    \"name\": \"pii___card___replace_card_number\",\n                    \"args\": {\n                        \"pattern\": r\"(\\d{5}-\\d{5}-\\d{5}-\\d{5})\",\n                        \"replace_pii\": True,\n                        \"replace_token\": \"[CARD_NUMBER]\",\n                        \"start_token\": \"[CARD_START]\",\n                        \"end_token\": \"[CARD_END]\",\n                    },\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    expected = f\"Your card No. is {sample_card}\"\n    assert result.collect()[0][\"text\"] == expected\n"
  },
  {
    "path": "dataverse/tests/test_pii_nin.py",
    "content": "import re\n\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\nsample_nin = \"240101-0111111\"\n\n\n@register_etl\ndef helper___test___create_data_for_korean_rnn(spark, *args, **kwargs):\n    return spark.createDataFrame([{\"text\": f\"nin is {sample_nin}\"}])\n\n\ndef test_pii___nin___replace_korean_rnns():\n    from etl.pii.nin import pii___nin___replace_korean_rrn  # noqa: F401\n\n    etl_pipeline = ETLPipeline()\n\n    # Case 1: Random PII\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_korean_rnn\"},\n                {\n                    \"name\": \"pii___nin___replace_korean_rrn\",\n                    \"args\": {\"random_pii\": True, \"replace_pii\": False},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    assert re.search(r\"\\d{6}-\\d{7}\", result.collect()[0][\"text\"]) != sample_nin\n\n    # Case 2: Replace PII with Token\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_korean_rnn\"},\n                {\n                    \"name\": \"pii___nin___replace_korean_rrn\",\n                    \"args\": {\"replace_pii\": True, \"replace_token\": \"[REDACTED]\"},\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    assert result.collect()[0][\"text\"] == \"nin is [REDACTED]\"\n\n    # Case 3: Add start and end tokens\n    ETL_config = OmegaConf.create(\n        {\n            \"etl\": [\n                {\"name\": \"helper___test___create_data_for_korean_rnn\"},\n                {\n                    \"name\": \"pii___nin___replace_korean_rrn\",\n                    \"args\": {\n                        \"replace_pii\": True,\n                        \"replace_token\": \"[REDACTED]\",\n                        \"start_token\": \"[START]\",\n                        \"end_token\": \"[END]\",\n                    },\n                },\n            ]\n        }\n    )\n    _, result = etl_pipeline.run(ETL_config)\n    assert result.collect()[0][\"text\"] == \"nin is [START][REDACTED][END]\"\n"
  },
  {
    "path": "dataverse/utils/README.md",
    "content": "# Utils\n> Common utilities\n\n## API\n\n## Format\n\n## Setting\n"
  },
  {
    "path": "dataverse/utils/__init__.py",
    "content": ""
  },
  {
    "path": "dataverse/utils/analyze/README.md",
    "content": "# Analyze\n> gaining insight of whatever you want to know\n\n## Naming Convention\n- `<target>_<function_name>`\n    - e.g. `python_is_script_executable`\n    - e.g. `jupyter_find_orginal_file`\n\n\n"
  },
  {
    "path": "dataverse/utils/analyze/__init__.py",
    "content": "\nfrom .python import python_is_script_executable\nfrom .pip import pip_get_package_path"
  },
  {
    "path": "dataverse/utils/analyze/pip.py",
    "content": "\nimport pkg_resources\n\n\ndef pip_get_package_path(package_name):\n    try:\n        package = pkg_resources.get_distribution(package_name)\n        return package.location\n    except pkg_resources.DistributionNotFound:\n        print(f\"Package '{package_name}' is not installed.\")\n        return None\n"
  },
  {
    "path": "dataverse/utils/analyze/python.py",
    "content": "\nimport ast\n\ndef python_is_script_executable(file_path, verbose=False):\n    \"\"\"\n    check if a python script is executable\n    in other words, check if the python script does not contains any declaration nodes\n    (imports, functions, classes, etc.)\n\n    declaration nodes:\n    - imports\n    - functions\n    - classes\n    - variables\n\n    Args:\n        file_path (str): path to the python script to check\n    Returns:\n        bool: True if the python script is executable, False otherwise\n    \"\"\"\n    with open(file_path, 'r') as file:\n        source_code = file.read()\n\n    # Parse source code into an AST\n    module = ast.parse(source_code)\n    for node in module.body:\n        if not isinstance(node, (\n            ast.Import,\n            ast.ImportFrom,\n            ast.FunctionDef,\n            ast.ClassDef,\n            ast.Assign\n        )):\n            if verbose:\n                print(\"found executable code: {}\".format(node))\n            return True\n\n    if verbose:\n        print(\"found no executable code\")\n    return False\n"
  },
  {
    "path": "dataverse/utils/api/README.md",
    "content": "# API\n> This is a collection of API wrapper utilities for external sources\n\n## 🥹 Use `original API` as much as you can\n> **Recommend to use the `original API` as much as you can** rather than using this `wrapper`. **Because `original API` is universal and this `wrapper` is not.**\n\nThis is just for **ease usage of some external sources**. This is not a MUST to use. If you feel like you can do it yourself, we strongly recommend to do so.\n\nOur purpose is to make a code easier to read and understand and normally `original API` is easy to read and understand for many people. This `wrapper` is just for some people who are not familiar with the `original API` or want to make a code more readable.\n\n\n### ✅ Recommended (`original API`)\n```python\nimport boto3\n\ns3 = boto3.client(\"s3\")\nbuckets = s3.list_buckets()['Buckets']\nbucket_names = []\nfor bucket in buckets:\n    bucket_names.append(bucket['Name'])\n```\n\n### ❌ Not Recommended (`wrapper`)\n```python\nfrom dataverse.utils.api import aws_s3_list_buckets\n\nbucket_names = aws_s3_list_buckets()\n```\n\n## Support API\n- aws\n\n## Naming Convention\n- `<api_name>_<function_name>`\n    - e.g. `aws_s3_upload_file`\n    - e.g. `aws_s3_download_file`"
  },
  {
    "path": "dataverse/utils/api/__init__.py",
    "content": "\n# AWS\nfrom .aws import AWSClient \nfrom .aws import EMRManager\n\nfrom .aws import aws_check_credentials\nfrom .aws import aws_get_state\nfrom .aws import aws_set_state\n\n# EC2\nfrom .aws import aws_ec2_instance_at_az\nfrom .aws import aws_ec2_instance_info\nfrom .aws import aws_ec2_all_instance_info\nfrom .aws import aws_ec2_get_price\n\n# SSM\nfrom .aws import aws_ssm_run_commands\n\n# VPC\nfrom .aws import aws_vpc_create\nfrom .aws import aws_vpc_delete\nfrom .aws import aws_subnet_create\nfrom .aws import aws_subnet_delete\nfrom .aws import aws_subnet_az\nfrom .aws import aws_emr_security_group_create\nfrom .aws import aws_security_group_delete\nfrom .aws import aws_security_group_remove_dependency\nfrom .aws import aws_gateway_create\nfrom .aws import aws_gateway_delete\nfrom .aws import aws_route_table_create\nfrom .aws import aws_route_table_delete\nfrom .aws import aws_route_table_asscociate_subnet\n\nfrom .aws import aws_elastic_ip_allocate\nfrom .aws import aws_elastic_ip_release\nfrom .aws import aws_nat_gateway_create\nfrom .aws import aws_nat_gateway_delete\n\nfrom .aws import aws_iam_role_create\nfrom .aws import aws_iam_role_delete\nfrom .aws import aws_iam_instance_profile_create\nfrom .aws import aws_iam_instance_profile_delete\n\n# S3\nfrom .aws import aws_s3_path_parse\nfrom .aws import aws_s3_create_bucket\nfrom .aws import aws_s3_delete_bucket\nfrom .aws import aws_s3_read\nfrom .aws import aws_s3_download\nfrom .aws import aws_s3_upload\nfrom .aws import aws_s3_write\nfrom .aws import aws_s3_delete\nfrom .aws import aws_s3_list_buckets\nfrom .aws import aws_s3_ls \nfrom .aws import aws_s3_get_object_type"
  },
  {
    "path": "dataverse/utils/api/aws.py",
    "content": "\n\"\"\"\nUsage:\n\n```python\nfrom dataverse.utils.api import aws_s3_list_buckets\nfrom dataverse.utils.api import aws_s3_list\n\naws_s3_list_buckets()\naws_s3_list(\"bucket\")\n```\n\"\"\"\n\n\nimport os\nimport glob\nimport re\nimport shutil\nimport tarfile\nimport tempfile\nimport json\nimport time\nimport boto3\nimport datetime\nimport ipaddress\nimport pkg_resources\nfrom omegaconf import OmegaConf\n\nfrom dataverse.utils.analyze import python_is_script_executable\n\n\n# TODO: get the information from AWS when it's supported someday\n# reference - https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-emr-supported-instance-types.html\nEMR_SUPPORTED_EC2_INSTANCES = [\n    \"m1.small\", \"m1.medium\", \"m1.large\", \"m1.xlarge\", \"m3.xlarge\", \"m3.2xlarge\",\n    \"c1.medium\", \"c1.xlarge\", \"c3.xlarge\", \"c3.2xlarge\", \"c3.4xlarge\", \"c3.8xlarge\",\n    \"cc1.4xlarge\", \"cc2.8xlarge\",\n    \"c4.large\", \"c4.xlarge\", \"c4.2xlarge\", \"c4.4xlarge\", \"c4.8xlarge\",\n    \"c5.xlarge\", \"c5.9xlarge\", \"c5.2xlarge\", \"c5.4xlarge\", \"c5.9xlarge\", \"c5.18xlarge\",\n    \"c5d.xlarge\", \"c5d.2xlarge\", \"c5d.4xlarge\", \"c5d.9xlarge\", \"c5d.18xlarge\",\n    \"m2.xlarge\", \"m2.2xlarge\", \"m2.4xlarge\",\n    \"r3.xlarge\", \"r3.2xlarge\", \"r3.4xlarge\", \"r3.8xlarge\",\n    \"cr1.8xlarge\",\n    \"m4.large\", \"m4.xlarge\", \"m4.2xlarge\", \"m4.4xlarge\", \"m4.10xlarge\", \"m4.16large\",\n    \"m5.xlarge\", \"m5.2xlarge\", \"m5.4xlarge\", \"m5.12xlarge\", \"m5.24xlarge\",\n    \"m5d.xlarge\", \"m5d.2xlarge\", \"m5d.4xlarge\", \"m5d.12xlarge\", \"m5d.24xlarge\",\n    \"r4.large\", \"r4.xlarge\", \"r4.2xlarge\", \"r4.4xlarge\", \"r4.8xlarge\", \"r4.16xlarge\",\n    \"h1.4xlarge\",\n    \"hs1.2xlarge\", \"hs1.4xlarge\", \"hs1.8xlarge\",\n    \"i2.xlarge\", \"i2.2xlarge\", \"i2.4xlarge\", \"i2.8xlarge\",\n    \"d2.xlarge\", \"d2.2xlarge\", \"d2.4xlarge\", \"d2.8xlarge\",\n    \"g2.2xlarge\",\n    \"cg1.4xlarge\"\n]\n\ndef aws_check_credentials(verbose=True):\n    \"\"\"\n    simple check if aws credentials are valid\n\n    Returns:\n        bool: True if valid, False if not valid\n    \"\"\"\n    sts = boto3.client('sts')\n    try:\n        sts.get_caller_identity()\n        return True\n    except Exception as e:\n        if verbose:\n            print(e)\n        return False\n\nclass AWSClient:\n    \"\"\"\n    AWS Client Information\n    \"\"\"\n    # Singleton\n    _initialized = False\n\n    def __new__(cls):\n        if not hasattr(cls, 'instance'):\n            cls.instance = super(AWSClient, cls).__new__(cls)\n        return cls.instance\n\n    def __init__(self):\n        if self._initialized:\n            return\n        self.region = boto3.session.Session().region_name\n        if self.region is None:\n            raise Exception(\"AWS Region is not set. Set the AWS Region with `aws configure`\")\n\n        self.sts = boto3.client('sts')\n        self.iam = boto3.client('iam')\n        self.s3 = boto3.client('s3')\n        self.ec2 = boto3.client('ec2')\n        self.emr = boto3.client('emr')\n        self.ssm = boto3.client('ssm')\n        self.user_id = self.sts.get_caller_identity()['UserId']\n        self.account_id = self.sts.get_caller_identity()['Account']\n        self._initialized = True\n\n    def __str__(self) -> str:\n        self.__repr__()\n\n    def __repr__(self) -> str:\n        return f\"AWSClient(region={self.region}, user_id={self.user_id})\"\n\n\n# --------------------------------------------------------------------------------\n# AWS State\n\"\"\"\n[ What is State? ]\n>>> state management of operating aws services for dataverse\n\nstate will be managed by python dictionary and saved as json file in aws s3.\nThis will be synced with running AWS services and it will be created for each user.\n\n[ stored information ]\n- cache, meta, config, codes, etc.\n\"\"\"\ndef aws_get_state():\n    # to avoid circular import\n    from dataverse.utils.setting import SystemSetting\n\n    aws_bucket = SystemSetting()['AWS_BUCKET']\n    state_path = f'{AWSClient().user_id}/state.json'\n\n    # get state from aws s3\n    try:\n        content = aws_s3_read(aws_bucket, state_path)\n        state = json.loads(content)\n\n    # FIXME: exception should distinguish between key not found and other errors\n    except:\n        state = {}\n        aws_s3_write(aws_bucket, state_path, json.dumps(state))\n\n    return state\n\ndef aws_set_state(state):\n    # to avoid circular import\n    from dataverse.utils.setting import SystemSetting\n\n    aws_bucket = SystemSetting()['AWS_BUCKET']\n    state_path = f'{AWSClient().user_id}/state.json'\n    aws_s3_write(aws_bucket, state_path, json.dumps(state))\n\n\n# --------------------------------------------------------------------------------\n# AWS EC2 Resource\ndef aws_ec2_instance_at_az(az):\n    \"\"\"\n    get all instance info at the given AZ\n    \"\"\"\n    response = AWSClient().ec2.describe_instance_type_offerings(\n        LocationType='availability-zone',\n        Filters=[\n            {\n                'Name': 'location',\n                'Values': [\n                    az,\n                ]\n            },\n        ]\n    )\n    instances = [inst['InstanceType'] for inst in response['InstanceTypeOfferings']]\n\n    return instances\n\ndef aws_ec2_instance_info(instance):\n    \"\"\"\n    get instance info from aws\n    \"\"\"\n    response = AWSClient().ec2.describe_instance_types(\n        InstanceTypes=[instance],\n    )\n\n    return response\n\ndef aws_ec2_all_instance_info():\n    \"\"\"\n    get all instance types information\n    \"\"\"\n    instance_info = {}\n    token = ''\n    while True:\n        if token == '':\n            response = AWSClient().ec2.describe_instance_types()\n        else:\n            response = AWSClient().ec2.describe_instance_types(NextToken=token)\n\n        for instance_type in response['InstanceTypes']:\n            instance_info[instance_type['InstanceType']] = {\n                'vcpu': instance_type['VCpuInfo']['DefaultVCpus'],\n                'memory': instance_type['MemoryInfo']['SizeInMiB']\n            }\n\n        if 'NextToken' in response:\n            token = response['NextToken']\n        else:\n            break\n\n    return instance_info\n\ndef aws_ec2_get_price(instance_type):\n    response = AWSClient().ec2.describe_spot_price_history(\n        InstanceTypes=[instance_type],\n        ProductDescriptions=['Linux/UNIX (Amazon VPC)'],\n        StartTime=datetime.datetime.now(),\n        MaxResults=1,\n    )\n\n    return response['SpotPriceHistory'][0]['SpotPrice']\n\n\n# --------------------------------------------------------------------------------\n# AWS SSM (Systems Manager)\ndef aws_ssm_run_commands(instance_ids, commands, verbose=True, return_output=False):\n    \"\"\"\n    Run commands on a list of EC2 instances using AWS SSM.\n    \"\"\"\n    if return_output:\n        results = {}\n    for command in commands:\n        if verbose:\n            print(f\"Sending following command to all instances...\")\n            print(\"==========================================\")\n            print(command)\n            print(\"==========================================\")\n\n        command_id = AWSClient().ssm.send_command(\n            InstanceIds=instance_ids,\n            DocumentName=\"AWS-RunShellScript\",\n            Parameters={\"commands\": [command]},\n            TimeoutSeconds=3600,\n        )[\"Command\"][\"CommandId\"]\n\n        while True:\n            # verify the previous step succeeded before running the next step.\n            cmd_result = AWSClient().ssm.list_commands(CommandId=command_id)[\"Commands\"][0]\n            if cmd_result[\"StatusDetails\"] == \"Success\":\n                if verbose or return_output:\n                    command_invocation = AWSClient().ssm.get_command_invocation(\n                        CommandId=command_id,\n                        InstanceId=instance_ids[0], # assume all instances are the same\n                    )\n                if verbose:\n                    print(\"=========== Standard output ============\")\n                    print(command_invocation[\"StandardOutputContent\"])\n                    print(\"==========================================\")\n                    print(f\"Command succeeded.\")\n                if return_output:\n                    results[command] = command_invocation[\"StandardOutputContent\"]\n                break\n            elif cmd_result[\"StatusDetails\"] in [\"Pending\", \"InProgress\"]:\n                if verbose:\n                    print(f\"Command status is {cmd_result['StatusDetails']}, waiting...\")\n                time.sleep(10)\n            else:\n                if verbose:\n                    print(f\"Command status is {cmd_result['StatusDetails']}, quitting.\")\n                    # get more detailed information about the command failure\n                    command_invocation = AWSClient().ssm.get_command_invocation(\n                        CommandId=command_id,\n                        InstanceId=instance_ids[0], # assume all instances are the same\n                    )\n                    print(\"============= Error output ==============\")\n                    print(command_invocation[\"StandardErrorContent\"])\n                    print(\"=========== Standard output ============\")\n                    print(command_invocation[\"StandardOutputContent\"])\n                    print(\"==========================================\")\n                raise RuntimeError(\n                    f\"Command failed to run. [ {cmd_result['StatusDetails']} ]\"\n                )\n    if return_output:\n        return results\n\n\n\n# --------------------------------------------------------------------------------\n# AWS EMR\n\nclass EMRManager:\n    \"\"\"\n    one EMR manager per one EMR cluster\n    \"\"\"\n    def launch(self, config):\n        \"\"\"\n        auto setup environments and launch emr cluster\n\n        Args:\n            config (OmegaConf): config for the etl\n        \"\"\"\n        # clean unused resources\n        self._clean()\n\n        if config.emr.id is not None:\n            config.emr.auto_generated = False\n\n            return config.emr.id\n\n        # TODO: modify interface for custom policy\n        # create role & instance profile\n        self._role_setup(config)\n        self._instance_profile_setup(config)\n\n        # create vpc\n        self._vpc_setup(config)\n\n        # create emr cluster\n        # XXX: wait until instance profile is ready\n        #      otherwise, emr cluster creation will fail\n        # FIXME: convert to smart solution (e.g. waiter)\n        #        currently AWS doesn't support waiter available option for instance profile\n        # NOTE: I've tried to make waiter using `describe_instance_profile` but it didn't work\n        time.sleep(7)\n\n        # set default instance type\n        self._set_default_instance(config)\n\n        emr_id = self._emr_cluster_create(config)\n        config.emr.id = emr_id\n        config.emr.auto_generated = True\n\n        return emr_id\n\n    def _role_setup(self, config):\n        \"\"\"\n        TODO: modify interface for custom policy\n        \"\"\"\n\n        # [ EC2 ] --------------------------------------------------\n        ec2_trust_policy = {\n            \"Version\": \"2008-10-17\",\n            \"Statement\": [\n                {\n                    \"Sid\": \"\",\n                    \"Effect\": \"Allow\",\n                    \"Principal\": {\n                        \"Service\": \"ec2.amazonaws.com\"\n                    },\n                    \"Action\": \"sts:AssumeRole\"\n                }\n            ]\n        }\n        ec2_role = 'Dataverse_EMR_EC2_DefaultRole'\n        ec2_policy = 'AmazonElasticMapReduceforEC2Role'\n        ssm_policy = 'AmazonSSMManagedInstanceCore'\n\n        # add timestamp to temporary role name\n        timestamp = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n        ec2_role = f\"{ec2_role}_{timestamp}\"\n        ec2_policy_arns = [\n            f\"arn:aws:iam::aws:policy/service-role/{ec2_policy}\",\n            f\"arn:aws:iam::aws:policy/{ssm_policy}\"\n        ]\n\n        aws_iam_role_create(\n            role_name=ec2_role,\n            trust_policy=ec2_trust_policy,\n            policy_arns=ec2_policy_arns,\n            description='Role for Dataverse EMR EC2',\n        )\n        config.emr.role.ec2.name = ec2_role\n        config.emr.role.ec2.policy_arns = ec2_policy_arns\n\n        # [ EMR ] --------------------------------------------------\n        emr_trust_policy = {\n            \"Version\": \"2008-10-17\",\n            \"Statement\": [\n                {\n                    \"Sid\": \"\",\n                    \"Effect\": \"Allow\",\n                    \"Principal\": {\n                        \"Service\": \"elasticmapreduce.amazonaws.com\"\n                    },\n                    \"Action\": \"sts:AssumeRole\",\n                    \"Condition\": {\n                        \"StringEquals\": {\n                            \"aws:SourceAccount\": AWSClient().account_id\n                        },\n                        \"ArnLike\": {\n                            \"aws:SourceArn\": f\"arn:aws:elasticmapreduce:{AWSClient().region}:{AWSClient().account_id}:*\"\n                        }\n                    }\n                }\n            ]\n        }\n        emr_role = 'Dataverse_EMR_DefaultRole'\n        emr_policy = 'AmazonElasticMapReduceRole'\n\n        # add timestamp to temporary role name\n        timestamp = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n        emr_role = f\"{emr_role}_{timestamp}\"\n        emr_policy_arns = [f\"arn:aws:iam::aws:policy/service-role/{emr_policy}\"]\n\n        aws_iam_role_create(\n            role_name=emr_role,\n            trust_policy=emr_trust_policy,\n            policy_arns=emr_policy_arns,\n            description='Role for Dataverse EMR',\n        )\n        config.emr.role.emr.name = emr_role\n        config.emr.role.emr.policy_arns = emr_policy_arns\n\n    def _instance_profile_setup(self, config):\n        \"\"\"\n        TODO: modify interface for custom policy\n        \"\"\"\n        ec2_role = config.emr.role.ec2.name\n        instance_profile_name = 'Dataverse_EMR_EC2_DefaultRole_InstanceProfile'\n\n        # add timestamp to temporary role name\n        timestamp = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n        instance_profile_name = f\"{instance_profile_name}_{timestamp}\"\n\n        aws_iam_instance_profile_create(\n            instance_profile_name=instance_profile_name,\n            role_name=ec2_role,\n        )\n        config.emr.instance_profile.name = instance_profile_name\n        config.emr.instance_profile.ec2_role = ec2_role\n\n    def _vpc_setup(self, config):\n        \"\"\"\n        config will be automatically updated\n        \"\"\"\n\n        # VPC\n        vpc_id = aws_vpc_create()\n        config.emr.vpc.id = vpc_id\n\n        # if private subnet is required\n        subnet_args = {\n            'vpc_id': vpc_id,\n            'tag_name': 'Dataverse-Temporary-Subnet-Public',\n        }\n        if not config.emr.subnet.public:\n            vpcs = AWSClient().ec2.describe_vpcs(VpcIds=[vpc_id])\n            cidr_block = vpcs['Vpcs'][0]['CidrBlock']\n            ip_net = ipaddress.ip_network(cidr_block)\n\n            # split the network into two subnets\n            public_subnet, private_subnet = list(ip_net.subnets())\n            subnet_args['cird_block'] = str(public_subnet)\n\n        # Subnet\n        subnet_id = aws_subnet_create(**subnet_args)\n        config.emr.subnet.id = subnet_id\n        config.emr.subnet.public_id = subnet_id\n\n        # Internet Gateway\n        gateway_id = aws_gateway_create(vpc_id)\n        config.emr.gateway.id = gateway_id\n\n        # Route Table\n        route_table_id = aws_route_table_create(\n            vpc_id=vpc_id,\n            gateway_id=gateway_id,\n            destination_cidr_block='0.0.0.0/0',\n            tag_name='Dataverse-Route-Table-Public',\n        )\n        aws_route_table_asscociate_subnet(subnet_id, route_table_id)\n        config.emr.route_table.id = route_table_id\n\n        if not config.emr.subnet.public:\n            # add NAT Gateway to public subnet\n            elastic_ip_id = aws_elastic_ip_allocate(vpc_id=vpc_id)\n            config.emr.elastic_ip.id = elastic_ip_id\n\n            nat_gateway_id = aws_nat_gateway_create(\n                vpc_id=vpc_id,\n                subnet_id=subnet_id,\n                elastic_ip_id=elastic_ip_id,\n            )\n            config.emr.nat_gateway.id = nat_gateway_id\n\n            # create private subnet\n            private_subnet_id = aws_subnet_create(\n                vpc_id=vpc_id,\n                cird_block=str(private_subnet),\n                tag_name='Dataverse-Temporary-Subnet-Private',\n            )\n            config.emr.subnet.id = private_subnet_id\n            config.emr.subnet.private_id = private_subnet_id\n\n            # add NAT Gateway to private subnet\n            private_route_table_id = aws_route_table_create(\n                vpc_id=vpc_id,\n                nat_gateway_id=nat_gateway_id,\n                destination_cidr_block='0.0.0.0/0',\n                tag_name='Dataverse-Route-Table-Private',\n            )\n            aws_route_table_asscociate_subnet(\n                subnet_id=private_subnet_id,\n                route_table_id=private_route_table_id,\n            )\n\n        # set state\n        state = aws_get_state()\n        state['vpc'][vpc_id]['public_subnet'] = config.emr.subnet.public\n        aws_set_state(state)\n\n    def _set_default_instance(\n        self,\n        config,\n        min_memory=2048,\n        max_memory=8192,\n    ):\n        \"\"\"\n        choose default instance type by memory\n\n        args:\n            config (OmegaConf): config for the etl\n            min_memory (int): minimum memory size (MiB)\n            max_memory (int): maximum memory size (MiB)\n        \"\"\"\n        subnet_id = config.emr.subnet.id\n        az = aws_subnet_az(subnet_id)\n        instances = aws_ec2_instance_at_az(az=az)\n\n        # find memory size is bigger specified min/max memory\n        candidate = None\n        _min_candidate_memory = float('inf')\n        for instance in instances:\n\n            # check if instance is supported by EMR\n            if instance not in EMR_SUPPORTED_EC2_INSTANCES:\n                continue\n\n            instance_info = aws_ec2_instance_info(instance)\n            memory = instance_info['InstanceTypes'][0]['MemoryInfo']['SizeInMiB']\n            if min_memory <= memory <= max_memory:\n                if memory < _min_candidate_memory:\n                    candidate = instance\n                    _min_candidate_memory = memory\n\n        if candidate is None:\n            raise Exception(f\"Unable to find instance type with memory between {min_memory} and {max_memory}\")\n\n\n        instance_info = aws_ec2_instance_info(candidate)\n        vcpu = instance_info['InstanceTypes'][0]['VCpuInfo']['DefaultVCpus']\n        memory = instance_info['InstanceTypes'][0]['MemoryInfo']['SizeInMiB']\n        print(\n            f\"{'=' * 80}\\n\"\n            f\"Default instance type is [ {candidate} ]\\n\"\n            f\"{'=' * 80}\\n\"\n            f\" vCPU: {vcpu}\\n\"\n            f\" Memory: {memory}\\n\"\n            f\" Price: {aws_ec2_get_price(candidate)}\\n\"\n            f\"{'=' * 80}\\n\"\n        )\n\n        if config.emr.master_instance.type is None:\n            config.emr.master_instance.type = candidate\n        if config.emr.core_instance.type is None:\n            config.emr.core_instance.type = candidate\n        if config.emr.task_instance.type is None:\n            config.emr.task_instance.type = candidate\n\n    def _emr_cluster_create(self, config):\n        \"\"\"\n        create aws emr cluster\n\n        Args:\n            config (OmegaConf): config for the etl\n        \"\"\"\n        # to avoid circular import\n        from dataverse.utils.setting import SystemSetting\n        log_dir = f\"s3://{SystemSetting().AWS_BUCKET}/{AWSClient().user_id}/emr/logs\"\n\n        # instance group setting\n        instance_groups = [\n            {\n                'Name': 'master nodes',\n                'Market': 'ON_DEMAND',\n                'InstanceRole': 'MASTER',\n                'InstanceType': config.emr.master_instance.type,\n                'InstanceCount': 1,\n            },\n            {\n                'Name': 'core nodes',\n                'Market': 'ON_DEMAND',\n                'InstanceRole': 'CORE',\n                'InstanceType': config.emr.core_instance.type,\n                'InstanceCount': config.emr.core_instance.count,\n            },\n        ]\n\n        # task is optional\n        if config.emr.task_instance.count > 0:\n            instance_groups.append(\n                {\n                    'Name': 'task nodes',\n                    'Market': 'ON_DEMAND',\n                    'InstanceRole': 'TASK',\n                    'InstanceType': config.emr.task_instance.type,\n                    'InstanceCount': config.emr.task_instance.count,\n                }\n            )\n\n        # create emr cluster\n        emr_id = AWSClient().emr.run_job_flow(\n            Name=config.emr.name,\n            ReleaseLabel=config.emr.release,\n            AutoTerminationPolicy={\n                \"IdleTimeout\": config.emr.idle_timeout,\n            },\n            Instances={\n                'InstanceGroups': instance_groups,\n                'KeepJobFlowAliveWhenNoSteps': True,\n                'TerminationProtected': False,\n                'Ec2SubnetId': config.emr.subnet.id,\n            },\n            Applications=[{'Name': 'Spark'}],\n            VisibleToAllUsers=True,\n            JobFlowRole=config.emr.instance_profile.name,\n            ServiceRole=config.emr.role.emr.name,\n            Tags=[\n                {\n                    'Key': 'Name',\n                    'Value': config.emr.name,\n                },\n            ],\n            LogUri=log_dir,\n        )['JobFlowId']\n\n        # wait until emr cluster is ready\n        waiter = AWSClient().emr.get_waiter('cluster_running')\n        waiter.wait(ClusterId=emr_id)\n\n        # set state\n        state = aws_get_state()\n        if 'emr' not in state:\n            state['emr'] = {}\n\n        state['emr'][emr_id] = {\n            'vpc_id': config.emr.vpc.id,\n        }\n\n        # instance profile\n        if config.emr.instance_profile.name is not None:\n            state['emr'][emr_id]['instance_profile'] = config.emr.instance_profile.name\n\n        # role\n        if 'role' not in state['emr'][emr_id]:\n            state['emr'][emr_id]['role'] = {}\n\n        if config.emr.role.emr.name is not None:\n            state['emr'][emr_id]['role']['emr'] = config.emr.role.emr.name\n        if config.emr.role.ec2.name is not None:\n            state['emr'][emr_id]['role']['ec2'] = config.emr.role.ec2.name\n\n        aws_set_state(state)\n\n        config.emr.id = emr_id\n\n        return emr_id\n\n    def run(self, config, verbose=False):\n        # setup environment\n        self._setup(config, verbose=verbose)\n\n        # run emr\n        # get pip installed packages path\n        location = self._get_pip_package_path(config, verbose=verbose)\n        emr_main = os.path.join(location, 'dataverse', 'api', 'emr.py')\n\n        response = AWSClient().emr.add_job_flow_steps(\n            JobFlowId=config.emr.id,\n            Steps=[\n                {\n                    'Name': 'Run Dataverse python script on Master node',\n                    'ActionOnFailure': 'CONTINUE',\n                    'HadoopJarStep': {\n                        'Jar': 'command-runner.jar',\n                        'Args': [\n                            'python3',\n                            emr_main,\n                            '--config',\n                            '/home/hadoop/dataverse/config.yaml',\n                        ]\n                    }\n                },\n            ]\n        )\n        step_id = response['StepIds'][0]\n\n        return step_id\n\n    def _setup(self, config, verbose=False):\n        \"\"\"\n        [ upload to S3 ]\n        - config for `dataverse`\n        - dataverse site-packages source code\n        - requirements.txt\n        - dynamic etl files\n\n        [ move s3 to ec2 ]\n        - move uploaded files in S3 from local to EMR cluster\n\n        [ setup environment on EMR cluster ]\n        - set aws region\n        - install pip dependencies for `dataverse`\n        - set `dataverse` package at EMR cluster pip installed packages path\n        \"\"\"\n        # generate working directory\n        self._get_working_dir(config)\n\n        # upload to necessary dataverse files to S3\n        self._upload_config(config)\n        self._upload_source_code(config)\n        self._upload_dependencies(config)\n        self._upload_dynamic_etl_files(config)\n\n        # move uploaded files in S3 from local to EMR cluster\n        self._move_s3_to_ec2(config, verbose=verbose)\n\n        # setup environment on EMR cluster\n        self._setup_aws(config, verbose=verbose)\n        self._setup_dependencies(config, verbose=verbose)\n        self._setup_source_code(config, verbose=verbose)\n\n    def _get_working_dir(self, config):\n        \"\"\"\n        get working directory path for the emr cluster\n        if not provided, it will be automatically generated\n        \"\"\"\n        # to avoid circular import\n        from dataverse.utils.setting import SystemSetting\n\n        if config.emr.working_dir is not None:\n            working_dir = config.emr.working_dir\n            if working_dir.startswith(('s3://', 's3a://', 's3n://')):\n                aws_s3_matched = re.match(r's3[a,n]?://([^/]+)/(.*)', working_dir)\n                if not aws_s3_matched:\n                    raise ValueError(f\"EMR working directory {working_dir} is not a valid s3 path\")\n        else:\n            # [ emr versioning ] - emr_YYYY-MM-DD_HH:MM:SS_<emr_id>\n            # datetime first for ascending order\n            bucket = SystemSetting()['AWS_BUCKET']\n            user_id = AWSClient().user_id\n            working_dir_name = datetime.datetime.now().strftime(f\"emr_%Y-%m-%d_%H:%M:%S_{config.emr.id}\")\n\n            working_dir = f\"s3://{bucket}/{user_id}/emr/{working_dir_name}\"\n            config.emr.working_dir = working_dir\n\n        return working_dir\n\n    def _upload_config(self, config):\n        \"\"\"\n        upload config for `dataverse` to S3\n        \"\"\"\n        working_dir = self._get_working_dir(config)\n        bucket, key = aws_s3_path_parse(working_dir)\n\n        aws_s3_write(bucket, f\"{key}/config.yaml\", OmegaConf.to_yaml(config))\n\n    def _upload_source_code(self, config):\n        \"\"\"\n        upload pip site-packages source code to S3\n\n        caveat:\n            this doesn't include wheel files or meta data for pip packages\n        \"\"\"\n        # to avoid circular import\n        from dataverse.utils.setting import SystemSetting\n\n        temp_dir = tempfile.mkdtemp()\n        zip_file = os.path.join(temp_dir, 'dataverse.tar.gz')\n\n        dataverse_home = SystemSetting().DATAVERSE_HOME\n        with tarfile.open(zip_file, \"w:gz\") as tar:\n            tar.add(dataverse_home, arcname=os.path.basename(dataverse_home))\n\n        working_dir = self._get_working_dir(config)\n        bucket, key = aws_s3_path_parse(working_dir)\n\n        aws_s3_upload(bucket, f'{key}/dataverse.tar.gz', zip_file)\n\n        shutil.rmtree(temp_dir)\n\n    def _upload_dependencies(self, config, package_name=\"dataverse\"):\n        # get all dependencies\n        requirements = []\n        for r in pkg_resources.get_distribution(package_name).requires():\n            requirements.append(str(r))\n\n        # create requirements.txt\n        temp_dir = tempfile.mkdtemp()\n        dependency_file = os.path.join(temp_dir, 'requirements.txt')\n\n        with open(dependency_file, 'w') as f:\n            for requirement in requirements:\n                f.write(f\"{requirement}\\n\")\n\n        # upload requirements.txt to S3\n        working_dir = self._get_working_dir(config)\n        bucket, key = aws_s3_path_parse(working_dir)\n\n        aws_s3_upload(bucket, f'{key}/requirements.txt', dependency_file)\n\n        shutil.rmtree(temp_dir)\n\n    def _upload_dynamic_etl_files(self, config):\n        # to avoid circular import\n        from dataverse.etl import ETLRegistry\n\n        # get all etl files\n        dynamic_etl_file_paths = []\n        for etl in ETLRegistry().get_all():\n            # not part of the dataverse source but dynamically loaded by user\n            if not etl.__etl_dir__:\n                file_path = etl.__file_path__\n\n                # jupyter notebook is not supported\n                # TODO: allow jupyter notebook\n                # NOTE: reason why jupyter notebook is not supported is because\n                #       the filename point at the temporary file path not the `.ipynb` file\n                if 'ipykernel' in file_path:\n                    raise ValueError(\n                        'Dynamic ETL from jupyter notebook not supported. Only from .py files\\n'\n                        f\"[ {file_path} ] is given which is temporary jupyter cell execution file\\n\"\n                    )\n\n                # only declaration is allowed\n                # TODO: analyze the code and only parse necessary dynamic etl code\n                # NOTE: this is to prevent execution of the code\n                if python_is_script_executable(file_path):\n                    raise ValueError(\n                        'Dynamic ETL file should only contain declaration (imports, functions, classes, etc.)'\n                        f\"[ {file_path} ] includes execution.\\n\"\n                    )\n\n                # check not from of jupyter notebook\n                dynamic_etl_file_paths.append(file_path)\n\n        # upload etl files to S3\n        working_dir = self._get_working_dir(config)\n        bucket, key = aws_s3_path_parse(working_dir)\n\n        # if dynamic_etl dir exists, remove it\n        # NOTE: this is to prevent old files from being uploaded\n        #       in case that user is using setup multiple times with same working_dir\n        try:\n            aws_s3_delete(bucket, f'{key}/dynamic_etl')\n        except:\n            pass\n\n        for file_path in dynamic_etl_file_paths:\n            aws_s3_upload(\n                bucket=bucket,\n                key=f'{key}/dynamic_etl/{os.path.basename(file_path)}',\n                local_path=file_path\n            )\n\n    def _move_s3_to_ec2(self, config, verbose=False):\n        \"\"\"\n        move uploaded files in S3 from local to EMR cluster\n        \"\"\"\n        nodes = AWSClient().emr.list_instances(\n            ClusterId=config.emr.id\n        )[\"Instances\"]\n        instance_ids = [node[\"Ec2InstanceId\"] for node in nodes]\n\n        # remove existing dataverse directory\n        commands = [\n            \"rm -r /home/hadoop/dataverse\",\n        ]\n        try:\n            aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n        except:\n            pass\n\n        commands = [\n            f\"aws s3 cp {config.emr.working_dir} /home/hadoop/dataverse --recursive\",\n        ]\n        aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n\n    def _get_pip_package_path(self, config, verbose=False):\n        \"\"\"\n        get pip installed packages path\n        \"\"\"\n        nodes = AWSClient().emr.list_instances(\n            ClusterId=config.emr.id\n        )[\"Instances\"]\n        instance_ids = [node[\"Ec2InstanceId\"] for node in nodes]\n\n        commands = [\"pip3 show numpy\"]\n        result = aws_ssm_run_commands(\n            instance_ids,\n            commands,\n            verbose=verbose,\n            return_output=True,\n        )\n        location = re.findall(r'Location: (.*)\\n', result['pip3 show numpy'])[0]\n\n        return location\n\n    def _setup_aws(self, config, verbose=False):\n        \"\"\"\n        setup aws environment on EMR cluster\n        \"\"\"\n        nodes = AWSClient().emr.list_instances(\n            ClusterId=config.emr.id\n        )[\"Instances\"]\n        instance_ids = [node[\"Ec2InstanceId\"] for node in nodes]\n\n        commands = [\n            f\"aws configure set region {AWSClient().region}\",\n        ]\n        aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n\n    def _setup_dependencies(self, config, verbose=False):\n        nodes = AWSClient().emr.list_instances(\n            ClusterId=config.emr.id\n        )[\"Instances\"]\n        instance_ids = [node[\"Ec2InstanceId\"] for node in nodes]\n\n        commands = [\n            \"sudo yum install -y python3-devel\",\n            \"pip3 install wheel setuptools pip --upgrade\",\n        ]\n        aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n\n        # NOTE: unknown unlimited loop caused by `pip3 install -r requirements.txt`\n        #       so I split the following command separately\n        commands = [\n            \"pip3 install -r /home/hadoop/dataverse/requirements.txt\",\n        ]\n        aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n\n    def _setup_source_code(self, config, verbose=False):\n        \"\"\"\n        copy dataverse source code to pip installed packages path\n        \"\"\"\n        nodes = AWSClient().emr.list_instances(\n            ClusterId=config.emr.id\n        )[\"Instances\"]\n        instance_ids = [node[\"Ec2InstanceId\"] for node in nodes]\n\n        # unzip dataverse.tar.gz and copy to pip installed packages path\n        commands = [\n            \"tar -xzf /home/hadoop/dataverse/dataverse.tar.gz -C /home/hadoop/dataverse\",\n        ]\n        aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n\n        # get pip installed packages path\n        location = self._get_pip_package_path(config, verbose=verbose)\n\n        # copy dataverse source code to pip installed packages path\n        commands = [\n            f\"cp -r /home/hadoop/dataverse/dataverse {location}\",\n        ]\n        aws_ssm_run_commands(instance_ids, commands, verbose=verbose)\n\n    def wait(self, config, step_id, verbose=True):\n        \"\"\"\n        waiter for emr step\n        \"\"\"\n        while True:\n            response = AWSClient().emr.describe_step(\n                ClusterId=config.emr.id,\n                StepId=step_id,\n            )\n            state = response['Step']['Status']['State']\n            if state == 'Pending':\n                time.sleep(10)\n                if verbose:\n                    print(\"[ Dataverse ] step pending...\")\n                continue\n            if state in ['COMPLETED', 'FAILED', 'CANCELLED']:\n                if verbose:\n                    print(f\"[ Dataverse ] step status: {state}. Done.\")\n                break\n            if verbose:\n                if 'Message' in response['Step']['Status']['StateChangeReason']:\n                    print(response['Step']['Status']['StateChangeReason']['Message'])\n                time.sleep(10)\n\n    def terminate(self, config):\n        \"\"\"\n        terminate emr cluster\n\n        Args:\n            config (OmegaConf): config for the etl\n        \"\"\"\n        # only terminate auto generated emr cluster\n        if config.emr.auto_generated is False:\n            print('EMR cluster is not auto generated. Not terminating.')\n            return\n\n        if config.emr.id is None:\n            print('EMR cluster is not launched. Proceeding to clean resources.')\n        else:\n            AWSClient().emr.terminate_job_flows(JobFlowIds=[config.emr.id])\n\n            # wait until emr cluster is terminated\n            waiter = AWSClient().emr.get_waiter('cluster_terminated')\n            waiter.wait(ClusterId=config.emr.id)\n\n            # set state\n            state = aws_get_state()\n            if 'emr' in state and config.emr.id in state['emr']:\n                del state['emr'][config.emr.id]\n                aws_set_state(state)\n\n        # clean unused resources\n        self._clean()\n\n    def _clean(self):\n        \"\"\"\n        clean unused resources related to EMR\n        \"\"\"\n        self._clean_stopped_emr()\n        self._clean_unused_vpc()\n        self._clean_unused_iam_instance_profile()\n        self._clean_unused_iam_role()\n\n    def _clean_stopped_emr(self):\n        \"\"\"\n        check stopped EMR and update the state\n        \"\"\"\n        state = aws_get_state()\n\n        # get all emr ids\n        emr_ids = []\n        if 'emr' in state:\n            for emr_id in state['emr']:\n                emr_ids.append(emr_id)\n\n        # remove stopped emr from state\n        REMOVE_STATES = [\n            'TERMINATED',\n            'TERMINATED_WITH_ERRORS'\n        ]\n        for emr_id in emr_ids:\n            emr_info = AWSClient().emr.describe_cluster(ClusterId=emr_id)\n            if emr_info['Cluster']['Status']['State'] in REMOVE_STATES:\n                del state['emr'][emr_id]\n        aws_set_state(state)\n\n    def _clean_unused_vpc(self):\n        \"\"\"\n        check the AWS state and clean vpc that is not used by any emr cluster\n        \"\"\"\n        state = aws_get_state()\n\n        # get all vpc ids that are used by emr\n        used_vpc_ids = []\n        if 'emr' in state:\n            for emr_id in state['emr']:\n                used_vpc_ids.append(state['emr'][emr_id]['vpc_id'])\n\n        # get all vpc ids that are created\n        all_vpc_ids = []\n        if 'vpc' in state:\n            for vpc_id in state['vpc']:\n                all_vpc_ids.append(vpc_id)\n\n        # clean unused vpc\n        unused_vpc_ids = list(set(all_vpc_ids) - set(used_vpc_ids))\n\n        for vpc_id in unused_vpc_ids:\n            aws_vpc_delete(vpc_id)\n\n    def _clean_unused_iam_role(self):\n        \"\"\"\n        check the AWS state and clean iam role that is not used by any emr cluster\n        \"\"\"\n        state = aws_get_state()\n\n        # get all iam role names that are used by emr\n        used_iam_role_names = []\n        if 'emr' in state:\n            for emr_id in state['emr']:\n                if 'ec2' in state['emr'][emr_id]['role']:\n                    used_iam_role_names.append(state['emr'][emr_id]['role']['ec2'])\n                if 'emr' in state['emr'][emr_id]['role']:\n                    used_iam_role_names.append(state['emr'][emr_id]['role']['emr'])\n\n        # get all iam role names that are created\n        all_iam_role_names = []\n        if 'iam' in state and 'role' in state['iam']:\n            for role_name in state['iam']['role']:\n                all_iam_role_names.append(role_name)\n\n        # clean unused iam role\n        unused_iam_role_names = list(set(all_iam_role_names) - set(used_iam_role_names))\n\n        for role_name in unused_iam_role_names:\n            aws_iam_role_delete(role_name)\n\n    def _clean_unused_iam_instance_profile(self):\n        \"\"\"\n        check the AWS state and clean iam instance profile that is not used by any emr cluster\n        \"\"\"\n        state = aws_get_state()\n\n        # get all iam instance profile names that are used by emr\n        used_iam_instance_profile_names = []\n        if 'emr' in state:\n            for emr_id in state['emr']:\n                used_iam_instance_profile_names.append(state['emr'][emr_id]['instance_profile'])\n\n        # get all iam instance profile names that are created\n        all_iam_instance_profile_names = []\n        if 'iam' in state and 'instance_profile' in state['iam']:\n            for instance_profile_name in state['iam']['instance_profile']:\n                all_iam_instance_profile_names.append(instance_profile_name)\n\n        # clean unused iam instance profile\n        unused_iam_instance_profile_names = list(set(all_iam_instance_profile_names) - set(used_iam_instance_profile_names))\n\n        for instance_profile_name in unused_iam_instance_profile_names:\n            aws_iam_instance_profile_delete(instance_profile_name)\n\n    def terminate_by_id(self, emr_id):\n        \"\"\"\n        when you want to terminate emr cluster without config\n\n        ```python\n        from dataverse.utils.api import EMRManager\n        EMRManager().terminate_by_id('j-3C05XDxxxxxxx')\n        ```\n        \"\"\"\n        # to avoid circular import\n        from dataverse.config import Config\n\n        config = Config.default(emr=True)\n        config.emr.id = emr_id\n        self.terminate(config)\n\n\n# --------------------------------------------------------------------------------\n\ndef aws_iam_role_create(\n    role_name,\n    trust_policy,\n    policy_arns,\n    description='Role for Dataverse',\n    max_session_duration=3600,\n):\n\n    # create role\n    try:\n        AWSClient().iam.create_role(\n            RoleName=role_name,\n            Description=description,\n            AssumeRolePolicyDocument=json.dumps(trust_policy),\n            MaxSessionDuration=max_session_duration,\n        )\n\n        # attach policy\n        for policy_arn in policy_arns:\n            AWSClient().iam.attach_role_policy(\n                RoleName=role_name,\n                PolicyArn=policy_arn,\n            )\n\n        # set state\n        state = aws_get_state()\n        if 'iam' not in state:\n            state['iam'] = {}\n\n        if 'role' not in state['iam']:\n            state['iam']['role'] = {}\n\n        state['iam']['role'][role_name] = {\n            'policy_arns': policy_arns,\n        }\n        aws_set_state(state)\n    except AWSClient().iam.exceptions.EntityAlreadyExistsException:\n        print(f\"{role_name} already exists.\")\n    except Exception as e:\n        raise e\n\n    # wait until role is ready\n    waiter = AWSClient().iam.get_waiter('role_exists')\n    waiter.wait(RoleName=role_name)\n\ndef aws_iam_role_delete(role_name):\n    try:\n        # detach policy\n        response = AWSClient().iam.list_attached_role_policies(RoleName=role_name)\n        for policy in response['AttachedPolicies']:\n            AWSClient().iam.detach_role_policy(\n                RoleName=role_name,\n                PolicyArn=policy['PolicyArn'],\n            )\n\n        # delete role\n        AWSClient().iam.delete_role(RoleName=role_name)\n    except AWSClient().iam.exceptions.NoSuchEntityException:\n        print(f\"{role_name} does not exist.\")\n    except Exception as e:\n        raise e\n\n    # set state\n    state = aws_get_state()\n    if 'iam' in state and 'role' in state['iam']:\n        if role_name in state['iam']['role']:\n            del state['iam']['role'][role_name]\n            aws_set_state(state)\n\ndef aws_iam_instance_profile_create(instance_profile_name, role_name):\n    try:\n        AWSClient().iam.create_instance_profile(\n            InstanceProfileName=instance_profile_name\n        )\n        AWSClient().iam.add_role_to_instance_profile(\n            InstanceProfileName=instance_profile_name,\n            RoleName=role_name\n        )\n\n        # set state\n        state = aws_get_state()\n        if 'iam' not in state:\n            state['iam'] = {}\n\n        if 'instance_profile' not in state['iam']:\n            state['iam']['instance_profile'] = {}\n\n        state['iam']['instance_profile'][instance_profile_name] = {\n            'role_name': role_name,\n        }\n        aws_set_state(state)\n    except AWSClient().iam.exceptions.EntityAlreadyExistsException:\n        print(f\"{instance_profile_name} already exists.\")\n    except Exception as e:\n        raise e\n\n    # wait until instance profile is ready\n    waiter = AWSClient().iam.get_waiter('instance_profile_exists')\n    waiter.wait(InstanceProfileName=instance_profile_name)\n\n    # FIXME: wait until instance profile is available\n    ...\n\ndef aws_iam_instance_profile_delete(instance_profile_name):\n    # remove role from instance profile\n    response = AWSClient().iam.get_instance_profile(InstanceProfileName=instance_profile_name)\n    role_name = response['InstanceProfile']['Roles'][0]['RoleName']\n    AWSClient().iam.remove_role_from_instance_profile(\n        InstanceProfileName=instance_profile_name,\n        RoleName=role_name,\n    )\n\n    # delete instance profile\n    AWSClient().iam.delete_instance_profile(InstanceProfileName=instance_profile_name)\n\n    # set state\n    state = aws_get_state()\n    if 'iam' in state and 'instance_profile' in state['iam']:\n        if instance_profile_name in state['iam']['instance_profile']:\n            del state['iam']['instance_profile'][instance_profile_name]\n            aws_set_state(state)\n\ndef aws_iam_remove_all_instance_profile():\n    \"\"\"\n    WARNING: this will remove all instance profile that has Dataverse in it\n             which means it might remove other instance profile that not from you\n    \"\"\"\n    # get all instance profile\n    instance_profiles = AWSClient().iam.list_instance_profiles()[\"InstanceProfiles\"]\n    # remove all the instance_profile that has Dataverse in it\n    for profile in instance_profiles:\n        if \"Dataverse\" in profile[\"InstanceProfileName\"]:\n            aws_iam_instance_profile_delete(profile[\"InstanceProfileName\"])\n\n\ndef aws_vpc_create(cidr_block=None, tag_name='Dataverse-Temporary-VPC'):\n\n    # load all vpcs ids to check if the cidr block is occupied\n    vpcs = AWSClient().ec2.describe_vpcs()\n    second_octets = []\n    for vpc in vpcs['Vpcs']:\n        second_octet = int(vpc['CidrBlock'].split('.')[1])\n        second_octets.append(second_octet)\n\n    # auto generate cidr block if not provided\n    if cidr_block is None:\n        is_network_available = False\n        for octet in range(0, 255):\n            if octet not in second_octets:\n                is_network_available = True\n                break\n\n        if is_network_available:\n            cidr_block = '10.' + str(octet) + '.0.0/16'\n        else:\n            raise Exception('Unable to find an available CIDR block for VPC.')\n\n    # user provided cidr block\n    elif cidr_block.split('.')[1] in second_octets:\n        raise Exception('The CIDR block is already occupied.')\n\n    # create vpc\n    vpc = AWSClient().ec2.create_vpc(CidrBlock=cidr_block)\n    vpc_id = vpc['Vpc']['VpcId']\n    AWSClient().ec2.create_tags(\n        Resources=[vpc_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # update state\n    state = aws_get_state()\n    if 'vpc' not in state:\n        state['vpc'] = {}\n\n    state['vpc'][vpc_id] = {'public_subnet': False}\n    aws_set_state(state)\n\n    # wait until vpc is ready\n    waiter = AWSClient().ec2.get_waiter('vpc_available')\n    waiter.wait(VpcIds=[vpc_id])\n\n    return vpc_id\n\ndef aws_vpc_delete(vpc_id):\n    if isinstance(vpc_id, str):\n        vpc_ids = [vpc_id]\n    elif isinstance(vpc_id, list):\n        vpc_ids = vpc_id\n\n    for vpc_id in vpc_ids:\n        state = aws_get_state()\n\n        # [ DEPENDENCY ] remove all dependencies\n        # ------------------------------------------------------------\n        # dataverse managed dependency\n        if state['vpc'][vpc_id]:\n            if 'nat_gateway' in state['vpc'][vpc_id]:\n                aws_nat_gateway_delete(vpc_id, state['vpc'][vpc_id]['nat_gateway'])\n            if 'elastic_ip' in state['vpc'][vpc_id]:\n                aws_elastic_ip_release(vpc_id, state['vpc'][vpc_id]['elastic_ip'])\n            if 'subnet' in state['vpc'][vpc_id]:\n                # NOTE: set retry because terminated EMR cluster iterrupts subnet deletion\n                #       by dependency problem for few seconds\n                # HACK: this is a hacky solution and should be fixed in the future\n                RETRY_SUBNET_DELETION = 5\n                for _ in range(RETRY_SUBNET_DELETION):\n                    try:\n                        aws_subnet_delete(vpc_id, state['vpc'][vpc_id]['subnet'])\n                        break\n                    except AWSClient().ec2.exceptions.ClientError as e:\n                        if e.response['Error']['Code'] == 'DependencyViolation':\n                            time.sleep(5)\n                            continue\n                        else:\n                            raise e\n                    except Exception as e:\n                        raise e\n            if 'security_group' in state['vpc'][vpc_id]:\n                aws_security_group_delete(vpc_id, state['vpc'][vpc_id]['security_group'])\n            if 'gateway' in state['vpc'][vpc_id]:\n                aws_gateway_delete(vpc_id, state['vpc'][vpc_id]['gateway'])\n            if 'route_table' in state['vpc'][vpc_id]:\n                aws_route_table_delete(vpc_id, state['vpc'][vpc_id]['route_table'])\n\n        # EMR managed dependency\n        vpc = boto3.resource('ec2').Vpc(vpc_id)\n\n        # NOTE: remove dependency between security groups\n        for security_group in vpc.security_groups.all():\n            aws_security_group_remove_dependency(security_group.id)\n\n        for security_group in vpc.security_groups.all():\n            if security_group.group_name == \"default\":\n                continue\n            aws_security_group_delete(vpc_id, security_group.id)\n        # ------------------------------------------------------------\n\n        try:\n            AWSClient().ec2.delete_vpc(VpcId=vpc_id)\n        # when vpc doesn't exist\n        except AWSClient().ec2.exceptions.ClientError as e:\n            if e.response['Error']['Code'] == 'InvalidVpcID.NotFound':\n                print(f\"VPC {vpc_id} doesn't exist.\")\n        # re-thrown other exceptions\n        except Exception as e:\n            raise e\n\n        if 'vpc' in state and vpc_id in state['vpc']:\n            del state['vpc'][vpc_id]\n        aws_set_state(state)\n\ndef aws_subnet_create(vpc_id, cird_block=None, tag_name='Dataverse-Temporary-Subnet'):\n    if cird_block is None:\n        # Get VPC information to determine CIDR block\n        vpcs = AWSClient().ec2.describe_vpcs(VpcIds=[vpc_id])\n        cird_block = vpcs['Vpcs'][0]['CidrBlock']\n\n    # create subnet\n    subnet = AWSClient().ec2.create_subnet(CidrBlock=str(cird_block), VpcId=vpc_id)\n    subnet_id = subnet['Subnet']['SubnetId']\n    AWSClient().ec2.create_tags(\n        Resources=[subnet_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # update state\n    state = aws_get_state()\n    if 'subnet' not in state['vpc'][vpc_id]:\n        state['vpc'][vpc_id]['subnet'] = []\n\n    state['vpc'][vpc_id]['subnet'].append(subnet_id)\n    aws_set_state(state)\n\n    # wait until subnet is ready\n    waiter = AWSClient().ec2.get_waiter('subnet_available')\n    waiter.wait(SubnetIds=[subnet_id])\n\n    return subnet_id\n\ndef aws_subnet_delete(vpc_id, subnet_id):\n    if isinstance(subnet_id, str):\n        subnet_ids = [subnet_id]\n    elif isinstance(subnet_id, list):\n        subnet_ids = subnet_id\n\n    for subnet_id in subnet_ids:\n        AWSClient().ec2.delete_subnet(SubnetId=subnet_id)\n        state = aws_get_state()\n\n        if 'vpc' in state and vpc_id in state['vpc']:\n            if 'subnet' in state['vpc'][vpc_id] and subnet_id in state['vpc'][vpc_id]['subnet']:\n                state['vpc'][vpc_id]['subnet'].remove(subnet_id)\n                aws_set_state(state)\n\ndef aws_subnet_az(subnet_id):\n    \"\"\"\n    when subnet id is give find the AZ\n    \"\"\"\n    response = AWSClient().ec2.describe_subnets(SubnetIds=[subnet_id])\n    az = response['Subnets'][0]['AvailabilityZone']\n\n    return az\n\ndef aws_emr_security_group_create(\n        vpc_id,\n        port=4040,\n        group_name='DataverseEMRSecurityGroup',\n        description='Dataverse EMR security group',\n        tag_name='Dataverse-Temporary-EMR-Security-Group'\n    ):\n    \"\"\"\n    Create a security group for EMR.\n    # TODO: Create a new function for general purpose.\n    ...\n\n    args:\n        vpc_id (str): The VPC ID.\n        port (int): The port to open for pyspark UI\n        group_name (str): The name of the security group.\n        description (str): The description of the security group.\n    \"\"\"\n    security_group = AWSClient().ec2.create_security_group(\n        GroupName=group_name,\n        Description=description,\n        VpcId=vpc_id,\n    )\n    security_group_id = security_group['GroupId']\n    AWSClient().ec2.authorize_security_group_ingress(\n        GroupId=security_group_id,\n        IpPermissions=[\n            {\n                'IpProtocol': 'tcp',\n                'FromPort': port,\n                'ToPort': port,\n                'IpRanges': [{'CidrIp': '0.0.0.0/0'}]\n            },\n        ])\n    AWSClient().ec2.create_tags(\n        Resources=[security_group_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # set state\n    state = aws_get_state()\n    if 'security_group' not in state['vpc'][vpc_id]:\n        state['vpc'][vpc_id]['security_group'] = []\n\n    state['vpc'][vpc_id]['security_group'].append(security_group_id)\n    aws_set_state(state)\n\n    return security_group_id\n\ndef aws_security_group_delete(vpc_id, security_group_id):\n    if isinstance(security_group_id, str):\n        security_group_ids = [security_group_id]\n    elif isinstance(security_group_id, list):\n        security_group_ids = security_group_id\n\n    for security_group_id in security_group_ids:\n        AWSClient().ec2.delete_security_group(GroupId=security_group_id)\n        state = aws_get_state()\n\n        if 'vpc' in state and vpc_id in state['vpc']:\n            if 'security_group' in state['vpc'][vpc_id] and security_group_id in state['vpc'][vpc_id]['security_group']:\n                state['vpc'][vpc_id]['security_group'].remove(security_group_id)\n                aws_set_state(state)\n\ndef aws_security_group_remove_dependency(security_group_id):\n    \"\"\"\n    \"\"\"\n    response = AWSClient().ec2.describe_security_groups(\n        GroupIds=[security_group_id]\n    )\n\n    # Removing inbound rules\n    inbound_rules = response['SecurityGroups'][0]['IpPermissions']\n    if inbound_rules:\n        AWSClient().ec2.revoke_security_group_ingress(\n            GroupId=security_group_id,\n            IpPermissions=inbound_rules\n        )\n\n    # Removing outbound rules\n    outbound_rules = response['SecurityGroups'][0]['IpPermissionsEgress']\n    if outbound_rules:\n        AWSClient().ec2.revoke_security_group_egress(\n            GroupId=security_group_id,\n            IpPermissions=outbound_rules\n        )\n\ndef aws_gateway_create(vpc_id, tag_name='Dataverse-Gateway'):\n    \"\"\"\n    Create a gateway for public subnet.\n    \"\"\"\n    gateway = AWSClient().ec2.create_internet_gateway()\n    gateway_id = gateway['InternetGateway']['InternetGatewayId']\n\n    # attach gateway to vpc\n    AWSClient().ec2.attach_internet_gateway(\n        InternetGatewayId=gateway_id,\n        VpcId=vpc_id\n    )\n    AWSClient().ec2.create_tags(\n        Resources=[gateway_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # set state\n    state = aws_get_state()\n    if 'gateway' not in state['vpc'][vpc_id]:\n        state['vpc'][vpc_id]['gateway'] = []\n     \n    state['vpc'][vpc_id]['gateway'].append(gateway_id)\n    aws_set_state(state)\n\n    # wait until gateway is ready\n    waiter = AWSClient().ec2.get_waiter('internet_gateway_exists')\n    waiter.wait(InternetGatewayIds=[gateway_id])\n\n    return gateway_id\n\ndef aws_gateway_delete(vpc_id, gateway_id):\n    if isinstance(gateway_id, str):\n        gateway_ids = [gateway_id]\n    elif isinstance(gateway_id, list):\n        gateway_ids = gateway_id\n\n    for gateway_id in gateway_ids:\n        # detach gateway from vpc\n        AWSClient().ec2.detach_internet_gateway(\n            InternetGatewayId=gateway_id,\n            VpcId=vpc_id\n        )\n        AWSClient().ec2.delete_internet_gateway(InternetGatewayId=gateway_id)\n        state = aws_get_state()\n        if 'vpc' in state and vpc_id in state['vpc']:\n            if 'gateway' in state['vpc'][vpc_id] and gateway_id in state['vpc'][vpc_id]['gateway']:\n                state['vpc'][vpc_id]['gateway'].remove(gateway_id)\n                aws_set_state(state)\n\ndef aws_elastic_ip_allocate(vpc_id, tag_name='Dataverse-Elastic-IP'):\n    \"\"\"\n    Allocate an elastic ip.\n    \"\"\"\n    elastic_ip = AWSClient().ec2.allocate_address(Domain='vpc')\n    elastic_ip_id = elastic_ip['AllocationId']\n    AWSClient().ec2.create_tags(\n        Resources=[elastic_ip_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # set state\n    state = aws_get_state()\n    if 'vpc' not in state:\n        state['vpc'] = {}\n    if vpc_id not in state['vpc']:\n        state['vpc'][vpc_id] = {}\n    if 'elastic_ip' not in state['vpc'][vpc_id]:\n        state['vpc'][vpc_id]['elastic_ip'] = []\n\n    state['vpc'][vpc_id]['elastic_ip'].append(elastic_ip_id)\n    aws_set_state(state)\n\n    # TODO: wait until elastic ip is ready\n    ...\n\n    return elastic_ip_id\n\ndef aws_elastic_ip_release(vpc_id, elastic_ip_id):\n    if isinstance(elastic_ip_id, str):\n        elastic_ip_ids = [elastic_ip_id]\n    elif isinstance(elastic_ip_id, list):\n        elastic_ip_ids = elastic_ip_id\n\n    for elastic_ip_id in elastic_ip_ids:\n        try:\n            AWSClient().ec2.release_address(AllocationId=elastic_ip_id)\n            state = aws_get_state()\n            if 'vpc' in state and vpc_id in state['vpc']:\n                if 'elastic_ip' in state['vpc'][vpc_id] and elastic_ip_id in state['vpc'][vpc_id]['elastic_ip']:\n                    state['vpc'][vpc_id]['elastic_ip'].remove(elastic_ip_id)\n                    aws_set_state(state)\n        except AWSClient().ec2.exceptions.ClientError as e:\n            if e.response['Error']['Code'] == 'InvalidAllocationID.NotFound':\n                print(f\"Elastic IP id {elastic_ip_id} doesn't exist.\")\n            else:\n                raise e\n        except Exception as e:\n            raise e\n\ndef aws_nat_gateway_create(\n    vpc_id,\n    subnet_id,\n    elastic_ip_id,\n    tag_name='Dataverse-NAT-Gateway'\n):\n    \"\"\"\n    Create a NAT gateway for private subnet.\n    \"\"\"\n    # create NAT gateway\n    nat_gateway = AWSClient().ec2.create_nat_gateway(\n        AllocationId=elastic_ip_id,\n        SubnetId=subnet_id,\n    )\n    nat_gateway_id = nat_gateway['NatGateway']['NatGatewayId']\n\n    # set tag\n    AWSClient().ec2.create_tags(\n        Resources=[nat_gateway_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # set state\n    state = aws_get_state()\n    if 'vpc' not in state:\n        state['vpc'] = {}\n    if vpc_id not in state['vpc']:\n        state['vpc'][vpc_id] = {}\n    if 'nat_gateway' not in state['vpc'][vpc_id]:\n        state['vpc'][vpc_id]['nat_gateway'] = []\n\n    state['vpc'][vpc_id]['nat_gateway'].append(nat_gateway_id)\n    aws_set_state(state)\n\n    # wait until NAT gateway is ready\n    waiter = AWSClient().ec2.get_waiter('nat_gateway_available')\n    waiter.wait(NatGatewayIds=[nat_gateway_id])\n\n    return nat_gateway_id\n\ndef aws_nat_gateway_delete(vpc_id, nat_gateway_id):\n    if isinstance(nat_gateway_id, str):\n        nat_gateway_ids = [nat_gateway_id]\n    elif isinstance(nat_gateway_id, list):\n        nat_gateway_ids = nat_gateway_id\n\n    for nat_gateway_id in nat_gateway_ids:\n        # delete NAT gateway\n        AWSClient().ec2.delete_nat_gateway(NatGatewayId=nat_gateway_id)\n\n        # set state\n        state = aws_get_state()\n        if 'vpc' in state and vpc_id in state['vpc']:\n            if 'nat_gateway' in state['vpc'][vpc_id] and nat_gateway_id in state['vpc'][vpc_id]['nat_gateway']:\n                state['vpc'][vpc_id]['nat_gateway'].remove(nat_gateway_id)\n                aws_set_state(state)\n\n        # wait until NAT gateway is deleted\n        waiter = AWSClient().ec2.get_waiter('nat_gateway_deleted')\n        waiter.wait(NatGatewayIds=[nat_gateway_id])\n\ndef aws_route_table_create(\n    vpc_id,\n    gateway_id=None,\n    nat_gateway_id=None,\n    tag_name='Dataverse-Route-Table',\n    destination_cidr_block='0.0.0.0/0',\n):\n    \"\"\"\n    Create a route table for subnet.\n    \"\"\"\n    route_table = AWSClient().ec2.create_route_table(VpcId=vpc_id)\n    route_table_id = route_table['RouteTable']['RouteTableId']\n    args = {\n        'DestinationCidrBlock': destination_cidr_block,\n        'RouteTableId': route_table_id,\n    }\n    if gateway_id is not None:\n        args['GatewayId'] = gateway_id\n    if nat_gateway_id is not None:\n        args['NatGatewayId'] = nat_gateway_id\n\n    AWSClient().ec2.create_route(**args)\n    AWSClient().ec2.create_tags(\n        Resources=[route_table_id],\n        Tags=[\n            {'Key': 'Name', 'Value': tag_name},\n        ]\n    )\n\n    # set state\n    state = aws_get_state()\n    if 'route_table' not in state['vpc'][vpc_id]:\n        state['vpc'][vpc_id]['route_table'] = []\n\n    state['vpc'][vpc_id]['route_table'].append(route_table_id)\n    aws_set_state(state)\n\n    # TODO: wait until route table is ready\n    #       didn't found waiter for route table\n    ...\n\n    return route_table_id\n\ndef aws_route_table_delete(vpc_id, route_table_id):\n    if isinstance(route_table_id, str):\n        route_table_ids = [route_table_id]\n    elif isinstance(route_table_id, list):\n        route_table_ids = route_table_id\n\n    for route_table_id in route_table_ids:\n        AWSClient().ec2.delete_route_table(RouteTableId=route_table_id)\n        state = aws_get_state()\n        if 'vpc' in state and vpc_id in state['vpc']:\n            if 'route_table' in state['vpc'][vpc_id] and route_table_id in state['vpc'][vpc_id]['route_table']:\n                state['vpc'][vpc_id]['route_table'].remove(route_table_id)\n                aws_set_state(state)\n\ndef aws_route_table_asscociate_subnet(subnet_id, route_table_id):\n    route_table = boto3.resource('ec2').RouteTable(route_table_id)\n    route_table.associate_with_subnet(SubnetId=subnet_id)\n\ndef aws_s3_path_parse(path):\n    \"\"\"\n    parse aws s3 path to bucket and key\n    \"\"\"\n    aws_s3_matched = re.match(r's3[a,n]?://([^/]+)/(.*)', path)\n    if aws_s3_matched:\n        bucket = aws_s3_matched.group(1)\n        path = aws_s3_matched.group(2)\n    else:\n        raise Exception(f\"Invalid S3 path: {path}\")\n\n    return bucket, path\n\ndef aws_s3_create_bucket(bucket):\n    \"\"\"\n    create aws s3 bucket\n\n    Args:\n        bucket (str): bucket name (must be unique)\n        location (str): aws region name\n    \"\"\"\n    AWSClient().s3.create_bucket(\n        Bucket=bucket,\n        CreateBucketConfiguration={'LocationConstraint': AWSClient().region}\n    )\n\ndef aws_s3_delete_bucket(bucket):\n    \"\"\"\n    delete aws s3 bucket\n\n    Args:\n        bucket (str): bucket name\n    \"\"\"\n    AWSClient().s3.delete_bucket(Bucket=bucket)\n\ndef aws_s3_read(bucket, key):\n    \"\"\"\n    Args:\n        bucket (str): bucket name\n        key (str): key (aws s3 file path)\n\n    Usage:\n        aws_s3_read('tmp', 'this/is/path.json')\n    \"\"\"\n    obj = AWSClient().s3.get_object(Bucket=bucket, Key=key)\n    text = obj['Body'].read().decode('utf-8')\n\n    return text\n\ndef aws_s3_download(bucket, key, local_path):\n    \"\"\"\n    Args:\n        bucket (str): bucket name\n        key (str): key (aws s3 file path)\n        local_path (str): local path to save file\n\n    Usage:\n        aws_s3_download('tmp', 'this/is/path.json', 'path.json')\n    \"\"\"\n    obj_type = aws_s3_get_object_type(bucket, key)\n    if obj_type == 'folder':\n        paginator = AWSClient().s3.get_paginator('list_objects')\n        page_iterator = paginator.paginate(Bucket=bucket)\n        for page in page_iterator:\n            for obj in page['Contents']:\n                bucket_key = obj['Key']\n\n                if not bucket_key.startswith(key):\n                    continue\n\n                rel_bucket_path = bucket_key.replace(key, '')\n                if rel_bucket_path.startswith('/'):\n                    rel_bucket_path = rel_bucket_path[1:]\n\n                local_file_path = os.path.join(local_path, rel_bucket_path)\n                os.makedirs(os.path.dirname(local_file_path), exist_ok=True)\n                AWSClient().s3.download_file(bucket, bucket_key, local_file_path)\n    elif obj_type == 'file':\n        AWSClient().s3.download_file(bucket, key, local_path)\n    elif obj_type == 'no_obj':\n        raise Exception(f\"Object doesn't exist: {key}\")\n\ndef aws_s3_upload(bucket, key, local_path):\n    \"\"\"\n    Args:\n        bucket (str): bucket name\n        key (str): key (aws s3 file path)\n        local_path (str): local path to save file\n\n    Usage:\n        aws_s3_upload('tmp', 'this/is/path.json', 'path.json')\n    \"\"\"\n    if os.path.isdir(local_path):\n        files = glob.glob(os.path.join(local_path, \"*/*\"))\n        for file in files:\n            rel_path = os.path.relpath(file, local_path)\n            _key = os.path.join(key, rel_path)\n\n            # NOTE: no need to upload folder\n            if os.path.isdir(file):\n                continue\n\n            AWSClient().s3.upload_file(file, bucket, _key)\n    else:\n        AWSClient().s3.upload_file(local_path, bucket, key)\n\ndef aws_s3_write(bucket, key, obj):\n    \"\"\"\n    Args:\n        bucket (str): bucket name\n        key (str): key (aws s3 file path)\n        obj (str): object to write\n\n    Usage:\n        aws_s3_write('tmp', 'this/is/path.json', '{\"hello\": \"world\"}')\n    \"\"\"\n    AWSClient().s3.put_object(Bucket=bucket, Key=key, Body=obj)\n\ndef aws_s3_delete(bucket, key):\n    \"\"\"\n    Args:\n        bucket (str): bucket name\n        key (str): key (aws s3 file path)\n\n    Usage:\n        aws_s3_delete('tmp', 'this/is/path.json')\n    \"\"\"\n    obj_type = aws_s3_get_object_type(bucket, key)\n\n    if obj_type == 'folder':\n        paginator = AWSClient().s3.get_paginator('list_objects')\n        page_iterator = paginator.paginate(Bucket=bucket)\n        for page in page_iterator:\n            for obj in page['Contents']:\n                bucket_key = obj['Key']\n\n                if not bucket_key.startswith(key):\n                    continue\n\n                AWSClient().s3.delete_object(Bucket=bucket, Key=bucket_key)\n    elif obj_type == 'file':\n        AWSClient().s3.delete_object(Bucket=bucket, Key=key)\n    elif obj_type == 'no_obj':\n        raise Exception(f\"Object doesn't exist: {key}\")\n\ndef aws_s3_list_buckets():\n    \"\"\"\n    get all buckets from aws s3\n    \"\"\"\n    buckets = AWSClient().s3.list_buckets()['Buckets']\n    bucket_names = []\n    for bucket in buckets:\n        bucket_names.append(bucket['Name'])\n\n    return bucket_names\n\ndef aws_s3_ls(query=None):\n    \"\"\"\n    ls command for aws s3\n    this is made to be similar to linux ls command\n    and unified to only single args usage to make it simple\n\n    Args:\n        query (str): file search query\n    Returns:\n        list: list of files/folders\n            - list ends with '/' if it is a folder\n\n    Usage:\n\n    ```python\n    - bucket/\n        - subfolder1/\n            - duck_folder1/\n            - duck_folder2/\n            - duck_file.txt\n        - subfolder2/\n        - subfile1.json\n    ```\n    >>> aws_list()\n    - bucket/\n\n    >>> aws_list(bucket)\n    - subfolder1/\n    - subfolder2/\n    - subfile1.json\n\n    >>> aws_list(bucket/subfolder1\")\n    - ducky_folder1/\n    - ducky_folder2/\n    - ducky_file.txt\n    \"\"\"\n    if query is None or query == \"\":\n        return aws_s3_list_buckets()\n    elif len(query.split(\"/\")) > 1:\n        bucket, prefix = query.split(\"/\", 1)\n    else:\n        bucket = query\n        prefix = \"\"\n\n    if prefix and not prefix.endswith(\"/\"):\n        prefix += \"/\"\n\n    results = AWSClient().s3.list_objects_v2(\n        Bucket=bucket,\n        Prefix=prefix,\n        Delimiter=\"/\",\n    )\n    objects = []\n\n    # TODO: no limit to 1,000 objects - use pagination\n    ...\n\n    # files\n    if \"Contents\" in results:\n        objects.extend(list(obj[\"Key\"] for obj in results[\"Contents\"]))\n\n    # subfolders\n    if \"CommonPrefixes\" in results:\n        objects.extend(list(obj[\"Prefix\"] for obj in results[\"CommonPrefixes\"]))\n\n    # set default\n    remove_prefix = True\n    if remove_prefix:\n        # remove the prefix itself\n        objects = list(obj.replace(prefix, \"\") for obj in objects)\n\n        # remove ''\n        objects = list(obj for obj in objects if obj)\n    else:\n        for obj in objects:\n            if obj == prefix:\n                objects.remove(obj)\n\n    return objects\n\ndef aws_s3_get_object_type(bucket, key):\n    \"\"\"\n    get object type from s3\n\n    NOTE:\n        S3 don't have a concept of folder\n        so this is a hardcoded solution to check key is file/folder or doesn't exist\n\n    TODO:\n        if there is edge case that this function doesn't cover\n        please add it to the test case\n    \"\"\"\n    results = AWSClient().s3.list_objects_v2(\n        Bucket=bucket,\n        Prefix=key,\n        Delimiter=\"/\",\n    )\n    if 'CommonPrefixes' in results:\n        prefix_folders = results['CommonPrefixes'][0]['Prefix'].split('/')\n        key_folders = key.split('/')\n\n        # remove ''\n        prefix_folders = [x for x in prefix_folders if x != '']\n        key_folders = [x for x in key_folders if x != '']\n\n        # check key exacly match prefix\n        for key_folder in key_folders:\n            if key_folder not in prefix_folders:\n                return 'no_obj'\n        return 'folder'\n    elif 'Contents' in results:\n        content = results['Contents'][0]['Key']\n        if content == key:\n            if content.endswith('/'):\n                return 'folder'\n            else:\n                return 'file'\n        else:\n            return 'folder'\n    else:\n        return 'no_obj'"
  },
  {
    "path": "dataverse/utils/format/README.md",
    "content": "# Format\n> ETL is backed by spark and `format` is a helpers to reformat data. It could be **collection of converters** converts from any data format into spark readable format or any utils that helps with data format.\n\n## Support Format\n- ufl\n- huggingface"
  },
  {
    "path": "dataverse/utils/format/__init__.py",
    "content": "\nfrom .huggingface import huggingface2parquet\nfrom .huggingface import load_huggingface_dataset\nfrom .ufl import get_uuidv1\nfrom .ufl import get_uuidv4"
  },
  {
    "path": "dataverse/utils/format/huggingface.py",
    "content": "import os\nimport datasets\nfrom pathlib import Path\nfrom omegaconf import ListConfig\n\nfrom dataverse.utils.setting import SystemSetting\n\n\ndef load_huggingface_dataset(name_or_path, split=None, from_disk=False):\n    \"\"\"\n    load huggingface dataset\n\n    Args:\n        name_or_path (str or list): the name or path of the huggingface dataset\n        split (str): the split of the dataset\n    \"\"\"\n    if from_disk:\n        if split is not None:\n            raise ValueError(\"split is not supported when from_disk is True\")\n\n        # load huggingface dataset from disk\n        if isinstance(name_or_path, str):\n            dataset = datasets.load_from_disk(name_or_path)\n        elif isinstance(name_or_path, list):\n            dataset = datasets.load_from_disk(*name_or_path)\n        elif isinstance(name_or_path, ListConfig):\n            dataset_list = [datasets.load_from_disk(nop) for nop in name_or_path]\n            dataset = datasets.concatenate_datasets(dataset_list)\n        else:\n            raise ValueError(f\"Unsupported type of name_or_path: {type(name_or_path)}\")\n    else:\n        # load huggingface dataset\n        if isinstance(name_or_path, str):\n            dataset = datasets.load_dataset(name_or_path, split=split)\n        elif isinstance(name_or_path, list):\n            dataset = datasets.load_dataset(*name_or_path, split=split)\n        elif isinstance(name_or_path, ListConfig):\n            dataset = datasets.load_dataset(*name_or_path, split=split)\n        else:\n            raise ValueError(f\"Unsupported type of name_or_path: {type(name_or_path)}\")\n\n    return dataset\n\n\ndef huggingface2parquet(\n    dataset: datasets.Dataset, cache_dir: str = None, verbose: bool = True, **kwargs\n):\n    \"\"\"\n    Convert a huggingface dataset to parquet format and save it to the path.\n\n    Args:\n        dataset (datasets.Dataset): a huggingface dataset\n        cache_dir (str): cache path to save the dataset\n        verbose (bool): whether to print the information of the dataset\n    \"\"\"\n    # check the dataset which has train, test, validation or other splits\n    # concatenate all the splits into one\n    dataset_list = []\n\n    # check the dataset has splits\n    try:\n        for split in dataset.keys():\n            dataset_list.append(dataset[split])\n    except:\n        dataset_list.append(dataset)\n\n    dataset = datasets.concatenate_datasets(dataset_list)\n\n    # save the dataset to parquet\n    # FIXME: this is a temporary solution to store the dataset in the package root path\n    #        we will change it to a better solution in the future\n    if cache_dir is None:\n        # save the parquet at package root path\n        cache_dir = SystemSetting().CACHE_DIR\n\n    dataset_path = f\"{cache_dir}/.cache/dataverse/dataset/huggingface_{dataset._fingerprint}.parquet\"\n\n    # check the dataset exist\n    if os.path.exists(dataset_path):\n        if verbose:\n            print(f\"Dataset already exists at {dataset_path}\")\n        return dataset_path\n\n    os.makedirs(f\"{cache_dir}/.cache/dataverse/dataset\", exist_ok=True)\n    dataset.to_parquet(dataset_path)\n\n    return dataset_path\n\n\nif __name__ == \"__main__\":\n    # test the function\n    dataset = load_huggingface_dataset([\"glue\", \"mrpc\"])\n    dataset_path = huggingface2parquet(dataset, verbose=True)\n\n    print(f\"Dataset saved at {dataset_path}\")\n"
  },
  {
    "path": "dataverse/utils/format/ufl.py",
    "content": "\n\"\"\"\nUFL (Upstage Format for LLM)\n\"\"\"\n\nimport uuid\n\ndef get_uuidv1():\n    return uuid.uuid1().hex\n\ndef get_uuidv4():\n    return uuid.uuid4().hex"
  },
  {
    "path": "dataverse/utils/setting/README.md",
    "content": "\n# Setting\n> Setting includes Environment Variables, User Secrets\n\n## System Settings\n> The heart of the system. It contains the configuration of the system.\n\n### naming convention\n1. Only CAPITALIZED format\n    - e.g. `CACHE_DIR` (O)\n    - e.g. `cache_dir` (X)\n2. Only alphanumeric and underscore\n    - e.g. `CACHE_DIR2` (O)\n    - e.g. `cache-dir` (X)\n    - e.g. `CACHE_@DIR` (X)\n3. Only one underscore between words\n    - e.g. `CACHE__DIR` (X)\n4. No underscore at the start/end of the key\n    - e.g. `_CACHE_DIR` (X)\n    - e.g. `CACHE_DIR_` (X)\n\n### System Setting Policy\n- Only memory (not stored in the file)\n- Only updated by `Environment Variables`\n- Default Setting Manually updated\n    - check `system.py.SystemSetting.default_setting()`\n- No update after the system is initialized\n    - If you want to change the setting, you must restart the system.\n\n### How to modify?\n- Only by Setting Environment Variables\n\n```bash\n# dynamic\nCACHE_DIR=/path/to/cache/dir python3 main.py\n\n# static\nexport CACHE_DIR=/path/to/cache/dir\npython3 main.py\n```\n\n### How to use `SystemSetting`\n> **This MUST be used internally by the system**. But just in case, you can use it in 3 ways.\n\n```python\nfrom dataverse.utils.setting import SystemSetting\n\n# get the setting\ncache_dir = SystemSetting().get('CACHE_DIR')\ncache_dir = SystemSetting()['CACHE_DIR']\ncache_dir = SystemSetting().CACHE_DIR\n\n# set the setting\nSystemSetting().set('CACHE_DIR', '/path/to/cache/dir')\nSystemSetting()['CACHE_DIR'] = '/path/to/cache/dir'\nSystemSetting().CACHE_DIR = '/path/to/cache/dir'\n```\n\n\n## User Settings\n> API keys, passwords, or other sensitive information of user.\n\n### naming convention\n1. Only CAPITALIZED format\n    - e.g. `GITHUB_API` (O)\n    - e.g. `github_api` (X)\n2. Only alphanumeric and underscore\n    - e.g. `GITHUB_API2` (O)\n    - e.g. `github-api` (X)\n    - e.g. `GITHUB_@API` (X)\n3. Only one underscore between words\n    - e.g. `GITHUB__API` (X)\n4. No underscore at the start/end of the key\n    - e.g. `_GITHUB_API` (X)\n    - e.g. `GITHUB_API_` (X)\n\n### Where does it store?\n> Setting will be stored in `CACHE_DIR` set in `SystemSetting` with the name of `user_setting.json`.\n\n```python\nfrom dataverse.utils.setting import SystemSetting\n\n{SystemSetting().CACHE_DIR}/.cache/dataverse/setting/user_setting.json\n```\n\n \n### How to modify?\n1. You could modify the `user_setting.json` file directly\n2. or can use proxy class `UserSetting`\n    - this is synchronized with the `user_setting.json` file\n\n```python\nfrom dataverse.utils.setting import UserSetting\n```\n\n### How to use `UserSetting` proxy?\n> There is 3 ways to use it.\n\n\n```python\nfrom dataverse.utils.setting import UserSetting\n\n# get the value\ngithub_api = UserSetting().get('GITHUB_API')\ngithub_api = UserSetting()['GITHUB_API']\ngithub_api = UserSetting().GITHUB_API\n\n# set the value\nUserSetting().set('GITHUB_API', 'your_github_api_key')\nUserSetting()['GITHUB_API'] = 'your_github_api_key'\nUserSetting().GITHUB_API = 'your_github_api_key'\n```"
  },
  {
    "path": "dataverse/utils/setting/__init__.py",
    "content": "\nfrom dataverse.utils.setting.user import UserSetting\nfrom dataverse.utils.setting.system import SystemSetting"
  },
  {
    "path": "dataverse/utils/setting/system.py",
    "content": "\n\"\"\"\nInterface for system setting\n\"\"\"\n\nimport os\nimport re\nimport uuid\nimport json\nimport boto3\nimport pyspark\nfrom pathlib import Path\n\nimport dataverse\nfrom dataverse.utils.api import aws_check_credentials\nfrom dataverse.utils.api import aws_s3_create_bucket\nfrom dataverse.utils.api import aws_s3_list_buckets\n\n\nclass SystemSetting:\n    \"\"\"\n    System Setting CRUD interface\n\n    system setting holds all the variables\n    that influence the behavior of the dataverse system.\n\n    Also, this class is a singleton class, so you can use it anywhere in the code\n    \n    [ MEMORY ONLY ]\n    - system setting is stored in memory only\n    - system setting is not persistent\n\n    [ Update by Env Variable ]\n    - system setting can be updated by env variable\n\n    [ Manual Update ]\n    - default system setting can be updated manually\n    - check `default_setting()`\n\n    [ No Update after Initialization ]\n    - system could be updated but not reflected in the program\n    - need to restart the program to use new system setting\n    \"\"\"\n    # Singleton\n    _initialized = False\n\n    # TODO: system setting per user [Candidate]\n    ...\n\n    def __new__(cls):\n        if not hasattr(cls, 'instance'):\n            cls.instance = super(SystemSetting, cls).__new__(cls)\n        return cls.instance\n\n    def __init__(self):\n        # when the class is initialized, this is called everytime\n        # regardless of the singleton. So adding the flag to check\n        if self._initialized:\n            return\n\n        self.default_setting()\n        self.update_by_env()\n        self._initialized = True\n\n    def _get_aws_bucket(self, verbose=True):\n        \"\"\"\n        the bucket will be used to store the dataverse info\n        - cache\n        - log\n        - etc\n\n        - format\n            - dataverse-{MAGIC_NUMBER}-{UUID}\n        \"\"\"\n        # if aws credential is not valid, return None\n        if not aws_check_credentials():\n            return None\n\n        identify_prefix = f'dataverse-{self.MAGIC_NUMBER}-'\n        for bucket in aws_s3_list_buckets():\n            if identify_prefix in bucket:\n\n                # check if the last part is uuid\n                uuid_part = bucket.replace(identify_prefix, \"\")\n                try:\n                    uuid.UUID(uuid_part)\n\n                    # Use this bucket for your package operations\n                    if verbose:\n                        print(\"Detected Dataverse Bucket: \" + bucket)\n\n                    return bucket\n                except ValueError:\n                    # not a valid UUID, so ignore this bucket\n                    pass \n\n        # if there is no relevant bucket, create one\n        bucket = f'dataverse-{self.MAGIC_NUMBER}-{uuid.uuid1()}'\n        aws_s3_create_bucket(bucket)\n\n        return bucket\n\n\n    def default_setting(self):\n        \"\"\"\n        Reset the system setting to default\n\n        Default setting:\n        - `MAGIC_NUMBER`: magic number for dataverse\n        - `CACHE_DIR`: default cache directory\n        - `IS_CLI`: if the program is running in CLI mode\n        - `AWS_BUCKET`: default aws bucket name for dataverse info\n        - `SPARK_VERSION`: spark version\n        - `HADOOP_VERSION`: hadoop version\n        \"\"\"\n        self.system_setting = {}\n\n        # MAGIC NUMBER\n        # dv - Dataverse\n        # 42 - (The Hitchhiker's Guide to the Galaxy)\n        #      The Answer to the Ultimate Question of Life, the Universe, and Everything\n        self.MAGIC_NUMBER = \"dv42\"\n\n        # DATAVERSE\n        self.DATAVERSE_HOME = os.path.dirname(dataverse.__file__)\n\n        # HARD CODED DEFAULT SETTING\n        self.CACHE_DIR = Path.home().as_posix()\n        self.IS_CLI = False\n\n        # AWS SETTING\n        self.AWS_BUCKET = self._get_aws_bucket()\n\n        # SPARK VERSION\n        self.SPARK_VERSION = pyspark.__version__\n\n        # HADOOP VERSION\n        jars = Path(pyspark.__file__).parent / \"jars\"\n        hadoop_jar = list(jars.glob(\"hadoop-client-runtime*.jar\"))\n        self.HADOOP_VERSION = re.findall(r\"\\d+\\.\\d+\\.\\d+\", hadoop_jar[0].name)[-1]\n\n        # TODO: add more default setting here\n        ...\n\n    def update_by_env(self):\n        \"\"\"\n        Update the system setting by env variable\n        \"\"\"\n        # check if the env variable is set\n        for key in self.system_setting:\n            if key in os.environ:\n                self.system_setting[key] = os.environ[key]\n\n    def check_naming_convention(self, key):\n        \"\"\"\n        1. only CAPITALIZED format\n            - e.g. CACHE_DIR (O)\n            - e.g. cache_dir (X)\n        2. only alphanumeric and underscore\n            - e.g. CACHE_DIR2 (O)\n            - e.g. cache-dir (X)\n            - e.g. CACHE_@DIR (X)\n        3. only one underscore between words\n            - e.g. CACHE__DIR (X)\n        4. no underscore at the start/end of the key\n            - e.g. _CACHE_DIR (X)\n            - e.g. CACHE_DIR_ (X)\n        \"\"\"\n        # 1. only CAPITALIZED format\n        if key != key.upper():\n            raise ValueError(f\"key [ {key} ] is not in Capitalized format\")\n\n        # 2. only alphanumeric and underscore\n        for char in key:\n            if not char.isalnum() and char != \"_\":\n                raise ValueError(f\"key [ {key} ] should only contains alphanumeric and underscore\")\n\n        # 3. only one underscore between words\n        if \"_\" in key:\n            # check if there is only one underscore\n            divided_keys = key.split(\"_\")\n            if \"\" in divided_keys:\n                raise ValueError(f\"key [ {key} ] contains more than one underscore\")\n\n        # 4. no underscore at the start/end of the key\n        if key.startswith(\"_\") or key.endswith(\"_\"):\n            raise ValueError(f\"key [ {key} ] contains underscore at the start/end of the key\")\n\n    def get(self, key):\n        \"\"\"\n        \"\"\"\n        if key not in self.system_setting:\n            raise KeyError(f\"key [ {key} ] does not exist in SYSTEM setting\")\n        return self.system_setting[key]\n\n    def set(self, key, value):\n        \"\"\"\n        \"\"\"\n        self.check_naming_convention(key)\n        self.system_setting[key] = value\n\n    # Support dot-like access, e.g. setting.CACHE_DIR\n    def __getattr__(self, key):\n        if key in [\n            \"_initialized\",\n            \"system_setting\"\n        ]:\n            return super().__getattr__(key)\n        else:\n            return self.get(key)\n\n    def __setattr__(self, key, value):\n        if key in [\n            \"_initialized\",\n            \"system_setting\"\n        ]:\n            super().__setattr__(key, value)\n        else:\n            self.set(key, value)\n\n    # Support dict-like access, e.g. setting[\"CACHE_DIR\"]\n    def __getitem__(self, key):\n        return self.get(key)\n\n    def __setitem__(self, key, value):\n        self.set(key, value)\n\n    def delete(self, key):\n        \"\"\"\n        \"\"\"\n        if key in self.system_setting:\n            self.system_setting.pop(key, None)\n        else:\n            raise KeyError(f\"key [ {key} ] does not exist in SYSTEM setting\")\n\n    def list(self):\n        \"\"\"\n        List all settings\n        \"\"\"\n        print(self.system_setting)\n\n    def __repr__(self):\n        return json.dumps(self.system_setting, indent=4)\n\n    def __str__(self):\n        return json.dumps(self.system_setting, indent=4)\n"
  },
  {
    "path": "dataverse/utils/setting/user.py",
    "content": "\n\"\"\"\nInterface for user setting\n\"\"\"\n\nimport os\nimport json\nfrom pathlib import Path\n\nfrom dataverse.utils.setting.system import SystemSetting\n\n\nclass UserSetting:\n    \"\"\"\n    Proxy for user setting CRUD, synchronized with the user_setting.json file\n\n    To manage user API keys, passwords, or other sensitive information,\n    you can use directly store the information in the user_setting.json file\n    or use this class to store the information as proxy. \n\n    Also, this class is a singleton class, so you can use it anywhere in the code\n\n    caveat:\n        - this is just a storage and does not include any logic more than CRUD\n        - anything more than CRUD is a responsibility of outside of this class\n\n    what does it means resposibility of outside of this class:\n        When user wants to save API key and if does not exist\n        asking user to input the API key with stdin might be a good idea.\n        But this is not the responsibility of this class.\n\n        This class will only return API key exists or not and base on that,\n        outside of this class will ask user to input the API key\n        or raise error or whatever it needs to do.\n    \"\"\"\n    # Singleton\n    _initialized = False\n\n    # TODO: system setting per user [Candidate]\n    ...\n\n    def __new__(cls):\n        if not hasattr(cls, 'instance'):\n            cls.instance = super(UserSetting, cls).__new__(cls)\n        return cls.instance\n\n    def __init__(self):\n        # when the class is initialized, this is called everytime\n        # regardless of the singleton. So adding the flag to check\n        if self._initialized:\n            return\n\n        # create the user setting path\n        os.makedirs(f\"{SystemSetting().CACHE_DIR}/.cache/dataverse/setting\", exist_ok=True)\n        self.user_setting_path = os.path.join(\n            SystemSetting().CACHE_DIR,\n            \".cache/dataverse/setting/user_setting.json\"\n        )\n\n        # load the user setting, if not exist, empty dict will be assigned\n        self.user_setting = self.load(self.user_setting_path)\n        self._initialized = True\n\n    def reset(self):\n        \"\"\"\n        reset the setting\n        \"\"\"\n        self.user_setting = {}\n        self.sync_file()\n\n    def sync_file(self):\n        \"\"\"\n        sync (class -> file)\n        \"\"\"\n        with open(self.user_setting_path, \"w\") as f:\n            json.dump(self.user_setting, f, indent=4)\n\n    def sync_class(self):\n        \"\"\"\n        sync (file -> class)\n        \"\"\"\n        # sync the file to make sure the dict is up-to-date\n        self.user_setting = self.load(self.user_setting_path)\n\n    def load(self, path):\n        \"\"\"\n        Load the user setting file\n        \"\"\"\n        # check if user setting file exists\n        if not os.path.exists(path):\n            return {}\n\n        # read the file\n        with open(path, \"r\") as f:\n            json_file = json.load(f)\n\n        return json_file\n\n    def check_naming_convention(self, key):\n        \"\"\"\n        1. only CAPITALIZED format\n            - e.g. GITHUB_API (O)\n            - e.g. github_api (X)\n        2. only alphanumeric and underscore\n            - e.g. GITHUB_API2 (O)\n            - e.g. github-api (X)\n            - e.g. GITHUB_@API (X)\n        3. only one underscore between words\n            - e.g. GITHUB__API (X)\n        4. no underscore at the start/end of the key\n            - e.g. _GITHUB_API (X)\n            - e.g. GITHUB_API_ (X)\n        \"\"\"\n        # 1. only CAPITALIZED format\n        if key != key.upper():\n            raise ValueError(f\"key [ {key} ] is not in Capitalized format\")\n\n        # 2. only alphanumeric and underscore\n        for char in key:\n            if not char.isalnum() and char != \"_\":\n                raise ValueError(f\"key [ {key} ] should only contains alphanumeric and underscore\")\n\n        # 3. only one underscore between words\n        if \"_\" in key:\n            # check if there is only one underscore\n            divided_keys = key.split(\"_\")\n            if \"\" in divided_keys:\n                raise ValueError(f\"key [ {key} ] contains more than one underscore\")\n\n        # 4. no underscore at the start/end of the key\n        if key.startswith(\"_\") or key.endswith(\"_\"):\n            raise ValueError(f\"key [ {key} ] contains underscore at the start/end of the key\")\n\n\n    def get(self, key):\n        \"\"\"\n        \"\"\"\n        self.sync_class()\n        if key not in self.user_setting:\n            raise KeyError(f\"key [ {key} ] does not exist in USER setting\")\n        return self.user_setting[key]\n\n    def set(self, key, value):\n        \"\"\"\n        \"\"\"\n        self.check_naming_convention(key)\n        self.user_setting[key] = value\n        self.sync_file()\n\n    # Support dot-like access, e.g. setting.GITHUB_API\n    def __getattr__(self, key):\n        if key in [\n            \"_initialized\",\n            \"user_setting\",\n            \"user_setting_path\"\n        ]:\n            return super().__getattr__(key)\n        else:\n            return self.get(key)\n\n    def __setattr__(self, key, value):\n        if key in [\n            \"_initialized\",\n            \"user_setting\",\n            \"user_setting_path\"\n        ]:\n            super().__setattr__(key, value)\n        else:\n            self.set(key, value)\n\n    # Support dict-like access, e.g. setting[\"GITHUB_API\"]\n    def __getitem__(self, key):\n        return self.get(key)\n\n    def __setitem__(self, key, value):\n        self.set(key, value)\n\n    def delete(self, key):\n        \"\"\"\n        \"\"\"\n        if key in self.user_setting:\n            self.user_setting.pop(key, None)\n            self.sync_file()\n        else:\n            raise KeyError(f\"key [ {key} ] does not exist in USER setting\")\n\n    def list(self):\n        \"\"\"\n        List all settings\n        \"\"\"\n        self.sync_class()\n        print(self.user_setting)\n\n    def __repr__(self):\n        self.sync_class()\n        return json.dumps(self.user_setting, indent=4)\n\n    def __str__(self):\n        self.sync_class()\n        return json.dumps(self.user_setting, indent=4)\n"
  },
  {
    "path": "docs/Makefile",
    "content": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the environment for the first two.\nSPHINXOPTS    ?=\nSPHINXBUILD   ?= sphinx-build\nSOURCEDIR     = source\nBUILDDIR      = build\n\n# Put it first so that \"make\" without argument is like \"make help\".\nhelp:\n\t@$(SPHINXBUILD) -M help \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)\n\n.PHONY: help Makefile html\n\nhtml:\n\t@echo \"Setting environment variable for the build process...\"\n\t@export DATAVERSE_BUILD_DOC=true && \\\n\t$(SPHINXBUILD) -b html \"$(SOURCEDIR)\" \"$(BUILDDIR)/html\" $(SPHINXOPTS) $(O) && \\\n\techo \"Build is finished. Cleaning up environment variable.\" && \\\n\tunset DATAVERSE_BUILD_DOC\n# Catch-all target: route all unknown targets to Sphinx using the new\n# \"make mode\" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).\n%: Makefile\n\t@$(SPHINXBUILD) -M $@ \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)\n"
  },
  {
    "path": "docs/make.bat",
    "content": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\nset SOURCEDIR=source\r\nset BUILDDIR=build\r\n\r\n%SPHINXBUILD% >NUL 2>NUL\r\nif errorlevel 9009 (\r\n\techo.\r\n\techo.The 'sphinx-build' command was not found. Make sure you have Sphinx\r\n\techo.installed, then set the SPHINXBUILD environment variable to point\r\n\techo.to the full path of the 'sphinx-build' executable. Alternatively you\r\n\techo.may add the Sphinx directory to PATH.\r\n\techo.\r\n\techo.If you don't have Sphinx installed, grab it from\r\n\techo.https://www.sphinx-doc.org/\r\n\texit /b 1\r\n)\r\n\r\nif \"%1\" == \"\" goto help\r\n\r\n%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%\r\ngoto end\r\n\r\n:help\r\n%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%\r\n\r\n:end\r\npopd\r\n"
  },
  {
    "path": "docs/source/citation.rst",
    "content": "===================\nCitation\n===================\n\n\nIf you want to cite our *Dataverse* project, feel free to use the following bibtex::\n\n    @misc{dataverse,\n        title = {Dataverse},\n        author = {Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi Kim, Dahyun Kim, Chanjun Park},\n        year = {2024},\n        publisher = {GitHub, Upstage AI},\n        howpublished = {\\url{https://github.com/UpstageAI/dataverse}},\n        }"
  },
  {
    "path": "docs/source/conf.py",
    "content": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common options. For a full\n# list see the documentation:\n# https://www.sphinx-doc.org/en/master/usage/configuration.html\n\n# -- Path setup --------------------------------------------------------------\n\nimport inspect\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\n#\nimport os\nimport sys\n\nimport sphinx_pdj_theme\nfrom sphinx.application import Sphinx\n\nsys.path.insert(0, os.path.abspath(\"../..\"))\n\n\n# -- Project information -----------------------------------------------------\n\nproject = \"dataverse\"\ncopyright = \"2024, Upstage AI\"\nauthor = \"Upstage AI\"\n\n# The full version, including alpha/beta/rc tags\nrelease = \"1.0.4\"\n\n\n# -- General configuration ---------------------------------------------------\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = [\n    \"sphinx.ext.autodoc\",\n    \"sphinx.ext.todo\",\n    \"sphinx.ext.napoleon\",\n    \"sphinx.ext.autosummary\",\n    \"sphinx.ext.githubpages\",\n]\ntodo_include_todos = True\nnapoleon_google_docstring = True\nnapoleon_numpy_docstring = True\n\n# Add any paths that contain templates here, relative to this directory.\ntemplates_path = [\"_templates\"]\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\n# This pattern also affects html_static_path and html_extra_path.\nexclude_patterns = []\n\n\n# -- Options for HTML output -------------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\n\nhtml_permalinks_icon = \"<span>#</span>\"\nhtml_theme = \"sphinx_rtd_theme\"\n\n# Add any paths that contain custom static files (such as style sheets) here,\n# relative to this directory. They are copied after the builtin static files,\n# so a file named \"default.css\" will overwrite the builtin \"default.css\".\n\nhtml_static_path = [\"../images/\"]\n\n\n# -- Handle register_etl decorator -------------------------------------------------\n\nsys.path.insert(0, os.path.abspath(\"../../dataverse\"))\n\ndef process_signature(\n    app: Sphinx, what: str, name: str, obj, options, signature, return_annotation\n):\n    if what == \"function\" and hasattr(obj, \"run\"):\n        original_func = obj.run.__wrapped__\n        new_signature = inspect.signature(original_func)\n        parameters = list(new_signature.parameters.values())\n        new_signature = new_signature.replace(\n            parameters=[\n                inspect.Parameter(\"self\", inspect.Parameter.POSITIONAL_OR_KEYWORD)\n            ]\n            + parameters\n        )\n        return str(new_signature), return_annotation\n\n    return signature, return_annotation\n\n\ndef skip_undoc_members(app, what, name, obj, skip, options):\n    if inspect.isclass(obj) and not hasattr(obj, \"__is_etl__\"):\n        return None\n    return True\n\n\ndef setup(app):\n    app.connect(\"autodoc-process-signature\", process_signature)\n    app.connect(\"autodoc-skip-member\", skip_undoc_members)\n"
  },
  {
    "path": "docs/source/config/config.interface.rst",
    "content": "config\n========================\n\n.. automodule:: config.interface.Config\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\n    .. automethod:: config.interface.Config.load\n    .. automethod:: config.interface.Config.save\n    .. automethod:: config.interface.Config.default\n    .. automethod:: config.interface.Config.set_default\n\n\n\n"
  },
  {
    "path": "docs/source/etl/etl.bias.rst",
    "content": "etl.bias\n================\n\nReducing skewed or prejudiced data,\nwith a particular emphasis on data that reinforces stereotypes of LLMs.\n\n.. automodule:: etl.bias\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/etl/etl.cleaning.rst",
    "content": "etl.cleaning\n====================\nRemoving irrelevant, redun-dant, or noisy information from the data,\nsuch as stop words or special characters.\n\netl.cleaning.char module\n------------------------\n\n.. automodule:: etl.cleaning.char\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.char.cleaning___char___remove_accent\n.. autofunction:: etl.cleaning.char.cleaning___char___normalize_whitespace\n.. autofunction:: etl.cleaning.char.cleaning___char___remove_unprintable\n\n\netl.cleaning.document module\n----------------------------\n\n.. automodule:: etl.cleaning.document\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n\n.. autofunction:: etl.cleaning.document.cleaning___document___split_by_word\n\netl.cleaning.html module\n------------------------\n\n.. automodule:: etl.cleaning.html\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.html.cleaning___html___extract_plain_text\n\netl.cleaning.korean module\n--------------------------\n\n.. automodule:: etl.cleaning.korean\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.korean.cleaning___korean___filter_by_ratio\n.. autofunction:: etl.cleaning.korean.cleaning___korean___reduce_emoticon\n\n\netl.cleaning.length module\n--------------------------\n\n.. automodule:: etl.cleaning.length\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.length.cleaning___length___char_len_filter\n.. autofunction:: etl.cleaning.length.cleaning___length___word_len_filter\n\netl.cleaning.number module\n--------------------------\n\n.. automodule:: etl.cleaning.number\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.number.cleaning___number___normalize\n\netl.cleaning.table module\n-------------------------\n\n.. automodule:: etl.cleaning.table\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.table.cleaning___table___merge_col_vertical\n\netl.cleaning.unicode module\n---------------------------\n\n.. automodule:: etl.cleaning.unicode\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.cleaning.unicode.cleaning___unicode___remove_punct\n.. autofunction:: etl.cleaning.unicode.cleaning___unicode___replace_punct\n.. autofunction:: etl.cleaning.unicode.cleaning___unicode___normalize"
  },
  {
    "path": "docs/source/etl/etl.data_ingestion.rst",
    "content": "etl.data\\_ingestion\n===========================\nFacilitating the loading of data from various sources\n(e.g., data in Huggingface Hub, and parquet/csv/arrow format data in local storage)\ninto a preferred format.\n\netl.data\\_ingestion.arrow module\n--------------------------------\n\n.. automodule:: etl.data_ingestion.arrow\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.arrow.data_ingestion___arrow___hf2raw\n\netl.data\\_ingestion.common\\_crawl module\n----------------------------------------\n\n.. automodule:: etl.data_ingestion.common_crawl\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.common_crawl.data_ingestion___common_crawl___wet2raw\n.. autofunction:: etl.data_ingestion.common_crawl.data_ingestion___common_crawl___dump2raw\n.. autofunction:: etl.data_ingestion.common_crawl.data_ingestion___common_crawl___raw2ufl\n\netl.data\\_ingestion.csv module\n------------------------------\n\n.. automodule:: etl.data_ingestion.csv\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.csv.data_ingestion___csv___csv2raw\n\netl.data\\_ingestion.cultura\\_x module\n-------------------------------------\n\n.. automodule:: etl.data_ingestion.cultura_x\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.cultura_x.data_ingestion___cultura_x___raw2ufl\n\netl.data\\_ingestion.huggingface module\n--------------------------------------\n\n.. automodule:: etl.data_ingestion.huggingface\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.huggingface.data_ingestion___huggingface___hf2raw\n\netl.data\\_ingestion.parquet module\n----------------------------------\n\n.. automodule:: etl.data_ingestion.parquet\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.parquet.data_ingestion___parquet___pq2raw\n\netl.data\\_ingestion.red\\_pajama module\n--------------------------------------\n\n.. automodule:: etl.data_ingestion.red_pajama\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.red_pajama.data_ingestion___red_pajama___parquet2ufl\n.. autofunction:: etl.data_ingestion.red_pajama.data_ingestion___red_pajama___hf2ufl\n.. autofunction:: etl.data_ingestion.red_pajama.data_ingestion___red_pajama___hf2raw\n.. autofunction:: etl.data_ingestion.red_pajama.data_ingestion___red_pajama___raw2ufl_templatev1\n\n\netl.data\\_ingestion.slim\\_pajama module\n---------------------------------------\n\n.. automodule:: etl.data_ingestion.slim_pajama\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.slim_pajama.data_ingestion___slim_pajama___parquet2ufl\n.. autofunction:: etl.data_ingestion.slim_pajama.data_ingestion___slim_pajama___hf2ufl\n\netl.data\\_ingestion.test module\n-------------------------------\n\n.. automodule:: etl.data_ingestion.test\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_ingestion.test.data_ingestion___test___generate_fake_ufl"
  },
  {
    "path": "docs/source/etl/etl.data_save.rst",
    "content": "etl.data\\_save\n======================\nPersisting the processed data into a preferred destination,\nsuch as a data lake or database.\n\netl.data\\_save.aws module\n-------------------------\n\n.. automodule:: etl.data_save.aws\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\netl.data\\_save.huggingface module\n---------------------------------\n\n.. automodule:: etl.data_save.huggingface\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_save.huggingface.data_save___huggingface___ufl2hf_hub\n.. autofunction:: etl.data_save.huggingface.data_save___huggingface___ufl2hf\n.. autofunction:: etl.data_save.huggingface.data_save___huggingface___ufl2hf_obj\n\n\netl.data\\_save.parquet module\n-----------------------------\n\n.. automodule:: etl.data_save.parquet\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.data_save.parquet.data_save___parquet___ufl2parquet\n\n"
  },
  {
    "path": "docs/source/etl/etl.decontamination.rst",
    "content": "etl.decontamination\n===========================\n\nIdentifying and removing contaminated data such as benchmark datasets.\n\n.. automodule:: etl.decontamination\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/etl/etl.deduplication.rst",
    "content": "etl.deduplication\n=========================\nEliminating duplicated data on dataset by dataset basis or globally across multiple datasets.\n\netl.deduplication.common\\_crawl module\n--------------------------------------\n\n.. automodule:: etl.deduplication.common_crawl\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.deduplication.common_crawl.deduplication___common_crawl___exact_line\n\netl.deduplication.exact module\n------------------------------\n\n.. automodule:: etl.deduplication.exact\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.deduplication.exact.deduplication___exact___column\n\netl.deduplication.minhash module\n--------------------------------\n\n.. automodule:: etl.deduplication.minhash\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.deduplication.minhash.deduplication___minhash___lsh_jaccard\n\netl.deduplication.polyglot module\n---------------------------------\n\n.. automodule:: etl.deduplication.polyglot\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.deduplication.polyglot.deduplication___polyglot___minhash\n"
  },
  {
    "path": "docs/source/etl/etl.pii.rst",
    "content": "etl.pii\n===============\n\nEnsuring the removal of sensitive information, such as personally identifiable data, from the dataset.\n  \netl.pii.card module\n-------------------\n\n.. automodule:: etl.pii.card\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.pii.card.pii___card___replace_card_number\n\netl.pii.nin module\n------------------\n\n.. automodule:: etl.pii.nin\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.pii.nin.pii___nin___replace_korean_rrn\n"
  },
  {
    "path": "docs/source/etl/etl.pipeline.rst",
    "content": "etl.pipeline\n=====================\n\nETL Interface: user will be interacting with this interface\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\n.. autoclass:: etl.pipeline.ETLPipeline\n   :members:\n\n   .. autofunction:: etl.pipeline.ETLPipeline.status\n   .. autofunction:: etl.pipeline.ETLPipeline.search\n   .. autofunction:: etl.pipeline.ETLPipeline.get\n   .. autofunction:: etl.pipeline.ETLPipeline.setup_spark_conf\n   .. autofunction:: etl.pipeline.ETLPipeline.sample\n   .. autofunction:: etl.pipeline.ETLPipeline.run\n   .. autofunction:: etl.pipeline.ETLPipeline.run_emr\n"
  },
  {
    "path": "docs/source/etl/etl.quality.rst",
    "content": "etl.quality\n===================\n\nImproving the quality of data from the perspectives of accuracy, consistency, and reliability for LLMs.\n\netl.quality.language module\n---------------------------\n\n.. automodule:: etl.quality.language\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.quality.language.quality___language___fasttext_filter\n"
  },
  {
    "path": "docs/source/etl/etl.registry.rst",
    "content": "etl.registry\n=====================\nBase class to support the registration of the ETL classes\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\n\n.. autofunction:: etl.registry.auto_register\n\n.. autoclass:: etl.registry.ETLStructure\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autoclass:: etl.registry.ETLRegistry\n   :members:\n\n   .. autofunction:: etl.registry.ETLRegistry.register\n   .. autofunction:: etl.registry.ETLRegistry.search\n   .. autofunction:: etl.registry.ETLRegistry.get\n   .. autofunction:: etl.registry.ETLRegistry.get_all\n   .. autofunction:: etl.registry.ETLRegistry.reset\n\n.. autoclass:: etl.registry.ETLAutoRegistry\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autoclass:: etl.registry.BaseETL\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.registry.register_etl\n\n\n\n\n\n\n"
  },
  {
    "path": "docs/source/etl/etl.rst",
    "content": "etl\n===========\n\n.. toctree::\n   :maxdepth: 1\n\n   etl.bias\n   etl.cleaning\n   etl.data_ingestion\n   etl.data_save\n   etl.decontamination\n   etl.deduplication\n   etl.pii\n   etl.quality\n   etl.toxicity\n   etl.utils\n   etl.pipeline\n   etl.registry\n"
  },
  {
    "path": "docs/source/etl/etl.toxicity.rst",
    "content": "etl.toxicity\n====================\n\nIdentifying and eliminating harmful, offensive, or inappropriate content within the data.\n\n.. automodule:: etl.toxicity\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/etl/etl.utils.rst",
    "content": "etl.utils\n=================\nProviding essential functionalities for data processing,\nincluding sampling, logging, and statistical analysis.\n\netl.utils.log module\n--------------------\n\n.. automodule:: etl.utils.log\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.utils.log.utils___log___count\n\netl.utils.sampling module\n-------------------------\n\n.. automodule:: etl.utils.sampling\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.utils.sampling.utils___sampling___random\n\netl.utils.statistics module\n---------------------------\n\n.. automodule:: etl.utils.statistics\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n.. autofunction:: etl.utils.statistics.utils___statistics___korean_nouns"
  },
  {
    "path": "docs/source/index.rst",
    "content": ".. dataverse documentation master file, created by\n   sphinx-quickstart on Thu Feb 29 19:54:35 2024.\n   You can adapt this file completely to your liking, but it should at least\n   contain the root `toctree` directive.\n\n===================\nDataverse\n===================\n\n.. image:: ../images/dataverse_logo-color.png\n\nDataverse is a freely-accessible open-source project that supports your ETL pipeline with Python.\nWe offer a simple, standardized and user-friendly solution for data processing and management, catering to the needs of data scientists, analysts, and developers in LLM era. Even though you don't know much about Spark, you can use it easily via dataverse.\n\n\nWith Dataverse, you are empowered to\n--------------------------------------\n\n- utilize a range of preprocessing functions without the need to install multiple libraries.\n- create high-quality data for analysis and training of Large Language Models (LLM).\n- leverage Spark with ease, regardless of your expertise level.\n- facilitate smoother collaboration among users with varying degress of Spark proficiency.\n- enjoy freedom from the limitations of local environments by harnessing the capabilities of AWS EMR.\n\n\nArchitecture of Dataverse\n--------------------------------------\n.. image:: ../images/dataverse_system_architecture_white.jpeg\n\n\nKey Features of Dataverse\n--------------------------------------\n- **Block-Based**: In Dataverse, a `block` means a `registered ETL function` which is running on Spark. You can build Spark code like putting together puzzle pieces. You can easily add, take away, or re-arrange pieces to get the results you want via configure.\n- **Configure-Based**: All the setups for Spark and steps of block can be defined with configure. You don't need to know all the code. Just set up the options, and you're good to go.\n- **Extensible**: It's designed to meet your specific demands, allowing for custom features that fit perfectly with your project.\n\nIf you want to know more about Dataverse, please checkout our `docs <https://data-verse.gitbook.io/docs/>`__.\n\n\n.. toctree::\n   :maxdepth: 2\n   :hidden:\n   :caption: Getting Started\n\n   installation\n   quickstart\n   citation\n\n.. toctree::\n   :maxdepth: 5\n   :hidden:\n   :caption: Documentation\n\n   etl/etl\n   config/config.interface\n\n\nIndices and tables\n==================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :ref:`search`\n"
  },
  {
    "path": "docs/source/installation.rst",
    "content": "===================================\nInstallation\n===================================\n\n\nDataverse can be installed using pip\n---------------------------------------\n\n.. code-block:: python\n    \n    pip install dataverse\n\nIn order to use *Dataverse*, there are prerequisites you need to have: Python, Spark and Java.\nIn this `link <https://data-verse.gitbook.io/docs/lets-start/installation>`__, you can find guidelines for installing Apache Spark and JDK.\n\n\nDataverse supports AWS S3 and EMR\n------------------------------------\nWe are providing step by step guides via `link <https://data-verse.gitbook.io/docs/lets-start/aws-setting-guides>`__ to set up AWS S3 and EMR on *Dataverse*.\n"
  },
  {
    "path": "docs/source/quickstart.rst",
    "content": "\n===================\nQuickstart\n===================\n\n\nVarious and more detailed tutorials are `here <https://github.com/UpstageAI/dataverse/tree/main/examples>`__.\n\n- `add_new_etl_process.ipynb <https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_04_add_new_etl_process.ipynb>`__ : If you want to use your custom function, you have to register the function on Dataverse. This will guide you from register to apply it on pipeline.\n- `test_etl_process.ipynb <https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_05_test_etl_process.ipynb>`__ : When you want to get test(sample) data to quickly test your ETL process, or need data from a certain point to test your ETL process.\n- `scaleout_with_EMR.ipynb <https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_06_scaleout_with_EMR.ipynb>`__ : For people who want to run their pipeline on EMR cluster.\n\n\n1. Set your ETL process as config.\n``````````````````````````````````\n.. code-block:: python\n\n  from omegaconf import OmegaConf\n  ETL_config = OmegaConf.create({\n\n      # Set up Spark\n      'spark': { \n          'appname': 'ETL',\n          'driver': {'memory': '4g'},\n      },\n      'etl': [\n          { \n            # Extract; You can use HuggingFace datset from hub directly!\n            'name': 'data_ingestion___huggingface___hf2raw', \n            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\n          },\n          {\n            # Reduce dataset scale\n            'name': 'utils___sampling___random',\n            'args': {'sample_n_or_frac': 0.5}\n          },\n          {\n            # Transform; deduplicate data via minhash\n            'name': 'deduplication___minhash___lsh_jaccard', \n            'args': {'threshold': 0.1,\n                    'ngram_size': 5,\n                    'subset': 'question'}\n          },\n          {\n            # Load; Save the data\n            'name': 'data_save___parquet___ufl2parquet',\n            'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}\n          }\n        ]\n    })\n\nAbove code block is an example of an ETL process in *Dataverse*.\nIn *Dataverse*, the available registered ETL functions are referred to as ``blocks``, and this example is comprised of four blocks. You can freely combine these blocks using config to create the ETL processes for your needs.\nThe list of available functions and args of them can be found in the `API Reference <https://data-verse.readthedocs.io/en/latest/>`__. Each functions 'args' should be added in dictionary format.\n\n\n2. Run ETLpipeline.\n```````````````````\n\n.. code-block:: python\n\n  from dataverse.etl import ETLPipeline\n\n  etl_pipeline = ETLPipeline()\n  spark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)\n\nETLPipeline is an object designed to manage the ETL processes.\nBy inserting ``ETL_config`` which is defined in the previous step into ETLpipeline object and calling the ``run`` method,\nstacked ETL blocks will execute in the order they were stacked.\n\n\n3. Result file is saved on the save_path\n```````````````````````````````````````````\n\n.. code-block:: python\n\n  import pandas as pd\n  pd.read_parquet('./guideline/etl/sample/quickstart.parquet')\n\nAs the example gave ``save_path`` argument to the last block of ``ETL_config``, \ndata passed through the process will be saved on the given path."
  },
  {
    "path": "docs/source/requirements.txt",
    "content": "sphinx\nsphinx-pdj-theme\nsphinx-rtd-theme\nrequests\nnumpy\npandas\nfasttext-wheel\nomegaconf\ndatasets\npyspark\nscipy\ntrafilatura\nhtml2text\nfaker\nboto3\npre-commit==3.6.0\nbotocore\nrsa\ns3transfer\nisort\npytest\n"
  },
  {
    "path": "examples/README.md",
    "content": "# 🌍 Examples\n> This is a example collection for `dataverse`. We will talk about the basic usage of `dataverse`, knowhows, and how to use it in your project.\n\n\n### 🙋  I'm very new to Dataverse\nIntroduces very basic, but core steps to use Dataverse.\n\n- [ETL_01_how_to_run.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_01_how_to_run.ipynb)\n- [ETL_02_one_cycle.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_02_one_cycle.ipynb)\n\n### 🙋 I want to use my custom function\nIf you want to use your custom function, you have to register the function on Dataverse. These will guide you from register to apply it on pipeline.\n\n- [ETL_03_create_new_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_03_create_new_etl_process.ipynb)\n- [ETL_04_add_new_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_04_add_new_etl_process.ipynb)\n\n### 🙋 I need to test my ETL process with samples\nWhen you want to get test(sample) data to quickly test your ETL process, or need data from a certain point to test your ETL process\n\n- [ETL_05_test_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_05_test_etl_process.ipynb)\n\n\n### 🙋 I want to run it on EMR cluster\n\nCheck AWS S3 Support for settings\n- [ETL_06_scaleout_with_EMR.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_06_scaleout_with_EMR.ipynb)\n\n### 🙋 Is there any real-world dataset to use Dataverse?\nShows how to use common crawl data.\n\n- [EX_use_common_crawl_data.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/EX_use_common_crawl_data.ipynb)\n\n### 🙋 I want to use Pyspark UI\nHelps you to use Pyspark UI to monitor the spark job in Docker environment.\n\n- [EX_use_pyspark_ui.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/EX_use_pyspark_ui.ipynb)\n\n\n"
  },
  {
    "path": "examples/etl/ETL_01_how_to_run.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# ETL how to run?\\n\",\n    \"> At here we will talk about how to run ETL. There is 2 steps to run ETL.\\n\",\n    \"\\n\",\n    \"1. prepare config\\n\",\n    \"2. put config to ETLPipeline\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 1. prepare config\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"from pathlib import Path\\n\",\n    \"from dataverse.config import Config \\n\",\n    \"from omegaconf import OmegaConf\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Option 1: When you cloned the Dataverse repository\\n\",\n    \"- This method loads the config file from the directory based on the Dataverse repository.\\n\",\n    \"- If you haven't cloned the repository, please follow option 2.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# E = Extract, T = Transform, L = Load\\n\",\n    \"main_path = Path(os.path.abspath('../..'))\\n\",\n    \"E_path = main_path / \\\"./dataverse/config/etl/sample/data_ingestion___sampling.yaml\\\"\\n\",\n    \"T_path = main_path / \\\"./dataverse/config/etl/sample/data_preprocess___dedup.yaml\\\"\\n\",\n    \"L_path = main_path / \\\"./dataverse/config/etl/sample/data_save___hf_obj.yaml\\\"\\n\",\n    \"\\n\",\n    \"E_config = Config.load(E_path)\\n\",\n    \"T_config = Config.load(T_path)\\n\",\n    \"L_config = Config.load(L_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Option 2: When you HAVEN'T Cloned the Dataverse Repository\\n\",\n    \"- With this method, we define each E, T, L config in the shell.\\n\",\n    \"- These configs are exactly the same as each file mentioned above.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"E_config = OmegaConf.create({\\n\",\n    \"    'spark': { \\n\",\n    \"        'appname': 'dataverse_etl_sample',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        { \\n\",\n    \"          'name': 'data_ingestion___test___generate_fake_ufl', \\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"          'name': 'utils___sampling___random',\\n\",\n    \"          'args': {'sample_n_or_frac': 0.1}\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"          'name': 'data_save___parquet___ufl2parquet',\\n\",\n    \"          'args': {'save_path': \\\"./sample/sample_ufl.parquet\\\"}\\n\",\n    \"        },\\n\",\n    \"      ]\\n\",\n    \"  })\\n\",\n    \"\\n\",\n    \"T_config = OmegaConf.create({\\n\",\n    \"    'spark': { \\n\",\n    \"        'appname': 'dataverse_etl_sample',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        { \\n\",\n    \"          'name': 'data_ingestion___parquet___pq2raw', \\n\",\n    \"          'args': {'path': \\\"./sample/sample_ufl.parquet\\\"}\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"          'name': 'deduplication___minhash___lsh_jaccard',\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"          'name': 'data_save___parquet___ufl2parquet',\\n\",\n    \"          'args': {'save_path': \\\"./sample/preprocess_ufl.parquet\\\"}\\n\",\n    \"        },\\n\",\n    \"      ]\\n\",\n    \"  })\\n\",\n    \"\\n\",\n    \"L_config = OmegaConf.create({\\n\",\n    \"    'spark': { \\n\",\n    \"        'appname': 'dataverse_etl_sample',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        { \\n\",\n    \"          'name': 'data_ingestion___parquet___pq2raw', \\n\",\n    \"          'args': {'path': './sample/preprocess_ufl.parquet'}\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"          'name': 'data_save___huggingface___ufl2hf_obj',\\n\",\n    \"        },\\n\",\n    \"      ]\\n\",\n    \"  })\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Extract Config\\n\",\n    \"\\n\",\n    \"- load fake generation UFL data\\n\",\n    \"- sample 10% of total data to reduce the size of dataset\\n\",\n    \"- save to parquet `dataverse/sample/sample_ufl.parquet`\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: dataverse_etl_sample\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___test___generate_fake_ufl\\n\",\n      \"- name: utils___sampling___random\\n\",\n      \"  args:\\n\",\n      \"    sample_n_or_frac: 0.1\\n\",\n      \"- name: data_save___parquet___ufl2parquet\\n\",\n      \"  args:\\n\",\n      \"    save_path: ./sample/sample_ufl.parquet\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(OmegaConf.to_yaml(E_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Transform Config\\n\",\n    \"\\n\",\n    \"- load parquet `./sample/sample_ufl.parquet`\\n\",\n    \"- deduplicate by `text` column, 15-gram minhash jaccard similarity\\n\",\n    \"- save to parquet `./sample/preprocess_ufl.parquet`\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: dataverse_etl_sample\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___parquet___pq2raw\\n\",\n      \"  args:\\n\",\n      \"    path: ./sample/sample_ufl.parquet\\n\",\n      \"- name: deduplication___minhash___lsh_jaccard\\n\",\n      \"- name: data_save___parquet___ufl2parquet\\n\",\n      \"  args:\\n\",\n      \"    save_path: ./sample/preprocess_ufl.parquet\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(OmegaConf.to_yaml(T_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Load Config\\n\",\n    \"\\n\",\n    \"- load parquet `./sample/preprocess_ufl.parquet`\\n\",\n    \"- convert to huggingface dataset and return the object\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: dataverse_etl_sample\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___parquet___pq2raw\\n\",\n      \"  args:\\n\",\n      \"    path: ./sample/preprocess_ufl.parquet\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(OmegaConf.to_yaml(L_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 2. put config to ETLPipeline\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"An error occurred (ExpiredToken) when calling the GetCallerIdentity operation: The security token included in the request is expired\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[ No AWS Credentials Found] - Failed to set spark conf for S3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"24/04/15 22:10:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"24/04/15 22:10:33 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[ No AWS Credentials Found] - Failed to set spark conf for S3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/04/15 22:10:38 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[ No AWS Credentials Found] - Failed to set spark conf for S3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/04/15 22:10:45 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# raw -> ufl\\n\",\n    \"etl_pipeline.run(E_config)\\n\",\n    \"\\n\",\n    \"# ufl -> dedup -> ufl\\n\",\n    \"etl_pipeline.run(T_config)\\n\",\n    \"\\n\",\n    \"# ufl -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(L_config)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'meta', 'name', 'text'],\\n\",\n       \"    num_rows: 14\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': 'a3715cee-e252-4360-9a15-93a3fcc832fb',\\n\",\n       \" 'meta': '{\\\"name\\\": \\\"Caitlin Hughes\\\", \\\"age\\\": 55, \\\"address\\\": \\\"517 Cassandra Mountains\\\\\\\\nJamesberg, NM 13313\\\", \\\"job\\\": \\\"Orthoptist\\\"}',\\n\",\n       \" 'name': 'test_fake_ufl',\\n\",\n       \" 'text': 'Necessary miss set choice car hour. Only man interest affect. Cover black protect successful president court memory.'}\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.13\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/ETL_02_one_cycle.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# ETL one cycle\\n\",\n    \"> Normally ETL is processed by 3 steps, E, T, L :) but we could do it by one cycle, ETL.\\n\",\n    \"\\n\",\n    \"We are going to use the 3 configs from `ETL_how_to_run.ipynb` and merge it to one config file.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 1. prepare config\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"from pathlib import Path\\n\",\n    \"from dataverse.config import Config \\n\",\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# E = Extract, T = Transform, L = Load\\n\",\n    \"main_path = Path(os.path.abspath('../..'))\\n\",\n    \"ETL_path = main_path / \\\"./dataverse/config/etl/sample/ETL___one_cycle.yaml\\\"\\n\",\n    \"\\n\",\n    \"ETL_config = Config.load(ETL_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Wait! If you haven't clone the repository, run the shell script below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': { \\n\",\n    \"        'appname': 'dataverse_etl_sample',\\n\",\n    \"        'driver': {'memory': '16g'}  \\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___test___generate_fake_ufl'\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"            'name': 'utils___sampling___random',\\n\",\n    \"            'args': {'sample_n_or_frac': 0.1}\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"            'name': 'deduplication___minhash___lsh_jaccard'\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_save___huggingface___ufl2hf_obj'\\n\",\n    \"        }\\n\",\n    \"    ]\\n\",\n    \"})\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 ETL Config\\n\",\n    \"> One cycle from raw to huggingface dataset\\n\",\n    \"\\n\",\n    \"- load fake generation UFL data\\n\",\n    \"- sample 10% of total data to reduce the size of dataset\\n\",\n    \"- deduplicate by `text` column, 15-gram minhash jaccard similarity\\n\",\n    \"- convert to huggingface dataset and return the object\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: dataverse_etl_sample\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___test___generate_fake_ufl\\n\",\n      \"- name: utils___sampling___random\\n\",\n      \"  args:\\n\",\n      \"    sample_n_or_frac: 0.1\\n\",\n      \"- name: deduplication___minhash___lsh_jaccard\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 2. put config to ETLPipeline\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[ No AWS Credentials Found] - Failed to set spark conf for S3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"24/04/15 22:26:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"24/04/15 22:26:20 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'meta', 'name', 'text'],\\n\",\n       \"    num_rows: 14\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '32ff39e5-2a88-45dc-a69e-b59b05f51216',\\n\",\n       \" 'meta': '{\\\"name\\\": \\\"Laura White\\\", \\\"age\\\": 49, \\\"address\\\": \\\"126 Javier Islands Apt. 925\\\\\\\\nPort Jasonshire, UT 60978\\\", \\\"job\\\": \\\"Mining engineer\\\"}',\\n\",\n       \" 'name': 'test_fake_ufl',\\n\",\n       \" 'text': 'Your whose admit ask herself. Public mission far program tough.\\\\nEconomic talk few minute. Budget face yeah along difference. Evening heart throughout.'}\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.13\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/ETL_03_create_new_etl_process.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# ETL create new etl process\\n\",\n    \"> Create your custom ETL process to the ETL pipeline.\\n\",\n    \"\\n\",\n    \"when you want to create your own ETL process, it could be tricky.\\n\",\n    \"here is a simple example to show you where to start to create your own ETL process.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 1. Start from ETL Pipeline you wanna add your own ETL process\\n\",\n    \"> simple ETL pipeline to load huggingface dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: ETL\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___huggingface___hf2raw\\n\",\n      \"  args:\\n\",\n      \"    name_or_path:\\n\",\n      \"    - ai2_arc\\n\",\n      \"    - ARC-Challenge\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'ETL',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___huggingface___hf2raw',\\n\",\n    \"            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_save___huggingface___ufl2hf_obj'\\n\",\n    \"        }\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"23/11/14 19:02:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"23/11/14 19:02:29 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"9233872e69a545e5a338bdc9b1154537\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/3 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    },\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"c5a31a78b3e9459fb28f8680b14491a8\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading and preparing dataset spark/-1076059055 to /root/.cache/huggingface/datasets/spark/-1076059055/0.0.0...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-1076059055/0.0.0. Subsequent calls will reuse this data.\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['answerKey', 'choices', 'id', 'question'],\\n\",\n       \"    num_rows: 2590\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\\n\",\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 2. choose where you wanna add your own ETL process\\n\",\n    \"> remove or comment out the following ETL process from config!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: ETL\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___huggingface___hf2raw\\n\",\n      \"  args:\\n\",\n      \"    name_or_path:\\n\",\n      \"    - ai2_arc\\n\",\n      \"    - ARC-Challenge\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'ETL',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___huggingface___hf2raw',\\n\",\n    \"            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\\n\",\n    \"        },\\n\",\n    \"        \\n\",\n    \"        # TODO: you want to add your own ETL process from here\\n\",\n    \"\\n\",\n    \"        # TODO: if so, you need to add the following ETL process!\\n\",\n    \"        #       remove or comment out the following ETL process\\n\",\n    \"        # {\\n\",\n    \"        #     'name': 'data_load___huggingface___ufl2hf_obj'\\n\",\n    \"        # }\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/14 19:02:42 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"1835993218fb488a9cc02bcdef0f49b9\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/3 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> spark, data[rdd, Dataframe]\\n\",\n    \"spark, data = etl_pipeline.run(ETL_config)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"\\n\",\n       \"            <div>\\n\",\n       \"                <p><b>SparkSession - in-memory</b></p>\\n\",\n       \"                \\n\",\n       \"        <div>\\n\",\n       \"            <p><b>SparkContext</b></p>\\n\",\n       \"\\n\",\n       \"            <p><a href=\\\"http://instance-3730:4040\\\">Spark UI</a></p>\\n\",\n       \"\\n\",\n       \"            <dl>\\n\",\n       \"              <dt>Version</dt>\\n\",\n       \"                <dd><code>v3.4.1</code></dd>\\n\",\n       \"              <dt>Master</dt>\\n\",\n       \"                <dd><code>local[10]</code></dd>\\n\",\n       \"              <dt>AppName</dt>\\n\",\n       \"                <dd><code>ETL</code></dd>\\n\",\n       \"            </dl>\\n\",\n       \"        </div>\\n\",\n       \"        \\n\",\n       \"            </div>\\n\",\n       \"        \"\n      ],\n      \"text/plain\": [\n       \"<pyspark.sql.session.SparkSession at 0x7fe58d1ede40>\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"spark\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"PythonRDD[13] at RDD at PythonRDD.scala:53\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 3. Check the current process so far!\\n\",\n    \"> use spark to check the current process so far!\\n\",\n    \"- `collect` is heavy so recommend to use `take` instead of `collect`!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': 'Mercury_7029645',\\n\",\n       \"  'question': 'Metal atoms will most likely form ions by the',\\n\",\n       \"  'choices': Row(text=['loss of electrons.', 'loss of protons.', 'gain of electrons.', 'gain of protons.'], label=['A', 'B', 'C', 'D']),\\n\",\n       \"  'answerKey': 'A'},\\n\",\n       \" {'id': 'Mercury_7216598',\\n\",\n       \"  'question': 'Which phrase does not describe asexual reproduction in organisms?',\\n\",\n       \"  'choices': Row(text=['requires two parents', 'little variation in offspring', 'only one type of cell involved', 'duplicates its genetic material'], label=['A', 'B', 'C', 'D']),\\n\",\n       \"  'answerKey': 'A'},\\n\",\n       \" {'id': 'MDSA_2008_5_40',\\n\",\n       \"  'question': 'A student is investigating changes in the states of matter. The student fills a graduated cylinder with 50 milliliters of packed snow. The graduated cylinder has a mass of 50 grams when empty and 95 grams when filled with the snow. The packed snow changes to liquid water when the snow is put in a warm room. Which statement best describes this process?',\\n\",\n       \"  'choices': Row(text=['Cooling causes the snow to melt.', 'Cooling causes the snow to freeze.', 'Heating causes the snow to freeze.', 'Heating causes the snow to melt.'], label=['A', 'B', 'C', 'D']),\\n\",\n       \"  'answerKey': 'D'}]\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"data.take(3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 4. Create your own ETL process\\n\",\n    \"> what do you want to do after all? \\n\",\n    \"\\n\",\n    \"Let's say you want to add `filter` process to the ETL pipeline.\\n\",\n    \"- you want to remove `choices` key from the dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': 'Mercury_7029645',\\n\",\n       \"  'question': 'Metal atoms will most likely form ions by the',\\n\",\n       \"  'answerKey': 'A'},\\n\",\n       \" {'id': 'Mercury_7216598',\\n\",\n       \"  'question': 'Which phrase does not describe asexual reproduction in organisms?',\\n\",\n       \"  'answerKey': 'A'},\\n\",\n       \" {'id': 'MDSA_2008_5_40',\\n\",\n       \"  'question': 'A student is investigating changes in the states of matter. The student fills a graduated cylinder with 50 milliliters of packed snow. The graduated cylinder has a mass of 50 grams when empty and 95 grams when filled with the snow. The packed snow changes to liquid water when the snow is put in a warm room. Which statement best describes this process?',\\n\",\n       \"  'answerKey': 'D'}]\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"data.take(3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Hey! it's working ;)! `choices` key are removed from the dataset!\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 5. Working? It's time to add to the ETL Registry\\n\",\n    \"> working great? it's time to move to how to add to the ETL Registry!\\n\",\n    \"[ETL_add_new_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/guideline/etl/ETL_add_new_etl_process.ipynb)\\n\",\n    \"\\n\",\n    \"Check out the guideline from above notebook. and for preview here is the function template to add to the ETL Registry.\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"# before\\n\",\n    \"data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})\\n\",\n    \"\\n\",\n    \"# after\\n\",\n    \"def your___custom___etl_process(spark, data, *args, **kwargs):\\n\",\n    \"    # add your custom process here\\n\",\n    \"    # here we are going to simply remove 'choices' key\\n\",\n    \"    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})\\n\",\n    \"\\n\",\n    \"    return data\\n\",\n    \"```\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.13\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/ETL_04_add_new_etl_process.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# ETL add new etl process\\n\",\n    \"> Add your custom ETL process to the ETL pipeline.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Original ETL Pipeline \\n\",\n    \"> This is simple ETL pipeline to load huggingface dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: ETL\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___huggingface___hf2raw\\n\",\n      \"  args:\\n\",\n      \"    name_or_path:\\n\",\n      \"    - ai2_arc\\n\",\n      \"    - ARC-Challenge\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'ETL',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___huggingface___hf2raw',\\n\",\n    \"            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\\n\",\n    \"        },\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_save___huggingface___ufl2hf_obj'\\n\",\n    \"        }\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"23/11/14 19:26:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"23/11/14 19:26:56 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"23/11/14 19:26:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\\n\",\n      \"Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"0aeed70bb9b34721aa5f6e8abf72a85f\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/3 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading and preparing dataset spark/14056872 to /root/.cache/huggingface/datasets/spark/14056872/0.0.0...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/14056872/0.0.0. Subsequent calls will reuse this data.\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['answerKey', 'choices', 'id', 'question'],\\n\",\n       \"    num_rows: 2590\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\\n\",\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'answerKey': 'A',\\n\",\n       \" 'choices': {'text': ['loss of electrons.',\\n\",\n       \"   'loss of protons.',\\n\",\n       \"   'gain of electrons.',\\n\",\n       \"   'gain of protons.'],\\n\",\n       \"  'label': ['A', 'B', 'C', 'D']},\\n\",\n       \" 'id': 'Mercury_7029645',\\n\",\n       \" 'question': 'Metal atoms will most likely form ions by the'}\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Add Custom ETL Process\\n\",\n    \"\\n\",\n    \"1. create your custom ETL process\\n\",\n    \"2. check ETL process is registered\\n\",\n    \"3. wrap it with `register_etl` decorator\\n\",\n    \"4. add your custom ETL process to the ETL config\\n\",\n    \"5. run the ETL pipeline\\n\",\n    \"\\n\",\n    \"Here you are going to make a simple \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from dataverse.etl import register_etl\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 1. create your custom ETL process\\n\",\n    \"\\n\",\n    \"- naming convention is `cate___sub-cate___name`\\n\",\n    \"    - e.g. `huggingface___dataset___load_dataset`\\n\",\n    \"- for input because we are using huggingface dataset `List[Dict]` format will be inserted\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"# ai2_arc format\\n\",\n    \"[\\n\",\n    \"    {\\n\",\n    \"        'id': ...,\\n\",\n    \"        'choices': ...,\\n\",\n    \"        'question': ...,\\n\",\n    \"        'answerKey': ...,\\n\",\n    \"    },\\n\",\n    \"    {...},\\n\",\n    \"    ...\\n\",\n    \"]\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Make a spark process assuming `List[Dict]` is given. Here we are simply going to remove `choices` key from each data point\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def your___custom___etl_process(spark, data, *args, **kwargs):\\n\",\n    \"    # add your custom process here\\n\",\n    \"    # here we are going to simply remove 'choices' key\\n\",\n    \"    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})\\n\",\n    \"\\n\",\n    \"    return data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 2. check ETL process is registered\\n\",\n    \"\\n\",\n    \"ETL Pipeline only runs registered ETL process\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/root/anaconda3/envs/dataverse/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\\n\",\n      \"  from .autonotebook import tqdm as notebook_tqdm\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"==================================================\\n\",\n       \"Total [ 43 ]\\n\",\n       \"==================================================\\n\",\n       \"data_ingestion [ 16 ]\\n\",\n       \"deduplication [ 4 ]\\n\",\n       \"cleaning [ 13 ]\\n\",\n       \"pii [ 2 ]\\n\",\n       \"quality [ 1 ]\\n\",\n       \"data_load [ 4 ]\\n\",\n       \"utils [ 3 ]\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLRegistry \\n\",\n    \"\\n\",\n    \"# we can see our custom is not registered yet\\n\",\n    \"ETLRegistry()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 3. wrap it with `register_etl` decorator\\n\",\n    \"\\n\",\n    \"How to register your custom ETL process?\\n\",\n    \"Simply wrap it with `register_etl` decorator\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"@register_etl\\n\",\n    \"def your_custom_etl_process():\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"@register_etl\\n\",\n    \"def your___custom___etl_process(spark, data, *args, **kwargs):\\n\",\n    \"    # remove all text\\n\",\n    \"    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'choices'})\\n\",\n    \"\\n\",\n    \"    return data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"==================================================\\n\",\n       \"Total [ 44 ]\\n\",\n       \"==================================================\\n\",\n       \"data_ingestion [ 16 ]\\n\",\n       \"deduplication [ 4 ]\\n\",\n       \"cleaning [ 13 ]\\n\",\n       \"pii [ 2 ]\\n\",\n       \"quality [ 1 ]\\n\",\n       \"data_load [ 4 ]\\n\",\n       \"utils [ 3 ]\\n\",\n       \"your [ 1 ]\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# you will see your custom etl is registered\\n\",\n    \"ETLRegistry()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 4. add your custom ETL process to the ETL config\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: ETL\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___huggingface___hf2raw\\n\",\n      \"  args:\\n\",\n      \"    name_or_path:\\n\",\n      \"    - ai2_arc\\n\",\n      \"    - ARC-Challenge\\n\",\n      \"- name: your___custom___etl_process\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'ETL',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___huggingface___hf2raw',\\n\",\n    \"            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\\n\",\n    \"        },\\n\",\n    \"\\n\",\n    \"        # ======== add your custom etl here ========\\n\",\n    \"        {\\n\",\n    \"            'name': 'your___custom___etl_process'\\n\",\n    \"        },\\n\",\n    \"        # ==========================================\\n\",\n    \"\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_save___huggingface___ufl2hf_obj'\\n\",\n    \"        }\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 5. run the ETL pipeline\\n\",\n    \"\\n\",\n    \"You can check that ETL process you added customly works great and `choices` are removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/14 19:27:13 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"23/11/14 19:27:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"7ef3d804674d408ba6696c00e6e58bd1\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/3 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading and preparing dataset spark/1082445423 to /root/.cache/huggingface/datasets/spark/1082445423/0.0.0...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/1082445423/0.0.0. Subsequent calls will reuse this data.\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['answerKey', 'id', 'question'],\\n\",\n       \"    num_rows: 2590\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\\n\",\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'answerKey': 'A',\\n\",\n       \" 'id': 'Mercury_7029645',\\n\",\n       \" 'question': 'Metal atoms will most likely form ions by the'}\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.13\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/ETL_05_test_etl_process.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# ETL test etl process\\n\",\n    \"> when you want to get `test`(sample) data to quickly test your ETL process, or need `data from a certain point` to test your ETL process, you can check how to do it here.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Get `test`(sample) data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 get `test`(sample) data `w/o config`\\n\",\n    \"> when you have created a ETL process and don't wanna set config from the scratch here is a quick way to get the sample data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"23/11/14 19:37:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"total data # : 100\\n\",\n      \"sample data :\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': 'e2ce9284-8691-471b-88e3-ba29a5888fd1',\\n\",\n       \"  'name': 'test_fake_ufl',\\n\",\n       \"  'text': 'Simple toward doctor any. Rich name reality bad family. Gas mind even important stay describe official.\\\\nThere recognize campaign wind on. Drop sport however central read.',\\n\",\n       \"  'meta': '{\\\"name\\\": \\\"Amanda Ross\\\", \\\"age\\\": 60, \\\"address\\\": \\\"302 Rebecca Camp\\\\\\\\nPatrickborough, CT 40755\\\", \\\"job\\\": \\\"Broadcast engineer\\\"}'}]\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, data = etl_pipeline.sample()\\n\",\n    \"\\n\",\n    \"# default sampling will return 100 `ufl` data\\n\",\n    \"print(f\\\"total data # : {data.count()}\\\")\\n\",\n    \"print(f\\\"sample data :\\\")\\n\",\n    \"data.take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"when you want to increase the sample size do the following\\n\",\n    \"```python\\n\",\n    \"spark, data = etl_pipeline.sample(n=10000)\\n\",\n    \"spark, data = etl_pipeline.sample(10000)\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"total data # : 10000\\n\",\n      \"sample data :\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': '79081a73-5c82-432d-bf4a-f7de8bf59d12',\\n\",\n       \"  'name': 'test_fake_ufl',\\n\",\n       \"  'text': 'Serious teacher follow they entire between. Far see issue view throughout order field.\\\\nWant senior sell amount picture. Tree cell low edge.',\\n\",\n       \"  'meta': '{\\\"name\\\": \\\"Jack Yoder\\\", \\\"age\\\": 75, \\\"address\\\": \\\"083 Diana Parkway Suite 438\\\\\\\\nLake Amberport, AS 76996\\\", \\\"job\\\": \\\"Haematologist\\\"}'}]\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"spark, data = etl_pipeline.sample(10000)\\n\",\n    \"print(f\\\"total data # : {data.count()}\\\")\\n\",\n    \"print(f\\\"sample data :\\\")\\n\",\n    \"data.take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 get `test`(sample) data `w/ config`\\n\",\n    \"> this might took some time to get the data but you can choose your own data\\n\",\n    \"- this was also introduced in `ETL_03_create_new_etl_process.ipynb`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Getting sample data `you want`\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: ETL\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___huggingface___hf2raw\\n\",\n      \"  args:\\n\",\n      \"    name_or_path:\\n\",\n      \"    - ai2_arc\\n\",\n      \"    - ARC-Challenge\\n\",\n      \"- name: utils___sampling___random\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'ETL',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___huggingface___hf2raw',\\n\",\n    \"            'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}\\n\",\n    \"        },\\n\",\n    \"        {'name': 'utils___sampling___random'}\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/14 19:38:01 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Found cached dataset ai2_arc (/root/.cache/huggingface/datasets/ai2_arc/ARC-Challenge/1.0.0/1569c2591ea2683779581d9fb467203d9aa95543bb9b75dcfde5da92529fd7f6)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"efc259f86fec4f76a1165f661ebf13d2\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/3 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"total data # : 280\\n\",\n      \"sample data :\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': 'Mercury_7029645',\\n\",\n       \"  'question': 'Metal atoms will most likely form ions by the',\\n\",\n       \"  'choices': Row(text=['loss of electrons.', 'loss of protons.', 'gain of electrons.', 'gain of protons.'], label=['A', 'B', 'C', 'D']),\\n\",\n       \"  'answerKey': 'A'}]\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, data = etl_pipeline.run(ETL_config)\\n\",\n    \"print(f\\\"total data # : {data.count()}\\\")\\n\",\n    \"print(f\\\"sample data :\\\")\\n\",\n    \"data.take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Test your ETL process\\n\",\n    \"> its time to test your ETL process with the sample data. define ETL process and run it\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/14 19:38:06 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"total data # : 100\\n\",\n      \"sample data :\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': 'eec9b075-b786-454c-a398-f69d8cf39739',\\n\",\n       \"  'name': 'test_fake_ufl',\\n\",\n       \"  'text': 'Country toward ago old right.\\\\nNewspaper hotel although short. Hair actually building.\\\\nWe build then blue hundred perform wall.',\\n\",\n       \"  'meta': '{\\\"name\\\": \\\"Michael Aguirre\\\", \\\"age\\\": 18, \\\"address\\\": \\\"8324 Jennings Road Apt. 378\\\\\\\\nLatoyahaven, MT 27716\\\", \\\"job\\\": \\\"Television camera operator\\\"}'}]\"\n      ]\n     },\n     \"execution_count\": 5,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"from dataverse.etl import register_etl\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# get sample data\\n\",\n    \"spark, data = etl_pipeline.sample()\\n\",\n    \"print(f\\\"total data # : {data.count()}\\\")\\n\",\n    \"print(f\\\"sample data :\\\")\\n\",\n    \"data.take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"@register_etl\\n\",\n    \"def test___your___etl_process(spark, data, *args, **kwargs):\\n\",\n    \"    # add your custom process here\\n\",\n    \"    # here we are going to simply remove 'id' key\\n\",\n    \"    data = data.map(lambda x: {k: v for k, v in x.items() if k != 'id'})\\n\",\n    \"\\n\",\n    \"    return data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'name': 'test_fake_ufl',\\n\",\n       \"  'text': 'Country toward ago old right.\\\\nNewspaper hotel although short. Hair actually building.\\\\nWe build then blue hundred perform wall.',\\n\",\n       \"  'meta': '{\\\"name\\\": \\\"Michael Aguirre\\\", \\\"age\\\": 18, \\\"address\\\": \\\"8324 Jennings Road Apt. 378\\\\\\\\nLatoyahaven, MT 27716\\\", \\\"job\\\": \\\"Television camera operator\\\"}'}]\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# test right away\\n\",\n    \"# - successfully removed `id` key\\n\",\n    \"etl = test___your___etl_process\\n\",\n    \"etl()(spark, data).take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'name': 'test_fake_ufl',\\n\",\n       \"  'text': 'Country toward ago old right.\\\\nNewspaper hotel although short. Hair actually building.\\\\nWe build then blue hundred perform wall.',\\n\",\n       \"  'meta': '{\\\"name\\\": \\\"Michael Aguirre\\\", \\\"age\\\": 18, \\\"address\\\": \\\"8324 Jennings Road Apt. 378\\\\\\\\nLatoyahaven, MT 27716\\\", \\\"job\\\": \\\"Television camera operator\\\"}'}]\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# test it is registered by calling it from etl_pipeline\\n\",\n    \"# - successfully removed `id` key\\n\",\n    \"etl = etl_pipeline.get('test___your___etl_process')\\n\",\n    \"etl()(spark, data).take(1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Experiments on the data itself\\n\",\n    \"> there is no chosen way to use this `test`(sample) data. you can do whatever you want with it. here are some examples\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[{'id': 'eec9b075-b786-454c-a398-f69d8cf39739',\\n\",\n       \"  'name': 'test_fake_ufl',\\n\",\n       \"  'text': 'Country toward ago old right.\\\\nNewspaper hotel although short. Hair actually building.\\\\nWe build then blue hundred perform wall.',\\n\",\n       \"  'meta': '{\\\"name\\\": \\\"Michael Aguirre\\\", \\\"age\\\": 18, \\\"address\\\": \\\"8324 Jennings Road Apt. 378\\\\\\\\nLatoyahaven, MT 27716\\\", \\\"job\\\": \\\"Television camera operator\\\"}',\\n\",\n       \"  'duck': 'is quarking (physics)'}]\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"data.map(lambda x: {**x, 'duck': 'is quarking (physics)'}).take(1)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.11\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/ETL_06_scaleout_with_EMR.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# ETL scaleout with EMR\\n\",\n    \"> when you have money but don't have enough device to process your data, it's time to use EMR\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Set AWS Credentials\\n\",\n    \"> This notebook assumes that you have already set your AWS credentials in your local machine. If not, please follow the steps below to set your AWS credentials.\\n\",\n    \"\\n\",\n    \"```bash\\n\",\n    \"aws configure\\n\",\n    \"    - key: <your access key>\\n\",\n    \"    - secret: <your secret key>\\n\",\n    \"    - region: <your region>\\n\",\n    \"aws configure set aws_session_token <your session token>\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"AWS credentials are valid\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.utils.api import aws_check_credentials \\n\",\n    \"\\n\",\n    \"# check aws credentials\\n\",\n    \"# NOTE: `True` means credentials are valid\\n\",\n    \"if aws_check_credentials() == True:\\n\",\n    \"    print(\\\"AWS credentials are valid\\\")\\n\",\n    \"else:\\n\",\n    \"    raise Exception(\\\"AWS credentials are invalid\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Set up Temporary Data & Environment\\n\",\n    \"> Here you don't need to prepare any data. We will create a temporary data and set temporary environment for you.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Create Temporary Folder at Local & AWS S3\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import tempfile\\n\",\n    \"import uuid\\n\",\n    \"\\n\",\n    \"from dataverse.utils.api import aws_s3_upload\\n\",\n    \"from dataverse.utils.api import aws_s3_create_bucket\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# create temp local & s3 path\\n\",\n    \"tmp_folder = tempfile.TemporaryDirectory()\\n\",\n    \"tmp_bucket = uuid.uuid4().hex\\n\",\n    \"\\n\",\n    \"aws_s3_create_bucket(bucket=tmp_bucket)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Create Temporary Data and upload to Local & AWS S3\\n\",\n    \"> Data will be duplicated\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"import pandas as pd\\n\",\n    \"from dataverse.utils.api import aws_s3_upload\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# create sample data and upload to s3\\n\",\n    \"sample_path = os.path.join(tmp_folder.name, 'duplicate.json')\\n\",\n    \"\\n\",\n    \"# create ufl data that has duplication\\n\",\n    \"ufl = [\\n\",\n    \"    {'text': \\\"random text\\\\nduplication\\\"},\\n\",\n    \"    {'text': \\\"fixed text\\\\nduplication\\\"},\\n\",\n    \"    {'text': \\\"fixed text\\\\nduplication\\\\nDUPLICATION\\\"},\\n\",\n    \"]\\n\",\n    \"df = pd.DataFrame(ufl)\\n\",\n    \"df.to_parquet(sample_path)\\n\",\n    \"\\n\",\n    \"bucket = aws_s3_upload(\\n\",\n    \"    bucket=tmp_bucket,\\n\",\n    \"    key='duplicate.json',\\n\",\n    \"    local_path=sample_path,\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Temporary Dynamic ETL\\n\",\n    \"> To show you that you can add temporal dynamic ETL \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Writing dynamic_etl.py\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"%%writefile dynamic_etl.py\\n\",\n    \"from dataverse.etl import register_etl\\n\",\n    \"from pyspark.rdd import RDD\\n\",\n    \"from pyspark.sql import DataFrame\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"@register_etl\\n\",\n    \"def test___add___one(spark, data, subset='text', *args, **kwargs):\\n\",\n    \"    if isinstance(data, DataFrame):\\n\",\n    \"        data = data.rdd\\n\",\n    \"        data = data.map(lambda row: row.asDict())\\n\",\n    \"\\n\",\n    \"    def _add_one(row):\\n\",\n    \"        row[subset] = row[subset] + '1'\\n\",\n    \"        return row\\n\",\n    \"\\n\",\n    \"    data = data.map(_add_one)\\n\",\n    \"\\n\",\n    \"    return data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Create Temporary Config\\n\",\n    \"- load parquet from s3\\n\",\n    \"- exact deduplicate by line splitted by newline\\n\",\n    \"- add `1` text at the end of each data `text`\\n\",\n    \"- save as parquet to s3\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Detected Dataverse Bucket: dataverse-dv42-d853ea88-c87d-486f-b3b5-d780203bc262\\n\",\n      \"spark:\\n\",\n      \"  master: local[10]\\n\",\n      \"  appname: default\\n\",\n      \"  driver:\\n\",\n      \"    memory: 8G\\n\",\n      \"    maxResultSize: 2G\\n\",\n      \"  executor:\\n\",\n      \"    memory: 1G\\n\",\n      \"  local:\\n\",\n      \"    dir: /root/.cache/dataverse/tmp\\n\",\n      \"  ui:\\n\",\n      \"    port: 4040\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___parquet___pq2ufl\\n\",\n      \"  args:\\n\",\n      \"    path: s3a://581f4bedcaf24703b248e73d4ecefabd/duplicate.json\\n\",\n      \"    repartition: 1\\n\",\n      \"- name: deduplication___common_crawl___exact_line\\n\",\n      \"- name: test___add___one\\n\",\n      \"- name: data_load___parquet___ufl2parquet\\n\",\n      \"  args:\\n\",\n      \"    save_path: s3a://581f4bedcaf24703b248e73d4ecefabd/deduplicate.parquet\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.config import Config\\n\",\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"load_path = f\\\"s3a://{tmp_bucket}/duplicate.json\\\"\\n\",\n    \"save_path = f\\\"s3a://{tmp_bucket}/deduplicate.parquet\\\"\\n\",\n    \"\\n\",\n    \"config = Config.default()\\n\",\n    \"config.etl.append({\\n\",\n    \"    'name': 'data_ingestion___parquet___pq2ufl',\\n\",\n    \"    'args': {\\n\",\n    \"        'path': load_path,\\n\",\n    \"        'repartition': 1\\n\",\n    \"    }}\\n\",\n    \")\\n\",\n    \"config.etl.append({'name': 'deduplication___common_crawl___exact_line'})\\n\",\n    \"config.etl.append({'name': 'test___add___one'})\\n\",\n    \"config.etl.append({\\n\",\n    \"    'name': 'data_load___parquet___ufl2parquet',\\n\",\n    \"    'args': {'save_path': save_path}})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 ETLPipeline with `Local`\\n\",\n    \"> We will test our ETL pipeline with local machine first\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Import `dynamic_etl.py` to add custom ETL\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Detected Dataverse Bucket: dataverse-dv42-d853ea88-c87d-486f-b3b5-d780203bc262\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# you can import before running the etl\\n\",\n    \"import dynamic_etl\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 run ETL Pipeline with Local machine\\n\",\n    \"> as the config specified\\n\",\n    \"\\n\",\n    \"- we will load data from s3\\n\",\n    \"- exact deduplicate by line splitted by newline\\n\",\n    \"- add `1` text at the end of each data `text`\\n\",\n    \"- and save as parquet to s3\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark conf is set with [ temporary ] S3 credentials\\n\",\n      \":: loading settings :: url = jar:file:/data/project/private/ducky/anaconda3/envs/llm/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Ivy Default Cache set to: /root/.ivy2/cache\\n\",\n      \"The jars for the packages stored in: /root/.ivy2/jars\\n\",\n      \"org.apache.hadoop#hadoop-aws added as a dependency\\n\",\n      \"com.amazonaws#aws-java-sdk-bundle added as a dependency\\n\",\n      \":: resolving dependencies :: org.apache.spark#spark-submit-parent-bbfd8d7f-e9d3-48d9-b3a0-6ac07189c03d;1.0\\n\",\n      \"\\tconfs: [default]\\n\",\n      \"\\tfound org.apache.hadoop#hadoop-aws;3.3.4 in central\\n\",\n      \"\\tfound org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central\\n\",\n      \"\\tfound com.amazonaws#aws-java-sdk-bundle;1.12.592 in central\\n\",\n      \":: resolution report :: resolve 128ms :: artifacts dl 4ms\\n\",\n      \"\\t:: modules in use:\\n\",\n      \"\\tcom.amazonaws#aws-java-sdk-bundle;1.12.592 from central in [default]\\n\",\n      \"\\torg.apache.hadoop#hadoop-aws;3.3.4 from central in [default]\\n\",\n      \"\\torg.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]\\n\",\n      \"\\t:: evicted modules:\\n\",\n      \"\\tcom.amazonaws#aws-java-sdk-bundle;1.12.262 by [com.amazonaws#aws-java-sdk-bundle;1.12.592] in [default]\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \"\\t|                  |            modules            ||   artifacts   |\\n\",\n      \"\\t|       conf       | number| search|dwnlded|evicted|| number|dwnlded|\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \"\\t|      default     |   4   |   0   |   0   |   1   ||   3   |   0   |\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \":: retrieving :: org.apache.spark#spark-submit-parent-bbfd8d7f-e9d3-48d9-b3a0-6ac07189c03d\\n\",\n      \"\\tconfs: [default]\\n\",\n      \"\\t0 artifacts copied, 3 already retrieved (0kB/4ms)\\n\",\n      \"23/12/14 21:39:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"23/12/14 21:39:58 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"23/12/14 21:40:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties\\n\",\n      \"23/12/14 21:40:06 WARN BlockManager: Block rdd_20_0 already exists on this machine; not re-adding it\\n\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, data = etl_pipeline.run(config=config)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 download data from s3 and check the result\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"s3a://581f4bedcaf24703b248e73d4ecefabd/deduplicate.parquet\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# aws s3 path\\n\",\n    \"print(save_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>text</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>random text\\\\nduplication1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>fixed text1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                        text\\n\",\n       \"0  random text\\\\nduplication1\\n\",\n       \"1                fixed text1\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.utils.api import aws_s3_path_parse\\n\",\n    \"from dataverse.utils.api import aws_s3_download\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"bucket, key = aws_s3_path_parse(save_path)\\n\",\n    \"aws_s3_download(\\n\",\n    \"    bucket=bucket,\\n\",\n    \"    key=key,\\n\",\n    \"    local_path=os.path.join(tmp_folder.name, 'deduplicate.parquet'),\\n\",\n    \")\\n\",\n    \"pd.read_parquet(os.path.join(tmp_folder.name, 'deduplicate.parquet'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Remove Result at local & AWS S3\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import shutil\\n\",\n    \"from dataverse.utils.api import aws_s3_delete\\n\",\n    \"\\n\",\n    \"# remove saved deduplicate.parquet\\n\",\n    \"shutil.rmtree(os.path.join(tmp_folder.name, 'deduplicate.parquet'))\\n\",\n    \"aws_s3_delete(bucket=tmp_bucket, key='deduplicate.parquet')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 ETLPipeline with `EMR`\\n\",\n    \"> Works good? Let's scale out with EMR!\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 run ETL Pipeline with EMR Machine\\n\",\n    \"> add `emr=True` to ETL pipeline. that's all! Auto handle EMR cluster for you!\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"- set `verbose=True` to see the log of EMR cluster\\n\",\n    \"- return value `data` will be returned as config set by `Dataverse` EMR Manager\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"# before - local\\n\",\n    \"spark, data = etl_pipeline(config)\\n\",\n    \"\\n\",\n    \"# after - EMR\\n\",\n    \"spark, config = etl_pipeline(config, emr=True)\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"================================================================================\\n\",\n      \"Default instance type is [ c5.xlarge ]\\n\",\n      \"================================================================================\\n\",\n      \" vCPU: 4\\n\",\n      \" Memory: 8192\\n\",\n      \" Price: 0.088100\\n\",\n      \"================================================================================\\n\",\n      \"\\n\",\n      \"[ Dataverse ] step status: COMPLETED. Done.\\n\",\n      \"DependencyViolation occured when terminating EMR cluster. Retrying one more time\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, config = etl_pipeline.run(config=config, emr=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 download data from s3 and check the result\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>text</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>random text\\\\nduplication1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>fixed text1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                        text\\n\",\n       \"0  random text\\\\nduplication1\\n\",\n       \"1                fixed text1\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.utils.api import aws_s3_path_parse\\n\",\n    \"from dataverse.utils.api import aws_s3_download\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"bucket, key = aws_s3_path_parse(save_path)\\n\",\n    \"aws_s3_download(\\n\",\n    \"    bucket=bucket,\\n\",\n    \"    key=key,\\n\",\n    \"    local_path=os.path.join(tmp_folder.name, 'deduplicate.parquet'),\\n\",\n    \")\\n\",\n    \"pd.read_parquet(os.path.join(tmp_folder.name, 'deduplicate.parquet'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 Remove Result at local & AWS S3\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import shutil\\n\",\n    \"from dataverse.utils.api import aws_s3_delete\\n\",\n    \"\\n\",\n    \"# remove saved deduplicate.parquet\\n\",\n    \"shutil.rmtree(os.path.join(tmp_folder.name, 'deduplicate.parquet'))\\n\",\n    \"aws_s3_delete(bucket=tmp_bucket, key='deduplicate.parquet')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Set `EMR` custom config\\n\",\n    \"> Wanna customize your EMR cluster? Let's do it!\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"from dataverse.config import Config\\n\",\n    \"\\n\",\n    \"# if you have your own EMR cluster, you can set your own EMR cluster config\\n\",\n    \"config = Config.default(emr=True)\\n\",\n    \"config.emr.id = 'j-XXXXXXXXXXXXX'(your emr cluster id)\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  master: local[10]\\n\",\n      \"  appname: default\\n\",\n      \"  driver:\\n\",\n      \"    memory: 8G\\n\",\n      \"    maxResultSize: 2G\\n\",\n      \"  executor:\\n\",\n      \"    memory: 1G\\n\",\n      \"  local:\\n\",\n      \"    dir: /root/.cache/dataverse/tmp\\n\",\n      \"  ui:\\n\",\n      \"    port: 4040\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___parquet___pq2ufl\\n\",\n      \"  args:\\n\",\n      \"    path: s3a://576768809f8a4181b034ef7921613d41/duplicate.json\\n\",\n      \"    repartition: 1\\n\",\n      \"- name: deduplication___common_crawl___exact_line\\n\",\n      \"- name: test___add___one\\n\",\n      \"- name: data_load___parquet___ufl2parquet\\n\",\n      \"  args:\\n\",\n      \"    save_path: s3a://576768809f8a4181b034ef7921613d41/deduplicate.parquet\\n\",\n      \"emr:\\n\",\n      \"  id: null\\n\",\n      \"  working_dir: null\\n\",\n      \"  name: dataverse_emr\\n\",\n      \"  release: emr-6.15.0\\n\",\n      \"  idle_timeout: 3600\\n\",\n      \"  master_instance:\\n\",\n      \"    type: null\\n\",\n      \"  core_instance:\\n\",\n      \"    type: null\\n\",\n      \"    count: 5\\n\",\n      \"  task_instance:\\n\",\n      \"    type: null\\n\",\n      \"    count: 0\\n\",\n      \"  auto_generated: null\\n\",\n      \"  role:\\n\",\n      \"    ec2:\\n\",\n      \"      name: null\\n\",\n      \"      policy_arns: null\\n\",\n      \"    emr:\\n\",\n      \"      name: null\\n\",\n      \"      policy_arns: null\\n\",\n      \"  instance_profile:\\n\",\n      \"    name: null\\n\",\n      \"    ec2_role: null\\n\",\n      \"  vpc:\\n\",\n      \"    id: null\\n\",\n      \"  subnet:\\n\",\n      \"    id: null\\n\",\n      \"    public_id: null\\n\",\n      \"    private_id: null\\n\",\n      \"    public: true\\n\",\n      \"  security_group:\\n\",\n      \"    id: null\\n\",\n      \"  gateway:\\n\",\n      \"    id: null\\n\",\n      \"  route_table:\\n\",\n      \"    id: null\\n\",\n      \"  elastic_ip:\\n\",\n      \"    id: null\\n\",\n      \"  nat_gateway:\\n\",\n      \"    id: null\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.config import Config\\n\",\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"load_path = f\\\"s3a://{tmp_bucket}/duplicate.json\\\"\\n\",\n    \"save_path = f\\\"s3a://{tmp_bucket}/deduplicate.parquet\\\"\\n\",\n    \"\\n\",\n    \"# TODO: add `emr=True` to get the emr config\\n\",\n    \"# =========================================\\n\",\n    \"config = Config.default(emr=True)\\n\",\n    \"# =========================================\\n\",\n    \"\\n\",\n    \"config.etl.append({\\n\",\n    \"    'name': 'data_ingestion___parquet___pq2ufl',\\n\",\n    \"    'args': {\\n\",\n    \"        'path': load_path,\\n\",\n    \"        'repartition': 1\\n\",\n    \"    }}\\n\",\n    \")\\n\",\n    \"config.etl.append({'name': 'deduplication___common_crawl___exact_line'})\\n\",\n    \"config.etl.append({'name': 'test___add___one'})\\n\",\n    \"config.etl.append({\\n\",\n    \"    'name': 'data_load___parquet___ufl2parquet',\\n\",\n    \"    'args': {'save_path': save_path}})\\n\",\n    \"\\n\",\n    \"# TODO: add `emr=True` to get the emr config\\n\",\n    \"# =========================================\\n\",\n    \"config.emr.core_instance.count = 5\\n\",\n    \"\\n\",\n    \"# TODO: there are more config options for emr\\n\",\n    \"#       check `dataverse.config.Config.default`\\n\",\n    \"# =========================================\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, config = etl_pipeline.run(config=config, emr=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 download data from s3 and check the result\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>text</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>random text\\\\nduplication1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>fixed text1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                        text\\n\",\n       \"0  random text\\\\nduplication1\\n\",\n       \"1                fixed text1\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.utils.api import aws_s3_path_parse\\n\",\n    \"from dataverse.utils.api import aws_s3_download\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"bucket, key = aws_s3_path_parse(save_path)\\n\",\n    \"aws_s3_download(\\n\",\n    \"    bucket=bucket,\\n\",\n    \"    key=key,\\n\",\n    \"    local_path=os.path.join(tmp_folder.name, 'deduplicate.parquet'),\\n\",\n    \")\\n\",\n    \"pd.read_parquet(os.path.join(tmp_folder.name, 'deduplicate.parquet'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Remove Temporary Data & Environment\\n\",\n    \"> it's time to clean up\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from dataverse.utils.api import aws_s3_delete\\n\",\n    \"from dataverse.utils.api import aws_s3_delete_bucket\\n\",\n    \"\\n\",\n    \"!rm dynamic_etl.py\\n\",\n    \"\\n\",\n    \"# remove temp folder\\n\",\n    \"tmp_folder.cleanup()\\n\",\n    \"\\n\",\n    \"# remove temp bucket\\n\",\n    \"aws_s3_delete(bucket=tmp_bucket, key='')\\n\",\n    \"aws_s3_delete_bucket(bucket=tmp_bucket)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.11\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/EX_use_common_crawl_data.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Use Common Crawl Data\\n\",\n    \"> How to use common crawl data? There is 2 ways to achieve this\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Dump-ID\\n\",\n    \"> common crawl dump id related to the date of the crawl. ex: 2023-23\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: CommonCrawl\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___common_crawl___dump2raw\\n\",\n      \"  args:\\n\",\n      \"    dump: 2023-23\\n\",\n      \"    segment_n: 1\\n\",\n      \"- name: data_ingestion___common_crawl___raw2ufl\\n\",\n      \"- name: cleaning___normalization___number\\n\",\n      \"- name: deduplication___common_crawl___exact_line\\n\",\n      \"- name: quality___language___fasttext_filter\\n\",\n      \"  args:\\n\",\n      \"    whitelist:\\n\",\n      \"    - ko\\n\",\n      \"    threshold: 0.5\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'CommonCrawl',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___common_crawl___dump2raw',\\n\",\n    \"            'args': {\\n\",\n    \"                'dump': \\\"2023-23\\\",\\n\",\n    \"                'segment_n': 1,\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        {'name': 'data_ingestion___common_crawl___raw2ufl'},\\n\",\n    \"        {'name': 'cleaning___normalization___number'},\\n\",\n    \"        {'name': 'deduplication___common_crawl___exact_line'},\\n\",\n    \"        {\\n\",\n    \"            'name': 'quality___language___fasttext_filter',\\n\",\n    \"            'args': {\\n\",\n    \"                'whitelist': ['ko'],\\n\",\n    \"                'threshold': 0.5,\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        {'name': 'data_save___huggingface___ufl2hf_obj'}\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"23/11/14 22:09:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"23/11/14 22:09:41 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading and preparing dataset spark/-572665896 to /root/.cache/huggingface/datasets/spark/-572665896/0.0.0...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-572665896/0.0.0. Subsequent calls will reuse this data.\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'meta', 'name', 'text'],\\n\",\n       \"    num_rows: 292\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\\n\",\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '19ee2ac082ef11eeae4262800acfdc4f',\\n\",\n       \" 'meta': '{\\\"title\\\": \\\"\\\\\\\\uc640\\\\\\\\uae00\\\\\\\\uc640\\\\\\\\uae00 - \\\\\\\\uc7ac\\\\\\\\ubbf8\\\", \\\"url\\\": \\\"http://wagle.isplus.joins.com/app/index.php?mid=wg_fun&page=6\\\", \\\"date_download\\\": \\\"2023-06-05T00:45:09Z\\\", \\\"digest\\\": \\\"sha1:UDASCLMI7FRAUR5PKBHJZ6DZSBZPZTFI\\\", \\\"length\\\": 2557, \\\"nlines\\\": 45, \\\"source_domain\\\": \\\"wagle.isplus.joins.com\\\", \\\"cc_segment\\\": \\\"crawl-data/CC-MAIN-2023-23/segments/1685224650409.64/wet/CC-MAIN-20230604225057-20230605015057-00644.warc.wet.gz\\\"}',\\n\",\n       \" 'name': 'common_crawl',\\n\",\n       \" 'text': \\\"조인스\\\\n와글와글 전체 목록\\\\n조회\\\\n0000 매경기 재평가되는 맨유짤 더레즈 0000-00-00\\\\n0000 아스날이 0경기 0승을 한 이유? 구0000너 0000-00-00\\\\n0000 은근 축구 혼자서 다하는 선수 풋스타 0000-00-00\\\\n0000 [오피셜] 아스날, 리그 0위로 0라운드 종료 아스날아.. 0000-00-00\\\\n0000 [놀람] 놀랄 수 밖에 없는 첼시 선발라인업 케파멘디 0000-00-00\\\\n0000 [감동] 분데스리가 00번의 과거와 미래 포항항 0000-00-00\\\\n0000 ???:너네들 재미있어보이네~ 에밀홀딩 0000-00-00\\\\n0000 [정보]0000년 0회 이상 우승한팀 어우뮌x0 0000-00-00\\\\n0000 커뮤니티실드에서의 리버풀 해리킼웰 0000-00-00\\\\n0000 [유머] ?? : 아스날... 생각보다 강팀이잖아..? 금발롱 0000-00-00\\\\n0000 (감동)??:우....승...뭐라고? 티아구메.. 0000-00-00\\\\n0000 ????:우승팀이 이정도라니 나설필요가 없겠는걸 '질'? 0년0우승.. 0000-00-00\\\\n0000 위닝의 저주(?) 베르바턴 0000-00-00\\\\n0000 [유머] 현시점 최강팀 보이빕 0000-00-00\\\\n0000 다시보는 바르셀로나 보드진 영입 큰일은바.. 0000-00-00\\\\n0000 “축구의 신” 뮌헨콜라 0000-00-00\\\\n0000 라리가 0형제.jpg 헤르니고르 0000-00-00\\\\n0000 ???: 어이! 바르샤, 한잔 해~! 사비에르 0000-00-00\\\\n0000 그래도 아직 레바뮌 맞지 ㅋㅋ 킴미희 0000-00-00\\\\n0000 뮌헨-돌문-맨유 내리갈굼 퓰리식혜 0000-00-00\\\\n0000 최근 펩과르디올라 챔스 성적ㅋㅋ 펩몬드 0000-00-00\\\\n0000 ?????: 만나서 반갑다 00000 후안펩시 0000-00-00\\\\n0000 챔스 아탈란타 상대로 유일하게 클린시트한 키퍼 누구? 안녕카일 0000-00-00\\\\n0000 PSG VS 아탈란타 네이마르 요약 축신마르 0000-00-00\\\\n0000 맨유팬의 불타는 행복회로 에덴하자드 0000-00-00\\\\n0000 최근 경기당 0골씩 넣고 있는 축구선수 썬가드 0000-00-00\\\\n0000 [유머] 램파드 인터뷰떴다! 램램반장 0000-00-00\\\\n0000 [감동] ??? : 우...승...뭐라고? 아스날팬임 0000-00-00\\\\n0000 ???: 야 토트넘! 라이트구너 0000-00-00\\\\n0000 드디어 완성된 PL 빅0 사네는뮌.. 0000-00-00\\\\n0000 [유머] 와 아스날 매수ㅋㅋㅋ 김티어니 0000-00-00\\\\n0000 ???: 하늘은 왜 아스날을 낳고 베베루니 0000-00-00\\\\n0000 00/00시즌 린가드 유니버스 끝 린가디 0000-00-00\\\\n0000 0류 맨유 팬카페 맨유더마스 0000-00-00\\\\n0000 프리미어리그 운명공동체 0팀 AlMacdo 0000-00-00\\\\n쓰기\\\"}\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 WET folder\\n\",\n    \"> use pre-downloaded WET files\\n\",\n    \"\\n\",\n    \"We are going to use the cache common crawl as we just downloaded while processing dump-id ETL example right before. Time to use it!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from dataverse.utils.setting import SystemSetting\\n\",\n    \"from pathlib import Path\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"wet_path = Path(SystemSetting().CACHE_DIR) / '.cache' / 'dataverse' / 'dataset' / 'common_crawl_2023-23'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: CommonCrawl\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___common_crawl___wet2raw\\n\",\n      \"  args:\\n\",\n      \"    wet_path: /root/.cache/dataverse/dataset/common_crawl_2023-23\\n\",\n      \"- name: data_ingestion___common_crawl___raw2ufl\\n\",\n      \"- name: cleaning___normalization___number\\n\",\n      \"- name: deduplication___common_crawl___exact_line\\n\",\n      \"- name: quality___language___fasttext_filter\\n\",\n      \"  args:\\n\",\n      \"    whitelist:\\n\",\n      \"    - ko\\n\",\n      \"    threshold: 0.5\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'CommonCrawl',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___common_crawl___wet2raw',\\n\",\n    \"            'args': {\\n\",\n    \"                'wet_path': str(wet_path),\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        {'name': 'data_ingestion___common_crawl___raw2ufl'},\\n\",\n    \"        {'name': 'cleaning___normalization___number'},\\n\",\n    \"        {'name': 'deduplication___common_crawl___exact_line'},\\n\",\n    \"        {\\n\",\n    \"            'name': 'quality___language___fasttext_filter',\\n\",\n    \"            'args': {\\n\",\n    \"                'whitelist': ['ko'],\\n\",\n    \"                'threshold': 0.5,\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        {'name': 'data_save___huggingface___ufl2hf_obj'}\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/14 22:10:11 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading and preparing dataset spark/-1399168669 to /root/.cache/huggingface/datasets/spark/-1399168669/0.0.0...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/-1399168669/0.0.0. Subsequent calls will reuse this data.\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'meta', 'name', 'text'],\\n\",\n       \"    num_rows: 292\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\\n\",\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '29551d2082ef11eea9d462800acfdc4f',\\n\",\n       \" 'meta': '{\\\"title\\\": \\\"\\\\\\\\uc640\\\\\\\\uae00\\\\\\\\uc640\\\\\\\\uae00 - \\\\\\\\uc7ac\\\\\\\\ubbf8\\\", \\\"url\\\": \\\"http://wagle.isplus.joins.com/app/index.php?mid=wg_fun&page=6\\\", \\\"date_download\\\": \\\"2023-06-05T00:45:09Z\\\", \\\"digest\\\": \\\"sha1:UDASCLMI7FRAUR5PKBHJZ6DZSBZPZTFI\\\", \\\"length\\\": 2557, \\\"nlines\\\": 45, \\\"source_domain\\\": \\\"wagle.isplus.joins.com\\\", \\\"cc_segment\\\": \\\"/root/.cache/dataverse/dataset/common_crawl_2023-23/CC-MAIN-20230604225057-20230605015057-00644.warc.wet.gz\\\"}',\\n\",\n       \" 'name': 'common_crawl',\\n\",\n       \" 'text': \\\"조인스\\\\n와글와글 전체 목록\\\\n조회\\\\n0000 매경기 재평가되는 맨유짤 더레즈 0000-00-00\\\\n0000 아스날이 0경기 0승을 한 이유? 구0000너 0000-00-00\\\\n0000 은근 축구 혼자서 다하는 선수 풋스타 0000-00-00\\\\n0000 [오피셜] 아스날, 리그 0위로 0라운드 종료 아스날아.. 0000-00-00\\\\n0000 [놀람] 놀랄 수 밖에 없는 첼시 선발라인업 케파멘디 0000-00-00\\\\n0000 [감동] 분데스리가 00번의 과거와 미래 포항항 0000-00-00\\\\n0000 ???:너네들 재미있어보이네~ 에밀홀딩 0000-00-00\\\\n0000 [정보]0000년 0회 이상 우승한팀 어우뮌x0 0000-00-00\\\\n0000 커뮤니티실드에서의 리버풀 해리킼웰 0000-00-00\\\\n0000 [유머] ?? : 아스날... 생각보다 강팀이잖아..? 금발롱 0000-00-00\\\\n0000 (감동)??:우....승...뭐라고? 티아구메.. 0000-00-00\\\\n0000 ????:우승팀이 이정도라니 나설필요가 없겠는걸 '질'? 0년0우승.. 0000-00-00\\\\n0000 위닝의 저주(?) 베르바턴 0000-00-00\\\\n0000 [유머] 현시점 최강팀 보이빕 0000-00-00\\\\n0000 다시보는 바르셀로나 보드진 영입 큰일은바.. 0000-00-00\\\\n0000 “축구의 신” 뮌헨콜라 0000-00-00\\\\n0000 라리가 0형제.jpg 헤르니고르 0000-00-00\\\\n0000 ???: 어이! 바르샤, 한잔 해~! 사비에르 0000-00-00\\\\n0000 그래도 아직 레바뮌 맞지 ㅋㅋ 킴미희 0000-00-00\\\\n0000 뮌헨-돌문-맨유 내리갈굼 퓰리식혜 0000-00-00\\\\n0000 최근 펩과르디올라 챔스 성적ㅋㅋ 펩몬드 0000-00-00\\\\n0000 ?????: 만나서 반갑다 00000 후안펩시 0000-00-00\\\\n0000 챔스 아탈란타 상대로 유일하게 클린시트한 키퍼 누구? 안녕카일 0000-00-00\\\\n0000 PSG VS 아탈란타 네이마르 요약 축신마르 0000-00-00\\\\n0000 맨유팬의 불타는 행복회로 에덴하자드 0000-00-00\\\\n0000 최근 경기당 0골씩 넣고 있는 축구선수 썬가드 0000-00-00\\\\n0000 [유머] 램파드 인터뷰떴다! 램램반장 0000-00-00\\\\n0000 [감동] ??? : 우...승...뭐라고? 아스날팬임 0000-00-00\\\\n0000 ???: 야 토트넘! 라이트구너 0000-00-00\\\\n0000 드디어 완성된 PL 빅0 사네는뮌.. 0000-00-00\\\\n0000 [유머] 와 아스날 매수ㅋㅋㅋ 김티어니 0000-00-00\\\\n0000 ???: 하늘은 왜 아스날을 낳고 베베루니 0000-00-00\\\\n0000 00/00시즌 린가드 유니버스 끝 린가디 0000-00-00\\\\n0000 0류 맨유 팬카페 맨유더마스 0000-00-00\\\\n0000 프리미어리그 운명공동체 0팀 AlMacdo 0000-00-00\\\\n쓰기\\\"}\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 WET folder - Add MinhashLSH fuzzy deduplication\\n\",\n    \"> same but more preprocessing! \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"spark:\\n\",\n      \"  appname: CommonCrawl\\n\",\n      \"  driver:\\n\",\n      \"    memory: 16g\\n\",\n      \"etl:\\n\",\n      \"- name: data_ingestion___common_crawl___wet2raw\\n\",\n      \"  args:\\n\",\n      \"    wet_path: /root/.cache/dataverse/dataset/common_crawl_2023-23\\n\",\n      \"- name: data_ingestion___common_crawl___raw2ufl\\n\",\n      \"- name: cleaning___normalization___number\\n\",\n      \"- name: deduplication___minhash___lsh_jaccard\\n\",\n      \"- name: deduplication___common_crawl___exact_line\\n\",\n      \"- name: quality___language___fasttext_filter\\n\",\n      \"  args:\\n\",\n      \"    whitelist:\\n\",\n      \"    - ko\\n\",\n      \"    threshold: 0.5\\n\",\n      \"- name: data_save___huggingface___ufl2hf_obj\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from omegaconf import OmegaConf\\n\",\n    \"\\n\",\n    \"# load from dict\\n\",\n    \"ETL_config = OmegaConf.create({\\n\",\n    \"    'spark': {\\n\",\n    \"        'appname': 'CommonCrawl',\\n\",\n    \"        'driver': {'memory': '16g'},\\n\",\n    \"    },\\n\",\n    \"    'etl': [\\n\",\n    \"        {\\n\",\n    \"            'name': 'data_ingestion___common_crawl___wet2raw',\\n\",\n    \"            'args': {\\n\",\n    \"                'wet_path': str(wet_path),\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        {'name': 'data_ingestion___common_crawl___raw2ufl'},\\n\",\n    \"        {'name': 'cleaning___normalization___number'},\\n\",\n    \"        {'name': 'deduplication___minhash___lsh_jaccard'},\\n\",\n    \"        {'name': 'deduplication___common_crawl___exact_line'},\\n\",\n    \"        {\\n\",\n    \"            'name': 'quality___language___fasttext_filter',\\n\",\n    \"            'args': {\\n\",\n    \"                'whitelist': ['ko'],\\n\",\n    \"                'threshold': 0.5,\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        {'name': 'data_save___huggingface___ufl2hf_obj'}\\n\",\n    \"    ]\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"print(OmegaConf.to_yaml(ETL_config))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/14 22:10:34 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Downloading and preparing dataset spark/2085970941 to /root/.cache/huggingface/datasets/spark/2085970941/0.0.0...\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Dataset spark downloaded and prepared to /root/.cache/huggingface/datasets/spark/2085970941/0.0.0. Subsequent calls will reuse this data.\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Dataset({\\n\",\n       \"    features: ['id', 'meta', 'name', 'text'],\\n\",\n       \"    num_rows: 285\\n\",\n       \"})\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"\\n\",\n    \"# raw -> hf_obj\\n\",\n    \"spark, dataset = etl_pipeline.run(ETL_config)\\n\",\n    \"dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'id': '3aa3dddc82ef11ee898d62800acfdc4f',\\n\",\n       \" 'meta': '{\\\"title\\\": \\\"\\\\\\\\ub3d9\\\\\\\\uc601\\\\\\\\uc0c1 | \\\\\\\\uc6b0\\\\\\\\ub9ac \\\\\\\\ud568\\\\\\\\uaed8 \\\\\\\\ub9cc\\\\\\\\ub4e4\\\\\\\\uc5b4 \\\\\\\\ubd05\\\\\\\\uc2dc\\\\\\\\ub2e4.(57.\\\\\\\\ud478\\\\\\\\ucd08\\\\\\\\ubcf6\\\\\\\\uc74c)\\\\\\\\u200b\\\", \\\"url\\\": \\\"https://dprktoday.com/videos/16055?list=\\\", \\\"date_download\\\": \\\"2023-06-05T01:01:45Z\\\", \\\"digest\\\": \\\"sha1:6TKZ4VWGQESC6HVNGQS3ESIE4BR63V25\\\", \\\"length\\\": 4007, \\\"nlines\\\": 317, \\\"source_domain\\\": \\\"dprktoday.com\\\", \\\"cc_segment\\\": \\\"/root/.cache/dataverse/dataset/common_crawl_2023-23/CC-MAIN-20230604225057-20230605015057-00644.warc.wet.gz\\\"}',\\n\",\n       \" 'name': 'common_crawl',\\n\",\n       \" 'text': '첫페지로\\\\n날자별열람\\\\n손전화홈페지열람기\\\\n조선어 English 中国语 Русский\\\\n정치\\\\n경제\\\\n군사\\\\n사회문화\\\\n조국통일\\\\n관광\\\\n력사\\\\n로작\\\\n기 사\\\\n동영상\\\\n사 진\\\\n음악감상\\\\n전체\\\\n혁명활동소식\\\\n기록영화\\\\n회고록《세기와 더불어》\\\\n《조선의 오늘》동영상\\\\n조선중앙TV\\\\nU C C\\\\n국제친선전람관을 찾아서 |\\\\n국가선물관을 찾아서 |\\\\n특집 |\\\\n생활의 랑만과 정서 |\\\\n미덕의 향기 |\\\\n인물소개 |\\\\n예술공연 |\\\\n아동무대 |\\\\n조선영화 |\\\\nTV예술영화 |\\\\nTV련속소설 |\\\\nTV련속극 |\\\\nTV극 |\\\\nTV기록영화 |\\\\nTV기록편집물 |\\\\n사이프로편집물 |\\\\n만화영화 |\\\\n인기동영상 |\\\\n화면취재시간 |\\\\n민족의 자취를 찾아서 |\\\\n우리함께 |\\\\n조선의 숨결 |\\\\n이 시각 평양, 그 한토막 |\\\\n나는 좋아요 |\\\\n료리백과 |\\\\n[료리만들기]\\\\n우리 함께 만들어 봅시다.(00.푸초볶음)\\\\u200b\\\\n0 0:00 [0000-00-00]\\\\n온면\\\\n돼지고기졸임\\\\n봄철음식 -달래무우김치, 냉이고추장무침-\\\\n감자가루군만두\\\\n닭알료리, 청포채\\\\n0 0분 [0000-00-00]\\\\n우리 함께 만들어 봅시다.(000.뜨더국)\\\\n우리 함께 만들어 봅시다.(000.닭위졸임)\\\\n우리 함께 만들어 봅시다.(000.미꾸라지풋고추졸임)\\\\n0 0:0 [0000-00-00]\\\\n우리 함께 만들어 봅시다.(000.무우채김치)\\\\n우리 함께 만들어 봅시다.(000.칼제비국)\\\\n우리 함께 만들어 봅시다.(000.고등어졸임)\\\\n우리 함께 만들어 봅시다.(000.록두묵채)\\\\n|\\\\n감상글(0) |\\\\n동영상보기 |\\\\n추천하기\\\\n료리만들기 000건\\\\n0분 00초\\\\n0 [0000-00-00]\\\\n0분\\\\n0분 0초\\\\n←되돌이\\\\n현대조선을 빛내이신 절세위인들 | 회고록 《세기와 더불어》 | 정치 | 경제 | 군사 | 사회문화 | 조국통일 | 관광 | 력사\\\\n기사 | 동영상 | 사진 | 음악감상 | 통일신보 | 다매체편집물 | 도서 | 도서련재 | 록음물 | 그림책 | 조선우표 | 조선미술 | 명제품 | 특산료리 | 독자목소리 | 감상글\\\\n홈페지봉사에 관한 문의\\\\nCopyright© 0000-0000 《평양모란봉편집사》 All Rights Reserved'}\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"dataset[0]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.13\"\n  },\n  \"orig_nbformat\": 4\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/EX_use_pyspark_ui.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Use Pyspark UI\\n\",\n    \"> you can use pyspark UI to monitor the spark job.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 🌌 Using in Docker Environment\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 🌠 when pyspark default port `4040` is not available\\n\",\n    \"> In a Docker environment, to access PySpark's UI, the port PySpark's UI runs on inside the Docker container should be mapped to a certain port on your host machine to make it accessible. By default, PySpark's UI runs on port `4040` inside the container. If this port is not available, you can configure PySpark to use a different port.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### pyspark UI attempt with `4040` (default)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"23/11/18 07:22:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"23/11/18 07:22:16 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\",\n      \"23/11/18 07:22:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, data = etl_pipeline.sample()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"if you access `https://{your_ip_address}:4040` and nothing will be shown\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# stop the spark session after you are done\\n\",\n    \"spark.stop()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### change pyspark UI port to which is `available`\\n\",\n    \"> the point here is to let you know that you can change the port to the port you want\\n\",\n    \"\\n\",\n    \"- here for example, let's assume `30360` port is available\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/11/18 07:22:23 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from dataverse.etl import ETLPipeline\\n\",\n    \"\\n\",\n    \"etl_pipeline = ETLPipeline()\\n\",\n    \"spark, data = etl_pipeline.sample(config={'spark': {'ui': {'port': 30360}}})\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"if you access `https://{your_ip_address}:30360` and voila! you can see the pyspark UI\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"100\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# check out the changes in the UI\\n\",\n    \"data.count()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# stop the spark session after you are done\\n\",\n    \"spark.stop()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"llm\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.11\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples/etl/README.md",
    "content": "\n# 🗺️ ETL (Extract, Transform, Load)"
  },
  {
    "path": "requirements.txt",
    "content": "requests\nnumpy\npandas\nfasttext-wheel\nomegaconf\npyarrow==14.0.1\ndatasets\npyspark\nscipy\ntrafilatura\nhtml2text\nfaker\nawscli\nboto3\npre-commit==3.6.0\nbotocore\nrsa\ns3transfer\nisort\npytest\ngraphframes-latest\n"
  },
  {
    "path": "setup.py",
    "content": "import os\n\nfrom setuptools import find_packages, setup\n\nbasedir = os.path.abspath(os.path.dirname(__file__))\nrequirements_path = os.path.join(basedir, \"requirements.txt\")\n\n\ndef get_requirements():\n    \"\"\"Get package requirements from a requirements file (ex: requirements.txt).\"\"\"\n    with open(requirements_path, \"r\") as f:\n        return f.read().splitlines()\n\n\ndef get_extras_require():\n    extras_require = {\n        \"aws\": [\n            \"awscli==1.32.36\",\n            \"botocore==1.34.36\",\n            \"rsa==4.7.2\",\n            \"s3transfer==0.10.0\",\n        ],\n        \"dev\": [\n            \"black==22.12.0\",\n            \"isort>=5.10.1\",\n            \"flake8>=4.0.1\",\n            \"pytest>=7.4.4\",\n            \"pre-commit==3.6.0\",\n        ],\n    }\n\n    extras_require.update({\"all\": [i[0] for i in extras_require.values()]})\n    return extras_require\n\n\nsetup(\n    name=\"dataverse\",\n    version=\"1.0.4\",\n    packages=find_packages(),\n    author=\"Dataverse Team\",\n    author_email=\"dataverse@upstage.ai\",\n    description=\"An open-source simplifies ETL workflow with Python based on Spark\",\n    license=\"Apache License 2.0\",\n    include_package_data=True,\n    install_requires=get_requirements(),\n    entry_points={\"console_scripts\": [\"dataverse = dataverse.api.cli:main\"]},\n)\n"
  }
]