Showing preview only (507K chars total). Download the full file or copy to clipboard to get everything.
Repository: UpstageAI/dataverse
Branch: main
Commit: a0adedc316a4
Files: 148
Total size: 464.3 KB
Directory structure:
gitextract_q82io2no/
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── 1-bug-report.yml
│ │ ├── 2-feature-request.yml
│ │ ├── 3-documentation-improve.yml
│ │ └── config.yml
│ └── pull_request_template.md
├── .gitignore
├── .pre-commit-config.yaml
├── .readthedocs.yaml
├── LICENSE
├── Makefile
├── README.md
├── contribution/
│ └── CONTRIBUTING.md
├── dataverse/
│ ├── README.md
│ ├── __init__.py
│ ├── api/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── cli.py
│ │ └── emr.py
│ ├── config/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ └── interface.py
│ ├── etl/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── __sample/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ ├── ducky.py
│ │ │ └── github.py
│ │ ├── bias/
│ │ │ ├── README.md
│ │ │ └── __init__.py
│ │ ├── cleaning/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ ├── char.py
│ │ │ ├── document.py
│ │ │ ├── html.py
│ │ │ ├── korean.py
│ │ │ ├── length.py
│ │ │ ├── number.py
│ │ │ ├── table.py
│ │ │ └── unicode.py
│ │ ├── data_ingestion/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ ├── arrow.py
│ │ │ ├── common_crawl.py
│ │ │ ├── csv.py
│ │ │ ├── cultura_x.py
│ │ │ ├── huggingface.py
│ │ │ ├── parquet.py
│ │ │ ├── red_pajama.py
│ │ │ ├── slim_pajama.py
│ │ │ └── test.py
│ │ ├── data_save/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ ├── aws.py
│ │ │ ├── huggingface.py
│ │ │ └── parquet.py
│ │ ├── decontamination/
│ │ │ ├── README.md
│ │ │ └── __init__.py
│ │ ├── deduplication/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ ├── common_crawl.py
│ │ │ ├── exact.py
│ │ │ ├── minhash.py
│ │ │ └── polyglot.py
│ │ ├── pii/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ ├── card.py
│ │ │ └── nin.py
│ │ ├── pipeline.py
│ │ ├── quality/
│ │ │ ├── README.md
│ │ │ ├── __init__.py
│ │ │ └── language.py
│ │ ├── registry.py
│ │ ├── toxicity/
│ │ │ ├── README.md
│ │ │ └── __init__.py
│ │ └── utils/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── log.py
│ │ ├── sampling.py
│ │ └── statistics.py
│ ├── lab/
│ │ ├── README.md
│ │ └── __init__.py
│ ├── tests/
│ │ ├── conftest.py
│ │ ├── test_cleaning_accent.py
│ │ ├── test_cleaning_char.py
│ │ ├── test_cleaning_document.py
│ │ ├── test_cleaning_html.py
│ │ ├── test_cleaning_korean.py
│ │ ├── test_cleaning_length.py
│ │ ├── test_cleaning_number.py
│ │ ├── test_cleaning_table.py
│ │ ├── test_cleaning_unicode.py
│ │ ├── test_deduplication_common_crawl.py
│ │ ├── test_deduplication_exact.py
│ │ ├── test_deduplication_minhash.py
│ │ ├── test_deduplication_polyglot.py
│ │ ├── test_pii_card.py
│ │ └── test_pii_nin.py
│ └── utils/
│ ├── README.md
│ ├── __init__.py
│ ├── analyze/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── pip.py
│ │ └── python.py
│ ├── api/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ └── aws.py
│ ├── format/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── huggingface.py
│ │ └── ufl.py
│ └── setting/
│ ├── README.md
│ ├── __init__.py
│ ├── system.py
│ └── user.py
├── docs/
│ ├── Makefile
│ ├── make.bat
│ └── source/
│ ├── citation.rst
│ ├── conf.py
│ ├── config/
│ │ └── config.interface.rst
│ ├── etl/
│ │ ├── etl.bias.rst
│ │ ├── etl.cleaning.rst
│ │ ├── etl.data_ingestion.rst
│ │ ├── etl.data_save.rst
│ │ ├── etl.decontamination.rst
│ │ ├── etl.deduplication.rst
│ │ ├── etl.pii.rst
│ │ ├── etl.pipeline.rst
│ │ ├── etl.quality.rst
│ │ ├── etl.registry.rst
│ │ ├── etl.rst
│ │ ├── etl.toxicity.rst
│ │ └── etl.utils.rst
│ ├── index.rst
│ ├── installation.rst
│ ├── quickstart.rst
│ └── requirements.txt
├── examples/
│ ├── README.md
│ └── etl/
│ ├── ETL_01_how_to_run.ipynb
│ ├── ETL_02_one_cycle.ipynb
│ ├── ETL_03_create_new_etl_process.ipynb
│ ├── ETL_04_add_new_etl_process.ipynb
│ ├── ETL_05_test_etl_process.ipynb
│ ├── ETL_06_scaleout_with_EMR.ipynb
│ ├── EX_use_common_crawl_data.ipynb
│ ├── EX_use_pyspark_ui.ipynb
│ └── README.md
├── requirements.txt
└── setup.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/ISSUE_TEMPLATE/1-bug-report.yml
================================================
name: "🐛 Bug Report"
description: Create a new ticket for a bug.
title: "🐛 [BUG] - <title>"
labels: [
"bug"
]
body:
- type: textarea
id: environment-setting
attributes:
label: "Environment Settings"
description: Java, Pyspark version, Python version, ...
placeholder: Let us explain your environment settings to reproduce
validations:
required: true
- type: textarea
id: expected-behavior
attributes:
label: "Expected Behavior"
placeholder: A clear and concise description of what you would expect to happen.
validations:
required: true
- type: textarea
id: actual-behavior
attributes:
label: "Actual Behavior"
placeholder: A clear and concise description of what actually happened.
- type: textarea
id: reproduction
attributes:
label: Reproduction
description: |
Please enter an explicit steps to reproduce your problem.
If you have any code snippets, error messages, and etc., please provide them here.
placeholder: |
Steps to reproduce:
1.
2.
3.
4.
validations:
required: true
================================================
FILE: .github/ISSUE_TEMPLATE/2-feature-request.yml
================================================
name: "🚀 Feature Request"
description: Suggesting new desired feature and enhancement of existing feature
title: "🚀 [REQUEST] - <title>"
labels: [
"enhancement", "feature"
]
body:
- type: textarea
id: feature-request
attributes:
label: Feature request
description: |
Please describe the feature you want to add or needs to be enhanced.
If you have any related paper or code, please provide us.
validations:
required: true
- type: textarea
id: context
validations:
required: false
attributes:
label: Context
description: |
Please let us know your motivation or additional context for this suggestion.
Knowing the reason why it needs to be add/enhanced makes us easy to understand the need.
================================================
FILE: .github/ISSUE_TEMPLATE/3-documentation-improve.yml
================================================
name: "📝 Documentation Improvement"
description: Report wrong or missing documentation. You can suggest new document or document that needs any improvement.
title: "📝 [Docs] - <title>"
labels: [
"docs"
]
body:
- type: checkboxes
attributes:
label: dataverse version checks
options:
- label: >
I have checked that the issue still exists on the latest versions of the _dataverse_.
required: true
- type: textarea
id: location
attributes:
label: Location of the documentation
description: >
Please provide the location of the documentation.
If you are suggesting new document, please provide appropriate place it has to be.
validations:
required: true
- type: textarea
id: problem
attributes:
label: Documentation problem
description: >
Please provide a description of what documentation you believe needs to be fixed/improved/added.
validations:
required: true
- type: textarea
id: suggestion
attributes:
label: Suggestion
description: >
Please explain the suggested fix and **why** it's better than the existing documentation.
Or it could be content of new document you are suggesting.
validations:
required: true
================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: true
================================================
FILE: .github/pull_request_template.md
================================================
## PR Checklist
Please check if your PR fulfills the following requirements:
- [ ] The commit message follows _dataverse_ guidelines [link](https://github.com/UpstageAI/dataverse/blob/main/contribution/CONTRIBUTING.md#commit-guidelines):
- [ ] Tests for the changes have been added (for bug fixes / features)
- [ ] Docs have been added / updated (for bug fixes / features)
## What does this PR do?
<!-- Please describe the link to a relevant issue and current behavior that you are modifying.-->
- Issue Number: #
- Description:
================================================
FILE: .gitignore
================================================
# forbidden
.env
reference/
common_crawl/
notebook/
.cache/
sample/
# open-source
cc_net/
dps/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
# - id: trailing-whitespace
- id: check-added-large-files
- id: detect-private-key
- id: detect-aws-credentials
args: [--allow-missing-credentials]
- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort
args: [
--profile=black,
]
- repo: https://github.com/psf/black
rev: 23.12.1
hooks:
- id: black
args: [
--line-length=100,
]
- repo: https://github.com/myint/autoflake
rev: v2.2.0
hooks:
- id: autoflake
args: [
# --in-place,
# --remove-unused-variables,
# --remove-all-unused-imports,
--expand-star-imports,
]
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
hooks:
- id: flake8
args: [
"--ignore=E203, E501, W503",
]
# E203: Whitespace before ':'
# E501: line length - because black checks and this makes error even on commented code
# W503: PEP8 now recommends to break before binary operator (https://peps.python.org/pep-0008/#should-a-line-break-before-or-after-a-binary-operator)
================================================
FILE: .readthedocs.yaml
================================================
# .readthedocs.yml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Set the OS, Python version and other tools you might need
build:
os: ubuntu-20.04
tools:
python: "3.10"
# You can also specify other tool versions:
# nodejs: "19"
# rust: "1.64"
# golang: "1.19"
# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/source/conf.py
# Build documentation with MkDocs
#mkdocs:
# configuration: mkdocs.yml
# Optionally build your docs in additional formats such as PDF
#formats:
# - pdf
# Optionally set the version of Python and requirements required to build your docs
python:
install:
- requirements: docs/source/requirements.txt
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: Makefile
================================================
.PHONY: aws_s3 pyspark java
aws_s3:
@test -d $$SPARK_HOME/jars || mkdir -p $$SPARK_HOME/jars
@test -f $$SPARK_HOME/jars/hadoop-aws-3.3.4.jar || wget -P $$SPARK_HOME/jars/ https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
@test -f $$SPARK_HOME/jars/aws-java-sdk-bundle-1.12.592.jar || wget -P $$SPARK_HOME/jars/ https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.592/aws-java-sdk-bundle-1.12.592.jar
pyspark:
echo "export SPARK_HOME=$(shell pip show pyspark | grep Location | awk '{print $$2 "/pyspark"}')" >> ~/.bashrc
echo "export PYSPARK_PYTHON=python3" >> ~/.bashrc
# setting java environment
java:
sudo apt-get update
sudo apt-get install openjdk-11-jdk
echo "export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" >> ~/.bashrc
================================================
FILE: README.md
================================================
<div align="center">
<br>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/images/dataverse_logo-white.png" width=300>
<source media="(prefers-color-scheme: light)" srcset="docs/images/dataverse_logo-color.png" width=300>
<img alt="DATAVERSE" src="docs/images/dataverse_logo-color.png" width=300>
</picture>
<br>
The Universe of Data.
All about Data, Data Science, and Data Engineering. </br>
Upstage Solar is powered by Dataverse! Try at Upstage [Console](https://console.upstage.ai/)!
[Docs](https://data-verse.gitbook.io/docs/) • [Examples](https://github.com/UpstageAI/dataverse/tree/main/examples) • [API Reference](https://data-verse.readthedocs.io/en/latest/) • [FAQ](https://data-verse.gitbook.io/docs/documents/faqs) • [Contribution Guide](https://github.com/UpstageAI/dataverse/blob/main/contribution/CONTRIBUTING.md) • [Contact](mailto:dataverse@upstage.ai) • [Discord](https://discord.gg/aAqF7pyq4h) • [Paper](https://arxiv.org/abs/2403.19340)
<br><br>
<div align="left">
## Welcome to Dataverse!
Dataverse is a freely-accessible open-source project that supports your **ETL(Extract, Transform and Load) pipeline with Python**. We offer a simple, standardized and user-friendly solution for data processing and management, catering to the needs of data scientists, analysts, and developers in LLM era. Even though you don't know much about Spark, you can use it easily via _dataverse_.
### With Dataverse, you are empowered to
- utilize a range of preprocessing functions without the need to install multiple libraries.
- create high-quality data for analysis and training of Large Language Models (LLM).
- leverage Spark with ease, regardless of your expertise level.
- facilitate smoother collaboration among users with varying degress of Spark proficiency.
- enjoy freedom from the limitations of local environments by harnessing the capabilities of AWS EMR.
### Architecture of Dataverse

### Key Features of Dataverse
- **Block-Based**: In Dataverse, a `block` means a `registered ETL function` which is running on Spark. You can build Spark code like putting together puzzle pieces. You can easily add, take away, or re-arrange pieces to get the results you want via configure.
- **Configure-Based**: All the setups for Spark and steps of block can be defined with configure. You don't need to know all the code. Just set up the options, and you're good to go.
- **Extensible**: It's designed to meet your specific demands, allowing for custom features that fit perfectly with your project.
If you want to know more about Dataverse, please checkout our [docs](https://data-verse.gitbook.io/docs/).
By clicking below image, it'll take you to a short intro video!
[](https://youtu.be/yYyyLuPNK5s?feature=shared)
<br>
## 🌌 Installation
### 🌠 Prerequisites
To use this library, the following conditions are needed:
- Python (version between 3.10 and 3.11)
- JDK (version 11)
- PySpark
Detail installation guide for prerequisites can be found on [here](https://data-verse.gitbook.io/docs/installation).
### 🌠 Install via PyPi
```bash
pip install dataverse
```
<br>
## 🌌 Quickstart
Various and more detailed tutorials are [here](https://github.com/UpstageAI/dataverse/tree/main/examples).
- [add_new_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_04_add_new_etl_process.ipynb) : If you want to use your custom function, you have to register the function on Dataverse. This will guide you from register to apply it on pipeline.
- [test_etl_process.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_05_test_etl_process.ipynb) : When you want to get test(sample) data to quickly test your ETL process, or need data from a certain point to test your ETL process.
- [scaleout_with_EMR.ipynb](https://github.com/UpstageAI/dataverse/blob/main/examples/etl/ETL_06_scaleout_with_EMR.ipynb) : For people who want to run their pipeline on EMR cluster.
<details>
<summary><u>Detail to the example etl configure.</u></summary>
<ul></ul>
<ul>
<li style="line-height:250%;"> <b>data_ingestion___huggingface___hf2raw </b></li>
Load dataset from <a href="https://huggingface.co/datasets/allenai/ai2_arc">Hugging Face</a>, which contains a total of 2.59k rows.
</ul>
<ul>
<li style="line-height:250%;"> <b>utils___sampling___random </b></li>
To decrease the dataset size, randomly subsample 50% of data to reduce the size of dataset, with a default seed value of 42. <br/>
This will reduce the dataset to 1.29k rows.
</ul>
<ul>
<li style="line-height:250%;"> <b>deduplication___minhash___lsh_jaccard </b></li>
Deduplicate by <code>question</code> column, 5-gram minhash jaccard similarity threshold of 0.1.
</ul>
<ul>
<li style="line-height:250%;"> <b>data_save___parquet___ufl2parquet </b></li>
Save the processed dataset as a Parquet file to <code>./guideline/etl/sample/quickstart.parquet</code>.<br/>
The final dataset comprises around 1.14k rows.
</ul>
</details>
```python
# 1. Set your ETL process as config.
from omegaconf import OmegaConf
ETL_config = OmegaConf.create({
# Set up Spark
'spark': {
'appname': 'ETL',
'driver': {'memory': '4g'},
},
'etl': [
{
# Extract; You can use HuggingFace datset from hub directly!
'name': 'data_ingestion___huggingface___hf2raw',
'args': {'name_or_path': ['ai2_arc', 'ARC-Challenge']}
},
{
# Reduce dataset scale
'name': 'utils___sampling___random',
'args': {'sample_n_or_frac': 0.5}
},
{
# Transform; deduplicate data via minhash
'name': 'deduplication___minhash___lsh_jaccard',
'args': {'threshold': 0.1,
'ngram_size': 5,
'subset': 'question'}
},
{
# Load; Save the data
'name': 'data_save___parquet___ufl2parquet',
'args': {'save_path': './guideline/etl/sample/quickstart.parquet'}
}
]
})
```
Above code block is an example of an ETL process in Dataverse. In Dataverse, the available registered ETL functions are referred to as `blocks`, and this example is comprised of four blocks. You can freely combine these blocks using config to create the ETL processes for your needs. The list of available functions and args of them can be found in the [API Reference](https://data-verse.readthedocs.io/en/latest/). Each functions 'args' should be added in dictionary format.
```python
# 2. Run ETLpipeline.
from dataverse.etl import ETLPipeline
etl_pipeline = ETLPipeline()
spark, dataset = etl_pipeline.run(config=ETL_config, verbose=True)
```
ETLPipeline is an object designed to manage the ETL processes. By inserting `ETL_config` which is defined in the previous step into ETLpipeline object and calling the `run` method, stacked ETL blocks will execute in the order they were stacked.
```python
# 3. Result file is saved on the save_path
```
As the example gave `save_path` argument to the last block of `ETL_config`, data passed through the process will be saved on the given path.
<br>
## 🌌 Modules
Currently, about 50 functions are registered as the ETL process, which means they are eagerly awaiting your use!
| Type | Package | description |
|-----------|-----------------|---------------------------------------------------------------------------------------------------|
| Extract | data_ingestion | Loading data from any source to the preferred format |
| Transform | bias | (WIP) Reduce skewed or prejudiced data, particularly data that reinforce stereotypes. |
| | cleaning | Remove irrelevant, redundant, or noisy information, such as stop words or special characters. |
| | decontamination | (WIP) Remove contaminated data including benchmark. |
| | deduplication | Remove duplicated data, targeting not only identical matches but also similar data. |
| | pii | PII stands for Personally Identifiable Information. Removing sensitive information from data. |
| | quality | Improving the data quality, in the perspective of accuracy, consistency, and reliability of data. |
| | toxicity | (WIP) Removing harmful, offensive, or inappropriate content within the data. |
| Load | data_save | Saving the processed data to a preferred source like data lake, database, etc. |
| Utils | utils | Essential tools for data processing, including sampling, logging, statistics, etc. |
<br>
## 🌌 Dataverse supports AWS
Dataverse works with AWS S3 and EMR, enabling you to load and save data on S3 and execute ETL pipelines through EMR. Step by step guide to setting up is [here](https://data-verse.gitbook.io/docs/lets-start/aws-s3-support).
</br>
## 🌌 Dataverse use-case
> If you have any use-cases of your own, please feel free to let us know. </br>We would love to hear about them and possibly feature your case.
*✨* [`Upstage`](https://www.upstage.ai/) is using Dataverse for preprocessing the data for the training of [Solar Mini](https://console.upstage.ai/services/solar?utm_source=upstage.ai&utm_medium=referral&utm_campaign=Main+hero+Solar+card&utm_term=Try+API+for+Free&utm_content=home). </br>
*✨* [`Upstage`](https://www.upstage.ai/) is using Dataverse for preprocessing the data for the [Up 1T Token Club](https://en.content.upstage.ai/1tt).
## 🌌 Contributors
<a href="https://github.com/UpstageAI/dataverse/graphs/contributors">
<img src="https://contrib.rocks/image?repo=UpstageAI/dataverse" />
</a>
## 🌌 Acknowledgements
Dataverse is an open-source project orchestrated by the **Data-Centric LLM Team** at [`Upstage`](https://www.upstage.ai/), designed as an data ecosystem for LLM(Large Language Model). Launched in March 2024, this initiative stands at the forefront of advancing data handling in the realm of LLM.
## 🌌 License
Dataverse is completely freely-accessible open-source and licensed under the Apache-2.0 license.
## 🌌 Citation
If you want to cite our 🌌 Dataverse project, feel free to use the following bibtex. You can check our paper via [link](https://arxiv.org/abs/2403.19340).
```bibtex
@misc{park2024dataverse,
title={Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models},
author={Hyunbyung Park and Sukyung Lee and Gyoungjin Gim and Yungi Kim and Dahyun Kim and Chanjun Park},
year={2024},
eprint={2403.19340},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
================================================
FILE: contribution/CONTRIBUTING.md
================================================
# __Contribution Guidelines__
Welcome to _Dataverse_! We warmly welcome any kind of contribution 😊✨. </br>
This page provides an outline on how to contribute to _Dataverse_ and suggestions for nice conventions to follow.
> __These are guidelines, NOT rules 💡__ <p>
This page is not the Constituion of the _Dataverse_. We are providing guidelines to help you make a useful and efficient contribution to _Dataverse_. While we think these guidelines are sensible and we appreciate when they are observed, following them isn't strictly required. We hope you won't be tired by these guidelines. Also, we'd love to hear your ideas on how to improve our guidelines!
</br>
# Table of Contents
- [Questions or Feedback](#questions-or-feedback)
- [🤝 How to Contribute?](#how-to-contribute)
- [Tests](#tests)
- [Directory of Dataverse](#directory-of-dataverse)
- [Design Philosophy](#design-philosophy)
- [Commit Guidelines](#commit-guidelines)
- [Style Guides](#style-guides)
</br>
# Questions or Feedback
Join the conversation on our GitHub discussion board! It's the go-to spot for questions, chats, and a helping hand from the _Dataverse_ community. Drop by and say hello here: [link](https://github.com/UpstageAI/dataverse/discussions)
And if there's a shiny new feature you're dreaming of, don't be shy—head over to our [issue page](https://github.com/UpstageAI/dataverse/issues) to let us know! Your input could help shape the future. ✨
</br>
# How to Contribute?
- Any kind of improvement of document: fixing typo, enhancing grammar or semantic structuring or adding new examples.
- Submit issues related to bugs, new desired features, or enhancement of existing features.
- Fix a bug, implement new feature or improving existing feature.
- Answer other users' question or help.
## __Documentation__
We appreciate all the pull requests to fix typo / improve grammar or semantic structuring of documents. Feel free to check! <br/>
Our API reference page is constructed with [Sphinx](https://www.sphinx-doc.org/en/master/). We adhere to the [Google style for docstrings](https://google.github.io/styleguide/pyguide.html) as a fundamental practice, so please follow this format. The source files are located within the `docs/source/` directory.
## __Report a Bug / Request New Feature / Suggest Enhancements__
Please open an issue whenever you find a bug or have an idea to enhance _Dataverse_. Maintainers will label it or leave comment on it as soon as they check the issue. Issues labeled as `Open for contribution` mean they are open for contribution.
## __Fix a Bug / Add New Feature / Improve Existing Feature__
If you have a particular roadmap, goals, or new feature, share it via issue. When you already fixed a bug or have new feature that enhances _Dataverse_, you can jump on to fourth step which is opening pull requests. Please note that when you open pull requests without opening an issue or maintainers' check, it can be declined if it does not aligh with philosophy of _Dataverse_.
### __1️⃣ Check issues labeled as__ `Open for contribution`
You can find issues waiting for your contribution by filtering label with `Open for contribution`. This label does not stand alone. It is always with `Bug`, `Docs` or `Enhancement`. Issues with `Critical` or `ASAP` label are more urgent.
### __2️⃣ Leave a comment on the issue you want to contribute__
Once we review your comment, we'll entrust the issue to you by swapping out the `Open for contribution` label for a `WIP` (Work in Progress) label.
### __3️⃣ Work on it__
Before diving into coding, do take a moment to familiarize yourself with our coding style by visiting this [style guides](#style-guides). And hey, if you hit a snag while tackling the issue, don't hesitate to drop a comment right there. Our community is a supportive bunch and will jump in to assist or brainstorm with you.
1. Fork the repository of _Dataverse_.
2. Clone your fork to your local disk.
3. Create a new branch to hold your develompment changes. </br>
It's not required to adhere strictly to the branch naming example provided; consider it a mild suggestion.
```
git checkout -b {prefix}/{issue-number}-{description}
```
4. Set up a development environment
5. Develop the features in your branch
### __4️⃣ Create a Pull Request__
Go ahead and visit your GitHub fork, then initiate a pull request — it's time to share your awesome work! Before you do, double-check that you've completed everything on the checklist we provided. Once you're all set, submit your contributions for the project maintainers to review.
Don't worry if the maintainers have some feedback or suggest changes—it's all part of the process and happens to even our most experienced contributors. Keep your updates flowing by working in your local branch and pushing any new changes to your fork. Your pull request will update automatically for everyone to see the progress.
</br>
# Tests
The Dataverse test framework is built using [pytest](https://docs.pytest.org/en/8.0.x/). Ensure that you write a corresponding test for any new features or changes you make. You'll find the test files in the `dataverse/dataverse/tests` directory.
- Create a new test file if you've introduced a new category or a sub-category for the ETL process.
- If your addition is a new feature within an existing category or sub-category, include your tests in the existing test file.
</br>
# Directory of Dataverse
For _Dataverse_'s overarching goals: check the [docs](https://data-verse.gitbook.io/docs#future-work)
```{plain text}
📦 dataverse/dataverse
┣ 📂 api
┣ 📂 config
┃ ┣ 📂 etl
┃ ┃ ┗ 📂 sample
┣ 📂 etl
┃ ┣ 📂 {CATEGORY}
┣ 📂 lab
┣ 📂 tests
┗ 📂 utils
```
- [`📂 api`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/api): The Dataverse API serves as a
gateway for users.
- [`📂 config`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/config): Contains configuration files for the Dataverse application. You can also find sample configuration file for etl process under this directory.
- [`📂 etl`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/etl): Main directory of _Dataverse_ where all of the data processors are placed. Data processors are separated with it's category.
- [`📂 lab`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/lab): TBD. Data analysis will be supported via here.
- [`📂 tests`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/tests): Pytest files
- [`📂 utils`](https://github.com/UpstageAI/dataverse/tree/main/dataverse/utils): The Utilities module functions as a collection of internal helper tools. Its key features include API utilities that simplify interaction with various external APIs, including AWS EMR. Please be aware that another utils module is also included within the etl module.
</br>
# Design Philosophy
- [Principles for Configuration](#principles-for-configuration)
- [Principles for ETL Process](#principles-for-etl-process)
## Principles for Configuration
1. `One file` rules `ALL`
2. `10 Seconds` to know what is going on
#### 1. `One file` rules `ALL`
One cycle of ETL, Analyzer, etc. which we could call one job, will be controled by one configuration file. We are not going to use multiple configuration files to composite one big configuration file.
#### 2. `10 Seconds` to know what is going on
The reader should be able to know what is going on in the configuration file within 10 seconds. This is to make sure the configuration file is easy and small enough to read and understand.
## Principles for ETL Process
> When you create your own ETL process, you should follow the following principles
1. No `DRY` (Don't Repeat Yourself)
2. One file Only
#### 1. No `DRY` (Don't Repeat Yourself)
> No `DRY` is applied between **ETL sub-categories**.
- So if similar ETL processes are used in same sub-categories, it could be shared.
- But if it's used in different sub-categories, it should not be shared.
As you can see in the following example, there are 2 ETL processes `common_process_a` and `common_process_b`seems nice to be shared. But as you can see, they are not shared. They are repeated. This is because of the No `DRY` principle.
```python
- deduplication/
- exact.py
- "def common_process_a():"
- "def common_process_b():"
- def deduplication___exact___a():
- exact_datasketch.py
- "def common_process_a():"
- "def common_process_b():"
- def deduplication___exact_datasketch___a():
- def deduplication___exact_datasketch___b():
```
#### 2. One file Only
Code that ETL process uses should be in the same file. This is because of the `One file Only` principle. Except **ETL Base class, few required utils functions, and open sources** there should be no dependency outside the file.
```python
# This is OK ✅
- deduplication/
- exact.py
- def helper_a():
- def helper_b():
- def etl_process():
helper_a()
helper_b()
# This is not allowed ❌
- deduplication/
- helper.py
- def helper_a():
- def helper_b():
- exact.py
from helper import helper_a
from helper import helper_b
- def etl_process():
helper_a()
helper_b()
```
ETL process itself is meant to be built to be used in various combination of ETL pipeline **So try to make it as generic as possible.**
</br>
# Commit Guidelines
### Commit strategy
- Avoid mixing multiple, unrelated modifications in a single commit. One commit is related with one issue.
- Each commit should encapsulate a complete, autonomous upgrade to the code.
### Commit messages
Please make sure your commit messages follow `type`: `title (#<related issue number>)` format. <br/>
For example:
```plain text
<TYPE>: Short summary with 72 characters or less (#<Issue number>)
If you have more detalied explanatory text, put it as body.
But the body is optional.
```
- Find adequate type in the below list:
- `NEW`: introducing a new feature
- `ENHANCE`: improve an existing code/feature.
- `FIX`: fix a code bug
- `DOCS`: write/update/add any kind of documents including docstring
- `REFACTOR`: refactor existing code without any specific improvements
- `STYLE`: changes that do not affect the meaning of the code (ex. white-space, line length)
- `TEST`: add additional testing
- `DEL`: remove code or files
- `RELEASE`: release new version of dataverse
- `OTHER`: anything not covered above (not recommended)
- Use the present tense ("Add feature" not "Added feature")
- Do not end the subject line with a punctuation
</br>
# Style Guides
### Pre-commit hook
We provide a pre-commit git hook for style check. You can find exact check list in this [file](https://github.com/UpstageAI/dataverse/blob/main/.pre-commit-config.yaml). <br/> Please run the code below before a commit is created:
```bash
pre-commit run
```
================================================
FILE: dataverse/README.md
================================================
# Dataverse
> The Universe of Data
## 🌌 Config
> Config for the Dataverse
## 🌌 API
> Interface of Dataverse for external use
## 🌌 ETL
> ETL pipeline (Extract, Transform, Load)
## 🌌 LAB
> Data Analysis & Visualization
## 🌌 Utils
> Common utilities used internally for Dataverse
================================================
FILE: dataverse/__init__.py
================================================
================================================
FILE: dataverse/api/README.md
================================================
# API (Application Programming Interface)
> Interface with ease and efficiency
================================================
FILE: dataverse/api/__init__.py
================================================
================================================
FILE: dataverse/api/cli.py
================================================
"""
main entry point for the dataverse CLI tool
"""
from dataverse.utils.setting import SystemSetting
def main():
"""Main entry point for the cli."""
print("🌌 Hello Welcome to Dataverse! 🌌")
print("=" * 50)
print("We are still under construction for CLI!")
print("=" * 50)
print("QUARK - By Ducky 🦆")
# set the system setting to CLI mode
SystemSetting().IS_CLI = True
================================================
FILE: dataverse/api/emr.py
================================================
"""
API to use AWS EMR with spark-submit
"""
import os
import argparse
import importlib
from dataverse.etl import ETLPipeline
def import_dynamic_etls():
"""
Import dynamic etls which was created by user.
"""
dynamic_etl_path = "/home/hadoop/dataverse/dynamic_etl"
try:
files = os.listdir(dynamic_etl_path)
except FileNotFoundError:
return
except Exception as e:
raise e
# Filter out non-Python files
files = [f for f in files if f.endswith('.py')]
# Dynamically import all Python files in the directory
for file in files:
file_path = os.path.join(dynamic_etl_path, file)
# Remove .py at the end
module_name = file[:-3]
spec = importlib.util.spec_from_file_location(module_name, file_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
def main(config, verbose=False):
"""Main entry point for the aws emr."""
etl_pipeline = ETLPipeline()
import_dynamic_etls()
spark, data = etl_pipeline.run(config=config, verbose=verbose)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--config", help="config file path")
parser.add_argument("--verbose", action='store_true')
args = parser.parse_args()
main(args.config, args.verbose)
================================================
FILE: dataverse/config/README.md
================================================
# Configuration
> This directory contains configuration files for the Dataverse application
## 🌌 How to use
### 🌠 Load pre-built configuration
> you can load the pre-built configuration from path, or dict, or OmegaConf
#### Load from local path
```python
from dataverse.config import Config
config = Config.load('path/to/config.yaml')
```
#### Load from AWS S3
> you need to set aws credential with `aws configure` to use this feature
```python
from dataverse.config import Config
config = Config.load('s3://path/to/config.yaml')
```
#### Load from dict
```python
config = Config.load({
"spark": {"appname": "README.md example"}
"etl": [
{"name": "...", "args": "..."},
{"name": "...", "args": "..."},
]
})
```
### 🌠 Set the empty args with `default` value
> the args you already set will not be changed to default
```python
from dataverse.config import Config
config = Config.load('path/to/config.yaml')
config = Config.set_default(config)
```
### 🌠 Get `Default` configuration
> `default` configuration has no `etl` pre-defined
```python
from dataverse.config import Config
config = Config.default()
```
## 🌌 About Configuration
### 🌠 Why configuration is just `OmegaConf`?
> To make it simple and easy to use. We are not going to inherit some other `base` class to make it complicated. But still `Config` interface is provided as a helper for to load, save, set default, etc.
### 🌠 2 Rules for configuration
1. `One file` rules `ALL`
2. `10 Seconds` to know what is going on
#### `One file` rules `ALL`
One cycle of ETL, Analyzer, etc. which we could call one job, will be controled by one configuration file. We are not going to use multiple configuration files to composite one big configuration file.
#### `10 Seconds` to know what is going on
The reader should be able to know what is going on in the configuration file within 10 seconds. This is to make sure the configuration file is easy and small enough to read and understand.
### 🌠 What open source to choose for configuration?
> **`omegaconf`**
- `OmegaConf`
- For ease understanding & usage
- Omegaconf supports yaml, dict, json and even `dataclass` from python.
- `hydra`
- hydra was also our candidate but to make it simple we are using OmegaConf.
- hydra requires multiple configuration files to composite one big configuration file
- also many people find out using hydra itself took quite a time just to understand
================================================
FILE: dataverse/config/__init__.py
================================================
from .interface import Config
================================================
FILE: dataverse/config/interface.py
================================================
"""
Interface to check & load the configurations for installation environment
awesome_config = Config.load("/path/to/ducky_awesome_config.yaml")
awesome_config = Config.load({awesome: config})
"""
import re
import boto3
from pathlib import Path
from typing import Union
from omegaconf import OmegaConf
from omegaconf import DictConfig
from dataverse.utils.setting import SystemSetting
from dataverse.utils.api import aws_s3_read
from dataverse.utils.api import aws_s3_write
from pathlib import Path
class Config:
"""
Interface to check & load the configurations
This class provides a lightweight wrapper for OmegaConf and allows checking and loading configurations.
It supports loading configurations from various sources such as files, AWS S3, and config strings.
The class also provides methods for saving configurations and setting default values for missing config arguments.
"""
def __new__(cls, *args, **kwargs):
raise NotImplementedError("Config is not allowed to be instantiated")
@classmethod
def load(cls, config: Union[str, dict, DictConfig, OmegaConf, Path]):
"""
Load the configuration for the etl.
Args:
config (Union[str, dict, OmegaConf]): The configuration for the etl.
- str or Path: This could have several cases:
- Path to the config file.
- S3 path to the config file.
- Config string. This is similar to loading a `yaml` file with `open()`.
- dict: Config dictionary.
- OmegaConf: Config object.
Returns:
The loaded configuration.
Raises:
ValueError: If the provided config is not a valid path or S3 path.
TypeError: If the provided config is not of type str, dict, or OmegaConf.
"""
if isinstance(config, (str, Path)):
if isinstance(config, Path):
config = str(config)
# Local File
if Path(config).is_file():
config = OmegaConf.load(config)
# AWS S3
elif config.startswith(('s3://', 's3a://', 's3n://')):
aws_s3_matched = re.match(r's3[a,n]?://([^/]+)/(.*)', config)
if aws_s3_matched:
bucket, key = aws_s3_matched.groups()
config_content = aws_s3_read(bucket, key)
config = OmegaConf.create(config_content)
else:
# Assume it's a config string that starts with s3
config_str = config
config = OmegaConf.create(config_str)
# Check if it's a config string or not
# In case of a config string, it should create a config object
# If not, it will create {'config': None}
if config_str in config and config[config_str] is None:
raise ValueError(f"config {config_str} is not a valid s3 path")
# String Config
else:
# Assume it's a config string
config_str = config
config = OmegaConf.create(config_str)
# Same as above, check if it's a config string or not
if config_str in config and config[config_str] is None:
raise ValueError(f"config {config_str} is not a valid path")
elif isinstance(config, dict):
config = OmegaConf.create(config)
elif isinstance(config, (OmegaConf, DictConfig)):
pass
else:
raise TypeError(f"config should be str, dict, or OmegaConf but got {type(config)}")
return config
@classmethod
def save(cls, config, path: Union[str, Path]):
"""
Saves the configuration to a specified path.
Args:
config: The configuration to be saved.
path (Union[str, Path]): The path where the configuration should be saved.
Raises:
ValueError: If the provided path is not a valid S3 path.
"""
if path.startswith(('s3://', 's3a://', 's3n://')):
aws_s3_matched = re.match(r's3[a,n]?://([^/]+)/(.*)', path)
if aws_s3_matched:
bucket, key = aws_s3_matched.groups()
aws_s3_write(bucket, key, config)
else:
raise ValueError(f"config path {path} is not a valid s3 path")
else:
OmegaConf.save(config, Path(path))
@classmethod
def default(cls, emr: bool = False):
"""
Fill the missing config with default values.
Args:
emr (bool, optional): Flag indicating whether the config is for EMR. Defaults to False.
Returns:
dict: Default configuration dictionary.
"""
local_dir = f"{SystemSetting().CACHE_DIR}/.cache/dataverse/tmp"
default = OmegaConf.create({
'spark': {
'master': 'local[10]',
'appname': 'default',
'driver': {
'memory': '8G',
'maxResultSize': '2G',
},
'executor': {'memory': '1G'},
'local': {'dir': local_dir},
'ui': {'port': 4040},
},
'etl': [],
})
if emr:
default.update({
'emr': {
'id': None,
'working_dir': None,
'name': 'dataverse_emr',
'release': 'emr-6.15.0',
'idle_timeout': 3600,
# master (driver)
'master_instance': {
'type': None,
},
# core (data node)
'core_instance': {
'type': None,
'count': 2,
},
# task (executors)
'task_instance': {
'type': None,
'count': 0,
},
# EMR cluster created by dataverse or user
'auto_generated': None,
# iam
'role': {
'ec2': {
'name': None,
'policy_arns': None,
},
'emr': {
'name': None,
'policy_arns': None,
}
},
'instance_profile': {
'name': None,
'ec2_role': None,
},
# TODO: allow more options to customize e.g. cidr, tag, etc.
# but make sure vpc is temporary and not shared
'vpc': {
'id': None,
},
'subnet': {
'id': None,
'public_id': None,
'private_id': None,
'public': True,
},
'security_group': {
'id': None,
},
'gateway': {
'id': None,
},
'route_table': {
'id': None,
},
'elastic_ip': {
'id': None,
},
'nat_gateway': {
'id': None,
},
}
})
return default
@classmethod
def set_default(cls, config: OmegaConf, emr: bool = False):
"""
Sets the missing config arguments with default values.
Args:
config (OmegaConf): The configuration object to merge with default values.
emr (bool, optional): Whether to use EMR configuration. Defaults to False.
Returns:
OmegaConf: The merged configuration object.
"""
return OmegaConf.merge(cls.default(emr=emr), config)
================================================
FILE: dataverse/etl/README.md
================================================
# ETL (Extract, Transform, Load)
> Dataverse ETL is "Block-based coding powered by Spark"
- Each block is called `ETL process`
- Combination of ETL processes is called `ETL pipeline`
- ETL pipeline is managed by `config` file
## 🌌 What is ETL process?
> ETL process is the small code snippet, that is considered as a single unit of ETL pipeline. It is meant to be form various combinations to accommodate different kinds of data sources and transformations in ETL pipeline so it should be as generic as possible.
```python
def ETL_process(data, config):
return data
```
## 🌌 What is ETL pipeline?
> ETL pipeline is the sequence of ETL processes.
```python
data = ETL_process_1()
data = ETL_process_2(data)
data = ETL_process_3(data)
```
## 🌌 How to run ETL Pipeline?
> Define the ETL process, and add in the config file to run the ETL pipeline.
```python
from dataverse.etl import ETLPipeline
from dataverse.config import Config
# 1. Define the ETL process in the config file
config = Config.load("TBD")
config = Config.set_default(config)
# 2. Run the ETL pipeline
etl_pipeline = ETLPipeline()
spark, data = etl_pipeline.run(config)
```
### 🌠 What is returned after running ETL pipeline?
> `spark` and `data` is returned after running ETL pipeline
- `spark` - spark session
- `data` - data after running ETL pipeline
#### `spark` status depends on the last ETL process
- `data_load` ETL process at the end
- spark will be terminated
- otherwise
- spark will be alive
- you can use `spark` to do whatever you want
## 🌌 How to add new ETL process?
> ETL is managed by registry. Whatever ETL you make, you need to register it to registry.
### 🌠 Choose what `Category` & `Sub-Category` to put your ETL process
> First you need to check the category and sub-category of the ETL process you want to add.
```python
======================================
- etl/
- CATEGORY/
- __init__.py
- SUBCATEGORY.py
- def CATEGORY___SUBCATEGORY___ETL_PROCESS()
======================================
```
- `category` is the folder. This is pre-defined and you can add a new category if needed. **Check below to learn more about category**
- `sub-category` is the python file. This is not pre-defined and you have to decide which name could be appropriate for the ETL process you want to add.
Now when you know the `category` and `sub-category`, you can add a new ETL process.
There are only one way to add a new ETL process
### 🌠 Use decorator `@register_etl` to register your ETL `function`
```python
# check the __sample/ folder for example
from dataverse.etl import register_etl
@register_etl
def category___subcategory___etl(rdd, config):
# do something
return rdd
```
#### ☣️ Inheriting `BaseETL` is NOT ALLOWED ☣️
```python
from dataverse.etl import BaseETL
class category___subcategory___etl(BaseETL):
def run(rdd, config):
# do something
return rdd
```
### 🌠 ETL Process Class Naming Convention
> This shared the same documentary with README.md in `__sample/` folder
<details>
```python
[ETL Category]___[ETL Sub-Category]___[ETL Name]
======================================
- "__sample/"
- github.py
- def __sample___github___remove_url()
- def __sample___github___filter_by_stars()
- "bias/"
- mmlu.py
- def bias___mmlu___remove_word()
- def bias___mmlu___to_parquet()
- ducky.py
- def bias___ducky___fly()
- def bias___ducky___quark()
======================================
```
> caveat: the combination of `[ETL Category]___[ETL Sub-Category]___[ETL Name]` MUST be unique
1. `[ETL Category]` is the folder and category where the ETL is defined
- `[ETL Category]` MUST be one of the following pre-defined list
- `cleaning`
- `decontamination`
- `deduplication`
- `data_ingestion`
- `pil`
- `quality`
- `toxicity`
- `bias`
- `data_load`
- `utils`
2. `[ETL Sub-Category]` is the name of the file where the ETL is defined
- no pre-defined list
- it could be a dataset name
- or a nickname of yours
- or whatever you think it's appropriate
- e.g. `github` or `kaggle` or `mmlu` whatever you want
3. `[ETL Name]` naming should follow `function` naming convention, even it's `class`
- all lower case
- use underscore `_` to separate words
4. Each is separated by `___` (triple underscore)
- e.g. `bias___mmlu___remove_word()`
#### Why does folder, file name included in the ETL class name?
- To avoid the following tmp names on dynamic construction of ETL class
- e.g. `tmp___ipykernel_181248___remove_url` <- jupyter notebook env
- e.g. `python3.10___abc___remove_url` <- dynamic class construction by `type`
- so decided to control the name space by only `ETL class name` which includes folder, file name
</details>
## 🌌 Principles for ETL Process
> When you create your own ETL process, you should follow the following principles
1. No `DRY` (Don't Repeat Yourself)
2. One file Only
### 🌠 No `DRY` (Don't Repeat Yourself)
> No `DRY` is applied between **ETL sub-categories**.
- So if similar ETL processes are used in same sub-categories, it could be shared.
- But if it's used in different sub-categories, it should not be shared.
As you can see in the following example, there are 2 ETL processes `common_process_a` and `common_process_b`seems nice to be shared. But as you can see, they are not shared. They are repeated. This is because of the No `DRY` principle.
```python
- deduplication/
- exact.py
- "def common_process_a():"
- "def common_process_b():"
- def deduplication___exact___a():
- exact_datasketch.py
- "def common_process_a():"
- "def common_process_b():"
- def deduplication___exact_datasketch___a():
- def deduplication___exact_datasketch___b():
```
### 🌠 One file Only
Code that ETL process uses should be in the same file. This is because of the `One file Only` principle. Except **ETL Base class, few required utils functions, and open sources** there should be no dependency outside the file.
```python
# This is OK ✅
- deduplication/
- exact.py
- def helper_a():
- def helper_b():
- def etl_process():
helper_a()
helper_b()
# This is not allowed ❌
- deduplication/
- helper.py
- def helper_a():
- def helper_b():
- exact.py
from helper import helper_a
from helper import helper_b
- def etl_process():
helper_a()
helper_b()
```
ETL process itself is meant to be built to be used in various combination of ETL pipeline **So try to make it as generic as possible.** 😊
## 🌌 How to use ETL Process by Configuration
> Now let's learn how to use ETL process by configuration
### 🌠 Register ETL process
> This is same as above. Register ETL process using `@register_etl` decorator
```python
from dataverse.etl import register_etl
@register_etl
def etl_process_start(spark, load_path, repartition=3):
data = spark.read.load(load_path).repartition(repartition)
return data
@register_etl
def etl_process_middle(data, threshold=0.5):
data = data.filter(data['stars'] > threshold)
return data
@register_etl
def etl_process_end(data, save_path, repartition=1):
data.repartition(repartition).write.save(save_path)
return None
```
### 🌠 Define ETL process in the config file
You can use the following config to run the above ETL processes in order
- `etl_process_start` -> `etl_process_middle` -> `etl_process_end`
```yaml
spark:
appname: dataverse_etl_sample
driver:
memory: 4g
etl:
- name: etl_process_start
args:
load_path: ./sample/raw.parquet
repartition: 3
- name: etl_process_middle
args:
threshold: 0.5
- name: etl_process_end
args:
save_path: ./sample/ufl.parquet
repartition: 1
```
**Check the following real example for more details**
- Config located at `dataverse/config/etl/sample/ETL___one_cycle.yaml`
```yaml
spark:
appname: dataverse_etl_sample
driver:
memory: 16g
etl:
- name: data_ingestion___test___generate_fake_ufl
- name: utils___sampling___random
args:
sample_n_or_frac: 0.1
- name: deduplication___minhash___lsh_jaccard
- name: data_load___huggingface___ufl2hf_obj
```
## 🌌 How to add a new ETL Category
### 🌠 Add a new folder to `etl/` folder
```python
======================================
- etl/
- YOUR_NEW_CATEGORY/
- __init__.py
- YOUR_NEW_SUBCATEGORY.py
- data_ingestion/
...
======================================
```
### 🌠 Add a new category to `ETL_CATEGORY` in `registry.py`
> Only added category will be recognized by the ETL pipeline
```python
ETL_CATEGORIES = [
YOUR_NEW_CATEGORY,
'data_ingestion',
'decontamination',
'deduplication',
'bias',
'toxicity',
'cleaning',
'pii',
'quality',
'data_load',
'utils',
]
```
### 🌠 Pre-defined ETL Categories
```python
======================================
- etl/
- "__sample/"
- This is to show how to use the etl package
- "data_ingestion/"
- converting data from one format, schema to another
- "data_load/"
- saving data to desired location
- "quality/"
- improving data quality
- e.g. removing data with low quality
- "cleaning/"
- cleaning data
- e.g. removing HTML tags from text
- e.g. data normalization
- "decontamination/"
- removing contamination from data
- e.g. removing benchmark data from data
- "deduplication/"
- removing duplication inside data
- "pii/"
- removing PII from data
- "bias/" -
- removing bias from data
- e.g. removing data with gender bias words
- "toxicity/"
- removing toxic data
- e.g. removing data with toxic words
- "utils/"
- utilities for the ETL process
- e.g. sampling, logging, error handling, etc
======================================
```
## 🌌 How to Ignore specific ETL Sub-Category
> If you want to ignore some of the `ETL sub-category` python files, you can add the file name to `ETL_IGNORE` in `registry.py`
when you want to make a file just for storage purpose, you can add the file name to `ETL_IGNORE` in `registry.py`
```python
ETL_IGNORE = [
'__init__.py',
'storage.py'
]
```
================================================
FILE: dataverse/etl/__init__.py
================================================
from .registry import ETLRegistry
from .registry import register_etl
from .registry import BaseETL
from .pipeline import ETLPipeline
================================================
FILE: dataverse/etl/__sample/README.md
================================================
# Sample
> This is a showcase
================================================
FILE: dataverse/etl/__sample/__init__.py
================================================
================================================
FILE: dataverse/etl/__sample/ducky.py
================================================
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl import register_etl
from typing import Union
@register_etl
def __sample___ducky___make_your_own_etl_processor(data: Union[RDD, DataFrame], *args, **kwargs):
"""
decorator will convert this function to BaseETL class
"""
print("make_your_own_etl_processor")
return data
================================================
FILE: dataverse/etl/__sample/github.py
================================================
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl import BaseETL
from dataverse.etl import register_etl
from dataverse.etl import ETLRegistry
from dataverse.etl.registry import ETLStructure
from typing import Union
@register_etl
def __sample___github___using_decorator(data: Union[RDD, DataFrame], *args, **kwargs):
"""
decorator will convert this function to BaseETL class
"""
print("sample using decorator")
return data
@register_etl
def __sample___github___config(data: Union[RDD, DataFrame], config: dict = None, *args, **kwargs):
"""
decorator will convert this function to BaseETL class
"""
print("config says", config)
return data
if __name__ == "__main__":
registry = ETLRegistry()
print("[ Testing ] registry etl using decorator")
# this could seem like a function but it is actually a BaseETL class
etl = __sample___github___using_decorator
etl()(data=None)
print("is subclass of ETLStructure?", issubclass(etl, ETLStructure), "\n")
print("[ Testing ] registry etl using decorator with config")
etl = __sample___github___config
etl()(data=None, config={"hello": "world"})
print("is subclass of ETLStructure?", issubclass(etl, ETLStructure), "\n")
# check is it properly registryed
print("[ Testing ] check is it properly registry")
print("="*50)
print(registry._registry)
print("="*50)
================================================
FILE: dataverse/etl/bias/README.md
================================================
================================================
FILE: dataverse/etl/bias/__init__.py
================================================
================================================
FILE: dataverse/etl/cleaning/README.md
================================================
# Cleaning
> Data normalization, removing noise, and other data cleaning tasks.
## 🌌 Naming Convention
> This is a strong recommendation. You can use your own naming convention if you want.
```python
def cleaning___[ETL Sub-Category]___[ETL Process]()
```
- `ETL Sub-Category` - the data source to handle
- e.g. unicode
- e.g. char
- e.g. word
- e.g. number
- `ETL process name` - purpose of the ETL process
- e.g. remove
- e.g. filter
- e.g. normalize
================================================
FILE: dataverse/etl/cleaning/__init__.py
================================================
================================================
FILE: dataverse/etl/cleaning/char.py
================================================
"""
A collection of modules for cleaning data at the character level.
For example: whitespace, accent characters, and unprintable characters.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import re
import unicodedata
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def cleaning___char___normalize_whitespace(
spark, data: Union[RDD, DataFrame], subset: str = "text", *args, **kwargs
) -> RDD:
r"""
Normalize whitespace.
- Strips the leading and trailing whitespaces.
- Replaces all consecutive whitespaces with a single space,
excluding ``\n`` and ``\r`` characters.
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str): A subset or column to consider. Defaults to 'text'.
Returns:
RDD: The processed data with normalized whitespace.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
pattern = re.compile(r"[^\S\r\n]+")
def _normalize_whitespace(row):
row[subset] = re.sub(pattern, " ", row[subset].strip())
return row
data = data.map(_normalize_whitespace)
return data
@register_etl
def cleaning___char___remove_unprintable(
spark, data: Union[RDD, DataFrame], subset="text", *args, **kwargs
) -> RDD:
"""
Remove all the non-printable characters.
Code is from facebookresearch/cc_net
https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str): A subset or column to consider. Defaults to 'text'.
Returns:
RDD: The processed data with unprintable characters are removed.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _remove_non_printable_char(row):
new_lines = []
for line in row[subset].split("\n"):
new_lines.append(
re.sub(f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]", "", line)
)
row[subset] = "\n".join(new_lines)
return row
data = data.map(_remove_non_printable_char)
return data
def strip_accents(text: str) -> str:
"""Strips accents from a piece of text."""
nfd = unicodedata.normalize("NFD", text)
output = [c for c in nfd if unicodedata.category(c) != "Mn"]
if len(output) == text:
return text
return "".join(output)
@register_etl
def cleaning___char___remove_accent(
spark, data: Union[RDD, DataFrame], subset: str = "text", *args, **kwargs
) -> RDD:
"""Strips accents from a piece of text.
+--------+--------+
| input | output |
+========+========+
| café | cafe |
| résumé | resume |
+--------+--------+
Code is from facebookresearch/cc_net
https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str): A subset or column to consider. Defaults to 'text'.
Returns:
The processed data with accents removed.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _strip_accents(row):
row[subset] = strip_accents(row[subset])
return row
data = data.map(_strip_accents)
return data
================================================
FILE: dataverse/etl/cleaning/document.py
================================================
"""
A collection of modules for cleaning data at the document level.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def cleaning___document___split_by_word(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
word_per_chunk: int = 100,
delimiter: str = " ",
*args,
**kwargs
) -> RDD:
"""
Split documents into smaller chunks by word.
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
word_per_chunk (int, optional): Number of words per chunk. Defaults to 100.
delimiter (str, optional): Delimiter to split the text. Defaults to " ".
Returns:
RDD: The processed data with documents split into smaller chunks.
Raises:
ValueError: If word_per_chunk is not a positive integer.
Examples:
- word_per_chunk = 2
- delimiter = " "
- input
+-----------------------------+
| text |
+=============================+
| "hello world, how are you?" |
+-----------------------------+
- output
+----------------+
| text |
+================+
| "hello world," |
+----------------+
| "how are" |
+----------------+
| "you?" |
+----------------+
Caveats:
- NO normalization is done here!
- This doesn't consider the whitespace normalization.
- Recommend using other normalization before this.
- All the keys from the original row are copied to all the new rows created.
- ``id`` is not unique anymore.
- Make sure ``id`` is assigned after this step.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _split_by_word(row):
words = row[subset].split(delimiter)
# Create chunks
chunks = []
for i in range(0, len(words), word_per_chunk):
chunks.append(delimiter.join(words[i : i + word_per_chunk]))
# Create a new dictionary for each chunk with all the keys from the original row
return [{**row, subset: chunk} for chunk in chunks]
data = data.flatMap(_split_by_word)
return data
================================================
FILE: dataverse/etl/cleaning/html.py
================================================
"""
A collection of modules for cleaning data includes html.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import Union
import html2text
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def cleaning___html___extract_plain_text(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
use_trafilatura: bool = False,
*args,
**kwargs
) -> RDD:
r"""
Extracts plain text from HTML.
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
use_trafilatura (bool, optional): Whether to use trafilatura instead of html2text. Defaults to False.
Returns:
The plain data extracted from html.
Caveats:
- ``html2text`` adds a double newline after each paragraph, which is not handled at this point.
- The option to use `trafilatura` is provided because extracting plain text with ``trafilatura`` does not seem to work well in some cases.
- [OK] Case::
text = "<body><h1>My First Heading</h1><p>My first paragraph.</p></body>"
# html2text
print(html2text.html2text(text))
>>> '# My First Heading\n\nMy first paragraph.\n\n'
# trafilatura
print(trafilatura.html2txt(text))
>>> 'My First HeadingMy first paragraph.'
- [ERROR] Case (trafilatura removes all the text)::
text = "<p>hello <br> nice to meet you.</p>"
# html2text
print(html2text.html2text(text))
>>> 'hello \nnice to meet you.\n\n'
# trafilatura
print(trafilatura.html2txt(text))
>>> ''
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
# this is optional
if use_trafilatura:
import trafilatura
def _html2txt(row):
row[subset] = trafilatura.html2txt(row[subset])
return row
else:
def _html2txt(row):
row[subset] = html2text.html2text(row[subset])
return row
data = data.map(_html2txt)
return data
================================================
FILE: dataverse/etl/cleaning/korean.py
================================================
"""
This is only for Korean text datas.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import re
from enum import IntEnum
from typing import List, Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
class KoreanType(IntEnum):
JAUM = 0
MOUM = 1
COMPLETE = 2
ELSE = -1
KOR_BEGIN = 44032
KOR_END = 55203
CHOSUNG_BASE = 588
JUNGSUNG_BASE = 28
JAUM_BEGIN = 12593
JAUM_END = 12622
MOUM_BEGIN = 12623
MOUM_END = 12643
# fmt: off
CHOSUNG = ["ㄱ", "ㄲ", "ㄴ", "ㄷ", "ㄸ", "ㄹ", "ㅁ", "ㅂ", "ㅃ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅉ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ"]
JUNGSUNG = ["ㅏ", "ㅐ", "ㅑ", "ㅒ", "ㅓ", "ㅔ", "ㅕ", "ㅖ", "ㅗ", "ㅘ", "ㅙ", "ㅚ", "ㅛ", "ㅜ", "ㅝ", "ㅞ", "ㅟ", "ㅠ", "ㅡ", "ㅢ", "ㅣ"]
JONGSUNG = [" ", "ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄹ", "ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ", "ㅂ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ"]
JAUM = ["ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄸ", "ㄹ", "ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ", "ㅂ", "ㅃ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅉ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ"]
MOUM = ["ㅏ", "ㅐ", "ㅑ", "ㅒ", "ㅓ", "ㅔ", "ㅕ", "ㅖ", "ㅗ", "ㅘ", "ㅙ", "ㅚ", "ㅛ", "ㅜ", "ㅝ", "ㅞ", "ㅟ", "ㅠ", "ㅡ", "ㅢ", "ㅣ"]
# fmt: on
def character_is_korean(c):
i = ord(c)
return (
(KOR_BEGIN <= i <= KOR_END)
or (JAUM_BEGIN <= i <= JAUM_END)
or (MOUM_BEGIN <= i <= MOUM_END)
)
def decompose(c):
if not character_is_korean(c):
return None
i = ord(c)
if JAUM_BEGIN <= i <= JAUM_END:
return c, " ", " "
if MOUM_BEGIN <= i <= MOUM_END:
return " ", c, " "
i -= KOR_BEGIN
cho = i // CHOSUNG_BASE
jung = (i - cho * CHOSUNG_BASE) // JUNGSUNG_BASE
jong = i - cho * CHOSUNG_BASE - jung * JUNGSUNG_BASE
return CHOSUNG[cho], JUNGSUNG[jung], JONGSUNG[jong]
def compose(chosung, jungsung, jongsung):
unicode = KOR_BEGIN
unicode += CHOSUNG_BASE * CHOSUNG.index(chosung)
unicode += JUNGSUNG_BASE * JUNGSUNG.index(jungsung)
unicode += JONGSUNG.index(jongsung)
return chr(unicode)
def cleaning___korean___filter_by_ratio(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
filter_type: str = "word",
korean_ratio: float = 0.5,
*args,
**kwargs,
) -> RDD:
"""
Filters out the text that has less than `korean_ratio` excluding space.
Code is from eleutherAI/dps and was modified
https://github.com/EleutherAI/dps/blob/master/dps/spark/prep/korean_prep.py#L52
Args:
spark (SparkSession): The Spark session object.
data(Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
subset(str, optional): A subset or column to consider. Defaults to 'text'.
filter_type(str, optional): The type of filtering to be applied. Can be 'char' or 'word'. Defaults to 'word'.
korean_ratio(float, optional) : The minimum ratio of Korean characters or words required for a text to survive the filtering. Defaults to 0.5.
Returns:
The filtered data with it's Korean ratio.
Raises:
ValueError: If the filter_type is not 'char' or 'word', or if the korean_ratio is not between 0 and 1.
Examples:
With korean_ratio = 0.5
+------------------------------------------------+
| text |
+================================================+
| "한국어가 포함 비율이 50% 이상인 경우만 남김" |
+------------------------------------------------+
- filter_type = 'char' -> [survive!]
- Korean characters: 17
- Non-Korean characters: 3
- Total characters: 20
- Korean character ratio: 17 / 20 > 0.5 -> True
- filter_type = 'word' -> [survive!]
- Korean characters: 6
- Non-Korean characters: 1
- Total characters: 7
- Korean character ratio: 6 / 7 > 0.5 -> True
+------------------------------------------------+
| text |
+================================================+
| "korean including 비율이 50% 미만인 경우 제거" |
+------------------------------------------------+
- filter_type = 'char' -> [remove!]
- Korean characters: 10
- Non-Korean characters: 28
- Total characters: 38
- Korean word ratio: 10 / 38 > 0.5 -> False
- filter_type = 'word' -> [survive!]
- Korean characters: 4
- Non-Korean characters: 3
- Total characters: 7
- Korean word ratio: 4 / 7 > 0.5 -> True
Note:
- The regex to count Korean characters doesn't work properly on characters that are not words.
- e.g 안녕"하세요 is counted is 2 korean words - ["안녕", "하세요"]
"""
assert filter_type in [
"char",
"word",
], f"filter_type should be either `char` or `word` but got {filter_type}"
assert (
0.0 <= korean_ratio <= 1.0
), f"korean_ratio should be between 0. ~ 1. but got {korean_ratio}"
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _korean_ratio_filter(row):
if row[subset] is None or len(row[subset]) == 0:
return False
if filter_type == "char":
korean_counts = len(re.findall("[ㄱ-힣]", row[subset]))
all_counts = len(re.sub("[ \r\n\t\f\v]", "", row[subset]))
if filter_type == "word":
korean_counts = len(re.findall(r"\b[\w]*[ㄱ-힣][\w]*\b", row[subset]))
all_counts = len(re.findall(r"\b\w+\b", row[subset]))
if all_counts == 0:
return False
return (korean_counts / all_counts) >= korean_ratio
data = data.filter(_korean_ratio_filter)
return data
def classify_korean_type(unicode):
if JAUM_BEGIN <= unicode <= JAUM_END:
return KoreanType.JAUM
elif MOUM_BEGIN <= unicode <= MOUM_END:
return KoreanType.MOUM
elif KOR_BEGIN <= unicode <= KOR_END:
return KoreanType.COMPLETE
else:
return KoreanType.ELSE
def reduce_repeated_emotions(text, num_repeats=2):
if num_repeats > 0:
repeat_chars_pattern = re.compile(r"(\w)\\1{2,}")
text = repeat_chars_pattern.sub("\\1" * num_repeats, text)
return text
@register_etl
def cleaning___korean___reduce_emoticon(
spark,
data: Union[RDD, DataFrame],
subset: Union[str, List[str]] = "text",
num_repeats: int = 2,
*args,
**kwargs,
) -> RDD:
"""
Reduces emoticon Korean characters.
It performs the following steps:
1. Splits complete Korean characters into individual characters, preserving only the previous jaum and next moum.
- e.g. (remain) ㅋㅋ킄ㅋㅋㅋ -> ㅋㅋ킄ㅋㅋㅋ
- e.g. (splited) ㅋㅋ쿠ㅜㅜㅜ -> ㅋㅋㅋㅜㅜㅜㅜ
2. Reduces repeating Korean characters.
- e.g. ㅋㅋㅋㅋㅋ -> ㅋㅋ
Args:
spark(SparkSession): The Spark session object.
data(Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
subset(str, optional): A subset or columns to consider. Defaults to 'text'.
num_repeats(int, optional): The number of repeating characters to reduce. Defaults to 2.
Returns:
RDD: The processed data with reduced emoticon Korean characters.
Note:
**[ potential risk of splitting complete korean character ]**
splitting emoticon characters into individual characters has high risk inside
so only left one case that is `complete korean character between jaum and moum`
other cases were added also but due to the risk, wiped out
References:
- `soynlp normalizer.py <https://github.com/lovit/soynlp/blob/master/soynlp/normalizer/_normalizer.py>`_
- `dps korean_prep.py <https://github.com/EleutherAI/dps/blob/master/dps/spark/prep/korean_prep.py>`_
"""
def _reduce_korean_emotion(row):
text = row[subset]
if not text:
return row
korean_types = [classify_korean_type(ord(c)) for c in text]
last_idx = len(korean_types) - 1
normalized_text = []
for i, (korean_type, c) in enumerate(zip(korean_types, text)):
# when complete korean character is between jaum and moum
if (0 < i < last_idx) and (
korean_types[i - 1] == KoreanType.JAUM
and korean_type == KoreanType.COMPLETE
and korean_types[i + 1] == KoreanType.MOUM
):
cho, jung, jong = decompose(c)
# case 1. when complete kor char is combination of prev jaum and next moum
# e.g. ㅋ(쿠)ㅜ -> ㅋ(ㅋㅜ)ㅜ
if cho == text[i - 1] and jung == text[i + 1] and jong == " ":
normalized_text.append(cho)
normalized_text.append(jung)
# case 2. otherwise, just leave it
# e.g. ㅋ(쿵)ㅜ -> ㅋ(쿵)ㅜ
else:
normalized_text.append(c)
else:
normalized_text.append(c)
row[subset] = reduce_repeated_emotions("".join(normalized_text), num_repeats)
return row
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
data = data.map(_reduce_korean_emotion)
return data
================================================
FILE: dataverse/etl/cleaning/length.py
================================================
"""
Filtering based on length.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def cleaning___length___char_len_filter(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
min_len: int = None,
max_len: int = None,
*args,
**kwargs
) -> RDD:
"""
Filters the data by character length.
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
min_len (int, optional): The minimum length of characters to filter. If None, there is no minimum length.
max_len (int, optional): The maximum length of characters to filter. If None, there is no maximum length.
Returns:
The filtered data as an RDD.
Raises:
ValueError: If both min_len and max_len are None.
Note:
- min_len <= len <= max_len
- min_len and max_len can not be None at the same time.
- If min_len is None, then only the maximum length is considered.
- If max_len is None, then only the minimum length is considered.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
assert (
min_len is not None or max_len is not None
), "min_len and max_len cannot be None at the same time"
if min_len is not None and max_len is not None:
data = data.filter(lambda row: min_len <= len(row[subset]) <= max_len)
elif min_len is None:
data = data.filter(lambda row: len(row[subset]) <= max_len)
elif max_len is None:
data = data.filter(lambda row: min_len <= len(row[subset]))
return data
@register_etl
def cleaning___length___word_len_filter(
spark,
data: Union[RDD, DataFrame],
subset="text",
min_len: int = None,
max_len: int = None,
*args,
**kwargs
):
"""
filter by word length
min_len <= len <= max_len
- if min_len is None, then len <= max_len
- if max_len is None, then len >= min_len
args:
subset: column to filter
min_len: minimum length to filter
max_len: maximum length to filter
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
assert (
min_len is not None or max_len is not None
), "min_len and max_len cannot be None at the same time"
if min_len is not None and max_len is not None:
data = data.filter(lambda row: min_len <= len(row[subset].split()) <= max_len)
elif min_len is None:
data = data.filter(lambda row: len(row[subset].split()) <= max_len)
elif max_len is None:
data = data.filter(lambda row: min_len <= len(row[subset].split()))
return data
================================================
FILE: dataverse/etl/cleaning/number.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import re
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def cleaning___number___normalize(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
assign_number: int = 0,
*args,
**kwargs,
) -> RDD:
"""
Convert all the number to assigned number (e.g. 0)
Code is from facebookresearch/cc_net
https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py
Examples:
- input
+----------+
| text |
+==========+
| 1234|
| 1234.5678|
+----------+
- output
+----------+
| text |
+==========+
| 0000|
| 0000.0000|
+----------+
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
assign_number (int, optional): The number to assign. Default is 0.
Returns:
The normalized data.
Raises:
AssertionError: If assign_number is not between 0 and 9 (inclusive).
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _normalize_number(row):
row[subset] = re.sub(r"\d", str(assign_number), row[subset])
return row
# assign_number is between 0 ~ 9
assert assign_number in range(
10
), f"assign_number should be between 0 ~ 9 but got {assign_number}"
data = data.map(_normalize_number)
return data
================================================
FILE: dataverse/etl/cleaning/table.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from dataverse.etl.registry import register_etl
@register_etl
def cleaning___table___merge_col_vertical(
spark,
data: Union[RDD, DataFrame],
col1: str = None,
col2: str = None,
merge_col_name: str = "merge_col",
*args,
**kwargs
) -> RDD:
"""
Merges two columns vertically into one column.
Example:
Before:
+------+------+---------+
| col1 | col2 | species |
+======+======+=========+
| 1 | 2 | duck |
+------+------+---------+
| 3 | 4 | duck |
+------+------+---------+
| 5 | 6 | ducky |
+------+------+---------+
After calling ``cleaning_table_merge_col_vertical(...)``:
+--------+---------+
| number | species |
+========+=========+
| 1 | duck |
+--------+---------+
| 3 | duck |
+--------+---------+
| 5 | ducky |
+--------+---------+
| 2 | duck |
+--------+---------+
| 4 | duck |
+--------+---------+
| 6 | ducky |
+--------+---------+
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
col1 (str): The name of the first column to merge.
col2 (str): The name of the second column to merge.
merge_col_name (str, optional): The name of the merged column.
Returns:
The processed data with the merged column.
Raises:
ValueError: If col1 or col2 is not specified.
"""
if isinstance(data, RDD):
data = data.toDF()
assert col1 is not None, "col1 must be specified"
assert col2 is not None, "col2 must be specified"
rest_cols = [c for c in data.columns if c not in [col1, col2]]
df1 = data.select(*rest_cols, F.col(col1).alias(merge_col_name))
df2 = data.select(*rest_cols, F.col(col2).alias(merge_col_name))
# union the dataframes
data = df1.union(df2)
return data
================================================
FILE: dataverse/etl/cleaning/unicode.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import re
import unicodedata
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
UNICODE_PUNCT = {
",": ",",
"。": ".",
"、": ",",
"„": '"',
"”": '"',
"“": '"',
"«": '"',
"»": '"',
"1": '"',
"」": '"',
"「": '"',
"《": '"',
"》": '"',
"´": "'",
"∶": ":",
":": ":",
"?": "?",
"!": "!",
"(": "(",
")": ")",
";": ";",
"–": "-",
"—": " - ",
".": ". ",
"~": "~",
"’": "'",
"…": "...",
"━": "-",
"〈": "<",
"〉": ">",
"【": "[",
"】": "]",
"%": "%",
"►": "-",
}
@register_etl
def cleaning___unicode___remove_punct(
spark, data: Union[RDD, DataFrame], subset: str = "text", *args, **kwargs
) -> RDD:
"""
Removes all the Unicode punctuations.
Code is from facebookresearch/cc_net
https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
Returns:
The cleaned data.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _remove_unicode_punct(row):
row[subset] = re.sub(f"[{''.join(UNICODE_PUNCT.keys())}]", "", row[subset])
return row
data = data.map(_remove_unicode_punct)
return data
@register_etl
def cleaning___unicode___replace_punct(
spark, data: Union[RDD, DataFrame], subset: str = "text", *args, **kwargs
) -> RDD:
"""
Replace all the unicode punctuations
Code is from facebookresearch/cc_net
https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
Returns:
The cleaned data.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _replace_unicode_punct(row):
row[subset] = "".join((UNICODE_PUNCT.get(c, c) for c in row[subset]))
return row
data = data.map(_replace_unicode_punct)
return data
@register_etl
def cleaning___unicode___normalize(
spark, data: Union[RDD, DataFrame], subset="text", *args, **kwargs
):
"""
Normalize the unicode
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed. It can be either an RDD or a DataFrame.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
Returns:
The cleaned data.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _normalize(row):
row[subset] = unicodedata.normalize("NFC", row[subset])
return row
data = data.map(_normalize)
return data
================================================
FILE: dataverse/etl/data_ingestion/README.md
================================================
# Data Ingestion
> Ingest various data sources into the desired format
**Recommendation for Data Ingestion**
> Use Data Ingestion to convert all datasets to unified format you choose before preprocessing(transform)
- for `Text Only` Dataset, recommend using `ufl` format
- for details on `ufl` format, see below
- for `other` dataset, consider creating a new unified format
## 📚 Data Ingestion Flow
> This is the recommended flow for data ingestion, but not mandatory
There is 2 types of data ingestion flow for standard
- **1 step flow** (load & template)
- load `raw data` to `desired format` directly
- **2 step flow** (load -> template)
- load `raw data` to `raw format` first with **dict type**
- convert `raw format` to `desired format`
If you want to create 3 steps, thats on you. Remember this is just a guideline.
### 📗 Why 2 step flow?
> To support various templates for the same data source
Let's suppose we are ingesting `mmlu` dataset and our desired format is `ufl` format.
And with the following 2 templates, we can create 2 different data with `ufl` format.
To give user a broader choice, multiple templates for the same data source is necessary and 2 step flow is the way to go.
```python
# raw format
raw = {
"question": "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.",
"choices": ["8", "2", "24", "120"],
"answer": 1,
}
# template v1 - only question (q)
ufl = {
'id': "b1c2d3e4f5g6h7i8j9k0",
'name': "mmlu",
'text': "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.",
'meta': {},
}
# template v2 - question, answer (qa)
ufl = {
'id': "a1b2c3d4e5f6g7h8i9j0",
'name': "mmlu",
'text': "question: Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.\nanswer: 8",
'meta': {},
}
```
## 📚 Naming Convention
> This is a strong recommendation. You can use your own naming convention if you want.
```python
def data_ingestion___[ETL Sub-Category]___[raw source]2[target format]()
```
- `ETL Sub-Category` - 2 types of sub-category (python file)
1. Name to the data source name to handle (`specific` purpose)
- e.g. mmlu
- e.g. squad
2. Name `file format` itself (`general` purpose)
- e.g. parquet
- e.g. csv
- e.g. hugingface
- `ETL process name`
- Name the ETL process as the `raw source` -> `target format`
- **raw source**
- `file format`
- `parquet` - (loading data from parquet)
- `hf` - (loading data from huggingface dataset)
- `csv` - (loading data from csv)
- etc
- `raw`
- the data is already loaded in memory as raw
- **target format**
- `ufl` - (loading data to ufl format)
- e.g. `parquet2ufl` means loading parquet to ufl format
- e.g. `hf2ufl` means loading huggingface dataset to ufl format
- `raw` - (loading data w/o any transformation)
- e.g. `parquet2raw` means loading parquet to raw format
- e.g. `hf2raw` means loading huggingface dataset to raw format
- `[YOUR_FORMAT]`
- this is on you
**caveat**
- `ufl` is not a file format rather a schema(data format).
### 📗 1 step flow
> direct loading raw data to desired format
- In case of your data is already saved in UFL format, use `raw` loading ETL process
- e.g. `hf2raw` could be used as total 1 step when your data is already saved in UFL format
```python
- "data_ingestion/"
# converting raw data to desired format
- mmlu.py
- def data_ingestion___mmlu___parquet2ufl()
- def data_ingestion___mmlu___hf2ufl()
- squad.py
- def data_ingestion___squad___hf2ufl()
- mnist.py
- def data_ingestion___mnist___csv2ufl()
# this is used when loading UFL format saved in parquet
- parquet.py
- def data_ingestion___parquet___pq2ufl()
```
### 📗 2 step flow
> loading raw data to raw format first and then convert to desired format
#### 📖 Step 1 - load raw data to raw format
```python
- "data_ingestion/"
# converting raw data to raw format
- huggingface.py
- def data_ingestion___huggingface___hf2raw()
- mmlu.py
- def data_ingestion___mmlu___parquet2raw()
- def data_ingestion___mmlu___hf2raw()
- mnist.py
- def data_ingestion___mnist___csv2raw()
```
#### 📖 Step 2 - convert raw format to desired format
- Name the ETL process as the `raw format` -> `target format`
- e.g. `raw2ufl` means converting raw format to ufl format
- Add template name to the end of the function name
- e.g. `raw2ufl_q` means converting raw format to ufl format with `question` template
- e.g. `raw2ufl_qa` means converting raw format to ufl format with `question & answer` template
```python
- "data_ingestion/"
# converting raw format to desired format
- mmlu.py
- def data_ingestion___mmlu___raw2ufl_q()
- def data_ingestion___mmlu___raw2ufl_qa()
- squad.py
- def data_ingestion___squad___raw2ufl_v1()
- mnist.py
- def data_ingestion___mnist___raw2ufl_v1()
```
## 📚 UFL (Upstage Format for LLM)
> This is the schema(data format) recommended by the Upstage LLM. Dataverse standard format for preparing pretraining dataset.
```python
{
"id":"uuid",
"name": "string",
"text":"string",
"meta": "string",
}
```
- `id` - uuid v1
- `name` - name of the dataset
- `text` - text of the dataset
- `meta` - meta data of the dataset
- meta data is a stringified json object
### 📗 Why stringified for meta data?
> Meta data does not have a fixed schema. It can be anything. So, it is stringified to avoid any issues with the schema.
**huggingface datasets**
- when 2 datasets have different meta data schema, it will throw an error when merging the datasets
================================================
FILE: dataverse/etl/data_ingestion/__init__.py
================================================
================================================
FILE: dataverse/etl/data_ingestion/arrow.py
================================================
"""
Load Arrow.
Support direct loading of arrow saved huggingface dataset to spark dataframe.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import glob
import os
from typing import List, Union
import numpy as np
import pyarrow as pa
from omegaconf import ListConfig
from pyspark.rdd import RDD
from dataverse.etl import register_etl
def find_arrow_paths(directory):
"""find *.arrow files recursively"""
if isinstance(directory, str):
return glob.glob(os.path.join(directory, "**/*.arrow"), recursive=True)
elif isinstance(directory, list) or isinstance(directory, ListConfig):
arrow_paths = []
for d in directory:
arrow_paths.extend(find_arrow_paths(d))
return arrow_paths
raise ValueError(f"directory must be str or list, got {type(directory)}")
def get_dir_size(arrow_paths):
total_size = 0
for fp in arrow_paths:
# skip if it is not `.arrow` file
if not fp.endswith(".arrow"):
continue
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size
def arrow_table_to_dict(arrow_path):
"""
speed 10000 take - 70ms
faster than
- pyarrow -> pydict direct loading
- pyarrow -> pandas -> pydict loading
TODO: speed and memory improvement
"""
in_memory_stream = pa.input_stream(arrow_path)
opened_stream = pa.ipc.open_stream(in_memory_stream)
table = opened_stream.read_all()
# get schema for field names
schema = table.schema
rows = []
# iterate over each row
for row in range(table.num_rows):
row_data = {
schema.field(col).name: table.column(col)[row].as_py()
for col in range(table.num_columns)
}
rows.append(row_data)
return rows
@register_etl
def data_ingestion___arrow___hf2raw(
spark,
path: Union[str, List[str]],
sample_n: int = -1,
arrow_partition_mb_size: int = -1,
raw_partition_mb_size: int = 256,
repartition: int = -1,
seed: int = 42,
verbose: bool = True,
*args,
**kwargs,
) -> RDD:
"""
Directly loads the arrow saved HuggingFace dataset to raw format as a dictionary.
Args:
spark (SparkSession): The Spark session object.
path (Union[str, List[str]]): The path of the arrow folders.
sample_n (int, optional): The number of arrow files to be sampled. Defaults to -1.
If sample_n is -1, all arrow files will be loaded.
arrow_partition_mb_size (int, optional): The size of each arrow partition in MB. Defaults to -1.
If arrow_partition_size is -1, it will repartition arrow files by the number of arrow files.
This assumes that arrow file size is evenly distributed. When there is data skew in arrow file size, it is recommended to use the default (-1).
raw_partition_mb_size (int, optional): The size of each raw partition in MB. Defaults to 256.
This is activated only when repartition is -1.
repartition (int, optional): Manually choose the number of partitions. Defaults to -1.
seed (int, optional): The seed for sampling. Defaults to 42.
verbose (bool, optional): Whether to print the information of the dataset. Defaults to True.
Returns:
RDD: The RDD containing the raw data in dictionary format.
Examples:
>>> import datasets
>>> dataset = datasets.load_dataset('ducky')
>>> dataset.save_to_disk('your/path/to/ducky')
>>> data_ingestion___arrow___hf2raw()(spark, 'your/path/to/ducky')
Caveats:
Arrow paths are repartitioned by the number of arrow files.
"""
arrow_paths = find_arrow_paths(path)
assert len(arrow_paths) > 0, f"no arrow files found in {path}"
# sample from the arrow files
if sample_n > 0 and sample_n < len(arrow_paths):
np.random.seed(seed)
arrow_paths = np.random.choice(arrow_paths, size=sample_n, replace=False)
if arrow_partition_mb_size == -1:
# if data is skewed, recommend to use default (-1)
arrow_repartition = len(arrow_paths)
else:
# this assume that arrow file size is evenly distributed
assert (
arrow_partition_mb_size > 0
), f"arrow_partition_mb_size must be positive, got {arrow_partition_mb_size}"
arrow_total_mb_size = get_dir_size(arrow_paths) / 1024 / 1024
arrow_repartition = arrow_total_mb_size // arrow_partition_mb_size
arrow_repartition += 1 if arrow_total_mb_size % arrow_partition_mb_size else 0
arrow_repartition = min(int(arrow_repartition), len(arrow_paths))
rdd = spark.sparkContext.parallelize(arrow_paths)
rdd = rdd.repartition(arrow_repartition)
rdd = rdd.flatMap(arrow_table_to_dict)
if repartition != -1:
raw_repartition = repartition
else:
assert (
raw_partition_mb_size > 0
), f"raw_partition_mb_size must be positive, got {raw_partition_mb_size}"
arrow_total_mb_size = get_dir_size(arrow_paths) / 1024 / 1024
raw_repartition = arrow_total_mb_size // raw_partition_mb_size
raw_repartition += 1 if arrow_total_mb_size % raw_partition_mb_size else 0
# count the number of data points (this is expensive)
# this is to prevent the case where the number of data points is less than raw_repartition
total_data_n = rdd.count()
raw_repartition = min(int(raw_repartition), total_data_n)
rdd = rdd.repartition(raw_repartition)
return rdd
================================================
FILE: dataverse/etl/data_ingestion/common_crawl.py
================================================
"""
Load Common Crawl data from dump-id & segment files
Code is from facebookresearch/cc_net with some modifications
https://github.com/facebookresearch/cc_net
This is a migration of the code to Dataverse.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import functools
import glob
import gzip
import io
import json
import os
import sys
import tempfile
import time
import typing as tp
import warnings
from pathlib import Path
from typing import Iterable, List, Optional, TextIO, Union
from urllib.parse import urlparse
import numpy as np
import requests
from pyspark.rdd import RDD
from dataverse.etl import register_etl
from dataverse.utils.format import get_uuidv1
from dataverse.utils.setting import SystemSetting
def parse_doc(headers: List[str], doc: List[str]) -> Optional[dict]:
"""Headers format is:
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: [url]
WARC-Date: [crawldate: 2019-02-15T19:15:59Z]
WARC-Record-ID: <urn:uuid:8865156e-d5f1-4734-9c68-4b46eaf2bb7e>
WARC-Refers-To: <urn:uuid:340152e2-65cf-4143-b522-8ce4e2d069d7>
WARC-Block-Digest: sha1:S3DTWCONT2L6ORTGCY2KXEZ37LNBB7V2
Content-Type: text/plain
Content-Length: 7743
"""
if not headers or not doc:
return None
try:
url, date, digest, length = None, None, None, None
for header in headers:
if header.startswith("WARC-Target-URI:"):
url = header.split()[1]
elif header.startswith("WARC-Date:"):
date = header.split()[1]
elif header.startswith("WARC-Block-Digest:"):
digest = header.split()[1]
elif header.startswith("Content-Length:"):
length = int(header.split()[1])
except Exception:
# logger.warning("Can't parse header:", e, headers, doc)
return None
# Docs are separated by two empty lines.
last = None
if not doc[-1] and not doc[-2]:
last = -2
title, doc = doc[0], doc[1:last]
return {
"url": url,
"date_download": date,
"digest": digest,
"length": length,
"nlines": len(doc),
"source_domain": urlparse(url).netloc,
"title": title,
"raw_content": "\n".join(doc),
}
def group_by_docs(warc_lines: Iterable[str]) -> Iterable[dict]:
doc: List[str] = []
headers, read_headers = [], True
for warc in warc_lines:
warc = warc.strip()
if read_headers:
headers.append(warc)
read_headers = warc != ""
continue
if warc == "WARC/1.0":
# We reached the beginning of the new doc.
parsed = parse_doc(headers, doc)
if parsed is not None:
yield parsed
headers, doc, read_headers = [warc], [], True
continue
doc.append(warc)
# Return the last document
if doc:
parsed = parse_doc(headers, doc)
if parsed is not None:
yield parsed
def _close_when_exhausted(file) -> Iterable[str]:
with file:
yield from file
def open_segment_file(segment: str, verbose: bool = True) -> Iterable[str]:
"""
overwrite the open_segment function to get the WET file from the folder
args:
segment: path to the WET file
"""
filename = Path(segment)
if filename.suffix == ".gz":
file: TextIO = gzip.open(filename, "rt") # type: ignore
else:
file = open(filename, "rt")
return _close_when_exhausted(file)
def process_segment_file(segment: str, verbose: bool = True) -> Iterable[dict]:
for doc in group_by_docs(open_segment_file(segment, verbose=verbose)):
doc["cc_segment"] = segment
yield doc
def find_wet_files(directory):
"""find *.wet, *wet.gz files recursively"""
return glob.glob(os.path.join(directory, "**/*.wet"), recursive=True) + glob.glob(
os.path.join(directory, "**/*.wet.gz"), recursive=True
)
WET_URL_ROOT = "https://data.commoncrawl.org"
FileDescriptor = Union[Path, List[Path], str]
ReadableFileLike = Union[Iterable[str], FileDescriptor, None]
def _tmp(prefix: str = None, suffix: str = None, dir: Path = None) -> Path:
if isinstance(prefix, Path):
prefix = str(prefix)
if isinstance(suffix, Path):
suffix = str(suffix)
_, tmp_path = tempfile.mkstemp(prefix=prefix, suffix=suffix, dir=dir)
return Path(tmp_path)
def _yield_from(files: list) -> Iterable[str]:
for file in files:
yield from open_read(file)
def open_read(filename: ReadableFileLike) -> Iterable[str]:
"""Open the given file, list of files or files matching the given glob and read lines.
`filename` is None or "-" -> reads from stdin
`filename` is a Path / str -> interprets filename as a glob and open files matching it
`filename` is a list -> opens sequentially all files from the list using `open_read`
`filename` is something else -> returns the object wrapped in a `nullcontext`
This allows to pass already openened files or iterables.
`open_read` will decompress gzip files, given they have ".gz" suffix.
"""
if filename is None:
return sys.stdin
if isinstance(filename, list):
assert isinstance(filename[0], Path)
if len(filename) == 0:
return []
if len(filename) > 1:
return _yield_from(filename)
filename = tp.cast(Path, filename[0])
if isinstance(filename, str):
if filename.startswith("http://") or filename.startswith("https://"):
return open_remote_file(filename)
filename = Path(filename)
if not isinstance(filename, Path):
# we might have received an iterable, return it unmodified.
return filename # type: ignore
# Expand glob patterns only when reading
files = [Path(f) for f in sorted(glob.glob(str(filename)))]
if len(files) > 1:
return _yield_from(files)
if len(files) == 1:
filename = files[0]
assert isinstance(filename, Path)
if filename.suffix == ".gz":
file: TextIO = gzip.open(filename, "rt") # type: ignore
else:
file = open(filename, "rt")
return _close_when_exhausted(file)
def request_get_content(url: str, n_retry: int = 3, verbose: bool = True) -> bytes:
"""Retrieve the binary content at url.
Retry on connection errors.
"""
t0 = time.time()
if verbose:
# TODO: Logging will be activated later
# logging.info(f"Starting download of {url}")
print(f"Starting download of {url}")
for i in range(1, n_retry + 1):
try:
with requests.Session() as session:
r = session.get(url)
r.raise_for_status()
break
except requests.exceptions.RequestException as e:
# Sleep and try again on error, unless it's a 404.
message = e.args[0] if isinstance(e.args[0], str) else ""
if i == n_retry or "Client Error" in message:
raise e
warnings.warn(f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})")
time.sleep(10 * 2**i)
if verbose:
dl_time = time.time() - t0
dl_speed = len(r.content) / dl_time / 1024
# logging.info(
# f"Downloaded {url} [{r.status_code}] took {dl_time:.0f}s ({dl_speed:.1f}kB/s)"
# )
print(f"Downloaded {url} [{r.status_code}] took {dl_time:.0f}s ({dl_speed:.1f}kB/s)")
return r.content
def open_remote_file(url: str, cache: Path, verbose: bool = True) -> Iterable[str]:
"""
Download the files at the given url to memory and opens it as a file.
Assumes that the file is small, and fetch it when this function is called.
"""
if cache and cache.exists():
return open_read(cache)
# TODO: open the remote file in streaming mode.
# The hard part is that we need to write the content on disk at the same time,
# to implement disk caching.
raw_bytes = request_get_content(url, verbose=verbose)
content = io.BytesIO(raw_bytes)
if url.endswith(".gz"):
f: TextIO = gzip.open(content, mode="rt") # type: ignore
else:
f = io.TextIOWrapper(content)
try:
# The file might have been created even not fully downloaded/written
# so make sure tmp_cache is deleted when the program exits.
# and only replace the cache file when the download is complete.
if cache and not cache.exists():
tmp_cache = _tmp(cache)
tmp_cache.write_bytes(raw_bytes)
if not cache.exists():
tmp_cache.replace(cache)
finally:
if tmp_cache.exists():
tmp_cache.unlink()
return _close_when_exhausted(f)
def cc_wet_paths_url(dump_id: str) -> str:
return "/".join([WET_URL_ROOT, "crawl-data", "CC-MAIN-" + dump_id, "wet.paths.gz"])
def segment_url(segment: str):
return "/".join((WET_URL_ROOT, segment))
def cc_segment_urls(dump_id: str, cache_dir: Path, verbose: bool = True) -> List[str]:
wet_paths = cc_wet_paths_url(dump_id)
wet_paths_cache = cache_dir / f"wet_{dump_id}.paths.gz"
f = open_remote_file(wet_paths, cache=wet_paths_cache, verbose=verbose)
return [segment.strip() for segment in f]
def open_segment_url(segment: str, cache_dir: Path, verbose: bool = True) -> Iterable[str]:
url = segment_url(segment)
file: Optional[Path] = None
if cache_dir:
file = cache_dir / segment.split("/")[-1]
return open_remote_file(url, cache=file, verbose=verbose)
def process_segment_url(segment: str, cache_dir: Path, verbose: bool = True) -> Iterable[str]:
for doc in group_by_docs(open_segment_url(segment, cache_dir, verbose=verbose)):
doc["cc_segment"] = segment
yield doc
@register_etl
def data_ingestion___common_crawl___wet2raw(
spark,
wet_path: str,
segment_n: int = -1,
repartition=20,
seed: int = 42,
verbose=True,
*args,
**kwargs,
) -> RDD:
"""
Load WET files and convert them to raw format as a dictionary.
[ what is WET? ]
- WET files which store extracted plain text from the data stored in the WARC.
Args:
spark: The Spark session.
wet_path: The path to the WET folder that includes WET format files.
This search recursively, so you don't need to specify the path to each WET file.
This search for all the *.wet, *.gz files in the folder.
segment_n: The number of segments to load. This is a sampling parameter.
One segment is about 1GB.
Set as -1 (default) to load all the segments.
repartition: The number of partitions.
seed: The random seed.
verbose: Whether to print the information of the dataset.
Returns:
rdd: The RDD containing the converted raw data.
"""
wet_paths = find_wet_files(wet_path)
if segment_n > 0 and segment_n < len(wet_paths):
np.random.seed(seed)
wet_paths = np.random.choice(wet_paths, size=segment_n, replace=False)
rdd = spark.sparkContext.parallelize(wet_paths)
rdd = rdd.flatMap(functools.partial(process_segment_file, verbose=verbose))
rdd = rdd.repartition(repartition)
return rdd
@register_etl
def data_ingestion___common_crawl___dump2raw(
spark,
dump: str,
segment_n: int = -1,
repartition: int = 20,
use_cache: bool = True,
cache_dir: str = None,
seed: int = 42,
verbose: bool = True,
*args,
**kwargs,
) -> RDD:
"""
Ingests data from Common Crawl dump and converts it to raw format.
Args:
spark (SparkSession): The Spark session.
dump (str): The dump ID of the Common Crawl. For example, '2023-23'.
segment_n (int, optional): The number of segments to load. Default is -1, which loads all segments.
Note that one segment is about 1GB.
repartition (int, optional): The number of partitions. Default is 20.
use_cache (bool, optional): Whether to use the cache. Default is True.
If you want to save disk space, set as False because the size of cache can be large.
FYI, on WET dump is about 10TB.
cache_dir (str, optional): The cache path to save the dataset.
seed (int, optional): The random seed. Default is 42.
verbose (bool, optional): Whether to print the information of the dataset. Default is True.
Returns:
RDD: The RDD containing the processed data.
"""
if use_cache:
if cache_dir is None:
# save the parquet at package root path
cache_dir = SystemSetting().CACHE_DIR
cache_dir = f"{cache_dir}/.cache/dataverse/dataset/common_crawl_{dump}"
else:
cache_dir = f"{cache_dir}/common_crawl_{dump}"
else:
cache_dir = None
if not isinstance(cache_dir, Path):
cache_dir = Path(cache_dir)
# if cache dir exist creat one
if cache_dir and not cache_dir.exists():
cache_dir.mkdir(parents=True)
wet_urls = cc_segment_urls(dump, cache_dir, verbose=verbose)
if segment_n > 0 and segment_n < len(wet_urls):
np.random.seed(seed)
wet_urls = np.random.choice(wet_urls, size=segment_n, replace=False)
rdd = spark.sparkContext.parallelize(wet_urls)
rdd = rdd.flatMap(
functools.partial(
process_segment_url,
cache_dir=cache_dir,
verbose=verbose,
)
)
rdd = rdd.repartition(repartition)
return rdd
def convert_bytes(data):
if isinstance(data, bytes):
return data.decode()
if isinstance(data, dict):
return {convert_bytes(key): convert_bytes(value) for key, value in data.items()}
if isinstance(data, list):
return [convert_bytes(element) for element in data]
return data
@register_etl
def data_ingestion___common_crawl___raw2ufl(spark, data: RDD, *args, **kwargs):
"""
Converts raw format to UFL with custom template.
Args:
spark (SparkSession): The Spark session.
data (RDD): The input data.
Returns:
The converted data in UFL format.
"""
def templatev1(data):
new_data = {}
new_data["id"] = get_uuidv1()
new_data["name"] = "common_crawl"
new_data["text"] = f"{data.get('raw_content', None)}"
new_data["meta"] = json.dumps(
convert_bytes(
{
"title": data.get("title", None),
"url": data.get("url", None),
"date_download": data.get("date_download", None),
"digest": data.get("digest", None),
"length": data.get("length", None),
"nlines": data.get("nlines", None),
"source_domain": data.get("source_domain", None),
"cc_segment": data.get("cc_segment", None),
}
)
)
return new_data
data = data.map(lambda x: templatev1(x))
return data
================================================
FILE: dataverse/etl/data_ingestion/csv.py
================================================
"""
Load CSV data
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import List, Union
from pyspark.rdd import RDD
from dataverse.etl import register_etl
# from dataverse.utils.format import huggingface2parquet, load_huggingface_dataset
@register_etl
def data_ingestion___csv___csv2raw(
spark, path: Union[str, List[str]], repartition: int = 20, verbose: bool = True, *args, **kwargs
) -> RDD:
"""
Converts CSV data to raw RDD.
Args:
spark (SparkSession): The Spark session.
path (Union[str, List[str]]): The path(s) to the CSV file(s).
repartition (int, optional): The number of partitions for the RDD. Defaults to 20.
verbose (bool, optional): Whether to print the information of the dataset.
Returns:
RDD: The raw RDD containing the CSV data.
"""
if isinstance(path, str):
path = [path]
df = spark.read.csv(*path, header=True)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
return rdd
================================================
FILE: dataverse/etl/data_ingestion/cultura_x.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import json
from pyspark.rdd import RDD
from dataverse.etl import register_etl
from dataverse.utils.format import get_uuidv1
@register_etl
def data_ingestion___cultura_x___raw2ufl(spark, ufl: RDD, *args, **kwargs):
"""
Converts raw format to UFL with custom template.
Args:
spark (SparkSession): The Spark session object.
ufl(RDD): The input DataFrame in raw format.
Returns:
RDD: The transformed DataFrame in UFL format.
"""
def templatev1(row):
new_row = {}
new_row["id"] = get_uuidv1()
new_row["name"] = "cultura_x"
new_row["text"] = row["text"]
new_row["meta"] = json.dumps(
{
"url": row["url"],
"timestamp": row["timestamp"],
"source": row["source"],
}
)
return new_row
ufl = ufl.map(lambda x: templatev1(x))
return ufl
================================================
FILE: dataverse/etl/data_ingestion/huggingface.py
================================================
"""
Load Huggingface data
This is used just to load huggingface dataset without any refomatting
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import List, Union
from pyspark.rdd import RDD
from dataverse.etl import register_etl
from dataverse.utils.format import huggingface2parquet, load_huggingface_dataset
@register_etl
def data_ingestion___huggingface___hf2raw(
spark,
name_or_path: Union[str, List[str]],
split: int = None,
from_disk: bool = False,
repartition: int = 20,
verbose: bool = True,
*args,
**kwargs
) -> RDD:
"""
Convert a HuggingFace dataset to raw format as a dictionary.
Args:
spark (SparkSession): The Spark session.
name_or_path (Union[str, List[str]]): The name or path of the HuggingFace dataset.
split(int, optional): The split of the dataset. Defaults to None.
from_disk(bool, optional): Whether to load from disk. Defaults to False.
No split is allowed when from_disk is True.
repartition(int, optional): The number of partitions. Defaults to 20.
verbose(bool, optional): Whether to print the information of the dataset. Defaults to True.
Returns:
rdd: The converted dataset as an RDD of dictionaries.
"""
dataset = load_huggingface_dataset(name_or_path, split=split, from_disk=from_disk)
parquet_path = huggingface2parquet(dataset, verbose=verbose)
df = spark.read.parquet(parquet_path)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
return rdd
================================================
FILE: dataverse/etl/data_ingestion/parquet.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import List, Union
from pyspark.rdd import RDD
from dataverse.etl import register_etl
@register_etl
def data_ingestion___parquet___pq2raw(
spark, path: Union[str, List[str]], repartition=20, *args, **kwargs
) -> RDD:
"""
Reads parquet files into an RDD and repartitions it.
Args:
spark (SparkSession): The Spark session.
path (str or list): The path of the parquet files.
repartition (int): The number of partitions.
Returns:
rdd: The repartitioned RDD containing the data from the parquet files.
"""
if isinstance(path, str):
path = [path]
df = spark.read.parquet(*path)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
return rdd
================================================
FILE: dataverse/etl/data_ingestion/red_pajama.py
================================================
"""
Supported datasets:
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import List, Union
from dataverse.etl import register_etl
from dataverse.utils.format import (
get_uuidv1,
huggingface2parquet,
load_huggingface_dataset,
)
"""
1 stage data ingestion - default
====================================
direct loading ufl with one ETL process
"""
def convert2ufl(row):
row["id"] = get_uuidv1()
row["name"] = "red_pajama"
return row
@register_etl
def data_ingestion___red_pajama___parquet2ufl(spark, input_paths, repartition=20, *args, **kwargs):
"""
convert parquet file to ufl
"""
df = spark.read.parquet(*input_paths)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
rdd = rdd.map(lambda x: convert2ufl(x))
return rdd
@register_etl
def data_ingestion___red_pajama___hf2ufl(
spark,
name_or_path: Union[str, List[str]] = "togethercomputer/RedPajama-Data-1T-Sample",
split=None,
from_disk=False,
repartition=20,
verbose=True,
*args,
**kwargs
):
"""
convert huggingface dataset to ufl
Args:
spark (SparkSession): spark session
name_or_path (str or list): the name or path of the huggingface dataset
split (str): the split of the dataset
from_disk (bool): whether to load from disk
- no split is allowed when from_disk is True
repartition (int): the number of partitions
verbose (bool): whether to print the information of the dataset
"""
dataset = load_huggingface_dataset(name_or_path, split=split, from_disk=from_disk)
parquet_path = huggingface2parquet(dataset, verbose=verbose)
df = spark.read.parquet(parquet_path)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
rdd = rdd.map(lambda x: convert2ufl(x))
return rdd
"""
2 stage data ingestion - default
====================================
loading ufl with custom template with two ETL process
"""
@register_etl
def data_ingestion___red_pajama___hf2raw(
spark,
name_or_path: Union[str, List[str]] = "togethercomputer/RedPajama-Data-1T-Sample",
split=None,
repartition=20,
verbose=True,
*args,
**kwargs
):
"""
convert huggingface dataset to raw format as dict
Args:
spark (SparkSession): spark session
name_or_path (str or list): the name or path of the huggingface dataset
split (str): the split of the dataset
repartition (int): the number of partitions
verbose (bool): whether to print the information of the dataset
"""
dataset = load_huggingface_dataset(name_or_path, split=split)
parquet_path = huggingface2parquet(dataset, verbose=verbose)
df = spark.read.parquet(parquet_path)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
return rdd
@register_etl
def data_ingestion___red_pajama___raw2ufl_templatev1(spark, ufl, *args, **kwargs):
"""
convert raw format to ufl with custom template
"""
def templatev1(row):
row["id"] = get_uuidv1()
row["name"] = "red_pajama"
return row
ufl = ufl.map(lambda x: templatev1(x))
return ufl
@register_etl
def data_ingestion___red_pajama___raw2ufl_templatev2(spark, ufl, *args, **kwargs):
...
return ufl
================================================
FILE: dataverse/etl/data_ingestion/slim_pajama.py
================================================
"""
Supported datasets:
https://huggingface.co/datasets/cerebras/SlimPajama-627B
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import List, Union
from dataverse.etl import register_etl
from dataverse.utils.format import huggingface2parquet, load_huggingface_dataset
@register_etl
def data_ingestion___slim_pajama___parquet2ufl(spark, input_paths, repartition=20, *args, **kwargs):
"""
convert parquet file to ufl
"""
df = spark.read.parquet(*input_paths)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
return rdd
@register_etl
def data_ingestion___slim_pajama___hf2ufl(
spark,
name_or_path: Union[str, List[str]] = "cerebras/SlimPajama-627B",
split=None,
from_disk=False,
repartition=20,
verbose=True,
*args,
**kwargs
):
"""
convert huggingface dataset to ufl
Args:
spark (SparkSession): spark session
name_or_path (str or list): the name or path of the huggingface dataset
split (str): the split of the dataset
from_disk (bool): whether to load from disk
- no split is allowed when from_disk is True
repartition (int): the number of partitions
verbose (bool): whether to print the information of the dataset
"""
dataset = load_huggingface_dataset(name_or_path, split=split, from_disk=from_disk)
parquet_path = huggingface2parquet(dataset, verbose=verbose)
df = spark.read.parquet(parquet_path)
rdd = df.rdd.repartition(repartition)
rdd = rdd.map(lambda row: row.asDict())
return rdd
================================================
FILE: dataverse/etl/data_ingestion/test.py
================================================
"""
special purpose to create fake data for testing or debugging
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import json
from faker import Faker
from pyspark.rdd import RDD
from dataverse.etl import register_etl
@register_etl
def data_ingestion___test___generate_fake_ufl(
spark, n: int = 100, repartition: int = 20, verbose: bool = True, *args, **kwargs
) -> RDD:
"""
Generate fake data for testing or debugging.
Args:
spark (SparkSession): The Spark session object.
n (int, optional): The number of data to generate. Default is 100.
repartition (int, optional): The number of partitions. Default is 20.
verbose (bool, optional): Whether to print the information of the dataset. Default is True.
Returns:
RDD: The generated fake data RDD.
"""
faker = Faker()
def _generate_fake_ufl(n=100):
while n > 0:
n -= 1
yield {
"id": faker.uuid4(),
"name": "test_fake_ufl",
"text": faker.text(),
"meta": json.dumps(
{
"name": faker.name(),
"age": faker.random_int(0, 100),
"address": faker.address(),
"job": faker.job(),
}
),
}
rdd = spark.sparkContext.parallelize(_generate_fake_ufl(n=n))
rdd = rdd.repartition(repartition)
return rdd
================================================
FILE: dataverse/etl/data_save/README.md
================================================
# Data Save
> How to save data to the destination? In other words, how to save the data to the destination?
## 🌌 Naming Convention
- TBD
## 🌌 Supported Data Save Method
- AWS (S3)
- HuggingFace (Dataset)
- Parquet
================================================
FILE: dataverse/etl/data_save/__init__.py
================================================
================================================
FILE: dataverse/etl/data_save/aws.py
================================================
"""
TODO: Data saving to AWS S3
This is not implemented yet.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
# TODO
================================================
FILE: dataverse/etl/data_save/huggingface.py
================================================
"""
Data saving to Huggingface Datasets
Huggingface support spark natively!
https://huggingface.co/docs/datasets/use_with_spark
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import os
from typing import Union
from datasets import Dataset
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl import register_etl
@register_etl
def data_save___huggingface___ufl2hf_hub(spark, ufl, hub_path, repartition=1, *args, **kwargs):
"""
TODO: Save data to Hugging Face dataset and upload to hub.
"""
NotImplementedError()
return None
@register_etl
def data_save___huggingface___ufl2hf(
spark, ufl: Union[RDD, DataFrame], save_path: str, repartition: int = 1, *args, **kwargs
) -> str:
"""
Save data to HuggingFace dataset and return the path.
Args:
spark(sparkSession): The Spark session.
ufl(Union[RDD, DataFrame]):The input data to be saved.
save_path(str): The path to save the HF dataset.
repartition(int, optional): The number of partitions to repartition the data. Defaults to 1.
Raises:
ValueError: If the save_path already exists.
AssertionError: If ufl is not an RDD or DataFrame.
Returns:
str: The path where the HuggingFace dataset is saved.
"""
if os.path.exists(save_path):
raise ValueError(f"save_path {save_path} already exists")
if isinstance(ufl, RDD):
ufl = ufl.toDF()
assert isinstance(ufl, DataFrame), f"ufl must be RDD or DataFrame, got {type(ufl)}"
ufl = ufl.repartition(repartition)
hf_dataset = Dataset.from_spark(ufl)
hf_dataset.save_to_disk(save_path)
return save_path
@register_etl
def data_save___huggingface___ufl2hf_obj(
spark, ufl: Union[RDD, DataFrame], repartition: int = 1, *args, **kwargs
) -> Dataset:
"""
Convert data to HuggingFace dataset object.
Args:
spark(sparkSession): The Spark session.
ufl(Union[RDD, DataFrame]):The input data to be saved.
repartition(int, optional): The number of partitions to repartition the data. Defaults to 1.
Returns:
Dataset: The HuggingFace dataset object.
Raises:
AssertionError: If the input data is not RDD or DataFrame.
"""
if isinstance(ufl, RDD):
ufl = ufl.toDF()
assert isinstance(ufl, DataFrame), f"ufl must be RDD or DataFrame, got {type(ufl)}"
ufl = ufl.repartition(repartition)
hf_dataset = Dataset.from_spark(ufl)
return hf_dataset
================================================
FILE: dataverse/etl/data_save/parquet.py
================================================
"""
Data saving to Parquets
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import os
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl import register_etl
@register_etl
def data_save___parquet___ufl2parquet(
spark,
ufl: Union[RDD, DataFrame],
save_path: str,
repartition: int = 1,
*args,
**kwargs,
) -> str:
"""
Save data to parquet and return the path.
Args:
spark(sparkSession): The Spark session.
ufl(Union[RDD, DataFrame]):The input data to be saved.
save_path(str): The path to save the HF dataset.
repartition(int, optional): The number of partitions to repartition the data. Defaults to 1.
Raises:
ValueError: If the save_path already exists.
Returns:
str: The path where the parquet file is saved.
"""
if os.path.exists(save_path):
raise ValueError(f"save_path {save_path} already exists")
if isinstance(ufl, RDD):
ufl = ufl.toDF()
assert isinstance(ufl, DataFrame), f"ufl must be RDD or DataFrame, got {type(ufl)}"
ufl = ufl.repartition(repartition)
ufl.write.parquet(save_path, mode="overwrite")
return save_path
================================================
FILE: dataverse/etl/decontamination/README.md
================================================
================================================
FILE: dataverse/etl/decontamination/__init__.py
================================================
================================================
FILE: dataverse/etl/deduplication/README.md
================================================
# Deduplication
> Deduplication is the process of removing duplicate records from a dataset.
Normally this is clustered in 2 big categories:
- **Exact Deduplication**: remove exact duplicate records
- **Fuzzy Deduplication**: remove records that are similar to each other
☣️ **caveat**️ ☣️
> When we cluster sub-categories with just 2 big categories, it seems waste of space. So here temporalily we cluster sub-categories with more detailed categories.
- part of name of full name (e.g. minhash)
- open source name
- etc
But we will change this to much better cluster in the future. And we need your help!
💡Any ideas are welcomed!💡
## 🌌 Exact Deduplication
> Exact Deduplication is the process of removing exact duplicate records from a dataset.
## 🌌 Fuzzy Deduplication
> Fuzzy Deduplication is the process of removing records that are similar to each other from a dataset.
================================================
FILE: dataverse/etl/deduplication/__init__.py
================================================
================================================
FILE: dataverse/etl/deduplication/common_crawl.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import functools
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql.functions import collect_list, posexplode, split
from dataverse.etl.registry import register_etl
def filter_lines(row, subset="text"):
row = row.asDict()
text = row[subset]
line_ids = row["line_ids"]
text_lines = text.split("\n")
filtered_texts = "\n".join([text_lines[line_i] for line_i in sorted(line_ids)])
del row["line_ids"]
row[subset] = filtered_texts
return row
@register_etl
def deduplication___common_crawl___exact_line(
spark, data: Union[RDD, DataFrame], subset="text", *args, **kwargs
) -> RDD:
"""
Performs exact line by line deduplication on the given data.
Strip and lower is applied to the line text before deduplication
but this will not be applied to the original text.
Examples:
- input
+--------+
| text|
+========+
| DuckY|
+--------+
| dUKCY|
+--------+
- output
+--------+
| text|
+========+
| DuckY|
+--------+
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be deduplicated..
subset (str, optional): A subset or column to consider. Defaults to 'text'.
Returns:
rdd: The deduplicated data.
Raises:
AssertionError: If the input data is not a DataFrame.
"""
if isinstance(data, RDD):
data = data.toDF()
data = data.cache()
data = data.withColumn("__id__", F.monotonically_increasing_id())
assert isinstance(data, DataFrame), f"data must be DataFrame, got {type(data)}"
line_data = data.select(
"__id__", posexplode(split(data[subset], "\n")).alias("line_id", "line")
)
line_data = line_data.withColumn("line", F.lower(F.trim(line_data["line"])))
line_data = line_data.dropDuplicates(subset=["line"])
line_data = line_data.groupBy("__id__").agg(collect_list("line_id").alias("line_ids"))
merged_data = data.join(line_data, on=["__id__"], how="inner")
data.unpersist()
line_data.unpersist()
# remove __id__
merged_data = merged_data.drop("__id__")
# filter the lines using the line_ids
merged_data = merged_data.rdd.map(functools.partial(filter_lines, subset=subset))
return merged_data
================================================
FILE: dataverse/etl/deduplication/exact.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
from typing import List, Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def deduplication___exact___column(
spark, data: Union[RDD, DataFrame], subset: List[str] = ["text"], *args, **kwargs
):
"""
Exact column deduplication
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be deduplicated..
subset(List[str]): Subset of columns to consider for duplication check. Default to ['text'].
Returns:
Deduplicated DataFrame object
"""
if isinstance(data, RDD):
data = data.toDF()
assert isinstance(data, DataFrame), f"data must be DataFrame, got {type(data)}"
data = data.dropDuplicates(subset=subset)
return data
================================================
FILE: dataverse/etl/deduplication/minhash.py
================================================
"""
Code is from ChenghaoMou/text-dedup
https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py
This is a migration of the code to Dataverse.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import hashlib
import functools
import re
import os
import struct
import sys
from itertools import tee
from operator import add
from typing import Any, List, Text, Tuple, Union
import numpy as np
import pyspark
from pyspark.rdd import RDD
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.ml.feature import NGram, RegexTokenizer
from scipy.integrate import quad as integrate
from dataverse.etl.registry import register_etl
# region: Connected Components in MapReduce and Beyond, 2014
def generate_edges(nodes: List[int]) -> List[Tuple[int, int]]:
"""
Generate edges from a cluster. Instead of generating N^2 edges, we only need all nodes align to a single node, since
we will be running connected components on the edges later.
Parameters
----------
nodes : List[int]
The list of nodes in the cluster.
Returns
-------
List[Tuple[int, int]]
The list of edges.
Examples
--------
>>> generate_edges([1, 2, 3])
[(2, 1), (3, 1)]
"""
if len(nodes) <= 1:
return []
min_node = min(nodes)
return [(n, min_node) for n in nodes if n != min_node]
def get_hash(text: str, n_bytes: int=8):
return int.from_bytes(
hashlib.sha1(text.encode("utf-8")).digest()[:n_bytes],
sys.byteorder
)
def get_signatures(
shingles: List[str],
band_n: int,
row_per_band: int,
mod_prime: int,
hash_params: Tuple[np.ndarray]
):
if not shingles:
return []
shingles = np.array(
[get_hash(shingle) for shingle in set(shingles)],
dtype=np.uint64
)
signatures = np.full(
shape=(band_n * row_per_band),
fill_value=mod_prime,
dtype=np.uint64
)
chunk_size = 2 ** 10
a, b = hash_params
for i in range(0, len(shingles), chunk_size):
shingles_chunk = shingles[i:i+chunk_size]
signatures = np.minimum(
signatures,
np.min((shingles_chunk.reshape(-1, 1) * a + b) % mod_prime, axis=0)
)
return [
f"{idx:02d}" \
+ signatures[i*row_per_band:(i+1)*row_per_band].tobytes().hex()
for idx, i in enumerate(range(band_n))
]
# region: MinHashLSH
def optimal_param(
threshold: float,
num_perm: int,
false_positive_weight: float = 0.5,
false_negative_weight: float = 0.5,
):
"""
Compute the optimal `MinHashLSH` parameter that minimizes the weighted sum
of probabilities of false positive and false negative, taken from datasketch.
Parameters
----------
threshold : float
The threshold for similarity.
num_perm : int
The number of permutations.
false_positive_weight : float
The weight of false positive.
false_negative_weight : float
The weight of false negative.
Returns
-------
Tuple[int, int]
The optimal `b` and `r` parameters.
The number of bands, and the number of rows per band respectively.
Examples
--------
>>> optimal_param(0.7, 256)
(25, 10)
"""
def false_positive_area(threshold: float, b: int, r: int):
"""Source: `datasketch.lsh`"""
def area(s):
return 1 - (1 - s ** float(r)) ** float(b)
a, _ = integrate(area, 0.0, threshold)
return a
def false_negative_area(threshold: float, b: int, r: int):
"""Source: `datasketch.lsh`"""
def area(s):
return 1 - (1 - (1 - s ** float(r)) ** float(b))
a, _ = integrate(area, threshold, 1.0)
return a
min_error = float("inf")
opt = (0, 0)
for b in range(1, num_perm + 1):
max_r = int(num_perm / b)
for r in range(1, max_r + 1):
fp = false_positive_area(threshold, b, r)
fn = false_negative_area(threshold, b, r)
error = fp * false_positive_weight + fn * false_negative_weight
if error < min_error:
min_error = error
opt = (b, r)
return opt
# region: Quality Control
def process_cluster(cluster: List[Any]) -> List[Any]:
return cluster[:1]
@register_etl
def deduplication___minhash___lsh_jaccard(
spark: SparkSession,
data: Union[RDD, DataFrame],
threshold: float = 0.7,
ngram_size: int = 5,
min_length: int = 5,
num_perm: int = 250,
band_n: int = None,
row_per_band: int = None,
id_col: Union[str, None] = None,
subset: str = "text",
seed: int = 42,
duplicates_save_path: Union[str, None] = None,
*args,
**kwargs,
) -> RDD:
"""
Fuzzy deduplication using MinHash and Locality Sensitive Hashing (LSH).
Args:
spark (SparkSession): The SparkSession object.
data (Union[RDD, DataFrame]): Input data to be deduplicated.
threshold (float, optional): Similarity threshold. Default is 0.7.
ngram_size (int, optional): Size of n-grams. Default is 5.
min_length (int, optional): Minimum token length of document to be considered. Default is 5.
num_perm (int, optional): Number of permutations. Default is 250.
band_n (int, optional): Number of bands. If not provided, it will be calculated based on the threshold and num_perm.
row_per_band (int, optional): Number of rows per band. If not provided, it will be calculated based on the threshold and num_perm.
id_col (str, optional): Key column for extract duplicated rows. If not provided, temporary id column will be created.
subset (str, optional): Column to deduplicate on. Default is "text".
seed (int, optional): Random seed. Default is 42.
duplicates_save_path (str, optional): Save path for duplicated entries. If not provided, not saving the duplicates.
Returns:
RDD: Deduplicated data as a DataFrame.
"""
spark.sparkContext.setCheckpointDir("checkpoint")
from graphframes import GraphFrame
if isinstance(data, RDD):
data_df = data.toDF()
elif isinstance(data, DataFrame):
data_df = data
if (
duplicates_save_path is not None
and os.path.exists(duplicates_save_path)
):
assert "duplicates_save_path already exists."
temp_id_col, component_col, tokens_col, ngrams_col = \
"__id__", "__component__", "__tokens__", "__ngrams__"
exist_cols = set(data_df.columns)
while True:
if temp_id_col in exist_cols:
temp_id_col += "_"
elif component_col in exist_cols:
component_col += "_"
elif tokens_col in exist_cols:
tokens_col += "_"
elif ngrams_col in exist_cols:
ngrams_col += "_"
else:
break
if id_col is None:
id_col = temp_id_col
print(f"create temp id col: {id_col}")
data_df = data_df.withColumn(id_col, F.monotonically_increasing_id())
data_df.persist(pyspark.StorageLevel.DISK_ONLY)
if band_n is None or row_per_band is None:
band_n, row_per_band = optimal_param(threshold, num_perm)
mod_prime = 1 << 61 - 1
gen = np.random.RandomState(seed)
hash_params = (
gen.randint(1, mod_prime, dtype=np.uint64, size=band_n * row_per_band),
gen.randint(0, mod_prime, dtype=np.uint64, size=band_n * row_per_band),
)
subset_type: str = [t for c, t in data_df.dtypes if c == subset][0]
if subset_type.startswith("str"):
# assume subset col should be tokenized
tokens_df = RegexTokenizer(
inputCol=subset,
outputCol=tokens_col,
pattern="\\W"
).transform(
data_df
.select(id_col, F.col(subset).substr(1, 10_000_000).alias(subset))
).select(
id_col, tokens_col
).filter(
F.size(tokens_col) >= min_length
)
elif subset_type.startswith("array"):
print("already tokenized.")
tokens_col = subset
tokens_df = data_df.select(id_col, tokens_col)
shingles_df = NGram(
n=ngram_size,
inputCol=tokens_col,
outputCol=ngrams_col
).transform(tokens_df).select(id_col, ngrams_col)
sig_udf = F.udf(
functools.partial(
get_signatures,
band_n=band_n,
row_per_band=row_per_band,
mod_prime=mod_prime,
hash_params=hash_params
),
returnType=T.ArrayType(T.StringType())
)
signature_df = (
shingles_df
.select(id_col, F.explode(sig_udf(ngrams_col)).alias("band"))
.groupby("band")
.agg(
F.collect_set(id_col).alias("ids")
)
)
edge_udf = F.udf(
generate_edges,
returnType=T.ArrayType(T.ArrayType(data_df.schema[id_col].dataType))
)
edges_df = (
signature_df
.select("ids")
.filter(F.size("ids") > 1)
.select(F.explode(edge_udf("ids")).alias("edges"))
.distinct()
.selectExpr("edges[0] as src", "edges[1] as dst")
).persist(pyspark.StorageLevel.DISK_ONLY)
count = edges_df.count()
if count == 0:
print("no entry for deduplication.")
edges_df.unpersist()
data_df.unpersist()
return data
vertices_df = (
edges_df
.selectExpr("src as id")
.union(edges_df.selectExpr("dst as id"))
.distinct()
)
assignment = (
GraphFrame(vertices_df, edges_df)
.connectedComponents(broadcastThreshold=200 * (1024 ** 2))
)
join_df = data_df.join(
assignment.select(
F.col("id").alias(id_col),
F.col("component").alias(component_col)
),
on=id_col,
how="left"
)
if duplicates_save_path is not None:
duplicates_df = (
join_df
.filter(F.col(component_col).isNotNull())
.drop(ngrams_col)
)
if id_col == temp_id_col:
duplicates_df = duplicates_df.drop(id_col)
if tokens_col != subset:
duplicates_df = duplicates_df.drop(tokens_col)
duplicates_df.write.parquet(duplicates_save_path)
duplicates_df.unpersist()
final_df = (
join_df
.filter(F.col(component_col).isNull())
.union(
join_df
.filter(F.col(component_col).isNotNull())
.dropDuplicates([component_col])
)
.drop(component_col, ngrams_col)
)
if id_col == temp_id_col:
final_df = final_df.drop(id_col)
if tokens_col != subset:
final_df = final_df.drop(tokens_col)
edges_df.unpersist()
return final_df.rdd
================================================
FILE: dataverse/etl/deduplication/polyglot.py
================================================
"""
Code is from EleutherAI/dps
https://github.com/EleutherAI/dps/blob/master/dps/spark/jobs/dedup_job.py
This is a migration of the deduplication job from the DPS project to the Dataverse.
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import binascii
import random
from itertools import combinations
from typing import List, Union
import numpy as np
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl import register_etl
MERSENNE_PRIME = (1 << 61) - 1
MAX_HASH = (1 << 32) - 1
HASH_RANGE = 1 << 32
def shingle_word(text: str, n_gram: int = 15, char_level: bool = False) -> List[str]:
"""
example
-------
>>> shingle_word("hello world from ducky", n_gram=2)
['hello_world', 'world_from', 'from_ducky']
>>> shingle_word("hello world from ducky", n_gram=2, char_level=True)
['h_e', 'e_l', 'l_l', 'l_o', 'o_w', 'w_o', 'o_r', 'r_l', 'l_d', 'd_f', 'f_r', 'r_o', 'o_m', 'm_d', 'd_u', 'u_c', 'c_k', 'k_y']
"""
res = []
text_words = text.split() if not char_level else text
for i in range(len(text_words)):
shingle = text_words[i : i + n_gram]
if len(shingle) == n_gram:
res.append("_".join(shingle).encode("utf-8"))
return res
def generate_minhash(shingles: List, num_perm: int = 64, seed: int = 1) -> np.array:
def hashfunc(b: bytes) -> bytes:
return binascii.crc32(b) & MAX_HASH
hashvalues = np.ones(num_perm, dtype=np.uint64) * MAX_HASH
generator = np.random.RandomState(seed)
permutations = np.array(
[
(
generator.randint(1, MERSENNE_PRIME, dtype=np.uint64),
generator.randint(0, MERSENNE_PRIME, dtype=np.uint64),
)
for _ in range(num_perm)
],
dtype=np.uint64,
).T
for shingle in shingles:
hv = hashfunc(shingle)
a, b = permutations
phv = np.bitwise_and((a * hv + b) % MERSENNE_PRIME, np.uint64(MAX_HASH))
hashvalues = np.minimum(phv, hashvalues)
return hashvalues
def jaccard_by_hashvalues(src_hashvalues, tgt_hashvalues) -> float:
if len(src_hashvalues) != len(tgt_hashvalues):
raise ValueError()
return np.float(np.count_nonzero(src_hashvalues == tgt_hashvalues)) / np.float(
len(src_hashvalues)
)
def expand_instances_by_minhash(
data, expand_size: int, n_gram: int, seed: int = 1, char_level: bool = False
):
shingles = shingle_word(data["text"], n_gram=n_gram, char_level=char_level)
minhashes = generate_minhash(shingles, num_perm=expand_size, seed=seed)
for mh in minhashes.tolist():
yield (str(mh), [dict(**data, shingles=shingles, hashvalues=minhashes)])
def explore_dedup_instance(hash_groups, threshold: float = 0.8):
if len(hash_groups) <= 1:
return
group_represent_text = hash_groups[0]["text"] # not to remove all text instances in group.
pairs = combinations(hash_groups, 2)
for d_1, d_2 in pairs:
sim_score = jaccard_by_hashvalues(d_1["hashvalues"], d_2["hashvalues"])
if sim_score >= threshold:
dedup_text = [d_1["text"], d_2["text"]]
if group_represent_text in dedup_text:
yield dedup_text[0] if dedup_text[0] != group_represent_text else dedup_text[1]
else:
yield random.choice(dedup_text)
@register_etl
def deduplication___polyglot___minhash(
spark,
data: Union[RDD, DataFrame],
expand_size: int = 64,
n_gram: int = 15,
seed: int = 1,
char_level: bool = False,
sim_threshold: float = 0.8,
*args,
**kwargs,
):
"""
Fuzzy deduplication using MinHash algorithm.
Args:
spark (SparkSession): The SparkSession object.
data (Union[RDD, DataFrame]): The input data to be deduplicated.
expand_size (int, optional): The size of expansion for each instance. Defaults to 64.
n_gram (int, optional): The size of n-gram for tokenization. Defaults to 15.
seed (int, optional): The seed value for random number generation. Defaults to 1.
char_level (bool, optional): Whether to use character-level tokenization. Defaults to False.
sim_threshold (float, optional): The similarity threshold for deduplication. Defaults to 0.8.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.
Returns:
RDD or DataFrame: The deduplicated data.
Raises:
None
Examples:
>>> deduplication___polyglot___minhash()(spark, data, expand_size=128, sim_threshold=0.9)
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
overlap_kv_rdd: RDD = (
data.flatMap(
lambda x: expand_instances_by_minhash(
x,
expand_size=expand_size,
n_gram=n_gram,
seed=seed,
char_level=char_level,
)
)
.reduceByKey(lambda x, y: x + y)
.flatMap(lambda x: explore_dedup_instance(x[1], threshold=sim_threshold))
.distinct()
.map(lambda x: (x, dict(text=x)))
.cache()
)
data = data.map(lambda x: (x["text"], x)).subtractByKey(overlap_kv_rdd).map(lambda x: x[1])
return data
================================================
FILE: dataverse/etl/pii/README.md
================================================
# PII (Personally Identifiable Information)
> Replacing, Removing, and Anonymizing PII
## 🌌 Naming Convention
> This is a strong recommendation. You can use your own naming convention if you want.
```python
def cleaning___[ETL Sub-Category]___[ETL Process]()
```
- `ETL Sub-Category` - the `PII` type
- e.g. card number
- e.g. email
- e.g. phone number
- `ETL process name` - what you are doing to the `PII`
- e.g. remove
- e.g. replace
================================================
FILE: dataverse/etl/pii/__init__.py
================================================
================================================
FILE: dataverse/etl/pii/card.py
================================================
"""
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import random
import re
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def pii___card___replace_card_number(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
pattern: str = r"(\d{4}-\d{4}-\d{4}-\d{4})",
random_pii: bool = True,
replace_pii: bool = False,
replace_token: str = "[CARD_NUMBER]",
start_token: str = "",
end_token: str = "",
*args,
**kwargs,
) -> RDD:
r"""
Replace card number with a random number or a token
Args:
spark: The SparkSession object.
data (Union[RDD, DataFrame]): The input data to process.
subset (str, optional): The subset or columns to consider. Defaults to 'text'.
pattern (str, optional): The regex pattern to find. Defaults to r'(\d{4}-\d{4}-\d{4}-\d{4})'.
random_pii (bool, optional): If True, replace the pii with a random number. Defaults to True.
replace_pii (bool, optional): If True, replace the pii with the `replace_token`. Defaults to False.
replace_token (str, optional): The token to replace the pii with. Defaults to '[CARD_NUMBER]'.
start_token (str, optional): The start token to append where the pattern is found. Defaults to ''.
end_token (str, optional): The end token to append where the pattern is found. Defaults to ''.
Returns:
RDD: The processed data.
Caveats:
- `replace_pii` takes precedence over `random_pii`
- e.g when both are True, the card number will be replaced with the token
- e.g. this is 1234-1234-1234-1234 -> this is [CARD_NUMBER]
- `start_token` and `end_token` are used to append the token to the start and end of the card number
- it doens't matter with `random_card_number` or `replace_card_number` is True or False
Examples:
<input>
- text = 'card number is 1234-1234-1234-1234.'
<output>
- random pii
- text = 'card number is 2238-1534-1294-1274.'
- replace pii
- replace_token = '[CARD_NUMBER]'
- text = 'card number is [CARD_NUMBER].'
- start token
- start_token = '[CARD_NUMBER_START]'
- text = 'card number is [CARD_NUMBER_START]1234-1234-1234-1234.'
- end token
- end_token = '[CARD_NUMBER_END]'
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _replace_match(match):
match = match.group()
if replace_pii:
match = replace_token
elif random_pii:
match = re.sub(r"\d", lambda x: str(random.randint(0, 9)), match)
return f"{start_token}{match}{end_token}"
def _replace_pii(row):
row[subset] = re.sub(pattern, _replace_match, row[subset])
return row
data = data.map(_replace_pii)
return data
================================================
FILE: dataverse/etl/pii/nin.py
================================================
"""
NIN (National Identification Number)
=====================================
A national identification number, national identity number, or
national insurance number or JMBG/EMBG is used by the governments
of many countries as a means of tracking their citizens, permanent residents,
and temporary residents for the purposes of work, taxation,
government benefits, health care, and other governmentally-related functions.
https://en.wikipedia.org/wiki/National_identification_number
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import random
import re
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
@register_etl
def pii___nin___replace_korean_rrn(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
pattern: str = r"\d{6}-\d{7}",
random_pii: bool = True,
replace_pii: bool = False,
replace_token: str = "[NIN]",
start_token: str = "",
end_token: str = "",
*args,
**kwargs,
) -> RDD:
r"""
Replace Korean RRN (Resident Registration Number) with a random number or a token
Args:
spark (SparkSession): The Spark session object.
data(Union[RDD, DataFrame]): The input data to be processed.
subset(str, optional): A subset or column to consider. Defaults to 'text'.
pattern(str, optional): The regex pattern to find. Defaults to r'\d{6}-\d{7}'.
random_pii(str, optional): If True, replace the pii with a random number. Defaults to True.
replace_pii(bool, optional): If True, replace the pii with the `replace_token`. Defaults to False.
replace_token(bool, optional): The token to replace the pii with. Defaults to '[NIN]'.
start_token(str, optional): The start token to append where the pattern is found. Defaults to ''.
end_token(str, optional): The end token to append where the pattern is found. Defaults to ''.
Returns:
rdd: The processed data with replaced Korean RRN.
Caveats:
- `replace_pii` takes precedence over `random_pii`
- `start_token` and `end_token` are used to append the token to the start and end of the number
- it doens't matter with `random_pii` or `replace_pii` is True or False
Examples:
<input>
- text = 'nin is 123456-1234567'
<output>
- random pii
- text = 'nin is 141124-1244121'
- replace pii
- replace_token = '[NIN]'
- text = 'nin is [NIN].'
- start token
- start_token = '[NIN_START]'
- text = 'nin is [NIN_START]123456-1234567'
- end token
- end_token = '[NIN_END]'
- text = 'nin is 123456-1234567[NIN_END].'
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
def _replace_match(match):
match = match.group()
if replace_pii:
match = replace_token
elif random_pii:
match = re.sub(r"\d", lambda x: str(random.randint(0, 9)), match)
return f"{start_token}{match}{end_token}"
def _replace_pii(row):
row[subset] = re.sub(pattern, _replace_match, row[subset])
return row
data = data.map(_replace_pii)
return data
================================================
FILE: dataverse/etl/pipeline.py
================================================
"""
ETL Interface
----------------------
user will be interacting with this interface
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import time
from pathlib import Path
from typing import Union
import boto3
from omegaconf import DictConfig, OmegaConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from dataverse.config import Config
from dataverse.etl import ETLRegistry
from dataverse.utils.api import AWSClient, EMRManager, aws_check_credentials
from dataverse.utils.setting import SystemSetting
class ETLPipeline:
"""
ETL Pipeline.
This class represents an ETL (Extract, Transform, Load) pipeline.
It provides methods for managing and executing ETL processes.
Attributes:
registry (ETLRegistry): The registry of ETL processes.
Examples:
>>> etl_pipeline = ETLPipeline()
>>> etl_pipeline.status()
>>> etl_pipeline.search('data_ingestion', 'ufl')
>>> spark, data = etl_pipeline.sample()
>>> config = Config.default()
>>> etl_pipeline.run(config = config)
"""
def __init__(self):
self.registry = ETLRegistry()
def __len__(self):
return len(self.registry)
def status(self):
"""
Get the status of the registry.
Returns:
str: The status of the registry.
Raises:
None
Examples:
>>> etl_pipeline = EtlPipeline()
>>> etl_pipeline.status()
'If you need details of ETL Registry use `etl_pipeline.search()`'
Note:
This method does not show detailed information.
It will only info about category .
"""
print("If you need details of ETL Registry use `etl_pipeline.search()`")
return str(self.registry)
def search(self, category=None, sub_category=None):
"""
Get detailed status of the registry by searching.
This function lets you know category, sub_category, and etl_name.
Args:
category (str, optional): The category to filter the search results. Defaults to None.
sub_category (str, optional): The sub-category to filter the search results. Defaults to None.
Returns:
list: A list of search results matching the specified category and sub-category.
Examples:
Return every ETL
>>> etl_pipeline.search()
Only selected category
>>> etl_pipeline.search('data_ingestion')
>>> etl_pipeline.search(category='data_ingestion')
Only selected category & sub_category
>>> etl_pipeline.search('data_ingestion', 'ufl')
>>> etl_pipeline.search(category='data_ingestion', sub_category='ufl')
"""
return self.registry.search(category=category, sub_category=sub_category)
def get(self, key):
"""get ETL class from registry"""
return self.registry.get(key=key)
def setup_spark_conf(self, config, verbose=False):
"""
AWS credential setting log is not influenced by the verbose by design
"""
# TODO: add more spark configurations
spark_conf = SparkConf()
spark_conf.set("spark.master", config.spark.master)
spark_conf.set("spark.app.name", config.spark.appname)
spark_conf.set("spark.driver.memory", config.spark.driver.memory)
spark_conf.set("spark.driver.maxResultSize", config.spark.driver.maxResultSize)
spark_conf.set("spark.executor.memory", config.spark.executor.memory)
spark_conf.set("spark.local.dir", config.spark.local.dir)
spark_conf.set("spark.ui.port", config.spark.ui.port)
spark_conf.set("spark.jars.packages", "graphframes:graphframes:0.8.3-spark3.5-s_2.12")
# AWS S3 Support
if aws_check_credentials(verbose=verbose):
session = boto3.Session()
credentials = session.get_credentials()
spark_conf.set("spark.hadoop.fs.s3a.access.key", credentials.access_key)
spark_conf.set("spark.hadoop.fs.s3a.secret.key", credentials.secret_key)
spark_conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_ver = SystemSetting().get("HADOOP_VERSION")
spark_conf.set(
"spark.jars.packages",
(
f"org.apache.hadoop:hadoop-aws:{hadoop_ver}"
f",com.amazonaws:aws-java-sdk-bundle:1.12.592"
),
)
# check if the credentials are temporary or not
try:
spark_conf.set("spark.hadoop.fs.s3a.session.token", credentials.token)
spark_conf.set(
"spark.hadoop.fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider",
) # this is for temporary credentials
print("spark conf is set with [ temporary ] S3 credentials")
except Exception:
print("spark conf is set with [ permanent ] S3 credentials")
else:
print("[ No AWS Credentials Found] - Failed to set spark conf for S3")
return spark_conf
def sample(
self,
n=100,
config=None,
sample_etl="data_ingestion___test___generate_fake_ufl",
verbose=False,
):
"""
Get the spark session and sample data.
Use this function to test the ETL pipeline quickly without config.
Args:
n (int): The number of data to generate. Default is 100.
config (Union[str, dict, OmegaConf]): Config for the ETL. Default is None.
sample_etl (str): The name of the sample ETL process. Default is "data_ingestion___test___generate_fake_ufl".
verbose (bool): If True, print the status. Default is False.
Returns:
Tuple[SparkSession, DataFrame]: The Spark session and the sampled data.
"""
if config is None:
config = Config.default()
else:
config = Config.load(config)
config = Config.set_default(config)
# remove all the ETL processes
config.etl = []
config.etl.append({"name": sample_etl, "args": {"n": n}})
if verbose:
print("=" * 50)
print("[ Configuration ]")
print(OmegaConf.to_yaml(config))
print("=" * 50)
spark_conf = self.setup_spark_conf(config, verbose=verbose)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
if verbose:
print("=" * 50)
print("[ Spark Final Configuration ]")
print(OmegaConf.to_yaml(spark_conf.getAll()))
print("=" * 50)
sample_etl_class = self.get(key=sample_etl)
data = sample_etl_class()(spark, n=n, etl_name=sample_etl)
if verbose:
print(
(
f"{'=' * 50}\n"
"[ SAMPLE MODE ]\n"
f"{'=' * 50}\n"
"This is a quick way to get the sample data for testing or debugging w/o config.\n"
"If you want to test the ETL pipeline with your own data, please use `run` w/ config.\n"
f"{'=' * 50}\n"
"=> spark, data = etl_pipeline.sample()\n"
"=> data = data.map(add awesome duck to column)\n"
f"{'=' * 50}\n"
)
)
return spark, data
def run(
self,
config: Union[str, dict, DictConfig, OmegaConf, Path],
verbose=False,
cache=False,
emr=False,
*args,
**kwargs,
):
"""
Runs the ETL process.
Args:
config (Union[str, dict, OmegaConf]): config for the etl
- str: path to the config file
- dict: config dict
- OmegaConf: config object
verbose (bool): if True, print the status of the etl pipeline
- the verbose will be applied to the ETL process as well
- ETL process `verbose` takes precedence over this
cache (bool): cache every stage of the ETL process
emr (bool): if True, run the ETL process on EMR
"""
# ================ [ EMR ] ===================
if emr:
return self.run_emr(
config,
verbose=verbose,
cache=cache,
*args,
**kwargs,
)
# =============== [ Set Config ] ==================
# mainly this is to fill the missing config args with default
config = Config.load(config)
config = Config.set_default(config)
if verbose:
print("=" * 50)
print("[ Configuration ]")
print(OmegaConf.to_yaml(config))
print("=" * 50)
# ================ [ Set Spark ] ===================
spark_conf = self.setup_spark_conf(config, verbose=verbose)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
if verbose:
print("=" * 50)
print("[ Spark Final Configuration ]")
print(OmegaConf.to_yaml(spark_conf.getAll()))
print("=" * 50)
# ================= [ Run ETL ] ====================
# [ Load RDD/DataFrame ] - data ingestion
# [ Preprocessing ]
# [ Save RDD/DataFrame ] - data save
etl_configs = config.etl
total_etl_n = len(etl_configs)
# [switch] is the ETL process ended or not
# if not, spark session & data will be returned to continue
IS_ETL_FINISHED = True
data = None
prev_etl_name = None
prev_data = None # for caching
for etl_i, etl_config in enumerate(etl_configs):
# etl_config.name format
# =====>[ etl_cate___etl_sub_cate___etl_name ]
etl_name = etl_config.name
etl_category = etl_name.split("___")[0]
etl_class = self.get(key=etl_name)
# instantiate etl class
etl_instance = etl_class()
# this is middle creator mode
# if the last ETL process is not data save
if etl_i == total_etl_n - 1 and etl_category != "data_save":
if verbose:
print(
(
f"{'=' * 50}\n"
"[ DEBUG MODE ]\n"
f"{'=' * 50}\n"
f"Last ETL process was assigned for [ {etl_category} ]\n"
"Spark session will not be stopped and will be returned\n"
"If this is not intended, please assign [ data_save ] at the end.\n"
f"{'=' * 50}\n"
"Example:\n"
"=> spark, data = etl_pipeline.run(config)\n"
"=> data = data.map(add awesome duck to column)\n"
f"{'=' * 50}\n"
)
)
IS_ETL_FINISHED = False
# when args is not defined, set it to empty dict
if "args" in etl_config:
args = etl_config.args
else:
args = {}
# if verbose is not defined, set it same to the pipeline
if "verbose" not in args:
args["verbose"] = verbose
# `etl_name` is passed to args for tracking
if etl_i == 0:
data = etl_instance(spark, **args, etl_name=etl_name, prev_etl_name=None)
else:
data = etl_instance(
spark, data, **args, etl_name=etl_name, prev_etl_name=prev_etl_name
)
# cache the data
if cache:
if prev_data is not None:
prev_data.unpersist()
data.cache()
prev_data = data
prev_etl_name = etl_name
# =============== [ Stop Spark ] ==================
if IS_ETL_FINISHED:
spark.stop()
if verbose:
print("=" * 50)
print("[ Spark Successfully Done ]")
print("=" * 50)
return spark, data
def run_emr(
self,
config: Union[str, dict, DictConfig, OmegaConf, Path],
verbose=False,
cache=False,
*args,
**kwargs,
):
"""
Runs the ETL process on an EMR cluster.
Args:
config (Union[str, dict, OmegaConf]): config for the etl
- str: path to the config file
- dict: config dict
- OmegaConf: config object
verbose (bool): if True, print the status of the etl pipeline
- the verbose will be applied to the ETL process as well
- ETL process `verbose` takes precedence over this
cache (bool): cache every stage of the ETL process
Returns:
None, Config:
- None for spark session
- Config for the config
- originally data is returned, but it is not necessary for EMR
"""
if not aws_check_credentials(verbose=verbose):
raise ValueError("AWS EMR requires AWS credentials")
# =============== [ Set Config ] ==================
config = Config.load(config)
config = Config.set_default(config, emr=True)
# EMR resource manager - yarn
config.spark.master = "yarn"
# reset local_dir for EMR cluster
config.spark.local.dir = "/tmp"
# ================ [ EMR ] ===================
# NOTE: config will be auto-updated by EMR Manager
emr_manager = EMRManager()
try:
# EMR cluster launch
emr_manager.launch(config)
if verbose:
print("=" * 50)
print("[ Configuration ]")
print(OmegaConf.to_yaml(config))
print("=" * 50)
# EMR cluster environment setup & run spark
step_id = emr_manager.run(config, verbose=verbose)
# wait until EMR cluster step is done
emr_manager.wait(config, step_id)
# EMR Cluster terminate
# XXX: after EMR cluster is terminated, and confirmed by waiter
# there is still a chance that the cluster is not terminated and cause error
# - DependencyViolation (which depends on terminated cluster)
# FIXME: this is a temporary solution, need to find a better way to handle this
RETRY_TERMINATE = 5
for _ in range(RETRY_TERMINATE):
try:
emr_manager.terminate(config)
break
except AWSClient().ec2.exceptions.ClientError as e:
if e.response["Error"]["Code"] == "DependencyViolation":
print("DependencyViolation - retrying to terminate EMR cluster")
time.sleep(5)
else:
raise e
except Exception as e:
raise e
# ctrl + c
except KeyboardInterrupt:
print("KeyboardInterrupt - terminating EMR cluster")
emr_manager.terminate(config)
raise KeyboardInterrupt
except Exception as e:
print("Exception - terminating EMR cluster")
emr_manager.terminate(config)
raise e
return None, config
================================================
FILE: dataverse/etl/quality/README.md
================================================
# Quality
================================================
FILE: dataverse/etl/quality/__init__.py
================================================
================================================
FILE: dataverse/etl/quality/language.py
================================================
"""
language filtering from Common Crawl
This is a migration of the common crawl code to Dataverse.
some part of code is from facebookresearch/cc_net
https://github.com/facebookresearch/cc_net/blob/main/cc_net/split_by_lang.py
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import functools
from pathlib import Path
from typing import List, Union
import requests
from fasttext.FastText import _FastText
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.etl.registry import register_etl
from dataverse.utils.setting import SystemSetting
def load_fasttext(
url="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin",
):
"""
There is 2 issues found here
- due to unserilizable fasttext problem, we need to load the model for every task
- this is a problem, extremely slow
- we need to load the model once and use it for all tasks
- since this could lead to duplicated download, we need to check if the model is already downloaded
- so far found no duplicated download, but if there is, hope to be fixed in the future
"""
# FIXME: this is a manual check for duplicate download
# rd_n = np.random.randint(0, 1000000)
# print(rd_n, 'entered load_fasttext model!')
# Get the lid.bin file for Fasttext
cache_dir = SystemSetting().CACHE_DIR
cache_dir = Path(f"{cache_dir}/.cache/dataverse/model")
fasttext_path = cache_dir / "fasttext" / "bin" / "lid.bin"
fasttext_path.parent.mkdir(parents=True, exist_ok=True) # Make directories if not existed
if not fasttext_path.exists():
# FIXME: this is a manual check for duplicate download
# print(rd_n, 'downloading fasttext model!')
response = requests.get(url, stream=True)
# Raise exception if downloading is not successful
response.raise_for_status()
with open(fasttext_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
# FIXME: this is to suppress the warning message
# return fasttext.load_model(str(fasttext_path))
return _FastText(model_path=str(fasttext_path))
def language_predict_fasttext(row, model, top_k: int = 1, score_rounding: int = 2):
text = row["text"].replace("\n", "")
labels, scores = model.predict(text, k=top_k)
labels = [label.replace("__label__", "") for label in labels]
row["labels"] = labels
row["scores"] = scores.round(score_rounding)
return row
def language_predict_fasttext_by_partition(rows, top_k: int = 1, score_rounding: int = 2):
# loaded for every partition
model = load_fasttext()
# FIXME: not possible to use multiprocessing here because of the model is not serializable
# pool = multiprocessing.Pool(processes = os.cpu_count() or 0)
# results = pool.imap(
# functools.partial(language_predict_fasttext, model=model, top_k=top_k),
# rows,
# )
for row in rows:
yield language_predict_fasttext(row, model, top_k=top_k)
@register_etl
def quality___language___fasttext_filter(
spark,
data: Union[RDD, DataFrame],
subset: str = "text",
top_k: int = 1,
score_rounding: int = 2,
threshold: float = 0.0,
whitelist: List[str] = None,
blacklist: List[str] = None,
*args,
**kwargs,
) -> RDD:
"""
Filters data based on language using fasttext.
If language score is below threshold, that row will be filtered.
Args:
spark (SparkSession): The Spark session object.
data (Union[RDD, DataFrame]): The input data to be processed.
subset (str, optional): A subset or column to consider. Defaults to 'text'.
top_k(int, optional): The number of top languages to keep after classification. Defaults to 1.
- if fasttext classified 3 languages, top_k=1 will keep the top language
- [en, fr, de] -> [en]
- if fasttext classified 3 languages, top_k=2 will keep the top 2 languages
- [en, fr, de] -> [en, fr]
score_rounding(int, optional): The number of decimal places to round the scores. Defaults to 2.
threshold(float, optional): The minimum score to keep the language. Defaults to 0.0.
whitelist(List[str], optional): The list of languages to keep. Defaults to None.
blacklist(List[str], optional): The list of languages to remove. Defaults to None.
Raises:
ValueError: If both whitelist and blacklist are not None.
Returns:
rdd: The filtered data.
Caveats about `whitelist` and `blacklist`:
- [Default] If both `whitelist` and `blacklist` are None, all languages will be kept.
- If both `whitelist` and `blacklist` are not None, an error will be raised.
- If `whitelist` is not None, only the languages in the `whitelist` will be kept.
- If `blacklist` is not None, the languages in the `blacklist` will be removed.
"""
if isinstance(data, DataFrame):
data = data.rdd
data = data.map(lambda row: row.asDict())
# detect language using fasttext
data = data.mapPartitions(
functools.partial(
language_predict_fasttext_by_partition,
top_k=top_k,
score_rounding=score_rounding,
)
)
# filter by threshold
data = data.filter(lambda x: any(s >= threshold for s in x["scores"][:top_k]))
# filter by whitelist and blacklist
if whitelist is not None and blacklist is not None:
raise ValueError("whitelist and blacklist cannot be both not None")
elif whitelist is not None:
data = data.filter(lambda x: any(label in whitelist for label in x["labels"][:top_k]))
elif blacklist is not None:
data = data.filter(lambda x: all(label not in blacklist for label in x["labels"][:top_k]))
else:
# otherwise, keep all languages
...
# remove labels and scores
data = data.map(lambda x: {k: v for k, v in x.items() if k != "labels" and k != "scores"})
return data
================================================
FILE: dataverse/etl/registry.py
================================================
"""
Base class to support the registration of the ETL classes
Copyright (c) 2024-present Upstage Co., Ltd.
Apache-2.0 license
"""
import abc
import importlib.util
import inspect
import os
from functools import wraps
from typing import Union
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from dataverse.utils.setting import SystemSetting
# TODO: If you add category directorie
gitextract_q82io2no/ ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── 1-bug-report.yml │ │ ├── 2-feature-request.yml │ │ ├── 3-documentation-improve.yml │ │ └── config.yml │ └── pull_request_template.md ├── .gitignore ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── LICENSE ├── Makefile ├── README.md ├── contribution/ │ └── CONTRIBUTING.md ├── dataverse/ │ ├── README.md │ ├── __init__.py │ ├── api/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── cli.py │ │ └── emr.py │ ├── config/ │ │ ├── README.md │ │ ├── __init__.py │ │ └── interface.py │ ├── etl/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── __sample/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── ducky.py │ │ │ └── github.py │ │ ├── bias/ │ │ │ ├── README.md │ │ │ └── __init__.py │ │ ├── cleaning/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── char.py │ │ │ ├── document.py │ │ │ ├── html.py │ │ │ ├── korean.py │ │ │ ├── length.py │ │ │ ├── number.py │ │ │ ├── table.py │ │ │ └── unicode.py │ │ ├── data_ingestion/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── arrow.py │ │ │ ├── common_crawl.py │ │ │ ├── csv.py │ │ │ ├── cultura_x.py │ │ │ ├── huggingface.py │ │ │ ├── parquet.py │ │ │ ├── red_pajama.py │ │ │ ├── slim_pajama.py │ │ │ └── test.py │ │ ├── data_save/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── aws.py │ │ │ ├── huggingface.py │ │ │ └── parquet.py │ │ ├── decontamination/ │ │ │ ├── README.md │ │ │ └── __init__.py │ │ ├── deduplication/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── common_crawl.py │ │ │ ├── exact.py │ │ │ ├── minhash.py │ │ │ └── polyglot.py │ │ ├── pii/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ ├── card.py │ │ │ └── nin.py │ │ ├── pipeline.py │ │ ├── quality/ │ │ │ ├── README.md │ │ │ ├── __init__.py │ │ │ └── language.py │ │ ├── registry.py │ │ ├── toxicity/ │ │ │ ├── README.md │ │ │ └── __init__.py │ │ └── utils/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── log.py │ │ ├── sampling.py │ │ └── statistics.py │ ├── lab/ │ │ ├── README.md │ │ └── __init__.py │ ├── tests/ │ │ ├── conftest.py │ │ ├── test_cleaning_accent.py │ │ ├── test_cleaning_char.py │ │ ├── test_cleaning_document.py │ │ ├── test_cleaning_html.py │ │ ├── test_cleaning_korean.py │ │ ├── test_cleaning_length.py │ │ ├── test_cleaning_number.py │ │ ├── test_cleaning_table.py │ │ ├── test_cleaning_unicode.py │ │ ├── test_deduplication_common_crawl.py │ │ ├── test_deduplication_exact.py │ │ ├── test_deduplication_minhash.py │ │ ├── test_deduplication_polyglot.py │ │ ├── test_pii_card.py │ │ └── test_pii_nin.py │ └── utils/ │ ├── README.md │ ├── __init__.py │ ├── analyze/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── pip.py │ │ └── python.py │ ├── api/ │ │ ├── README.md │ │ ├── __init__.py │ │ └── aws.py │ ├── format/ │ │ ├── README.md │ │ ├── __init__.py │ │ ├── huggingface.py │ │ └── ufl.py │ └── setting/ │ ├── README.md │ ├── __init__.py │ ├── system.py │ └── user.py ├── docs/ │ ├── Makefile │ ├── make.bat │ └── source/ │ ├── citation.rst │ ├── conf.py │ ├── config/ │ │ └── config.interface.rst │ ├── etl/ │ │ ├── etl.bias.rst │ │ ├── etl.cleaning.rst │ │ ├── etl.data_ingestion.rst │ │ ├── etl.data_save.rst │ │ ├── etl.decontamination.rst │ │ ├── etl.deduplication.rst │ │ ├── etl.pii.rst │ │ ├── etl.pipeline.rst │ │ ├── etl.quality.rst │ │ ├── etl.registry.rst │ │ ├── etl.rst │ │ ├── etl.toxicity.rst │ │ └── etl.utils.rst │ ├── index.rst │ ├── installation.rst │ ├── quickstart.rst │ └── requirements.txt ├── examples/ │ ├── README.md │ └── etl/ │ ├── ETL_01_how_to_run.ipynb │ ├── ETL_02_one_cycle.ipynb │ ├── ETL_03_create_new_etl_process.ipynb │ ├── ETL_04_add_new_etl_process.ipynb │ ├── ETL_05_test_etl_process.ipynb │ ├── ETL_06_scaleout_with_EMR.ipynb │ ├── EX_use_common_crawl_data.ipynb │ ├── EX_use_pyspark_ui.ipynb │ └── README.md ├── requirements.txt └── setup.py
SYMBOL INDEX (288 symbols across 60 files)
FILE: dataverse/api/cli.py
function main (line 9) | def main():
FILE: dataverse/api/emr.py
function import_dynamic_etls (line 12) | def import_dynamic_etls():
function main (line 39) | def main(config, verbose=False):
FILE: dataverse/config/interface.py
class Config (line 23) | class Config:
method __new__ (line 31) | def __new__(cls, *args, **kwargs):
method load (line 35) | def load(cls, config: Union[str, dict, DictConfig, OmegaConf, Path]):
method save (line 101) | def save(cls, config, path: Union[str, Path]):
method default (line 123) | def default(cls, emr: bool = False):
method set_default (line 227) | def set_default(cls, config: OmegaConf, emr: bool = False):
FILE: dataverse/etl/__sample/ducky.py
function __sample___ducky___make_your_own_etl_processor (line 12) | def __sample___ducky___make_your_own_etl_processor(data: Union[RDD, Data...
FILE: dataverse/etl/__sample/github.py
function __sample___github___using_decorator (line 15) | def __sample___github___using_decorator(data: Union[RDD, DataFrame], *ar...
function __sample___github___config (line 23) | def __sample___github___config(data: Union[RDD, DataFrame], config: dict...
FILE: dataverse/etl/cleaning/char.py
function cleaning___char___normalize_whitespace (line 20) | def cleaning___char___normalize_whitespace(
function cleaning___char___remove_unprintable (line 53) | def cleaning___char___remove_unprintable(
function strip_accents (line 88) | def strip_accents(text: str) -> str:
function cleaning___char___remove_accent (line 98) | def cleaning___char___remove_accent(
FILE: dataverse/etl/cleaning/document.py
function cleaning___document___split_by_word (line 17) | def cleaning___document___split_by_word(
FILE: dataverse/etl/cleaning/html.py
function cleaning___html___extract_plain_text (line 18) | def cleaning___html___extract_plain_text(
FILE: dataverse/etl/cleaning/korean.py
class KoreanType (line 18) | class KoreanType(IntEnum):
function character_is_korean (line 44) | def character_is_korean(c):
function decompose (line 53) | def decompose(c):
function compose (line 71) | def compose(chosung, jungsung, jongsung):
function cleaning___korean___filter_by_ratio (line 79) | def cleaning___korean___filter_by_ratio(
function classify_korean_type (line 182) | def classify_korean_type(unicode):
function reduce_repeated_emotions (line 193) | def reduce_repeated_emotions(text, num_repeats=2):
function cleaning___korean___reduce_emoticon (line 202) | def cleaning___korean___reduce_emoticon(
FILE: dataverse/etl/cleaning/length.py
function cleaning___length___char_len_filter (line 17) | def cleaning___length___char_len_filter(
function cleaning___length___word_len_filter (line 67) | def cleaning___length___word_len_filter(
FILE: dataverse/etl/cleaning/number.py
function cleaning___number___normalize (line 16) | def cleaning___number___normalize(
FILE: dataverse/etl/cleaning/table.py
function cleaning___table___merge_col_vertical (line 16) | def cleaning___table___merge_col_vertical(
FILE: dataverse/etl/cleaning/unicode.py
function cleaning___unicode___remove_punct (line 54) | def cleaning___unicode___remove_punct(
function cleaning___unicode___replace_punct (line 85) | def cleaning___unicode___replace_punct(
function cleaning___unicode___normalize (line 116) | def cleaning___unicode___normalize(
FILE: dataverse/etl/data_ingestion/arrow.py
function find_arrow_paths (line 21) | def find_arrow_paths(directory):
function get_dir_size (line 34) | def get_dir_size(arrow_paths):
function arrow_table_to_dict (line 48) | def arrow_table_to_dict(arrow_path):
function data_ingestion___arrow___hf2raw (line 78) | def data_ingestion___arrow___hf2raw(
FILE: dataverse/etl/data_ingestion/common_crawl.py
function parse_doc (line 37) | def parse_doc(headers: List[str], doc: List[str]) -> Optional[dict]:
function group_by_docs (line 85) | def group_by_docs(warc_lines: Iterable[str]) -> Iterable[dict]:
function _close_when_exhausted (line 112) | def _close_when_exhausted(file) -> Iterable[str]:
function open_segment_file (line 117) | def open_segment_file(segment: str, verbose: bool = True) -> Iterable[str]:
function process_segment_file (line 132) | def process_segment_file(segment: str, verbose: bool = True) -> Iterable...
function find_wet_files (line 138) | def find_wet_files(directory):
function _tmp (line 150) | def _tmp(prefix: str = None, suffix: str = None, dir: Path = None) -> Path:
function _yield_from (line 159) | def _yield_from(files: list) -> Iterable[str]:
function open_read (line 164) | def open_read(filename: ReadableFileLike) -> Iterable[str]:
function request_get_content (line 211) | def request_get_content(url: str, n_retry: int = 3, verbose: bool = True...
function open_remote_file (line 248) | def open_remote_file(url: str, cache: Path, verbose: bool = True) -> Ite...
function cc_wet_paths_url (line 282) | def cc_wet_paths_url(dump_id: str) -> str:
function segment_url (line 286) | def segment_url(segment: str):
function cc_segment_urls (line 290) | def cc_segment_urls(dump_id: str, cache_dir: Path, verbose: bool = True)...
function open_segment_url (line 297) | def open_segment_url(segment: str, cache_dir: Path, verbose: bool = True...
function process_segment_url (line 306) | def process_segment_url(segment: str, cache_dir: Path, verbose: bool = T...
function data_ingestion___common_crawl___wet2raw (line 313) | def data_ingestion___common_crawl___wet2raw(
function data_ingestion___common_crawl___dump2raw (line 357) | def data_ingestion___common_crawl___dump2raw(
function convert_bytes (line 424) | def convert_bytes(data):
function data_ingestion___common_crawl___raw2ufl (line 435) | def data_ingestion___common_crawl___raw2ufl(spark, data: RDD, *args, **k...
FILE: dataverse/etl/data_ingestion/csv.py
function data_ingestion___csv___csv2raw (line 17) | def data_ingestion___csv___csv2raw(
FILE: dataverse/etl/data_ingestion/cultura_x.py
function data_ingestion___cultura_x___raw2ufl (line 15) | def data_ingestion___cultura_x___raw2ufl(spark, ufl: RDD, *args, **kwargs):
FILE: dataverse/etl/data_ingestion/huggingface.py
function data_ingestion___huggingface___hf2raw (line 18) | def data_ingestion___huggingface___hf2raw(
FILE: dataverse/etl/data_ingestion/parquet.py
function data_ingestion___parquet___pq2raw (line 13) | def data_ingestion___parquet___pq2raw(
FILE: dataverse/etl/data_ingestion/red_pajama.py
function convert2ufl (line 25) | def convert2ufl(row):
function data_ingestion___red_pajama___parquet2ufl (line 32) | def data_ingestion___red_pajama___parquet2ufl(spark, input_paths, repart...
function data_ingestion___red_pajama___hf2ufl (line 45) | def data_ingestion___red_pajama___hf2ufl(
function data_ingestion___red_pajama___hf2raw (line 86) | def data_ingestion___red_pajama___hf2raw(
function data_ingestion___red_pajama___raw2ufl_templatev1 (line 115) | def data_ingestion___red_pajama___raw2ufl_templatev1(spark, ufl, *args, ...
function data_ingestion___red_pajama___raw2ufl_templatev2 (line 131) | def data_ingestion___red_pajama___raw2ufl_templatev2(spark, ufl, *args, ...
FILE: dataverse/etl/data_ingestion/slim_pajama.py
function data_ingestion___slim_pajama___parquet2ufl (line 15) | def data_ingestion___slim_pajama___parquet2ufl(spark, input_paths, repar...
function data_ingestion___slim_pajama___hf2ufl (line 26) | def data_ingestion___slim_pajama___hf2ufl(
FILE: dataverse/etl/data_ingestion/test.py
function data_ingestion___test___generate_fake_ufl (line 17) | def data_ingestion___test___generate_fake_ufl(
FILE: dataverse/etl/data_save/huggingface.py
function data_save___huggingface___ufl2hf_hub (line 22) | def data_save___huggingface___ufl2hf_hub(spark, ufl, hub_path, repartiti...
function data_save___huggingface___ufl2hf (line 31) | def data_save___huggingface___ufl2hf(
function data_save___huggingface___ufl2hf_obj (line 67) | def data_save___huggingface___ufl2hf_obj(
FILE: dataverse/etl/data_save/parquet.py
function data_save___parquet___ufl2parquet (line 18) | def data_save___parquet___ufl2parquet(
FILE: dataverse/etl/deduplication/common_crawl.py
function filter_lines (line 17) | def filter_lines(row, subset="text"):
function deduplication___common_crawl___exact_line (line 32) | def deduplication___common_crawl___exact_line(
FILE: dataverse/etl/deduplication/exact.py
function deduplication___exact___column (line 16) | def deduplication___exact___column(
FILE: dataverse/etl/deduplication/minhash.py
function generate_edges (line 33) | def generate_edges(nodes: List[int]) -> List[Tuple[int, int]]:
function get_hash (line 60) | def get_hash(text: str, n_bytes: int=8):
function get_signatures (line 67) | def get_signatures(
function optimal_param (line 105) | def optimal_param(
function process_cluster (line 170) | def process_cluster(cluster: List[Any]) -> List[Any]:
function deduplication___minhash___lsh_jaccard (line 174) | def deduplication___minhash___lsh_jaccard(
FILE: dataverse/etl/deduplication/polyglot.py
function shingle_word (line 27) | def shingle_word(text: str, n_gram: int = 15, char_level: bool = False) ...
function generate_minhash (line 49) | def generate_minhash(shingles: List, num_perm: int = 64, seed: int = 1) ...
function jaccard_by_hashvalues (line 76) | def jaccard_by_hashvalues(src_hashvalues, tgt_hashvalues) -> float:
function expand_instances_by_minhash (line 85) | def expand_instances_by_minhash(
function explore_dedup_instance (line 95) | def explore_dedup_instance(hash_groups, threshold: float = 0.8):
function deduplication___polyglot___minhash (line 113) | def deduplication___polyglot___minhash(
FILE: dataverse/etl/pii/card.py
function pii___card___replace_card_number (line 17) | def pii___card___replace_card_number(
FILE: dataverse/etl/pii/nin.py
function pii___nin___replace_korean_rrn (line 27) | def pii___nin___replace_korean_rrn(
FILE: dataverse/etl/pipeline.py
class ETLPipeline (line 25) | class ETLPipeline:
method __init__ (line 45) | def __init__(self):
method __len__ (line 48) | def __len__(self):
method status (line 51) | def status(self):
method search (line 73) | def search(self, category=None, sub_category=None):
method get (line 103) | def get(self, key):
method setup_spark_conf (line 107) | def setup_spark_conf(self, config, verbose=False):
method sample (line 157) | def sample(
method run (line 222) | def run(
method run_emr (line 356) | def run_emr(
FILE: dataverse/etl/quality/language.py
function load_fasttext (line 25) | def load_fasttext(
function language_predict_fasttext (line 62) | def language_predict_fasttext(row, model, top_k: int = 1, score_rounding...
function language_predict_fasttext_by_partition (line 73) | def language_predict_fasttext_by_partition(rows, top_k: int = 1, score_r...
function quality___language___fasttext_filter (line 88) | def quality___language___fasttext_filter(
FILE: dataverse/etl/registry.py
function auto_register (line 42) | def auto_register(etl_categories=ETL_CATEGORIES):
class ETLStructure (line 71) | class ETLStructure:
class ETLRegistry (line 75) | class ETLRegistry:
method __new__ (line 99) | def __new__(cls):
method __init__ (line 104) | def __init__(self):
method __len__ (line 116) | def __len__(self):
method __repr__ (line 119) | def __repr__(self):
method __str__ (line 122) | def __str__(self):
method reset (line 125) | def reset(self):
method register (line 131) | def register(self, key: str, etl: ETLStructure):
method _update_status (line 175) | def _update_status(self, key: str):
method search (line 185) | def search(self, category: str = None, sub_category: str = None):
method _convert_to_report_format (line 229) | def _convert_to_report_format(
method get (line 281) | def get(self, key: str) -> ETLStructure:
method get_all (line 315) | def get_all(self):
class ETLAutoRegistry (line 325) | class ETLAutoRegistry(abc.ABCMeta, type):
method __new__ (line 326) | def __new__(cls, name, bases, attrs):
class BaseETL (line 349) | class BaseETL(ETLStructure, metaclass=ETLAutoRegistry):
method run (line 366) | def run(self, data: Union[RDD, DataFrame], *args, **kwargs):
method __call__ (line 372) | def __call__(self, *args, **kwargs):
function add_self (line 379) | def add_self(func):
function register_etl (line 393) | def register_etl(func):
FILE: dataverse/etl/utils/log.py
function utils___log___count (line 15) | def utils___log___count(
FILE: dataverse/etl/utils/sampling.py
function utils___sampling___random (line 17) | def utils___sampling___random(
FILE: dataverse/etl/utils/statistics.py
function utils___statistics___korean_nouns (line 16) | def utils___statistics___korean_nouns(
FILE: dataverse/tests/conftest.py
function set_test_mode_env (line 10) | def set_test_mode_env():
FILE: dataverse/tests/test_cleaning_accent.py
function helper___test___generate_accent (line 7) | def helper___test___generate_accent(spark, *args, **kwargs):
function test_cleaning___accent____remove (line 13) | def test_cleaning___accent____remove():
FILE: dataverse/tests/test_cleaning_char.py
function helper___test___generate_whitespace (line 17) | def helper___test___generate_whitespace(
function test_cleaning___char___normalize_whitespace (line 34) | def test_cleaning___char___normalize_whitespace():
function helper___test___generate_unprintable (line 68) | def helper___test___generate_unprintable(
function test_cleaning___char___remove_unprintable (line 92) | def test_cleaning___char___remove_unprintable():
FILE: dataverse/tests/test_cleaning_document.py
function fake_data_rdd (line 13) | def fake_data_rdd():
function test_cleaning___document___split_by_word (line 17) | def test_cleaning___document___split_by_word():
FILE: dataverse/tests/test_cleaning_html.py
function helper___test___generate_html (line 14) | def helper___test___generate_html(
function test_cleaning___html___extract_plain_text (line 61) | def test_cleaning___html___extract_plain_text():
function test_cleaning___html___extract_plain_text_trafilatura (line 86) | def test_cleaning___html___extract_plain_text_trafilatura():
FILE: dataverse/tests/test_cleaning_korean.py
function helper___test___generate_korean (line 14) | def helper___test___generate_korean(
function test_cleaning___korean___filter_by_ratio (line 105) | def test_cleaning___korean___filter_by_ratio():
function test_cleaning___korean___filter_by_ratio_chars (line 139) | def test_cleaning___korean___filter_by_ratio_chars():
function helper___test___generate_korean_emoticon (line 175) | def helper___test___generate_korean_emoticon(spark, *args, **kwargs):
function test_cleaning___korean___reduce_emoticon (line 183) | def test_cleaning___korean___reduce_emoticon():
FILE: dataverse/tests/test_cleaning_length.py
function helper___test___generate_data_for_test_length (line 12) | def helper___test___generate_data_for_test_length(spark, n=10, faker_see...
function test_cleaning___length___char_len_filter (line 26) | def test_cleaning___length___char_len_filter():
function test_cleaning___length___word_len_filter (line 93) | def test_cleaning___length___word_len_filter():
FILE: dataverse/tests/test_cleaning_number.py
function helper___test___generate_number (line 8) | def helper___test___generate_number(spark, *args, **kwargs):
function test_cleaning___number___normalize (line 21) | def test_cleaning___number___normalize():
FILE: dataverse/tests/test_cleaning_table.py
function helper___test___generate_table (line 8) | def helper___test___generate_table(spark, *args, **kwargs):
function test_cleaning___table___merge_col_vertical (line 14) | def test_cleaning___table___merge_col_vertical():
FILE: dataverse/tests/test_cleaning_unicode.py
function helper___test___generate_unicode_data (line 8) | def helper___test___generate_unicode_data(spark, *args, **kwargs):
function helper___test___generate_expected_unicode_data (line 21) | def helper___test___generate_expected_unicode_data(spark, type="remove"):
function test_cleaning___unicode___remove_punct (line 56) | def test_cleaning___unicode___remove_punct():
function test_cleaning___unicode___replace_punct (line 77) | def test_cleaning___unicode___replace_punct():
function test_cleaning___unicode___normalize (line 98) | def test_cleaning___unicode___normalize():
FILE: dataverse/tests/test_deduplication_common_crawl.py
function helper___test___generate_exact_line (line 9) | def helper___test___generate_exact_line(spark, *args, **kwagrs):
function test_deduplication___common_crawl___exact_line (line 18) | def test_deduplication___common_crawl___exact_line():
FILE: dataverse/tests/test_deduplication_exact.py
function helper___test___generate_duplicated_data (line 8) | def helper___test___generate_duplicated_data(spark, *args, **kwargs):
function test_deduplication___exact_column (line 15) | def test_deduplication___exact_column():
FILE: dataverse/tests/test_deduplication_polyglot.py
function helper___test___create_data_for_polyglot_minhash (line 8) | def helper___test___create_data_for_polyglot_minhash(spark, *args, **kwa...
function test_deduplication___polyglot___minhash (line 18) | def test_deduplication___polyglot___minhash():
FILE: dataverse/tests/test_pii_card.py
function helper___test___create_data_for_pii_card (line 12) | def helper___test___create_data_for_pii_card(spark, *args, **kwargs):
function test_pii___card___replace (line 16) | def test_pii___card___replace():
FILE: dataverse/tests/test_pii_nin.py
function helper___test___create_data_for_korean_rnn (line 12) | def helper___test___create_data_for_korean_rnn(spark, *args, **kwargs):
function test_pii___nin___replace_korean_rnns (line 16) | def test_pii___nin___replace_korean_rnns():
FILE: dataverse/utils/analyze/pip.py
function pip_get_package_path (line 5) | def pip_get_package_path(package_name):
FILE: dataverse/utils/analyze/python.py
function python_is_script_executable (line 4) | def python_is_script_executable(file_path, verbose=False):
FILE: dataverse/utils/api/aws.py
function aws_check_credentials (line 56) | def aws_check_credentials(verbose=True):
class AWSClient (line 72) | class AWSClient:
method __new__ (line 79) | def __new__(cls):
method __init__ (line 84) | def __init__(self):
method __str__ (line 101) | def __str__(self) -> str:
method __repr__ (line 104) | def __repr__(self) -> str:
function aws_get_state (line 120) | def aws_get_state():
function aws_set_state (line 139) | def aws_set_state(state):
function aws_ec2_instance_at_az (line 150) | def aws_ec2_instance_at_az(az):
function aws_ec2_instance_info (line 169) | def aws_ec2_instance_info(instance):
function aws_ec2_all_instance_info (line 179) | def aws_ec2_all_instance_info():
function aws_ec2_get_price (line 204) | def aws_ec2_get_price(instance_type):
function aws_ssm_run_commands (line 217) | def aws_ssm_run_commands(instance_ids, commands, verbose=True, return_ou...
class EMRManager (line 282) | class EMRManager:
method launch (line 286) | def launch(self, config):
method _role_setup (line 326) | def _role_setup(self, config):
method _instance_profile_setup (line 405) | def _instance_profile_setup(self, config):
method _vpc_setup (line 423) | def _vpc_setup(self, config):
method _set_default_instance (line 503) | def _set_default_instance(
method _emr_cluster_create (line 561) | def _emr_cluster_create(self, config):
method run (line 660) | def run(self, config, verbose=False):
method _setup (line 691) | def _setup(self, config, verbose=False):
method _get_working_dir (line 724) | def _get_working_dir(self, config):
method _upload_config (line 750) | def _upload_config(self, config):
method _upload_source_code (line 759) | def _upload_source_code(self, config):
method _upload_dependencies (line 783) | def _upload_dependencies(self, config, package_name="dataverse"):
method _upload_dynamic_etl_files (line 805) | def _upload_dynamic_etl_files(self, config):
method _move_s3_to_ec2 (line 857) | def _move_s3_to_ec2(self, config, verbose=False):
method _get_pip_package_path (line 880) | def _get_pip_package_path(self, config, verbose=False):
method _setup_aws (line 900) | def _setup_aws(self, config, verbose=False):
method _setup_dependencies (line 914) | def _setup_dependencies(self, config, verbose=False):
method _setup_source_code (line 933) | def _setup_source_code(self, config, verbose=False):
method wait (line 957) | def wait(self, config, step_id, verbose=True):
method terminate (line 981) | def terminate(self, config):
method _clean (line 1011) | def _clean(self):
method _clean_stopped_emr (line 1020) | def _clean_stopped_emr(self):
method _clean_unused_vpc (line 1043) | def _clean_unused_vpc(self):
method _clean_unused_iam_role (line 1067) | def _clean_unused_iam_role(self):
method _clean_unused_iam_instance_profile (line 1094) | def _clean_unused_iam_instance_profile(self):
method terminate_by_id (line 1118) | def terminate_by_id(self, emr_id):
function aws_iam_role_create (line 1137) | def aws_iam_role_create(
function aws_iam_role_delete (line 1182) | def aws_iam_role_delete(role_name):
function aws_iam_instance_profile_create (line 1206) | def aws_iam_instance_profile_create(instance_profile_name, role_name):
function aws_iam_instance_profile_delete (line 1240) | def aws_iam_instance_profile_delete(instance_profile_name):
function aws_iam_remove_all_instance_profile (line 1259) | def aws_iam_remove_all_instance_profile():
function aws_vpc_create (line 1272) | def aws_vpc_create(cidr_block=None, tag_name='Dataverse-Temporary-VPC'):
function aws_vpc_delete (line 1322) | def aws_vpc_delete(vpc_id):
function aws_subnet_create (line 1390) | def aws_subnet_create(vpc_id, cird_block=None, tag_name='Dataverse-Tempo...
function aws_subnet_delete (line 1420) | def aws_subnet_delete(vpc_id, subnet_id):
function aws_subnet_az (line 1435) | def aws_subnet_az(subnet_id):
function aws_emr_security_group_create (line 1444) | def aws_emr_security_group_create(
function aws_security_group_delete (line 1495) | def aws_security_group_delete(vpc_id, security_group_id):
function aws_security_group_remove_dependency (line 1510) | def aws_security_group_remove_dependency(security_group_id):
function aws_gateway_create (line 1533) | def aws_gateway_create(vpc_id, tag_name='Dataverse-Gateway'):
function aws_gateway_delete (line 1566) | def aws_gateway_delete(vpc_id, gateway_id):
function aws_elastic_ip_allocate (line 1585) | def aws_elastic_ip_allocate(vpc_id, tag_name='Dataverse-Elastic-IP'):
function aws_elastic_ip_release (line 1615) | def aws_elastic_ip_release(vpc_id, elastic_ip_id):
function aws_nat_gateway_create (line 1637) | def aws_nat_gateway_create(
function aws_nat_gateway_delete (line 1679) | def aws_nat_gateway_delete(vpc_id, nat_gateway_id):
function aws_route_table_create (line 1700) | def aws_route_table_create(
function aws_route_table_delete (line 1743) | def aws_route_table_delete(vpc_id, route_table_id):
function aws_route_table_asscociate_subnet (line 1757) | def aws_route_table_asscociate_subnet(subnet_id, route_table_id):
function aws_s3_path_parse (line 1761) | def aws_s3_path_parse(path):
function aws_s3_create_bucket (line 1774) | def aws_s3_create_bucket(bucket):
function aws_s3_delete_bucket (line 1787) | def aws_s3_delete_bucket(bucket):
function aws_s3_read (line 1796) | def aws_s3_read(bucket, key):
function aws_s3_download (line 1810) | def aws_s3_download(bucket, key, local_path):
function aws_s3_upload (line 1843) | def aws_s3_upload(bucket, key, local_path):
function aws_s3_write (line 1867) | def aws_s3_write(bucket, key, obj):
function aws_s3_delete (line 1879) | def aws_s3_delete(bucket, key):
function aws_s3_list_buckets (line 1906) | def aws_s3_list_buckets():
function aws_s3_ls (line 1917) | def aws_s3_ls(query=None):
function aws_s3_get_object_type (line 1997) | def aws_s3_get_object_type(bucket, key):
FILE: dataverse/utils/format/huggingface.py
function load_huggingface_dataset (line 9) | def load_huggingface_dataset(name_or_path, split=None, from_disk=False):
function huggingface2parquet (line 45) | def huggingface2parquet(
FILE: dataverse/utils/format/ufl.py
function get_uuidv1 (line 8) | def get_uuidv1():
function get_uuidv4 (line 11) | def get_uuidv4():
FILE: dataverse/utils/setting/system.py
class SystemSetting (line 20) | class SystemSetting:
method __new__ (line 50) | def __new__(cls):
method __init__ (line 55) | def __init__(self):
method _get_aws_bucket (line 65) | def _get_aws_bucket(self, verbose=True):
method default_setting (line 104) | def default_setting(self):
method update_by_env (line 145) | def update_by_env(self):
method check_naming_convention (line 154) | def check_naming_convention(self, key):
method get (line 189) | def get(self, key):
method set (line 196) | def set(self, key, value):
method __getattr__ (line 203) | def __getattr__(self, key):
method __setattr__ (line 212) | def __setattr__(self, key, value):
method __getitem__ (line 222) | def __getitem__(self, key):
method __setitem__ (line 225) | def __setitem__(self, key, value):
method delete (line 228) | def delete(self, key):
method list (line 236) | def list(self):
method __repr__ (line 242) | def __repr__(self):
method __str__ (line 245) | def __str__(self):
FILE: dataverse/utils/setting/user.py
class UserSetting (line 13) | class UserSetting:
method __new__ (line 42) | def __new__(cls):
method __init__ (line 47) | def __init__(self):
method reset (line 64) | def reset(self):
method sync_file (line 71) | def sync_file(self):
method sync_class (line 78) | def sync_class(self):
method load (line 85) | def load(self, path):
method check_naming_convention (line 99) | def check_naming_convention(self, key):
method get (line 135) | def get(self, key):
method set (line 143) | def set(self, key, value):
method __getattr__ (line 151) | def __getattr__(self, key):
method __setattr__ (line 161) | def __setattr__(self, key, value):
method __getitem__ (line 172) | def __getitem__(self, key):
method __setitem__ (line 175) | def __setitem__(self, key, value):
method delete (line 178) | def delete(self, key):
method list (line 187) | def list(self):
method __repr__ (line 194) | def __repr__(self):
method __str__ (line 198) | def __str__(self):
FILE: docs/source/conf.py
function process_signature (line 78) | def process_signature(
function skip_undoc_members (line 96) | def skip_undoc_members(app, what, name, obj, skip, options):
function setup (line 102) | def setup(app):
FILE: setup.py
function get_requirements (line 9) | def get_requirements():
function get_extras_require (line 15) | def get_extras_require():
Condensed preview — 148 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (518K chars).
[
{
"path": ".github/ISSUE_TEMPLATE/1-bug-report.yml",
"chars": 1200,
"preview": "name: \"🐛 Bug Report\"\ndescription: Create a new ticket for a bug.\ntitle: \"🐛 [BUG] - <title>\"\nlabels: [\n \"bug\"\n]\n\nbody:\n "
},
{
"path": ".github/ISSUE_TEMPLATE/2-feature-request.yml",
"chars": 790,
"preview": "name: \"🚀 Feature Request\"\ndescription: Suggesting new desired feature and enhancement of existing feature\ntitle: \"🚀 [REQ"
},
{
"path": ".github/ISSUE_TEMPLATE/3-documentation-improve.yml",
"chars": 1302,
"preview": "name: \"📝 Documentation Improvement\"\ndescription: Report wrong or missing documentation. You can suggest new document or "
},
{
"path": ".github/ISSUE_TEMPLATE/config.yml",
"chars": 26,
"preview": "blank_issues_enabled: true"
},
{
"path": ".github/pull_request_template.md",
"chars": 533,
"preview": "## PR Checklist\nPlease check if your PR fulfills the following requirements:\n\n- [ ] The commit message follows _datavers"
},
{
"path": ".gitignore",
"chars": 3174,
"preview": "# forbidden\n.env\nreference/\ncommon_crawl/\nnotebook/\n.cache/\nsample/\n\n# open-source \ncc_net/\ndps/\n\n# Byte-compiled / opti"
},
{
"path": ".pre-commit-config.yaml",
"chars": 1400,
"preview": "repos:\n - repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v3.2.0\n hooks:\n # - id: trail"
},
{
"path": ".readthedocs.yaml",
"chars": 791,
"preview": "# .readthedocs.yml\n# Read the Docs configuration file\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html fo"
},
{
"path": "LICENSE",
"chars": 11357,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "Makefile",
"chars": 799,
"preview": "\n.PHONY: aws_s3 pyspark java \n\naws_s3:\n\t@test -d $$SPARK_HOME/jars || mkdir -p $$SPARK_HOME/jars\n\t@test -f $$SPARK_HOME/"
},
{
"path": "README.md",
"chars": 11080,
"preview": "<div align=\"center\">\n\n<br>\n<picture>\n <source media=\"(prefers-color-scheme: dark)\" srcset=\"docs/images/dataverse_logo-w"
},
{
"path": "contribution/CONTRIBUTING.md",
"chars": 10951,
"preview": "# __Contribution Guidelines__\nWelcome to _Dataverse_! We warmly welcome any kind of contribution 😊✨. </br>\nThis page pro"
},
{
"path": "dataverse/README.md",
"chars": 282,
"preview": "# Dataverse\n> The Universe of Data\n\n\n## 🌌 Config\n> Config for the Dataverse\n\n## 🌌 API\n> Interface of Dataverse for exter"
},
{
"path": "dataverse/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/api/README.md",
"chars": 78,
"preview": "# API (Application Programming Interface)\n> Interface with ease and efficiency"
},
{
"path": "dataverse/api/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/api/cli.py",
"chars": 404,
"preview": "\n\"\"\"\nmain entry point for the dataverse CLI tool\n\"\"\"\n\nfrom dataverse.utils.setting import SystemSetting\n\n\ndef main():\n "
},
{
"path": "dataverse/api/emr.py",
"chars": 1350,
"preview": "\n\"\"\"\nAPI to use AWS EMR with spark-submit\n\"\"\"\n\nimport os\nimport argparse\nimport importlib\n\nfrom dataverse.etl import ETL"
},
{
"path": "dataverse/config/README.md",
"chars": 2458,
"preview": "# Configuration\n> This directory contains configuration files for the Dataverse application\n\n\n## 🌌 How to use\n\n### 🌠 Loa"
},
{
"path": "dataverse/config/__init__.py",
"chars": 30,
"preview": "\nfrom .interface import Config"
},
{
"path": "dataverse/config/interface.py",
"chars": 8299,
"preview": "\"\"\"\nInterface to check & load the configurations for installation environment\n\nawesome_config = Config.load(\"/path/to/du"
},
{
"path": "dataverse/etl/README.md",
"chars": 10536,
"preview": "# ETL (Extract, Transform, Load)\n> Dataverse ETL is \"Block-based coding powered by Spark\"\n\n- Each block is called `ETL p"
},
{
"path": "dataverse/etl/__init__.py",
"chars": 133,
"preview": "\nfrom .registry import ETLRegistry\nfrom .registry import register_etl\nfrom .registry import BaseETL\nfrom .pipeline impor"
},
{
"path": "dataverse/etl/__sample/README.md",
"chars": 29,
"preview": "# Sample\n> This is a showcase"
},
{
"path": "dataverse/etl/__sample/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/__sample/ducky.py",
"chars": 374,
"preview": "\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import register_etl\n\nfrom typing imp"
},
{
"path": "dataverse/etl/__sample/github.py",
"chars": 1437,
"preview": "\n\nfrom pyspark.rdd import RDD\nfrom pyspark.sql import DataFrame\n\nfrom dataverse.etl import BaseETL\nfrom dataverse.etl im"
},
{
"path": "dataverse/etl/bias/README.md",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/bias/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/cleaning/README.md",
"chars": 485,
"preview": "# Cleaning\n> Data normalization, removing noise, and other data cleaning tasks.\n\n\n## 🌌 Naming Convention\n> This is a str"
},
{
"path": "dataverse/etl/cleaning/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/cleaning/char.py",
"chars": 3756,
"preview": "\"\"\"\nA collection of modules for cleaning data at the character level.\nFor example: whitespace, accent characters, and un"
},
{
"path": "dataverse/etl/cleaning/document.py",
"chars": 2656,
"preview": "\"\"\"\nA collection of modules for cleaning data at the document level.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApach"
},
{
"path": "dataverse/etl/cleaning/html.py",
"chars": 2421,
"preview": "\"\"\"\nA collection of modules for cleaning data includes html.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 li"
},
{
"path": "dataverse/etl/cleaning/korean.py",
"chars": 9547,
"preview": "\"\"\"\nThis is only for Korean text datas.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\n"
},
{
"path": "dataverse/etl/cleaning/length.py",
"chars": 2998,
"preview": "\"\"\"\nFiltering based on length.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import "
},
{
"path": "dataverse/etl/cleaning/number.py",
"chars": 1816,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\nfrom typing import Union\n\nfrom pyspar"
},
{
"path": "dataverse/etl/cleaning/table.py",
"chars": 2318,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd impo"
},
{
"path": "dataverse/etl/cleaning/unicode.py",
"chars": 3350,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport re\nimport unicodedata\nfrom typing import"
},
{
"path": "dataverse/etl/data_ingestion/README.md",
"chars": 5915,
"preview": "# Data Ingestion\n> Ingest various data sources into the desired format\n\n**Recommendation for Data Ingestion**\n> Use Data"
},
{
"path": "dataverse/etl/data_ingestion/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/data_ingestion/arrow.py",
"chars": 5615,
"preview": "\"\"\"\nLoad Arrow.\nSupport direct loading of arrow saved huggingface dataset to spark dataframe.\n\nCopyright (c) 2024-presen"
},
{
"path": "dataverse/etl/data_ingestion/common_crawl.py",
"chars": 15117,
"preview": "\"\"\"\nLoad Common Crawl data from dump-id & segment files\n\nCode is from facebookresearch/cc_net with some modifications\nht"
},
{
"path": "dataverse/etl/data_ingestion/csv.py",
"chars": 1051,
"preview": "\"\"\"\nLoad CSV data\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nf"
},
{
"path": "dataverse/etl/data_ingestion/cultura_x.py",
"chars": 991,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport json\n\nfrom pyspark.rdd import RDD\n\nfrom "
},
{
"path": "dataverse/etl/data_ingestion/huggingface.py",
"chars": 1590,
"preview": "\"\"\"\nLoad Huggingface data\n\nThis is used just to load huggingface dataset without any refomatting\n\nCopyright (c) 2024-pre"
},
{
"path": "dataverse/etl/data_ingestion/parquet.py",
"chars": 836,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\nfrom typing import List, Union\n\nfrom pyspark.rdd"
},
{
"path": "dataverse/etl/data_ingestion/red_pajama.py",
"chars": 3535,
"preview": "\"\"\"\nSupported datasets:\nhttps://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T\nhttps://huggingface.co/datase"
},
{
"path": "dataverse/etl/data_ingestion/slim_pajama.py",
"chars": 1617,
"preview": "\"\"\"\nSupported datasets:\nhttps://huggingface.co/datasets/cerebras/SlimPajama-627B\n\nCopyright (c) 2024-present Upstage Co."
},
{
"path": "dataverse/etl/data_ingestion/test.py",
"chars": 1506,
"preview": "\"\"\"\nspecial purpose to create fake data for testing or debugging\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2."
},
{
"path": "dataverse/etl/data_save/README.md",
"chars": 216,
"preview": "# Data Save\n> How to save data to the destination? In other words, how to save the data to the destination?\n\n\n## 🌌 Namin"
},
{
"path": "dataverse/etl/data_save/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/data_save/aws.py",
"chars": 139,
"preview": "\"\"\"\nTODO: Data saving to AWS S3\n\nThis is not implemented yet.\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 l"
},
{
"path": "dataverse/etl/data_save/huggingface.py",
"chars": 2524,
"preview": "\"\"\"\nData saving to Huggingface Datasets\n\nHuggingface support spark natively!\nhttps://huggingface.co/docs/datasets/use_wi"
},
{
"path": "dataverse/etl/data_save/parquet.py",
"chars": 1249,
"preview": "\"\"\"\nData saving to Parquets\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport os\nfrom typing "
},
{
"path": "dataverse/etl/decontamination/README.md",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/decontamination/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/deduplication/README.md",
"chars": 883,
"preview": "# Deduplication\n> Deduplication is the process of removing duplicate records from a dataset.\n\nNormally this is clustered"
},
{
"path": "dataverse/etl/deduplication/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/deduplication/common_crawl.py",
"chars": 2586,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport functools\nfrom typing import Union\n\nfrom"
},
{
"path": "dataverse/etl/deduplication/exact.py",
"chars": 903,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\n\nfrom typing import List, Union\n\nfrom pyspark.r"
},
{
"path": "dataverse/etl/deduplication/minhash.py",
"chars": 10947,
"preview": "\"\"\"\nCode is from ChenghaoMou/text-dedup\nhttps://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py\n"
},
{
"path": "dataverse/etl/deduplication/polyglot.py",
"chars": 5331,
"preview": "\"\"\"\nCode is from EleutherAI/dps\nhttps://github.com/EleutherAI/dps/blob/master/dps/spark/jobs/dedup_job.py\n\nThis is a mig"
},
{
"path": "dataverse/etl/pii/README.md",
"chars": 460,
"preview": "# PII (Personally Identifiable Information)\n> Replacing, Removing, and Anonymizing PII\n\n\n## 🌌 Naming Convention\n> This i"
},
{
"path": "dataverse/etl/pii/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/pii/card.py",
"chars": 3099,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nimport random\nimport re\nfrom typing import Unio"
},
{
"path": "dataverse/etl/pii/nin.py",
"chars": 3392,
"preview": "\"\"\"\nNIN (National Identification Number)\n=====================================\nA national identification number, nationa"
},
{
"path": "dataverse/etl/pipeline.py",
"chars": 15915,
"preview": "\"\"\"\nETL Interface\n----------------------\nuser will be interacting with this interface\n\nCopyright (c) 2024-present Upstag"
},
{
"path": "dataverse/etl/quality/README.md",
"chars": 9,
"preview": "# Quality"
},
{
"path": "dataverse/etl/quality/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/quality/language.py",
"chars": 6078,
"preview": "\"\"\"\nlanguage filtering from Common Crawl\n\nThis is a migration of the common crawl code to Dataverse.\nsome part of code i"
},
{
"path": "dataverse/etl/registry.py",
"chars": 14324,
"preview": "\"\"\"\nBase class to support the registration of the ETL classes\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 l"
},
{
"path": "dataverse/etl/toxicity/README.md",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/toxicity/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/utils/README.md",
"chars": 210,
"preview": "# Utils\n> Utilities for the ETL process. Not really part of the ETL process but useful for the ETL process.\n\nThis could "
},
{
"path": "dataverse/etl/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/etl/utils/log.py",
"chars": 890,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing import Union\n\nfrom pyspark.rdd impo"
},
{
"path": "dataverse/etl/utils/sampling.py",
"chars": 1375,
"preview": "\"\"\"\nSampling module for data ingestion\n\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom typing"
},
{
"path": "dataverse/etl/utils/statistics.py",
"chars": 2103,
"preview": "\"\"\"\nCopyright (c) 2024-present Upstage Co., Ltd.\nApache-2.0 license\n\"\"\"\n\nfrom operator import add\nfrom typing import Uni"
},
{
"path": "dataverse/lab/README.md",
"chars": 123,
"preview": "# Lab \n> Space Laboratory for data analysis\n\nThis will be further supported.\n- Data Exploration\n- Data Visualization \n- "
},
{
"path": "dataverse/lab/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/tests/conftest.py",
"chars": 470,
"preview": "import os\nimport sys\n\nimport pytest\n\nsys.path.append(\"./\") # to find etl folder as module\n\n\n@pytest.fixture(scope=\"sess"
},
{
"path": "dataverse/tests/test_cleaning_accent.py",
"chars": 1113,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline, register_etl\n\n\n@register_etl\ndef helper___test__"
},
{
"path": "dataverse/tests/test_cleaning_char.py",
"chars": 4687,
"preview": "import random\nimport re\nfrom typing import Union\n\nfrom omegaconf import OmegaConf\nfrom pyspark.rdd import RDD\nfrom pyspa"
},
{
"path": "dataverse/tests/test_cleaning_document.py",
"chars": 2206,
"preview": "import pytest\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\n\nfaker_seed"
},
{
"path": "dataverse/tests/test_cleaning_html.py",
"chars": 3016,
"preview": "import random\n\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom datav"
},
{
"path": "dataverse/tests/test_cleaning_korean.py",
"chars": 6485,
"preview": "import random\n\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom datav"
},
{
"path": "dataverse/tests/test_cleaning_length.py",
"chars": 5392,
"preview": "import pytest\nfrom faker import Faker\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom datave"
},
{
"path": "dataverse/tests/test_cleaning_number.py",
"chars": 1460,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n"
},
{
"path": "dataverse/tests/test_cleaning_table.py",
"chars": 1304,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n"
},
{
"path": "dataverse/tests/test_cleaning_unicode.py",
"chars": 3711,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n"
},
{
"path": "dataverse/tests/test_deduplication_common_crawl.py",
"chars": 1159,
"preview": "from omegaconf import OmegaConf\nfrom pyspark.sql import Row\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.re"
},
{
"path": "dataverse/tests/test_deduplication_exact.py",
"chars": 1845,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n"
},
{
"path": "dataverse/tests/test_deduplication_minhash.py",
"chars": 119,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n"
},
{
"path": "dataverse/tests/test_deduplication_polyglot.py",
"chars": 1492,
"preview": "from omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import register_etl\n\n"
},
{
"path": "dataverse/tests/test_pii_card.py",
"chars": 3067,
"preview": "import re\n\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import reg"
},
{
"path": "dataverse/tests/test_pii_nin.py",
"chars": 2130,
"preview": "import re\n\nfrom omegaconf import OmegaConf\n\nfrom dataverse.etl import ETLPipeline\nfrom dataverse.etl.registry import reg"
},
{
"path": "dataverse/utils/README.md",
"chars": 58,
"preview": "# Utils\n> Common utilities\n\n## API\n\n## Format\n\n## Setting\n"
},
{
"path": "dataverse/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "dataverse/utils/analyze/README.md",
"chars": 190,
"preview": "# Analyze\n> gaining insight of whatever you want to know\n\n## Naming Convention\n- `<target>_<function_name>`\n - e.g. `"
},
{
"path": "dataverse/utils/analyze/__init__.py",
"chars": 86,
"preview": "\nfrom .python import python_is_script_executable\nfrom .pip import pip_get_package_path"
},
{
"path": "dataverse/utils/analyze/pip.py",
"chars": 296,
"preview": "\nimport pkg_resources\n\n\ndef pip_get_package_path(package_name):\n try:\n package = pkg_resources.get_distributio"
},
{
"path": "dataverse/utils/analyze/python.py",
"chars": 1036,
"preview": "\nimport ast\n\ndef python_is_script_executable(file_path, verbose=False):\n \"\"\"\n check if a python script is executab"
},
{
"path": "dataverse/utils/api/README.md",
"chars": 1187,
"preview": "# API\n> This is a collection of API wrapper utilities for external sources\n\n## 🥹 Use `original API` as much as you can\n>"
},
{
"path": "dataverse/utils/api/__init__.py",
"chars": 1603,
"preview": "\n# AWS\nfrom .aws import AWSClient \nfrom .aws import EMRManager\n\nfrom .aws import aws_check_credentials\nfrom .aws import "
},
{
"path": "dataverse/utils/api/aws.py",
"chars": 68144,
"preview": "\n\"\"\"\nUsage:\n\n```python\nfrom dataverse.utils.api import aws_s3_list_buckets\nfrom dataverse.utils.api import aws_s3_list\n\n"
},
{
"path": "dataverse/utils/format/README.md",
"chars": 256,
"preview": "# Format\n> ETL is backed by spark and `format` is a helpers to reformat data. It could be **collection of converters** c"
},
{
"path": "dataverse/utils/format/__init__.py",
"chars": 151,
"preview": "\nfrom .huggingface import huggingface2parquet\nfrom .huggingface import load_huggingface_dataset\nfrom .ufl import get_uui"
},
{
"path": "dataverse/utils/format/huggingface.py",
"chars": 3319,
"preview": "import os\nimport datasets\nfrom pathlib import Path\nfrom omegaconf import ListConfig\n\nfrom dataverse.utils.setting import"
},
{
"path": "dataverse/utils/format/ufl.py",
"chars": 144,
"preview": "\n\"\"\"\nUFL (Upstage Format for LLM)\n\"\"\"\n\nimport uuid\n\ndef get_uuidv1():\n return uuid.uuid1().hex\n\ndef get_uuidv4():\n "
},
{
"path": "dataverse/utils/setting/README.md",
"chars": 2989,
"preview": "\n# Setting\n> Setting includes Environment Variables, User Secrets\n\n## System Settings\n> The heart of the system. It cont"
},
{
"path": "dataverse/utils/setting/__init__.py",
"chars": 110,
"preview": "\nfrom dataverse.utils.setting.user import UserSetting\nfrom dataverse.utils.setting.system import SystemSetting"
},
{
"path": "dataverse/utils/setting/system.py",
"chars": 7283,
"preview": "\n\"\"\"\nInterface for system setting\n\"\"\"\n\nimport os\nimport re\nimport uuid\nimport json\nimport boto3\nimport pyspark\nfrom path"
},
{
"path": "dataverse/utils/setting/user.py",
"chars": 6019,
"preview": "\n\"\"\"\nInterface for user setting\n\"\"\"\n\nimport os\nimport json\nfrom pathlib import Path\n\nfrom dataverse.utils.setting.system"
},
{
"path": "docs/Makefile",
"chars": 926,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
},
{
"path": "docs/make.bat",
"chars": 804,
"preview": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sp"
},
{
"path": "docs/source/citation.rst",
"chars": 432,
"preview": "===================\nCitation\n===================\n\n\nIf you want to cite our *Dataverse* project, feel free to use the fol"
},
{
"path": "docs/source/conf.py",
"chars": 3311,
"preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
},
{
"path": "docs/source/config/config.interface.rst",
"chars": 340,
"preview": "config\n========================\n\n.. automodule:: config.interface.Config\n :members:\n :undoc-members:\n :show-inh"
},
{
"path": "docs/source/etl/etl.bias.rst",
"chars": 215,
"preview": "etl.bias\n================\n\nReducing skewed or prejudiced data,\nwith a particular emphasis on data that reinforces stereo"
},
{
"path": "docs/source/etl/etl.cleaning.rst",
"chars": 2338,
"preview": "etl.cleaning\n====================\nRemoving irrelevant, redun-dant, or noisy information from the data,\nsuch as stop word"
},
{
"path": "docs/source/etl/etl.data_ingestion.rst",
"chars": 3064,
"preview": "etl.data\\_ingestion\n===========================\nFacilitating the loading of data from various sources\n(e.g., data in Hug"
},
{
"path": "docs/source/etl/etl.data_save.rst",
"chars": 910,
"preview": "etl.data\\_save\n======================\nPersisting the processed data into a preferred destination,\nsuch as a data lake or"
},
{
"path": "docs/source/etl/etl.decontamination.rst",
"chars": 211,
"preview": "etl.decontamination\n===========================\n\nIdentifying and removing contaminated data such as benchmark datasets.\n"
},
{
"path": "docs/source/etl/etl.deduplication.rst",
"chars": 1138,
"preview": "etl.deduplication\n=========================\nEliminating duplicated data on dataset by dataset basis or globally across m"
},
{
"path": "docs/source/etl/etl.pii.rst",
"chars": 504,
"preview": "etl.pii\n===============\n\nEnsuring the removal of sensitive information, such as personally identifiable data, from the d"
},
{
"path": "docs/source/etl/etl.pipeline.rst",
"chars": 592,
"preview": "etl.pipeline\n=====================\n\nETL Interface: user will be interacting with this interface\n\nCopyright (c) 2024-pres"
},
{
"path": "docs/source/etl/etl.quality.rst",
"chars": 363,
"preview": "etl.quality\n===================\n\nImproving the quality of data from the perspectives of accuracy, consistency, and relia"
},
{
"path": "docs/source/etl/etl.registry.rst",
"chars": 861,
"preview": "etl.registry\n=====================\nBase class to support the registration of the ETL classes\n\nCopyright (c) 2024-present"
},
{
"path": "docs/source/etl/etl.rst",
"chars": 245,
"preview": "etl\n===========\n\n.. toctree::\n :maxdepth: 1\n\n etl.bias\n etl.cleaning\n etl.data_ingestion\n etl.data_save\n etl"
},
{
"path": "docs/source/etl/etl.toxicity.rst",
"chars": 209,
"preview": "etl.toxicity\n====================\n\nIdentifying and eliminating harmful, offensive, or inappropriate content within the d"
},
{
"path": "docs/source/etl/etl.utils.rst",
"chars": 750,
"preview": "etl.utils\n=================\nProviding essential functionalities for data processing,\nincluding sampling, logging, and st"
},
{
"path": "docs/source/index.rst",
"chars": 2383,
"preview": ".. dataverse documentation master file, created by\n sphinx-quickstart on Thu Feb 29 19:54:35 2024.\n You can adapt th"
},
{
"path": "docs/source/installation.rst",
"chars": 686,
"preview": "===================================\nInstallation\n===================================\n\n\nDataverse can be installed using "
},
{
"path": "docs/source/quickstart.rst",
"chars": 3321,
"preview": "\n===================\nQuickstart\n===================\n\n\nVarious and more detailed tutorials are `here <https://github.com/"
},
{
"path": "docs/source/requirements.txt",
"chars": 200,
"preview": "sphinx\nsphinx-pdj-theme\nsphinx-rtd-theme\nrequests\nnumpy\npandas\nfasttext-wheel\nomegaconf\ndatasets\npyspark\nscipy\ntrafilatu"
},
{
"path": "examples/README.md",
"chars": 1902,
"preview": "# 🌍 Examples\n> This is a example collection for `dataverse`. We will talk about the basic usage of `dataverse`, knowhows"
},
{
"path": "examples/etl/ETL_01_how_to_run.ipynb",
"chars": 10892,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# ETL how to run?\\n\",\n \"> At her"
},
{
"path": "examples/etl/ETL_02_one_cycle.ipynb",
"chars": 5933,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# ETL one cycle\\n\",\n \"> Normally"
},
{
"path": "examples/etl/ETL_03_create_new_etl_process.ipynb",
"chars": 13894,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# ETL create new etl process\\n\",\n "
},
{
"path": "examples/etl/ETL_04_add_new_etl_process.ipynb",
"chars": 14590,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# ETL add new etl process\\n\",\n \""
},
{
"path": "examples/etl/ETL_05_test_etl_process.ipynb",
"chars": 12070,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# ETL test etl process\\n\",\n \"> w"
},
{
"path": "examples/etl/ETL_06_scaleout_with_EMR.ipynb",
"chars": 25021,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# ETL scaleout with EMR\\n\",\n \"> "
},
{
"path": "examples/etl/EX_use_common_crawl_data.ipynb",
"chars": 20163,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Use Common Crawl Data\\n\",\n \"> "
},
{
"path": "examples/etl/EX_use_pyspark_ui.ipynb",
"chars": 4410,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Use Pyspark UI\\n\",\n \"> you can"
},
{
"path": "examples/etl/README.md",
"chars": 36,
"preview": "\n# 🗺️ ETL (Extract, Transform, Load)"
},
{
"path": "requirements.txt",
"chars": 201,
"preview": "requests\nnumpy\npandas\nfasttext-wheel\nomegaconf\npyarrow==14.0.1\ndatasets\npyspark\nscipy\ntrafilatura\nhtml2text\nfaker\nawscli"
},
{
"path": "setup.py",
"chars": 1275,
"preview": "import os\n\nfrom setuptools import find_packages, setup\n\nbasedir = os.path.abspath(os.path.dirname(__file__))\nrequirement"
}
]
About this extraction
This page contains the full source code of the UpstageAI/dataverse GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 148 files (464.3 KB), approximately 127.5k tokens, and a symbol index with 288 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.