Full Code of huggingface/datasets for AI

main 4a0c9c946f0d cached

331 files

3.8 MB

998.6k tokens

3742 symbols

1 requests

Download .txt

Showing preview only (3,989K chars total). Download the full file or copy to clipboard to get everything.

Repository: huggingface/datasets
Branch: main
Commit: 4a0c9c946f0d
Files: 331
Total size: 3.8 MB

Directory structure:
gitextract_njn6cbk0/

├── .dvc/
│   ├── .gitignore
│   ├── config
│   └── plots/
│       ├── confusion.json
│       ├── default.json
│       ├── scatter.json
│       └── smooth.json
├── .dvcignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   ├── config.yml
│   │   └── feature-request.yml
│   ├── conda/
│   │   ├── build.sh
│   │   └── meta.yaml
│   └── workflows/
│       ├── build_documentation.yml
│       ├── build_pr_documentation.yml
│       ├── ci.yml
│       ├── release-conda.yml
│       ├── self-assign.yaml
│       ├── trufflehog.yml
│       └── upload_pr_documentation.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .zenodo.json
├── ADD_NEW_DATASET.md
├── AUTHORS
├── CITATION.cff
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── SECURITY.md
├── benchmarks/
│   ├── benchmark_array_xd.py
│   ├── benchmark_getitem_100B.py
│   ├── benchmark_indices_mapping.py
│   ├── benchmark_iterating.py
│   ├── benchmark_map_filter.py
│   ├── format.py
│   ├── results/
│   │   ├── .gitkeep
│   │   ├── benchmark_array_xd.json
│   │   ├── benchmark_getitem_100B.json
│   │   ├── benchmark_indices_mapping.json
│   │   ├── benchmark_iterating.json
│   │   └── benchmark_map_filter.json
│   └── utils.py
├── docs/
│   ├── README.md
│   └── source/
│       ├── _config.py
│       ├── _redirects.yml
│       ├── _toctree.yml
│       ├── about_arrow.md
│       ├── about_cache.mdx
│       ├── about_dataset_features.mdx
│       ├── about_dataset_load.mdx
│       ├── about_map_batch.mdx
│       ├── about_mapstyle_vs_iterable.mdx
│       ├── access.mdx
│       ├── audio_dataset.mdx
│       ├── audio_load.mdx
│       ├── audio_process.mdx
│       ├── cache.mdx
│       ├── cli.mdx
│       ├── create_dataset.mdx
│       ├── dataset_card.mdx
│       ├── depth_estimation.mdx
│       ├── document_dataset.mdx
│       ├── document_load.mdx
│       ├── faiss_es.mdx
│       ├── filesystems.mdx
│       ├── how_to.md
│       ├── image_classification.mdx
│       ├── image_dataset.mdx
│       ├── image_load.mdx
│       ├── image_process.mdx
│       ├── index.mdx
│       ├── installation.md
│       ├── load_hub.mdx
│       ├── loading.mdx
│       ├── nifti_dataset.mdx
│       ├── nlp_load.mdx
│       ├── nlp_process.mdx
│       ├── object_detection.mdx
│       ├── package_reference/
│       │   ├── builder_classes.mdx
│       │   ├── loading_methods.mdx
│       │   ├── main_classes.mdx
│       │   ├── table_classes.mdx
│       │   └── utilities.mdx
│       ├── process.mdx
│       ├── quickstart.mdx
│       ├── repository_structure.mdx
│       ├── semantic_segmentation.mdx
│       ├── share.mdx
│       ├── stream.mdx
│       ├── tabular_load.mdx
│       ├── troubleshoot.mdx
│       ├── tutorial.md
│       ├── upload_dataset.mdx
│       ├── use_dataset.mdx
│       ├── use_with_jax.mdx
│       ├── use_with_numpy.mdx
│       ├── use_with_pandas.mdx
│       ├── use_with_polars.mdx
│       ├── use_with_pyarrow.mdx
│       ├── use_with_pytorch.mdx
│       ├── use_with_spark.mdx
│       ├── use_with_tensorflow.mdx
│       ├── video_dataset.mdx
│       └── video_load.mdx
├── notebooks/
│   └── README.md
├── pyproject.toml
├── setup.py
├── src/
│   └── datasets/
│       ├── __init__.py
│       ├── arrow_dataset.py
│       ├── arrow_reader.py
│       ├── arrow_writer.py
│       ├── builder.py
│       ├── combine.py
│       ├── commands/
│       │   ├── __init__.py
│       │   ├── datasets_cli.py
│       │   ├── delete_from_hub.py
│       │   ├── env.py
│       │   └── test.py
│       ├── config.py
│       ├── data_files.py
│       ├── dataset_dict.py
│       ├── distributed.py
│       ├── download/
│       │   ├── __init__.py
│       │   ├── download_config.py
│       │   ├── download_manager.py
│       │   └── streaming_download_manager.py
│       ├── exceptions.py
│       ├── features/
│       │   ├── __init__.py
│       │   ├── _torchcodec.py
│       │   ├── audio.py
│       │   ├── features.py
│       │   ├── image.py
│       │   ├── nifti.py
│       │   ├── pdf.py
│       │   ├── translation.py
│       │   └── video.py
│       ├── filesystems/
│       │   ├── __init__.py
│       │   └── compression.py
│       ├── fingerprint.py
│       ├── formatting/
│       │   ├── __init__.py
│       │   ├── formatting.py
│       │   ├── jax_formatter.py
│       │   ├── np_formatter.py
│       │   ├── polars_formatter.py
│       │   ├── tf_formatter.py
│       │   └── torch_formatter.py
│       ├── hub.py
│       ├── info.py
│       ├── inspect.py
│       ├── io/
│       │   ├── __init__.py
│       │   ├── abc.py
│       │   ├── csv.py
│       │   ├── generator.py
│       │   ├── json.py
│       │   ├── parquet.py
│       │   ├── spark.py
│       │   ├── sql.py
│       │   └── text.py
│       ├── iterable_dataset.py
│       ├── load.py
│       ├── naming.py
│       ├── packaged_modules/
│       │   ├── __init__.py
│       │   ├── arrow/
│       │   │   ├── __init__.py
│       │   │   └── arrow.py
│       │   ├── audiofolder/
│       │   │   ├── __init__.py
│       │   │   └── audiofolder.py
│       │   ├── cache/
│       │   │   ├── __init__.py
│       │   │   └── cache.py
│       │   ├── csv/
│       │   │   ├── __init__.py
│       │   │   └── csv.py
│       │   ├── eval/
│       │   │   ├── __init__.py
│       │   │   └── eval.py
│       │   ├── folder_based_builder/
│       │   │   ├── __init__.py
│       │   │   └── folder_based_builder.py
│       │   ├── generator/
│       │   │   ├── __init__.py
│       │   │   └── generator.py
│       │   ├── hdf5/
│       │   │   ├── __init__.py
│       │   │   └── hdf5.py
│       │   ├── imagefolder/
│       │   │   ├── __init__.py
│       │   │   └── imagefolder.py
│       │   ├── json/
│       │   │   ├── __init__.py
│       │   │   └── json.py
│       │   ├── lance/
│       │   │   ├── __init__.py
│       │   │   └── lance.py
│       │   ├── niftifolder/
│       │   │   ├── __init__.py
│       │   │   └── niftifolder.py
│       │   ├── pandas/
│       │   │   ├── __init__.py
│       │   │   └── pandas.py
│       │   ├── parquet/
│       │   │   ├── __init__.py
│       │   │   └── parquet.py
│       │   ├── pdffolder/
│       │   │   ├── __init__.py
│       │   │   └── pdffolder.py
│       │   ├── spark/
│       │   │   ├── __init__.py
│       │   │   └── spark.py
│       │   ├── sql/
│       │   │   ├── __init__.py
│       │   │   └── sql.py
│       │   ├── text/
│       │   │   ├── __init__.py
│       │   │   └── text.py
│       │   ├── videofolder/
│       │   │   ├── __init__.py
│       │   │   └── videofolder.py
│       │   ├── webdataset/
│       │   │   ├── __init__.py
│       │   │   ├── _tenbin.py
│       │   │   └── webdataset.py
│       │   └── xml/
│       │       ├── __init__.py
│       │       └── xml.py
│       ├── parallel/
│       │   ├── __init__.py
│       │   └── parallel.py
│       ├── search.py
│       ├── splits.py
│       ├── streaming.py
│       ├── table.py
│       └── utils/
│           ├── __init__.py
│           ├── _dataset_viewer.py
│           ├── _dill.py
│           ├── _filelock.py
│           ├── deprecation_utils.py
│           ├── doc_utils.py
│           ├── experimental.py
│           ├── extract.py
│           ├── file_utils.py
│           ├── filelock.py
│           ├── hub.py
│           ├── info_utils.py
│           ├── json.py
│           ├── logging.py
│           ├── metadata.py
│           ├── patching.py
│           ├── py_utils.py
│           ├── resources/
│           │   ├── __init__.py
│           │   ├── creators.json
│           │   ├── languages.json
│           │   ├── multilingualities.json
│           │   ├── readme_structure.yaml
│           │   └── size_categories.json
│           ├── sharding.py
│           ├── stratify.py
│           ├── tf_utils.py
│           ├── tqdm.py
│           ├── track.py
│           ├── typing.py
│           └── version.py
├── templates/
│   ├── README.md
│   └── README_guide.md
├── tests/
│   ├── __init__.py
│   ├── _test_patching.py
│   ├── commands/
│   │   ├── __init__.py
│   │   ├── conftest.py
│   │   └── test_test.py
│   ├── conftest.py
│   ├── distributed_scripts/
│   │   └── run_torch_distributed.py
│   ├── features/
│   │   ├── __init__.py
│   │   ├── data/
│   │   │   ├── test_audio_16000.pcm
│   │   │   ├── test_audio_48000.opus
│   │   │   └── test_nifti.nii
│   │   ├── test_array_xd.py
│   │   ├── test_audio.py
│   │   ├── test_features.py
│   │   ├── test_image.py
│   │   ├── test_nifti.py
│   │   ├── test_pdf.py
│   │   └── test_video.py
│   ├── fixtures/
│   │   ├── __init__.py
│   │   ├── files.py
│   │   ├── fsspec.py
│   │   └── hub.py
│   ├── io/
│   │   ├── __init__.py
│   │   ├── data/
│   │   │   ├── test_file.json.bz2
│   │   │   └── test_file.json.xz
│   │   ├── test_csv.py
│   │   ├── test_json.py
│   │   ├── test_parquet.py
│   │   ├── test_sql.py
│   │   └── test_text.py
│   ├── packaged_modules/
│   │   ├── __init__.py
│   │   ├── test_arrow.py
│   │   ├── test_audiofolder.py
│   │   ├── test_cache.py
│   │   ├── test_csv.py
│   │   ├── test_folder_based_builder.py
│   │   ├── test_hdf5.py
│   │   ├── test_imagefolder.py
│   │   ├── test_json.py
│   │   ├── test_lance.py
│   │   ├── test_pandas.py
│   │   ├── test_parquet.py
│   │   ├── test_spark.py
│   │   ├── test_sql.py
│   │   ├── test_text.py
│   │   ├── test_videofolder.py
│   │   └── test_webdataset.py
│   ├── test_arrow_dataset.py
│   ├── test_arrow_reader.py
│   ├── test_arrow_writer.py
│   ├── test_builder.py
│   ├── test_data_files.py
│   ├── test_dataset_dict.py
│   ├── test_dataset_list.py
│   ├── test_distributed.py
│   ├── test_download_manager.py
│   ├── test_exceptions.py
│   ├── test_experimental.py
│   ├── test_extract.py
│   ├── test_file_utils.py
│   ├── test_filelock.py
│   ├── test_filesystem.py
│   ├── test_fingerprint.py
│   ├── test_fingerprint_tokenizer_stability.py
│   ├── test_formatting.py
│   ├── test_hub.py
│   ├── test_info.py
│   ├── test_info_utils.py
│   ├── test_inspect.py
│   ├── test_iterable_dataset.py
│   ├── test_load.py
│   ├── test_metadata_util.py
│   ├── test_offline_util.py
│   ├── test_parallel.py
│   ├── test_patching.py
│   ├── test_py_utils.py
│   ├── test_search.py
│   ├── test_sharding_utils.py
│   ├── test_splits.py
│   ├── test_streaming_download_manager.py
│   ├── test_table.py
│   ├── test_tqdm.py
│   ├── test_upstream_hub.py
│   ├── test_version.py
│   └── utils.py
└── utils/
    └── release.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .dvc/.gitignore
================================================
/config.local
/tmp
/cache


================================================
FILE: .dvc/config
================================================


================================================
FILE: .dvc/plots/confusion.json
================================================
{
    "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
    "data": {
        "values": "<DVC_METRIC_DATA>"
    },
    "title": "<DVC_METRIC_TITLE>",
    "mark": "rect",
    "encoding": {
        "x": {
            "field": "<DVC_METRIC_X>",
            "type": "nominal",
            "sort": "ascending",
            "title": "<DVC_METRIC_X_LABEL>"
        },
        "y": {
            "field": "<DVC_METRIC_Y>",
            "type": "nominal",
            "sort": "ascending",
            "title": "<DVC_METRIC_Y_LABEL>"
        },
        "color": {
            "aggregate": "count",
            "type": "quantitative"
        },
        "facet": {
            "field": "rev",
            "type": "nominal"
        }
    }
}


================================================
FILE: .dvc/plots/default.json
================================================
{
    "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
    "data": {
        "values": "<DVC_METRIC_DATA>"
    },
    "title": "<DVC_METRIC_TITLE>",
    "mark": {
        "type": "line"
    },
    "encoding": {
        "x": {
            "field": "<DVC_METRIC_X>",
            "type": "quantitative",
            "title": "<DVC_METRIC_X_LABEL>"
        },
        "y": {
            "field": "<DVC_METRIC_Y>",
            "type": "quantitative",
            "title": "<DVC_METRIC_Y_LABEL>",
            "scale": {
                "zero": false
            }
        },
        "color": {
            "field": "rev",
            "type": "nominal"
        }
    }
}


================================================
FILE: .dvc/plots/scatter.json
================================================
{
    "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
    "data": {
        "values": "<DVC_METRIC_DATA>"
    },
    "title": "<DVC_METRIC_TITLE>",
    "mark": "point",
    "encoding": {
        "x": {
            "field": "<DVC_METRIC_X>",
            "type": "quantitative",
            "title": "<DVC_METRIC_X_LABEL>"
        },
        "y": {
            "field": "<DVC_METRIC_Y>",
            "type": "quantitative",
            "title": "<DVC_METRIC_Y_LABEL>",
            "scale": {
                "zero": false
            }
        },
        "color": {
            "field": "rev",
            "type": "nominal"
        }
    }
}


================================================
FILE: .dvc/plots/smooth.json
================================================
{
    "$schema": "https://vega.github.io/schema/vega-lite/v4.json",
    "data": {
        "values": "<DVC_METRIC_DATA>"
    },
    "title": "<DVC_METRIC_TITLE>",
    "mark": {
        "type": "line"
    },
    "encoding": {
        "x": {
            "field": "<DVC_METRIC_X>",
            "type": "quantitative",
            "title": "<DVC_METRIC_X_LABEL>"
        },
        "y": {
            "field": "<DVC_METRIC_Y>",
            "type": "quantitative",
            "title": "<DVC_METRIC_Y_LABEL>",
            "scale": {
                "zero": false
            }
        },
        "color": {
            "field": "rev",
            "type": "nominal"
        }
    },
    "transform": [
        {
            "loess": "<DVC_METRIC_Y>",
            "on": "<DVC_METRIC_X>",
            "groupby": [
                "rev"
            ],
            "bandwidth": 0.3
        }
    ]
}


================================================
FILE: .dvcignore
================================================
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore


================================================
FILE: .github/ISSUE_TEMPLATE/bug-report.yml
================================================
name: Bug report
description: Create a report to help reproduce and fix the bug
body:
  - type: textarea
    id: description
    attributes:
      label: Describe the bug
      description: A clear and concise description of what the bug is
    validations:
      required: true
  
  - type: textarea
    id: reproduction
    attributes:
      label: Steps to reproduce the bug
      description: |
        Please provide a code sample that reproduces the problem you ran into. It can be a Colab link or just a code snippet.
        If you have code snippets, error messages, stack traces please provide them here as well.
        Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
        Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.
      placeholder: |
        Steps to reproduce the behavior:
          
          1.
          2.
          3.
    validations:
      required: true

  - type: textarea
    id: expected-behavior
    validations:
      required: true
    attributes:
      label: Expected behavior
      description: A clear and concise description of the expected results.

  - type: textarea
    id: environment-info
    attributes:
      label: Environment info
      description: Please share your environemnt info with us. You can run the command `datasets-cli env` and copy-paste its output below.
      placeholder: datasets version, platform, python version, ...
    validations:
      required: true


================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
contact_links:
  - name: Datasets on the Hugging Face Hub
    url: https://huggingface.co/datasets
    about: Please use the "Community" tab of the dataset on the Hugging Face Hub to open a discussion or a pull request
  - name: Forum
    url: https://discuss.huggingface.co/c/datasets/10
    about: Please ask and answer questions here, and engage with other community members


================================================
FILE: .github/ISSUE_TEMPLATE/feature-request.yml
================================================
name: Feature request
description: Suggest an idea for this project
labels: ["enhancement"]
body:
  - type: textarea
    id: feature-request
    attributes:
      label: Feature request
      description: A clear and concise description of the feature proposal.
    validations:
      required: true
  
  - type: textarea
    id: motivation
    validations:
      required: true
    attributes:
      label: Motivation
      description: |
        Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.   

  - type: textarea
    id: contribution
    validations:
      required: true
    attributes:
      label: Your contribution
      description: |
        Is there any way that you could help, e.g. by submitting a PR? Make sure to read the CONTRIBUTING.MD [readme](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md).


================================================
FILE: .github/conda/build.sh
================================================
$PYTHON setup.py install --single-version-externally-managed --record=record.txt


================================================
FILE: .github/conda/meta.yaml
================================================
{% set name = "datasets" %}

package:
  name: "{{ name|lower }}"
  version: "{{ DATASETS_VERSION }}"

source:
  path: ../../

build:
  noarch: python

requirements:
  host:
    - python
    - pip
    - numpy >=1.17
    - pyarrow >=16.0.0
    - python-xxhash
    - dill
    - pandas
    - requests >=2.19.0
    - httpx <1.0.0
    - tqdm >=4.66.3
    - dataclasses
    - multiprocess
    - fsspec
    - huggingface_hub >=0.25.0,<2.0.0
    - packaging
  run:
    - python
    - pip
    - numpy >=1.17
    - pyarrow >=16.0.0
    - python-xxhash
    - dill
    - pandas
    - requests >=2.19.0
    - httpx <1.0.0
    - tqdm >=4.66.3
    - dataclasses
    - multiprocess
    - fsspec
    - huggingface_hub >=0.25.0,<2.0.0
    - packaging

test:
  imports:
    - datasets

about:
  home: https://huggingface.co
  license: Apache License 2.0
  license_file: LICENSE
  summary: "🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools"


================================================
FILE: .github/workflows/build_documentation.yml
================================================
name: Build documentation

on:
  push:
    branches:
      - main
      - doc-builder*
      - v*-release
      - v*-patch

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
    with:
      commit_sha: ${{ github.sha }}
      package: datasets
      notebook_folder: datasets_doc
    secrets:
      token: ${{ secrets.HUGGINGFACE_PUSH }}
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}


================================================
FILE: .github/workflows/build_pr_documentation.yml
================================================
name: Build PR Documentation

on:
  pull_request:

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
    with:
      commit_sha: ${{ github.event.pull_request.head.sha }}
      pr_number: ${{ github.event.number }}
      package: datasets


================================================
FILE: .github/workflows/ci.yml
================================================
name: CI

on:
  pull_request:
    branches:
      - main
  push:
    branches:
      - main
      - ci-*

env:
  CI_HEADERS: ${{ secrets.CI_HEADERS }}

jobs:

  check_code_quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install .[quality]
      - name: Check quality
        run: |
          ruff check tests src benchmarks utils setup.py # linter
          ruff format --check tests src benchmarks utils setup.py # formatter

  test:
    needs: check_code_quality
    strategy:
      matrix:
        test: ['unit', 'integration']
        os: [ubuntu-latest, windows-latest]
        deps_versions: [deps-latest, deps-minimum]
    continue-on-error: ${{ matrix.test == 'integration' }}
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Setup FFmpeg
        if: ${{ matrix.os == 'ubuntu-latest' }}
        run: |
          sudo apt update
          sudo apt install -y ffmpeg 
      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Setup conda env (windows)
        if: ${{ matrix.os == 'windows-latest' }}
        uses: conda-incubator/setup-miniconda@v2
        with:
          auto-update-conda: true
          miniconda-version: "latest"
          activate-environment: test
          python-version: "3.10"
      - name: Setup FFmpeg (windows)
        if: ${{ matrix.os == 'windows-latest' }}
        run: conda install "ffmpeg=7.0.1" -c conda-forge
      - name: Upgrade pip
        run: python -m pip install --upgrade pip
      - name: Install uv
        run: pip install --upgrade uv
      - name: Install dependencies
        run: uv pip install --system "datasets[tests] @ ."
      - name: Install dependencies (latest versions)
        if: ${{ matrix.deps_versions == 'deps-latest' }}
        run: uv pip install --system --upgrade pyarrow huggingface-hub "dill<0.3.9"
      - name: Install dependencies (minimum versions)
        if: ${{ matrix.deps_versions != 'deps-latest' }}
        run: uv pip install --system pyarrow==21.0.0 huggingface-hub==0.25.0 transformers dill==0.3.1.1
      - name: Print dependencies
        run: uv pip list
      - name: Test with pytest
        run: |
          python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/

  test_py314:
    needs: check_code_quality
    strategy:
      matrix:
        test: ['unit']
        os: [ubuntu-latest, windows-latest]
        deps_versions: [deps-latest]
    continue-on-error: false
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Setup FFmpeg
        if: ${{ matrix.os == 'ubuntu-latest' }}
        run: |
          sudo apt update
          sudo apt install -y ffmpeg
      - name: Set up Python 3.14
        uses: actions/setup-python@v5
        with:
          python-version: "3.14"
      - name: Setup conda env (windows)
        if: ${{ matrix.os == 'windows-latest' }}
        uses: conda-incubator/setup-miniconda@v2
        with:
          auto-update-conda: true
          miniconda-version: "latest"
          activate-environment: test
          python-version: "3.14"
      - name: Setup FFmpeg (windows)
        if: ${{ matrix.os == 'windows-latest' }}
        run: conda install "ffmpeg=7.0.1" -c conda-forge
      - name: Upgrade pip
        run: python -m pip install --upgrade pip
      - name: Install uv
        run: pip install --upgrade uv
      - name: Install dependencies
        run: uv pip install --system "datasets[tests] @ ."
      - name: Print dependencies
        run: uv pip list
      - name: Test with pytest
        run: |
          python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/

  test_py314_future:
    needs: check_code_quality
    strategy:
      matrix:
        test: ['unit']
        os: [ubuntu-latest, windows-latest]
        deps_versions: [deps-latest]
    continue-on-error: false
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Setup FFmpeg
        if: ${{ matrix.os == 'ubuntu-latest' }}
        run: |
          sudo apt update
          sudo apt install -y ffmpeg 
      - name: Set up Python 3.14
        uses: actions/setup-python@v5
        with:
          python-version: "3.14"
      - name: Setup conda env (windows)
        if: ${{ matrix.os == 'windows-latest' }}
        uses: conda-incubator/setup-miniconda@v2
        with:
          auto-update-conda: true
          miniconda-version: "latest"
          activate-environment: test
          python-version: "3.14"
      - name: Setup FFmpeg (windows)
        if: ${{ matrix.os == 'windows-latest' }}
        run: conda install "ffmpeg=7.0.1" -c conda-forge
      - name: Upgrade pip
        run: python -m pip install --upgrade pip
      - name: Install uv
        run: pip install --upgrade uv
      - name: Install dependencies
        run: uv pip install --system "datasets[tests_numpy2] @ ."
      - name: Print dependencies
        run: pip list

      - name: Test with pytest
        run: |
          python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/


================================================
FILE: .github/workflows/release-conda.yml
================================================
name: Release - Conda

on:
  push:
    tags:
      - "[0-9]+.[0-9]+.[0-9]+*"

env:
  ANACONDA_API_TOKEN: ${{ secrets.ANACONDA_API_TOKEN }}

jobs:
  build_and_package:
    runs-on: ubuntu-22.04
    defaults:
      run:
        shell: bash -l {0}

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Install miniconda
        uses: conda-incubator/setup-miniconda@v2
        with:
          auto-update-conda: true
          auto-activate-base: false
          activate-environment: "build-datasets"
          python-version: "3.10"
          channels: huggingface

      - name: Setup conda env
        run: |
          conda install -c defaults anaconda-client conda-build

      - name: Extract version
        run: echo "DATASETS_VERSION=`python setup.py --version`" >> $GITHUB_ENV

      - name: Build conda packages
        run: |
          conda info
          conda build .github/conda

      - name: Upload to Anaconda
        run: |
          anaconda upload `conda build .github/conda --output -c conda-forge` --force


================================================
FILE: .github/workflows/self-assign.yaml
================================================
name: Self-assign
on:
  issue_comment:
    types: created
jobs:
  one:
    runs-on: ubuntu-latest
    if: >-
      (github.event.comment.body == '#take' ||
       github.event.comment.body == '#self-assign')
      && !github.event.issue.assignee
    steps:
      - run: |
          echo "Assigning issue ${{ github.event.issue.number }} to ${{ github.event.comment.user.login }}"
          curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -d '{"assignees": ["${{ github.event.comment.user.login }}"]}' https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/assignees
          curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -X "DELETE" https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/labels/help%20wanted


================================================
FILE: .github/workflows/trufflehog.yml
================================================
on:
  push:

name: Secret Leaks

permissions:
  contents: read

jobs:
  trufflehog:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v5
      with:
        fetch-depth: 0
    - name: Secret Scanning
      uses: trufflesecurity/trufflehog@main
      with:
        extra_args: --results=verified


================================================
FILE: .github/workflows/upload_pr_documentation.yml
================================================
name: Upload PR Documentation

on:
  workflow_run:
    workflows: ["Build PR Documentation"]
    types:
      - completed

jobs:
  build:
    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
    with:
      package_name: datasets
    secrets:
      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

================================================
FILE: .gitignore
================================================
# Locked files
*.lock
!dvc.lock

# Extracted dummy data
datasets/**/dummy_data-zip-extracted/

# Compiled python modules.
*.pyc

# Byte-compiled
_pycache__/
.cache/

# Python egg metadata, regenerated from source files by setuptools.
*.egg-info
.eggs/

# PyPI distribution artifacts.
build/
dist/

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# pyenv
.python-version

# Tests
.pytest_cache/

# Other
*.DS_Store

# PyCharm/vscode
.idea
.vscode

# Vim
.*.swp

# playground
/playground

# Sphinx documentation
docs/_build/
docs/source/_build/

# Benchmark results
report.json
report.md

# Ruff
.ruff_cache


================================================
FILE: .pre-commit-config.yaml
================================================
repos:
  - repo: https://github.com/charliermarsh/ruff-pre-commit # https://github.com/charliermarsh/ruff#usage
    rev: 'v0.11.8'
    hooks:
      # Run the linter.
      - id: ruff
        args: [ --fix ]
      # Run the formatter.
      - id: ruff-format


================================================
FILE: .zenodo.json
================================================
{
    "license": "Apache-2.0",
    "creators": [
        {
            "affiliation": "Hugging Face",
            "name": "Quentin Lhoest"
        },
        {
            "orcid": "0000-0003-1727-1045",
            "affiliation": "Hugging Face",
            "name": "Albert Villanova del Moral"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Patrick von Platen"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Thomas Wolf"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Mario Šaško"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Yacine Jernite"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Abhishek Thakur"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Lewis Tunstall"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Suraj Patil"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Mariama Drame"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Julien Chaumond"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Julien Plu"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Joe Davison"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Simon Brandeis"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Victor Sanh"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Teven Le Scao"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Kevin Canwen Xu"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Nicolas Patry"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Steven Liu"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Angelina McMillan-Major"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Philipp Schmid"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Sylvain Gugger"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Nathan Raw"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Sylvain Lesage"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Anton Lozhkov"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Matthew Carrigan"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Th\u00e9o Matussi\u00e8re"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Leandro von Werra"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Lysandre Debut"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Stas Bekman"
        },
        {
            "affiliation": "Hugging Face",
            "name": "Cl\u00e9ment Delangue"
        }
    ]
}

================================================
FILE: ADD_NEW_DATASET.md
================================================
# How to add one new datasets

Add datasets directly to the 🤗 Hugging Face Hub!

You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:

* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)


================================================
FILE: AUTHORS
================================================
# This is the list of HuggingFace Datasets authors for copyright purposes.
#
# This does not necessarily list everyone who has contributed code, since in
# some cases, their employer may be the copyright holder.  To see the full list
# of contributors, see the revision history in source control.

Google Inc.
HuggingFace Inc.


================================================
FILE: CITATION.cff
================================================
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "huggingface/datasets"
authors:
- family-names: Lhoest
  given-names: Quentin
- family-names: Villanova del Moral
  given-names: Albert
  orcid: "https://orcid.org/0000-0003-1727-1045"
- family-names: von Platen
  given-names: Patrick
- family-names: Wolf
  given-names: Thomas
- family-names: Šaško
  given-names: Mario
- family-names: Jernite
  given-names: Yacine
- family-names: Thakur
  given-names: Abhishek
- family-names: Tunstall
  given-names: Lewis
- family-names: Patil
  given-names: Suraj
- family-names: Drame
  given-names: Mariama
- family-names: Chaumond
  given-names: Julien
- family-names: Plu
  given-names: Julien
- family-names: Davison
  given-names: Joe
- family-names: Brandeis
  given-names: Simon
- family-names: Sanh
  given-names: Victor
- family-names: Le Scao
  given-names: Teven
- family-names: Canwen Xu
  given-names: Kevin
- family-names: Patry
  given-names: Nicolas
- family-names: Liu
  given-names: Steven
- family-names: McMillan-Major
  given-names: Angelina
- family-names: Schmid
  given-names: Philipp
- family-names: Gugger
  given-names: Sylvain
- family-names: Raw
  given-names: Nathan
- family-names: Lesage
  given-names: Sylvain
- family-names: Lozhkov
  given-names: Anton
- family-names: Carrigan
  given-names: Matthew
- family-names: Matussière
  given-names: Théo
- family-names: von Werra
  given-names: Leandro
- family-names: Debut
  given-names: Lysandre
- family-names: Bekman
  given-names: Stas
- family-names: Delangue
  given-names: Clément
doi: 10.5281/zenodo.4817768
repository-code: "https://github.com/huggingface/datasets"
license: Apache-2.0
preferred-citation:
  type: conference-paper
  title: "Datasets: A Community Library for Natural Language Processing"
  authors:
  - family-names: Lhoest
    given-names: Quentin
  - family-names: Villanova del Moral
    given-names: Albert
    orcid: "https://orcid.org/0000-0003-1727-1045"
  - family-names: von Platen
    given-names: Patrick
  - family-names: Wolf
    given-names: Thomas
  - family-names: Šaško
    given-names: Mario
  - family-names: Jernite
    given-names: Yacine
  - family-names: Thakur
    given-names: Abhishek
  - family-names: Tunstall
    given-names: Lewis
  - family-names: Patil
    given-names: Suraj
  - family-names: Drame
    given-names: Mariama
  - family-names: Chaumond
    given-names: Julien
  - family-names: Plu
    given-names: Julien
  - family-names: Davison
    given-names: Joe
  - family-names: Brandeis
    given-names: Simon
  - family-names: Sanh
    given-names: Victor
  - family-names: Le Scao
    given-names: Teven
  - family-names: Canwen Xu
    given-names: Kevin
  - family-names: Patry
    given-names: Nicolas
  - family-names: Liu
    given-names: Steven
  - family-names: McMillan-Major
    given-names: Angelina
  - family-names: Schmid
    given-names: Philipp
  - family-names: Gugger
    given-names: Sylvain
  - family-names: Raw
    given-names: Nathan
  - family-names: Lesage
    given-names: Sylvain
  - family-names: Lozhkov
    given-names: Anton
  - family-names: Carrigan
    given-names: Matthew
  - family-names: Matussière
    given-names: Théo
  - family-names: von Werra
    given-names: Leandro
  - family-names: Debut
    given-names: Lysandre
  - family-names: Bekman
    given-names: Stas
  - family-names: Delangue
    given-names: Clément
  collection-title: "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations"
  collection-type: proceedings
  month: 11
  year: 2021
  publisher:
    name: "Association for Computational Linguistics"
  url: "https://aclanthology.org/2021.emnlp-demo.21"
  start: 175
  end: 184
  identifiers:
    - type: other
      value: "arXiv:2109.02846"
      description: "The arXiv preprint of the paper"


================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual identity
and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our
community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
  and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
  overall community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or
  advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
  address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.

Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
feedback@huggingface.co.
All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the
reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series
of actions.

**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior,  harassment of an
individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within
the community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
[https://www.contributor-covenant.org/version/2/0/code_of_conduct.html][v2.0].

Community Impact Guidelines were inspired by 
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].

For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available 
at [https://www.contributor-covenant.org/translations][translations].

[homepage]: https://www.contributor-covenant.org
[v2.0]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations


================================================
FILE: CONTRIBUTING.md
================================================
# How to contribute to Datasets?
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](CODE_OF_CONDUCT.md)

Datasets is an open source project, so all contributions and suggestions are welcome.

You can contribute in many different ways: giving ideas, answering questions, reporting bugs, proposing enhancements,
improving the documentation, fixing bugs,...

Many thanks in advance to every contributor.

In order to facilitate healthy, constructive behavior in an open and inclusive community, we all respect and abide by
our [code of conduct](CODE_OF_CONDUCT.md).

## How to work on an open Issue?
You have the list of open Issues at: https://github.com/huggingface/datasets/issues

Some of them may have the label `help wanted`: that means that any contributor is welcomed!

If you would like to work on any of the open Issues:

1. Make sure it is not already assigned to someone else. You have the assignee (if any) on the top of the right column of the Issue page.

2. You can self-assign it by commenting on the Issue page with the keyword: `#self-assign`.

3. Work on your self-assigned issue and eventually create a Pull Request.

## How to create a Pull Request?
If you want to add a dataset see specific instructions in the section [*How to add a dataset*](#how-to-add-a-dataset).

1. Fork the [repository](https://github.com/huggingface/datasets) by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

2. Clone your fork to your local disk, and add the base repository as a remote:

    ```bash
    git clone git@github.com:<your Github handle>/datasets.git
    cd datasets
    git remote add upstream https://github.com/huggingface/datasets.git
    ```

3. Create a new branch to hold your development changes:

    ```bash
    git checkout -b a-descriptive-name-for-my-changes
    ```

    **do not** work on the `main` branch.

4. Set up a development environment by running the following command in a virtual environment:

    Simple setup with code formatting only (recommended)
    ```bash
    pip install -e ".[quality]"
    ```
    
    Advanced setup with all the optional dependencies
    ```bash
    pip install -e ".[dev]"
    ```

   (If datasets was already installed in the virtual environment, remove
   it with `pip uninstall datasets` before reinstalling it in editable
   mode with the `-e` flag.)

5. Develop the features on your branch.

6. Format your code. Run `black` and `ruff` so that your newly added files look nice with the following command:

    ```bash
    make style
    ```
   
7. _(Optional)_ You can also use [`pre-commit`](https://pre-commit.com/) to format your code automatically each time run `git commit`, instead of running `make style` manually. 
To do this, install `pre-commit` via `pip install pre-commit` and then run `pre-commit install` in the project's root directory to set up the hooks.
Note that if any files were formatted by `pre-commit` hooks during committing, you have to run `git commit` again .


8. Once you're happy with your contribution, add your changed files and make a commit to record your changes locally:

    ```bash
    git add -u
    git commit
    ```

    It is a good idea to sync your copy of the code with the original
    repository regularly. This way you can quickly account for changes:

    ```bash
    git fetch upstream
    git rebase upstream/main
    ```

9. Once you are satisfied, push the changes to your fork repo using:

   ```bash
   git push -u origin a-descriptive-name-for-my-changes
   ```

   Go the webpage of your fork on GitHub. Click on "Pull request" to send your changes to the project maintainers for review.

## Datasets on Hugging Face

### How to add a dataset on Hugging Face

You can share your dataset on https://huggingface.co/datasets directly using your account (no need to open a PR on GitHub), see the documentation:

* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)

### How to contribute to the dataset cards

Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.

If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:

* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
* and if you need inspiration, we recommend looking through a [completed example](https://huggingface.co/datasets/eli5/blob/main/README.md)

If you are a **dataset author**... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement.

If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.

Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://huggingface.co/papers/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).

Thank you for your contribution!

## Code of conduct

This project adheres to the HuggingFace [code of conduct](CODE_OF_CONDUCT.md).
By participating, you are expected to abide by this code.


================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: Makefile
================================================
.PHONY: quality style test

check_dirs := tests src benchmarks utils

# Check that source code meets quality standards

quality:
	ruff check $(check_dirs) setup.py  # linter
	ruff format --check $(check_dirs) setup.py # formatter

# Format source code automatically

style:
	ruff check --fix $(check_dirs) setup.py # linter
	ruff format $(check_dirs) setup.py # formatter

# Run tests for the library

test:
	python -m pytest -n auto --dist=loadfile -s -v ./tests/


================================================
FILE: README.md
================================================
<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/datasets-logo-dark.svg">
    <source media="(prefers-color-scheme: light)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/datasets-logo-light.svg">
    <img alt="Hugging Face Datasets Library" src="https://huggingface.co/datasets/huggingface/documentation-images/raw/main/datasets-logo-light.svg" width="352" height="59" style="max-width: 100%;">
  </picture>
  <br/>
  <br/>
</p>

<p align="center">
    <a href="https://github.com/huggingface/datasets/actions/workflows/ci.yml?query=branch%3Amain"><img alt="Build" src="https://github.com/huggingface/datasets/actions/workflows/ci.yml/badge.svg?branch=main"></a>
    <a href="https://github.com/huggingface/datasets/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue"></a>
    <a href="https://huggingface.co/docs/datasets/index.html"><img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/datasets/index.html.svg?down_color=red&down_message=offline&up_message=online"></a>
    <a href="https://github.com/huggingface/datasets/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/datasets.svg"></a>
    <a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a>
    <a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg"></a>
    <a href="https://zenodo.org/badge/latestdoi/250213286"><img src="https://zenodo.org/badge/250213286.svg" alt="DOI"></a>
</p>

🤗 Datasets is a lightweight library providing **two** main features:

- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share)

<h3 align="center">
    <a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/main/docs/source/imgs/course_banner.png"></a>
</h3>

🤗 Datasets is designed to let the community easily add and share new datasets.

🤗 Datasets has many additional interesting features:

- Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
- Smart caching: never wait for your data to process several times.
- Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
- Built-in interoperability with NumPy, PyTorch, TensorFlow 2, JAX, Pandas, Polars and more.
- Native support for audio, image and video data.
- Enable streaming mode to save disk space and start iterating over the dataset immediately.

🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library.

# Installation

## With pip

🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

```bash
pip install datasets
```

## With conda

🤗 Datasets can be installed using conda as follows:

```bash
conda install -c huggingface -c conda-forge datasets
```

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation

## Installation to use with Machine Learning & Data frameworks frameworks

If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (0.4+) you should also install PyTorch, TensorFlow or JAX.
🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately.

For more details on using the library with these frameworks, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart

# Usage

🤗 Datasets is made to be very simple to use - the API is centered around a single function, `datasets.load_dataset(dataset_name, **kwargs)`, that instantiates a dataset.

This library can be used for text/image/audio/etc. datasets. Here is an example to load a text dataset:

Here is a quick example:

```python
from datasets import load_dataset

# Print all the available datasets
from huggingface_hub import list_datasets
print([dataset.id for dataset in list_datasets(limit=20)])

# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('rajpurkar/squad')
print(squad_dataset['train'][0])

# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})

# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
```

If your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:

```python
# If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset
image_dataset = load_dataset('timm/imagenet-1k-wds', streaming=True)
for example in image_dataset["train"]:
    break
```

For more details on using the library, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart and the specific pages on:

- Loading a dataset: https://huggingface.co/docs/datasets/loading
- What's in a Dataset: https://huggingface.co/docs/datasets/access
- Processing data with 🤗 Datasets: https://huggingface.co/docs/datasets/process
    - Processing audio data: https://huggingface.co/docs/datasets/audio_process
    - Processing image data: https://huggingface.co/docs/datasets/image_process
    - Processing text data: https://huggingface.co/docs/datasets/nlp_process
- Streaming a dataset: https://huggingface.co/docs/datasets/stream
- etc.

# Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).

You can find:
- [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also
- [how to upload it using Git](https://huggingface.co/docs/datasets/share).

# Disclaimers

You can use 🤗 Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the `revision` of the repositories they use.

If you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!

## BibTeX

If you want to cite our 🤗 Datasets library, you can use our [paper](https://huggingface.co/papers/2109.02846):

```bibtex
@inproceedings{lhoest-etal-2021-datasets,
    title = "Datasets: A Community Library for Natural Language Processing",
    author = "Lhoest, Quentin  and
      Villanova del Moral, Albert  and
      Jernite, Yacine  and
      Thakur, Abhishek  and
      von Platen, Patrick  and
      Patil, Suraj  and
      Chaumond, Julien  and
      Drame, Mariama  and
      Plu, Julien  and
      Tunstall, Lewis  and
      Davison, Joe  and
      {\v{S}}a{\v{s}}ko, Mario  and
      Chhablani, Gunjan  and
      Malik, Bhavitvya  and
      Brandeis, Simon  and
      Le Scao, Teven  and
      Sanh, Victor  and
      Xu, Canwen  and
      Patry, Nicolas  and
      McMillan-Major, Angelina  and
      Schmid, Philipp  and
      Gugger, Sylvain  and
      Delangue, Cl{\'e}ment  and
      Matussi{\`e}re, Th{\'e}o  and
      Debut, Lysandre  and
      Bekman, Stas  and
      Cistac, Pierric  and
      Goehringer, Thibault  and
      Mustar, Victor  and
      Lagunas, Fran{\c{c}}ois  and
      Rush, Alexander  and
      Wolf, Thomas",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-demo.21",
    pages = "175--184",
    abstract = "The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.",
    eprint={2109.02846},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
}
```

If you need to cite a specific version of our 🤗 Datasets library for reproducibility, you can use the corresponding version Zenodo DOI from this [list](https://zenodo.org/search?q=conceptrecid:%224817768%22&sort=-version&all_versions=True).


================================================
FILE: SECURITY.md
================================================
# Security Policy

## Supported Versions
<!--
Use this section to tell people about which versions of your project are
currently being supported with security updates.

| Version | Supported          |
| ------- | ------------------ |
| 5.1.x   | :white_check_mark: |
| 5.0.x   | :x:                |
| 4.0.x   | :white_check_mark: |
| < 4.0   | :x:                |
-->

Each major version is currently being supported with security updates.

| Version | Supported          |
|---------|--------------------|
| 1.x.x   | :white_check_mark: |
| 2.x.x   | :white_check_mark: |


## Reporting a Vulnerability
<!--
Use this section to tell people how to report a vulnerability.

Tell them where to go, how often they can expect to get an update on a
reported vulnerability, what to expect if the vulnerability is accepted or
declined, etc.
-->

To report a security vulnerability, please contact: security@huggingface.co


================================================
FILE: benchmarks/benchmark_array_xd.py
================================================
import json
import os
import tempfile

import datasets
from datasets.arrow_writer import ArrowWriter
from datasets.features import Array2D
from utils import generate_examples, get_duration


SHAPE_TEST_1 = (30, 487)
SHAPE_TEST_2 = (36, 1024)
SPEED_TEST_SHAPE = (100, 100)
SPEED_TEST_N_EXAMPLES = 100

DEFAULT_FEATURES = datasets.Features(
    {"text": Array2D(SHAPE_TEST_1, dtype="float32"), "image": Array2D(SHAPE_TEST_2, dtype="float32")}
)

RESULTS_BASEPATH, RESULTS_FILENAME = os.path.split(__file__)
RESULTS_FILE_PATH = os.path.join(RESULTS_BASEPATH, "results", RESULTS_FILENAME.replace(".py", ".json"))


@get_duration
def write(my_features, dummy_data, tmp_dir):
    with ArrowWriter(features=my_features, path=os.path.join(tmp_dir, "beta.arrow")) as writer:
        for key, record in dummy_data:
            example = my_features.encode_example(record)
            writer.write(example)
        num_examples, num_bytes = writer.finalize()


@get_duration
def read_unformated(feats, tmp_dir):
    dataset = datasets.Dataset.from_file(
        filename=os.path.join(tmp_dir, "beta.arrow"), info=datasets.DatasetInfo(features=feats)
    )
    for _ in dataset:
        pass


@get_duration
def read_formatted_as_numpy(feats, tmp_dir):
    dataset = datasets.Dataset.from_file(
        filename=os.path.join(tmp_dir, "beta.arrow"), info=datasets.DatasetInfo(features=feats)
    )
    dataset.set_format("numpy")
    for _ in dataset:
        pass


@get_duration
def read_batch_unformated(feats, tmp_dir):
    batch_size = 10
    dataset = datasets.Dataset.from_file(
        filename=os.path.join(tmp_dir, "beta.arrow"), info=datasets.DatasetInfo(features=feats)
    )
    for i in range(0, len(dataset), batch_size):
        _ = dataset[i : i + batch_size]


@get_duration
def read_batch_formatted_as_numpy(feats, tmp_dir):
    batch_size = 10
    dataset = datasets.Dataset.from_file(
        filename=os.path.join(tmp_dir, "beta.arrow"), info=datasets.DatasetInfo(features=feats)
    )
    dataset.set_format("numpy")
    for i in range(0, len(dataset), batch_size):
        _ = dataset[i : i + batch_size]


@get_duration
def read_col_unformated(feats, tmp_dir):
    dataset = datasets.Dataset.from_file(
        filename=os.path.join(tmp_dir, "beta.arrow"), info=datasets.DatasetInfo(features=feats)
    )
    for col in feats:
        _ = dataset[col]


@get_duration
def read_col_formatted_as_numpy(feats, tmp_dir):
    dataset = datasets.Dataset.from_file(
        filename=os.path.join(tmp_dir, "beta.arrow"), info=datasets.DatasetInfo(features=feats)
    )
    dataset.set_format("numpy")
    for col in feats:
        _ = dataset[col]


def benchmark_array_xd():
    times = {}
    read_functions = (
        read_unformated,
        read_formatted_as_numpy,
        read_batch_unformated,
        read_batch_formatted_as_numpy,
        read_col_unformated,
        read_col_formatted_as_numpy,
    )
    with tempfile.TemporaryDirectory() as tmp_dir:
        feats = datasets.Features({"image": Array2D(SPEED_TEST_SHAPE, dtype="float32")})
        data = generate_examples(features=feats, num_examples=SPEED_TEST_N_EXAMPLES)
        times["write_array2d"] = write(feats, data, tmp_dir)
        for read_func in read_functions:
            times[read_func.__name__ + " after write_array2d"] = read_func(feats, tmp_dir)

    with tempfile.TemporaryDirectory() as tmp_dir:
        # don't use fixed length for fair comparison
        # feats = datasets.Features(
        #     {"image": datasets.Sequence(datasets.Sequence(datasets.Value("float32"), SPEED_TEST_SHAPE[1]), SPEED_TEST_SHAPE[0])}
        # )
        feats = datasets.Features({"image": datasets.Sequence(datasets.Sequence(datasets.Value("float32")))})
        data = generate_examples(
            features=feats, num_examples=SPEED_TEST_N_EXAMPLES, seq_shapes={"image": SPEED_TEST_SHAPE}
        )
        times["write_nested_sequence"] = write(feats, data, tmp_dir)
        for read_func in read_functions:
            times[read_func.__name__ + " after write_nested_sequence"] = read_func(feats, tmp_dir)

    with tempfile.TemporaryDirectory() as tmp_dir:
        # don't use fixed length for fair comparison
        # feats = datasets.Features(
        #     {"image": datasets.Sequence(datasets.Value("float32"), SPEED_TEST_SHAPE[0] * SPEED_TEST_SHAPE[1])}
        # )
        feats = datasets.Features({"image": datasets.Sequence(datasets.Value("float32"))})
        data = generate_examples(
            features=feats,
            num_examples=SPEED_TEST_N_EXAMPLES,
            seq_shapes={"image": [SPEED_TEST_SHAPE[0] * SPEED_TEST_SHAPE[1]]},
        )
        times["write_flattened_sequence"] = write(feats, data, tmp_dir)
        for read_func in read_functions:
            times[read_func.__name__ + " after write_flattened_sequence"] = read_func(feats, tmp_dir)

    with open(RESULTS_FILE_PATH, "wb") as f:
        f.write(json.dumps(times).encode("utf-8"))


if __name__ == "__main__":  # useful to run the profiler
    benchmark_array_xd()


================================================
FILE: benchmarks/benchmark_getitem_100B.py
================================================
import json
import os
from dataclasses import dataclass

import numpy as np
import pyarrow as pa

import datasets
from utils import get_duration


SPEED_TEST_N_EXAMPLES = 100_000_000_000
SPEED_TEST_CHUNK_SIZE = 10_000

RESULTS_BASEPATH, RESULTS_FILENAME = os.path.split(__file__)
RESULTS_FILE_PATH = os.path.join(RESULTS_BASEPATH, "results", RESULTS_FILENAME.replace(".py", ".json"))


def generate_100B_dataset(num_examples: int, chunk_size: int) -> datasets.Dataset:
    table = pa.Table.from_pydict({"col": [0] * chunk_size})
    table = pa.concat_tables([table] * (num_examples // chunk_size))
    return datasets.Dataset(table, fingerprint="table_100B")


@dataclass
class RandIter:
    low: int
    high: int
    size: int
    seed: int

    def __post_init__(self):
        rng = np.random.default_rng(self.seed)
        self._sampled_values = rng.integers(low=self.low, high=self.high, size=self.size).tolist()

    def __iter__(self):
        return iter(self._sampled_values)

    def __len__(self):
        return self.size


@get_duration
def get_first_row(dataset: datasets.Dataset):
    _ = dataset[0]


@get_duration
def get_last_row(dataset: datasets.Dataset):
    _ = dataset[-1]


@get_duration
def get_batch_of_1024_rows(dataset: datasets.Dataset):
    _ = dataset[range(len(dataset) // 2, len(dataset) // 2 + 1024)]


@get_duration
def get_batch_of_1024_random_rows(dataset: datasets.Dataset):
    _ = dataset[RandIter(0, len(dataset), 1024, seed=42)]


def benchmark_table_100B():
    times = {"num examples": SPEED_TEST_N_EXAMPLES}
    functions = (get_first_row, get_last_row, get_batch_of_1024_rows, get_batch_of_1024_random_rows)
    print("generating dataset")
    dataset = generate_100B_dataset(num_examples=SPEED_TEST_N_EXAMPLES, chunk_size=SPEED_TEST_CHUNK_SIZE)
    print("Functions")
    for func in functions:
        print(func.__name__)
        times[func.__name__] = func(dataset)

    with open(RESULTS_FILE_PATH, "wb") as f:
        f.write(json.dumps(times).encode("utf-8"))


if __name__ == "__main__":  # useful to run the profiler
    benchmark_table_100B()


================================================
FILE: benchmarks/benchmark_indices_mapping.py
================================================
import json
import os
import tempfile

import datasets
from utils import generate_example_dataset, get_duration


SPEED_TEST_N_EXAMPLES = 500_000

RESULTS_BASEPATH, RESULTS_FILENAME = os.path.split(__file__)
RESULTS_FILE_PATH = os.path.join(RESULTS_BASEPATH, "results", RESULTS_FILENAME.replace(".py", ".json"))


@get_duration
def select(dataset: datasets.Dataset):
    _ = dataset.select(range(0, len(dataset), 2))


@get_duration
def sort(dataset: datasets.Dataset):
    _ = dataset.sort("numbers")


@get_duration
def shuffle(dataset: datasets.Dataset):
    _ = dataset.shuffle()


@get_duration
def train_test_split(dataset: datasets.Dataset):
    _ = dataset.train_test_split(0.1)


@get_duration
def shard(dataset: datasets.Dataset, num_shards=10):
    for shard_id in range(num_shards):
        _ = dataset.shard(num_shards, shard_id)


def benchmark_indices_mapping():
    times = {"num examples": SPEED_TEST_N_EXAMPLES}
    functions = (select, sort, shuffle, train_test_split, shard)
    with tempfile.TemporaryDirectory() as tmp_dir:
        print("generating dataset")
        features = datasets.Features({"text": datasets.Value("string"), "numbers": datasets.Value("float32")})
        dataset = generate_example_dataset(
            os.path.join(tmp_dir, "dataset.arrow"), features, num_examples=SPEED_TEST_N_EXAMPLES
        )
        print("Functions")
        for func in functions:
            print(func.__name__)
            times[func.__name__] = func(dataset)

    with open(RESULTS_FILE_PATH, "wb") as f:
        f.write(json.dumps(times).encode("utf-8"))


if __name__ == "__main__":  # useful to run the profiler
    benchmark_indices_mapping()


================================================
FILE: benchmarks/benchmark_iterating.py
================================================
import json
import os
import tempfile

import datasets
from utils import generate_example_dataset, get_duration


SPEED_TEST_N_EXAMPLES = 50_000
SMALL_TEST = 5_000

RESULTS_BASEPATH, RESULTS_FILENAME = os.path.split(__file__)
RESULTS_FILE_PATH = os.path.join(RESULTS_BASEPATH, "results", RESULTS_FILENAME.replace(".py", ".json"))


@get_duration
def read(dataset: datasets.Dataset, length):
    for i in range(length):
        _ = dataset[i]


@get_duration
def read_batch(dataset: datasets.Dataset, length, batch_size):
    for i in range(0, len(dataset), batch_size):
        _ = dataset[i : i + batch_size]


@get_duration
def read_formatted(dataset: datasets.Dataset, length, type):
    with dataset.formatted_as(type=type):
        for i in range(length):
            _ = dataset[i]


@get_duration
def read_formatted_batch(dataset: datasets.Dataset, length, batch_size, type):
    with dataset.formatted_as(type=type):
        for i in range(0, length, batch_size):
            _ = dataset[i : i + batch_size]


def benchmark_iterating():
    times = {"num examples": SPEED_TEST_N_EXAMPLES}
    functions = [
        (read, {"length": SMALL_TEST}),
        (read, {"length": SPEED_TEST_N_EXAMPLES}),
        (read_batch, {"length": SPEED_TEST_N_EXAMPLES, "batch_size": 10}),
        (read_batch, {"length": SPEED_TEST_N_EXAMPLES, "batch_size": 100}),
        (read_batch, {"length": SPEED_TEST_N_EXAMPLES, "batch_size": 1_000}),
        (read_formatted, {"type": "numpy", "length": SMALL_TEST}),
        (read_formatted, {"type": "pandas", "length": SMALL_TEST}),
        (read_formatted, {"type": "torch", "length": SMALL_TEST}),
        (read_formatted, {"type": "tensorflow", "length": SMALL_TEST}),
        (read_formatted_batch, {"type": "numpy", "length": SMALL_TEST, "batch_size": 10}),
        (read_formatted_batch, {"type": "numpy", "length": SMALL_TEST, "batch_size": 1_000}),
    ]

    functions_shuffled = [
        (read, {"length": SMALL_TEST}),
        (read, {"length": SPEED_TEST_N_EXAMPLES}),
        (read_batch, {"length": SPEED_TEST_N_EXAMPLES, "batch_size": 10}),
        (read_batch, {"length": SPEED_TEST_N_EXAMPLES, "batch_size": 100}),
        (read_batch, {"length": SPEED_TEST_N_EXAMPLES, "batch_size": 1_000}),
        (read_formatted, {"type": "numpy", "length": SMALL_TEST}),
        (read_formatted_batch, {"type": "numpy", "length": SMALL_TEST, "batch_size": 10}),
        (read_formatted_batch, {"type": "numpy", "length": SMALL_TEST, "batch_size": 1_000}),
    ]
    with tempfile.TemporaryDirectory() as tmp_dir:
        print("generating dataset")
        features = datasets.Features(
            {"list": datasets.Sequence(datasets.Value("float32")), "numbers": datasets.Value("float32")}
        )
        dataset = generate_example_dataset(
            os.path.join(tmp_dir, "dataset.arrow"),
            features,
            num_examples=SPEED_TEST_N_EXAMPLES,
            seq_shapes={"list": (100,)},
        )
        print("first set of iterations")
        for func, kwargs in functions:
            print(func.__name__, str(kwargs))
            times[func.__name__ + " " + " ".join(str(v) for v in kwargs.values())] = func(dataset, **kwargs)

        print("shuffling dataset")
        dataset = dataset.shuffle()
        print("Second set of iterations (after shuffling")
        for func, kwargs in functions_shuffled:
            print("shuffled ", func.__name__, str(kwargs))
            times["shuffled " + func.__name__ + " " + " ".join(str(v) for v in kwargs.values())] = func(
                dataset, **kwargs
            )

    with open(RESULTS_FILE_PATH, "wb") as f:
        f.write(json.dumps(times).encode("utf-8"))


if __name__ == "__main__":  # useful to run the profiler
    benchmark_iterating()


================================================
FILE: benchmarks/benchmark_map_filter.py
================================================
import json
import os
import tempfile

import transformers

import datasets
from utils import generate_example_dataset, get_duration


SPEED_TEST_N_EXAMPLES = 500_000

RESULTS_BASEPATH, RESULTS_FILENAME = os.path.split(__file__)
RESULTS_FILE_PATH = os.path.join(RESULTS_BASEPATH, "results", RESULTS_FILENAME.replace(".py", ".json"))


@get_duration
def map(dataset: datasets.Dataset, **kwargs):
    _ = dataset.map(**kwargs)


@get_duration
def filter(dataset: datasets.Dataset, **kwargs):
    _ = dataset.filter(**kwargs)


def benchmark_map_filter():
    times = {"num examples": SPEED_TEST_N_EXAMPLES}
    with tempfile.TemporaryDirectory() as tmp_dir:
        features = datasets.Features({"text": datasets.Value("string"), "numbers": datasets.Value("float32")})
        dataset = generate_example_dataset(
            os.path.join(tmp_dir, "dataset.arrow"), features, num_examples=SPEED_TEST_N_EXAMPLES
        )

        tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

        def tokenize(examples):
            return tokenizer(examples["text"])

        times["map identity"] = map(dataset)

        times["map identity batched"] = map(dataset, batched=True)

        times["map no-op batched"] = map(dataset, function=lambda x: None, batched=True)

        with dataset.formatted_as(type="numpy"):
            times["map no-op batched numpy"] = map(dataset, function=lambda x: None, batched=True)

        with dataset.formatted_as(type="pandas"):
            times["map no-op batched pandas"] = map(dataset, function=lambda x: None, batched=True)

        with dataset.formatted_as(type="torch", columns="numbers"):
            times["map no-op batched pytorch"] = map(dataset, function=lambda x: None, batched=True)

        with dataset.formatted_as(type="tensorflow", columns="numbers"):
            times["map no-op batched tensorflow"] = map(dataset, function=lambda x: None, batched=True)

        times["map fast-tokenizer batched"] = map(dataset, function=tokenize, batched=True)

        times["filter"] = filter(dataset)

        # Activate later when tokenizer support batched inputs
        # with dataset.formatted_as(type='numpy'):
        #     times[func.__name__ + " fast-tokenizer batched numpy"] = func(dataset, function=tokenize, batched=True)

    with open(RESULTS_FILE_PATH, "wb") as f:
        f.write(json.dumps(times).encode("utf-8"))


if __name__ == "__main__":  # useful to run the profiler
    benchmark_map_filter()


================================================
FILE: benchmarks/format.py
================================================
import json
import sys


def format_json_to_md(input_json_file, output_md_file):
    with open(input_json_file, encoding="utf-8") as f:
        results = json.load(f)

    output_md = ["<details>", "<summary>Show updated benchmarks!</summary>", " "]

    for benchmark_name in sorted(results):
        benchmark_res = results[benchmark_name]

        benchmark_file_name = benchmark_name.split("/")[-1]
        output_md.append(f"### Benchmark: {benchmark_file_name}")

        title = "| metric |"
        lines = "|--------|"
        value = "| new / old (diff) |"
        for metric_name in sorted(benchmark_res):
            metric_vals = benchmark_res[metric_name]
            new_val = metric_vals["new"]
            old_val = metric_vals.get("old", None)
            dif_val = metric_vals.get("diff", None)

            val_str = f" {new_val:f}" if isinstance(new_val, (int, float)) else "None"

            if old_val is not None:
                val_str += f" / {old_val:f}" if isinstance(old_val, (int, float)) else "None"
            if dif_val is not None:
                val_str += f" ({dif_val:f})" if isinstance(dif_val, (int, float)) else "None"

            title += " " + metric_name + " |"
            lines += "---|"
            value += val_str + " |"

        output_md += [title, lines, value, " "]

    output_md.append("</details>")

    with open(output_md_file, "w", encoding="utf-8") as f:
        f.writelines("\n".join(output_md))


if __name__ == "__main__":
    input_json_file = sys.argv[1]
    output_md_file = sys.argv[2]

    format_json_to_md(input_json_file, output_md_file)


================================================
FILE: benchmarks/results/.gitkeep
================================================


================================================
FILE: benchmarks/results/benchmark_array_xd.json
================================================
{"write_array2d": 0.14168284999323077, "read_unformated after write_array2d": 0.04353281999647152, "read_formatted_as_numpy after write_array2d": 0.1285462469968479, "read_batch_unformated after write_array2d": 0.023109222995117307, "read_batch_formatted_as_numpy after write_array2d": 0.011352884990628809, "read_col_unformated after write_array2d": 0.037052362007671036, "read_col_formatted_as_numpy after write_array2d": 0.007985618998645805, "write_nested_sequence": 1.4927163410029607, "read_unformated after write_nested_sequence": 0.28319963401008863, "read_formatted_as_numpy after write_nested_sequence": 0.419271487990045, "read_batch_unformated after write_nested_sequence": 0.3234798710036557, "read_batch_formatted_as_numpy after write_nested_sequence": 0.03850809299910907, "read_col_unformated after write_nested_sequence": 0.29384092400141526, "read_col_formatted_as_numpy after write_nested_sequence": 0.004250421989127062, "write_flattened_sequence": 1.4521546780015342, "read_unformated after write_flattened_sequence": 0.25513897799828555, "read_formatted_as_numpy after write_flattened_sequence": 0.07564631900459062, "read_batch_unformated after write_flattened_sequence": 0.2758980469952803, "read_batch_formatted_as_numpy after write_flattened_sequence": 0.011008214991306886, "read_col_unformated after write_flattened_sequence": 0.25848906899045687, "read_col_formatted_as_numpy after write_flattened_sequence": 0.004328447001171298}

================================================
FILE: benchmarks/results/benchmark_getitem_100B.json
================================================
{"num examples": 100000000000, "get_first_row": 0.00019991099999927542, "get_last_row": 5.4411000000698095e-05, "get_batch_of_1024_rows": 0.0004897069999998394, "get_batch_of_1024_random_rows": 0.01800621099999944}

================================================
FILE: benchmarks/results/benchmark_indices_mapping.json
================================================
{"num examples": 500000, "select": 0.03741131999413483, "sort": 0.7371353159978753, "shuffle": 0.17655655200360343, "train_test_split": 0.29633847798686475, "shard": 0.01452581599005498}

================================================
FILE: benchmarks/results/benchmark_iterating.json
================================================
{"num examples": 50000, "read 5000": 0.2152090710005723, "read 50000": 2.077654693988734, "read_batch 50000 10": 1.5041199039987987, "read_batch 50000 100": 1.5411947140091797, "read_batch 50000 1000": 1.4684901159926085, "read_formatted numpy 5000": 4.584776938994764, "read_formatted pandas 5000": 3.7457121399929747, "read_formatted torch 5000": 4.565676491998602, "read_formatted tensorflow 5000": 5.269861594992108, "read_formatted_batch numpy 5000 10": 0.4242750950070331, "read_formatted_batch numpy 5000 1000": 0.007607111998368055, "shuffled read 5000": 0.22604441999283154, "shuffled read 50000": 2.268928524994408, "shuffled read_batch 50000 10": 55.44462437101174, "shuffled read_batch 50000 100": 6.876476717996411, "shuffled read_batch 50000 1000": 2.1420724369963864, "shuffled read_formatted numpy 5000": 4.8052272600034485, "shuffled read_formatted_batch numpy 5000 10": 6.500664097999106, "shuffled read_formatted_batch numpy 5000 1000": 0.0754691059992183}

================================================
FILE: benchmarks/results/benchmark_map_filter.json
================================================
{"num examples": 500000, "map identity": 10.19139202599763, "map identity batched": 0.6804238399927272, "map no-op batched": 0.5342009569867514, "map no-op batched numpy": 0.5792830920108827, "map no-op batched pandas": 0.4343639040016569, "map no-op batched pytorch": 0.5403374370071106, "map no-op batched tensorflow": 1.3869360350072384, "map fast-tokenizer batched": 8.074308118986664, "filter": 1.841787679004483}

================================================
FILE: benchmarks/utils.py
================================================
import timeit

import numpy as np

import datasets
from datasets.arrow_writer import ArrowWriter
from datasets.features.features import _ArrayXD


def get_duration(func):
    def wrapper(*args, **kwargs):
        starttime = timeit.default_timer()
        _ = func(*args, **kwargs)
        delta = timeit.default_timer() - starttime
        return delta

    wrapper.__name__ = func.__name__

    return wrapper


def generate_examples(features: dict, num_examples=100, seq_shapes=None):
    dummy_data = []
    seq_shapes = seq_shapes or {}
    for i in range(num_examples):
        example = {}
        for col_id, (k, v) in enumerate(features.items()):
            if isinstance(v, _ArrayXD):
                data = np.random.rand(*v.shape).astype(v.dtype)
            elif isinstance(v, datasets.Value):
                if v.dtype == "string":
                    data = "The small grey turtle was surprisingly fast when challenged."
                else:
                    data = np.random.randint(10, size=1).astype(v.dtype).item()
            elif isinstance(v, datasets.Sequence):
                while isinstance(v, datasets.Sequence):
                    v = v.feature
                shape = seq_shapes[k]
                data = np.random.rand(*shape).astype(v.dtype)
            example[k] = data

        dummy_data.append((i, example))

    return dummy_data


def generate_example_dataset(dataset_path, features, num_examples=100, seq_shapes=None):
    dummy_data = generate_examples(features, num_examples=num_examples, seq_shapes=seq_shapes)

    with ArrowWriter(features=features, path=dataset_path) as writer:
        for key, record in dummy_data:
            example = features.encode_example(record)
            writer.write(example)

        num_final_examples, num_bytes = writer.finalize()

    if not num_final_examples == num_examples:
        raise ValueError(
            f"Error writing the dataset, wrote {num_final_examples} examples but should have written {num_examples}."
        )

    dataset = datasets.Dataset.from_file(filename=dataset_path, info=datasets.DatasetInfo(features=features))

    return dataset


================================================
FILE: docs/README.md
================================================
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Generating the documentation

To generate the documentation, you first have to build it. Several packages are necessary to build the doc,
you can install them with the following command, at the root of the code repository:

```bash
pip install -e ".[docs]"
```

Then you need to install our special tool that builds the documentation:

```bash
pip install git+https://github.com/huggingface/doc-builder
```

---
**NOTE**

You only need to generate the documentation to inspect it locally (if you're planning changes and want to
check how they look before committing for instance). You don't have to `git commit` the built documentation.

---

## Building the documentation

Once you have setup the `doc-builder` and additional packages, you can generate the documentation by typing
the following command:

```bash
doc-builder build datasets docs/source/ --build_dir ~/tmp/test-build
```

You can adapt the `--build_dir` to set any temporary folder that you prefer. This command will create it and generate
the MDX files that will be rendered as the documentation on the main website. You can inspect them in your favorite
Markdown editor.

## Previewing the documentation

To preview the docs, first install the `watchdog` module with:

```bash
pip install watchdog
```

Then run the following command:

```bash
doc-builder preview datasets docs/source/
```

The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.

---
**NOTE**

The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).

## Adding a new element to the navigation bar

Accepted files are Markdown (.md or .mdx).

Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/datasets/blob/main/docs/source/_toctree.yml) file.

## Renaming section headers and moving sections

It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.

Therefore we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor.

So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file:

```
Sections that were moved:

[ <a href="#section-b">Section A</a><a id="section-a"></a> ]
```
and of course if you moved it to another file, then:

```
Sections that were moved:

[ <a href="../new-file#section-b">Section A</a><a id="section-a"></a> ]
```

Use the relative style to link to the new file so that the versioned docs continue to work.

For an example of a rich moved sections set please see the very end of [the transformers Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md).


## Writing Documentation - Specification

The `huggingface/datasets` documentation follows the
[Google documentation](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) style for docstrings,
although we can write them directly in Markdown.

### Adding a new tutorial

Adding a new tutorial or section is done in two steps:

- Add a new file under `./source`. This file can either be ReStructuredText (.rst) or Markdown (.md).
- Link that file in `./source/_toctree.yml` on the correct toc-tree.

Make sure to put your new file under the proper section. If you have a doubt, feel free to ask in a Github Issue or PR.

### Writing source documentation

Values that should be put in `code` should either be surrounded by backticks: \`like so\`. Note that argument names
and objects like True, None or any strings should usually be put in `code`.

When mentioning a class, function or method, it is recommended to use our syntax for internal links so that our tool
adds a link to its documentation with this syntax: \[\`XXXClass\`\] or \[\`function\`\]. This requires the class or 
function to be in the main package.

If you want to create a link to some internal class or function, you need to
provide its path. For instance: \[\`table.InMemoryTable\`\]. This will be converted into a link with
`table.InMemoryTable` in the description. To get rid of the path and only keep the name of the object you are
linking to in the description, add a ~: \[\`~table.InMemoryTable\`\] will generate a link with `InMemoryTable` in the description.

The same works for methods so you can either use \[\`XXXClass.method\`\] or \[~\`XXXClass.method\`\].

#### Defining arguments in a method

Arguments should be defined with the `Args:` (or `Arguments:` or `Parameters:`) prefix, followed by a line return and
an indentation. The argument should be followed by its type, with its shape if it is a tensor, a colon and its
description:

```
    Args:
        n_layers (`int`): The number of layers of the model.
```

If the description is too long to fit in one line, another indentation is necessary before writing the description
after the argument.

Here's an example showcasing everything so far:

```
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.

            Indices can be obtained using [`AlbertTokenizer`]. See [`~PreTrainedTokenizer.encode`] and
            [`~PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
```

For optional arguments or arguments with defaults we follow the following syntax: imagine we have a function with the
following signature:

```
def my_function(x: str = None, a: float = 1):
```

then its documentation should look like this:

```
    Args:
        x (`str`, *optional*):
            This argument controls ...
        a (`float`, *optional*, defaults to 1):
            This argument is used to ...
```

Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even
if the first line describing your argument type and its default gets long, you can't break it into several lines. You can
however write as many lines as you want in the indented description (see the example above with `input_ids`).

#### Writing a multi-line code block

Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown:


````
```
# first line of code
# second line
# etc
```
````

#### Writing a return block

The return block should be introduced with the `Returns:` prefix, followed by a line return and an indentation.
The first line should be the type of the return, followed by a line return. No need to indent further for the elements
building the return.

Here's an example of a single value return:

```
    Returns:
        `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
```

Here's an example of tuple return, comprising several objects:

```
    Returns:
        `tuple(torch.FloatTensor)` comprising various elements depending on the configuration ([`BertConfig`]) and inputs:
        - ** loss** (*optional*, returned when `masked_lm_labels` is provided) `torch.FloatTensor` of shape `(1,)` --
          Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss.
        - **prediction_scores** (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`) --
          Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
```

#### Adding an image

Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
to this dataset.

## Writing documentation examples

The syntax for Example docstrings can look as follows:

```
    Example:

    ```py
    >>> from datasets import load_dataset
    >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
    >>> def add_prefix(example):
    ...     example["text"] = "Review: " + example["text"]
    ...     return example
    >>> ds = ds.map(add_prefix)
    >>> ds[0:3]["text"]
    ['Review: compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .',
        'Review: the soundtrack alone is worth the price of admission .',
        'Review: rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .']

    # process a batch of examples
    >>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True)
    # set number of processors
    >>> ds = ds.map(add_prefix, num_proc=4)
    ```
```

The docstring should give a minimal, clear example of how the respective class or function is to be used in practice and also include the expected (ideally sensible) output.
Often, readers will try out the example before even going through the function 
or class definitions. Therefore, it is of utmost importance that the example 
works as expected.


================================================
FILE: docs/source/_config.py
================================================
# docstyle-ignore
INSTALL_CONTENT = """
# Datasets installation
! pip install datasets transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/datasets.git
"""

notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
default_branch_name = "main"
version_prefix = ""


================================================
FILE: docs/source/_redirects.yml
================================================
# This first_section was backported from nginx
loading_datasets: loading
share_dataset: share
quicktour: quickstart
dataset_streaming: stream
torch_tensorflow: use_dataset
splits: loading#slice-splits
processing: process
faiss_and_ea: faiss_es
features: about_dataset_features
exploring: access
package_reference/logging_methods: package_reference/utilities
# end of first_section


================================================
FILE: docs/source/_toctree.yml
================================================
- sections: 
  - local: index
    title: 🤗 Datasets
  - local: quickstart
    title: Quickstart
  - local: installation
    title: Installation
  title: Get started
- sections:
  - local: tutorial
    title: Overview
  - local: load_hub
    title: Load a dataset from the Hub
  - local: access
    title: Know your dataset
  - local: use_dataset
    title: Preprocess
  - local: create_dataset
    title: Create a dataset
  - local: upload_dataset
    title: Share a dataset to the Hub
  title: "Tutorials"
- sections:
  - local: how_to
    title: Overview
  - sections:
    - local: loading
      title: Load
    - local: process
      title: Process
    - local: stream
      title: Stream
    - local: use_with_pytorch
      title: Use with PyTorch
    - local: use_with_tensorflow
      title: Use with TensorFlow
    - local: use_with_numpy
      title: Use with NumPy
    - local: use_with_jax
      title: Use with JAX
    - local: use_with_pandas
      title: Use with Pandas
    - local: use_with_polars
      title: Use with Polars
    - local: use_with_pyarrow
      title: Use with PyArrow
    - local: use_with_spark
      title: Use with Spark
    - local: cache
      title: Cache management
    - local: filesystems
      title: Cloud storage
    - local: faiss_es
      title: Search index
    - local: cli
      title: CLI
    - local: troubleshoot
      title: Troubleshooting
    title: "General usage"
  - sections:
    - local: audio_load
      title: Load audio data
    - local: audio_process
      title: Process audio data
    - local: audio_dataset
      title: Create an audio dataset
    title: "Audio"
  - sections:
    - local: image_load
      title: Load image data
    - local: image_process
      title: Process image data
    - local: image_dataset
      title: Create an image dataset
    - local: depth_estimation
      title: Depth estimation
    - local: image_classification
      title: Image classification
    - local: semantic_segmentation
      title: Semantic segmentation
    - local: object_detection
      title: Object detection
    - local: video_load
      title: Load video data
    - local: video_dataset
      title: Create a video dataset
    - local: document_load
      title: Load document data
    - local: document_dataset
      title: Create a document dataset
    - local: nifti_dataset
      title: Create a medical imaging dataset
    title: "Vision"
  - sections:
    - local: nlp_load
      title: Load text data
    - local: nlp_process
      title: Process text data
    title: "Text"
  - sections:
    - local: tabular_load
      title: Load tabular data
    title: "Tabular"
  - sections:
    - local: share
      title: Share
    - local: dataset_card
      title: Create a dataset card
    - local: repository_structure
      title: Structure your repository
    title: "Dataset repository"
  title: "How-to guides"
- sections:
  - local: about_arrow
    title: Datasets 🤝 Arrow
  - local: about_cache
    title: The cache
  - local: about_mapstyle_vs_iterable
    title: Dataset or IterableDataset
  - local: about_dataset_features
    title: Dataset features
  - local: about_dataset_load
    title: Build and load
  - local: about_map_batch
    title: Batch mapping
  title: "Conceptual guides"
- sections:
  - local: package_reference/main_classes
    title: Main classes
  - local: package_reference/builder_classes
    title: Builder classes
  - local: package_reference/loading_methods
    title: Loading methods
  - local: package_reference/table_classes
    title: Table Classes
  - local: package_reference/utilities
    title: Utilities
  title: "Reference"


================================================
FILE: docs/source/about_arrow.md
================================================
# Datasets 🤝 Arrow

## What is Arrow?

[Arrow](https://arrow.apache.org/) enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages:

* Arrow's standard format allows [zero-copy reads](https://en.wikipedia.org/wiki/Zero-copy) which removes virtually all serialization overhead.
* Arrow is language-agnostic so it supports different programming languages.
* Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
* Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.
* Arrow supports many, possibly nested, column types.

## Memory-mapping

🤗 Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup.
This architecture allows for large datasets to be used on machines with relatively small device memory.

For example, loading the full English Wikipedia dataset only takes a few MB of RAM:

```python
>>> import os; import psutil; import timeit
>>> from datasets import load_dataset

# Process.memory_info is expressed in bytes, so convert to megabytes 
>>> mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
>>> wiki = load_dataset("wikimedia/wikipedia", "20220301.en", split="train")
>>> mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)

>>> print(f"RAM memory used: {(mem_after - mem_before)} MB")
RAM memory used: 50 MB
```

This is possible because the Arrow data is actually memory-mapped from disk, and not loaded in memory.
Memory-mapping allows access to data on disk, and leverages virtual memory capabilities for fast lookups.

## Performance

Iterating over a memory-mapped dataset using Arrow is fast. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s:

```python
>>> s = """batch_size = 1000
... for batch in wiki.iter(batch_size):
...     ...
... """

>>> elapsed_time = timeit.timeit(stmt=s, number=1, globals=globals())
>>> print(f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {elapsed_time:.1f} sec, "
...       f"ie. {float(wiki.dataset_size >> 27)/elapsed_time:.1f} Gb/s")
Time to iterate over the 18 GB dataset: 31.8 sec, ie. 4.8 Gb/s
```


================================================
FILE: docs/source/about_cache.mdx
================================================
# The cache

The cache is one of the reasons why 🤗 Datasets is so efficient. It stores previously downloaded and processed datasets so when you need to use them again, they are reloaded directly from the cache. This avoids having to download a dataset all over again, or reapplying processing functions. Even after you close and start another Python session, 🤗 Datasets will reload your dataset directly from the cache!

## Fingerprint

How does the cache keeps track of what transforms are applied to a dataset? Well, 🤗 Datasets assigns a fingerprint to the cache file. A fingerprint keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied. 

> [!TIP]
> Transforms are any of the processing methods from the [How-to Process](./process) guides such as [`Dataset.map`] or [`Dataset.shuffle`].

Here are what the actual fingerprints look like:

```py
>>> from datasets import Dataset
>>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1})
>>> print(dataset1._fingerprint, dataset2._fingerprint)
d19493523d95e2dc 5b86abacd4b42434
```

In order for a transform to be hashable, it needs to be picklable by [dill](https://dill.readthedocs.io/en/latest/) or [pickle](https://docs.python.org/3/library/pickle). 

When you use a non-hashable transform, 🤗 Datasets uses a random fingerprint instead and raises a warning. The non-hashable transform is considered different from the previous transforms. As a result, 🤗 Datasets will recompute all the transforms. Make sure your transforms are serializable with pickle or dill to avoid this!

An example of when 🤗 Datasets recomputes everything is when caching is disabled. When this happens, the cache files are generated every time and they get written to a temporary directory. Once your Python session ends, the cache files in the temporary directory are deleted. A random hash is assigned to these cache files, instead of a fingerprint. 

> [!TIP]
> When caching is disabled, use [`Dataset.save_to_disk`] to save your transformed dataset or it will be deleted once the session ends.

## Hashing

The fingerprint of a dataset is updated by hashing the function passed to `map` as well as the `map` parameters (`batch_size`, `remove_columns`, etc.).

You can check the hash of any Python object using the [`fingerprint.Hasher`]:

```py
>>> from datasets.fingerprint import Hasher
>>> my_func = lambda example: {"length": len(example["text"])}
>>> print(Hasher.hash(my_func))
'3d35e2b3e94c81d6'
```

The hash is computed by dumping the object using a `dill` pickler and hashing the dumped bytes.
The pickler recursively dumps all the variables used in your function, so any change you do to an object that is used in your function, will cause the hash to change.

If one of your functions doesn't seem to have the same hash across sessions, it means at least one of its variables contains a Python object that is not deterministic.
When this happens, feel free to hash any object you find suspicious to try to find the object that caused the hash to change.
For example, if you use a list for which the order of its elements is not deterministic across sessions, then the hash won't be the same across sessions either.


================================================
FILE: docs/source/about_dataset_features.mdx
================================================
# Dataset features

[`Features`] defines the internal structure of a dataset. It is used to specify the underlying serialization format. What's more interesting to you though is that [`Features`] contains high-level information about everything from the column names and types, to the [`ClassLabel`]. You can think of [`Features`] as the backbone of a dataset.

The [`Features`] format is simple: `dict[column_name, column_type]`. It is a dictionary of column name and column type pairs. The column type provides a wide range of options for describing the type of data you have.

Let's have a look at the features of the MRPC dataset from the GLUE benchmark:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train')
>>> dataset.features
{'idx': Value('int32'),
 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
 'sentence1': Value('string'),
 'sentence2': Value('string'),
}
```

The [`Value`] feature tells 🤗 Datasets:

- The `idx` data type is `int32`.
- The `sentence1` and `sentence2` data types are `string`.

🤗 Datasets supports many other data types such as `bool`, `float32` and `binary` to name just a few.

> [!TIP]
> Refer to [`Value`] for a full list of supported data types.

The [`ClassLabel`] feature informs 🤗 Datasets the `label` column contains two classes. The classes are labeled `not_equivalent` and `equivalent`. Labels are stored as integers in the dataset. When you retrieve the labels, [`ClassLabel.int2str`] and [`ClassLabel.str2int`] carries out the conversion from integer value to label name, and vice versa.

If your data type contains a list of objects, then you want to use the [`List`] feature. Remember the SQuAD dataset?

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('rajpurkar/squad', split='train')
>>> dataset.features
{'id': Value('string'),
 'title': Value('string'),
 'context': Value('string'),
 'question': Value('string'),
 'answers': {'text': List(Value('string')),
  'answer_start': List(Value('int32'))}}
```

The `answers` field is constructed using the dict of features because and contains two subfields, `text` and `answer_start`, which are lists of `string` and `int32`, respectively.

> [!TIP]
> See the [flatten](./process#flatten) section to learn how you can extract the nested subfields as their own independent columns.

The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].

```py
>>> features = Features({'a': Array2D(shape=(1, 3), dtype='int32')})
```

The array type also allows the first dimension of the array to be dynamic. This is useful for handling sequences with variable lengths such as sentences, without having to pad or truncate the input to a uniform shape.

```py
>>> features = Features({'a': Array3D(shape=(None, 5, 2), dtype='int32')})
```

## Audio feature

Audio datasets have a column with type [`Audio`], which contains three important fields:

- `array`: the decoded audio data represented as a 1-dimensional array.
- `path`: the path to the downloaded audio file.
- `sampling_rate`: the sampling rate of the audio data.

When you load an audio dataset and call the audio column, the [`Audio`] feature automatically decodes and resamples the audio file:

```py
>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset[0]["audio"]
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
```

> [!WARNING]
> Index into an audio dataset using the row index first and then the `audio` column - `dataset[0]["audio"]` - to avoid decoding and resampling all the audio files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an torchcodec `AudioDecoder` object,

```py
>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
>>> dataset[0]
{'audio': {'bytes': None,
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'},
 'english_transcription': 'I would like to set up a joint account with my partner',
 'intent_class': 11,
 'lang_id': 4,
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'transcription': 'I would like to set up a joint account with my partner'}
```

## Image feature

Image datasets have a column with type [`Image`], which loads `PIL.Image` objects from images stored as bytes:

When you load an image dataset and call the image column, the [`Image`] feature automatically decodes the image file:

```py
>>> from datasets import load_dataset, Image

>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train")
>>> dataset[0]["image"]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x125506CF8>
```

> [!WARNING]
> Index into an image dataset using the row index first and then the `image` column - `dataset[0]["image"]` - to avoid decoding all the image files in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,

```py
>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
>>> dataset[0]["image"]
{'bytes': None,
 'path': '/Users/username/.cache/huggingface/datasets/downloads/extracted/772e7c1fba622cff102b85dd74bcce46e8168634df4eaade7bedd3b8d91d3cd7/train/healthy/healthy_train.265.jpg'}
```

Depending on the dataset, you may get the path to the local downloaded image, or the content of the image as bytes if the dataset is not made of individual files.

You can also define a dataset of images from numpy arrays:

```python
>>> ds = Dataset.from_dict({"i": [np.zeros(shape=(16, 16, 3), dtype=np.uint8)]}, features=Features({"i": Image()}))
```

And in this case the numpy arrays are encoded into PNG (or TIFF if the pixels values precision is important).

For multi-channels arrays like RGB or RGBA, only uint8 is supported. If you use a larger precision, you get a warning and the array is downcasted to uint8.
For gray-scale images you can use the integer or float precision you want as long as it is compatible with `Pillow`. A warning is shown if your image integer or float precision is too high, and in this case the array is downcated: an int64 array is downcasted to int32, and a float64 array is downcasted to float32.

## Json feature

Datasets are based on Arrow which is a columnar format, and therefore they expect every example to have the same type and subtypes, and dictionaries to have the same keys and values types.
Loading a dataset errors out when fields have mismatching types, and fills missing fields in dictionaries with None so all dictionaries have the same keys and value types.

To avoid this and allow mixed-types without errors, you can use `on_mixed_types="use_json"` or specify `features=` with a [`Json`] type:

```python
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]
```

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

```python
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK
```

Another example with tool calling data and the `on_mixed_types="use_json"` argument (useful to not have to specify `features=` manually):

```python
>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}
```


================================================
FILE: docs/source/about_dataset_load.mdx
================================================
# Build and load

Nearly every deep learning workflow begins with loading a dataset, which makes it one of the most important steps. With 🤗 Datasets, there are more than 900 datasets available to help you get started with your NLP task. All you have to do is call: [`load_dataset`] to take your first step. This function is a true workhorse in every sense because it builds and loads every dataset you use.

## ELI5: `load_dataset`

Let's begin with a basic Explain Like I'm Five.

A dataset is a directory that contains:

- Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations

The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
The Hub is a central repository where all the Hugging Face datasets and models are stored.

If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:

* [`datasets.packaged_modules.text.Text`] for text
* [`datasets.packaged_modules.csv.Csv`] for CSV and TSV
* [`datasets.packaged_modules.json.Json`] for JSON and JSONL
* [`datasets.packaged_modules.parquet.Parquet`] for Parquet
* [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
* [`datasets.packaged_modules.sql.Sql`] for SQL databases
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders

> [!TIP]
> Read the [Share](./upload_dataset) section to learn more about how to share a dataset.

🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.

Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works.

## Building a dataset

When you load a dataset for the first time, 🤗 Datasets takes the raw data file and builds it into a table of rows and typed columns. There are two main classes responsible for building a dataset: [`BuilderConfig`] and [`DatasetBuilder`].


<div class="flex justify-center">
   <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/builderconfig.png"/>
</div>

### BuilderConfig[[datasets-builderconfig]]

[`BuilderConfig`] is the configuration class of [`DatasetBuilder`]. The [`BuilderConfig`] contains the following basic attributes about a dataset:

| Attribute     | Description                                                  |
|---------------|--------------------------------------------------------------|
| `name`        | Short name of the dataset.                                   |
| `version`     | Dataset version identifier.                                  |
| `data_dir`    | Stores the path to a local folder containing the data files. |
| `data_files`  | Stores paths to local data files.                            |
| `description` | Description of the dataset.                                  |

If you want to add additional attributes to your dataset such as the class labels, you can subclass the base [`BuilderConfig`] class. There are two ways to populate the attributes of a [`BuilderConfig`] class or subclass:

- Provide a list of predefined [`BuilderConfig`] class (or subclass) instances in the datasets [`DatasetBuilder.BUILDER_CONFIGS`] attribute.

- When you call [`load_dataset`], any keyword arguments that are not specific to the method will be used to set the associated attributes of the [`BuilderConfig`] class. This will override the predefined attributes if a specific configuration was selected.

You can also set the [`DatasetBuilder.BUILDER_CONFIG_CLASS`] to any custom subclass of [`BuilderConfig`].

### DatasetBuilder[[datasets-datasetbuilder]]

[`DatasetBuilder`] accesses all the attributes inside [`BuilderConfig`] to build the actual dataset.

<div class="flex justify-center">
   <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/datasetbuilder.png"/>
</div>

There are three main methods in [`DatasetBuilder`]:

1. [`DatasetBuilder._info`] is in charge of defining the dataset attributes. When you call `dataset.info`, 🤗 Datasets returns the information stored here. Likewise, the [`Features`] are also specified here. Remember, the [`Features`] are like the skeleton of the dataset. It provides the names and types of each column.

2. [`DatasetBuilder._split_generator`] downloads or retrieves the requested data files, organizes them into splits, and defines specific arguments for the generation process. This method has a [`DownloadManager`] that downloads files or fetches them from your local filesystem. Within the [`DownloadManager`], there is a [`DownloadManager.download_and_extract`] method that accepts a dictionary of URLs to the original data files, and downloads the requested files. Accepted inputs include: a single URL or path, or a list/dictionary of URLs or paths. Any compressed file types like TAR, GZIP and ZIP archives will be automatically extracted.

   Once the files are downloaded, [`SplitGenerator`] organizes them into splits. The [`SplitGenerator`] contains the name of the split, and any keyword arguments that are provided to the [`DatasetBuilder._generate_examples`] method. The keyword arguments can be specific to each split, and typically comprise at least the local path to the data files for each split.

3. [`DatasetBuilder._generate_examples`] reads and parses the data files for a split. Then it yields dataset examples according to the format specified in the `features` from [`DatasetBuilder._info`]. The input of [`DatasetBuilder._generate_examples`] is actually the `filepath` provided in the keyword arguments of the last method.

   The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an `ArrowWriter` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the `DEFAULT_WRITER_BATCH_SIZE` attribute in [`DatasetBuilder`]. We recommend not exceeding a size of 200 MB.

## Maintaining integrity

To ensure a dataset is complete, [`load_dataset`] will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. [`load_dataset`] verifies:

- The number of splits in the generated `DatasetDict`.
- The number of samples in each split of the generated `DatasetDict`.
- The list of downloaded files.
- The SHA256 checksums of the downloaded files (disabled by default).

If the dataset doesn't pass the verifications, it is likely that the dataset author made some changes in the data files.

In this case, an error is raised to alert that the dataset has changed.
To ignore the error, one needs to specify `verification_mode="no_checks"` in [`load_dataset`].
Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset "Community" tab, so that the integrity checks for that dataset are updated.

## Security

The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning).


================================================
FILE: docs/source/about_map_batch.mdx
================================================
# Batch mapping

Combining the utility of [`Dataset.map`] with batch mode is very powerful. It allows you to speed up processing, and freely control the size of the generated dataset. 

## Need for speed

The primary objective of batch mapping is to speed up processing. Often times, it is faster to work with batches of data instead of single examples. Naturally, batch mapping lends itself to tokenization. For example, the 🤗 [Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/) library works faster with batches because it parallelizes the tokenization of all the examples in a batch.

## Input size != output size

The ability to control the size of the generated dataset can be leveraged for many interesting use-cases. In the How-to [map](#map) section, there are examples of using batch mapping to:

- Split long sentences into shorter chunks.
- Augment a dataset with additional tokens.

It is helpful to understand how this works, so you can come up with your own ways to use batch mapping. At this point, you may be wondering how you can control the size of the generated dataset. The answer is: **the mapped function does not have to return an output batch of the same size**.

In other words, your mapped function input can be a batch of size `N` and return a batch of size `M`. The output `M` can be greater than or less than `N`. This means you can concatenate your examples, divide it up, and even add more examples!

However, remember that all values in the output dictionary must contain the **same number of elements** as the other fields in the output dictionary. Otherwise, it is not possible to define the number of examples in the output returned by the mapped function. The number can vary between successive batches processed by the mapped function. For a single batch though, all values of the output dictionary should have the same length (i.e., the number of elements).

For example, from a dataset of 1 column and 3 rows, if you use `map` to return a new column with twice as many rows, then you will have an error.
In this case, you end up with one column with 3 rows, and one column with 6 rows. As you can see, the table will not be valid:

```py
>>> from datasets import Dataset
>>> dataset = Dataset.from_dict({"a": [0, 1, 2]})
>>> dataset.map(lambda batch: {"b": batch["a"] * 2}, batched=True)  # new column with 6 elements: [0, 1, 2, 0, 1, 2]
'ArrowInvalid: Column 1 named b expected length 3 but got length 6'
```

To make it valid, you have to drop one of the columns:

```py
>>> from datasets import Dataset
>>> dataset = Dataset.from_dict({"a": [0, 1, 2]})
>>> dataset_with_duplicates = dataset.map(lambda batch: {"b": batch["a"] * 2}, remove_columns=["a"], batched=True)
>>> len(dataset_with_duplicates)
6
```
Alternatively, you can overwrite the existing column to achieve the same result.
For example, here’s how to duplicate every row in the dataset by overwriting column `"a"`:

```py
>>> from datasets import Dataset
>>> dataset = Dataset.from_dict({"a": [0, 1, 2]})
# overwrites the existing "a" column with duplicated values
>>> duplicated_dataset = dataset.map(
...     lambda batch: {"a": [x for x in batch["a"] for _ in range(2)]},
...     batched=True
... )
>>> duplicated_dataset
Dataset({
    features: ['a'],
    num_rows: 6
})
>>> duplicated_dataset["a"]
[0, 0, 1, 1, 2, 2]
```


================================================
FILE: docs/source/about_mapstyle_vs_iterable.mdx
================================================
# Differences between Dataset and IterableDataset

There are two types of dataset objects, a [`Dataset`] and an [`IterableDataset`].
Whichever type of dataset you choose to use or create depends on the size of the dataset.
In general, an [`IterableDataset`] is ideal for big datasets (think hundreds of GBs!) due to its lazy behavior and speed advantages, while a [`Dataset`] is great for everything else.
This page will compare the differences between a [`Dataset`] and an [`IterableDataset`] to help you pick the right dataset object for you.

## Downloading and streaming

When you have a regular [`Dataset`], you can access it using `my_dataset[0]`. This provides random access to the rows.
Such datasets are also called "map-style" datasets.
For example you can download ImageNet-1k like this and access any row:

```python
from datasets import load_dataset

imagenet = load_dataset("timm/imagenet-1k-wds", split="train")  # downloads the full dataset
print(imagenet[0])
```

But one caveat is that you must have the entire dataset stored on your disk or in memory, which blocks you from accessing datasets bigger than the disk.
Because it can become inconvenient for big datasets, there exists another type of dataset, the [`IterableDataset`].
When you have an `IterableDataset`, you can access it using a `for` loop to load the data progressively as you iterate over the dataset.
This way, only a small fraction of examples is loaded in memory, and you don't write anything on disk.

For example, you can stream the ImageNet-1k dataset without downloading it on disk:

```python
from datasets import load_dataset

imagenet = load_dataset("timm/imagenet-1k-wds", split="train", streaming=True)  # will start loading the data when iterated over
for example in imagenet:
    print(example)
    break
```

Streaming can read online data without writing any file to disk.
For example, you can stream datasets made out of multiple shards, each of which is hundreds of gigabytes like [C4](https://huggingface.co/datasets/c4)  or [LAION-2B](https://huggingface.co/datasets/laion/laion2B-en).
Learn more about how to stream a dataset in the [Dataset Streaming Guide](./stream).

This is not the only difference though, because the "lazy" behavior of an `IterableDataset` is also present when it comes to dataset creation and processing.

## Creating map-style datasets and iterable datasets

You can create a [`Dataset`] using lists or dictionaries, and the data is entirely converted to Arrow so you can easily access any row:
```python
my_dataset = Dataset.from_dict({"col_1": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
print(my_dataset[0])
```

To create an `IterableDataset` on the other hand, you must provide a "lazy" way to load the data.
In Python, we generally use generator functions. These functions `yield` one example at a time, which means you can't access a row by slicing it like a regular `Dataset`:
```python
def my_generator(n):
    for i in range(n):
        yield {"col_1": i}

my_iterable_dataset = IterableDataset.from_generator(my_generator, gen_kwargs={"n": 10})
for example in my_iterable_dataset:
    print(example)
    break
```

## Loading local files entirely and progressively

It is possible to convert local or remote data files to an Arrow [`Dataset`] using [`load_dataset`]:
```python
data_files = {"train": ["path/to/data.csv"]}
my_dataset = load_dataset("csv", data_files=data_files, split="train")
print(my_dataset[0])
```

However, this requires a conversion step from CSV to Arrow format, which takes time and disk space if your dataset is big.

To save disk space and skip the conversion step, you can define an `IterableDataset` by streaming from the local files directly.
This way, the data is read progressively from the local files as you iterate over the dataset:

```python
data_files = {"train": ["path/to/data.csv"]}
my_iterable_dataset = load_dataset("csv", data_files=data_files, split="train", streaming=True)
for example in my_iterable_dataset:  # this reads the CSV file progressively as you iterate over the dataset
    print(example)
    break
```

Many file formats are supported, like CSV, JSONL, and Parquet, as well as image and audio files.
You can find more information in the corresponding guides for loading [tabular](./tabular_load), [text](./nlp_load), [vision](./image_load), and [audio](./audio_load]) datasets.

## Eager data processing and lazy data processing

When you process a [`Dataset`] object using [`Dataset.map`], the entire dataset is processed immediately and returned.
This is similar to how `pandas` works for example.

```python
my_dataset = my_dataset.map(process_fn)  # process_fn is applied on all the examples of the dataset
print(my_dataset[0])
```

On the other hand, due to the "lazy" nature of an `IterableDataset`, calling [`IterableDataset.map`] does not apply your `map` function over the full dataset.
Instead, your `map` function is applied on-the-fly.

Because of that, you can chain multiple processing steps and they will all run at once when you start iterating over the dataset:

```python
my_iterable_dataset = my_iterable_dataset.map(process_fn_1)
my_iterable_dataset = my_iterable_dataset.filter(filter_fn)
my_iterable_dataset = my_iterable_dataset.map(process_fn_2)

# process_fn_1, filter_fn and process_fn_2 are applied on-the-fly when iterating over the dataset
for example in my_iterable_dataset:  
    print(example)
    break
```

## Exact and fast approximate shuffling

When you shuffle a [`Dataset`] using [`Dataset.shuffle`], you apply an exact shuffling of the dataset.
It works by taking a list of indices `[0, 1, 2, ... len(my_dataset) - 1]` and shuffling this list.
Then, accessing `my_dataset[0]` returns the row and index defined by the first element of the indices mapping that has been shuffled:
```python
my_dataset = my_dataset.shuffle(seed=42)
print(my_dataset[0])
```

Since we don't have random access to the rows in the case of an `IterableDataset`, we can't use a shuffled list of indices and access a row at an arbitrary position.
This prevents the use of exact shuffling.
Instead, a fast approximate shuffling is used in [`IterableDataset.shuffle`].
It uses a shuffle buffer to sample random examples iteratively from the dataset.
Since the dataset is still read iteratively, it provides excellent speed performance:
```python
my_iterable_dataset = my_iterable_dataset.shuffle(seed=42, buffer_size=100)
for example in my_iterable_dataset:
    print(example)
    break
```

But using a shuffle buffer is not enough to provide a satisfactory shuffling for machine learning model training. So [`IterableDataset.shuffle`] also shuffles the dataset shards if your dataset is made of multiple files or sources:

```python
# Stream from the internet
my_iterable_dataset = load_dataset("deepmind/code_contests", split="train", streaming=True)
my_iterable_dataset.num_shards  # 39

# Stream from local files
data_files = {"train": [f"path/to/data_{i}.csv" for i in range(1024)]}
my_iterable_dataset = load_dataset("csv", data_files=data_files, split="train", streaming=True)
my_iterable_dataset.num_shards  # 1024

# From a generator function
def my_generator(n, sources):
    for source in sources:
        for example_id_for_current_source in range(n):
            yield {"example_id": f"{source}_{example_id_for_current_source}"}

gen_kwargs = {"n": 10, "sources": [f"path/to/data_{i}" for i in range(1024)]}
my_iterable_dataset = IterableDataset.from_generator(my_generator, gen_kwargs=gen_kwargs)
my_iterable_dataset.num_shards  # 1024
```

## Speed differences

Regular [`Dataset`] objects are based on Arrow which provides fast random access to the rows.
Thanks to memory mapping and the fact that Arrow is an in-memory format, reading data from disk doesn't do expensive system calls and deserialization.
It provides even faster data loading when iterating using a `for` loop by iterating on contiguous Arrow record batches.

However as soon as your [`Dataset`] has an indices mapping (via [`Dataset.shuffle`] for example), the speed can become 10x slower.
This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore.
To restore the speed, you'd need to rewrite the entire dataset on your disk again using [`Dataset.flatten_indices`], which removes the indices mapping.
This may take a lot of time depending on the size of your dataset though:

```python
my_dataset[0]  # fast
my_dataset = my_dataset.shuffle(seed=42)
my_dataset[0]  # up to 10x slower
my_dataset = my_dataset.flatten_indices()  # rewrite the shuffled dataset on disk as contiguous chunks of data
my_dataset[0]  # fast again
```


In this case, we recommend switching to an [`IterableDataset`] and leveraging its fast approximate shuffling method [`IterableDataset.shuffle`].
It only shuffles the shards order and adds a shuffle buffer to your dataset, which keeps the speed of your dataset optimal.
You can also reshuffle the dataset easily:

```python
for example in enumerate(my_iterable_dataset):  # fast
    pass

shuffled_iterable_dataset = my_iterable_dataset.shuffle(seed=42, buffer_size=100)

for example in enumerate(shuffled_iterable_dataset):  # as fast as before
    pass

shuffled_iterable_dataset = my_iterable_dataset.shuffle(seed=1337, buffer_size=100)  # reshuffling using another seed is instantaneous

for example in enumerate(shuffled_iterable_dataset):  # still as fast as before
    pass
```

If you're using your dataset on multiple epochs, the effective seed to shuffle the shards order in the shuffle buffer is `seed + epoch`.
It makes it easy to reshuffle a dataset between epochs:
```python
for epoch in range(n_epochs):
    my_iterable_dataset.set_epoch(epoch)
    for example in my_iterable_dataset:  # fast + reshuffled at each epoch using `effective_seed = seed + epoch`
        pass
```

To restart the iteration of a map-style dataset, you can simply skip the first examples:

```python
my_dataset = my_dataset.select(range(start_index, len(dataset)))
```

But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have written a custom sampler that allows resuming).

On the other hand, iterable datasets don't provide random access to a specific example index to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:

```python
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
>>> # save in the middle of training
>>> state_dict = iterable_dataset.state_dict()
>>> # and resume later
>>> iterable_dataset.load_state_dict(state_dict)
```

Under the hood, the iterable dataset keeps track of the current shard being read and the example index in the current shard and it stores this info in the `state_dict`.

To resume from a checkpoint, the dataset skips all the shards that were previously read to restart from the current shard. 
Then it reads the shard and skips examples until it reaches the exact example from the checkpoint.

Therefore restarting a dataset is quite fast, since it will not re-read the shards that have already been iterated on. Still, resuming a dataset is generally not instantaneous since it has to restart reading from the beginning of the current shard and skip examples until it reaches the checkpoint location.

This can be used with the `StatefulDataLoader` from `torchdata`, see [streaming with a PyTorch DataLoader](./use_with_pytorch#stream-data).

## Switch from map-style to iterable

If you want to benefit from the "lazy" behavior of an [`IterableDataset`] or their speed advantages, you can switch your map-style [`Dataset`] to an [`IterableDataset`]:
```python
my_iterable_dataset = my_dataset.to_iterable_dataset()
```

If you want to shuffle your dataset or [use it with a PyTorch DataLoader](./use_with_pytorch#stream-data), we recommend generating a sharded [`IterableDataset`]:
```python
my_iterable_dataset = my_dataset.to_iterable_dataset(num_shards=1024)
my_iterable_dataset.num_shards  # 1024
```


================================================
FILE: docs/source/access.mdx
================================================
# Know your dataset

There are two types of dataset objects, a regular [`Dataset`] and then an ✨ [`IterableDataset`] ✨. A [`Dataset`] provides fast random access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won't even fit on disk or in memory, an [`IterableDataset`] allows you to access and use the dataset without waiting for it to download completely!

This tutorial will show you how to load and access a [`Dataset`] and an [`IterableDataset`].

## Dataset

When you load a dataset split, you'll get a [`Dataset`] object. You can do many things with a [`Dataset`] object, which is why it's important to learn how to manipulate and interact with the data stored inside. 
 
This tutorial uses the [rotten_tomatoes](https://huggingface.co/datasets/rotten_tomatoes) dataset, but feel free to load any dataset you'd like and follow along!

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
```

### Indexing

A [`Dataset`] contains columns of data, and each column can be a different type of data. The *index*, or axis label, is used to access examples from the dataset. For example, indexing by the row returns a dictionary of an example from the dataset:

```py
# Get the first row in the dataset
>>> dataset[0]
{'label': 1,
 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}
```

Use the `-` operator to start from the end of the dataset:

```py
# Get the last row in the dataset
>>> dataset[-1]
{'label': 0,
 'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .'}
```

Indexing by the column name returns a list of all the values in the column:

```py
>>> dataset["text"]
['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic',
 ...,
 'things really get weird , though not particularly scary : the movie is all portent and no content .']
```

You can combine row and column name indexing to return a specific value at a position:

```py
>>> dataset[0]["text"]
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
```

Indexing order doesn't matter. Indexing by the column name first returns a [`Column`] object that you can index as usual with row indices:

```py
>>> import time

>>> start_time = time.time()
>>> text = dataset[0]["text"]
>>> end_time = time.time()
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
Elapsed time: 0.0031 seconds

>>> start_time = time.time()
>>> text = dataset["text"][0]
>>> end_time = time.time()
>>> print(f"Elapsed time: {end_time - start_time:.4f} seconds")
Elapsed time: 0.0042 seconds
```

### Slicing

Slicing returns a slice - or subset - of the dataset, which is useful for viewing several rows at once. To slice a dataset, use the `:` operator to specify a range of positions. 

```py
# Get the first three rows
>>> dataset[:3]
{'label': [1, 1, 1],
 'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic']}

# Get rows between three and six
>>> dataset[3:6]
{'label': [1, 1, 1],
 'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .']}
```

## IterableDataset

An [`IterableDataset`] is loaded when you set the `streaming` parameter to `True` in [`~datasets.load_dataset`]:

```py
>>> from datasets import load_dataset

>>> iterable_dataset = load_dataset("ethz/food101", split="train", streaming=True)
>>> for example in iterable_dataset:
...     print(example)
...     break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F5C520>, 'label': 6}
```

You can also create an [`IterableDataset`] from an *existing* [`Dataset`], but it is faster than streaming mode because the dataset is streamed from local files:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
>>> iterable_dataset = dataset.to_iterable_dataset()
```

An [`IterableDataset`] progressively iterates over a dataset one example at a time, so you don't have to wait for the whole dataset to download before you can use it. As you can imagine, this is quite useful for large datasets you want to use immediately!

### Indexing

An [`IterableDataset`]'s behavior is different from a regular [`Dataset`]. You don't get random access to examples in an [`IterableDataset`]. Instead, you should iterate over its elements, for example, by calling `next(iter())` or with a `for` loop to return the next item from the [`IterableDataset`]:

```py
>>> next(iter(iterable_dataset))
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F0681F59B50>,
 'label': 6}

>>> for example in iterable_dataset:
...     print(example)
...     break
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DE82B0>, 'label': 6}
```

But an [`IterableDataset`] supports column indexing that returns an iterable for the column values:

```py
>>> next(iter(iterable_dataset["label"]))
6
```

### Creating a subset

You can return a subset of the dataset with a specific number of examples in it with [`IterableDataset.take`]:

```py
# Get first three examples
>>> list(iterable_dataset.take(3))
[{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=384x512 at 0x7F7479DEE9D0>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F7479DE8190>,
  'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x383 at 0x7F7479DE8310>,
  'label': 6}]
```

But unlike [slicing](access/#slicing), [`IterableDataset.take`] creates a new [`IterableDataset`]. 

## Next steps

Interested in learning more about the differences between these two types of datasets? Learn more about them in the [Differences between `Dataset` and `IterableDataset`](about_mapstyle_vs_iterable) conceptual guide.

To get more hands-on with these datasets types, check out the [Process](process) guide to learn how to preprocess a [`Dataset`] or the [Stream](stream) guide to learn how to preprocess an [`IterableDataset`].


================================================
FILE: docs/source/audio_dataset.mdx
================================================
# Create an audio dataset

You can share a dataset with your team or with anyone in the community by creating a dataset repository on the Hugging Face Hub:

```py
from datasets import load_dataset

dataset = load_dataset("<username>/my_dataset")
```

There are several methods for creating and sharing an audio dataset:

- Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.

- Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## Local files

You can load your own dataset using the paths to your audio files. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and cast it to the [`Audio`] feature:

```py
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
```

Then upload the dataset to the Hugging Face Hub using [`Dataset.push_to_hub`]:

```py
audio_dataset.push_to_hub("<username>/my_dataset")
```

This will create a dataset repository containing your audio dataset:

```
my_dataset/
├── README.md
└── data/
    └── train-00000-of-00001.parquet
```

## AudioFolder

The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.

> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `AudioFolder` creates dataset splits based on your dataset repository structure.

`AudioFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

```
folder/train/dog/golden_retriever.mp3
folder/train/dog/german_shepherd.mp3
folder/train/dog/chihuahua.mp3

folder/train/cat/maine_coon.mp3
folder/train/cat/bengal.mp3
folder/train/cat/birman.mp3
```

If the dataset follows the `AudioFolder` structure, then you can load it directly with [`load_dataset`]:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("username/dataset_name")
```

This is equivalent to passing `audiofolder` manually in [`load_dataset`] and the directory in `data_dir`:

```py
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```

You can also use `audiofolder` to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:

```
folder/train/dog/golden_retriever.mp3
folder/train/cat/maine_coon.mp3
folder/test/dog/german_shepherd.mp3
folder/test/cat/bengal.mp3
```

> [!WARNING]
> If all audio files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.

If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.

```
folder/train/metadata.csv
folder/train/0001.mp3
folder/train/0002.mp3
folder/train/0003.mp3
```

You can also zip your audio files, and in this case each zip should contain both the audio files and the metadata

```
folder/train.zip
folder/test.zip
folder/validation.zip
```

Your `metadata.csv` file must have a `file_name` or `*_file_name` field which links audio files with their metadata:

```csv
file_name,additional_feature
0001.mp3,This is a first value of a text feature you added to your audio files
0002.mp3,This is a second value of a text feature you added to your audio files
0003.mp3,This is a third value of a text feature you added to your audio files
```

or using `metadata.jsonl`:

```jsonl
{"file_name": "0001.mp3", "additional_feature": "This is a first value of a text feature you added to your audio files"}
{"file_name": "0002.mp3", "additional_feature": "This is a second value of a text feature you added to your audio files"}
{"file_name": "0003.mp3", "additional_feature": "This is a third value of a text feature you added to your audio files"}
```

Here the `file_name` must be the name of the audio file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the audio file.

It's possible to point to more than one audio in each row in your dataset, for example if both your input and output are audio files:

```jsonl
{"input_file_name": "0001.mp3", "output_file_name": "0001_output.mp3"}
{"input_file_name": "0002.mp3", "output_file_name": "0002_output.mp3"}
{"input_file_name": "0003.mp3", "output_file_name": "0003_output.mp3"}
```

You can also define lists of audio files. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:

```jsonl
{"recordings_file_names": ["0001_r0.mp3", "0001_r1.mp3"], label: "same_person"}
{"recordings_file_names": ["0002_r0.mp3", "0002_r1.mp3"], label: "same_person"}
{"recordings_file_names": ["0003_r0.mp3", "0003_r1.mp3"], label: "different_person"}
```

## WebDataset

The [WebDataset](https://github.com/webdataset/webdataset) format is based on TAR archives and is suitable for big audio datasets.
Indeed you can group your audio files in TAR archives (e.g. 1GB of audio files per TAR archive) and have thousands of TAR archives:

```
folder/train/00000.tar
folder/train/00001.tar
folder/train/00002.tar
...
```

In the archives, each example is made of files sharing the same prefix:

```
e39871fd9fd74f55.mp3
e39871fd9fd74f55.json
f18b91585c4d3f3e.mp3
f18b91585c4d3f3e.json
ede6e66b2fb59aab.mp3
ede6e66b2fb59aab.json
ed600d57fcee4f94.mp3
ed600d57fcee4f94.json
...
```

You can put your audio files labels/captions/bounding boxes using JSON or text files for example.

Load your WebDataset and it will create on column per file suffix (here "mp3" and "json"):

```python
>>> from datasets import load_dataset

>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train")
>>> dataset[0]["json"]
{"transcript": "Hello there !", "speaker": "Obi-Wan Kenobi"}
```

It's also possible to have several audio files per example like this:

```
e39871fd9fd74f55.input.mp3
e39871fd9fd74f55.output.mp3
e39871fd9fd74f55.json
f18b91585c4d3f3e.input.mp3
f18b91585c4d3f3e.output.mp3
f18b91585c4d3f3e.json
...
```

For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset).


================================================
FILE: docs/source/audio_load.mdx
================================================
# Load audio data

You can load an audio dataset using the [`Audio`] feature that automatically decodes and resamples the audio files when you access the examples.
Audio decoding is based on the [`torchcodec`](https://github.com/pytorch/torchcodec) python package, which uses the [`FFmpeg`](https://www.ffmpeg.org/) C library under the hood.

## Installation

To work with audio datasets, you need to have the `audio` dependencies installed.
Check out the [installation](./installation#audio) guide to learn how to install it.

## Local files

You can load your own dataset using the paths to your audio files. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and cast it to the [`Audio`] feature:

```py
>>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
>>> audio_dataset[0]["audio"]
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
```

## AudioFolder

You can also load a dataset with an `AudioFolder` dataset builder. It does not require writing a custom dataloader, making it useful for quickly creating and loading audio datasets with several thousand audio files.

## AudioFolder with metadata

To link your audio files with metadata information, make sure your dataset has a `metadata.csv` file. Your dataset structure might look like:

```
folder/train/metadata.csv
folder/train/first_audio_file.mp3
folder/train/second_audio_file.mp3
folder/train/third_audio_file.mp3
```

Your `metadata.csv` file must have a `file_name` column which links audio files with their metadata. An example `metadata.csv` file might look like:

```text
file_name,transcription
first_audio_file.mp3,znowu się duch z ciałem zrośnie w młodocianej wstaniesz wiosnie i możesz skutkiem tych leków umierać wstawać wiek wieków dalej tam były przestrogi jak siekać głowę jak nogi
second_audio_file.mp3,już u źwierzyńca podwojów król zasiada przy nim książęta i panowie rada a gdzie wzniosły krążył ganek rycerze obok kochanek król skinął palcem zaczęto igrzysko
third_audio_file.mp3,pewnie kędyś w obłędzie ubite minęły szlaki zaczekajmy dzień jaki poślemy szukać wszędzie dziś jutro pewnie będzie posłali wszędzie sługi czekali dzień i drugi gdy nic nie doczekali z płaczem chcą jechać dali
```

`AudioFolder` will load audio data and create a `transcription` column containing texts from `metadata.csv`:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("username/dataset_name")
>>> # OR locally:
>>> dataset = load_dataset("/path/to/folder")
```

For local datasets, this is equivalent to passing `audiofolder` manually in [`load_dataset`] and the directory in `data_dir`:

```py
>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```

Metadata can also be specified as JSON Lines, in which case use `metadata.jsonl` as the name of the metadata file. This format is helpful in scenarios when one of the columns is complex, e.g. a list of floats, to avoid parsing errors or reading the complex values as strings.

To ignore the information in the metadata file, set `drop_metadata=True` in [`load_dataset`]:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("username/dataset_with_metadata", drop_metadata=True)
```

If you don't have a metadata file, `AudioFolder` automatically infers the label name from the directory name.
If you want to drop automatically created labels, set `drop_labels=True`.
In this case, your dataset will only contain an audio column:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("username/dataset_without_metadata", drop_labels=True)
```

Finally the `filters` argument lets you load only a subset of the dataset, based on a condition on the label or the metadata. This is especially useful if the metadata is in Parquet format, since this format enables fast filtering. It is also recommended to use this argument with `streaming=True`, because by default the dataset is fully downloaded before filtering.

```python
>>> filters = [("label", "=", 0)]
>>> dataset = load_dataset("username/dataset_name", streaming=True, filters=filters)
```

> [!TIP]
> For more information about creating your own `AudioFolder` dataset, take a look at the [Create an audio dataset](./audio_dataset) guide.

For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.

## Audio decoding

By default, audio files are decoded sequentially as torchcodec [`AudioDecoder`](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder) objects when you iterate on a dataset.
However it is possible to speed up the dataset significantly using multithreaded decoding:

```python
>>> import os
>>> num_threads = num_threads = min(32, (os.cpu_count() or 1) + 4)
>>> dataset = dataset.decode(num_threads=num_threads)
>>> for example in dataset:  # up to 20 times faster !
...     ...
```

You can enable multithreading using `num_threads`. This is especially useful to speed up remote data streaming.
However it can be slower than `num_threads=0` for local data on fast disks.

If you are not interested in the images decoded as NumPy arrays and would like to access the path/bytes instead, you can disable decoding:

```python
>>> dataset = dataset.decode(False)
```

Note: [`IterableDataset.decode`] is only available for streaming datasets at the moment.


================================================
FILE: docs/source/audio_process.mdx
================================================
# Process audio data

This guide shows specific methods for processing audio datasets. Learn how to:

- Resample the sampling rate.
- Use [`~Dataset.map`] with audio datasets.

For a guide on how to process any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./process">general process guide</a>.

## Cast

The [`~Dataset.cast_column`] function is used to cast a column to another feature to be decoded. When you use this function with the [`Audio`] feature, you can resample the sampling rate:

```py
>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
```

Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:

```py
>>> audio = dataset[0]["audio"]
<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
>>> audio = audio_dataset[0]["audio"]
>>> samples = audio.get_all_samples()
>>> samples.data
tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06,
         -1.9127e-04, -5.3330e-05]]
>>> samples.sample_rate
16000
```

<div class="flex justify-center">
  <img
    class="block dark:hidden"
    src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample.gif"
  />
  <img
    class="hidden dark:block"
    src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample-dark.gif"
  />
</div>

## Map

The [`~Dataset.map`] function helps preprocess your entire dataset at once. Depending on the type of model you're working with, you'll need to either load a [feature extractor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoFeatureExtractor) or a [processor](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoProcessor).

- For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`:

  ```py
  >>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor

  >>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
  # after defining a vocab.json file you can instantiate a tokenizer object:
  >>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
  >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
  >>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
  ```

- For fine-tuned speech recognition models, you only need to load a `processor`:

  ```py
  >>> from transformers import AutoProcessor

  >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
  ```

When you use [`~Dataset.map`] with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data:

```py
>>> def prepare_dataset(batch):
...     audio = batch["audio"]
...     batch["input_values"] = processor(audio.get_all_samples().data, sampling_rate=audio["sampling_rate"]).input_values[0]
...     batch["input_length"] = len(batch["input_values"])
...     with processor.as_target_processor():
...         batch["labels"] = processor(batch["sentence"]).input_ids
...     return batch
>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)
```


================================================
FILE: docs/source/cache.mdx
================================================
# Cache management

When you download a dataset from Hugging Face, the data are stored locally on your computer.
Files from Hugging Face are stored as usual in the `huggingface_hub` cache, which is at `~/.cache/huggingface/hub` by default.
See the [Hub cache documentation](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) for more details and how to change its location.

The Hub cache allows 🤗 Datasets to avoid re-downloading dataset files from Hugging Face every time you use them. 

🤗 Datasets also has its own cache to store datasets converted in Arrow format (the format used by [`Dataset`] objects).

This guide focuses on the 🤗 Datasets cache and will show you how to:

- Change the cache directory.
- Control how a dataset is loaded from the cache.
- Clean up cache files in the directory.
- Enable or disable caching.

## Cache directory

The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory:

```
$ export HF_HOME="/path/to/another/directory/datasets"
```

Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory:

```
$ export HF_DATASETS_CACHE="/path/to/datasets_cache"
```

⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices).  
It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are located in `~/.cache/huggingface/hub` by default and controlled separately via the `HF_HUB_CACHE` variable:

```
$ export HF_HUB_CACHE="/path/to/hub_cache"
```

💡 If you'd like to relocate all Hugging Face caches — including datasets and hub downloads — use the `HF_HOME` variable instead:

```
$ export HF_HOME="/path/to/cache_root"
```

This results in:
- datasets cache → `/path/to/cache_root/datasets`
- hub cache → `/path/to/cache_root/hub`

These distinctions are especially useful when working in shared environments or networked file systems (e.g., NFS).  
See [issue #7480](https://github.com/huggingface/datasets/issues/7480) for discussion on how users encountered unexpected cache locations when `HF_HUB_CACHE` was not set alongside `HF_DATASETS_CACHE`.

When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets")
```

## Download mode

After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload')
```

Refer to [`DownloadMode`] for a full list of download modes.

## Cache files

Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_files`]:

```py
# Returns the number of removed cache files
>>> dataset.cleanup_cache_files()
2
```

## Enable or disable caching

If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [`Dataset.map`]:

```py
>>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False)
```

In the example above, 🤗 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state.

Disable caching on a global scale with [`disable_caching`]:

```py
>>> from datasets import disable_caching
>>> disable_caching()
```

When you disable caching, 🤗 Datasets will no longer reload cached files when applying transforms to datasets. Any transform you apply on your dataset will be need to be reapplied.

> [!TIP]
> If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.

<a id='load_dataset_enhancing_performance'></a>

## Improve performance

Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:

1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.

2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence.


================================================
FILE: docs/source/cli.mdx
================================================
# Command Line Interface (CLI)

🤗 Datasets provides a command line interface (CLI) with useful shell commands to interact with your dataset.

You can check the available commands:
```bash
>>> datasets-cli --help
usage: datasets-cli <command> [<args>]

positional arguments:
  {env,test,delete_from_hub}
                        datasets-cli command helpers
    env                 Print relevant system environment info.
    test                Test dataset loading.
    delete_from_hub     Delete dataset config from the Hub

optional arguments:
  -h, --help            show this help message and exit
```

## Delete from Hub

Delete a dataset configuration from a [supported dataset](repository_structure) on the Hub.

```bash
>>> datasets-cli delete_from_hub --help
usage: datasets-cli <command> [<args>] delete_from_hub [-h] [--token TOKEN] [--revision REVISION] dataset_id config_name

positional arguments:
  dataset_id           source dataset ID, e.g. USERNAME/DATASET_NAME or ORGANIZATION/DATASET_NAME
  config_name          config name to delete

optional arguments:
  -h, --help           show this help message and exit
  --token TOKEN        access token to the Hugging Face Hub
  --revision REVISION  source revision
```

For example:
```bash
>>> datasets-cli delete_from_hub USERNAME/DATASET_NAME CONFIG_NAME
```

> [!TIP]
> Do not forget that you need to log in first to your Hugging Face account:
> ```bash
> >>> hf auth login
> ```


================================================
FILE: docs/source/create_dataset.mdx
================================================
# Create a dataset

Sometimes, you may need to create a dataset if you're working with your own data. Creating a dataset with 🤗 Datasets confers all the advantages of the library to your dataset: fast loading and processing, [stream enormous datasets](stream), [memory-mapping](https://huggingface.co/course/chapter5/4?fw=pt#the-magic-of-memory-mapping), and more. You can easily and rapidly create a dataset with 🤗 Datasets low-code approaches, reducing the time it takes to start training a model. In many cases, it is as easy as [dragging and dropping](upload_dataset#upload-with-the-hub-ui) your data files into a dataset repository on the Hub.

In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for creating all types of datasets:

- Folder-based builders for quickly creating an image or audio dataset
- `from_` methods for creating datasets from local files

## File-based builders

🤗 Datasets supports many common formats such as `csv`, `json/jsonl`, `parquet`, `txt`.

For example it can read a dataset made up of one or several CSV files (in this case, pass your CSV files as a list):

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")
```

To get the list of supported formats and code examples, follow this guide [here](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

## Folder-based builders

There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:

- [`ImageFolder`] uses the [`~datasets.Image`] feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/imagefolder/imagefolder.py#L39) of supported image extensions.
- [`AudioFolder`] uses the [`~datasets.Audio`] feature to decode an audio file. Extensions such as wav, mp3, and even mp4 are supported, and you can check the complete [list](https://ffmpeg.org/ffmpeg-formats.html) of supported audio extensions. Decoding is done via ffmpeg.

The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.

For example, if your image dataset (it is the same for an audio dataset) is stored like this:

```
pokemon/train/grass/bulbasaur.png
pokemon/train/fire/charmander.png
pokemon/train/water/squirtle.png

pokemon/test/grass/ivysaur.png
pokemon/test/fire/charmeleon.png
pokemon/test/water/wartortle.png
```

Then this is how the folder-based builder generates an example:

<div class="flex justify-center">
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png" />
</div>

Create the image dataset by specifying `imagefolder` in [`load_dataset`]:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
```

An audio dataset is created in the same way, except you specify `audiofolder` in [`load_dataset`] instead:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```

Any additional information about your dataset, such as text captions or transcriptions, can be included with a `metadata.csv` file in the folder containing your dataset. The metadata file needs to have a `file_name` column that links the image or audio file to its corresponding metadata:

```
file_name, text
bulbasaur.png, There is a plant seed on its back right from the day this Pokémon is born.
charmander.png, It has a preference for hot things.
squirtle.png, When it retracts its long neck into its shell, it squirts out water with vigorous force.
```

To learn more about each of these folder-based builders, check out the and <a href="https://huggingface.co/docs/datasets/image_dataset#imagefolder"><span class="underline decoration-yellow-400 decoration-2 font-semibold">ImageFolder</span></a> or <a href="https://huggingface.co/docs/datasets/audio_dataset#audiofolder"><span class="underline decoration-pink-400 decoration-2 font-semibold">AudioFolder</span></a> guides.

## From Python dictionaries

You can also create a dataset from data in Python dictionaries. There are two ways you can create a dataset using the `from_` methods:

    * The [`~Dataset.from_generator`] method is the most memory-efficient way to create a dataset from a [generator](https://wiki.python.org/moin/Generators) due to a generators iterative behavior. This is especially useful when you're working with a really large dataset that may not fit in memory, since the dataset is generated on disk progressively and then memory-mapped.

    ```py
    >>> from datasets import Dataset
    >>> def gen():
    ...     yield {"pokemon": "bulbasaur", "type": "grass"}
    ...     yield {"pokemon": "squirtle", "type": "water"}
    >>> ds = Dataset.from_generator(gen)
    >>> ds[0]
    {"pokemon": "bulbasaur", "type": "grass"}
    ```

    A generator-based [`IterableDataset`] needs to be iterated over with a `for` loop for example:

    ```py
    >>> from datasets import IterableDataset
    >>> ds = IterableDataset.from_generator(gen)
    >>> for example in ds:
    ...     print(example)
    {"pokemon": "bulbasaur", "type": "grass"}
    {"pokemon": "squirtle", "type": "water"}
    ```

    * The [`~Dataset.from_dict`] method is a straightforward way to create a dataset from a dictionary:

    ```py
    >>> from datasets import Dataset
    >>> ds = Dataset.from_dict({"pokemon": ["bulbasaur", "squirtle"], "type": ["grass", "water"]})
    >>> ds[0]
    {"pokemon": "bulbasaur", "type": "grass"}
    ```

    To create an image or audio dataset, chain the [`~Dataset.cast_column`] method with [`~Dataset.from_dict`] and specify the column and feature type. For example, to create an audio dataset:

    ```py
    >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
    ```

Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.


================================================
FILE: docs/source/dataset_card.mdx
================================================
# Create a dataset card

Each dataset should have a dataset card to promote responsible usage and inform users of any potential biases within the dataset.
This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://huggingface.co/papers/1810.03993).
Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of.

Creating a dataset card is easy and can be done in just a few steps:

1. Go to your dataset repository on the [Hub](https://hf.co/new-dataset) and click on **Create Dataset Card** to create a new `README.md` file in your repository.

2. Use the **Metadata UI** to select the tags that describe your dataset. You can add a license, language, pretty_name, the task_categories, size_categories, and any other tags that you think are relevant. These tags help users discover and find your dataset on the Hub.

<div class="flex justify-center">
    <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-metadata-ui.png"/>
    <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/datasets-metadata-ui-dark.png"/>
</div>

  > [!TIP]
  > For a complete, but not required, set of tag options you can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). This'll have a few more tag options like `multilinguality` and `language_creators` which are useful but not absolutely necessary.

3. Click on the **Import dataset card template** link to automatically create a template with all the relevant fields to complete. Fill out the template sections to the best of your ability. Take a look at the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) for more detailed information about what to include in each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**.

4. Once you're done, commit the changes to the `README.md` file and you'll see the completed dataset card on your repository.

YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.

Feel free to take a look at the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/tblard/allocine) dataset cards as examples to help you get started.


================================================
FILE: docs/source/depth_estimation.mdx
================================================
# Depth estimation

Depth estimation datasets are used to train a model to approximate the relative distance of every pixel in an
image from the camera, also known as depth. The applications enabled by these datasets primarily lie in areas like visual machine
perception and perception in robotics. Example applications include mapping streets for self-driving cars. This guide will show you how to apply transformations
to a depth estimation dataset.

Before you start, make sure you have up-to-date versions of `albumentations` installed:

```bash
pip install -U albumentations 
```

[Albumentations](https://albumentations.ai/) is a Python library for performing data augmentation
for computer vision. It supports various computer vision tasks such as image classification, object
detection, segmentation, and keypoint estimation.

This guide uses the [NYU Depth V2](https://huggingface.co/datasets/sayakpaul/nyu_depth_v2) dataset which is 
comprised of video sequences from various indoor scenes, recorded by RGB and depth cameras. The dataset consists of scenes from 3 cities and provides images along with
their depth maps as labels.

Load the `train` split of the dataset and take a look at an example:

```py
>>> from datasets import load_dataset

>>> train_dataset = load_dataset("sayakpaul/nyu_depth_v2", split="train")
>>> index = 17
>>> example = train_dataset[index]
>>> example
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=640x480>,
 'depth_map': <PIL.TiffImagePlugin.TiffImageFile image mode=F size=640x480>}
```

The dataset has two fields:

* `image`: a PIL PNG image object with `uint8` data type.
* `depth_map`: a PIL Tiff image object with `float32` data type which is the depth map of the image.

Here the depth maps are using TIFF format as it supports a wide range of data types, including `float32` data.
However it is mention-worthy that JPEG/PNG format can only store `uint8` or `uint16` data.
Therefore if you have depth maps saved as JPEG/PNG, use the `Image(mode="F")` type to load them as single channel `float32` like normal depth maps:

```python
>>> from datasets import Image

>>> train_dataset = train_dataset.cast_column("depth_map", Image(mode="F"))
```

Next, check out an image with:

```py
>>> example["image"]
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_sample.png">
</div>

Before we look at the depth map, we need to first convert its data type to `uint8` using `.convert('RGB')` as PIL can't display `float32` images. Now take a look at its corresponding depth map:

```py
>>> example["depth_map"].convert("RGB")
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_target.png">
</div>

It's all black! You'll need to add some color to the depth map to visualize it properly. To do that, either we can apply color automatically during display using `plt.imshow()` or create a colored depth map using `plt.cm` and then display it. In this example, we have used the latter one, as we can save/write the colored depth map later. (the utility below is taken from the [FastDepth repository](https://github.com/dwofk/fast-depth/blob/master/utils.py)).

```py 
>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> cmap = plt.cm.viridis

>>> def colored_depthmap(depth, d_min=None, d_max=None):
...     if d_min is None:
...         d_min = np.min(depth)
...     if d_max is None:
...         d_max = np.max(depth)
...     depth_relative = (depth - d_min) / (d_max - d_min)
...     return 255 * cmap(depth_relative)[:,:,:3]

>>> def show_depthmap(depth_map):
...    if not isinstance(depth_map, np.ndarray):
...        depth_map = np.array(depth_map)
...    if depth_map.ndim == 3:
...        depth_map = depth_map.squeeze()

...    d_min = np.min(depth_map)
...    d_max = np.max(depth_map)
...    depth_map = colored_depthmap(depth_map, d_min, d_max)

...    plt.imshow(depth_map.astype("uint8"))
...    plt.axis("off")
...    plt.show()

>>> show_depthmap(example["depth_map"])
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_target_viz.png">
</div>

You can also visualize several different images and their corresponding depth maps.

```py
>>> def merge_into_row(input_image, depth_target):
...     if not isinstance(input_image, np.ndarray):
...         input_image = np.array(input_image)
...
...     d_min = np.min(depth_target)
...     d_max = np.max(depth_target)
...     depth_target_col = colored_depthmap(depth_target, d_min, d_max)
...     img_merge = np.hstack([input_image, depth_target_col])
...
...     return img_merge

>>> random_indices = np.random.choice(len(train_dataset), 9).tolist()
>>> plt.figure(figsize=(15, 6))
>>> for i, idx in enumerate(random_indices):
...     example = train_dataset[idx]
...     ax = plt.subplot(3, 3, i + 1)
...     image_viz = merge_into_row(
...         example["image"], example["depth_map"]
...     )
...     plt.imshow(image_viz.astype("uint8"))
...     plt.axis("off")
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_collage.png">
</div>

Now apply some augmentations with `albumentations`. The augmentation transformations include:

* Random horizontal flipping
* Random cropping 
* Random brightness and contrast 
* Random gamma correction 
* Random hue saturation

```py 
>>> import albumentations as A

>>> crop_size = (448, 576)
>>> transforms = [
...     A.HorizontalFlip(p=0.5),
...     A.RandomCrop(crop_size[0], crop_size[1]),
...     A.RandomBrightnessContrast(),
...     A.RandomGamma(),
...     A.HueSaturationValue()
... ]
```

Additionally, define a mapping to better reflect the target key name.

```py 
>>> additional_targets = {"depth": "mask"}
>>> aug = A.Compose(transforms=transforms, additional_targets=additional_targets)
```

With `additional_targets` defined, you can pass the target depth maps to the `depth` argument of `aug` instead of `mask`. You'll notice this change
in the `apply_transforms()` function defined below.

Create a function to apply the transformation to the images as well as their depth maps:

```py 
>>> def apply_transforms(examples):
...     transformed_images, transformed_maps = [], []
...     for image, depth_map in zip(examples["image"], examples["depth_map"]):
...         image, depth_map = np.array(image), np.array(depth_map)
...         transformed = aug(image=image, depth=depth_map)
...         transformed_images.append(transformed["image"])
...         transformed_maps.append(transformed["depth"])
...
...     examples["pixel_values"] = transformed_images
...     examples["labels"] = transformed_maps
...     return examples
```

Use the [`~Dataset.set_transform`] function to apply the transformation on-the-fly to batches of the dataset to consume less disk space:

```py
>>> train_dataset.set_transform(apply_transforms)
```

You can verify the transformation worked by indexing into the `pixel_values` and `labels` of an example image:

```py
>>> example = train_dataset[index]

>>> plt.imshow(example["pixel_values"])
>>> plt.axis("off")
>>> plt.show()
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_sample_aug.png">
</div>

Visualize the same transformation on the image's corresponding depth map:

```py 
>>> show_depthmap(example["labels"])
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_target_aug.png">
</div>

You can also visualize multiple training samples reusing the previous `random_indices`: 

```py 
>>> plt.figure(figsize=(15, 6))

>>> for i, idx in enumerate(random_indices):
...     ax = plt.subplot(3, 3, i + 1)
...     example = train_dataset[idx]
...     image_viz = merge_into_row(
...         example["pixel_values"], example["labels"]
...     )
...     plt.imshow(image_viz.astype("uint8"))
...     plt.axis("off")
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/depth_est_aug_collage.png">
</div>

================================================
FILE: docs/source/document_dataset.mdx
================================================
# Create a document dataset

This guide will show you how to create a document dataset with `PdfFolder` and some metadata. This is a no-code solution for quickly creating a document dataset with several thousand PDFs.

> [!TIP]
> You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub.

## PdfFolder

The `PdfFolder` is a dataset builder designed to quickly load a document dataset with several thousand PDFs without requiring you to write any code.

> [!TIP]
> 💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `PdfFolder` creates dataset splits based on your dataset repository structure.

`PdfFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like:

```
folder/train/resume/0001.pdf
folder/train/resume/0002.pdf
folder/train/resume/0003.pdf

folder/train/invoice/0001.pdf
folder/train/invoice/0002.pdf
folder/train/invoice/0003.pdf
```

If the dataset follows the `PdfFolder` structure, then you can load it directly with [`load_dataset`]:

```py
>>> from datasets import load_dataset

>>> dataset = load_dataset("path/to/folder")
```

This is equivalent to passing `pdffolder` manually in [`load_dataset`] and the directory in `data_dir`:

```py
>>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder")
```

You can also use `pdffolder` to load datasets involving multiple splits. To do so, your dataset directory should have the following structure:

```
folder/train/resume/0001.pdf
folder/train/resume/0002.pdf
folder/test/invoice/0001.pdf
folder/test/invoice/0002.pdf
```

> [!WARNING]
> If all PDF files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly.


If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.

```
folder/train/metadata.csv
folder/train/0001.pdf
folder/train/0002.pdf
folder/train/0003.pdf
```

Your `metadata.csv` file must have a `file_name` or `*_file_name` field which links PDF files with their metadata:

```csv
file_name,additional_feature
0001.pdf,This is a first value of a text feature you added to your pdfs
0002.pdf,This is a second value of a text feature you added to your pdfs
0003.pdf,This is a third value of a text feature you added to your pdfs
```

or using `metadata.jsonl`:

```jsonl
{"file_name": "0001.pdf", "additional_feature": "This is a first value of a text feature you added to your PDFs"}
{"file_name": "0002.pdf", "additional_feature": "This is a second value of a text feature you added to your PDFs"}
{"file_name": "0003.pdf", "additional_feature": "This is a third value of a text feature you added to your PDFs"}
```

Here the `file_name` must be the name of the PDF file next to the metadata file. More generally, it must be the relative path from the directory containing the metadata to the PDF file.

It's possible to point to more than one PDF in each row in your dataset, for example if both your input and output are pdfs:

```jsonl
{"input_file_name": "0001.pdf", "output_file_name": "0001_output.pdf"}
{"input_file_name": "0002.pdf", "output_file_name": "0002_output.pdf"}
{"input_file_name": "0003.pdf", "output_file_name": "0003_output.pdf"}
```

You can also define lists of PDFs. In that case you need to name the field `file_names` or `*_file_names`. Here is an example:

```jsonl
{"pdfs_file_names": ["0001_part1.pdf", "0001_part2.pdf"], "label": "urgent"}
{"pdfs_file_names": ["0002_part1.pdf", "0002_part2.pdf"], "label": "urgent"}
{"pdfs_file_names": ["0003_part1.pdf", "0002_part2.pdf"], "label": "normal"}
```

### OCR (Optical Character Recognition)

OCR datasets have the text contained in a PDF. An example `metadata.csv` may look like:

```csv
file_name,text
0001.pdf,Invoice 1234 from 01/01/1970...
0002.pdf,Software Engineer Resume. Education: ...
0003.pdf,Attention is all you need. Abstract. The ...
```

Load the dataset with `PdfFolder`, and it will create a `text` column for the PDF captions:

```py
>>> dataset = load_dataset("pdffolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["text"]
"Invoice 1234 from 01/01/1970..."
```

### Upload dataset to the Hub

Once you've created a dataset, you can share it to the using `huggingface_hub` for example. Make sure you have the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/index) library installed and you're logged in to your Hugging Face account (see the [Upload with Python tutorial](upload_dataset#upload-with-python) for more details).

Upload your dataset with `huggingface_hub.HfApi.upload_folder`:

```py
from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path="/path/to/local/dataset",
    repo_id="username/my-cool-dataset",
    repo_type="dataset",
)
```


================================================
FILE: docs/source/document_load.mdx
================================================
# Load pdf data

> [!WARNING]
> Pdf support is experimental and is subject to change.

Pdf datasets have [`Pdf`] type columns, which contain `pdfplumber` objects. 

> [!TIP]
> To work with pdf datasets, you need to have the `pdfplumber` package installed. Check out the [installation](https://github.com/jsvine/pdfplumber#installation) guide to learn how to install it.

When you load a pdf dataset and call the pdf column, the pdfs are decoded as `pdfplumber` Pdfs:

```py
>>> from datasets import load_dataset, Pdf

>>> dataset = load_dataset("path/to/pdf/folder", split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
```

> [!WARNING]
> Index into a pdf dataset using the row index first and then the `pdf` column - `dataset[0]["pdf"]` - to avoid creating all the pdf objects in the dataset. Otherwise, this can be a slow and time-consuming process if you have a large dataset.

For a guide on how to load any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./loading">general loading guide</a>.

## Read pages

Access pages directly from a pdf using the `.pages` attribute.

Then you can use the `pdfplumber` functions to read texts, tables and images, e.g.:

`

Download .txt

gitextract_njn6cbk0/

├── .dvc/
│   ├── .gitignore
│   ├── config
│   └── plots/
│       ├── confusion.json
│       ├── default.json
│       ├── scatter.json
│       └── smooth.json
├── .dvcignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug-report.yml
│   │   ├── config.yml
│   │   └── feature-request.yml
│   ├── conda/
│   │   ├── build.sh
│   │   └── meta.yaml
│   └── workflows/
│       ├── build_documentation.yml
│       ├── build_pr_documentation.yml
│       ├── ci.yml
│       ├── release-conda.yml
│       ├── self-assign.yaml
│       ├── trufflehog.yml
│       └── upload_pr_documentation.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .zenodo.json
├── ADD_NEW_DATASET.md
├── AUTHORS
├── CITATION.cff
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── SECURITY.md
├── benchmarks/
│   ├── benchmark_array_xd.py
│   ├── benchmark_getitem_100B.py
│   ├── benchmark_indices_mapping.py
│   ├── benchmark_iterating.py
│   ├── benchmark_map_filter.py
│   ├── format.py
│   ├── results/
│   │   ├── .gitkeep
│   │   ├── benchmark_array_xd.json
│   │   ├── benchmark_getitem_100B.json
│   │   ├── benchmark_indices_mapping.json
│   │   ├── benchmark_iterating.json
│   │   └── benchmark_map_filter.json
│   └── utils.py
├── docs/
│   ├── README.md
│   └── source/
│       ├── _config.py
│       ├── _redirects.yml
│       ├── _toctree.yml
│       ├── about_arrow.md
│       ├── about_cache.mdx
│       ├── about_dataset_features.mdx
│       ├── about_dataset_load.mdx
│       ├── about_map_batch.mdx
│       ├── about_mapstyle_vs_iterable.mdx
│       ├── access.mdx
│       ├── audio_dataset.mdx
│       ├── audio_load.mdx
│       ├── audio_process.mdx
│       ├── cache.mdx
│       ├── cli.mdx
│       ├── create_dataset.mdx
│       ├── dataset_card.mdx
│       ├── depth_estimation.mdx
│       ├── document_dataset.mdx
│       ├── document_load.mdx
│       ├── faiss_es.mdx
│       ├── filesystems.mdx
│       ├── how_to.md
│       ├── image_classification.mdx
│       ├── image_dataset.mdx
│       ├── image_load.mdx
│       ├── image_process.mdx
│       ├── index.mdx
│       ├── installation.md
│       ├── load_hub.mdx
│       ├── loading.mdx
│       ├── nifti_dataset.mdx
│       ├── nlp_load.mdx
│       ├── nlp_process.mdx
│       ├── object_detection.mdx
│       ├── package_reference/
│       │   ├── builder_classes.mdx
│       │   ├── loading_methods.mdx
│       │   ├── main_classes.mdx
│       │   ├── table_classes.mdx
│       │   └── utilities.mdx
│       ├── process.mdx
│       ├── quickstart.mdx
│       ├── repository_structure.mdx
│       ├── semantic_segmentation.mdx
│       ├── share.mdx
│       ├── stream.mdx
│       ├── tabular_load.mdx
│       ├── troubleshoot.mdx
│       ├── tutorial.md
│       ├── upload_dataset.mdx
│       ├── use_dataset.mdx
│       ├── use_with_jax.mdx
│       ├── use_with_numpy.mdx
│       ├── use_with_pandas.mdx
│       ├── use_with_polars.mdx
│       ├── use_with_pyarrow.mdx
│       ├── use_with_pytorch.mdx
│       ├── use_with_spark.mdx
│       ├── use_with_tensorflow.mdx
│       ├── video_dataset.mdx
│       └── video_load.mdx
├── notebooks/
│   └── README.md
├── pyproject.toml
├── setup.py
├── src/
│   └── datasets/
│       ├── __init__.py
│       ├── arrow_dataset.py
│       ├── arrow_reader.py
│       ├── arrow_writer.py
│       ├── builder.py
│       ├── combine.py
│       ├── commands/
│       │   ├── __init__.py
│       │   ├── datasets_cli.py
│       │   ├── delete_from_hub.py
│       │   ├── env.py
│       │   └── test.py
│       ├── config.py
│       ├── data_files.py
│       ├── dataset_dict.py
│       ├── distributed.py
│       ├── download/
│       │   ├── __init__.py
│       │   ├── download_config.py
│       │   ├── download_manager.py
│       │   └── streaming_download_manager.py
│       ├── exceptions.py
│       ├── features/
│       │   ├── __init__.py
│       │   ├── _torchcodec.py
│       │   ├── audio.py
│       │   ├── features.py
│       │   ├── image.py
│       │   ├── nifti.py
│       │   ├── pdf.py
│       │   ├── translation.py
│       │   └── video.py
│       ├── filesystems/
│       │   ├── __init__.py
│       │   └── compression.py
│       ├── fingerprint.py
│       ├── formatting/
│       │   ├── __init__.py
│       │   ├── formatting.py
│       │   ├── jax_formatter.py
│       │   ├── np_formatter.py
│       │   ├── polars_formatter.py
│       │   ├── tf_formatter.py
│       │   └── torch_formatter.py
│       ├── hub.py
│       ├── info.py
│       ├── inspect.py
│       ├── io/
│       │   ├── __init__.py
│       │   ├── abc.py
│       │   ├── csv.py
│       │   ├── generator.py
│       │   ├── json.py
│       │   ├── parquet.py
│       │   ├── spark.py
│       │   ├── sql.py
│       │   └── text.py
│       ├── iterable_dataset.py
│       ├── load.py
│       ├── naming.py
│       ├── packaged_modules/
│       │   ├── __init__.py
│       │   ├── arrow/
│       │   │   ├── __init__.py
│       │   │   └── arrow.py
│       │   ├── audiofolder/
│       │   │   ├── __init__.py
│       │   │   └── audiofolder.py
│       │   ├── cache/
│       │   │   ├── __init__.py
│       │   │   └── cache.py
│       │   ├── csv/
│       │   │   ├── __init__.py
│       │   │   └── csv.py
│       │   ├── eval/
│       │   │   ├── __init__.py
│       │   │   └── eval.py
│       │   ├── folder_based_builder/
│       │   │   ├── __init__.py
│       │   │   └── folder_based_builder.py
│       │   ├── generator/
│       │   │   ├── __init__.py
│       │   │   └── generator.py
│       │   ├── hdf5/
│       │   │   ├── __init__.py
│       │   │   └── hdf5.py
│       │   ├── imagefolder/
│       │   │   ├── __init__.py
│       │   │   └── imagefolder.py
│       │   ├── json/
│       │   │   ├── __init__.py
│       │   │   └── json.py
│       │   ├── lance/
│       │   │   ├── __init__.py
│       │   │   └── lance.py
│       │   ├── niftifolder/
│       │   │   ├── __init__.py
│       │   │   └── niftifolder.py
│       │   ├── pandas/
│       │   │   ├── __init__.py
│       │   │   └── pandas.py
│       │   ├── parquet/
│       │   │   ├── __init__.py
│       │   │   └── parquet.py
│       │   ├── pdffolder/
│       │   │   ├── __init__.py
│       │   │   └── pdffolder.py
│       │   ├── spark/
│       │   │   ├── __init__.py
│       │   │   └── spark.py
│       │   ├── sql/
│       │   │   ├── __init__.py
│       │   │   └── sql.py
│       │   ├── text/
│       │   │   ├── __init__.py
│       │   │   └── text.py
│       │   ├── videofolder/
│       │   │   ├── __init__.py
│       │   │   └── videofolder.py
│       │   ├── webdataset/
│       │   │   ├── __init__.py
│       │   │   ├── _tenbin.py
│       │   │   └── webdataset.py
│       │   └── xml/
│       │       ├── __init__.py
│       │       └── xml.py
│       ├── parallel/
│       │   ├── __init__.py
│       │   └── parallel.py
│       ├── search.py
│       ├── splits.py
│       ├── streaming.py
│       ├── table.py
│       └── utils/
│           ├── __init__.py
│           ├── _dataset_viewer.py
│           ├── _dill.py
│           ├── _filelock.py
│           ├── deprecation_utils.py
│           ├── doc_utils.py
│           ├── experimental.py
│           ├── extract.py
│           ├── file_utils.py
│           ├── filelock.py
│           ├── hub.py
│           ├── info_utils.py
│           ├── json.py
│           ├── logging.py
│           ├── metadata.py
│           ├── patching.py
│           ├── py_utils.py
│           ├── resources/
│           │   ├── __init__.py
│           │   ├── creators.json
│           │   ├── languages.json
│           │   ├── multilingualities.json
│           │   ├── readme_structure.yaml
│           │   └── size_categories.json
│           ├── sharding.py
│           ├── stratify.py
│           ├── tf_utils.py
│           ├── tqdm.py
│           ├── track.py
│           ├── typing.py
│           └── version.py
├── templates/
│   ├── README.md
│   └── README_guide.md
├── tests/
│   ├── __init__.py
│   ├── _test_patching.py
│   ├── commands/
│   │   ├── __init__.py
│   │   ├── conftest.py
│   │   └── test_test.py
│   ├── conftest.py
│   ├── distributed_scripts/
│   │   └── run_torch_distributed.py
│   ├── features/
│   │   ├── __init__.py
│   │   ├── data/
│   │   │   ├── test_audio_16000.pcm
│   │   │   ├── test_audio_48000.opus
│   │   │   └── test_nifti.nii
│   │   ├── test_array_xd.py
│   │   ├── test_audio.py
│   │   ├── test_features.py
│   │   ├── test_image.py
│   │   ├── test_nifti.py
│   │   ├── test_pdf.py
│   │   └── test_video.py
│   ├── fixtures/
│   │   ├── __init__.py
│   │   ├── files.py
│   │   ├── fsspec.py
│   │   └── hub.py
│   ├── io/
│   │   ├── __init__.py
│   │   ├── data/
│   │   │   ├── test_file.json.bz2
│   │   │   └── test_file.json.xz
│   │   ├── test_csv.py
│   │   ├── test_json.py
│   │   ├── test_parquet.py
│   │   ├── test_sql.py
│   │   └── test_text.py
│   ├── packaged_modules/
│   │   ├── __init__.py
│   │   ├── test_arrow.py
│   │   ├── test_audiofolder.py
│   │   ├── test_cache.py
│   │   ├── test_csv.py
│   │   ├── test_folder_based_builder.py
│   │   ├── test_hdf5.py
│   │   ├── test_imagefolder.py
│   │   ├── test_json.py
│   │   ├── test_lance.py
│   │   ├── test_pandas.py
│   │   ├── test_parquet.py
│   │   ├── test_spark.py
│   │   ├── test_sql.py
│   │   ├── test_text.py
│   │   ├── test_videofolder.py
│   │   └── test_webdataset.py
│   ├── test_arrow_dataset.py
│   ├── test_arrow_reader.py
│   ├── test_arrow_writer.py
│   ├── test_builder.py
│   ├── test_data_files.py
│   ├── test_dataset_dict.py
│   ├── test_dataset_list.py
│   ├── test_distributed.py
│   ├── test_download_manager.py
│   ├── test_exceptions.py
│   ├── test_experimental.py
│   ├── test_extract.py
│   ├── test_file_utils.py
│   ├── test_filelock.py
│   ├── test_filesystem.py
│   ├── test_fingerprint.py
│   ├── test_fingerprint_tokenizer_stability.py
│   ├── test_formatting.py
│   ├── test_hub.py
│   ├── test_info.py
│   ├── test_info_utils.py
│   ├── test_inspect.py
│   ├── test_iterable_dataset.py
│   ├── test_load.py
│   ├── test_metadata_util.py
│   ├── test_offline_util.py
│   ├── test_parallel.py
│   ├── test_patching.py
│   ├── test_py_utils.py
│   ├── test_search.py
│   ├── test_sharding_utils.py
│   ├── test_splits.py
│   ├── test_streaming_download_manager.py
│   ├── test_table.py
│   ├── test_tqdm.py
│   ├── test_upstream_hub.py
│   ├── test_version.py
│   └── utils.py
└── utils/
    └── release.py

Download .txt

Showing preview only (353K chars total). Download the full file or copy to clipboard to get everything.

SYMBOL INDEX (3742 symbols across 178 files)

FILE: benchmarks/benchmark_array_xd.py
  function write (line 25) | def write(my_features, dummy_data, tmp_dir):
  function read_unformated (line 34) | def read_unformated(feats, tmp_dir):
  function read_formatted_as_numpy (line 43) | def read_formatted_as_numpy(feats, tmp_dir):
  function read_batch_unformated (line 53) | def read_batch_unformated(feats, tmp_dir):
  function read_batch_formatted_as_numpy (line 63) | def read_batch_formatted_as_numpy(feats, tmp_dir):
  function read_col_unformated (line 74) | def read_col_unformated(feats, tmp_dir):
  function read_col_formatted_as_numpy (line 83) | def read_col_formatted_as_numpy(feats, tmp_dir):
  function benchmark_array_xd (line 92) | def benchmark_array_xd():

FILE: benchmarks/benchmark_getitem_100B.py
  function generate_100B_dataset (line 19) | def generate_100B_dataset(num_examples: int, chunk_size: int) -> dataset...
  class RandIter (line 26) | class RandIter:
    method __post_init__ (line 32) | def __post_init__(self):
    method __iter__ (line 36) | def __iter__(self):
    method __len__ (line 39) | def __len__(self):
  function get_first_row (line 44) | def get_first_row(dataset: datasets.Dataset):
  function get_last_row (line 49) | def get_last_row(dataset: datasets.Dataset):
  function get_batch_of_1024_rows (line 54) | def get_batch_of_1024_rows(dataset: datasets.Dataset):
  function get_batch_of_1024_random_rows (line 59) | def get_batch_of_1024_random_rows(dataset: datasets.Dataset):
  function benchmark_table_100B (line 63) | def benchmark_table_100B():

FILE: benchmarks/benchmark_indices_mapping.py
  function select (line 16) | def select(dataset: datasets.Dataset):
  function sort (line 21) | def sort(dataset: datasets.Dataset):
  function shuffle (line 26) | def shuffle(dataset: datasets.Dataset):
  function train_test_split (line 31) | def train_test_split(dataset: datasets.Dataset):
  function shard (line 36) | def shard(dataset: datasets.Dataset, num_shards=10):
  function benchmark_indices_mapping (line 41) | def benchmark_indices_mapping():

FILE: benchmarks/benchmark_iterating.py
  function read (line 17) | def read(dataset: datasets.Dataset, length):
  function read_batch (line 23) | def read_batch(dataset: datasets.Dataset, length, batch_size):
  function read_formatted (line 29) | def read_formatted(dataset: datasets.Dataset, length, type):
  function read_formatted_batch (line 36) | def read_formatted_batch(dataset: datasets.Dataset, length, batch_size, ...
  function benchmark_iterating (line 42) | def benchmark_iterating():

FILE: benchmarks/benchmark_map_filter.py
  function map (line 18) | def map(dataset: datasets.Dataset, **kwargs):
  function filter (line 23) | def filter(dataset: datasets.Dataset, **kwargs):
  function benchmark_map_filter (line 27) | def benchmark_map_filter():

FILE: benchmarks/format.py
  function format_json_to_md (line 5) | def format_json_to_md(input_json_file, output_md_file):

FILE: benchmarks/utils.py
  function get_duration (line 10) | def get_duration(func):
  function generate_examples (line 22) | def generate_examples(features: dict, num_examples=100, seq_shapes=None):
  function generate_example_dataset (line 47) | def generate_example_dataset(dataset_path, features, num_examples=100, s...

FILE: src/datasets/arrow_dataset.py
  class DatasetInfoMixin (line 170) | class DatasetInfoMixin:
    method __init__ (line 175) | def __init__(self, info: DatasetInfo, split: Optional[NamedSplit]):
    method info (line 180) | def info(self):
    method split (line 185) | def split(self):
    method builder_name (line 190) | def builder_name(self) -> str:
    method citation (line 194) | def citation(self) -> str:
    method config_name (line 198) | def config_name(self) -> str:
    method dataset_size (line 202) | def dataset_size(self) -> Optional[int]:
    method description (line 206) | def description(self) -> str:
    method download_checksums (line 210) | def download_checksums(self) -> Optional[dict]:
    method download_size (line 214) | def download_size(self) -> Optional[int]:
    method features (line 218) | def features(self) -> Optional[Features]:
    method homepage (line 222) | def homepage(self) -> Optional[str]:
    method license (line 226) | def license(self) -> Optional[str]:
    method size_in_bytes (line 230) | def size_in_bytes(self) -> Optional[int]:
    method supervised_keys (line 234) | def supervised_keys(self):
    method version (line 238) | def version(self):
  class TensorflowDatasetMixin (line 242) | class TensorflowDatasetMixin:
    method _get_output_signature (line 246) | def _get_output_signature(
    method to_tf_dataset (line 343) | def to_tf_dataset(
  class DatasetTransformationNotAllowedError (line 551) | class DatasetTransformationNotAllowedError(Exception):
  function transmit_format (line 555) | def transmit_format(func):
  function update_metadata_with_features (line 598) | def update_metadata_with_features(table: Table, features: Features):
  function _check_table (line 614) | def _check_table(table) -> Table:
  function _check_column_names (line 626) | def _check_column_names(column_names: list[str]):
  function _check_valid_indices_value (line 634) | def _check_valid_indices_value(index, size):
  class NonExistentDatasetError (line 639) | class NonExistentDatasetError(Exception):
  class Column (line 645) | class Column(Sequence_):
    method __init__ (line 666) | def __init__(self, source: Union["Dataset", "Column"], column_name: str):
    method __iter__ (line 673) | def __iter__(self) -> Iterator[Any]:
    method __getitem__ (line 685) | def __getitem__(self, key: Union[int, str, list[int]]) -> Any:
    method __len__ (line 700) | def __len__(self) -> int:
    method __repr__ (line 703) | def __repr__(self):
    method __str__ (line 706) | def __str__(self):
    method __eq__ (line 709) | def __eq__(self, value):
  class Dataset (line 716) | class Dataset(DatasetInfoMixin, IndexableMixin, TensorflowDatasetMixin):
    method __init__ (line 719) | def __init__(
    method features (line 793) | def features(self) -> Features:
    method from_file (line 800) | def from_file(
    method from_buffer (line 840) | def from_buffer(
    method from_pandas (line 872) | def from_pandas(
    method from_polars (line 942) | def from_polars(
    method from_dict (line 990) | def from_dict(
    method from_list (line 1156) | def from_list(
    method from_csv (line 1290) | def from_csv(
    method from_generator (line 1345) | def from_generator(
    method from_json (line 1430) | def from_json(
    method from_parquet (line 1489) | def from_parquet(
    method from_text (line 1586) | def from_text(
    method from_spark (line 1650) | def from_spark(
    method from_sql (line 1714) | def from_sql(
    method __setstate__ (line 1770) | def __setstate__(self, state):
    method __del__ (line 1775) | def __del__(self):
    method __enter__ (line 1781) | def __enter__(self):
    method __exit__ (line 1784) | def __exit__(self, exc_type, exc_val, exc_tb):
    method save_to_disk (line 1788) | def save_to_disk(
    method _save_to_disk_single (line 1941) | def _save_to_disk_single(job_id: int, shard: "Dataset", fpath: str, st...
    method _build_local_temp_path (line 1968) | def _build_local_temp_path(uri_or_path: str) -> Path:
    method load_from_disk (line 1985) | def load_from_disk(
    method data (line 2106) | def data(self) -> Table:
    method cache_files (line 2126) | def cache_files(self) -> list[dict]:
    method num_columns (line 2144) | def num_columns(self) -> int:
    method num_rows (line 2159) | def num_rows(self) -> int:
    method column_names (line 2176) | def column_names(self) -> list[str]:
    method shape (line 2191) | def shape(self) -> tuple[int, int]:
    method unique (line 2207) | def unique(self, column: str) -> list:
    method class_encode_column (line 2238) | def class_encode_column(self, column: str, include_nulls: bool = False...
    method flatten (line 2314) | def flatten(self, new_fingerprint: Optional[str] = None, max_depth=16)...
    method cast (line 2360) | def cast(
    method cast_column (line 2445) | def cast_column(self, column: str, feature: FeatureType, new_fingerpri...
    method remove_columns (line 2489) | def remove_columns(self, column_names: Union[str, list[str]], new_fing...
    method rename_column (line 2543) | def rename_column(
    method rename_columns (line 2609) | def rename_columns(self, column_mapping: dict[str, str], new_fingerpri...
    method select_columns (line 2677) | def select_columns(self, column_names: Union[str, list[str]], new_fing...
    method _fast_select_column (line 2725) | def _fast_select_column(self, column_name: str) -> "Dataset":
    method __len__ (line 2731) | def __len__(self):
    method __iter__ (line 2748) | def __iter__(self):
    method iter (line 2777) | def iter(self, batch_size: int, drop_last_batch: bool = False):
    method __repr__ (line 2809) | def __repr__(self):
    method format (line 2813) | def format(self):
    method formatted_as (line 2822) | def formatted_as(
    method set_format (line 2854) | def set_format(
    method reset_format (line 2932) | def reset_format(self):
    method set_transform (line 2961) | def set_transform(
    method with_format (line 3004) | def with_format(
    method with_transform (line 3075) | def with_transform(
    method _getitem (line 3123) | def _getitem(self, key: Union[int, slice, str, ListLike[int]], **kwarg...
    method __getitem__ (line 3144) | def __getitem__(self, key: Union[int, slice, Iterable[int]]) -> dict: ...
    method __getitem__ (line 3148) | def __getitem__(self, key: str) -> list:  # noqa: F811
    method __getitem__ (line 3151) | def __getitem__(self, key):  # noqa: F811
    method __getitems__ (line 3158) | def __getitems__(self, keys: list) -> list:
    method cleanup_cache_files (line 3164) | def cleanup_cache_files(self) -> int:
    method _get_cache_file_path (line 3201) | def _get_cache_file_path(self, fingerprint):
    method map (line 3212) | def map(
    method _map_single (line 3668) | def _map_single(
    method batch (line 4063) | def batch(
    method filter (line 4116) | def filter(
    method flatten_indices (line 4262) | def flatten_indices(
    method _new_dataset_with_indices (line 4308) | def _new_dataset_with_indices(
    method select (line 4341) | def select(
    method _select_contiguous (line 4431) | def _select_contiguous(
    method _select_with_indices_mapping (line 4487) | def _select_with_indices_mapping(
    method skip (line 4592) | def skip(self, n: int) -> "Dataset":
    method repeat (line 4622) | def repeat(self, num_times: int) -> "Dataset":
    method take (line 4654) | def take(self, n: int) -> "Dataset":
    method sort (line 4679) | def sort(
    method shuffle (line 4809) | def shuffle(
    method train_test_split (line 4944) | def train_test_split(
    method shard (line 5220) | def shard(
    method to_csv (line 5297) | def to_csv(
    method to_dict (line 5356) | def to_dict(self, batch_size: Optional[int] = None, batched: bool = Fa...
    method to_list (line 5381) | def to_list(self) -> list:
    method to_json (line 5399) | def to_json(
    method to_pandas (line 5461) | def to_pandas(
    method to_polars (line 5500) | def to_polars(
    method to_parquet (line 5560) | def to_parquet(
    method to_sql (line 5600) | def to_sql(
    method _estimate_nbytes (line 5649) | def _estimate_nbytes(self) -> int:
    method _generate_tables_from_shards (line 5682) | def _generate_tables_from_shards(shards: list["Dataset"], batch_size: ...
    method _generate_tables_from_cache_file (line 5688) | def _generate_tables_from_cache_file(filename: str):
    method to_iterable_dataset (line 5692) | def to_iterable_dataset(self, num_shards: Optional[int] = 1) -> "Itera...
    method _push_parquet_shards_to_hub_single (line 5817) | def _push_parquet_shards_to_hub_single(
    method _push_parquet_shards_to_hub (line 5893) | def _push_parquet_shards_to_hub(
    method push_to_hub (line 6001) | def push_to_hub(
    method add_column (line 6197) | def add_column(
    method add_faiss_index (line 6250) | def add_faiss_index(
    method add_faiss_index_from_external_arrays (line 6330) | def add_faiss_index_from_external_arrays(
    method add_elasticsearch_index (line 6389) | def add_elasticsearch_index(
    method add_item (line 6459) | def add_item(self, item: dict, new_fingerprint: Optional[str] = None):
    method align_labels_with_mapping (line 6511) | def align_labels_with_mapping(self, label2id: dict, label_column: str)...
  function _push_to_repo (line 6589) | def _push_to_repo(
  function _push_to_bucket (line 6754) | def _push_to_bucket(
  function _get_updated_dataset_card (line 6826) | def _get_updated_dataset_card(
  function _concatenate_map_style_datasets (line 6959) | def _concatenate_map_style_datasets(
  function _interleave_map_style_datasets (line 7073) | def _interleave_map_style_datasets(
  function _split_by_node_map_style_dataset (line 7208) | def _split_by_node_map_style_dataset(dataset: Dataset, rank: int, world_...
  function get_indices_from_mask_function (line 7231) | def get_indices_from_mask_function(
  function async_get_indices_from_mask_function (line 7289) | async def async_get_indices_from_mask_function(

FILE: src/datasets/arrow_reader.py
  class DatasetNotOnHfGcsError (line 65) | class DatasetNotOnHfGcsError(ConnectionError):
  class MissingFilesOnHfGcsError (line 71) | class MissingFilesOnHfGcsError(ConnectionError):
  class FileInstructions (line 78) | class FileInstructions:
  function make_file_instructions (line 92) | def make_file_instructions(
  class BaseReader (line 167) | class BaseReader:
    method __init__ (line 172) | def __init__(self, path: str, info: Optional["DatasetInfo"]):
    method _get_table_from_filename (line 183) | def _get_table_from_filename(self, filename_skip_take, in_memory=False...
    method _read_files (line 187) | def _read_files(self, files, in_memory=False) -> Table:
    method get_file_instructions (line 219) | def get_file_instructions(self, name, instruction, split_infos):
    method read (line 227) | def read(
    method read_files (line 254) | def read_files(
  class ArrowReader (line 285) | class ArrowReader(BaseReader):
    method __init__ (line 291) | def __init__(self, path: str, info: Optional["DatasetInfo"]):
    method _get_table_from_filename (line 301) | def _get_table_from_filename(self, filename_skip_take, in_memory=False...
    method read_table (line 317) | def read_table(filename, in_memory=False) -> Table:
  class ParquetReader (line 332) | class ParquetReader(BaseReader):
    method __init__ (line 338) | def __init__(self, path: str, info: Optional["DatasetInfo"]):
    method _get_table_from_filename (line 348) | def _get_table_from_filename(self, filename_skip_take, **kwargs):
  class _AbsoluteInstruction (line 364) | class _AbsoluteInstruction:
  class _RelativeInstruction (line 373) | class _RelativeInstruction:
    method __post_init__ (line 382) | def __post_init__(self):
  function _str_to_read_instruction (line 397) | def _str_to_read_instruction(spec):
  function _pct_to_abs_pct1 (line 412) | def _pct_to_abs_pct1(boundary, num_examples):
  function _pct_to_abs_closest (line 423) | def _pct_to_abs_closest(boundary, num_examples):
  function _rel_to_abs_instr (line 427) | def _rel_to_abs_instr(rel_instr, name2len):
  class ReadInstruction (line 456) | class ReadInstruction:
    method _init (line 495) | def _init(self, relative_instructions):
    method _read_instruction_from_relative_instructions (line 500) | def _read_instruction_from_relative_instructions(cls, relative_instruc...
    method __init__ (line 507) | def __init__(self, split_name, rounding=None, from_=None, to=None, uni...
    method from_spec (line 537) | def from_spec(cls, spec):
    method to_spec (line 567) | def to_spec(self):
    method __add__ (line 587) | def __add__(self, other):
    method __str__ (line 602) | def __str__(self):
    method __repr__ (line 605) | def __repr__(self):
    method to_absolute (line 608) | def to_absolute(self, name2len):

FILE: src/datasets/arrow_writer.py
  function get_arrow_writer_batch_size_from_features (line 66) | def get_arrow_writer_batch_size_from_features(features: Optional[Feature...
  function get_writer_batch_size_from_features (line 105) | def get_writer_batch_size_from_features(features: Optional[Features]) ->...
  function get_writer_batch_size_from_data_size (line 145) | def get_writer_batch_size_from_data_size(num_rows: int, num_bytes: int) ...
  class SchemaInferenceError (line 169) | class SchemaInferenceError(ValueError):
  class TypedSequence (line 173) | class TypedSequence:
    method __init__ (line 213) | def __init__(
    method get_inferred_type (line 236) | def get_inferred_type(self) -> FeatureType:
    method _infer_custom_type_and_encode (line 252) | def _infer_custom_type_and_encode(data: Iterable) -> tuple[Iterable, O...
    method __arrow_array__ (line 290) | def __arrow_array__(self, type: Optional[pa.DataType] = None):
    method _arrow_array (line 296) | def _arrow_array(self, type: Optional[pa.DataType] = None):
  class OptimizedTypedSequence (line 462) | class OptimizedTypedSequence(TypedSequence):
    method __init__ (line 463) | def __init__(
  class ArrowWriter (line 487) | class ArrowWriter:
    method __init__ (line 490) | def __init__(
    method __len__ (line 552) | def __len__(self):
    method __enter__ (line 556) | def __enter__(self):
    method __exit__ (line 559) | def __exit__(self, exc_type, exc_val, exc_tb):
    method close (line 562) | def close(self):
    method _build_schema (line 572) | def _build_schema(self, inferred_schema: pa.Schema):
    method _build_writer (line 599) | def _build_writer(self, inferred_schema: pa.Schema):
    method schema (line 604) | def schema(self):
    method _build_metadata (line 615) | def _build_metadata(info: DatasetInfo, fingerprint: Optional[str] = No...
    method write_examples_on_file (line 624) | def write_examples_on_file(self):
    method write_rows_on_file (line 658) | def write_rows_on_file(self):
    method write (line 666) | def write(
    method write_row (line 684) | def write_row(self, row: pa.Table, writer_batch_size: Optional[int] = ...
    method write_batch (line 698) | def write_batch(
    method write_table (line 749) | def write_table(self, pa_table: pa.Table, writer_batch_size: Optional[...
    method finalize (line 767) | def finalize(self, close_stream=True):
  class ParquetWriter (line 789) | class ParquetWriter(ArrowWriter):
    method __init__ (line 790) | def __init__(self, *args, use_content_defined_chunking=True, write_pag...
    method _build_writer (line 797) | def _build_writer(self, inferred_schema: pa.Schema):

FILE: src/datasets/builder.py
  class InvalidConfigName (line 91) | class InvalidConfigName(ValueError):
  class BuilderConfig (line 96) | class BuilderConfig:
    method __post_init__ (line 121) | def __post_init__(self):
    method __eq__ (line 132) | def __eq__(self, o):
    method create_config_id (line 139) | def create_config_id(
    method _resolve_data_files (line 203) | def _resolve_data_files(self, base_path: str, download_config: Downloa...
  class DatasetBuilder (line 209) | class DatasetBuilder:
    method __init__ (line 300) | def __init__(
    method __getstate__ (line 417) | def __getstate__(self):
    method __setstate__ (line 420) | def __setstate__(self, d):
    method _check_legacy_cache (line 425) | def _check_legacy_cache(self) -> Optional[str]:
    method _check_legacy_cache2 (line 448) | def _check_legacy_cache2(self, dataset_module: "DatasetModule") -> Opt...
    method _create_builder_config (line 495) | def _create_builder_config(
    method builder_configs (line 589) | def builder_configs(cls) -> dict[str, BuilderConfig]:
    method cache_dir (line 598) | def cache_dir(self):
    method _use_legacy_cache_dir_if_possible (line 601) | def _use_legacy_cache_dir_if_possible(self, dataset_module: "DatasetMo...
    method _relative_data_dir (line 609) | def _relative_data_dir(self, with_version=True, with_hash=True) -> str:
    method _build_cache_dir (line 629) | def _build_cache_dir(self):
    method _info (line 664) | def _info(self) -> DatasetInfo:
    method get_imported_module_dir (line 676) | def get_imported_module_dir(cls):
    method _rename (line 680) | def _rename(self, src: str, dst: str):
    method download_and_prepare (line 683) | def download_and_prepare(
    method _download_and_prepare (line 906) | def _download_and_prepare(self, dl_manager, verification_mode, **prepa...
    method download_post_processing_resources (line 959) | def download_post_processing_resources(self, dl_manager):
    method _load_info (line 975) | def _load_info(self) -> DatasetInfo:
    method _save_info (line 978) | def _save_info(self):
    method _make_split_generators_kwargs (line 987) | def _make_split_generators_kwargs(self, prepare_split_kwargs):
    method as_dataset (line 992) | def as_dataset(
    method _build_single_dataset (line 1067) | def _build_single_dataset(
    method _as_dataset (line 1136) | def _as_dataset(self, split: Union[ReadInstruction, Split] = Split.TRA...
    method _get_dataset_fingerprint (line 1165) | def _get_dataset_fingerprint(self, split: Union[ReadInstruction, Split...
    method as_streaming_dataset (line 1173) | def as_streaming_dataset(
    method _as_streaming_dataset_single (line 1208) | def _as_streaming_dataset_single(
    method _post_process (line 1219) | def _post_process(self, dataset: Dataset, resources_paths: Mapping[str...
    method _post_processing_resources (line 1223) | def _post_processing_resources(self, split: str) -> dict[str, str]:
    method _download_post_processing_resources (line 1227) | def _download_post_processing_resources(
    method _split_generators (line 1234) | def _split_generators(self, dl_manager: Union[DownloadManager, Streami...
    method _prepare_split (line 1281) | def _prepare_split(
    method _get_examples_iterable_for_split (line 1310) | def _get_examples_iterable_for_split(self, split_generator: SplitGener...
  class Key (line 1321) | class Key:
    method __str__ (line 1325) | def __str__(self):
  class GeneratorBasedBuilder (line 1329) | class GeneratorBasedBuilder(DatasetBuilder):
    method _generate_shards (line 1338) | def _generate_shards(self, **kwargs) -> Iterator[Union[str, dict[str, ...
    method _generate_examples (line 1359) | def _generate_examples(self, **kwargs) -> Iterator[tuple[Key, dict[str...
    method _prepare_split (line 1389) | def _prepare_split(
    method _prepare_split_single (line 1547) | def _prepare_split_single(
    method _download_and_prepare (line 1632) | def _download_and_prepare(self, dl_manager, verification_mode, **prepa...
    method _get_examples_iterable_for_split (line 1639) | def _get_examples_iterable_for_split(self, split_generator: SplitGener...
  class ArrowBasedBuilder (line 1647) | class ArrowBasedBuilder(DatasetBuilder):
    method _generate_shards (line 1650) | def _generate_shards(self, **kwargs) -> Iterator[Union[str, dict[str, ...
    method _generate_tables (line 1671) | def _generate_tables(self, **kwargs) -> Iterator[tuple[Key, pa.Table]]:
    method _prepare_split (line 1691) | def _prepare_split(
    method _prepare_split_single (line 1848) | def _prepare_split_single(
    method _get_examples_iterable_for_split (line 1938) | def _get_examples_iterable_for_split(self, split_generator: SplitGener...
  class _CountableBuilderMixin (line 1946) | class _CountableBuilderMixin(DatasetBuilder):
    method _generate_num_examples (line 1948) | def _generate_num_examples(self, **kwargs) -> Iterator[int]:
    method count_examples (line 1951) | def count_examples(self, dl_manager: DownloadManager) -> dict[str, int]:
    method _count_examples (line 1956) | def _count_examples(self, split_generator: SplitGenerator) -> int:
    method _count_examples_single (line 1967) | def _count_examples_single(self, gen_kwargs: dict[str, Any]) -> int:

FILE: src/datasets/combine.py
  function interleave_datasets (line 18) | def interleave_datasets(
  function concatenate_datasets (line 168) | def concatenate_datasets(

FILE: src/datasets/commands/__init__.py
  class BaseDatasetsCLICommand (line 5) | class BaseDatasetsCLICommand(ABC):
    method register_subcommand (line 8) | def register_subcommand(parser: ArgumentParser):
    method run (line 12) | def run(self):

FILE: src/datasets/commands/datasets_cli.py
  function parse_unknown_args (line 10) | def parse_unknown_args(unknown_args):
  function main (line 14) | def main():

FILE: src/datasets/commands/delete_from_hub.py
  function _command_factory (line 8) | def _command_factory(args):
  class DeleteFromHubCommand (line 17) | class DeleteFromHubCommand(BaseDatasetsCLICommand):
    method register_subcommand (line 19) | def register_subcommand(parser):
    method __init__ (line 29) | def __init__(
    method run (line 41) | def run(self) -> None:

FILE: src/datasets/commands/env.py
  function info_command_factory (line 13) | def info_command_factory(_):
  class EnvironmentCommand (line 17) | class EnvironmentCommand(BaseDatasetsCLICommand):
    method register_subcommand (line 19) | def register_subcommand(parser: ArgumentParser):
    method run (line 23) | def run(self):
    method format_dict (line 40) | def format_dict(d):

FILE: src/datasets/commands/test.py
  function _test_command_factory (line 20) | def _test_command_factory(args):
  class TestCommand (line 35) | class TestCommand(BaseDatasetsCLICommand):
    method register_subcommand (line 39) | def register_subcommand(parser: ArgumentParser):
    method __init__ (line 75) | def __init__(
    method run (line 108) | def run(self):

FILE: src/datasets/data_files.py
  class Url (line 33) | class Url(str):
  class EmptyDatasetError (line 37) | class EmptyDatasetError(FileNotFoundError):
  function contains_wildcards (line 117) | def contains_wildcards(pattern: str) -> bool:
  function sanitize_patterns (line 121) | def sanitize_patterns(patterns: Union[dict, list, str]) -> dict[str, Uni...
  function _is_inside_unrequested_special_dir (line 162) | def _is_inside_unrequested_special_dir(matched_rel_path: str, pattern: s...
  function _is_unrequested_hidden_file_or_is_inside_unrequested_hidden_dir (line 195) | def _is_unrequested_hidden_file_or_is_inside_unrequested_hidden_dir(matc...
  function _get_data_files_patterns (line 257) | def _get_data_files_patterns(pattern_resolver: Callable[[str], list[str]...
  function resolve_pattern (line 301) | def resolve_pattern(
  function get_data_patterns (line 407) | def get_data_patterns(base_path: str, download_config: Optional[Download...
  function _get_single_origin_metadata (line 498) | def _get_single_origin_metadata(
  function _get_origin_metadata (line 522) | def _get_origin_metadata(
  class DataFilesList (line 551) | class DataFilesList(list[str]):
    method __init__ (line 569) | def __init__(self, data_files: list[str], origin_metadata: list[Single...
    method __add__ (line 573) | def __add__(self, other: "DataFilesList") -> "DataFilesList":
    method from_hf_repo (line 577) | def from_hf_repo(
    method from_local_or_remote (line 591) | def from_local_or_remote(
    method from_patterns (line 604) | def from_patterns(
    method filter (line 629) | def filter(
  class DataFilesDict (line 648) | class DataFilesDict(dict[str, DataFilesList]):
    method from_local_or_remote (line 665) | def from_local_or_remote(
    method from_hf_repo (line 687) | def from_hf_repo(
    method from_patterns (line 711) | def from_patterns(
    method filter (line 732) | def filter(
  class DataFilesPatternsList (line 741) | class DataFilesPatternsList(list[str]):
    method __init__ (line 748) | def __init__(
    method __add__ (line 756) | def __add__(self, other):
    method from_patterns (line 760) | def from_patterns(
    method resolve (line 765) | def resolve(
    method filter_extensions (line 788) | def filter_extensions(self, extensions: list[str]) -> "DataFilesPatter...
  class DataFilesPatternsDict (line 794) | class DataFilesPatternsDict(dict[str, DataFilesPatternsList]):
    method from_patterns (line 800) | def from_patterns(
    method resolve (line 815) | def resolve(
    method filter_extensions (line 825) | def filter_extensions(self, extensions: list[str]) -> "DataFilesPatter...

FILE: src/datasets/dataset_dict.py
  class bind (line 58) | class bind(partial):
    method __call__ (line 59) | def __call__(self, *fn_args, **fn_kwargs):
  class DatasetDict (line 63) | class DatasetDict(dict[Union[str, NamedSplit], "Dataset"]):
    method _check_values_type (line 66) | def _check_values_type(self):
    method _check_values_features (line 71) | def _check_values_features(self):
    method __enter__ (line 79) | def __enter__(self):
    method __exit__ (line 82) | def __exit__(self, exc_type, exc_val, exc_tb):
    method __getitem__ (line 90) | def __getitem__(self, k) -> Dataset:
    method data (line 105) | def data(self) -> dict[str, Table]:
    method cache_files (line 120) | def cache_files(self) -> dict[str, dict]:
    method num_columns (line 138) | def num_columns(self) -> dict[str, int]:
    method num_rows (line 154) | def num_rows(self) -> dict[str, int]:
    method column_names (line 170) | def column_names(self) -> dict[str, list[str]]:
    method shape (line 188) | def shape(self) -> dict[str, tuple[int]]:
    method flatten (line 203) | def flatten(self, max_depth=16) -> "DatasetDict":
    method unique (line 236) | def unique(self, column: str) -> dict[str, list]:
    method cleanup_cache_files (line 260) | def cleanup_cache_files(self) -> dict[str, int]:
    method __repr__ (line 279) | def __repr__(self):
    method cast (line 284) | def cast(self, features: Features) -> "DatasetDict":
    method cast_column (line 316) | def cast_column(self, column: str, feature) -> "DatasetDict":
    method remove_columns (line 345) | def remove_columns(self, column_names: Union[str, list[str]]) -> "Data...
    method rename_column (line 387) | def rename_column(self, original_column_name: str, new_column_name: st...
    method rename_columns (line 435) | def rename_columns(self, column_mapping: dict[str, str]) -> "DatasetDi...
    method select_columns (line 473) | def select_columns(self, column_names: Union[str, list[str]]) -> "Data...
    method class_encode_column (line 509) | def class_encode_column(self, column: str, include_nulls: bool = False...
    method formatted_as (line 542) | def formatted_as(
    method set_format (line 581) | def set_format(
    method reset_format (line 632) | def reset_format(self):
    method set_transform (line 664) | def set_transform(
    method with_format (line 693) | def with_format(
    method with_transform (line 770) | def with_transform(
    method map (line 824) | def map(
    method filter (line 996) | def filter(
    method flatten_indices (line 1109) | def flatten_indices(
    method sort (line 1161) | def sort(
    method shuffle (line 1228) | def shuffle(
    method save_to_disk (line 1311) | def save_to_disk(
    method load_from_disk (line 1386) | def load_from_disk(
    method from_csv (line 1446) | def from_csv(
    method from_json (line 1489) | def from_json(
    method from_parquet (line 1532) | def from_parquet(
    method from_text (line 1581) | def from_text(
    method align_labels_with_mapping (line 1624) | def align_labels_with_mapping(self, label2id: dict, label_column: str)...
    method push_to_hub (line 1633) | def push_to_hub(
  class IterableDatasetDict (line 1822) | class IterableDatasetDict(dict[Union[str, NamedSplit], IterableDataset]):
    method _check_values_type (line 1823) | def _check_values_type(self):
    method _check_values_features (line 1828) | def _check_values_features(self):
    method __repr__ (line 1836) | def __repr__(self):
    method num_columns (line 1842) | def num_columns(self) -> dict[str, Optional[int]]:
    method column_names (line 1859) | def column_names(self) -> dict[str, Optional[list[str]]]:
    method with_format (line 1877) | def with_format(
    method map (line 1923) | def map(
    method filter (line 2023) | def filter(
    method shuffle (line 2086) | def shuffle(
    method rename_column (line 2147) | def rename_column(self, original_column_name: str, new_column_name: st...
    method rename_columns (line 2183) | def rename_columns(self, column_mapping: dict[str, str]) -> "IterableD...
    method remove_columns (line 2211) | def remove_columns(self, column_names: Union[str, list[str]]) -> "Iter...
    method select_columns (line 2237) | def select_columns(self, column_names: Union[str, list[str]]) -> "Iter...
    method cast_column (line 2263) | def cast_column(self, column: str, feature: FeatureType) -> "IterableD...
    method cast (line 2294) | def cast(
    method push_to_hub (line 2331) | def push_to_hub(
  function _push_to_repo (line 2521) | def _push_to_repo(
  function _push_to_bucket (line 2696) | def _push_to_bucket(

FILE: src/datasets/distributed.py
  function split_dataset_by_node (line 10) | def split_dataset_by_node(dataset: DatasetType, rank: int, world_size: i...

FILE: src/datasets/download/download_config.py
  class DownloadConfig (line 10) | class DownloadConfig:
    method copy (line 72) | def copy(self) -> "DownloadConfig":
    method __setattr__ (line 75) | def __setattr__(self, name, value):

FILE: src/datasets/download/download_manager.py
  class DownloadMode (line 50) | class DownloadMode(enum.Enum):
  class DownloadManager (line 71) | class DownloadManager:
    method __init__ (line 74) | def __init__(
    method manual_dir (line 110) | def manual_dir(self):
    method downloaded_size (line 114) | def downloaded_size(self):
    method _record_sizes_checksums (line 118) | def _record_sizes_checksums(self, url_or_urls: NestedDataStructure, do...
    method download (line 131) | def download(self, url_or_urls):
    method _download_batched (line 181) | def _download_batched(
    method _download_single (line 224) | def _download_single(self, url_or_filename: str, download_config: Down...
    method iter_archive (line 234) | def iter_archive(self, path_or_buf: Union[str, io.BufferedReader]):
    method iter_files (line 259) | def iter_files(self, paths: Union[str, list[str]]):
    method extract (line 278) | def extract(self, path_or_paths):
    method download_and_extract (line 310) | def download_and_extract(self, url_or_urls):
    method get_recorded_sizes_checksums (line 328) | def get_recorded_sizes_checksums(self):
    method delete_extracted_files (line 331) | def delete_extracted_files(self):
    method manage_extracted_files (line 338) | def manage_extracted_files(self):

FILE: src/datasets/download/streaming_download_manager.py
  class StreamingDownloadManager (line 47) | class StreamingDownloadManager:
    method __init__ (line 57) | def __init__(
    method manual_dir (line 72) | def manual_dir(self):
    method download (line 75) | def download(self, url_or_urls):
    method _download_single (line 95) | def _download_single(self, urlpath: str) -> str:
    method extract (line 102) | def extract(self, url_or_urls):
    method _extract (line 124) | def _extract(self, urlpath: str) -> str:
    method download_and_extract (line 151) | def download_and_extract(self, url_or_urls):
    method iter_archive (line 171) | def iter_archive(self, urlpath_or_buf: Union[str, io.BufferedReader]) ...
    method iter_files (line 196) | def iter_files(self, urlpaths: Union[str, list[str]]) -> Iterable[str]:
    method manage_extracted_files (line 215) | def manage_extracted_files(self):
    method get_recorded_sizes_checksums (line 218) | def get_recorded_sizes_checksums(self):

FILE: src/datasets/exceptions.py
  class DatasetsError (line 12) | class DatasetsError(Exception):
  class DefunctDatasetError (line 16) | class DefunctDatasetError(DatasetsError):
  class FileNotFoundDatasetsError (line 20) | class FileNotFoundDatasetsError(DatasetsError, FileNotFoundError):
  class DataFilesNotFoundError (line 24) | class DataFilesNotFoundError(FileNotFoundDatasetsError):
  class DatasetNotFoundError (line 28) | class DatasetNotFoundError(FileNotFoundDatasetsError):
  class DatasetBuildError (line 37) | class DatasetBuildError(DatasetsError):
  class ManualDownloadError (line 41) | class ManualDownloadError(DatasetBuildError):
  class FileFormatError (line 45) | class FileFormatError(DatasetBuildError):
  class DatasetGenerationError (line 49) | class DatasetGenerationError(DatasetBuildError):
  class DatasetGenerationCastError (line 53) | class DatasetGenerationCastError(DatasetGenerationError):
    method from_cast_error (line 55) | def from_cast_error(
  class ChecksumVerificationError (line 90) | class ChecksumVerificationError(DatasetsError):
  class UnexpectedDownloadedFileError (line 94) | class UnexpectedDownloadedFileError(ChecksumVerificationError):
  class ExpectedMoreDownloadedFilesError (line 98) | class ExpectedMoreDownloadedFilesError(ChecksumVerificationError):
  class NonMatchingChecksumError (line 102) | class NonMatchingChecksumError(ChecksumVerificationError):
  class SplitsVerificationError (line 106) | class SplitsVerificationError(DatasetsError):
  class UnexpectedSplitsError (line 110) | class UnexpectedSplitsError(SplitsVerificationError):
  class ExpectedMoreSplitsError (line 114) | class ExpectedMoreSplitsError(SplitsVerificationError):
  class NonMatchingSplitsSizesError (line 118) | class NonMatchingSplitsSizesError(SplitsVerificationError):

FILE: src/datasets/features/_torchcodec.py
  class AudioDecoder (line 5) | class AudioDecoder(_AudioDecoder):
    method __getitem__ (line 6) | def __getitem__(self, key: str):

FILE: src/datasets/features/audio.py
  class Audio (line 24) | class Audio:
    method __call__ (line 93) | def __call__(self):
    method encode_example (line 96) | def encode_example(self, value: Union[str, bytes, bytearray, dict, "Au...
    method decode_example (line 164) | def decode_example(
    method flatten (line 223) | def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    method cast_storage (line 234) | def cast_storage(self, storage: Union[pa.StringArray, pa.StructArray])...
    method embed_storage (line 280) | def embed_storage(self, storage: pa.StructArray, token_per_repo_id=Non...
  function encode_torchcodec_audio (line 321) | def encode_torchcodec_audio(audio: "AudioDecoder") -> dict:

FILE: src/datasets/features/features.py
  function _arrow_to_datasets_dtype (line 55) | def _arrow_to_datasets_dtype(arrow_type: pa.DataType) -> str:
  function string_to_arrow (line 125) | def string_to_arrow(datasets_dtype: str) -> pa.DataType:
  function _cast_to_python_objects (line 276) | def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool, optimize_...
  function cast_to_python_objects (line 467) | def cast_to_python_objects(obj: Any, only_1d_for_numpy=False, optimize_l...
  class Value (line 493) | class Value:
    method __post_init__ (line 547) | def __post_init__(self):
    method __call__ (line 554) | def __call__(self):
    method encode_example (line 557) | def encode_example(self, value):
    method __repr__ (line 573) | def __repr__(self):
  class _ArrayXD (line 577) | class _ArrayXD:
    method __post_init__ (line 578) | def __post_init__(self):
    method __call__ (line 581) | def __call__(self):
    method encode_example (line 585) | def encode_example(self, value):
  class Array2D (line 590) | class Array2D(_ArrayXD):
  class Array3D (line 615) | class Array3D(_ArrayXD):
  class Array4D (line 640) | class Array4D(_ArrayXD):
  class Array5D (line 665) | class Array5D(_ArrayXD):
  class _ArrayXDExtensionType (line 689) | class _ArrayXDExtensionType(pa.ExtensionType):
    method __init__ (line 692) | def __init__(self, shape: tuple, dtype: str):
    method __arrow_ext_serialize__ (line 705) | def __arrow_ext_serialize__(self):
    method __arrow_ext_deserialize__ (line 709) | def __arrow_ext_deserialize__(cls, storage_type, serialized):
    method __reduce__ (line 714) | def __reduce__(self):
    method __hash__ (line 717) | def __hash__(self):
    method __arrow_ext_class__ (line 720) | def __arrow_ext_class__(self):
    method _generate_dtype (line 723) | def _generate_dtype(self, dtype):
    method to_pandas_dtype (line 731) | def to_pandas_dtype(self):
  class Array2DExtensionType (line 735) | class Array2DExtensionType(_ArrayXDExtensionType):
  class Array3DExtensionType (line 739) | class Array3DExtensionType(_ArrayXDExtensionType):
  class Array4DExtensionType (line 743) | class Array4DExtensionType(_ArrayXDExtensionType):
  class Array5DExtensionType (line 747) | class Array5DExtensionType(_ArrayXDExtensionType):
  function _is_zero_copy_only (line 758) | def _is_zero_copy_only(pa_type: pa.DataType, unnest: bool = False) -> bool:
  class ArrayExtensionArray (line 780) | class ArrayExtensionArray(pa.ExtensionArray):
    method __array__ (line 781) | def __array__(self):
    method __getitem__ (line 785) | def __getitem__(self, i):
    method to_numpy (line 788) | def to_numpy(self, zero_copy_only=True):
    method to_pylist (line 832) | def to_pylist(self, maps_as_pydicts: Optional[Literal["lossy", "strict...
  class PandasArrayExtensionDtype (line 841) | class PandasArrayExtensionDtype(PandasExtensionDtype):
    method __init__ (line 844) | def __init__(self, value_type: Union["PandasArrayExtensionDtype", np.d...
    method __from_arrow__ (line 847) | def __from_arrow__(self, array: Union[pa.Array, pa.ChunkedArray]):
    method construct_array_type (line 855) | def construct_array_type(cls):
    method type (line 859) | def type(self) -> type:
    method kind (line 863) | def kind(self) -> str:
    method name (line 867) | def name(self) -> str:
    method value_type (line 871) | def value_type(self) -> np.dtype:
  class PandasArrayExtensionArray (line 875) | class PandasArrayExtensionArray(PandasExtensionArray):
    method __init__ (line 876) | def __init__(self, data: np.ndarray, copy: bool = False):
    method __array__ (line 880) | def __array__(self, dtype=None):
    method copy (line 900) | def copy(self, deep: bool = False) -> "PandasArrayExtensionArray":
    method _from_sequence (line 904) | def _from_sequence(
    method _concat_same_type (line 917) | def _concat_same_type(cls, to_concat: Sequence_["PandasArrayExtensionA...
    method dtype (line 929) | def dtype(self) -> PandasArrayExtensionDtype:
    method nbytes (line 933) | def nbytes(self) -> int:
    method isna (line 936) | def isna(self) -> np.ndarray:
    method __setitem__ (line 939) | def __setitem__(self, key: Union[int, slice, np.ndarray], value: Any) ...
    method __getitem__ (line 942) | def __getitem__(self, item: Union[int, slice, np.ndarray]) -> Union[np...
    method take (line 947) | def take(
    method __len__ (line 970) | def __len__(self) -> int:
    method __eq__ (line 973) | def __eq__(self, other) -> np.ndarray:
  function pandas_types_mapper (line 979) | def pandas_types_mapper(dtype):
  class ClassLabel (line 985) | class ClassLabel:
    method __post_init__ (line 1027) | def __post_init__(self, num_classes, names_file):
    method __call__ (line 1056) | def __call__(self):
    method str2int (line 1059) | def str2int(self, values: Union[str, Iterable]) -> Union[int, Iterable]:
    method _strval2int (line 1083) | def _strval2int(self, value: str) -> int:
    method int2str (line 1104) | def int2str(self, values: Union[int, Iterable]) -> Union[str, Iterable]:
    method encode_example (line 1134) | def encode_example(self, example_data):
    method cast_storage (line 1150) | def cast_storage(self, storage: Union[pa.StringArray, pa.IntegerArray]...
    method _load_names_from_file (line 1177) | def _load_names_from_file(names_filepath):
  class Json (line 1183) | class Json:
    method __call__ (line 1224) | def __call__(self):
    method encode_example (line 1227) | def encode_example(self, example_data):
    method decode_example (line 1237) | def decode_example(self, example_data, token_per_repo_id: Optional[dic...
    method cast_storage (line 1242) | def cast_storage(self, storage: Union[pa.Array]) -> pa.JsonArray:
  class Sequence (line 1266) | class Sequence:
    method __new__ (line 1286) | def __new__(cls, feature=None, length=-1, **kwargs):
  class List (line 1300) | class List(Sequence):
    method __repr__ (line 1320) | def __repr__(self):
  class LargeList (line 1328) | class LargeList:
    method __repr__ (line 1344) | def __repr__(self):
  function _check_non_null_non_empty_recursive (line 1370) | def _check_non_null_non_empty_recursive(obj, schema: Optional[FeatureTyp...
  function get_nested_type (line 1392) | def get_nested_type(schema: FeatureType) -> pa.DataType:
  function encode_nested_example (line 1425) | def encode_nested_example(schema, obj, level=0):
  function decode_nested_example (line 1465) | def decode_nested_example(schema, obj, token_per_repo_id: Optional[dict[...
  function register_feature (line 1533) | def register_feature(
  function generate_from_dict (line 1548) | def generate_from_dict(obj: Any):
  function generate_from_arrow_type (line 1586) | def generate_from_arrow_type(pa_type: pa.DataType) -> FeatureType:
  function numpy_to_pyarrow_listarray (line 1615) | def numpy_to_pyarrow_listarray(arr: np.ndarray, type: pa.DataType = None...
  function list_of_pa_arrays_to_pyarrow_listarray (line 1627) | def list_of_pa_arrays_to_pyarrow_listarray(l_arr: list[Optional[pa.Array...
  function list_of_np_array_to_pyarrow_listarray (line 1640) | def list_of_np_array_to_pyarrow_listarray(l_arr: list[np.ndarray], type:...
  function contains_any_np_array (line 1650) | def contains_any_np_array(data: Any):
  function any_np_array_to_pyarrow_listarray (line 1667) | def any_np_array_to_pyarrow_listarray(data: Union[np.ndarray, list], typ...
  function to_pyarrow_listarray (line 1683) | def to_pyarrow_listarray(data: Any, pa_type: _ArrayXDExtensionType) -> p...
  function _visit (line 1699) | def _visit(feature: FeatureType, func: Callable[[FeatureType], Optional[...
  function _visit_with_path (line 1723) | def _visit_with_path(
  function require_decoding (line 1755) | def require_decoding(feature: FeatureType, ignore_decode_attribute: bool...
  function require_storage_cast (line 1779) | def require_storage_cast(feature: FeatureType) -> bool:
  function require_storage_embed (line 1797) | def require_storage_embed(feature: FeatureType) -> bool:
  function keep_features_dicts_synced (line 1815) | def keep_features_dicts_synced(func):
  class Features (line 1837) | class Features(dict):
    method __init__ (line 1871) | def __init__(*args, **kwargs):
    method __reduce__ (line 1899) | def __reduce__(self):
    method type (line 1903) | def type(self):
    method arrow_schema (line 1913) | def arrow_schema(self):
    method from_arrow_schema (line 1924) | def from_arrow_schema(cls, pa_schema: pa.Schema) -> "Features":
    method from_dict (line 1958) | def from_dict(cls, dic) -> "Features":
    method to_dict (line 1986) | def to_dict(self):
    method _to_yaml_list (line 1989) | def _to_yaml_list(self) -> list:
    method _from_yaml_list (line 2064) | def _from_yaml_list(cls, yaml_data: list) -> "Features":
    method encode_example (line 2140) | def encode_example(self, example):
    method encode_column (line 2154) | def encode_column(self, column, column_name: str):
    method encode_batch (line 2170) | def encode_batch(self, batch):
    method decode_example (line 2189) | def decode_example(self, example: dict, token_per_repo_id: Optional[di...
    method decode_column (line 2212) | def decode_column(
    method decode_batch (line 2237) | def decode_batch(self, batch: dict, token_per_repo_id: Optional[dict[s...
    method copy (line 2264) | def copy(self) -> "Features":
    method reorder_fields_as (line 2284) | def reorder_fields_as(self, other: "Features") -> "Features":
    method flatten (line 2337) | def flatten(self, max_depth=16) -> "Features":
  function _is_null_feature (line 2381) | def _is_null_feature(feature) -> bool:
  function _align_features (line 2396) | def _align_features(features_list: list[Features]) -> list[Features]:
  function _check_if_features_can_be_aligned (line 2410) | def _check_if_features_can_be_aligned(features_list: list[Features]):
  function _fix_for_backward_compatible_features (line 2433) | def _fix_for_backward_compatible_features(feature: Any) -> FeatureType:

FILE: src/datasets/features/image.py
  class Image (line 47) | class Image:
    method __call__ (line 95) | def __call__(self):
    method encode_example (line 98) | def encode_example(self, value: Union[str, bytes, bytearray, dict, np....
    method decode_example (line 139) | def decode_example(self, value: dict, token_per_repo_id=None) -> "PIL....
    method flatten (line 200) | def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    method cast_storage (line 213) | def cast_storage(self, storage: Union[pa.StringArray, pa.StructArray, ...
    method embed_storage (line 275) | def embed_storage(self, storage: pa.StructArray, token_per_repo_id=Non...
  function list_image_compression_formats (line 316) | def list_image_compression_formats() -> list[str]:
  function image_to_bytes (line 329) | def image_to_bytes(image: "PIL.Image.Image") -> bytes:
  function encode_pil_image (line 340) | def encode_pil_image(image: "PIL.Image.Image") -> dict:
  function encode_np_array (line 347) | def encode_np_array(array: np.ndarray) -> dict:
  function objects_to_list_of_image_dicts (line 390) | def objects_to_list_of_image_dicts(

FILE: src/datasets/features/nifti.py
  class Nifti1ImageWrapper (line 23) | class Nifti1ImageWrapper(nib.nifti1.Nifti1Image):
    method __init__ (line 28) | def __init__(self, nifti_image: nib.nifti1.Nifti1Image):
    method _repr_html_ (line 39) | def _repr_html_(self):
  class Nifti (line 64) | class Nifti:
    method __call__ (line 107) | def __call__(self):
    method encode_example (line 110) | def encode_example(self, value: Union[str, bytes, bytearray, dict, "ni...
    method decode_example (line 150) | def decode_example(self, value: dict, token_per_repo_id=None) -> "Nift...
    method embed_storage (line 213) | def embed_storage(self, storage: pa.StructArray, token_per_repo_id=Non...
    method flatten (line 253) | def flatten(self) -> Union["FeatureType", Dict[str, "FeatureType"]]:
    method cast_storage (line 266) | def cast_storage(self, storage: Union[pa.StringArray, pa.StructArray, ...
  function encode_nibabel_image (line 303) | def encode_nibabel_image(img: "nib.Nifti1Image", force_bytes: bool = Fal...

FILE: src/datasets/features/pdf.py
  function pdf_to_bytes (line 22) | def pdf_to_bytes(pdf: "pdfplumber.pdf.PDF") -> bytes:
  class Pdf (line 31) | class Pdf:
    method __call__ (line 75) | def __call__(self):
    method encode_example (line 78) | def encode_example(self, value: Union[str, bytes, bytearray, dict, "pd...
    method decode_example (line 113) | def decode_example(self, value: dict, token_per_repo_id=None) -> "pdfp...
    method flatten (line 171) | def flatten(self) -> Union["FeatureType", Dict[str, "FeatureType"]]:
    method cast_storage (line 184) | def cast_storage(self, storage: Union[pa.StringArray, pa.StructArray, ...
    method embed_storage (line 221) | def embed_storage(self, storage: pa.StructArray, token_per_repo_id=Non...
  function encode_pdfplumber_pdf (line 262) | def encode_pdfplumber_pdf(pdf: "pdfplumber.pdf.PDF") -> dict:

FILE: src/datasets/features/translation.py
  class Translation (line 12) | class Translation:
    method __call__ (line 41) | def __call__(self):
    method flatten (line 44) | def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
  class TranslationVariableLanguages (line 52) | class TranslationVariableLanguages:
    method __post_init__ (line 92) | def __post_init__(self):
    method __call__ (line 96) | def __call__(self):
    method encode_example (line 99) | def encode_example(self, translation_dict):
    method flatten (line 122) | def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:

FILE: src/datasets/features/video.py
  class Example (line 23) | class Example(TypedDict):
  class Video (line 29) | class Video:
    method __call__ (line 102) | def __call__(self):
    method encode_example (line 105) | def encode_example(self, value: Union[str, bytes, bytearray, Example, ...
    method decode_example (line 153) | def decode_example(
    method flatten (line 226) | def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    method cast_storage (line 239) | def cast_storage(self, storage: Union[pa.StringArray, pa.StructArray, ...
    method embed_storage (line 291) | def embed_storage(self, storage: pa.StructArray, token_per_repo_id=Non...
  function video_to_bytes (line 332) | def video_to_bytes(video: "VideoDecoder") -> bytes:
  function encode_torchcodec_video (line 337) | def encode_torchcodec_video(video: "VideoDecoder") -> Example:
  function encode_np_array (line 346) | def encode_np_array(array: np.ndarray) -> Example:
  function hf_video_reader (line 355) | def hf_video_reader(

FILE: src/datasets/filesystems/__init__.py
  function is_remote_filesystem (line 28) | def is_remote_filesystem(fs: fsspec.AbstractFileSystem) -> bool:
  function rename (line 39) | def rename(fs: fsspec.AbstractFileSystem, src: str, dst: str):

FILE: src/datasets/filesystems/compression.py
  class BaseCompressedFileFileSystem (line 9) | class BaseCompressedFileFileSystem(AbstractArchiveFileSystem):
    method __init__ (line 19) | def __init__(
    method _strip_protocol (line 60) | def _strip_protocol(cls, path):
    method _get_dirs (line 64) | def _get_dirs(self):
    method cat (line 69) | def cat(self, path: str):
    method _open (line 73) | def _open(
  class Bz2FileSystem (line 88) | class Bz2FileSystem(BaseCompressedFileFileSystem):
  class GzipFileSystem (line 96) | class GzipFileSystem(BaseCompressedFileFileSystem):
  class Lz4FileSystem (line 104) | class Lz4FileSystem(BaseCompressedFileFileSystem):
  class XzFileSystem (line 112) | class XzFileSystem(BaseCompressedFileFileSystem):
  class ZstdFileSystem (line 120) | class ZstdFileSystem(BaseCompressedFileFileSystem):

FILE: src/datasets/fingerprint.py
  class _TempCacheDir (line 45) | class _TempCacheDir:
    method __init__ (line 51) | def __init__(self):
    method _cleanup (line 81) | def _cleanup(self):
    method cleanup (line 92) | def cleanup(self):
  function maybe_register_dataset_for_temp_dir_deletion (line 97) | def maybe_register_dataset_for_temp_dir_deletion(dataset):
  function get_datasets_with_cache_file_in_temp_dir (line 116) | def get_datasets_with_cache_file_in_temp_dir():
  function enable_caching (line 120) | def enable_caching():
  function disable_caching (line 141) | def disable_caching():
  function is_caching_enabled (line 162) | def is_caching_enabled() -> bool:
  function get_temporary_cache_files_directory (line 183) | def get_temporary_cache_files_directory() -> str:
  class Hasher (line 196) | class Hasher:
    method __init__ (line 201) | def __init__(self):
    method hash_bytes (line 205) | def hash_bytes(cls, value: Union[bytes, list[bytes]]) -> str:
    method hash (line 213) | def hash(cls, value: Any) -> str:
    method update (line 216) | def update(self, value: Any) -> None:
    method hexdigest (line 222) | def hexdigest(self) -> str:
  function generate_fingerprint (line 235) | def generate_fingerprint(dataset: "Dataset") -> str:
  function generate_random_fingerprint (line 249) | def generate_random_fingerprint(nbits: int = 64) -> str:
  function update_fingerprint (line 253) | def update_fingerprint(fingerprint, transform, transform_args):
  function validate_fingerprint (line 303) | def validate_fingerprint(fingerprint: str, max_length=64):
  function format_transform_for_fingerprint (line 323) | def format_transform_for_fingerprint(func: Callable, version: Optional[s...
  function format_kwargs_for_fingerprint (line 333) | def format_kwargs_for_fingerprint(
  function fingerprint_transform (line 378) | def fingerprint_transform(

FILE: src/datasets/formatting/__init__.py
  function _register_formatter (line 40) | def _register_formatter(
  function _register_unavailable_formatter (line 63) | def _register_unavailable_formatter(
  function get_format_type_from_alias (line 115) | def get_format_type_from_alias(format_type: Optional[str]) -> Optional[s...
  function get_formatter (line 123) | def get_formatter(format_type: Optional[str], **format_kwargs) -> Format...

FILE: src/datasets/formatting/formatting.py
  function _is_range_contiguous (line 40) | def _is_range_contiguous(key: range) -> bool:
  function _raise_bad_key_type (line 44) | def _raise_bad_key_type(key: Any):
  function _query_table_with_indices_mapping (line 50) | def _query_table_with_indices_mapping(
  function _query_table (line 80) | def _query_table(table: Table, key: Union[int, slice, range, str, Iterab...
  function _is_array_with_nulls (line 105) | def _is_array_with_nulls(pa_array: pa.Array) -> bool:
  class BaseArrowExtractor (line 109) | class BaseArrowExtractor(Generic[RowFormat, ColumnFormat, BatchFormat]):
    method extract_row (line 116) | def extract_row(self, pa_table: pa.Table) -> RowFormat:
    method extract_column (line 119) | def extract_column(self, pa_table: pa.Table) -> ColumnFormat:
    method extract_batch (line 122) | def extract_batch(self, pa_table: pa.Table) -> BatchFormat:
  function _unnest (line 126) | def _unnest(py_dict: dict[str, list[T]]) -> dict[str, T]:
  class SimpleArrowExtractor (line 131) | class SimpleArrowExtractor(BaseArrowExtractor[pa.Table, pa.Array, pa.Tab...
    method extract_row (line 132) | def extract_row(self, pa_table: pa.Table) -> pa.Table:
    method extract_column (line 135) | def extract_column(self, pa_table: pa.Table) -> pa.Array:
    method extract_batch (line 138) | def extract_batch(self, pa_table: pa.Table) -> pa.Table:
  class PythonArrowExtractor (line 142) | class PythonArrowExtractor(BaseArrowExtractor[dict, list, dict]):
    method extract_row (line 143) | def extract_row(self, pa_table: pa.Table) -> dict:
    method extract_column (line 146) | def extract_column(self, pa_table: pa.Table) -> list:
    method extract_batch (line 149) | def extract_batch(self, pa_table: pa.Table) -> dict:
  class NumpyArrowExtractor (line 153) | class NumpyArrowExtractor(BaseArrowExtractor[dict, np.ndarray, dict]):
    method __init__ (line 154) | def __init__(self, **np_array_kwargs):
    method extract_row (line 157) | def extract_row(self, pa_table: pa.Table) -> dict:
    method extract_column (line 160) | def extract_column(self, pa_table: pa.Table) -> np.ndarray:
    method extract_batch (line 163) | def extract_batch(self, pa_table: pa.Table) -> dict:
    method _arrow_array_to_numpy (line 166) | def _arrow_array_to_numpy(self, pa_array: pa.Array) -> np.ndarray:
  class PandasArrowExtractor (line 205) | class PandasArrowExtractor(BaseArrowExtractor[pd.DataFrame, pd.Series, p...
    method extract_row (line 206) | def extract_row(self, pa_table: pa.Table) -> pd.DataFrame:
    method extract_column (line 209) | def extract_column(self, pa_table: pa.Table) -> pd.Series:
    method extract_batch (line 212) | def extract_batch(self, pa_table: pa.Table) -> pd.DataFrame:
  class PythonFeaturesDecoder (line 216) | class PythonFeaturesDecoder:
    method __init__ (line 217) | def __init__(
    method decode_row (line 223) | def decode_row(self, row: dict) -> dict:
    method decode_column (line 226) | def decode_column(self, column: list, column_name: str) -> list:
    method decode_batch (line 233) | def decode_batch(self, batch: dict) -> dict:
  class PandasFeaturesDecoder (line 237) | class PandasFeaturesDecoder:
    method __init__ (line 238) | def __init__(self, features: Optional[Features]):
    method decode_row (line 241) | def decode_row(self, row: pd.DataFrame) -> pd.DataFrame:
    method decode_column (line 255) | def decode_column(self, column: pd.Series, column_name: str) -> pd.Ser...
    method decode_batch (line 265) | def decode_batch(self, batch: pd.DataFrame) -> pd.DataFrame:
  class LazyDict (line 269) | class LazyDict(MutableMapping):
    method __init__ (line 272) | def __init__(self, pa_table: pa.Table, formatter: "Formatter"):
    method __len__ (line 279) | def __len__(self):
    method __getitem__ (line 282) | def __getitem__(self, key):
    method __setitem__ (line 290) | def __setitem__(self, key, value):
    method __delitem__ (line 295) | def __delitem__(self, key) -> None:
    method __iter__ (line 300) | def __iter__(self):
    method __contains__ (line 303) | def __contains__(self, key):
    method __repr__ (line 306) | def __repr__(self):
    method __or__ (line 310) | def __or__(self, other):
    method __ror__ (line 325) | def __ror__(self, other):
    method __ior__ (line 340) | def __ior__(self, other):
    method __copy__ (line 351) | def __copy__(self):
    method copy (line 360) | def copy(self):
    method fromkeys (line 366) | def fromkeys(cls, iterable, value=None):
    method format (line 369) | def format(self, key):
    method _format_all (line 372) | def _format_all(self):
  class LazyRow (line 378) | class LazyRow(LazyDict):
    method format (line 379) | def format(self, key):
  class LazyBatch (line 383) | class LazyBatch(LazyDict):
    method format (line 384) | def format(self, key):
  class Formatter (line 388) | class Formatter(Generic[RowFormat, ColumnFormat, BatchFormat]):
    method __init__ (line 399) | def __init__(
    method __call__ (line 409) | def __call__(self, pa_table: pa.Table, query_type: str) -> Union[RowFo...
    method format_row (line 417) | def format_row(self, pa_table: pa.Table) -> RowFormat:
    method format_column (line 420) | def format_column(self, pa_table: pa.Table) -> ColumnFormat:
    method format_batch (line 423) | def format_batch(self, pa_table: pa.Table) -> BatchFormat:
  class TensorFormatter (line 427) | class TensorFormatter(Formatter[RowFormat, ColumnFormat, BatchFormat]):
    method recursive_tensorize (line 428) | def recursive_tensorize(self, data_struct: dict):
  class TableFormatter (line 432) | class TableFormatter(Formatter[RowFormat, ColumnFormat, BatchFormat]):
  class ArrowFormatter (line 437) | class ArrowFormatter(TableFormatter[pa.Table, pa.Array, pa.Table]):
    method format_row (line 441) | def format_row(self, pa_table: pa.Table) -> pa.Table:
    method format_column (line 444) | def format_column(self, pa_table: pa.Table) -> pa.Array:
    method format_batch (line 447) | def format_batch(self, pa_table: pa.Table) -> pa.Table:
  class PythonFormatter (line 451) | class PythonFormatter(Formatter[Mapping, list, Mapping]):
    method __init__ (line 452) | def __init__(self, features=None, lazy=False, token_per_repo_id=None):
    method format_row (line 456) | def format_row(self, pa_table: pa.Table) -> Mapping:
    method format_column (line 463) | def format_column(self, pa_table: pa.Table) -> list:
    method format_batch (line 468) | def format_batch(self, pa_table: pa.Table) -> Mapping:
  class PandasFormatter (line 476) | class PandasFormatter(TableFormatter[pd.DataFrame, pd.Series, pd.DataFra...
    method format_row (line 480) | def format_row(self, pa_table: pa.Table) -> pd.DataFrame:
    method format_column (line 485) | def format_column(self, pa_table: pa.Table) -> pd.Series:
    method format_batch (line 490) | def format_batch(self, pa_table: pa.Table) -> pd.DataFrame:
  class CustomFormatter (line 496) | class CustomFormatter(Formatter[dict, ColumnFormat, dict]):
    method __init__ (line 506) | def __init__(self, transform: Callable[[dict], dict], features=None, t...
    method format_row (line 510) | def format_row(self, pa_table: pa.Table) -> dict:
    method format_column (line 519) | def format_column(self, pa_table: pa.Table) -> ColumnFormat:
    method format_batch (line 538) | def format_batch(self, pa_table: pa.Table) -> dict:
  function _check_valid_column_key (line 544) | def _check_valid_column_key(key: str, columns: list[str]) -> None:
  function _check_valid_index_key (line 549) | def _check_valid_index_key(key: Union[int, slice, range, Iterable], size...
  function key_to_query_type (line 568) | def key_to_query_type(key: Union[int, slice, range, str, Iterable]) -> str:
  function query_table (line 578) | def query_table(
  function format_table (line 621) | def format_table(

FILE: src/datasets/formatting/jax_formatter.py
  class JaxFormatter (line 38) | class JaxFormatter(TensorFormatter[Mapping, "jax.Array", Mapping]):
    method __init__ (line 39) | def __init__(self, features=None, device=None, token_per_repo_id=None,...
    method _map_devices_to_str (line 67) | def _map_devices_to_str() -> dict[str, "jaxlib.xla_extension.Device"]:
    method _consolidate (line 72) | def _consolidate(self, column):
    method _tensorize (line 83) | def _tensorize(self, value):
    method _recursive_tensorize (line 131) | def _recursive_tensorize(self, data_struct):
    method recursive_tensorize (line 150) | def recursive_tensorize(self, data_struct: dict):
    method format_row (line 153) | def format_row(self, pa_table: pa.Table) -> Mapping:
    method format_column (line 158) | def format_column(self, pa_table: pa.Table) -> "jax.Array":
    method format_batch (line 165) | def format_batch(self, pa_table: pa.Table) -> Mapping:

FILE: src/datasets/formatting/np_formatter.py
  class NumpyFormatter (line 26) | class NumpyFormatter(TensorFormatter[Mapping, np.ndarray, Mapping]):
    method __init__ (line 27) | def __init__(self, features=None, token_per_repo_id=None, **np_array_k...
    method _consolidate (line 31) | def _consolidate(self, column):
    method _tensorize (line 46) | def _tensorize(self, value):
    method _recursive_tensorize (line 79) | def _recursive_tensorize(self, data_struct):
    method recursive_tensorize (line 96) | def recursive_tensorize(self, data_struct: dict):
    method format_row (line 99) | def format_row(self, pa_table: pa.Table) -> Mapping:
    method format_column (line 104) | def format_column(self, pa_table: pa.Table) -> np.ndarray:
    method format_batch (line 111) | def format_batch(self, pa_table: pa.Table) -> Mapping:

FILE: src/datasets/formatting/polars_formatter.py
  class PolarsArrowExtractor (line 32) | class PolarsArrowExtractor(BaseArrowExtractor["pl.DataFrame", "pl.Series...
    method extract_row (line 33) | def extract_row(self, pa_table: pa.Table) -> "pl.DataFrame":
    method extract_column (line 44) | def extract_column(self, pa_table: pa.Table) -> "pl.Series":
    method extract_batch (line 55) | def extract_batch(self, pa_table: pa.Table) -> "pl.DataFrame":
  class PolarsFeaturesDecoder (line 67) | class PolarsFeaturesDecoder:
    method __init__ (line 68) | def __init__(self, features: Optional[Features]):
    method decode_row (line 72) | def decode_row(self, row: "pl.DataFrame") -> "pl.DataFrame":
    method decode_column (line 86) | def decode_column(self, column: "pl.Series", column_name: str) -> "pl....
    method decode_batch (line 96) | def decode_batch(self, batch: "pl.DataFrame") -> "pl.DataFrame":
  class PolarsFormatter (line 100) | class PolarsFormatter(TableFormatter["pl.DataFrame", "pl.Series", "pl.Da...
    method __init__ (line 104) | def __init__(self, features=None, **np_array_kwargs):
    method format_row (line 111) | def format_row(self, pa_table: pa.Table) -> "pl.DataFrame":
    method format_column (line 116) | def format_column(self, pa_table: pa.Table) -> "pl.Series":
    method format_batch (line 121) | def format_batch(self, pa_table: pa.Table) -> "pl.DataFrame":

FILE: src/datasets/formatting/tf_formatter.py
  class TFFormatter (line 32) | class TFFormatter(TensorFormatter[Mapping, "tf.Tensor", Mapping]):
    method __init__ (line 33) | def __init__(self, features=None, token_per_repo_id=None, **tf_tensor_...
    method _consolidate (line 38) | def _consolidate(self, column):
    method _tensorize (line 55) | def _tensorize(self, value):
    method _recursive_tensorize (line 86) | def _recursive_tensorize(self, data_struct):
    method recursive_tensorize (line 105) | def recursive_tensorize(self, data_struct: dict):
    method format_row (line 108) | def format_row(self, pa_table: pa.Table) -> Mapping:
    method format_column (line 113) | def format_column(self, pa_table: pa.Table) -> "tf.Tensor":
    method format_batch (line 120) | def format_batch(self, pa_table: pa.Table) -> Mapping:

FILE: src/datasets/formatting/torch_formatter.py
  class TorchFormatter (line 32) | class TorchFormatter(TensorFormatter[Mapping, "torch.Tensor", Mapping]):
    method __init__ (line 33) | def __init__(self, features=None, token_per_repo_id=None, **torch_tens...
    method _consolidate (line 38) | def _consolidate(self, column):
    method _tensorize (line 49) | def _tensorize(self, value):
    method _recursive_tensorize (line 92) | def _recursive_tensorize(self, data_struct):
    method recursive_tensorize (line 106) | def recursive_tensorize(self, data_struct: dict):
    method format_row (line 109) | def format_row(self, pa_table: pa.Table) -> Mapping:
    method format_column (line 114) | def format_column(self, pa_table: pa.Table) -> "torch.Tensor":
    method format_batch (line 121) | def format_batch(self, pa_table: pa.Table) -> Mapping:

FILE: src/datasets/hub.py
  function delete_from_hub (line 20) | def delete_from_hub(
  function _delete_files (line 92) | def _delete_files(dataset_id, revision=None, token=None):

FILE: src/datasets/info.py
  class SupervisedKeysData (line 56) | class SupervisedKeysData:
  class DownloadChecksumsEntryData (line 62) | class DownloadChecksumsEntryData:
  class MissingCachedSizesConfigError (line 67) | class MissingCachedSizesConfigError(Exception):
  class NonMatchingCachedSizesError (line 71) | class NonMatchingCachedSizesError(Exception):
  class PostProcessedInfo (line 76) | class PostProcessedInfo:
    method __post_init__ (line 80) | def __post_init__(self):
    method from_dict (line 86) | def from_dict(cls, post_processed_info_dict: dict) -> "PostProcessedIn...
  class DatasetInfo (line 92) | class DatasetInfo:
    method __post_init__ (line 167) | def __post_init__(self):
    method write_to_directory (line 186) | def write_to_directory(self, dataset_info_dir, pretty_print=False, sto...
    method _dump_info (line 215) | def _dump_info(self, file, pretty_print=False):
    method _dump_license (line 219) | def _dump_license(self, file):
    method from_merge (line 224) | def from_merge(cls, dataset_infos: list["DatasetInfo"]):
    method from_directory (line 248) | def from_directory(cls, dataset_info_dir: str, storage_options: Option...
    method from_dict (line 282) | def from_dict(cls, dataset_info_dict: dict) -> "DatasetInfo":
    method update (line 286) | def update(self, other_dataset_info: "DatasetInfo", ignore_none=True):
    method copy (line 296) | def copy(self) -> "DatasetInfo":
    method _to_yaml_dict (line 299) | def _to_yaml_dict(self) -> dict:
    method _from_yaml_dict (line 314) | def _from_yaml_dict(cls, yaml_data: dict) -> "DatasetInfo":
  class DatasetInfosDict (line 324) | class DatasetInfosDict(dict[str, DatasetInfo]):
    method write_to_directory (line 325) | def write_to_directory(self, dataset_infos_dir, overwrite=False, prett...
    method from_directory (line 354) | def from_directory(cls, dataset_infos_dir) -> "DatasetInfosDict":
    method from_dataset_card_data (line 374) | def from_dataset_card_data(cls, dataset_card_data: DatasetCardData) ->...
    method to_dataset_card_data (line 392) | def to_dataset_card_data(self, dataset_card_data: DatasetCardData) -> ...

FILE: src/datasets/inspect.py
  class SplitsNotFoundError (line 38) | class SplitsNotFoundError(ValueError):
  function get_dataset_infos (line 42) | def get_dataset_infos(
  function get_dataset_config_names (line 109) | def get_dataset_config_names(
  function get_dataset_default_config_name (line 175) | def get_dataset_default_config_name(
  function get_dataset_config_info (line 237) | def get_dataset_config_info(
  function get_dataset_split_names (line 295) | def get_dataset_split_names(

FILE: src/datasets/io/abc.py
  class AbstractDatasetReader (line 8) | class AbstractDatasetReader(ABC):
    method __init__ (line 9) | def __init__(
    method read (line 30) | def read(self) -> Union[Dataset, DatasetDict, IterableDataset, Iterabl...
  class AbstractDatasetInputStream (line 34) | class AbstractDatasetInputStream(ABC):
    method __init__ (line 35) | def __init__(
    method read (line 52) | def read(self) -> Union[Dataset, IterableDataset]:

FILE: src/datasets/io/csv.py
  class CsvDatasetReader (line 15) | class CsvDatasetReader(AbstractDatasetReader):
    method __init__ (line 16) | def __init__(
    method read (line 45) | def read(self):
  class CsvDatasetWriter (line 69) | class CsvDatasetWriter:
    method __init__ (line 70) | def __init__(
    method write (line 90) | def write(self) -> int:
    method _batch_csv (line 102) | def _batch_csv(self, args):
    method _write (line 115) | def _write(self, file_obj: BinaryIO, header, index, **to_csv_kwargs) -...

FILE: src/datasets/io/generator.py
  class GeneratorDatasetInputStream (line 8) | class GeneratorDatasetInputStream(AbstractDatasetInputStream):
    method __init__ (line 9) | def __init__(
    method read (line 41) | def read(self):

FILE: src/datasets/io/json.py
  class JsonDatasetReader (line 15) | class JsonDatasetReader(AbstractDatasetReader):
    method __init__ (line 16) | def __init__(
    method read (line 48) | def read(self):
  class JsonDatasetWriter (line 72) | class JsonDatasetWriter:
    method __init__ (line 73) | def __init__(
    method write (line 93) | def write(self) -> int:
    method _batch_json (line 126) | def _batch_json(self, args):
    method _write (line 139) | def _write(

FILE: src/datasets/io/parquet.py
  class ParquetDatasetReader (line 19) | class ParquetDatasetReader(AbstractDatasetReader):
    method __init__ (line 20) | def __init__(
    method read (line 51) | def read(self):
  class ParquetDatasetWriter (line 75) | class ParquetDatasetWriter:
    method __init__ (line 76) | def __init__(
    method write (line 100) | def write(self) -> int:
    method _write (line 116) | def _write(self, file_obj: BinaryIO, batch_size: int, **parquet_writer...

FILE: src/datasets/io/spark.py
  class SparkDatasetReader (line 11) | class SparkDatasetReader(AbstractDatasetReader):
    method __init__ (line 18) | def __init__(
    method read (line 49) | def read(self):

FILE: src/datasets/io/sql.py
  class SqlDatasetReader (line 17) | class SqlDatasetReader(AbstractDatasetInputStream):
    method __init__ (line 18) | def __init__(
    method read (line 36) | def read(self):
  class SqlDatasetWriter (line 56) | class SqlDatasetWriter:
    method __init__ (line 57) | def __init__(
    method write (line 76) | def write(self) -> int:
    method _batch_sql (line 84) | def _batch_sql(self, args):
    method _write (line 96) | def _write(self, index, **to_sql_kwargs) -> int:

FILE: src/datasets/io/text.py
  class TextDatasetReader (line 9) | class TextDatasetReader(AbstractDatasetReader):
    method __init__ (line 10) | def __init__(
    method read (line 39) | def read(self):

FILE: src/datasets/iterable_dataset.py
  function identity_func (line 96) | def identity_func(x):
  function _rename_columns_fn (line 100) | def _rename_columns_fn(example: dict, column_mapping: dict[str, str]):
  function add_column_fn (line 115) | def add_column_fn(example: dict, idx: int, name: str, column: list[dict]):
  function _infer_features_from_batch (line 121) | def _infer_features_from_batch(batch: dict[str, list], try_features: Opt...
  function _examples_to_batch (line 131) | def _examples_to_batch(examples: list[dict[str, Any]]) -> dict[str, list]:
  function _batch_to_examples (line 140) | def _batch_to_examples(batch: dict[str, list]) -> Iterator[dict[str, Any]]:
  function _convert_to_arrow (line 147) | def _convert_to_arrow(
  function shift_ex_examples_rngs (line 179) | def shift_ex_examples_rngs(ex_iterable: "_BaseExamplesIterable", value: ...
  class _BaseExamplesIterable (line 194) | class _BaseExamplesIterable:
    method __init__ (line 197) | def __init__(self) -> None:
    method __iter__ (line 200) | def __iter__(self) -> Iterator[tuple[Key, dict]]:
    method iter_arrow (line 205) | def iter_arrow(self) -> Optional[Callable[[], Iterator[tuple[Key, pa.T...
    method is_typed (line 209) | def is_typed(self) -> bool:
    method features (line 213) | def features(self) -> Optional[Features]:
    method shuffle_data_sources (line 216) | def shuffle_data_sources(self, generator: np.random.Generator) -> "_Ba...
    method shard_data_sources (line 223) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 227) | def reshard_data_sources(self) -> "_BaseExamplesIterable":
    method split_shard_indices_by_worker (line 235) | def split_shard_indices_by_worker(self, num_shards: int, index: int, c...
    method num_shards (line 246) | def num_shards(self) -> int:
    method _init_state_dict (line 249) | def _init_state_dict(self) -> dict:
    method load_state_dict (line 252) | def load_state_dict(self, state_dict: dict) -> dict:
    method state_dict (line 266) | def state_dict(self) -> dict:
  class ExamplesIterable (line 272) | class ExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 273) | def __init__(
    method _init_state_dict (line 286) | def _init_state_dict(self) -> dict:
    method __iter__ (line 290) | def __iter__(self):
    method shuffle_data_sources (line 302) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Exa...
    method shard_data_sources (line 309) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 316) | def reshard_data_sources(self) -> "ExamplesIterable":
    method num_shards (line 331) | def num_shards(self) -> int:
  class ArrowExamplesIterable (line 335) | class ArrowExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 336) | def __init__(
    method iter_arrow (line 350) | def iter_arrow(self):
    method _init_state_dict (line 353) | def _init_state_dict(self) -> dict:
    method __iter__ (line 357) | def __iter__(self):
    method _iter_arrow (line 379) | def _iter_arrow(self):
    method shuffle_data_sources (line 395) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Arr...
    method shard_data_sources (line 400) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 407) | def reshard_data_sources(self) -> "ArrowExamplesIterable":
    method num_shards (line 422) | def num_shards(self) -> int:
  class RebatchedArrowExamplesIterable (line 426) | class RebatchedArrowExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 427) | def __init__(
    method iter_arrow (line 441) | def iter_arrow(self):
    method is_typed (line 445) | def is_typed(self):
    method features (line 449) | def features(self):
    method _init_state_dict (line 452) | def _init_state_dict(self) -> dict:
    method __iter__ (line 463) | def __iter__(self):
    method _iter_arrow (line 466) | def _iter_arrow(self) -> Iterator[tuple[Key, pa.Table]]:
    method shuffle_data_sources (line 556) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Reb...
    method shard_data_sources (line 564) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 572) | def reshard_data_sources(self) -> "RebatchedArrowExamplesIterable":
    method num_shards (line 578) | def num_shards(self) -> int:
  class SelectColumnsIterable (line 582) | class SelectColumnsIterable(_BaseExamplesIterable):
    method __init__ (line 583) | def __init__(self, ex_iterable: _BaseExamplesIterable, column_names: l...
    method iter_arrow (line 589) | def iter_arrow(self):
    method is_typed (line 594) | def is_typed(self):
    method features (line 598) | def features(self):
    method _init_state_dict (line 601) | def _init_state_dict(self) -> dict:
    method __iter__ (line 605) | def __iter__(self):
    method _iter_arrow (line 609) | def _iter_arrow(self) -> Iterator[tuple[Key, pa.Table]]:
    method shuffle_data_sources (line 614) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Sel...
    method shard_data_sources (line 617) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 622) | def reshard_data_sources(self) -> "SelectColumnsIterable":
    method num_shards (line 626) | def num_shards(self) -> int:
  class StepExamplesIterable (line 630) | class StepExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 631) | def __init__(self, ex_iterable: _BaseExamplesIterable, step: int, offs...
    method iter_arrow (line 638) | def iter_arrow(self):
    method is_typed (line 642) | def is_typed(self):
    method features (line 646) | def features(self):
    method _init_state_dict (line 649) | def _init_state_dict(self) -> dict:
    method __iter__ (line 657) | def __iter__(self):
    method _iter_arrow (line 666) | def _iter_arrow(self):
    method shuffle_data_sources (line 677) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Ste...
    method shard_data_sources (line 682) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 689) | def reshard_data_sources(self) -> "StepExamplesIterable":
    method num_shards (line 697) | def num_shards(self) -> int:
  class CyclingMultiSourcesExamplesIterable (line 701) | class CyclingMultiSourcesExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 702) | def __init__(
    method is_typed (line 721) | def is_typed(self):
    method features (line 725) | def features(self):
    method iter_arrow (line 729) | def iter_arrow(self):
    method _get_indices_iterator (line 733) | def _get_indices_iterator(self):
    method _init_state_dict (line 742) | def _init_state_dict(self) -> dict:
    method _iter_arrow (line 752) | def _iter_arrow(self):
    method __iter__ (line 798) | def __iter__(self):
    method shuffle_data_sources (line 842) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Cyc...
    method num_shards (line 848) | def num_shards(self) -> int:
    method shard_data_sources (line 851) | def shard_data_sources(
    method reshard_data_sources (line 877) | def reshard_data_sources(self) -> "CyclingMultiSourcesExamplesIterable":
  class VerticallyConcatenatedMultiSourcesExamplesIterable (line 884) | class VerticallyConcatenatedMultiSourcesExamplesIterable(_BaseExamplesIt...
    method __init__ (line 897) | def __init__(self, ex_iterables: list[_BaseExamplesIterable]):
    method is_typed (line 902) | def is_typed(self):
    method features (line 906) | def features(self):
    method iter_arrow (line 910) | def iter_arrow(self):
    method _init_state_dict (line 914) | def _init_state_dict(self) -> dict:
    method __iter__ (line 922) | def __iter__(self):
    method _iter_arrow (line 929) | def _iter_arrow(self):
    method shuffle_data_sources (line 936) | def shuffle_data_sources(
    method num_shards (line 950) | def num_shards(self) -> int:
    method shard_data_sources (line 953) | def shard_data_sources(
    method reshard_data_sources (line 967) | def reshard_data_sources(self) -> "VerticallyConcatenatedMultiSourcesE...
  function _check_column_names (line 973) | def _check_column_names(column_names: list[str]):
  class HorizontallyConcatenatedMultiSourcesExamplesIterable (line 983) | class HorizontallyConcatenatedMultiSourcesExamplesIterable(_BaseExamples...
    method __init__ (line 999) | def __init__(self, ex_iterables: list[_BaseExamplesIterable]):
    method iter_arrow (line 1004) | def iter_arrow(self):
    method is_typed (line 1016) | def is_typed(self):
    method features (line 1020) | def features(self):
    method _init_state_dict (line 1023) | def _init_state_dict(self) -> dict:
    method __iter__ (line 1030) | def __iter__(self):
    method _iter_arrow (line 1053) | def _iter_arrow(self):
    method shuffle_data_sources (line 1081) | def shuffle_data_sources(
    method num_shards (line 1088) | def num_shards(self) -> int:
    method shard_data_sources (line 1091) | def shard_data_sources(
    method reshard_data_sources (line 1097) | def reshard_data_sources(self) -> "HorizontallyConcatenatedMultiSource...
  class RandomlyCyclingMultiSourcesExamplesIterable (line 1102) | class RandomlyCyclingMultiSourcesExamplesIterable(CyclingMultiSourcesExa...
    method __init__ (line 1103) | def __init__(
    method shift_rngs (line 1116) | def shift_rngs(self, value: int) -> "_BaseExamplesIterable":
    method is_typed (line 1127) | def is_typed(self):
    method features (line 1131) | def features(self):
    method _get_indices_iterator (line 1134) | def _get_indices_iterator(self):
    method _init_state_dict (line 1163) | def _init_state_dict(self) -> dict:
    method shuffle_data_sources (line 1174) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Ran...
    method shard_data_sources (line 1184) | def shard_data_sources(
    method reshard_data_sources (line 1216) | def reshard_data_sources(self) -> "RandomlyCyclingMultiSourcesExamples...
  function _table_output_to_arrow (line 1226) | def _table_output_to_arrow(output) -> pa.Table:
  class MappedExamplesIterable (line 1239) | class MappedExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 1240) | def __init__(
    method iter_arrow (line 1293) | def iter_arrow(self):
    method is_typed (line 1298) | def is_typed(self):
    method features (line 1302) | def features(self):
    method _init_state_dict (line 1305) | def _init_state_dict(self) -> dict:
    method __iter__ (line 1315) | def __iter__(self):
    method _iter (line 1323) | def _iter(self):
    method _iter_arrow (line 1518) | def _iter_arrow(self, max_chunksize: Optional[int] = None) -> Iterator...
    method shuffle_data_sources (line 1590) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Map...
    method shard_data_sources (line 1607) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 1624) | def reshard_data_sources(self) -> "MappedExamplesIterable":
    method num_shards (line 1641) | def num_shards(self) -> int:
  function _add_mask (line 1645) | def _add_mask(
  function add_mask (line 1658) | def add_mask(mask_function: Callable, input: Union[dict, pa.Table], *arg...
  function async_add_mask (line 1663) | async def async_add_mask(
  class FilteredExamplesIterable (line 1670) | class FilteredExamplesIterable(MappedExamplesIterable):
    method __init__ (line 1673) | def __init__(
    method _iter (line 1705) | def _iter(self):
    method _iter_arrow (line 1711) | def _iter_arrow(self, max_chunksize: Optional[int] = None):
    method shuffle_data_sources (line 1716) | def shuffle_data_sources(self, seed: Optional[int]) -> "FilteredExampl...
    method shard_data_sources (line 1729) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 1742) | def reshard_data_sources(self) -> "FilteredExamplesIterable":
    method num_shards (line 1755) | def num_shards(self) -> int:
  class BufferShuffledExamplesIterable (line 1759) | class BufferShuffledExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 1760) | def __init__(self, ex_iterable: _BaseExamplesIterable, buffer_size: in...
    method shift_rngs (line 1766) | def shift_rngs(self, value: int) -> "_BaseExamplesIterable":
    method is_typed (line 1776) | def is_typed(self):
    method features (line 1780) | def features(self):
    method iter_arrow (line 1784) | def iter_arrow(self):
    method _init_state_dict (line 1787) | def _init_state_dict(self) -> dict:
    method load_state_dict (line 1792) | def load_state_dict(self, state_dict: dict) -> dict:
    method _iter_random_indices (line 1802) | def _iter_random_indices(rng: np.random.Generator, buffer_size: int, r...
    method __iter__ (line 1806) | def __iter__(self):
    method _iter_arrow (line 1823) | def _iter_arrow(self):
    method shuffle_data_sources (line 1840) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Buf...
    method shard_data_sources (line 1846) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 1854) | def reshard_data_sources(self) -> "BufferShuffledExamplesIterable":
    method num_shards (line 1862) | def num_shards(self) -> int:
  class SkipExamplesIterable (line 1866) | class SkipExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 1867) | def __init__(
    method iter_arrow (line 1881) | def iter_arrow(self):
    method is_typed (line 1885) | def is_typed(self):
    method features (line 1889) | def features(self):
    method _init_state_dict (line 1892) | def _init_state_dict(self) -> dict:
    method __iter__ (line 1900) | def __iter__(self):
    method _iter_arrow (line 1910) | def _iter_arrow(self):
    method split_number (line 1929) | def split_number(num, n):
    method shuffle_data_sources (line 1937) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Ski...
    method shard_data_sources (line 1949) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 1961) | def reshard_data_sources(self) -> "SkipExamplesIterable":
    method num_shards (line 1970) | def num_shards(self) -> int:
  class RepeatExamplesIterable (line 1974) | class RepeatExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 1979) | def __init__(
    method _init_state_dict (line 1988) | def _init_state_dict(self) -> dict:
    method __iter__ (line 1996) | def __iter__(self):
    method shuffle_data_sources (line 2007) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Rep...
    method shard_data_sources (line 2011) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 2018) | def reshard_data_sources(self) -> "RepeatExamplesIterable":
    method num_shards (line 2025) | def num_shards(self) -> int:
  class TakeExamplesIterable (line 2029) | class TakeExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 2030) | def __init__(
    method iter_arrow (line 2044) | def iter_arrow(self):
    method is_typed (line 2048) | def is_typed(self):
    method features (line 2052) | def features(self):
    method _init_state_dict (line 2055) | def _init_state_dict(self) -> dict:
    method __iter__ (line 2063) | def __iter__(self):
    method _iter_arrow (line 2076) | def _iter_arrow(self):
    method split_number (line 2098) | def split_number(num, n):
    method shuffle_data_sources (line 2106) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Tak...
    method shard_data_sources (line 2118) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 2135) | def reshard_data_sources(self) -> "TakeExamplesIterable":
    method num_shards (line 2144) | def num_shards(self) -> int:
  function _apply_feature_types_on_example (line 2148) | def _apply_feature_types_on_example(
  class FormattingConfig (line 2164) | class FormattingConfig:
    method is_table (line 2168) | def is_table(self) -> bool:
    method is_tensor (line 2172) | def is_tensor(self) -> bool:
  class FormattedExamplesIterable (line 2176) | class FormattedExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 2177) | def __init__(
    method iter_arrow (line 2193) | def iter_arrow(self):
    method is_typed (line 2198) | def is_typed(self):
    method features (line 2202) | def features(self):
    method _init_state_dict (line 2205) | def _init_state_dict(self) -> dict:
    method __iter__ (line 2209) | def __iter__(self):
    method _iter_arrow (line 2247) | def _iter_arrow(self) -> Iterator[tuple[Key, pa.Table]]:
    method shuffle_data_sources (line 2263) | def shuffle_data_sources(self, generator: np.random.Generator) -> "For...
    method shard_data_sources (line 2273) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method reshard_data_sources (line 2283) | def reshard_data_sources(self) -> "FormattedExamplesIterable":
    method num_shards (line 2293) | def num_shards(self) -> int:
  class DistributedConfig (line 2298) | class DistributedConfig:
  function _maybe_add_torch_iterable_dataset_parent_class (line 2303) | def _maybe_add_torch_iterable_dataset_parent_class(cls):
  function _maybe_share_with_torch_persistent_workers (line 2312) | def _maybe_share_with_torch_persistent_workers(value: Union[int, "torch....
  class IterableColumn (line 2324) | class IterableColumn:
    method __init__ (line 2345) | def __init__(self, source: Union["IterableDataset", "IterableColumn"],...
    method __iter__ (line 2349) | def __iter__(self) -> Iterator[Any]:
    method __getitem__ (line 2353) | def __getitem__(self, column_name: str) -> "IterableColumn":
  class IterableDataset (line 2357) | class IterableDataset(DatasetInfoMixin):
    method __init__ (line 2360) | def __init__(
    method num_columns (line 2383) | def num_columns(self) -> Optional[int]:
    method column_names (line 2399) | def column_names(self) -> Optional[list[str]]:
    method state_dict (line 2414) | def state_dict(self) -> dict:
    method load_state_dict (line 2467) | def load_state_dict(self, state_dict: dict) -> None:
    method __repr__ (line 2517) | def __repr__(self):
    method __getstate__ (line 2520) | def __getstate__(self):
    method __setstate__ (line 2523) | def __setstate__(self, d):
    method _head (line 2532) | def _head(self, n=5):
    method epoch (line 2536) | def epoch(self) -> int:
    method num_shards (line 2540) | def num_shards(self) -> int:
    method n_shards (line 2546) | def n_shards(self) -> int:  # backward compatibility
    method _iter_pytorch (line 2549) | def _iter_pytorch(self):
    method _is_main_process (line 2605) | def _is_main_process(self):
    method _prepare_ex_iterable_for_iteration (line 2616) | def _prepare_ex_iterable_for_iteration(
    method __iter__ (line 2673) | def __iter__(self):
    method iter (line 2694) | def iter(self, batch_size: int, drop_last_batch: bool = False):
    method __getitem__ (line 2725) | def __getitem__(self, column_name: str) -> IterableColumn:
    method from_generator (line 2729) | def from_generator(
    method from_spark (line 2784) | def from_spark(
    method from_file (line 2827) | def from_file(filename: str) -> "IterableDataset":
    method from_pandas (line 2843) | def from_pandas(
    method from_polars (line 2904) | def from_polars(
    method from_dict (line 2955) | def from_dict(
    method from_list (line 2992) | def from_list(
    method from_csv (line 3031) | def from_csv(
    method from_json (line 3074) | def from_json(
    method from_parquet (line 3121) | def from_parquet(
    method from_text (line 3206) | def from_text(
    method with_format (line 3257) | def with_format(
    method map (line 3314) | def map(
    method filter (line 3469) | def filter(
    method shuffle (line 3554) | def shuffle(
    method set_epoch (line 3623) | def set_epoch(self, epoch: int):
    method skip (line 3626) | def skip(self, n: int) -> "IterableDataset":
    method repeat (line 3668) | def repeat(self, num_times: Optional[int]) -> "IterableDataset":
    method take (line 3710) | def take(self, n: int) -> "IterableDataset":
    method shard (line 3745) | def shard(
    method reshard (line 3801) | def reshard(self) -> "IterableDataset":
    method add_column (line 3841) | def add_column(self, name: str, column: Union[list, np.array]) -> "Ite...
    method rename_column (line 3853) | def rename_column(self, original_column_name: str, new_column_name: st...
    method rename_columns (line 3883) | def rename_columns(self, column_mapping: dict[str, str]) -> "IterableD...
    method remove_columns (line 3908) | def remove_columns(self, column_names: Union[str, list[str]]) -> "Iter...
    method select_columns (line 3943) | def select_columns(self, column_names: Union[str, list[str]]) -> "Iter...
    method cast_column (line 3993) | def cast_column(self, column: str, feature: FeatureType) -> "IterableD...
    method cast (line 4039) | def cast(
    method decode (line 4085) | def decode(self, enable: bool = True, num_threads: int = 0) -> "Iterab...
    method _step (line 4178) | def _step(self, step: int, offset: int) -> "IterableDataset":
    method _resolve_features (line 4189) | def _resolve_features(self):
    method batch (line 4207) | def batch(self, batch_size: int, drop_last_batch: bool = False) -> "It...
    method to_dict (line 4230) | def to_dict(self, batch_size: Optional[int] = None, batched: bool = Fa...
    method to_list (line 4253) | def to_list(self) -> list:
    method to_pandas (line 4268) | def to_pandas(
    method to_polars (line 4308) | def to_polars(
    method to_csv (line 4345) | def to_csv(
    method to_json (line 4388) | def to_json(
    method to_sql (line 4442) | def to_sql(
    method to_parquet (line 4484) | def to_parquet(
    method _push_parquet_shards_to_hub_single (line 4532) | def _push_parquet_shards_to_hub_single(
    method _push_parquet_shards_to_hub (line 4625) | def _push_parquet_shards_to_hub(
    method push_to_hub (line 4743) | def push_to_hub(
  function _concatenate_iterable_datasets (line 4939) | def _concatenate_iterable_datasets(
  function _interleave_iterable_datasets (line 5031) | def _interleave_iterable_datasets(
  function _split_by_node_iterable_dataset (line 5116) | def _split_by_node_iterable_dataset(dataset: IterableDataset, rank: int,...
  function _apply_async (line 5149) | async def _apply_async(pool, func, x):
  function _batch_fn (line 5158) | def _batch_fn(unbatched):
  function _generate_tables_from_polars (line 5162) | def _generate_tables_from_polars(df: Union["pl.DataFrame", "pl.LazyFrame...

FILE: src/datasets/load.py
  class _InitializeConfiguredDatasetBuilder (line 105) | class _InitializeConfiguredDatasetBuilder:
    method __call__ (line 114) | def __call__(self, builder_cls, metadata_configs, default_config_name,...
  function configure_builder_class (line 123) | def configure_builder_class(
  function import_main_class (line 163) | def import_main_class(module_path) -> Optional[type[DatasetBuilder]]:
  function get_dataset_builder_class (line 180) | def get_dataset_builder_class(
  function increase_load_count (line 197) | def increase_load_count(name: str):
  function infer_module_for_data_files_list (line 210) | def infer_module_for_data_files_list(
  function infer_module_for_data_files_list_in_archives (line 247) | def infer_module_for_data_files_list_in_archives(
  function infer_module_for_data_files (line 282) | def infer_module_for_data_files(
  function create_builder_configs_from_metadata_configs (line 310) | def create_builder_configs_from_metadata_configs(
  class BuilderConfigsParameters (line 368) | class BuilderConfigsParameters:
  class DatasetModule (line 386) | class DatasetModule:
  class _DatasetModuleFactory (line 394) | class _DatasetModuleFactory:
    method get_module (line 395) | def get_module(self) -> DatasetModule:
  class LocalDatasetModuleFactory (line 399) | class LocalDatasetModuleFactory(_DatasetModuleFactory):
    method __init__ (line 403) | def __init__(
    method get_module (line 419) | def get_module(self) -> DatasetModule:
  class PackagedDatasetModuleFactory (line 508) | class PackagedDatasetModuleFactory(_DatasetModuleFactory):
    method __init__ (line 511) | def __init__(
    method get_module (line 526) | def get_module(self) -> DatasetModule:
  class HubDatasetModuleFactory (line 549) | class HubDatasetModuleFactory(_DatasetModuleFactory):
    method __init__ (line 555) | def __init__(
    method get_module (line 574) | def get_module(self) -> DatasetModule:
  class HubDatasetModuleFactoryWithParquetExport (line 718) | class HubDatasetModuleFactoryWithParquetExport(_DatasetModuleFactory):
    method __init__ (line 723) | def __init__(
    method get_module (line 734) | def get_module(self) -> DatasetModule:
  class CachedDatasetModuleFactory (line 792) | class CachedDatasetModuleFactory(_DatasetModuleFactory):
    method __init__ (line 797) | def __init__(
    method get_module (line 806) | def get_module(self) -> DatasetModule:
  class HubBucketDatasetModuleFactory (line 834) | class HubBucketDatasetModuleFactory(_DatasetModuleFactory):
    method __init__ (line 840) | def __init__(
    method get_module (line 855) | def get_module(self) -> DatasetModule:
  function dataset_module_factory (line 955) | def dataset_module_factory(
  function load_dataset_builder (line 1212) | def load_dataset_builder(
  function load_dataset (line 1373) | def load_dataset(
  function load_dataset (line 1396) | def load_dataset(
  function load_dataset (line 1420) | def load_dataset(
  function load_dataset (line 1444) | def load_dataset(
  function load_dataset (line 1467) | def load_dataset(
  function load_from_disk (line 1725) | def load_from_disk(

FILE: src/datasets/naming.py
  function camelcase_to_snakecase (line 34) | def camelcase_to_snakecase(name):
  function snakecase_to_camelcase (line 41) | def snakecase_to_camelcase(name):
  function filename_prefix_for_name (line 48) | def filename_prefix_for_name(name):
  function filename_prefix_for_split (line 54) | def filename_prefix_for_split(name, split):
  function filepattern_for_dataset_split (line 62) | def filepattern_for_dataset_split(dataset_name, split, data_dir, filetyp...
  function filenames_for_dataset_split (line 70) | def filenames_for_dataset_split(path, dataset_name, split, filetype_suff...

FILE: src/datasets/packaged_modules/__init__.py
  function _hash_python_lines (line 27) | def _hash_python_lines(lines: list[str]) -> str:

FILE: src/datasets/packaged_modules/arrow/arrow.py
  class ArrowConfig (line 15) | class ArrowConfig(datasets.BuilderConfig):
    method __post_init__ (line 20) | def __post_init__(self):
  class Arrow (line 24) | class Arrow(datasets.ArrowBasedBuilder):
    method _info (line 27) | def _info(self):
    method _split_generators (line 30) | def _split_generators(self, dl_manager):
    method _cast_table (line 50) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 57) | def _generate_shards(self, files):
    method _generate_tables (line 60) | def _generate_tables(self, files):

FILE: src/datasets/packaged_modules/audiofolder/audiofolder.py
  class AudioFolderConfig (line 9) | class AudioFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
    method __post_init__ (line 15) | def __post_init__(self):
  class AudioFolder (line 19) | class AudioFolder(folder_based_builder.FolderBasedBuilder):

FILE: src/datasets/packaged_modules/cache/cache.py
  function _get_modification_time (line 21) | def _get_modification_time(cached_directory_path):
  function _find_hash_in_cache (line 25) | def _find_hash_in_cache(
  class Cache (line 99) | class Cache(datasets.ArrowBasedBuilder):
    method __init__ (line 100) | def __init__(
    method _info (line 148) | def _info(self) -> datasets.DatasetInfo:
    method download_and_prepare (line 151) | def download_and_prepare(self, output_dir: Optional[str] = None, *args...
    method _split_generators (line 157) | def _split_generators(self, dl_manager):
    method _generate_shards (line 179) | def _generate_shards(self, files):
    method _generate_tables (line 182) | def _generate_tables(self, files):

FILE: src/datasets/packaged_modules/csv/csv.py
  class CsvConfig (line 25) | class CsvConfig(datasets.BuilderConfig):
    method __post_init__ (line 70) | def __post_init__(self):
    method pd_read_csv_kwargs (line 78) | def pd_read_csv_kwargs(self):
  class Csv (line 145) | class Csv(datasets.ArrowBasedBuilder):
    method _info (line 148) | def _info(self):
    method _split_generators (line 151) | def _split_generators(self, dl_manager):
    method _cast_table (line 169) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 180) | def _generate_shards(self, base_files, files_iterables):
    method _generate_tables (line 183) | def _generate_tables(self, base_files, files_iterables):

FILE: src/datasets/packaged_modules/eval/eval.py
  class Eval (line 15) | class Eval(datasets.GeneratorBasedBuilder):
    method _info (line 18) | def _info(self):
    method _split_generators (line 21) | def _split_generators(self, dl_manager):
    method _sort_samples_key (line 53) | def _sort_samples_key(self, sample_path: str):
    method _iter_samples_from_log_files (line 58) | def _iter_samples_from_log_files(self, log_files: Iterable[str]):
    method _generate_shards (line 71) | def _generate_shards(self, base_files, logs_files_iterables):
    method _generate_examples (line 74) | def _generate_examples(self, base_files, logs_files_iterables):

FILE: src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py
  function count_path_segments (line 24) | def count_path_segments(path):
  class FolderBasedBuilderConfig (line 29) | class FolderBasedBuilderConfig(datasets.BuilderConfig):
    method __post_init__ (line 38) | def __post_init__(self):
  class FolderBasedBuilder (line 42) | class FolderBasedBuilder(datasets.GeneratorBasedBuilder):
    method _info (line 62) | def _info(self):
    method _split_generators (line 71) | def _split_generators(self, dl_manager):
    method _split_files_and_metadata_and_archives (line 269) | def _split_files_and_metadata_and_archives(self, data_files):
    method _read_metadata (line 283) | def _read_metadata(self, metadata_file: str, metadata_ext: str = "") -...
    method _generate_shards (line 366) | def _generate_shards(self, files, metadata_files, add_metadata, add_la...
    method _generate_examples (line 374) | def _generate_examples(self, files, metadata_files, add_metadata, add_...
  function _nested_apply (line 433) | def _nested_apply(item: Any, feature_path: _VisitPath, func: Callable[[A...

FILE: src/datasets/packaged_modules/generator/generator.py
  class GeneratorConfig (line 10) | class GeneratorConfig(datasets.BuilderConfig):
    method __post_init__ (line 16) | def __post_init__(self):
  class Generator (line 25) | class Generator(datasets.GeneratorBasedBuilder):
    method _info (line 28) | def _info(self):
    method _split_generators (line 31) | def _split_generators(self, dl_manager):
    method _generate_examples (line 34) | def _generate_examples(self, **gen_kwargs):

FILE: src/datasets/packaged_modules/hdf5/hdf5.py
  class HDF5Config (line 33) | class HDF5Config(datasets.BuilderConfig):
  class HDF5 (line 40) | class HDF5(datasets.ArrowBasedBuilder):
    method _info (line 45) | def _info(self):
    method _split_generators (line 48) | def _split_generators(self, dl_manager):
    method _generate_shards (line 66) | def _generate_shards(self, files):
    method _generate_tables (line 69) | def _generate_tables(self, files):
  function _is_complex_dtype (line 102) | def _is_complex_dtype(dtype: np.dtype) -> bool:
  function _create_complex_features (line 110) | def _create_complex_features(dset) -> Features:
  function _convert_complex_to_nested (line 135) | def _convert_complex_to_nested(arr: np.ndarray) -> pa.StructArray:
  function _is_compound_dtype (line 148) | def _is_compound_dtype(dtype: np.dtype) -> bool:
  class _CompoundGroup (line 153) | class _CompoundGroup:
    method items (line 157) | def items(self):
  class _CompoundField (line 164) | class _CompoundField:
    method __post_init__ (line 170) | def __post_init__(self):
    method __getitem__ (line 173) | def __getitem__(self, key):
  function _create_compound_features (line 177) | def _create_compound_features(dset) -> Features:
  function _convert_compound_to_nested (line 182) | def _convert_compound_to_nested(arr, dset) -> pa.StructArray:
  function _is_vlen_dtype (line 193) | def _is_vlen_dtype(dtype: np.dtype) -> bool:
  function _create_vlen_features (line 199) | def _create_vlen_features(dset) -> Features:
  function _convert_vlen_to_array (line 207) | def _convert_vlen_to_array(arr: np.ndarray) -> pa.Array:
  function _recursive_infer_features (line 216) | def _recursive_infer_features(h5_obj) -> Features:
  function _infer_feature (line 231) | def _infer_feature(dset):
  function _load_array (line 241) | def _load_array(dset, path: str, start: int, end: int) -> pa.Array:
  function _recursive_load_arrays (line 267) | def _recursive_load_arrays(h5_obj, features: Features, start: int, end: ...
  function _create_sized_feature (line 303) | def _create_sized_feature(dset):
  function _create_sized_feature_impl (line 309) | def _create_sized_feature_impl(dset_shape, value_feature):
  function _sized_arrayxd (line 328) | def _sized_arrayxd(rank: int):
  function _np_to_pa_to_hf_value (line 332) | def _np_to_pa_to_hf_value(numpy_dtype: np.dtype) -> Value:
  function _first_dataset (line 336) | def _first_dataset(h5_obj, features: Features, prefix=""):
  function _check_dataset_lengths (line 348) | def _check_dataset_lengths(h5_obj, features: Features) -> int:
  function _is_group (line 363) | def _is_group(h5_obj) -> bool:
  function _is_dataset (line 369) | def _is_dataset(h5_obj) -> bool:
  function _is_file (line 375) | def _is_file(h5_obj) -> bool:
  function _has_zero_dimensions (line 381) | def _has_zero_dimensions(feature):

FILE: src/datasets/packaged_modules/imagefolder/imagefolder.py
  class ImageFolderConfig (line 9) | class ImageFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
    method __post_init__ (line 15) | def __post_init__(self):
  class ImageFolder (line 19) | class ImageFolder(folder_based_builder.FolderBasedBuilder):

FILE: src/datasets/packaged_modules/json/json.py
  function pandas_read_json (line 30) | def pandas_read_json(path_or_buf, **kwargs):
  class FullReadDisallowed (line 36) | class FullReadDisallowed(Exception):
  class JsonConfig (line 41) | class JsonConfig(datasets.BuilderConfig):
    method __post_init__ (line 54) | def __post_init__(self):
  class Json (line 58) | class Json(datasets.ArrowBasedBuilder):
    method _info (line 61) | def _info(self):
    method _split_generators (line 73) | def _split_generators(self, dl_manager):
    method _cast_table (line 97) | def _cast_table(self, pa_table: pa.Table, json_field_paths=()) -> pa.T...
    method _generate_shards (line 127) | def _generate_shards(self, base_files, files_iterables):
    method _generate_tables (line 130) | def _generate_tables(self, base_files, files_iterables, allow_full_rea...

FILE: src/datasets/packaged_modules/lance/lance.py
  class LanceConfig (line 41) | class LanceConfig(datasets.BuilderConfig):
  function resolve_dataset_uris (line 62) | def resolve_dataset_uris(files: List[str]) -> Dict[str, List[str]]:
  function _fix_hf_uri (line 72) | def _fix_hf_uri(uri: str) -> str:
  function _fix_local_version_file (line 81) | def _fix_local_version_file(uri: str) -> str:
  class Lance (line 92) | class Lance(datasets.ArrowBasedBuilder, datasets.builder._CountableBuild...
    method _info (line 96) | def _info(self):
    method _split_generators (line 99) | def _split_generators(self, dl_manager):
    method _cast_table (line 183) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 190) | def _generate_shards(
    method _generate_num_examples (line 203) | def _generate_num_examples(
    method _generate_tables (line 216) | def _generate_tables(

FILE: src/datasets/packaged_modules/niftifolder/niftifolder.py
  class NiftiFolderConfig (line 9) | class NiftiFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
    method __post_init__ (line 15) | def __post_init__(self):
  class NiftiFolder (line 19) | class NiftiFolder(folder_based_builder.FolderBasedBuilder):

FILE: src/datasets/packaged_modules/pandas/pandas.py
  class PandasConfig (line 14) | class PandasConfig(datasets.BuilderConfig):
    method __post_init__ (line 19) | def __post_init__(self):
  class Pandas (line 23) | class Pandas(datasets.ArrowBasedBuilder):
    method _info (line 26) | def _info(self):
    method _split_generators (line 33) | def _split_generators(self, dl_manager):
    method _cast_table (line 43) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 50) | def _generate_shards(self, files):
    method _generate_tables (line 53) | def _generate_tables(self, files):

FILE: src/datasets/packaged_modules/parquet/parquet.py
  class ParquetConfig (line 17) | class ParquetConfig(datasets.BuilderConfig):
    method __post_init__ (line 86) | def __post_init__(self):
  class Parquet (line 90) | class Parquet(datasets.ArrowBasedBuilder):
    method _info (line 93) | def _info(self):
    method _split_generators (line 105) | def _split_generators(self, dl_manager):
    method _cast_table (line 143) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 150) | def _generate_shards(self, files, row_groups_list):
    method _generate_more_gen_kwargs (line 160) | def _generate_more_gen_kwargs(self, files, row_groups_list):
    method _generate_tables (line 176) | def _generate_tables(self, files, row_groups_list):

FILE: src/datasets/packaged_modules/pdffolder/pdffolder.py
  class PdfFolderConfig (line 9) | class PdfFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
    method __post_init__ (line 15) | def __post_init__(self):
  class PdfFolder (line 19) | class PdfFolder(folder_based_builder.FolderBasedBuilder):

FILE: src/datasets/packaged_modules/spark/spark.py
  class SparkConfig (line 32) | class SparkConfig(datasets.BuilderConfig):
    method __post_init__ (line 37) | def __post_init__(self):
  function _reorder_dataframe_by_partition (line 41) | def _reorder_dataframe_by_partition(df: "pyspark.sql.DataFrame", new_par...
  function _generate_iterable_examples (line 49) | def _generate_iterable_examples(
  class SparkExamplesIterable (line 78) | class SparkExamplesIterable(_BaseExamplesIterable):
    method __init__ (line 79) | def __init__(
    method _init_state_dict (line 88) | def _init_state_dict(self) -> dict:
    method load_state_dict (line 93) | def load_state_dict(self, state_dict: dict) -> dict:
    method __iter__ (line 96) | def __iter__(self):
    method shuffle_data_sources (line 99) | def shuffle_data_sources(self, generator: np.random.Generator) -> "Spa...
    method shard_data_sources (line 104) | def shard_data_sources(self, num_shards: int, index: int, contiguous=T...
    method num_shards (line 109) | def num_shards(self) -> int:
  class Spark (line 113) | class Spark(datasets.DatasetBuilder):
    method __init__ (line 116) | def __init__(
    method _validate_cache_dir (line 135) | def _validate_cache_dir(self):
    method _info (line 168) | def _info(self):
    method _split_generators (line 171) | def _split_generators(self, dl_manager: datasets.download.download_man...
    method _repartition_df_if_needed (line 174) | def _repartition_df_if_needed(self, max_shard_size):
    method _prepare_split_single (line 199) | def _prepare_split_single(
    method _prepare_split (line 283) | def _prepare_split(
    method _get_examples_iterable_for_split (line 363) | def _get_examples_iterable_for_split(

FILE: src/datasets/packaged_modules/sql/sql.py
  class SqlConfig (line 25) | class SqlConfig(datasets.BuilderConfig):
    method __post_init__ (line 38) | def __post_init__(self):
    method create_config_id (line 45) | def create_config_id(
    method pd_read_sql_kwargs (line 81) | def pd_read_sql_kwargs(self):
  class Sql (line 92) | class Sql(datasets.ArrowBasedBuilder):
    method _info (line 95) | def _info(self):
    method _split_generators (line 98) | def _split_generators(self, dl_manager):
    method _cast_table (line 101) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_tables (line 112) | def _generate_tables(self):

FILE: src/datasets/packaged_modules/text/text.py
  class TextConfig (line 17) | class TextConfig(datasets.BuilderConfig):
  class Text (line 45) | class Text(datasets.ArrowBasedBuilder):
    method _info (line 48) | def _info(self):
    method _split_generators (line 51) | def _split_generators(self, dl_manager):
    method _cast_table (line 73) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 86) | def _generate_shards(self, base_files, files_iterables):
    method _generate_tables (line 89) | def _generate_tables(self, base_files, files_iterables):

FILE: src/datasets/packaged_modules/videofolder/videofolder.py
  class VideoFolderConfig (line 9) | class VideoFolderConfig(folder_based_builder.FolderBasedBuilderConfig):
    method __post_init__ (line 15) | def __post_init__(self):
  class VideoFolder (line 19) | class VideoFolder(folder_based_builder.FolderBasedBuilder):

FILE: src/datasets/packaged_modules/webdataset/_tenbin.py
  function bytelen (line 40) | def bytelen(a):
  function bytedata (line 50) | def bytedata(a):
  function check_acceptable_input_type (line 80) | def check_acceptable_input_type(data, allow64):
  function str64 (line 93) | def str64(s):
  function unstr64 (line 100) | def unstr64(i):
  function check_infos (line 106) | def check_infos(data, infos, required_infos=None):
  function encode_header (line 119) | def encode_header(a, info=""):
  function decode_header (line 131) | def decode_header(h):
  function encode_list (line 143) | def encode_list(l, infos=None):  # noqa: E741
  function decode_list (line 157) | def decode_list(l, infos=False):  # noqa: E741
  function roundup (line 174) | def roundup(n, k=64):
  function encode_chunks (line 179) | def encode_chunks(l):  # noqa: E741
  function decode_chunks (line 194) | def decode_chunks(buf):
  function encode_buffer (line 211) | def encode_buffer(l, infos=None):  # noqa: E741
  function decode_buffer (line 218) | def decode_buffer(buf, infos=False):
  function write_chunk (line 223) | def write_chunk(stream, buf):
  function read_chunk (line 234) | def read_chunk(stream):
  function write (line 252) | def write(stream, l, infos=None):  # noqa: E741
  function read (line 258) | def read(stream, n=sys.maxsize, infos=False):
  function save (line 272) | def save(fname, *args, infos=None, nocheck=False):
  function load (line 280) | def load(fname, infos=False, nocheck=False):

FILE: src/datasets/packaged_modules/webdataset/webdataset.py
  class WebDataset (line 20) | class WebDataset(datasets.GeneratorBasedBuilder):
    method _get_pipeline_from_tar (line 29) | def _get_pipeline_from_tar(cls, tar_path, tar_iterator):
    method _info (line 60) | def _info(self) -> datasets.DatasetInfo:
    method _split_generators (line 63) | def _split_generators(self, dl_manager):
    method _generate_shards (line 108) | def _generate_shards(self, tar_paths, tar_iterators):
    method _generate_examples (line 111) | def _generate_examples(self, tar_paths, tar_iterators):
  function base_plus_ext (line 134) | def base_plus_ext(path):
  function text_loads (line 281) | def text_loads(data: bytes):
  function tenbin_loads (line 285) | def tenbin_loads(data: bytes):
  function msgpack_loads (line 291) | def msgpack_loads(data: bytes):
  function npy_loads (line 297) | def npy_loads(data: bytes):
  function npz_loads (line 304) | def npz_loads(data: bytes):
  function cbor_loads (line 308) | def cbor_loads(data: bytes):
  function torch_loads (line 314) | def torch_loads(data: bytes):

FILE: src/datasets/packaged_modules/xml/xml.py
  class XmlConfig (line 15) | class XmlConfig(datasets.BuilderConfig):
  class Xml (line 23) | class Xml(datasets.ArrowBasedBuilder):
    method _info (line 26) | def _info(self):
    method _split_generators (line 29) | def _split_generators(self, dl_manager):
    method _cast_table (line 47) | def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    method _generate_shards (line 60) | def _generate_shards(self, files):
    method _generate_tables (line 63) | def _generate_tables(self, files):

FILE: src/datasets/parallel/parallel.py
  class ParallelBackendConfig (line 12) | class ParallelBackendConfig:
  function parallel_map (line 17) | def parallel_map(function, iterable, num_proc, batched, batch_size, type...
  function _map_with_multiprocessing_pool (line 43) | def _map_with_multiprocessing_pool(
  function _map_with_joblib (line 77) | def _map_with_joblib(
  function parallel_backend (line 93) | def parallel_backend(backend_name: str):

FILE: src/datasets/search.py
  class MissingIndex (line 36) | class MissingIndex(Exception):
  class SearchResults (line 40) | class SearchResults(NamedTuple):
  class BatchedSearchResults (line 45) | class BatchedSearchResults(NamedTuple):
  class NearestExamplesResults (line 50) | class NearestExamplesResults(NamedTuple):
  class BatchedNearestExamplesResults (line 55) | class BatchedNearestExamplesResults(NamedTuple):
  class BaseIndex (line 60) | class BaseIndex:
    method search (line 63) | def search(self, query, k: int = 10, **kwargs) -> SearchResults:
    method search_batch (line 70) | def search_batch(self, queries, k: int = 10, **kwargs) -> BatchedSearc...
    method save (line 88) | def save(self, file: Union[str, PurePath]):
    method load (line 93) | def load(cls, file: Union[str, PurePath]) -> "BaseIndex":
  class ElasticSearchIndex (line 98) | class ElasticSearchIndex(BaseIndex):
    method __init__ (line 108) | def __init__(
    method add_documents (line 146) | def add_documents(self, documents: Union[list[str], "Dataset"], column...
    method search (line 182) | def search(self, query: str, k=10, **kwargs) -> SearchResults:
    method search_batch (line 201) | def search_batch(self, queries, k: int = 10, max_workers=10, **kwargs)...
  class FaissIndex (line 215) | class FaissIndex(BaseIndex):
    method __init__ (line 225) | def __init__(
    method add_vectors (line 255) | def add_vectors(
    method _faiss_index_to_device (line 316) | def _faiss_index_to_device(index: "faiss.Index", device: Optional[Unio...
    method search (line 349) | def search(self, query: np.array, k=10, **kwargs) -> SearchResults:
    method search_batch (line 369) | def search_batch(self, queries: np.array, k=10, **kwargs) -> BatchedSe...
    method save (line 387) | def save(self, file: Union[str, PurePath], storage_options: Optional[d...
    method load (line 400) | def load(
  class IndexableMixin (line 417) | class IndexableMixin:
    method __init__ (line 420) | def __init__(self):
    method __len__ (line 423) | def __len__(self):
    method __getitem__ (line 426) | def __getitem__(self, key):
    method is_index_initialized (line 429) | def is_index_initialized(self, index_name: str) -> bool:
    method _check_index_is_initialized (line 432) | def _check_index_is_initialized(self, index_name: str):
    method list_indexes (line 438) | def list_indexes(self) -> list[str]:
    method get_index (line 442) | def get_index(self, index_name: str) -> BaseIndex:
    method add_faiss_index (line 454) | def add_faiss_index(
    method add_faiss_index_from_external_arrays (line 495) | def add_faiss_index_from_external_arrays(
    method save_faiss_index (line 535) | def save_faiss_index(self, index_name: str, file: Union[str, PurePath]...
    method load_faiss_index (line 553) | def load_faiss_index(
    method add_elasticsearch_index (line 585) | def add_elasticsearch_index(
    method load_elasticsearch_index (line 637) | def load_elasticsearch_index(
    method drop_index (line 684) | def drop_index(self, index_name: str):
    method search (line 693) | def search(self, index_name: str, query: Union[str, np.array], k: int ...
    method search_batch (line 713) | def search_batch(
    method get_nearest_examples (line 735) | def get_nearest_examples(
    method get_nearest_examples_batch (line 759) | def get_nearest_examples_batch(

FILE: src/datasets/splits.py
  class SplitInfo (line 32) | class SplitInfo:
    method file_instructions (line 48) | def file_instructions(self):
  class SubSplitInfo (line 60) | class SubSplitInfo:
    method num_examples (line 72) | def num_examples(self):
    method file_instructions (line 77) | def file_instructions(self):
  class SplitBase (line 82) | class SplitBase(metaclass=abc.ABCMeta):
    method get_read_instruction (line 118) | def get_read_instruction(self, split_dict):
    method __eq__ (line 129) | def __eq__(self, other):
    method __ne__ (line 135) | def __ne__(self, other):
    method __add__ (line 139) | def __add__(self, other):
    method subsplit (line 143) | def subsplit(self, arg=None, k=None, percent=None, weighted=None):  # ...
  class PercentSliceMeta (line 254) | class PercentSliceMeta(type):
    method __getitem__ (line 255) | def __getitem__(cls, slice_value):
  class PercentSlice (line 261) | class PercentSlice(metaclass=PercentSliceMeta):
  class _SplitMerged (line 277) | class _SplitMerged(SplitBase):
    method __init__ (line 280) | def __init__(self, split1, split2):
    method get_read_instruction (line 284) | def get_read_instruction(self, split_dict):
    method __repr__ (line 289) | def __repr__(self):
  class _SubSplit (line 293) | class _SubSplit(SplitBase):
    method __init__ (line 296) | def __init__(self, split, slice_value):
    method get_read_instruction (line 300) | def get_read_instruction(self, split_dict):
    method __repr__ (line 303) | def __repr__(self):
  class NamedSplit (line 315) | class NamedSplit(SplitBase):
    method __init__ (line 358) | def __init__(self, name: str):
    method __str__ (line 365) | def __str__(self):
    method __repr__ (line 368) | def __repr__(self):
    method __eq__ (line 371) | def __eq__(self, other):
    method __lt__ (line 382) | def __lt__(self, other):
    method __hash__ (line 385) | def __hash__(self):
    method get_read_instruction (line 388) | def get_read_instruction(self, split_dict):
  class NamedSplitAll (line 392) | class NamedSplitAll(NamedSplit):
    method __init__ (line 395) | def __init__(self):
    method __repr__ (line 398) | def __repr__(self):
    method get_read_instruction (line 401) | def get_read_instruction(self, split_dict):
  class Split (line 407) | class Split:
    method __new__ (line 450) | def __new__(cls, name):
  class SplitReadInstruction (line 465) | class SplitReadInstruction:
    method __init__ (line 481) | def __init__(self, split_info=None):
    method add (line 487) | def add(self, sliced_split):
    method __add__ (line 494) | def __add__(self, other):
    method __getitem__ (line 504) | def __getitem__(self, slice_value):
    method get_list_sliced_split_info (line 516) | def get_list_sliced_split_info(self):
  class SplitDict (line 520) | class SplitDict(dict[str, SplitInfo]):
    method __init__ (line 523) | def __init__(self, *args, dataset_name=None, **kwargs):
    method __getitem__ (line 527) | def __getitem__(self, key: Union[SplitBase, str]):
    method __setitem__ (line 540) | def __setitem__(self, key: Union[SplitBase, str], value: SplitInfo):
    method add (line 545) | def add(self, split_info: SplitInfo):
    method total_num_examples (line 553) | def total_num_examples(self):
    method from_split_dict (line 558) | def from_split_dict(cls, split_infos: Union[list, dict], dataset_name:...
    method to_split_dict (line 575) | def to_split_dict(self):
    method copy (line 584) | def copy(self):
    method _to_yaml_list (line 587) | def _to_yaml_list(self) -> list:
    method _from_yaml_list (line 599) | def _from_yaml_list(cls, yaml_data: list) -> "SplitDict":
  class SplitGenerator (line 604) | class SplitGenerator:
    method __post_init__ (line 634) | def __post_init__(self):

FILE: src/datasets/streaming.py
  function extend_module_for_streaming (line 42) | def extend_module_for_streaming(module_path, download_config: Optional[D...
  function extend_dataset_builder_for_streaming (line 110) | def extend_dataset_builder_for_streaming(builder: "DatasetBuilder"):

FILE: src/datasets/table.py
  function inject_arrow_table_documentation (line 22) | def inject_arrow_table_documentation(arrow_table_method):
  function _in_memory_arrow_table_from_file (line 33) | def _in_memory_arrow_table_from_file(filename: str) -> pa.Table:
  function _in_memory_arrow_table_from_buffer (line 40) | def _in_memory_arrow_table_from_buffer(buffer: pa.Buffer) -> pa.Table:
  function _memory_mapped_record_batch_reader_from_file (line 47) | def _memory_mapped_record_batch_reader_from_file(filename: str) -> pa.Re...
  function read_schema_from_file (line 52) | def read_schema_from_file(filename: str) -> pa.Schema:
  function _memory_mapped_arrow_table_from_file (line 62) | def _memory_mapped_arrow_table_from_file(filename: str) -> pa.Table:
  function _deepcopy (line 68) | def _deepcopy(x, memo: dict):
  function _interpolation_search (line 78) | def _interpolation_search(arr: list[int], x: int) -> int:
  class IndexedTableMixin (line 104) | class IndexedTableMixin:
    method __init__ (line 105) | def __init__(self, table: pa.Table):
    method fast_gather (line 112) | def fast_gather(self, indices: Union[list[int], np.ndarray]) -> pa.Table:
    method fast_slice (line 129) | def fast_slice(self, offset=0, length=None) -> pa.Table:
  class Table (line 153) | class Table(IndexedTableMixin):
    method __init__ (line 165) | def __init__(self, table: pa.Table):
    method __deepcopy__ (line 169) | def __deepcopy__(self, memo: dict):
    method validate (line 178) | def validate(self, *args, **kwargs):
    method equals (line 194) | def equals(self, *args, **kwargs):
    method to_batches (line 211) | def to_batches(self, *args, **kwargs):
    method to_pydict (line 225) | def to_pydict(self, *args, **kwargs):
    method to_pylist (line 234) | def to_pylist(self, *args, **kwargs):
    method to_pandas (line 243) | def to_pandas(self, *args, **kwargs):
    method to_string (line 305) | def to_string(self, *args, **kwargs):
    method to_reader (line 308) | def to_reader(self, max_chunksize: Optional[int] = None):
    method field (line 324) | def field(self, *args, **kwargs):
    method column (line 337) | def column(self, *args, **kwargs):
    method itercolumns (line 350) | def itercolumns(self, *args, **kwargs):
    method schema (line 360) | def schema(self):
    method columns (line 370) | def columns(self):
    method num_columns (line 380) | def num_columns(self):
    method num_rows (line 390) | def num_rows(self):
    method shape (line 403) | def shape(self):
    method nbytes (line 413) | def nbytes(self):
    method column_names (line 420) | def column_names(self):
    method __eq__ (line 426) | def __eq__(self, other):
    method __getitem__ (line 429) | def __getitem__(self, i):
    method __len__ (line 432) | def __len__(self):
    method __repr__ (line 435) | def __repr__(self):
    method __str__ (line 438) | def __str__(self):
    method slice (line 441) | def slice(self, *args, **kwargs):
    method filter (line 457) | def filter(self, *args, **kwargs):
    method flatten (line 463) | def flatten(self, *args, **kwargs):
    method combine_chunks (line 477) | def combine_chunks(self, *args, **kwargs):
    method cast (line 493) | def cast(self, *args, **kwargs):
    method replace_schema_metadata (line 508) | def replace_schema_metadata(self, *args, **kwargs):
    method add_column (line 522) | def add_column(self, *args, **kwargs):
    method append_column (line 543) | def append_column(self, *args, **kwargs):
    method remove_column (line 559) | def remove_column(self, *args, **kwargs):
    method set_column (line 572) | def set_column(self, *args, **kwargs):
    method rename_columns (line 590) | def rename_columns(self, *args, **kwargs):
    method drop (line 596) | def drop(self, *args, **kwargs):
    method select (line 612) | def select(self, *args, **kwargs):
  class TableBlock (line 628) | class TableBlock(Table):
  class InMemoryTable (line 638) | class InMemoryTable(TableBlock):
    method from_file (line 654) | def from_file(cls, filename: str):
    method from_buffer (line 659) | def from_buffer(cls, buffer: pa.Buffer):
    method from_pandas (line 664) | def from_pandas(cls, *args, **kwargs):
    method from_arrays (line 722) | def from_arrays(cls, *args, **kwargs):
    method from_pydict (line 742) | def from_pydict(cls, *args, **kwargs):
    method from_pylist (line 760) | def from_pylist(cls, mapping, *args, **kwargs):
    method from_batches (line 778) | def from_batches(cls, *args, **kwargs):
    method slice (line 793) | def slice(self, offset=0, length=None):
    method filter (line 810) | def filter(self, *args, **kwargs):
    method flatten (line 816) | def flatten(self, *args, **kwargs):
    method combine_chunks (line 830) | def combine_chunks(self, *args, **kwargs):
    method cast (line 846) | def cast(self, *args, **kwargs):
    method replace_schema_metadata (line 861) | def replace_schema_metadata(self, *args, **kwargs):
    method add_column (line 875) | def add_column(self, *args, **kwargs):
    method append_column (line 896) | def append_column(self, *args, **kwargs):
    method remove_column (line 913) | def remove_column(self, *args, **kwargs):
    method set_column (line 927) | def set_column(self, *args, **kwargs):
    method rename_columns (line 946) | def rename_columns(self, *args, **kwargs):
    method drop (line 952) | def drop(self, *args, **kwargs):
    method select (line 969) | def select(self, *args, **kwargs):
  class MemoryMappedTable (line 989) | class MemoryMappedTable(TableBlock):
    method __init__ (line 1010) | def __init__(self, table: pa.Table, path: str, replays: Optional[list[...
    method from_file (line 1016) | def from_file(cls, filename: str, replays=None):
    method __getstate__ (line 1021) | def __getstate__(self):
    method __setstate__ (line 1024) | def __setstate__(self, state):
    method _apply_replays (line 1032) | def _apply_replays(table: pa.Table, replays: Optional[list[Replay]] = ...
    method _append_replay (line 1043) | def _append_replay(self, replay: Replay) -> list[Replay]:
    method slice (line 1048) | def slice(self, offset=0, length=None):
    method filter (line 1067) | def filter(self, *args, **kwargs):
    method flatten (line 1075) | def flatten(self, *args, **kwargs):
    method combine_chunks (line 1091) | def combine_chunks(self, *args, **kwargs):
    method cast (line 1109) | def cast(self, *args, **kwargs):
    method replace_schema_metadata (line 1126) | def replace_schema_metadata(self, *args, **kwargs):
    method add_column (line 1142) | def add_column(self, *args, **kwargs):
    method append_column (line 1165) | def append_column(self, *args, **kwargs):
    method remove_column (line 1184) | def remove_column(self, *args, **kwargs):
    method set_column (line 1200) | def set_column(self, *args, **kwargs):
    method rename_columns (line 1221) | def rename_columns(self, *args, **kwargs):
    method drop (line 1229) | def drop(self, *args, **kwargs):
    method select (line 1248) | def select(self, *args, **kwargs):
  class ConcatenationTable (line 1273) | class ConcatenationTable(Table):
    method __init__ (line 1299) | def __init__(self, table: pa.Table, blocks: list[list[TableBlock]]):
    method __getstate__ (line 1312) | def __getstate__(self):
    method __setstate__ (line 1315) | def __setstate__(self, state):
    method _concat_blocks (line 1327) | def _concat_blocks(blocks: list[Union[TableBlock, pa.Table]], axis: in...
    method _concat_blocks_horizontally_and_vertically (line 1344) | def _concat_blocks_horizontally_and_vertically(cls, blocks: list[list[...
    method _merge_blocks (line 1354) | def _merge_blocks(cls, blocks: TableBlockContainer, axis: Optional[int...
    method _consolidate_blocks (line 1370) | def _consolidate_blocks(cls, blocks: TableBlockContainer) -> TableBloc...
    method from_blocks (line 1379) | def from_blocks(cls, blocks: TableBlockContainer) -> "ConcatenationTab...
    method from_tables (line 1393) | def from_tables(cls, tables: list[Union[pa.Table, Table]], axis: int =...
    method _slices (line 1475) | def _slices(self):
    method slice (line 1482) | def slice(self, offset=0, length=None):
    method filter (line 1513) | def filter(self, mask, *args, **kwargs):
    method flatten (line 1524) | def flatten(self, *args, **kwargs):
    method combine_chunks (line 1542) | def combine_chunks(self, *args, **kwargs):
    method cast (line 1562) | def cast(self, target_schema, *args, **kwargs):
    method replace_schema_metadata (line 1593) | def replace_schema_metadata(self, *args, **kwargs):
    method add_column (line 1611) | def add_column(self, *args, **kwargs):
    method append_column (line 1632) | def append_column(self, *args, **kwargs):
    method remove_column (line 1649) | def remove_column(self, i, *args, **kwargs):
    method set_column (line 1673) | def set_column(self, *args, **kwargs):
    method rename_columns (line 1692) | def rename_columns(self, names, *args, **kwargs):
    method drop (line 1705) | def drop(self, columns, *args, **kwargs):
    method select (line 1726) | def select(self, columns, *args, **kwargs):
  function concat_tables (line 1746) | def concat_tables(tables: list[Table], axis: int = 0) -> Table:
  function list_table_cache_files (line 1769) | def list_table_cache_files(table: Table) -> list[str]:
  function _wrap_for_chunked_arrays (line 1790) | def _wrap_for_chunked_arrays(func):
  function _are_list_values_of_length (line 1802) | def _are_list_values_of_length(array: pa.ListArray, length: int) -> bool:
  function _combine_list_array_offsets_with_mask (line 1807) | def _combine_list_array_offsets_with_mask(array: pa.ListArray) -> pa.Array:
  function _storage_type (line 1820) | def _storage_type(type: pa.DataType) -> pa.DataType:
  function _short_str (line 1833) | def _short_str(value: Any) -> str:
  function array_cast (line 1841) | def array_cast(
  function cast_array_to_feature (line 1954) | def cast_array_to_feature(
  function embed_array_storage (line 2096) | def embed_array_storage(array: pa.Array, feature: "FeatureType", token_p...
  class CastError (line 2154) | class CastError(ValueError):
    method __init__ (line 2157) | def __init__(self, *args, table_column_names: list[str], requested_col...
    method __reduce__ (line 2162) | def __reduce__(self):
    method details (line 2168) | def details(self):
  function cast_table_to_features (line 2179) | def cast_table_to_features(table: pa.Table, features: "Features"):
  function cast_table_to_schema (line 2201) | def cast_table_to_schema(table: pa.Table, schema: pa.Schema):
  function embed_table_storage (line 2233) | def embed_table_storage(table: pa.Table, token_per_repo_id=None):
  function table_cast (line 2257) | def table_cast(table: pa.Table, schema: pa.Schema):
  function table_flatten (line 2279) | def table_flatten(table: pa.Table):
  function table_visitor (line 2321) | def table_visitor(table: pa.Table, function: Callable[[pa.Array], None]):
  function table_iter (line 2353) | def table_iter(table: Table, batch_size: int, drop_last_batch=False) -> ...

FILE: src/datasets/utils/_dataset_viewer.py
  class DatasetViewerError (line 16) | class DatasetViewerError(DatasetsError):
  function get_exported_parquet_files (line 26) | def get_exported_parquet_files(
  function get_exported_dataset_infos (line 62) | def get_exported_dataset_infos(

FILE: src/datasets/utils/_dill.py
  class Pickler (line 27) | class Pickler(dill.Pickler):
    method save (line 31) | def save(self, obj, save_persistent_id=True):
    method _batch_setitems (line 72) | def _batch_setitems(self, items, *args, **kwargs):
    method memoize (line 83) | def memoize(self, obj):
  function pklregister (line 89) | def pklregister(t):
  function _is_supported_dill_version (line 99) | def _is_supported_dill_version():
  function dump (line 111) | def dump(obj, file):
  function dumps (line 116) | def dumps(obj):
  function log (line 125) | def log(pickler, msg):
  function log (line 130) | def log(pickler, msg):
  function _save_set (line 135) | def _save_set(pickler, obj):
  function _save_regexPattern (line 149) | def _save_regexPattern(pickler, obj):
  function _save_tiktokenEncoding (line 158) | def _save_tiktokenEncoding(pickler, obj):
  function _save_torchTensor (line 167) | def _save_torchTensor(pickler, obj):
  function _save_torchGenerator (line 186) | def _save_torchGenerator(pickler, obj):
  function _save_spacyLanguage (line 200) | def _save_spacyLanguage(pickler, obj):
  function _save_transformersPreTrainedTokenizerBase (line 214) | def _save_transformersPreTrainedTokenizerBase(pickler, obj):
  function _save_code (line 264) | def _save_code(pickler, obj):
  function save_code (line 361) | def save_code(pickler, obj):

FILE: src/datasets/utils/_filelock.py
  class FileLock (line 25) | class FileLock(FileLock_):
    method __init__ (line 33) | def __init__(self, lock_file, *args, **kwargs):
    method hash_filename_if_too_long (line 44) | def hash_filename_if_too_long(cls, path: str) -> str:

FILE: src/datasets/utils/deprecation_utils.py
  function deprecated (line 14) | def deprecated(help_message: Optional[str] = None):
  class OnAccess (line 59) | class OnAccess(enum.EnumMeta):
    method __getattribute__ (line 64) | def __getattribute__(cls, name):
    method __getitem__ (line 70) | def __getitem__(cls, name):
    method __call__ (line 76) | def __call__(cls, value, names=None, *, module=None, qualname=None, ty...
  class DeprecatedEnum (line 83) | class DeprecatedEnum(enum.Enum, metaclass=OnAccess):
    method __new__ (line 88) | def __new__(cls, value):
    method help_message (line 95) | def help_message(self):
    method deprecate (line 98) | def deprecate(self):

FILE: src/datasets/utils/doc_utils.py
  function is_documented_by (line 4) | def is_documented_by(function_with_docstring: Callable):

FILE: src/datasets/utils/experimental.py
  function experimental (line 8) | def experimental(fn: Callable) -> Callable:

FILE: src/datasets/utils/extract.py
  class ExtractManager (line 27) | class ExtractManager:
    method __init__ (line 28) | def __init__(self, cache_dir: Optional[str] = None):
    method _get_output_path (line 34) | def _get_output_path(self, path: str) -> str:
    method _do_extract (line 42) | def _do_extract(self, output_path: str, force_extract: bool) -> bool:
    method extract (line 47) | def extract(self, input_path: str, force_extract: bool = False) -> str:
  class BaseExtractor (line 57) | class BaseExtractor(ABC):
    method is_extractable (line 60) | def is_extractable(cls, path: Union[Path, str], **kwargs) -> bool: ...
    method extract (line 64) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class MagicNumberBaseExtractor (line 67) | class MagicNumberBaseExtractor(BaseExtractor, ABC):
    method read_magic_number (line 71) | def read_magic_number(path: Union[Path, str], magic_number_length: int):
    method is_extractable (line 76) | def is_extractable(cls, path: Union[Path, str], magic_number: bytes = ...
  class TarExtractor (line 86) | class TarExtractor(BaseExtractor):
    method is_extractable (line 88) | def is_extractable(cls, path: Union[Path, str], **kwargs) -> bool:
    method safemembers (line 92) | def safemembers(members: tarfile.TarFile, output_path: Union[Path, str]):
    method extract (line 128) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class GzipExtractor (line 135) | class GzipExtractor(MagicNumberBaseExtractor):
    method extract (line 139) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class ZipExtractor (line 145) | class ZipExtractor(MagicNumberBaseExtractor):
    method is_extractable (line 153) | def is_extractable(cls, path: Union[Path, str], magic_number: bytes = ...
    method safemembers (line 190) | def safemembers(members: list[zipfile.ZipInfo], output_path: Union[Pat...
    method extract (line 222) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class XzExtractor (line 229) | class XzExtractor(MagicNumberBaseExtractor):
    method extract (line 233) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class RarExtractor (line 239) | class RarExtractor(MagicNumberBaseExtractor):
    method safemembers (line 243) | def safemembers(members: list["rarfile.RarInfo"], output_path: Union[P...
    method extract (line 280) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class ZstdExtractor (line 291) | class ZstdExtractor(MagicNumberBaseExtractor):
    method extract (line 295) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class Bzip2Extractor (line 305) | class Bzip2Extractor(MagicNumberBaseExtractor):
    method extract (line 309) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class SevenZipExtractor (line 315) | class SevenZipExtractor(MagicNumberBaseExtractor):
    method safemembers (line 319) | def safemembers(members: list["py7zr.FileInfo"], output_path: Union[Pa...
    method extract (line 356) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class Lz4Extractor (line 367) | class Lz4Extractor(MagicNumberBaseExtractor):
    method extract (line 371) | def extract(input_path: Union[Path, str], output_path: Union[Path, str...
  class Extractor (line 381) | class Extractor:
    method _get_magic_number_max_length (line 396) | def _get_magic_number_max_length(cls):
    method _read_magic_number (line 405) | def _read_magic_number(path: Union[Path, str], magic_number_length: int):
    method is_extractable (line 412) | def is_extractable(cls, path: Union[Path, str], return_extractor: bool...
    method infer_extractor_format (line 424) | def infer_extractor_format(cls, path: Union[Path, str]) -> Optional[st...
    method extract (line 432) | def extract(

FILE: src/datasets/utils/file_utils.py
  class _AiohttpClientError (line 54) | class _AiohttpClientError(Exception):
  function is_remote_url (line 75) | def is_remote_url(url_or_filename: str) -> bool:
  function is_local_path (line 79) | def is_local_path(url_or_filename: str) -> bool:
  function is_relative_path (line 86) | def is_relative_path(url_or_filename: str) -> bool:
  function relative_to_absolute_path (line 90) | def relative_to_absolute_path(path: T) -> T:
  function url_or_path_join (line 96) | def url_or_path_join(base_name: str, *pathnames: str) -> str:
  function url_or_path_parent (line 103) | def url_or_path_parent(url_or_path: str) -> str:
  function hash_url_to_filename (line 110) | def hash_url_to_filename(url, etag=None):
  function cached_path (line 134) | def cached_path(
  function get_datasets_user_agent (line 257) | def get_datasets_user_agent(user_agent: Optional[Union[str, dict]] = Non...
  function get_authentication_headers_for_url (line 275) | def get_authentication_headers_for_url(url: str, token: Optional[Union[s...
  function _raise_if_offline_mode_is_enabled (line 285) | def _raise_if_offline_mode_is_enabled(msg: Optional[str] = None):
  function fsspec_head (line 293) | def fsspec_head(url, storage_options=None):
  function stack_multiprocessing_download_progress_bars (line 299) | def stack_multiprocessing_download_progress_bars():
  class TqdmCallback (line 305) | class TqdmCallback(fsspec.callbacks.TqdmCallback):
    method __init__ (line 306) | def __init__(self, tqdm_kwargs=None, *args, **kwargs):
  function fsspec_get (line 315) | def fsspec_get(url, temp_file, storage_options=None, desc=None, disable_...
  function get_from_cache (line 333) | def get_from_cache(
  function add_start_docstrings (line 420) | def add_start_docstrings(*docstr):
  function add_end_docstrings (line 428) | def add_end_docstrings(*docstr):
  function estimate_dataset_size (line 436) | def estimate_dataset_size(paths):
  function readline (line 440) | def readline(f: io.RawIOBase):
  class NonStreamableDatasetError (line 512) | class NonStreamableDatasetError(Exception):
  function _get_path_extension (line 516) | def _get_path_extension(path: str) -> str:
  function _get_extraction_protocol_with_magic_number (line 526) | def _get_extraction_protocol_with_magic_number(f) -> Optional[str]:
  function _get_extraction_protocol (line 544) | def _get_extraction_protocol(urlpath: str, download_config: Optional[Dow...
  function xjoin (line 570) | def xjoin(a, *p):
  function xdirname (line 597) | def xdirname(a):
  function xexists (line 628) | def xexists(urlpath: str, download_config: Optional[DownloadConfig] = No...
  function xbasename (line 649) | def xbasename(a):
  function xsplit (line 675) | def xsplit(a):
  function xsplitext (line 702) | def xsplitext(a):
  function xisfile (line 729) | def xisfile(path, download_config: Optional[DownloadConfig] = None) -> b...
  function xgetsize (line 749) | def xgetsize(path, download_config: Optional[DownloadConfig] = None) -> ...
  function xisdir (line 777) | def xisdir(path, download_config: Optional[DownloadConfig] = None) -> bool:
  function xrelpath (line 800) | def xrelpath(path, start=None):
  class _OverridableIOWrapper (line 817) | class _OverridableIOWrapper(io.RawIOBase):
    method __init__ (line 818) | def __init__(self, f):
    method __getattribute__ (line 822) | def __getattribute__(self, attr):
    method __setattr__ (line 829) | def __setattr__(self, attr, value):
  function _add_retries_to_file_obj_read_method (line 836) | def _add_retries_to_file_obj_read_method(file_obj):
  function _prepare_path_and_storage_options (line 880) | def _prepare_path_and_storage_options(
  function _prepare_single_hop_path_and_storage_options (line 892) | def _prepare_single_hop_path_and_storage_options(
  function xopen (line 945) | def xopen(file: str, mode="r", *args, download_config: Optional[Download...
  function xlistdir (line 1006) | def xlistdir(path: str, download_config: Optional[DownloadConfig] = None...
  function xglob (line 1031) | def xglob(urlpath, *, recursive=False, download_config: Optional[Downloa...
  function xwalk (line 1057) | def xwalk(urlpath, download_config: Optional[DownloadConfig] = None, **k...
  class xPath (line 1085) | class xPath(type(Path())):
    method __str__ (line 1088) | def __str__(self):
    method exists (line 1098) | def exists(self, download_config: Optional[DownloadConfig] = None):
    method glob (line 1109) | def glob(self, pattern, download_config: Optional[DownloadConfig] = No...
    method rglob (line 1137) | def rglob(self, pattern, **kwargs):
    method parent (line 1149) | def parent(self) -> "xPath":
    method name (line 1158) | def name(self) -> str:
    method stem (line 1167) | def stem(self) -> str:
    method suffix (line 1176) | def suffix(self) -> str:
    method open (line 1184) | def open(self, *args, **kwargs):
    method joinpath (line 1196) | def joinpath(self, *p: tuple[str, ...]) -> "xPath":
    method __truediv__ (line 1207) | def __truediv__(self, p: str) -> "xPath":
    method with_suffix (line 1210) | def with_suffix(self, suffix):
  function _as_str (line 1217) | def _as_str(path: Union[str, Path, xPath]):
  function xgzip_open (line 1221) | def xgzip_open(filepath_or_buffer, *args, download_config: Optional[Down...
  function xnumpy_load (line 1231) | def xnumpy_load(filepath_or_buffer, *args, download_config: Optional[Dow...
  function xpandas_read_csv (line 1241) | def xpandas_read_csv(filepath_or_buffer, download_config: Optional[Downl...
  function xpandas_read_excel (line 1253) | def xpandas_read_excel(filepath_or_buffer, download_config: Optional[Dow...
  function xpyarrow_parquet_read_table (line 1271) | def xpyarrow_parquet_read_table(filepath_or_buffer, download_config: Opt...
  function xsio_loadmat (line 1281) | def xsio_loadmat(filepath_or_buffer, download_config: Optional[DownloadC...
  function xet_parse (line 1290) | def xet_parse(source, parser=None, download_config: Optional[DownloadCon...
  function xxml_dom_minidom_parse (line 1308) | def xxml_dom_minidom_parse(filename_or_file, download_config: Optional[D...
  class ArchiveIterable (line 1326) | class ArchiveIterable(TrackedIterableFromGenerator):
    method _iter_tar (line 1330) | def _iter_tar(f):
    method _iter_zip (line 1347) | def _iter_zip(f):
    method _iter_from_fileobj (line 1362) | def _iter_from_fileobj(cls, f) -> Generator[tuple, None, None]:
    method _iter_from_urlpath (line 1370) | def _iter_from_urlpath(
    method from_buf (line 1383) | def from_buf(cls, fileobj) -> "ArchiveIterable":
    method from_urlpath (line 1387) | def from_urlpath(cls, urlpath_or_buf, download_config: Optional[Downlo...
  class FilesIterable (line 1391) | class FilesIterable(TrackedIterableFromGenerator):
    method _iter_from_urlpaths (line 1395) | def _iter_from_urlpaths(
    method from_urlpaths (line 1419) | def from_urlpaths(cls, urlpaths, download_config: Optional[DownloadCon...

FILE: src/datasets/utils/info_utils.py
  class VerificationMode (line 22) | class VerificationMode(enum.Enum):
  function verify_checksums (line 43) | def verify_checksums(expected_checksums: Optional[dict], recorded_checks...
  function verify_splits (line 62) | def verify_splits(expected_splits: Optional[dict], recorded_splits: dict):
  function get_size_checksum_dict (line 80) | def get_size_checksum_dict(path: str, record_checksum: bool = True) -> d...
  function is_small_dataset (line 93) | def is_small_dataset(dataset_size):

FILE: src/datasets/utils/json.py
  function ujson_dumps (line 10) | def ujson_dumps(*args, **kwargs):
  function ujson_loads (line 18) | def ujson_loads(*args, **kwargs):
  function json_encode_field (line 26) | def json_encode_field(example: Any, json_field_path: str) -> Any:
  function find_mixed_struct_types_field_paths (line 45) | def find_mixed_struct_types_field_paths(examples: list, allow_root=False...
  function get_json_field_path_from_pyarrow_json_error (line 72) | def get_json_field_path_from_pyarrow_json_error(err_str: str) -> str:
  function insert_json_field_path (line 80) | def insert_json_field_path(json_field_paths: list[str], json_field_path:...
  function json_encode_fields_in_json_lines (line 90) | def json_encode_fields_in_json_lines(original_batch: bytes, json_field_p...
  function get_json_field_paths_from_feature (line 98) | def get_json_field_paths_from_feature(feature: "FeatureType") -> list[str]:
  function set_json_types_in_feature (line 112) | def set_json_types_in_feature(feature: "FeatureType", json_field_paths: ...

FILE: src/datasets/utils/logging.py
  function _get_default_logging_level (line 49) | def _get_default_logging_level():
  function _get_library_name (line 65) | def _get_library_name() -> str:
  function _get_library_root_logger (line 69) | def _get_library_root_logger() -> logging.Logger:
  function _configure_library_root_logger (line 73) | def _configure_library_root_logger() -> None:
  function _reset_library_root_logger (line 80) | def _reset_library_root_logger() -> None:
  function get_logger (line 85) | def get_logger(name: Optional[str] = None) -> logging.Logger:
  function get_verbosity (line 94) | def get_verbosity() -> int:
  function set_verbosity (line 110) | def set_verbosity(verbosity: int) -> None:
  function set_verbosity_info (line 119) | def set_verbosity_info():
  function set_verbosity_warning (line 129) | def set_verbosity_warning():
  function set_verbosity_debug (line 139) | def set_verbosity_debug():
  function set_verbosity_error (line 149) | def set_verbosity_error():
  function disable_propagation (line 159) | def disable_propagation() -> None:
  function enable_propagation (line 166) | def enable_propagation() -> None:

FILE: src/datasets/utils/metadata.py
  class _NoDuplicateSafeLoader (line 21) | class _NoDuplicateSafeLoader(yaml.SafeLoader):
    method _check_no_duplicates_on_constructed_node (line 22) | def _check_no_duplicates_on_constructed_node(self, node):
    method construct_mapping (line 30) | def construct_mapping(self, node, deep=False):
  function _split_yaml_from_readme (line 36) | def _split_yaml_from_readme(readme_content: str) -> tuple[Optional[str],...
  class MetadataConfigs (line 46) | class MetadataConfigs(dict[str, dict[str, Any]]):
    method _raise_if_data_files_field_not_valid (line 52) | def _raise_if_data_files_field_not_valid(metadata_config: dict):
    method _from_exported_parquet_files_and_dataset_infos (line 103) | def _from_exported_parquet_files_and_dataset_infos(
    method from_dataset_card_data (line 142) | def from_dataset_card_data(cls, dataset_card_data: DatasetCardData) ->...
    method to_dataset_card_data (line 166) | def to_dataset_card_data(self, dataset_card_data: DatasetCardData) -> ...
    method get_default_config_name (line 179) | def get_default_config_name(self) -> Optional[str]:

FILE: src/datasets/utils/patching.py
  class _PatchedModuleObj (line 9) | class _PatchedModuleObj:
    method __init__ (line 12) | def __init__(self, module, attrs=None):
  class patch_submodule (line 21) | class patch_submodule:
    method __init__ (line 40) | def __init__(self, obj, target: str, new, attrs=None):
    method __enter__ (line 48) | def __enter__(self):
    method __exit__ (line 102) | def __exit__(self, *exc_info):
    method start (line 106) | def start(self):
    method stop (line 111) | def stop(self):

FILE: src/datasets/utils/py_utils.py
  function size_str (line 71) | def size_str(size_in_bytes):
  function convert_file_size_to_int (line 95) | def convert_file_size_to_int(size: Union[int, str]) -> int:
  function glob_pattern_to_regex (line 139) | def glob_pattern_to_regex(pattern):
  function string_to_dict (line 158) | def string_to_dict(string: str, pattern: str) -> Optional[dict[str, str]]:
  function asdict (line 192) | def asdict(obj):
  function temporary_assignment (line 232) | def temporary_assignment(obj, attr, value):
  function temp_seed (line 243) | def temp_seed(seed: int, set_pytorch=False, set_tensorflow=False):
  function unique_values (line 296) | def unique_values(values):
  function no_op_if_value_is_null (line 305) | def no_op_if_value_is_null(func):
  function first_non_null_value (line 314) | def first_non_null_value(iterable):
  function first_non_null_non_empty_value (line 322) | def first_non_null_non_empty_value(iterable):
  function zip_dict (line 330) | def zip_dict(*dicts):
  class NonMutableDict (line 337) | class NonMutableDict(dict):
    method __init__ (line 345) | def __init__(self, *args, **kwargs):
    method __setitem__ (line 354) | def __setitem__(self, key, value):
    method update (line 359) | def update(self, other):
  class classproperty (line 365) | class classproperty(property):  # pylint: disable=invalid-name
    method __get__ (line 368) | def __get__(self, obj, objtype=None):
  function _single_map_nested (line 372) | def _single_map_nested(args):
  function map_nested (line 416) | def map_nested(
  class NestedDataStructure (line 553) | class NestedDataStructure:
    method __init__ (line 554) | def __init__(self, data=None):
    method flatten (line 557) | def flatten(self, data=None):
  function has_sufficient_disk_space (line 567) | def has_sufficient_disk_space(needed_bytes, directory="."):
  function copyfunc (line 575) | def copyfunc(func):
  function _write_generator_to_queue (line 584) | def _write_generator_to_queue(queue: queue.Queue, func: Callable[..., It...
  function _get_pool_pid (line 590) | def _get_pool_pid(pool: Union[multiprocessing.pool.Pool, multiprocess.po...
  function iflatmap_unordered (line 594) | def iflatmap_unordered(
  function iter_batched (line 630) | def iter_batched(iterable: Iterable[T], n: int) -> Iterable[list[T]]:

FILE: src/datasets/utils/sharding.py
  function _number_of_shards_in_gen_kwargs (line 4) | def _number_of_shards_in_gen_kwargs(gen_kwargs: dict) -> int:
  function _distribute_shards (line 21) | def _distribute_shards(num_shards: int, max_num_jobs: int) -> list[range]:
  function _split_gen_kwargs (line 48) | def _split_gen_kwargs(gen_kwargs: dict, max_num_jobs: int) -> list[dict]:
  function _merge_gen_kwargs (line 67) | def _merge_gen_kwargs(gen_kwargs_list: list[dict]) -> dict:
  function _shuffle_gen_kwargs (line 76) | def _shuffle_gen_kwargs(rng: np.random.Generator, gen_kwargs: dict) -> d...

FILE: src/datasets/utils/stratify.py
  function approximate_mode (line 4) | def approximate_mode(class_counts, n_draws, rng):
  function stratified_shuffle_split_generate_indices (line 54) | def stratified_shuffle_split_generate_indices(y, n_train, n_test, rng, n...

FILE: src/datasets/utils/tf_utils.py
  function minimal_tf_collate_fn (line 36) | def minimal_tf_collate_fn(features):
  function minimal_tf_collate_fn_with_renaming (line 56) | def minimal_tf_collate_fn_with_renaming(features):
  function is_numeric_pa_type (line 64) | def is_numeric_pa_type(pa_type):
  function np_get_batch (line 70) | def np_get_batch(
  function dataset_to_tf (line 118) | def dataset_to_tf(
  class SharedMemoryContext (line 231) | class SharedMemoryContext:
    method __init__ (line 234) | def __init__(self):
    method get_shm (line 238) | def get_shm(self, name, size, create):
    method get_array (line 248) | def get_array(self, name, shape, dtype, create):
    method __enter__ (line 252) | def __enter__(self):
    method __exit__ (line 255) | def __exit__(self, exc_type, exc_value, traceback):
  class NumpyMultiprocessingGenerator (line 263) | class NumpyMultiprocessingGenerator:
    method __init__ (line 264) | def __init__(
    method __iter__ (line 298) | def __iter__(self):
    method __call__ (line 395) | def __call__(self):
    method worker_loop (line 399) | def worker_loop(
    method distribute_batches (line 471) | def distribute_batches(dataset, batch_size, drop_remainder, num_worker...
  function multiprocess_dataset_to_tf (line 503) | def multiprocess_dataset_to_tf(

FILE: src/datasets/utils/tqdm.py
  function disable_progress_bars (line 61) | def disable_progress_bars() -> None:
  function enable_progress_bars (line 78) | def enable_progress_bars() -> None:
  function are_progress_bars_disabled (line 95) | def are_progress_bars_disabled() -> bool:
  class tqdm (line 105) | class tqdm(old_tqdm):
    method __init__ (line 112) | def __init__(self, *args, **kwargs):
    method __delattr__ (line 120) | def __delattr__(self, attr: str) -> None:
  function is_progress_bar_enabled (line 134) | def is_progress_bar_enabled():

FILE: src/datasets/utils/track.py
  class tracked_str (line 4) | class tracked_str(str):
    method set_origin (line 7) | def set_origin(self, origin: str):
    method get_origin (line 11) | def get_origin(self):
    method __repr__ (line 14) | def __repr__(self) -> str:
  class tracked_list (line 21) | class tracked_list(list):
    method __init__ (line 22) | def __init__(self, *args, **kwargs) -> None:
    method __iter__ (line 26) | def __iter__(self) -> Iterator:
    method __repr__ (line 32) | def __repr__(self) -> str:
  class TrackedIterableFromGenerator (line 39) | class TrackedIterableFromGenerator(Iterable):
    method __init__ (line 42) | def __init__(self, generator, *args):
    method __iter__ (line 48) | def __iter__(self):
    method __repr__ (line 54) | def __repr__(self) -> str:
    method __reduce__ (line 60) | def __reduce__(self):

FILE: src/datasets/utils/version.py
  class Version (line 30) | class Version:
    method __post_init__ (line 55) | def __post_init__(self):
    method __repr__ (line 58) | def __repr__(self):
    method tuple (line 62) | def tuple(self):
    method _validate_operand (line 65) | def _validate_operand(self, other):
    method __eq__ (line 72) | def __eq__(self, other):
    method __lt__ (line 80) | def __lt__(self, other):
    method __hash__ (line 84) | def __hash__(self):
    method from_dict (line 88) | def from_dict(cls, dic):
    method _to_yaml_string (line 92) | def _to_yaml_string(self) -> str:
  function _str_to_version_tuple (line 96) | def _str_to_version_tuple(version_str):
  function _version_tuple_to_str (line 104) | def _version_tuple_to_str(version_tuple):

FILE: tests/commands/conftest.py
  function dataset_dir (line 6) | def dataset_dir(tmp_path):

FILE: tests/commands/test_test.py
  function is_1percent_close (line 29) | def is_1percent_close(source, target):
  function test_test_command (line 34) | def test_test_command(dataset_dir):

FILE: tests/conftest.py
  function pytest_collection_modifyitems (line 11) | def pytest_collection_modifyitems(config, items):
  function set_test_cache_config (line 20) | def set_test_cache_config(tmp_path_factory, monkeypatch):
  function disable_implicit_token (line 35) | def disable_implicit_token(monkeypatch):
  function disable_tqdm_output (line 40) | def disable_tqdm_output():
  function set_update_download_counts_to_false (line 45) | def set_update_download_counts_to_false(monkeypatch):
  function set_sqlalchemy_silence_uber_warning (line 51) | def set_sqlalchemy_silence_uber_warning(monkeypatch):
  function zero_time_out_for_remote_code (line 61) | def zero_time_out_for_remote_code():

FILE: tests/distributed_scripts/run_torch_distributed.py
  class FailedTestError (line 15) | class FailedTestError(RuntimeError):
  function gen (line 19) | def gen(shards: List[str]):
  function main (line 25) | def main():

FILE: tests/features/test_array_xd.py
  function generate_examples (line 34) | def generate_examples(features: dict, num_examples=100, seq_shapes=None):
  class ExtensionTypeCompatibilityTest (line 59) | class ExtensionTypeCompatibilityTest(unittest.TestCase):
    method test_array2d_nonspecific_shape (line 60) | def test_array2d_nonspecific_shape(self):
    method test_multiple_extensions_same_row (line 81) | def test_multiple_extensions_same_row(self):
    method test_compatability_with_string_values (line 100) | def test_compatability_with_string_values(self):
    method test_extension_indexing (line 113) | def test_extension_indexing(self):
  function get_array_feature_types (line 129) | def get_array_feature_types():
  class ArrayXDTest (line 144) | class ArrayXDTest(unittest.TestCase):
    method get_features (line 145) | def get_features(self, array_feature, shape_1, shape_2):
    method get_dict_example_0 (line 154) | def get_dict_example_0(self, shape_1, shape_2):
    method get_dict_example_1 (line 161) | def get_dict_example_1(self, shape_1, shape_2):
    method get_dict_examples (line 168) | def get_dict_examples(self, shape_1, shape_2):
    method _check_getitem_output_type (line 175) | def _check_getitem_output_type(self, dataset, shape_1, shape_2, first_...
    method test_write (line 207) | def test_write(self, array_feature, shape_1, shape_2):
    method test_write_batch (line 223) | def test_write_batch(self, array_feature, shape_1, shape_2):
    method test_from_dict (line 235) | def test_from_dict(self, array_feature, shape_1, shape_2):
  class ArrayXDDynamicTest (line 244) | class ArrayXDDynamicTest(unittest.TestCase):
    method get_one_col_dataset (line 245) | def get_one_col_dataset(self, first_dim_list, fixed_shape):
    method get_two_col_datasset (line 251) | def get_two_col_datasset(self, first_dim_list, fixed_shape):
    method test_to_pylist (line 262) | def test_to_pylist(self):
    method test_to_numpy (line 274) | def test_to_numpy(self):
    method test_iter_dataset (line 307) | def test_iter_dataset(self):
    method test_to_pandas (line 317) | def test_to_pandas(self):
    method test_map_dataset (line 346) | def test_map_dataset(self):
  function test_table_to_pandas (line 361) | def test_table_to_pandas(dtype, dummy_value):
  function test_array_xd_numpy_arrow_extractor (line 371) | def test_array_xd_numpy_arrow_extractor(dtype, dummy_value):
  function test_array_xd_with_none (line 379) | def test_array_xd_with_none():
  function test_array_xd_with_np (line 419) | def test_array_xd_with_np(seq_type, dtype, shape, feature_class):
  function test_dataset_map (line 436) | def test_dataset_map(with_none):

FILE: tests/features/test_audio.py
  function tar_wav_path (line 17) | def tar_wav_path(shared_datadir, tmp_path_factory):
  function tar_mp3_path (line 26) | def tar_mp3_path(shared_datadir, tmp_path_factory):
  function iter_archive (line 34) | def iter_archive(archive_path):
  function test_audio_instantiation (line 42) | def test_audio_instantiation():
  function test_audio_feature_type_to_arrow (line 53) | def test_audio_feature_type_to_arrow():
  function test_audio_feature_encode_example (line 77) | def test_audio_feature_encode_example(shared_datadir, build_example):
  function test_audio_feature_encode_example_pcm (line 100) | def test_audio_feature_encode_example_pcm(shared_datadir, build_example):
  function test_audio_feature_encode_example_audiodecoder (line 121) | def test_audio_feature_encode_example_audiodecoder(shared_datadir, in_sa...
  function test_audio_decode_example (line 136) | def test_audio_decode_example(shared_datadir):
  function test_audio_resampling (line 152) | def test_audio_resampling(shared_datadir):
  function test_audio_decode_example_mp3 (line 165) | def test_audio_decode_example_mp3(shared_datadir):
  function test_audio_decode_example_opus (line 179) | def test_audio_decode_example_opus(shared_datadir):
  function test_audio_decode_example_pcm (line 193) | def test_audio_decode_example_pcm(shared_datadir, sampling_rate):
  function test_audio_resampling_mp3_different_sampling_rates (line 207) | def test_audio_resampling_mp3_different_sampling_rates(shared_datadir):
  function test_backwards_compatibility (line 228) | def test_backwards_compatibility(shared_datadir):
  function test_dataset_with_audio_feature (line 251) | def test_dataset_with_audio_feature(shared_datadir):
  function test_dataset_with_audio_feature_tar_wav (line 280) | def test_dataset_with_audio_feature_tar_wav(tar_wav_path):
  function test_dataset_with_audio_feature_tar_mp3 (line 314) | def test_dataset_with_audio_feature_tar_mp3(tar_mp3_path):
  function test_dataset_with_audio_feature_with_none (line 348) | def test_dataset_with_audio_feature_with_none():
  function test_resampling_at_loading_dataset_with_audio_feature (line 382) | def test_resampling_at_loading_dataset_with_audio_feature(shared_datadir):
  function test_resampling_at_loading_dataset_with_audio_feature_mp3 (line 411) | def test_resampling_at_loading_dataset_with_audio_feature_mp3(shared_dat...
  function test_resampling_after_loading_dataset_with_audio_feature (line 440) | def test_resampling_after_loading_dataset_with_audio_feature(shared_data...
  function test_resampling_after_loading_dataset_with_audio_feature_mp3 (line 473) | def test_resampling_after_loading_dataset_with_audio_feature_mp3(shared_...
  function test_dataset_cast_to_audio_features (line 518) | def test_dataset_cast_to_audio_features(shared_datadir, build_data):
  function test_dataset_concatenate_audio_features (line 533) | def test_dataset_concatenate_audio_features(shared_datadir):
  function test_dataset_concatenate_nested_audio_features (line 551) | def test_dataset_concatenate_nested_audio_features(shared_datadir):
  function test_dataset_with_audio_feature_map_is_not_decoded (line 572) | def test_dataset_with_audio_feature_map_is_not_decoded(shared_datadir):
  function test_dataset_with_audio_feature_map_is_decoded (line 594) | def test_dataset_with_audio_feature_map_is_decoded(shared_datadir):
  function test_formatted_dataset_with_audio_feature (line 624) | def test_formatted_dataset_with_audio_feature(shared_datadir):
  function jsonl_audio_dataset_path (line 676) | def jsonl_audio_dataset_path(shared_datadir, tmp_path_factory):
  function test_load_dataset_with_audio_feature (line 690) | def test_load_dataset_with_audio_feature(streaming, jsonl_audio_dataset_...
  function test_dataset_with_audio_feature_loaded_from_cache (line 708) | def test_dataset_with_audio_feature_loaded_from_cache():
  function test_dataset_with_audio_feature_undecoded (line 717) | def test_dataset_with_audio_feature_undecoded(shared_datadir):
  function test_formatted_dataset_with_audio_feature_undecoded (line 735) | def test_formatted_dataset_with_audio_feature_undecoded(shared_datadir):
  function test_dataset_with_audio_feature_map_undecoded (line 767) | def test_dataset_with_audio_feature_map_undecoded(shared_datadir):
  function test_audio_embed_storage (line 785) | def test_audio_embed_storage(shared_datadir):
  function test_audio_decode_example_opus_convert_to_stereo (line 795) | def test_audio_decode_example_opus_convert_to_stereo(shared_datadir):
  function test_audio_decode_example_opus_convert_to_mono (line 809) | def test_audio_decode_example_opus_convert_to_mono(shared_datadir):

FILE: tests/features/test_features.py
  function list_with (line 39) | def list_with(item):
  class FeaturesTest (line 43) | class FeaturesTest(TestCase):
    method test_from_arrow_schema_simple (line 44) | def test_from_arrow_schema_simple(self):
    method test_from_arrow_schema_with_sequence (line 54) | def test_from_arrow_schema_with_sequence(self):
    method test_string_to_arrow_bijection_for_primitive_types (line 64) | def test_string_to_arrow_bijection_for_primitive_types(self):
    method test_categorical_one_way (line 113) | def test_categorical_one_way(self):
    method test_feature_named_type (line 119) | def test_feature_named_type(self):
    method test_feature_named_self_as_kwarg (line 126) | def test_feature_named_self_as_kwarg(self):
    method test_class_label_feature_with_no_labels (line 133) | def test_class_label_feature_with_no_labels(self):
    method test_reorder_fields_as (line 140) | def test_reorder_fields_as(self):
    method test_flatten (line 258) | def test_flatten(self):
    method test_flatten_with_sequence (line 265) | def test_flatten_with_sequence(self):
    method test_features_dicts_are_synced (line 272) | def test_features_dicts_are_synced(self):
  function test_classlabel_init (line 297) | def test_classlabel_init(tmp_path_factory):
  function test_classlabel_str2int (line 320) | def test_classlabel_str2int():
  function test_classlabel_int2str (line 333) | def test_classlabel_int2str():
  function test_classlabel_cast_storage (line 346) | def test_classlabel_cast_storage():
  function test_class_label_to_and_from_dict (line 386) | def test_class_label_to_and_from_dict(class_label_arg, tmp_path_factory):
  function test_decode_nested_example_with_list_types (line 403) | def test_decode_nested_example_with_list_types(schema, monkeypatch):
  function test_encode_nested_example_with_list_types (line 416) | def test_encode_nested_example_with_list_types(schema):
  function test_encode_nested_example_sequence_with_none (line 422) | def test_encode_nested_example_sequence_with_none(inner_type):
  function test_encode_example (line 438) | def test_encode_example(features_dict, example, expected_encoded_example):
  function test_encode_batch_with_example_with_empty_first_elem (line 444) | def test_encode_batch_with_example_with_empty_first_elem():
  function test_encode_column_dict_with_none (line 461) | def test_encode_column_dict_with_none():
  function test_dataset_feature_with_none (line 480) | def test_dataset_feature_with_none(feature):
  function iternumpy (line 513) | def iternumpy(key1, value1, value2):
  function dict_diff (line 520) | def dict_diff(d1: dict, d2: dict):  # check if 2 dictionaries are equal
  class CastToPythonObjectsTest (line 536) | class CastToPythonObjectsTest(TestCase):
    method test_cast_to_python_objects_list (line 537) | def test_cast_to_python_objects_list(self):
    method test_cast_to_python_objects_tuple (line 543) | def test_cast_to_python_objects_tuple(self):
    method test_cast_to_python_or_numpy (line 549) | def test_cast_to_python_or_numpy(self):
    method test_cast_to_python_objects_series (line 558) | def test_cast_to_python_objects_series(self):
    method test_cast_to_python_objects_dataframe (line 567) | def test_cast_to_python_objects_dataframe(self):
    method test_cast_to_python_objects_pandas_timestamp (line 573) | def test_cast_to_python_objects_pandas_timestamp(self):
    method test_cast_to_python_objects_pandas_timedelta (line 583) | def test_cast_to_python_objects_pandas_timedelta(self):
    method test_cast_to_python_objects_torch (line 595) | def test_cast_to_python_objects_torch(self):
    method test_cast_to_python_objects_tf (line 610) | def test_cast_to_python_objects_tf(self):
    method test_cast_to_python_objects_jax (line 625) | def test_cast_to_python_objects_jax(self):
    method test_dont_iterate_over_each_element_in_a_list (line 641) | def test_dont_iterate_over_each_element_in_a_list(self, mocked_cast):
  function test_features_to_dict_and_from_dict_round_trip (line 688) | def test_features_to_dict_and_from_dict_round_trip(features: Features):
  function test_features_to_yaml_list (line 696) | def test_features_to_yaml_list(features: Features):
  function test_features_flatten_with_list_types (line 711) | def test_features_flatten_with_list_types(features_dict, expected_featur...
  function test_features_from_dict_with_list_types (line 742) | def test_features_from_dict_with_list_types(deserialized_features_dict, ...
  function test_generate_from_dict_with_list_types (line 772) | def test_generate_from_dict_with_list_types(deserialized_feature_dict, e...
  function test_features_to_yaml_list_with_large_list (line 787) | def test_features_to_yaml_list_with_large_list(features_dict, expected_f...
  function test_features_from_yaml_list_with_large_list (line 803) | def test_features_from_yaml_list_with_large_list(features_yaml_list, exp...
  function test_features_to_arrow_schema (line 809) | def test_features_to_arrow_schema(features: Features):
  function test_features_alignment (line 893) | def test_features_alignment(features: tuple[list[Features], list[Feature...
  function test_features_from_arrow_schema_primitive_data_type (line 900) | def test_features_from_arrow_schema_primitive_data_type(dtype):
  function test_features_from_arrow_schema_list_data_type (line 907) | def test_features_from_arrow_schema_list_data_type(list_dtype, scalar_dt...
  function test_features_reorder_fields_as_with_list_types (line 932) | def test_features_reorder_fields_as_with_list_types(feature, other_featu...
  function test_get_nested_type_with_scalar_feature (line 942) | def test_get_nested_type_with_scalar_feature(feature, expected_arrow_dat...
  function test_get_nested_type_with_list_feature (line 954) | def test_get_nested_type_with_list_feature(
  function test_generate_from_arrow_type_with_arrow_primitive_data_type (line 965) | def test_generate_from_arrow_type_with_arrow_primitive_data_type(arrow_p...
  function test_generate_from_arrow_type_with_arrow_nested_data_type (line 977) | def test_generate_from_arrow_type_with_arrow_nested_data_type(
  function test_check_non_null_non_empty_recursive_with_list_types (line 990) | def test_check_non_null_non_empty_recursive_with_list_types(schema):
  function test_check_non_null_non_empty_recursive_with_nested_list_types (line 1002) | def test_check_non_null_non_empty_recursive_with_nested_list_types(schema):
  function test_require_decoding_with_list_types (line 1007) | def test_require_decoding_with_list_types(feature):
  function test_require_storage_cast_with_list_types (line 1012) | def test_require_storage_cast_with_list_types(feature):
  function test_require_storage_embed_with_list_types (line 1017) | def test_require_storage_embed_with_list_types(feature):
  function test_visit_with_list_types (line 1025) | def test_visit_with_list_types(feature, expected):
  function test_is_null_feature (line 1048) | def test_is_null_feature(feature, expected):

FILE: tests/features/test_image.py
  function tar_jpg_path (line 20) | def tar_jpg_path(shared_datadir, tmp_path_factory):
  function iter_archive (line 28) | def iter_archive(archive_path):
  function test_image_instantiation (line 36) | def test_image_instantiation():
  function test_image_feature_type_to_arrow (line 44) | def test_image_feature_type_to_arrow():
  function test_image_feature_encode_example (line 67) | def test_image_feature_encode_example(shared_datadir, build_example):
  function test_image_decode_example (line 81) | def test_image_decode_example(shared_datadir):
  function test_image_decode_example_with_exif_orientation_tag (line 98) | def test_image_decode_example_with_exif_orientation_tag(shared_datadir):
  function test_image_change_mode (line 116) | def test_image_change_mode(shared_datadir):
  function test_dataset_with_image_feature (line 130) | def test_dataset_with_image_feature(shared_datadir):
  function test_dataset_with_image_feature_from_pil_image (line 163) | def test_dataset_with_image_feature_from_pil_image(infer_feature, shared...
  function test_dataset_with_image_feature_from_np_array (line 195) | def test_dataset_with_image_feature_from_np_array():
  function test_dataset_with_image_feature_tar_jpg (line 228) | def test_dataset_with_image_feature_tar_jpg(tar_jpg_path):
  function test_dataset_with_image_feature_with_none (line 263) | def test_dataset_with_image_feature_with_none():
  function test_dataset_cast_to_image_features (line 309) | def test_dataset_cast_to_image_features(shared_datadir, build_data):
  function test_dataset_cast_to_image_features_polars (line 323) | def test_dataset_cast_to_image_features_polars(shared_datadir):
  function test_dataset_concatenate_image_features (line 336) | def test_dataset_concatenate_image_features(shared_datadir):
  function test_dataset_concatenate_nested_image_features (line 350) | def test_dataset_concatenate_nested_image_features(shared_datadir):
  function test_dataset_with_image_feature_map (line 371) | def test_dataset_with_image_feature_map(shared_datadir):
  function test_formatted_dataset_with_image_feature_map (line 420) | def test_formatted_dataset_with_image_feature_map(shared_datadir):
  function test_dataset_with_image_feature_map_change_image (line 455) | def test_dataset_with_image_feature_map_change_image(shared_datadir):
  function test_formatted_dataset_with_image_feature (line 525) | def test_formatted_dataset_with_image_feature(shared_datadir):
  function img_dataset_dir (line 576) | def img_dataset_dir(shared_datadir, tmp_path):
  function test_load_dataset_with_image_feature (line 587) | def test_load_dataset_with_image_feature(shared_datadir, img_dataset_dir...
  function test_dataset_with_image_feature_undecoded (line 602) | def test_dataset_with_image_feature_undecoded(shared_datadir):
  function test_formatted_dataset_with_image_feature_undecoded (line 620) | def test_formatted_dataset_with_image_feature_undecoded(shared_datadir):
  function test_dataset_with_image_feature_map_undecoded (line 652) | def test_dataset_with_image_feature_map_undecoded(shared_datadir):
  function test_image_embed_storage (line 671) | def test_image_embed_storage(shared_datadir):
  function test_encode_np_array (line 693) | def test_encode_np_array(array, dtype_cast, expected_image_format):

FILE: tests/features/test_nifti.py
  function test_nifti_feature_encode_example (line 29) | def test_nifti_feature_encode_example(shared_datadir, nifti_file, build_...
  function test_dataset_with_nifti_feature (line 44) | def test_dataset_with_nifti_feature(shared_datadir, nifti_file):
  function test_encode_nibabel_image (line 74) | def test_encode_nibabel_image(shared_datadir):
  function test_embed_storage (line 96) | def test_embed_storage(shared_datadir):
  function test_load_zipped_file_locally (line 124) | def test_load_zipped_file_locally(shared_datadir):
  function test_nifti_lazy_loading (line 134) | def test_nifti_lazy_loading(shared_datadir):

FILE: tests/features/test_pdf.py
  function test_pdf_feature_encode_example (line 24) | def test_pdf_feature_encode_example(shared_datadir, build_example):
  function test_dataset_with_pdf_feature (line 38) | def test_dataset_with_pdf_feature(shared_datadir):

FILE: tests/features/test_video.py
  function test_video_feature_encode_example (line 24) | def test_video_feature_encode_example(shared_datadir, build_example):
  function test_dataset_with_video_feature (line 38) | def test_dataset_with_video_feature(shared_datadir):
  function test_dataset_with_video_map_and_formatted (line 76) | def test_dataset_with_video_map_and_formatted(shared_datadir):
  function test_dataset_with_video_feature_map_is_decoded (line 100) | def test_dataset_with_video_feature_map_is_decoded(shared_datadir):
  function jsonl_video_dataset_path (line 130) | def jsonl_video_dataset_path(shared_datadir, tmp_path_factory):
  function test_load_dataset_with_video_feature (line 144) | def test_load_dataset_with_video_feature(streaming, jsonl_video_dataset_...

FILE: tests/fixtures/files.py
  function dataset (line 23) | def dataset():
  function arrow_file (line 49) | def arrow_file(tmp_path_factory, dataset):
  function text_file_content (line 64) | def text_file_content():
  function text_file (line 69) | def text_file(tmp_path_factory):
  function bz2_file (line 78) | def bz2_file(tmp_path_factory):
  function gz_file (line 89) | def gz_file(tmp_path_factory):
  function lz4_file (line 100) | def lz4_file(tmp_path_factory):
  function seven_zip_file (line 112) | def seven_zip_file(tmp_path_factory, text_file):
  function tar_file (line 123) | def tar_file(tmp_path_factory, text_file):
  function xz_file (line 133) | def xz_file(tmp_path_factory):
  function zip_file (line 144) | def zip_file(tmp_path_factory, text_file):
  function zstd_file (line 154) | def zstd_file(tmp_path_factory):
  function xml_file (line 169) | def xml_file(tmp_path_factory):
  function dataset_dict (line 246) | def dataset_dict():
  function arrow_path (line 251) | def arrow_path(tmp_path_factory):
  function sqlite_path (line 259) | def sqlite_path(tmp_path_factory):
  function csv_path (line 271) | def csv_path(tmp_path_factory):
  function csv2_path (line 282) | def csv2_path(tmp_path_factory):
  function bz2_csv_path (line 293) | def bz2_csv_path(csv_path, tmp_path_factory):
  function zip_csv_path (line 306) | def zip_csv_path(csv_path, csv2_path, tmp_path_factory):
  function zip_uppercase_csv_path (line 315) | def zip_uppercase_csv_path(csv_path, csv2_path, tmp_path_factory):
  function zip_csv_with_dir_path (line 324) | def zip_csv_with_dir_path(csv_path, csv2_path, tmp_path_factory):
  function parquet_path (line 333) | def parquet_path(tmp_path_factory):
  function geoparquet_path (line 351) | def geoparquet_path(tmp_path_factory):
  function json_list_of_dicts_path (line 359) | def json_list_of_dicts_path(tmp_path_factory):
  function json_dict_of_lists_path (line 368) | def json_dict_of_lists_path(tmp_path_factory):
  function jsonl_path (line 377) | def jsonl_path(tmp_path_factory):
  function jsonl2_path (line 386) | def jsonl2_path(tmp_path_factory):
  function jsonl_312_path (line 395) | def jsonl_312_path(tmp_path_factory):
  function jsonl_str_path (line 404) | def jsonl_str_path(tmp_path_factory):
  function jsonl_missing_fields_path (line 413) | def jsonl_missing_fields_path(tmp_path_factory):
  function jsonl_mixed_types_path (line 422) | def jsonl_mixed_types_path(tmp_path_factory):
  function text_gz_path (line 431) | def text_gz_path(tmp_path_factory, text_path):
  function jsonl_gz_path (line 442) | def jsonl_gz_path(tmp_path_factory, jsonl_path):
  function zip_jsonl_path (line 453) | def zip_jsonl_path(jsonl_path, jsonl2_path, tmp_path_factory):
  function zip_nested_jsonl_path (line 462) | def zip_nested_jsonl_path(zip_jsonl_path, jsonl_path, jsonl2_path, tmp_p...
  function zip_jsonl_with_dir_path (line 470) | def zip_jsonl_with_dir_path(jsonl_path, jsonl2_path, tmp_path_factory):
  function tar_jsonl_path (line 479) | def tar_jsonl_path(jsonl_path, jsonl2_path, tmp_path_factory):
  function tar_nested_jsonl_path (line 488) | def tar_nested_jsonl_path(tar_jsonl_path, jsonl_path, jsonl2_path, tmp_p...
  function text_path (line 496) | def text_path(tmp_path_factory):
  function text2_path (line 506) | def text2_path(tmp_path_factory):
  function text_dir (line 516) | def text_dir(tmp_path_factory):
  function text_dir_with_unsupported_extension (line 526) | def text_dir_with_unsupported_extension(tmp_path_factory):
  function zip_text_path (line 536) | def zip_text_path(text_path, text2_path, tmp_path_factory):
  function zip_text_with_dir_path (line 545) | def zip_text_with_dir_path(text_path, text2_path, tmp_path_factory):
  function zip_unsupported_ext_path (line 554) | def zip_unsupported_ext_path(text_path, text2_path, tmp_path_factory):
  function text_path_with_unicode_new_lines (line 563) | def text_path_with_unicode_new_lines(tmp_path_factory):
  function image_file (line 572) | def image_file():
  function audio_file (line 577) | def audio_file():
  function audio_file_44100 (line 582) | def audio_file_44100():
  function audio_file_16000 (line 587) | def audio_file_16000():
  function tensor_file (line 592) | def tensor_file(tmp_path_factory):
  function zip_image_path (line 602) | def zip_image_path(image_file, tmp_path_factory):
  function data_dir_with_hidden_files (line 611) | def data_dir_with_hidden_files(tmp_path_factory):

FILE: tests/fixtures/fsspec.py
  class MockFileSystem (line 10) | class MockFileSystem(AbstractFileSystem):
    method __init__ (line 13) | def __init__(self, *args, local_root_dir, **kwargs):
    method mkdir (line 18) | def mkdir(self, path, *args, **kwargs):
    method makedirs (line 22) | def makedirs(self, path, *args, **kwargs):
    method rmdir (line 26) | def rmdir(self, path):
    method ls (line 30) | def ls(self, path, detail=True, *args, **kwargs):
    method info (line 38) | def info(self, path, *args, **kwargs):
    method cp_file (line 44) | def cp_file(self, path1, path2, *args, **kwargs):
    method rm_file (line 49) | def rm_file(self, path, *args, **kwargs):
    method rm (line 53) | def rm(self, path, *args, **kwargs):
    method _open (line 57) | def _open(self, path, *args, **kwargs):
    method created (line 61) | def created(self, path):
    method modified (line 65) | def modified(self, path):
    method _strip_protocol (line 70) | def _strip_protocol(cls, path):
  class TmpDirFileSystem (line 77) | class TmpDirFileSystem(MockFileSystem):
    method __init__ (line 81) | def __init__(self, *args, **kwargs):
    method _strip_protocol (line 86) | def _strip_protocol(cls, path):
  function mock_fsspec (line 94) | def mock_fsspec():
  function mockfs (line 103) | def mockfs(tmp_path_factory, mock_fsspec):
  function tmpfs (line 109) | def tmpfs(tmp_path_factory, mock_fsspec):

FILE: tests/fixtures/hub.py
  function ci_hub_config (line 32) | def ci_hub_config(monkeypatch):
  function set_ci_hub_access_token (line 51) | def set_ci_hub_access_token(ci_hub_config, monkeypatch):
  function _http_ci_user_agent (line 62) | def _http_ci_user_agent(*args, **kwargs):
  function set_hf_ci_headers (line 68) | def set_hf_ci_headers(monkeypatch):
  function hf_api (line 78) | def hf_api():
  function hf_token (line 83) | def hf_token():
  function cleanup_repo (line 88) | def cleanup_repo(hf_api: HfApi):
  function cleanup_bucket (line 96) | def cleanup_bucket(hf_api: HfApi):
  function temporary_repo (line 104) | def temporary_repo(cleanup_repo):
  function temporary_bucket (line 120) | def temporary_bucket(cleanup_bucket):
  function _hf_gated_dataset_repo_txt_data (line 136) | def _hf_gated_dataset_repo_txt_data(hf_api: HfApi, hf_token, text_file_c...
  function hf_gated_dataset_repo_txt_data (line 156) | def hf_gated_dataset_repo_txt_data(_hf_gated_dataset_repo_txt_data, ci_h...
  function hf_private_dataset_repo_txt_data_ (line 161) | def hf_private_dataset_repo_txt_data_(hf_api: HfApi, hf_token, text_file...
  function hf_private_dataset_repo_txt_data (line 180) | def hf_private_dataset_repo_txt_data(hf_private_dataset_repo_txt_data_, ...
  function hf_private_dataset_repo_zipped_txt_data_ (line 185) | def hf_private_dataset_repo_zipped_txt_data_(hf_api: HfApi, hf_token, zi...
  function hf_private_dataset_repo_zipped_txt_data (line 204) | def hf_private_dataset_repo_zipped_txt_data(hf_private_dataset_repo_zipp...
  function hf_private_dataset_repo_zipped_img_data_ (line 209) | def hf_private_dataset_repo_zipped_img_data_(hf_api: HfApi, hf_token, zi...
  function hf_private_dataset_repo_zipped_img_data (line 228) | def hf_private_dataset_repo_zipped_img_data(hf_private_dataset_repo_zipp...

FILE: tests/io/test_csv.py
  function _check_csv_dataset (line 13) | def _check_csv_dataset(dataset, expected_features):
  function test_dataset_from_csv_keep_in_memory (line 23) | def test_dataset_from_csv_keep_in_memory(keep_in_memory, csv_path, tmp_p...
  function test_dataset_from_csv_features (line 41) | def test_dataset_from_csv_features(features, csv_path, tmp_path):
  function test_dataset_from_csv_split (line 54) | def test_dataset_from_csv_split(split, csv_path, tmp_path):
  function test_dataset_from_csv_path_type (line 63) | def test_dataset_from_csv_path_type(path_type, csv_path, tmp_path):
  function _check_csv_datasetdict (line 74) | def _check_csv_datasetdict(dataset_dict, expected_features, splits=("tra...
  function test_csv_datasetdict_reader_keep_in_memory (line 86) | def test_csv_datasetdict_reader_keep_in_memory(keep_in_memory, csv_path,...
  function test_csv_datasetdict_reader_features (line 104) | def test_csv_datasetdict_reader_features(features, csv_path, tmp_path):
  function test_csv_datasetdict_reader_split (line 117) | def test_csv_datasetdict_reader_split(split, csv_path, tmp_path):
  function iter_csv_file (line 129) | def iter_csv_file(csv_path):
  function test_dataset_to_csv (line 134) | def test_dataset_to_csv(csv_path, tmp_path):
  function test_dataset_to_csv_multiproc (line 147) | def test_dataset_to_csv_multiproc(csv_path, tmp_path):
  function test_dataset_to_csv_invalidproc (line 160) | def test_dataset_to_csv_invalidproc(csv_path, tmp_path):
  function test_dataset_to_csv_fsspec (line 168) | def test_dataset_to_csv_fsspec(dataset, mockfs):

FILE: tests/io/test_json.py
  function _check_json_dataset (line 14) | def _check_json_dataset(dataset, expected_features):
  function test_dataset_from_json_keep_in_memory (line 24) | def test_dataset_from_json_keep_in_memory(keep_in_memory, jsonl_path, tm...
  function test_dataset_from_json_features (line 42) | def test_dataset_from_json_features(features, jsonl_path, tmp_path):
  function test_dataset_from_json_with_unsorted_column_names (line 60) | def test_dataset_from_json_with_unsorted_column_names(features, jsonl_31...
  function test_dataset_from_json_with_mismatched_features (line 76) | def test_dataset_from_json_with_mismatched_features(jsonl_312_path, tmp_...
  function test_dataset_from_json_with_missing_fields (line 93) | def test_dataset_from_json_with_missing_fields(jsonl_missing_fields_path...
  function test_dataset_from_json_with_mixed_types (line 108) | def test_dataset_from_json_with_mixed_types(jsonl_mixed_types_path, tmp_...
  function test_dataset_from_json_split (line 121) | def test_dataset_from_json_split(split, jsonl_path, tmp_path):
  function test_dataset_from_json_path_type (line 130) | def test_dataset_from_json_path_type(path_type, jsonl_path, tmp_path):
  function _check_json_datasetdict (line 141) | def _check_json_datasetdict(dataset_dict, expected_features, splits=("tr...
  function test_datasetdict_from_json_keep_in_memory (line 153) | def test_datasetdict_from_json_keep_in_memory(keep_in_memory, jsonl_path...
  function test_datasetdict_from_json_features (line 171) | def test_datasetdict_from_json_features(features, jsonl_path, tmp_path):
  function test_datasetdict_from_json_splits (line 183) | def test_datasetdict_from_json_splits(split, jsonl_path, tmp_path):
  function load_json (line 196) | def load_json(buffer):
  function load_json_lines (line 200) | def load_json_lines(buffer):
  class TestJsonDatasetWriter (line 204) | class TestJsonDatasetWriter:
    method test_dataset_to_json_lines (line 206) | def test_dataset_to_json_lines(self, lines, load_json_function, dataset):
    method test_dataset_to_json_orient (line 226) | def test_dataset_to_json_orient(self, orient, container, keys, len_at,...
    method test_dataset_to_json_lines_multiproc (line 245) | def test_dataset_to_json_lines_multiproc(self, lines, load_json_functi...
    method test_dataset_to_json_orient_multiproc (line 265) | def test_dataset_to_json_orient_multiproc(self, orient, container, key...
    method test_dataset_to_json_orient_invalidproc (line 283) | def test_dataset_to_json_orient_invalidproc(self, dataset):
    method test_dataset_to_json_compression (line 289) | def test_dataset_to_json_compression(self, shared_datadir, tmp_path_fa...
    method test_dataset_to_json_fsspec (line 300) | def test_dataset_to_json_fsspec(self, dataset, mockfs):

FILE: tests/io/test_parquet.py
  function _check_parquet_dataset (line 21) | def _check_parquet_dataset(dataset, expected_features):
  function test_dataset_from_parquet_keep_in_memory (line 31) | def test_dataset_from_parquet_keep_in_memory(keep_in_memory, parquet_pat...
  function test_dataset_from_parquet_features (line 49) | def test_dataset_from_parquet_features(features, parquet_path, tmp_path):
  function test_dataset_from_parquet_split (line 61) | def test_dataset_from_parquet_split(split, parquet_path, tmp_path):
  function test_dataset_from_parquet_path_type (line 70) | def test_dataset_from_parquet_path_type(path_type, parquet_path, tmp_path):
  function test_parquet_read_geoparquet (line 81) | def test_parquet_read_geoparquet(geoparquet_path, tmp_path):
  function test_parquet_read_filters (line 100) | def test_parquet_read_filters(parquet_path, tmp_path):
  function _check_parquet_datasetdict (line 110) | def _check_parquet_datasetdict(dataset_dict, expected_features, splits=(...
  function test_parquet_datasetdict_reader_keep_in_memory (line 122) | def test_parquet_datasetdict_reader_keep_in_memory(keep_in_memory, parqu...
  function test_parquet_datasetdict_reader_features (line 143) | def test_parquet_datasetdict_reader_features(streaming, features, parque...
  function test_parquet_datasetdict_reader_columns (line 160) | def test_parquet_datasetdict_reader_columns(streaming, columns, pass_fea...
  function test_parquet_datasetdict_reader_split (line 189) | def test_parquet_datasetdict_reader_split(split, parquet_path, tmp_path):
  function test_parquet_write (line 202) | def test_parquet_write(dataset, tmp_path):
  function test_parquet_write_uses_content_defined_chunking (line 210) | def test_parquet_write_uses_content_defined_chunking(dataset, tmp_path):
  function test_parquet_writer_persist_cdc_options_as_metadata (line 227) | def test_parquet_writer_persist_cdc_options_as_metadata(dataset, tmp_path):
  function test_dataset_to_parquet_keeps_features (line 261) | def test_dataset_to_parquet_keep

Download .json

Condensed preview — 331 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (4,137K chars).

[
  {
    "path": ".dvc/.gitignore",
    "chars": 26,
    "preview": "/config.local\n/tmp\n/cache\n"
  },
  {
    "path": ".dvc/config",
    "chars": 0,
    "preview": ""
  },
  {
    "path": ".dvc/plots/confusion.json",
    "chars": 740,
    "preview": "{\n    \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.json\",\n    \"data\": {\n        \"values\": \"<DVC_METRIC_DATA>\"\n"
  },
  {
    "path": ".dvc/plots/default.json",
    "chars": 677,
    "preview": "{\n    \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.json\",\n    \"data\": {\n        \"values\": \"<DVC_METRIC_DATA>\"\n"
  },
  {
    "path": ".dvc/plots/scatter.json",
    "chars": 654,
    "preview": "{\n    \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.json\",\n    \"data\": {\n        \"values\": \"<DVC_METRIC_DATA>\"\n"
  },
  {
    "path": ".dvc/plots/smooth.json",
    "chars": 889,
    "preview": "{\n    \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.json\",\n    \"data\": {\n        \"values\": \"<DVC_METRIC_DATA>\"\n"
  },
  {
    "path": ".dvcignore",
    "chars": 139,
    "preview": "# Add patterns of files dvc should ignore, which could improve\n# the performance. Learn more at\n# https://dvc.org/doc/us"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug-report.yml",
    "chars": 1621,
    "preview": "name: Bug report\ndescription: Create a report to help reproduce and fix the bug\nbody:\n  - type: textarea\n    id: descrip"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "chars": 378,
    "preview": "contact_links:\n  - name: Datasets on the Hugging Face Hub\n    url: https://huggingface.co/datasets\n    about: Please use"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature-request.yml",
    "chars": 984,
    "preview": "name: Feature request\ndescription: Suggest an idea for this project\nlabels: [\"enhancement\"]\nbody:\n  - type: textarea\n   "
  },
  {
    "path": ".github/conda/build.sh",
    "chars": 81,
    "preview": "$PYTHON setup.py install --single-version-externally-managed --record=record.txt\n"
  },
  {
    "path": ".github/conda/meta.yaml",
    "chars": 993,
    "preview": "{% set name = \"datasets\" %}\n\npackage:\n  name: \"{{ name|lower }}\"\n  version: \"{{ DATASETS_VERSION }}\"\n\nsource:\n  path: .."
  },
  {
    "path": ".github/workflows/build_documentation.yml",
    "chars": 438,
    "preview": "name: Build documentation\n\non:\n  push:\n    branches:\n      - main\n      - doc-builder*\n      - v*-release\n      - v*-pat"
  },
  {
    "path": ".github/workflows/build_pr_documentation.yml",
    "chars": 401,
    "preview": "name: Build PR Documentation\n\non:\n  pull_request:\n\nconcurrency:\n  group: ${{ github.workflow }}-${{ github.head_ref || g"
  },
  {
    "path": ".github/workflows/ci.yml",
    "chars": 5520,
    "preview": "name: CI\n\non:\n  pull_request:\n    branches:\n      - main\n  push:\n    branches:\n      - main\n      - ci-*\n\nenv:\n  CI_HEAD"
  },
  {
    "path": ".github/workflows/release-conda.yml",
    "chars": 1069,
    "preview": "name: Release - Conda\n\non:\n  push:\n    tags:\n      - \"[0-9]+.[0-9]+.[0-9]+*\"\n\nenv:\n  ANACONDA_API_TOKEN: ${{ secrets.ANA"
  },
  {
    "path": ".github/workflows/self-assign.yaml",
    "chars": 812,
    "preview": "name: Self-assign\non:\n  issue_comment:\n    types: created\njobs:\n  one:\n    runs-on: ubuntu-latest\n    if: >-\n      (gith"
  },
  {
    "path": ".github/workflows/trufflehog.yml",
    "chars": 338,
    "preview": "on:\n  push:\n\nname: Secret Leaks\n\npermissions:\n  contents: read\n\njobs:\n  trufflehog:\n    runs-on: ubuntu-latest\n    steps"
  },
  {
    "path": ".github/workflows/upload_pr_documentation.yml",
    "chars": 381,
    "preview": "name: Upload PR Documentation\n\non:\n  workflow_run:\n    workflows: [\"Build PR Documentation\"]\n    types:\n      - complete"
  },
  {
    "path": ".gitignore",
    "chars": 620,
    "preview": "# Locked files\n*.lock\n!dvc.lock\n\n# Extracted dummy data\ndatasets/**/dummy_data-zip-extracted/\n\n# Compiled python modules"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 258,
    "preview": "repos:\n  - repo: https://github.com/charliermarsh/ruff-pre-commit # https://github.com/charliermarsh/ruff#usage\n    rev:"
  },
  {
    "path": ".zenodo.json",
    "chars": 3247,
    "preview": "{\n    \"license\": \"Apache-2.0\",\n    \"creators\": [\n        {\n            \"affiliation\": \"Hugging Face\",\n            \"name\""
  },
  {
    "path": "ADD_NEW_DATASET.md",
    "chars": 379,
    "preview": "# How to add one new datasets\n\nAdd datasets directly to the 🤗 Hugging Face Hub!\n\nYou can share your dataset on https://h"
  },
  {
    "path": "AUTHORS",
    "chars": 327,
    "preview": "# This is the list of HuggingFace Datasets authors for copyright purposes.\n#\n# This does not necessarily list everyone w"
  },
  {
    "path": "CITATION.cff",
    "chars": 3886,
    "preview": "cff-version: 1.2.0\nmessage: \"If you use this software, please cite it as below.\"\ntitle: \"huggingface/datasets\"\nauthors:\n"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 5491,
    "preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participa"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 6421,
    "preview": "# How to contribute to Datasets?\n[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa"
  },
  {
    "path": "LICENSE",
    "chars": 11358,
    "preview": "\n                                 Apache License\n                           Version 2.0, January 2004\n                  "
  },
  {
    "path": "Makefile",
    "chars": 465,
    "preview": ".PHONY: quality style test\n\ncheck_dirs := tests src benchmarks utils\n\n# Check that source code meets quality standards\n\n"
  },
  {
    "path": "README.md",
    "chars": 11216,
    "preview": "<p align=\"center\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://huggingface.co/datasets/"
  },
  {
    "path": "SECURITY.md",
    "chars": 918,
    "preview": "# Security Policy\n\n## Supported Versions\n<!--\nUse this section to tell people about which versions of your project are\nc"
  },
  {
    "path": "benchmarks/benchmark_array_xd.py",
    "chars": 5040,
    "preview": "import json\nimport os\nimport tempfile\n\nimport datasets\nfrom datasets.arrow_writer import ArrowWriter\nfrom datasets.featu"
  },
  {
    "path": "benchmarks/benchmark_getitem_100B.py",
    "chars": 2100,
    "preview": "import json\nimport os\nfrom dataclasses import dataclass\n\nimport numpy as np\nimport pyarrow as pa\n\nimport datasets\nfrom u"
  },
  {
    "path": "benchmarks/benchmark_indices_mapping.py",
    "chars": 1672,
    "preview": "import json\nimport os\nimport tempfile\n\nimport datasets\nfrom utils import generate_example_dataset, get_duration\n\n\nSPEED_"
  },
  {
    "path": "benchmarks/benchmark_iterating.py",
    "chars": 3771,
    "preview": "import json\nimport os\nimport tempfile\n\nimport datasets\nfrom utils import generate_example_dataset, get_duration\n\n\nSPEED_"
  },
  {
    "path": "benchmarks/benchmark_map_filter.py",
    "chars": 2500,
    "preview": "import json\nimport os\nimport tempfile\n\nimport transformers\n\nimport datasets\nfrom utils import generate_example_dataset, "
  },
  {
    "path": "benchmarks/format.py",
    "chars": 1614,
    "preview": "import json\nimport sys\n\n\ndef format_json_to_md(input_json_file, output_md_file):\n    with open(input_json_file, encoding"
  },
  {
    "path": "benchmarks/results/.gitkeep",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "benchmarks/results/benchmark_array_xd.json",
    "chars": 1459,
    "preview": "{\"write_array2d\": 0.14168284999323077, \"read_unformated after write_array2d\": 0.04353281999647152, \"read_formatted_as_nu"
  },
  {
    "path": "benchmarks/results/benchmark_getitem_100B.json",
    "chars": 214,
    "preview": "{\"num examples\": 100000000000, \"get_first_row\": 0.00019991099999927542, \"get_last_row\": 5.4411000000698095e-05, \"get_bat"
  },
  {
    "path": "benchmarks/results/benchmark_indices_mapping.json",
    "chars": 186,
    "preview": "{\"num examples\": 500000, \"select\": 0.03741131999413483, \"sort\": 0.7371353159978753, \"shuffle\": 0.17655655200360343, \"tra"
  },
  {
    "path": "benchmarks/results/benchmark_iterating.json",
    "chars": 975,
    "preview": "{\"num examples\": 50000, \"read 5000\": 0.2152090710005723, \"read 50000\": 2.077654693988734, \"read_batch 50000 10\": 1.50411"
  },
  {
    "path": "benchmarks/results/benchmark_map_filter.json",
    "chars": 418,
    "preview": "{\"num examples\": 500000, \"map identity\": 10.19139202599763, \"map identity batched\": 0.6804238399927272, \"map no-op batch"
  },
  {
    "path": "benchmarks/utils.py",
    "chars": 2151,
    "preview": "import timeit\n\nimport numpy as np\n\nimport datasets\nfrom datasets.arrow_writer import ArrowWriter\nfrom datasets.features."
  },
  {
    "path": "docs/README.md",
    "chars": 10817,
    "preview": "<!---\nCopyright 2020 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "docs/source/_config.py",
    "chars": 401,
    "preview": "# docstyle-ignore\nINSTALL_CONTENT = \"\"\"\n# Datasets installation\n! pip install datasets transformers\n# To install from so"
  },
  {
    "path": "docs/source/_redirects.yml",
    "chars": 381,
    "preview": "# This first_section was backported from nginx\nloading_datasets: loading\nshare_dataset: share\nquicktour: quickstart\ndata"
  },
  {
    "path": "docs/source/_toctree.yml",
    "chars": 3643,
    "preview": "- sections: \n  - local: index\n    title: 🤗 Datasets\n  - local: quickstart\n    title: Quickstart\n  - local: installation\n"
  },
  {
    "path": "docs/source/about_arrow.md",
    "chars": 2364,
    "preview": "# Datasets 🤝 Arrow\n\n## What is Arrow?\n\n[Arrow](https://arrow.apache.org/) enables large amounts of data to be processed "
  },
  {
    "path": "docs/source/about_cache.mdx",
    "chars": 3461,
    "preview": "# The cache\n\nThe cache is one of the reasons why 🤗 Datasets is so efficient. It stores previously downloaded and process"
  },
  {
    "path": "docs/source/about_dataset_features.mdx",
    "chars": 9788,
    "preview": "# Dataset features\n\n[`Features`] defines the internal structure of a dataset. It is used to specify the underlying seria"
  },
  {
    "path": "docs/source/about_dataset_load.mdx",
    "chars": 7962,
    "preview": "# Build and load\n\nNearly every deep learning workflow begins with loading a dataset, which makes it one of the most impo"
  },
  {
    "path": "docs/source/about_map_batch.mdx",
    "chars": 3348,
    "preview": "# Batch mapping\n\nCombining the utility of [`Dataset.map`] with batch mode is very powerful. It allows you to speed up pr"
  },
  {
    "path": "docs/source/about_mapstyle_vs_iterable.mdx",
    "chars": 12210,
    "preview": "# Differences between Dataset and IterableDataset\n\nThere are two types of dataset objects, a [`Dataset`] and an [`Iterab"
  },
  {
    "path": "docs/source/access.mdx",
    "chars": 7476,
    "preview": "# Know your dataset\n\nThere are two types of dataset objects, a regular [`Dataset`] and then an ✨ [`IterableDataset`] ✨. "
  },
  {
    "path": "docs/source/audio_dataset.mdx",
    "chars": 7112,
    "preview": "# Create an audio dataset\n\nYou can share a dataset with your team or with anyone in the community by creating a dataset "
  },
  {
    "path": "docs/source/audio_load.mdx",
    "chars": 5556,
    "preview": "# Load audio data\n\nYou can load an audio dataset using the [`Audio`] feature that automatically decodes and resamples th"
  },
  {
    "path": "docs/source/audio_process.mdx",
    "chars": 3442,
    "preview": "# Process audio data\n\nThis guide shows specific methods for processing audio datasets. Learn how to:\n\n- Resample the sam"
  },
  {
    "path": "docs/source/cache.mdx",
    "chars": 4688,
    "preview": "# Cache management\n\nWhen you download a dataset from Hugging Face, the data are stored locally on your computer.\nFiles f"
  },
  {
    "path": "docs/source/cli.mdx",
    "chars": 1449,
    "preview": "# Command Line Interface (CLI)\n\n🤗 Datasets provides a command line interface (CLI) with useful shell commands to interac"
  },
  {
    "path": "docs/source/create_dataset.mdx",
    "chars": 6627,
    "preview": "# Create a dataset\n\nSometimes, you may need to create a dataset if you're working with your own data. Creating a dataset"
  },
  {
    "path": "docs/source/dataset_card.mdx",
    "chars": 2715,
    "preview": "# Create a dataset card\n\nEach dataset should have a dataset card to promote responsible usage and inform users of any po"
  },
  {
    "path": "docs/source/depth_estimation.mdx",
    "chars": 8418,
    "preview": "# Depth estimation\n\nDepth estimation datasets are used to train a model to approximate the relative distance of every pi"
  },
  {
    "path": "docs/source/document_dataset.mdx",
    "chars": 5380,
    "preview": "# Create a document dataset\n\nThis guide will show you how to create a document dataset with `PdfFolder` and some metadat"
  },
  {
    "path": "docs/source/document_load.mdx",
    "chars": 7222,
    "preview": "# Load pdf data\n\n> [!WARNING]\n> Pdf support is experimental and is subject to change.\n\nPdf datasets have [`Pdf`] type co"
  },
  {
    "path": "docs/source/faiss_es.mdx",
    "chars": 5653,
    "preview": "# Search index\n\n[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsea"
  },
  {
    "path": "docs/source/filesystems.mdx",
    "chars": 5667,
    "preview": "# Cloud storage\n\n## Hugging Face Datasets\n\nThe Hugging Face Dataset Hub is home to a growing collection of datasets that"
  },
  {
    "path": "docs/source/how_to.md",
    "chars": 1739,
    "preview": "# Overview\n\nThe how-to guides offer a more comprehensive overview of all the tools 🤗 Datasets offers and how to use them"
  },
  {
    "path": "docs/source/image_classification.mdx",
    "chars": 3203,
    "preview": "# Image classification\n\nImage classification datasets are used to train a model to classify an entire image. There are a"
  },
  {
    "path": "docs/source/image_dataset.mdx",
    "chars": 11341,
    "preview": "# Create an image dataset\n\nThere are two methods for creating and sharing an image dataset. This guide will show you how"
  },
  {
    "path": "docs/source/image_load.mdx",
    "chars": 7376,
    "preview": "# Load image data\n\nImage datasets have [`Image`] type columns, which contain PIL objects. \n\n> [!TIP]\n> To work with imag"
  },
  {
    "path": "docs/source/image_process.mdx",
    "chars": 3356,
    "preview": "# Process image data\n\nThis guide shows specific methods for processing image datasets. Learn how to:\n\n- Use [`~Dataset.m"
  },
  {
    "path": "docs/source/index.mdx",
    "chars": 3022,
    "preview": "# Datasets\n\n<img class=\"float-left !m-0 !border-0 !dark:border-0 !shadow-none !max-w-lg w-[150px]\" src=\"https://huggingf"
  },
  {
    "path": "docs/source/installation.md",
    "chars": 3630,
    "preview": "# Installation\n\nBefore you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets"
  },
  {
    "path": "docs/source/load_hub.mdx",
    "chars": 4260,
    "preview": "# Load a dataset from the Hub\n\nFinding high-quality datasets that are reproducible and accessible can be difficult. One "
  },
  {
    "path": "docs/source/loading.mdx",
    "chars": 19642,
    "preview": "# Load\n\nYour data can be stored in various places; they can be on your local machine's disk, in a Github repository, and"
  },
  {
    "path": "docs/source/nifti_dataset.mdx",
    "chars": 4965,
    "preview": "# Create a NIfTI dataset\n\nThis page shows how to create and share a dataset of medical images in NIfTI format (.nii / .n"
  },
  {
    "path": "docs/source/nlp_load.mdx",
    "chars": 1842,
    "preview": "# Load text data\n\nThis guide shows you how to load text datasets. To learn how to load any type of dataset, take a look "
  },
  {
    "path": "docs/source/nlp_process.mdx",
    "chars": 3147,
    "preview": "# Process text data\n\nThis guide shows specific methods for processing text datasets. Learn how to:\n\n- Tokenize a dataset"
  },
  {
    "path": "docs/source/object_detection.mdx",
    "chars": 6491,
    "preview": "# Object detection\n\nObject detection models identify something in an image, and object detection datasets are used for a"
  },
  {
    "path": "docs/source/package_reference/builder_classes.mdx",
    "chars": 763,
    "preview": "# Builder classes\n\n## Builders\n\n🤗 Datasets relies on two main classes during the dataset building process: [`DatasetBuil"
  },
  {
    "path": "docs/source/package_reference/loading_methods.mdx",
    "chars": 2546,
    "preview": "# Loading methods\n\nMethods for listing and loading datasets:\n\n## Datasets\n\n[[autodoc]] datasets.load_dataset\n\n[[autodoc]"
  },
  {
    "path": "docs/source/package_reference/main_classes.mdx",
    "chars": 5091,
    "preview": "# Main classes\n\n\n## DatasetInfo\n\n[[autodoc]] datasets.DatasetInfo\n\n## Dataset\n\nThe base class [`Dataset`] implements a D"
  },
  {
    "path": "docs/source/package_reference/table_classes.mdx",
    "chars": 2325,
    "preview": "# Table Classes\n\nEach `Dataset` object is backed by a PyArrow Table.\nA Table can be loaded from either the disk (memory "
  },
  {
    "path": "docs/source/package_reference/utilities.mdx",
    "chars": 2463,
    "preview": "# Utilities\n\n## Configure logging\n\n🤗 Datasets strives to be transparent and explicit about how it works, but this can be"
  },
  {
    "path": "docs/source/process.mdx",
    "chars": 39531,
    "preview": "# Process\n\n🤗 Datasets provides many tools for modifying the structure and content of a dataset. These tools are importan"
  },
  {
    "path": "docs/source/quickstart.mdx",
    "chars": 18724,
    "preview": "<!--Copyright 2023 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "docs/source/repository_structure.mdx",
    "chars": 7595,
    "preview": "# Structure your repository\n\nTo host and share your dataset, create a dataset repository on the Hugging Face Hub and upl"
  },
  {
    "path": "docs/source/semantic_segmentation.mdx",
    "chars": 6233,
    "preview": "# Semantic segmentation\n\nSemantic segmentation datasets are used to train a model to classify every pixel in an image. T"
  },
  {
    "path": "docs/source/share.mdx",
    "chars": 9073,
    "preview": "# Share a dataset using the CLI\n\nAt Hugging Face, we are on a mission to democratize good Machine Learning and we believ"
  },
  {
    "path": "docs/source/stream.mdx",
    "chars": 29041,
    "preview": "# Stream\n\nDataset streaming lets you work with a dataset without downloading it.\nThe data is streamed as you iterate ove"
  },
  {
    "path": "docs/source/tabular_load.mdx",
    "chars": 6270,
    "preview": "# Load tabular data\n\nA tabular dataset is a generic dataset used to describe any data stored in rows and columns, where "
  },
  {
    "path": "docs/source/troubleshoot.mdx",
    "chars": 5515,
    "preview": "# Troubleshooting\n\nThis guide aims to provide you the tools and knowledge required to navigate some common issues. If th"
  },
  {
    "path": "docs/source/tutorial.md",
    "chars": 1189,
    "preview": "# Overview\n\nWelcome to the 🤗 Datasets tutorials! These beginner-friendly tutorials will guide you through the fundamenta"
  },
  {
    "path": "docs/source/upload_dataset.mdx",
    "chars": 6935,
    "preview": "# Share a dataset to the Hub\n\nThe [Hub](https://huggingface.co/datasets) is home to an extensive collection of community"
  },
  {
    "path": "docs/source/use_dataset.mdx",
    "chars": 11389,
    "preview": "# Preprocess\n\nIn addition to loading datasets, 🤗 Datasets other main goal is to offer a diverse set of preprocessing fun"
  },
  {
    "path": "docs/source/use_with_jax.mdx",
    "chars": 7804,
    "preview": "# Use with JAX\n\nThis document is a quick introduction to using `datasets` with JAX, with a particular focus on how to ge"
  },
  {
    "path": "docs/source/use_with_numpy.mdx",
    "chars": 5678,
    "preview": "# Use with NumPy\n\nThis document is a quick introduction to using `datasets` with NumPy, with a particular focus on how t"
  },
  {
    "path": "docs/source/use_with_pandas.mdx",
    "chars": 2505,
    "preview": "# Use with Pandas\n\nThis document is a quick introduction to using `datasets` with Pandas, with a particular focus on how"
  },
  {
    "path": "docs/source/use_with_polars.mdx",
    "chars": 4223,
    "preview": "# Use with Polars\n\nThis document is a quick introduction to using `datasets` with Polars, with a particular focus on how"
  },
  {
    "path": "docs/source/use_with_pyarrow.mdx",
    "chars": 3483,
    "preview": "# Use with PyArrow\n\nThis document is a quick introduction to using `datasets` with PyArrow, with a particular focus on h"
  },
  {
    "path": "docs/source/use_with_pytorch.mdx",
    "chars": 9724,
    "preview": "# Use with PyTorch\n\nThis document is a quick introduction to using `datasets` with PyTorch, with a particular focus on h"
  },
  {
    "path": "docs/source/use_with_spark.mdx",
    "chars": 2903,
    "preview": "# Use with Spark\n\nThis document is a quick introduction to using 🤗 Datasets with Spark, with a particular focus on how t"
  },
  {
    "path": "docs/source/use_with_tensorflow.mdx",
    "chars": 11520,
    "preview": "# Using Datasets with TensorFlow\n\nThis document is a quick introduction to using `datasets` with TensorFlow, with a part"
  },
  {
    "path": "docs/source/video_dataset.mdx",
    "chars": 11604,
    "preview": "# Create a video dataset\n\nThis guide will show you how to create a video dataset with `VideoFolder` and some metadata. T"
  },
  {
    "path": "docs/source/video_load.mdx",
    "chars": 8764,
    "preview": "# Load video data\n\n> [!WARNING]\n> Video support is experimental and is subject to change.\n\nVideo datasets have [`Video`]"
  },
  {
    "path": "notebooks/README.md",
    "chars": 1812,
    "preview": "<!---\nCopyright 2023 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "pyproject.toml",
    "chars": 679,
    "preview": "[tool.ruff]\nline-length = 119\n\n[tool.ruff.lint]\n# Ignored rules:\n#   \"E501\" -> line length violation\n#   \"F821\" -> undef"
  },
  {
    "path": "setup.py",
    "chars": 10048,
    "preview": "# Lint as: python3\n\"\"\"HuggingFace/Datasets is an open library of datasets.\n\nNote:\n\n   VERSION needs to be formatted foll"
  },
  {
    "path": "src/datasets/__init__.py",
    "chars": 1635,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/arrow_dataset.py",
    "chars": 339318,
    "preview": "# Copyright 2020 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/arrow_reader.py",
    "chars": 25215,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/arrow_writer.py",
    "chars": 38027,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/builder.py",
    "chars": 91564,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/combine.py",
    "chars": 11760,
    "preview": "from typing import Optional, TypeVar\n\nfrom .arrow_dataset import Dataset, _concatenate_map_style_datasets, _interleave_m"
  },
  {
    "path": "src/datasets/commands/__init__.py",
    "chars": 312,
    "preview": "from abc import ABC, abstractmethod\nfrom argparse import ArgumentParser\n\n\nclass BaseDatasetsCLICommand(ABC):\n    @static"
  },
  {
    "path": "src/datasets/commands/datasets_cli.py",
    "chars": 1175,
    "preview": "#!/usr/bin/env python\nfrom argparse import ArgumentParser\n\nfrom datasets.commands.delete_from_hub import DeleteFromHubCo"
  },
  {
    "path": "src/datasets/commands/delete_from_hub.py",
    "chars": 1396,
    "preview": "from argparse import ArgumentParser\nfrom typing import Optional\n\nfrom datasets.commands import BaseDatasetsCLICommand\nfr"
  },
  {
    "path": "src/datasets/commands/env.py",
    "chars": 1239,
    "preview": "import platform\nfrom argparse import ArgumentParser\n\nimport fsspec\nimport huggingface_hub\nimport pandas\nimport pyarrow\n\n"
  },
  {
    "path": "src/datasets/commands/test.py",
    "chars": 7820,
    "preview": "import logging\nimport os\nfrom argparse import ArgumentParser\nfrom collections.abc import Generator\nfrom shutil import rm"
  },
  {
    "path": "src/datasets/config.py",
    "chars": 10358,
    "preview": "import importlib\nimport importlib.metadata\nimport logging\nimport os\nimport platform\nfrom pathlib import Path\nfrom typing"
  },
  {
    "path": "src/datasets/data_files.py",
    "chars": 32552,
    "preview": "import os\nimport re\nfrom functools import partial\nfrom glob import has_magic\nfrom pathlib import Path, PurePath\nfrom typ"
  },
  {
    "path": "src/datasets/dataset_dict.py",
    "chars": 128001,
    "preview": "import contextlib\nimport copy\nimport itertools\nimport json\nimport math\nimport posixpath\nimport random\nimport re\nimport t"
  },
  {
    "path": "src/datasets/distributed.py",
    "chars": 1815,
    "preview": "from typing import TypeVar\n\nfrom .arrow_dataset import Dataset, _split_by_node_map_style_dataset\nfrom .iterable_dataset "
  },
  {
    "path": "src/datasets/download/__init__.py",
    "chars": 281,
    "preview": "__all__ = [\n    \"DownloadConfig\",\n    \"DownloadManager\",\n    \"DownloadMode\",\n    \"StreamingDownloadManager\",\n]\n\nfrom .do"
  },
  {
    "path": "src/datasets/download/download_config.py",
    "chars": 3796,
    "preview": "import copy\nfrom dataclasses import dataclass, field\nfrom pathlib import Path\nfrom typing import Any, Optional, Union\n\nf"
  },
  {
    "path": "src/datasets/download/download_manager.py",
    "chars": 12778,
    "preview": "# Copyright 2020 The TensorFlow Datasets Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# "
  },
  {
    "path": "src/datasets/download/streaming_download_manager.py",
    "chars": 7553,
    "preview": "import io\nimport os\nfrom collections.abc import Iterable\nfrom typing import Optional, Union\n\nfrom ..utils.file_utils imp"
  },
  {
    "path": "src/datasets/exceptions.py",
    "chars": 4185,
    "preview": "# SPDX-License-Identifier: Apache-2.0\n# Copyright 2023 The HuggingFace Authors.\nfrom typing import Any, Optional, Union\n"
  },
  {
    "path": "src/datasets/features/__init__.py",
    "chars": 603,
    "preview": "__all__ = [\n    \"Audio\",\n    \"Array2D\",\n    \"Array3D\",\n    \"Array4D\",\n    \"Array5D\",\n    \"ClassLabel\",\n    \"Features\",\n "
  },
  {
    "path": "src/datasets/features/_torchcodec.py",
    "chars": 627,
    "preview": "import numpy as np\nfrom torchcodec.decoders import AudioDecoder as _AudioDecoder\n\n\nclass AudioDecoder(_AudioDecoder):\n  "
  },
  {
    "path": "src/datasets/features/audio.py",
    "chars": 15044,
    "preview": "import os\nfrom dataclasses import dataclass, field\nfrom io import BytesIO\nfrom pathlib import Path\nfrom typing import TY"
  },
  {
    "path": "src/datasets/features/features.py",
    "chars": 98124,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/features/image.py",
    "chars": 17337,
    "preview": "import os\nimport sys\nimport warnings\nfrom dataclasses import dataclass, field\nfrom io import BytesIO\nfrom pathlib import"
  },
  {
    "path": "src/datasets/features/nifti.py",
    "chars": 13247,
    "preview": "import os\nfrom dataclasses import dataclass, field\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING, Any, Class"
  },
  {
    "path": "src/datasets/features/pdf.py",
    "chars": 11141,
    "preview": "import os\nfrom dataclasses import dataclass, field\nfrom io import BytesIO\nfrom pathlib import Path\nfrom typing import TY"
  },
  {
    "path": "src/datasets/features/translation.py",
    "chars": 4490,
    "preview": "from dataclasses import dataclass, field\nfrom typing import TYPE_CHECKING, Any, ClassVar, Optional, Union\n\nimport pyarro"
  },
  {
    "path": "src/datasets/features/video.py",
    "chars": 16187,
    "preview": "import os\nfrom dataclasses import dataclass, field\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING, Any, Class"
  },
  {
    "path": "src/datasets/filesystems/__init__.py",
    "chars": 1523,
    "preview": "import importlib\nimport shutil\nimport warnings\nfrom typing import List\n\nimport fsspec\nimport fsspec.asyn\nfrom fsspec.imp"
  },
  {
    "path": "src/datasets/filesystems/compression.py",
    "chars": 4528,
    "preview": "import os\nfrom functools import partial\nfrom typing import Optional\n\nimport fsspec\nfrom fsspec.archive import AbstractAr"
  },
  {
    "path": "src/datasets/fingerprint.py",
    "chars": 21922,
    "preview": "import inspect\nimport os\nimport random\nimport shutil\nimport tempfile\nimport weakref\nfrom functools import wraps\nfrom pat"
  },
  {
    "path": "src/datasets/formatting/__init__.py",
    "chars": 5412,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/formatting/formatting.py",
    "chars": 26626,
    "preview": "# Copyright 2020 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/formatting/jax_formatter.py",
    "chars": 7412,
    "preview": "# Copyright 2021 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/formatting/np_formatter.py",
    "chars": 5102,
    "preview": "# Copyright 2020 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/formatting/polars_formatter.py",
    "chars": 4744,
    "preview": "# Copyright 2020 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/formatting/tf_formatter.py",
    "chars": 5236,
    "preview": "# Copyright 2020 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/formatting/torch_formatter.py",
    "chars": 5311,
    "preview": "# Copyright 2020 The HuggingFace Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "src/datasets/hub.py",
    "chars": 4822,
    "preview": "from itertools import chain\nfrom typing import Optional, Union\n\nfrom huggingface_hub import (\n    CommitInfo,\n    Commit"
  },
  {
    "path": "src/datasets/info.py",
    "chars": 19647,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/inspect.py",
    "chars": 15498,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n#"
  },
  {
    "path": "src/datasets/io/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/io/abc.py",
    "chars": 1672,
    "preview": "from abc import ABC, abstractmethod\nfrom typing import Optional, Union\n\nfrom .. import Dataset, DatasetDict, Features, I"
  },
  {
    "path": "src/datasets/io/csv.py",
    "chars": 5265,
    "preview": "import multiprocessing\nimport os\nfrom typing import BinaryIO, Optional, Union\n\nimport fsspec\n\nfrom .. import Dataset, Fe"
  },
  {
    "path": "src/datasets/io/generator.py",
    "chars": 2165,
    "preview": "from typing import Callable, Optional\n\nfrom .. import Features, NamedSplit, Split\nfrom ..packaged_modules.generator.gene"
  },
  {
    "path": "src/datasets/io/json.py",
    "chars": 6697,
    "preview": "import multiprocessing\nimport os\nfrom typing import BinaryIO, Optional, Union\n\nimport fsspec\n\nfrom .. import Dataset, Fe"
  },
  {
    "path": "src/datasets/io/parquet.py",
    "chars": 5935,
    "preview": "import json\nimport os\nfrom typing import BinaryIO, Optional, Union\n\nimport fsspec\nimport pyarrow.parquet as pq\n\nfrom .. "
  },
  {
    "path": "src/datasets/io/spark.py",
    "chars": 1797,
    "preview": "from typing import Optional\n\nimport pyspark\n\nfrom .. import Features, NamedSplit\nfrom ..download import DownloadMode\nfro"
  },
  {
    "path": "src/datasets/io/sql.py",
    "chars": 4234,
    "preview": "import multiprocessing\nfrom typing import TYPE_CHECKING, Optional, Union\n\nfrom .. import Dataset, Features, config\nfrom "
  },
  {
    "path": "src/datasets/io/text.py",
    "chars": 1975,
    "preview": "from typing import Optional\n\nfrom .. import Features, NamedSplit\nfrom ..packaged_modules.text.text import Text\nfrom ..ut"
  },
  {
    "path": "src/datasets/iterable_dataset.py",
    "chars": 226399,
    "preview": "import asyncio\nimport contextlib\nimport copy\nimport inspect\nimport itertools\nimport multiprocessing.pool\nimport re\nimpor"
  },
  {
    "path": "src/datasets/load.py",
    "chars": 79625,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/naming.py",
    "chars": 3028,
    "preview": "# Copyright 2020 The HuggingFace Datasets Authors and the TensorFlow Datasets Authors.\n#\n# Licensed under the Apache Lic"
  },
  {
    "path": "src/datasets/packaged_modules/__init__.py",
    "chars": 6578,
    "preview": "import inspect\nimport re\nfrom typing import Dict, List, Tuple\n\nfrom huggingface_hub.utils import insecure_hashlib\n\nfrom "
  },
  {
    "path": "src/datasets/packaged_modules/arrow/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/arrow/arrow.py",
    "chars": 3213,
    "preview": "from dataclasses import dataclass\nfrom typing import Optional\n\nimport pyarrow as pa\n\nimport datasets\nfrom datasets.build"
  },
  {
    "path": "src/datasets/packaged_modules/audiofolder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/audiofolder/audiofolder.py",
    "chars": 1744,
    "preview": "import datasets\n\nfrom ..folder_based_builder import folder_based_builder\n\n\nlogger = datasets.utils.logging.get_logger(__"
  },
  {
    "path": "src/datasets/packaged_modules/cache/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/cache/cache.py",
    "chars": 8293,
    "preview": "import glob\nimport json\nimport os\nimport shutil\nimport time\nfrom pathlib import Path\nfrom typing import Optional, Union\n"
  },
  {
    "path": "src/datasets/packaged_modules/csv/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/csv/csv.py",
    "chars": 8977,
    "preview": "from dataclasses import dataclass\nfrom typing import Any, Callable, Optional, Union\n\nimport pandas as pd\nimport pyarrow "
  },
  {
    "path": "src/datasets/packaged_modules/eval/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/eval/eval.py",
    "chars": 3226,
    "preview": "import json\nimport os\nfrom itertools import islice\nfrom typing import Iterable\n\nimport pyarrow as pa\n\nimport datasets\nfr"
  },
  {
    "path": "src/datasets/packaged_modules/folder_based_builder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py",
    "chars": 22886,
    "preview": "import collections\nimport io\nimport itertools\nimport os\nfrom dataclasses import dataclass\nfrom typing import Any, Callab"
  },
  {
    "path": "src/datasets/packaged_modules/generator/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/generator/generator.py",
    "chars": 1380,
    "preview": "from dataclasses import dataclass\nfrom typing import Callable, Optional\n\nimport datasets\nfrom datasets.builder import Ke"
  },
  {
    "path": "src/datasets/packaged_modules/hdf5/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/hdf5/hdf5.py",
    "chars": 12268,
    "preview": "from dataclasses import dataclass, field\nfrom typing import TYPE_CHECKING, Optional\n\nimport numpy as np\nimport pyarrow a"
  },
  {
    "path": "src/datasets/packaged_modules/imagefolder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/imagefolder/imagefolder.py",
    "chars": 1956,
    "preview": "import datasets\n\nfrom ..folder_based_builder import folder_based_builder\n\n\nlogger = datasets.utils.logging.get_logger(__"
  },
  {
    "path": "src/datasets/packaged_modules/json/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/json/json.py",
    "chars": 14317,
    "preview": "import io\nfrom dataclasses import dataclass\nfrom typing import Literal, Optional\n\nimport pandas as pd\nimport pyarrow as "
  },
  {
    "path": "src/datasets/packaged_modules/lance/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/lance/lance.py",
    "chars": 9833,
    "preview": "import re\nfrom dataclasses import dataclass\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING, Dict, List, Optio"
  },
  {
    "path": "src/datasets/packaged_modules/niftifolder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/niftifolder/niftifolder.py",
    "chars": 575,
    "preview": "import datasets\n\nfrom ..folder_based_builder import folder_based_builder\n\n\nlogger = datasets.utils.logging.get_logger(__"
  },
  {
    "path": "src/datasets/packaged_modules/pandas/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/pandas/pandas.py",
    "chars": 1954,
    "preview": "import warnings\nfrom dataclasses import dataclass\nfrom typing import Optional\n\nimport pandas as pd\nimport pyarrow as pa\n"
  },
  {
    "path": "src/datasets/packaged_modules/parquet/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/parquet/parquet.py",
    "chars": 10175,
    "preview": "from dataclasses import dataclass\nfrom typing import Literal, Optional, Union\n\nimport pyarrow as pa\nimport pyarrow.datas"
  },
  {
    "path": "src/datasets/packaged_modules/pdffolder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/pdffolder/pdffolder.py",
    "chars": 565,
    "preview": "import datasets\n\nfrom ..folder_based_builder import folder_based_builder\n\n\nlogger = datasets.utils.logging.get_logger(__"
  },
  {
    "path": "src/datasets/packaged_modules/spark/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/spark/spark.py",
    "chars": 14685,
    "preview": "import os\nimport posixpath\nimport uuid\nfrom collections.abc import Iterable\nfrom dataclasses import dataclass\nfrom itert"
  },
  {
    "path": "src/datasets/packaged_modules/sql/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/sql/sql.py",
    "chars": 4536,
    "preview": "import sys\nfrom dataclasses import dataclass\nfrom typing import TYPE_CHECKING, Optional, Union\n\nimport pandas as pd\nimpo"
  },
  {
    "path": "src/datasets/packaged_modules/text/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "src/datasets/packaged_modules/text/text.py",
    "chars": 6862,
    "preview": "from dataclasses import dataclass\nfrom io import StringIO\nfrom typing import Literal, Optional\n\nimport pyarrow as pa\n\nim"
  }
]

// ... and 131 more files (download for full content)

About this extraction

This page contains the full source code of the huggingface/datasets GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 331 files (3.8 MB), approximately 998.6k tokens, and a symbol index with 3742 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo