Full Code of p-lambda/dsir for AI

main 4802a1c71cdb cached

51 files

939.9 KB

243.3k tokens

126 symbols

1 requests

Download .txt

Showing preview only (972K chars total). Download the full file or copy to clipboard to get everything.

Repository: p-lambda/dsir
Branch: main
Commit: 4802a1c71cdb
Files: 51
Total size: 939.9 KB

Directory structure:
gitextract_g6u__ttu/

├── .gitignore
├── LICENSE
├── README.md
├── data_selection/
│   ├── README.md
│   ├── __init__.py
│   ├── base.py
│   ├── hashed_ngram_dsir.py
│   └── utils.py
├── experimental/
│   ├── README.md
│   ├── config.sh
│   ├── data_selection/
│   │   ├── dsir_general/
│   │   │   ├── data_selection.py
│   │   │   ├── run_data_selection.py
│   │   │   └── utils.py
│   │   ├── dsir_pipeline.py
│   │   ├── heuristic_cls_pipeline.py
│   │   ├── run_cmds.sh
│   │   ├── run_dsir.sh
│   │   ├── run_dsir_helper.sh
│   │   ├── run_heuristic_cls.sh
│   │   └── run_heuristic_cls_helper.sh
│   ├── glue_eval/
│   │   ├── read_glue_results.py
│   │   ├── run_eval_exps.sh
│   │   ├── run_glue.py
│   │   ├── run_glue_dist.sh
│   │   └── run_glue_for_seed_task.sh
│   ├── preprocessing/
│   │   ├── quality_scores/
│   │   │   ├── compute_quality_stats.py
│   │   │   ├── merge_quality_scores.py
│   │   │   ├── run_merge_quality_scores.sh
│   │   │   ├── run_quality_stats.sh
│   │   │   └── run_slurm_quality_stats.sh
│   │   ├── reformat_and_chunk_data.py
│   │   ├── run.sh
│   │   └── run_slurm.sh
│   ├── requirements.txt
│   └── train/
│       ├── accelerate_config.yaml
│       ├── collator.py
│       ├── model.py
│       ├── preprocess_general.sh
│       ├── pretrain_general.sh
│       ├── requirements.txt
│       ├── run_pipeline.py
│       ├── run_pretrain_pipeline_general.sh
│       ├── run_slurm.sh
│       └── trainer.py
├── pyproject.toml
├── setup.py
└── tests/
    ├── test_hashed_ngram.py
    ├── test_utils.py
    ├── toy_pile_data.jsonl
    ├── toy_target_data.jsonl
    └── toy_target_data_2.jsonl

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

wandb
train/logs
train/wandb
train/__pycache__
glue_eval/wandb
logs/


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2023 Sang Michael Xie

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Data Selection for Language Models via Importance Resampling (DSIR)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![arXiv](https://img.shields.io/badge/arXiv-2305.10429-00ff00.svg)](https://arxiv.org/abs/2302.03169)
[![PyPI version](https://badge.fury.io/py/data-selection.svg)](https://badge.fury.io/py/data-selection)

This repository contains the [DSIR](https://arxiv.org/abs/2302.03169) data selection tool for selecting relevant language model training data from any raw data source given a target dataset, as well as pre-filtered datasets and some pretrained models.

DSIR is built for:
- fast, large-scale (trillion-token scale) data selection from large raw text datasets (Pile, RefinedWeb, RedPajama, ...). There is almost no overhead to selecting more examples (unlike retrieval), other than the time it takes to write the extra examples to disk.
- selecting data that is distributed like a given target dataset (domain-specific data, Wikipedia, ...). Relevance and diversity are balanced automatically by matching the distribution of the target dataset on a feature space (e.g., n-gram frequencies).

Compute needed:
- 1 CPU node
- a decent amount of RAM (at least 64GB for most large datasets - need to hold a few floats per example in memory)
- a high number of cores. The data selection speed scales linearly with the CPU cores.

![DSIR figure](fig1.png)

Code related to the DSIR paper's experiments are in the `experimental/` directory.

## Quickstart

Install with pip:
```
pip install data-selection
```

Install from source by cloning this repo and installing via pip:
```
git clone git@github.com:/p-lambda/dsir
pip install ./dsir
```

To select data, simply initialize a `HashedNgramDSIR` object and call the following functions:
```python
from data_selection import HashedNgramDSIR

raw_datasets = [<list of paths>]
target_datasets = [<list of paths>]

dsir = HashedNgramDSIR(raw_datasets, target_datasets, cache_dir='/path/to/dsir_cache')
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='resampled', num_to_sample=10000000, cache_dir='/path/to/resampled_cache')
```
Running this would write 10M documents in `jsonl` files inside an output directory named `resampled`. The files will first be written to `cache_dir` and moved to `out_dir` upon completion (set `cache_dir` to `None` to skip this step). For best performance, use uncompressed `jsonl` files stored on local file storage for all data paths and use as many CPU cores as possible, which allows each file to be virtually sharded across multiple cores. Custom functions for reading the data paths and extracting the text field from each example can be provided via the
`{raw,target}_load_dataset_fn` and `{raw,target}_parse_example_fn` arguments to the constructor. The number of tokens to use for fitting the importance weight estimator can be tuned with the `num_tokens_to_fit` argument (set to `all` to fit on full dataset). Top-k retrieval instead of sampling without replacement (the default) can be done by specifying `top_k=True` to the `resample` method.

(Note: for results similar to the paper, first preprocess the documents by breaking them into equal-word-length chunks, and use `tokenizer="word_tokenize"` in the `HashedNgramDSIR` constructor.)
 
The `dsir` intermediate results (after `fit_importance_estimator` and `compute_importance_weights`) can be saved and loaded for later use, for example to resample 100M documents instead:
```python
dsir.save('/path/to/dsir_params.pkl')

# later on
dsir.load('/path/to/dsir_params.pkl')
dsir.resample(out_dir='/path/to/out_dir', num_to_sample=100000000, cache_dir='/path/to/resampled_cache')
```
The `save` method can be called at any time to save partial results.

See [Usage documentation](data_selection/README.md) for full details.


## Speed benchmark on The Pile
Using 1 CPU node with 96GB RAM and 96 cores, we can select data from the full (decompressed) Pile dataset in less than *4.5 hours*.
The Pile dataset was first decompressed and placed onto the node's local file storage. The breakdown of timings for each step are:
- *Fit importance estimator* (with `num_tokens_to_fit="auto"`): 59.28 seconds
- *Compute importance weights*: 4.36 hours
- *Resample 10M documents* (with `cache_dir=None` and `out_dir` is a local storage location): 353.68 seconds
- *Total*: 4.47 hours

Subsequent resampling with the same target data is very cheap, and the runtime does not scale with the number of documents to select (unlike retrieval). Resampling 100M documents takes the same amount of time (less than *6 minutes*) as resampling 10M documents:
- *Resample 10M documents*: 353.68 seconds
- *Resample 100M documents*: 352.69 seconds

## Examples

To select data from the Pile:
```python
from data_selection import HashedNgramDSIR

# 2-digit integers up to 29
subsets = [str(i).zfill(2) for i in range(0, 30)]

raw_datasets = [f'/path/to/pile/{subset}.jsonl' for subset in subsets]
target_datasets = ['/path/to/target.jsonl']

dsir = HashedNgramDSIR(
        raw_datasets=raw_datasets,
        target_datasets=target_datasets,
        cache_dir='/path/to/dsir_cache')
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='/path/to/out_dir', num_to_sample=10000000, cache_dir='/path/to/resample_cache')
```

HuggingFace datasets can also be used in either `raw_datasets` or `target_datasets` (note: streaming a large raw dataset directly will be very slow - we recommend this more for target datasets):
```python
from data_selection import HashedNgramDSIR
from datasets import load_dataset

subsets = [str(i).zfill(2) for i in range(0, 30)]

raw_datasets = [f'/path/to/pile/{subset}.jsonl' for subset in subsets]
target_datasets = ['codeparrot/self-instruct-starcoder', 'SetFit/mnli']

def target_load_dataset_fn(dataset):
    if dataset == 'codeparrot/self-instruct-starcoder':
        ds = load_dataset(dataset, streaming=True, split='raw')
    else:
        ds = load_dataset(dataset, streaming=True, split='train').take(10000)
    return ds

def target_parse_example_fn(ex):
    if 'output' in ex:
        return ex['output']
    else:
        return ex['text1'] + ' ' + ex['text2']

dsir = HashedNgramDSIR(
        raw_datasets=raw_datasets,
        target_datasets=target_datasets,
        cache_dir='/path/to/dsir_cache',
        target_parse_example_fn=target_parse_example_fn,
        target_load_dataset_fn=target_load_dataset_fn,
        separate_targets=True)
dsir.fit_importance_estimator(num_tokens_to_fit='auto')
dsir.compute_importance_weights()
dsir.resample(out_dir='/path/to/out_dir', num_to_sample=10000000, cache_dir='/path/to/resample_cache')
```
For use-cases where the target datasets are quite different (here, a mix of code and natural language), we recommend passing in `separate_targets` into the constructor. `separate_targets` controls whether to select data separately for each target and then join them. For example, when including two target datasets, one natural language dataset and one code, the most heavily upweighted data when `separate_targets=False` may skew towards documents with a mix of natural language and code, such as StackExchange. When `separate_targets=True`, two separate DSIR runs will occur in parallel, selecting a mixture of documents from each target according to `target_proportions`. When `target_proportions` is unspecified, the number of documents to select for each target is weighted according to the token sizes of each target dataset.


## Citation Information
Paper: <https://arxiv.org/abs/2302.03169>
```
@article{xie2023data,
  author = {Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang},
  journal = {Advances in Neural Information Processing Systems (NeurIPS)},
  title = {Data Selection for Language Models via Importance Resampling},
  year = {2023},
}
```



================================================
FILE: data_selection/README.md
================================================
# Usage
In general, DSIR aims to select data from the raw dataset that matches the feature distribution of the target data. Thus, the choice of feature space and importance estimator on this feature space can change the behavior of DSIR for different use-cases. Extending the base DSIR class in `base.py` is simple - follow the example in `hashed_ngram_dsir.py`.

### class data_selection.DSIR
Base class for DSIR.

- `raw_datasets`: List of data paths
- `target_datasets`: List of data paths
- `cache_dir`: Directory to store cached intermediates (log importance weights)
- `raw_load_dataset_fn`: Function to load raw dataset from path
- `raw_parse_example_fn`: a function that takes in an example dict and outputs a string
- `target_load_dataset_fn`: Function to load target dataset from path
- `target_parse_example_fn`: a function that takes in an example dict and outputs a string
- `num_proc`: num cpus to parallelize over. If None, use all available cpus.
- `separate_targets`: whether to select data separately for each target and then join them. For example, when including two target datasets, one natural language dataset and one code, the most heavily upweighted data when `separate_targets=False` may skew towards documents with a mix of natural language and code, such as StackExchange. When `separate_targets=True`, two separate DSIR runs will occur in parallel, selecting a mixture of documents using each target
- `target_proportions`: weighting across multiple targets if separate_targets=True. The proportions are on the document level. Set to None to weight by the size (in tokens) of each target dataset.

#### compute_importance_weights(self) -> None:
Compute importance weights on raw dataset with self.importance_estimator.
Saves importance weights in self.log_importance_weights_dir / {index}.npy in chunks indexed by index.
Also saves other per-example metadata (numpy arrays) in self.perexample_metadata_dir / {index}.npy."""

#### resample(self, out_dir: str, num_to_sample: int, cache_dir: str = None, top_k: bool = False) -> None:
Resample raw dataset according to importance weights.

- `out_dir`: path to save resampled dataset
- `num_to_sample`: number of samples to resample
- `cache_dir`: path to cache resampled dataset
- `top_k`: if True, get top_k examples by importance weight instead of sampling


### class data_selection.HashedNgramDSIR
The main subclass we provide is DSIR with hashed n-gram features. This choice of feature space allows for efficient data selection over large datasets. 

- `raw_datasets`: List of data paths
- `target_datasets`: List of data paths
- `cache_dir`: place to store cached log_importance_weights
- `load_dataset_fn`: Function to load a dataset from a path. Defaults to default_load_dataset_fn.
- `parse_example_fn`: Function that takes in an example dict and returns a string. Defaults to returning the "text" field of the example.
- `num_proc`: number of processes to use for parallelization. Defaults to number of cores.
- `ngrams`: N in N-grams. 2 means both unigram and bigrams.
- `num_buckets`: number of buckets to hash ngrams into.
- `tokenizer`: word_tokenize or wordpunct
- `min_example_length`: minimum number of tokens in an example to be considered.
- `target_laplace_smoothing`: Smooth the target hash ngram distribution with this Laplace smoothing parameter, which is a pseudo-count. This could be useful for small target datasets.
- `separate_targets`: whether to select data separately for each target and then join them. For example, when including two target datasets, one natural language dataset and one code, the most heavily upweighted data when `separate_targets=False` may skew towards documents with a mix of natural language and code, such as StackExchange. When `separate_targets=True`, two separate DSIR runs will occur in parallel, selecting a mixture of documents using each target according to `target_proportions`.
- `target_proportions`: weighting across multiple targets if separate_targets=True. The proportions are on the document level. Set to None to weight by the size in tokens of each target dataset

#### fit_importance_estimator(self, num_tokens_to_fit: Union[str, int] = 'auto') -> None:
Fit the importance estimator.

- `num_tokens_to_fit`: number of tokens to fit the raw dataset importance estimator on. Set to "all" to fit on all tokens, and "auto" to determine the number of tokens to fit on automatically (100k * num_buckets). Set to an integer to fit on that many tokens.


================================================
FILE: data_selection/__init__.py
================================================
try:
    import importlib.metadata as importlib_metadata
except ModuleNotFoundError:
    import importlib_metadata

__version__ = importlib_metadata.version('data-selection')

from .base import DSIR
from .hashed_ngram_dsir import HashedNgramDSIR


================================================
FILE: data_selection/base.py
================================================
# base DSIR class
import os
from typing import List, Optional, Dict, Callable, Iterable, Union
import multiprocessing as mp
from pathlib import Path
import shutil
import pickle
import json
import warnings

import numpy as np
from tqdm import tqdm

from data_selection.utils import parallelize
from data_selection import __version__


def default_load_dataset_fn(path: str) -> Iterable[Dict]:
    """Load jsonl dataset from path

    Args:
        path (str): path to dataset file
    """
    with open(path, 'r') as f:
        for line in f:
            if len(line) > 0:
                yield json.loads(line)


def default_parse_example_fn(ex: Dict) -> str:
    """Default parse function from example dict to string

    Args:
        ex (Dict): example dict
    """
    return ex['text']


def _iterate_virtually_sharded_dataset(dataset: Iterable, num_shards: int, shard_idx: int):
    for i, ex in enumerate(dataset):
        if i % num_shards == shard_idx:
            yield ex
    del dataset


class DSIR():
    """Base class for data selection with importance resampling (DSIR)."""
    __version__ = __version__

    def __init__(self,
                 raw_datasets: List[str],
                 target_datasets: List[str],
                 cache_dir: str,
                 raw_load_dataset_fn: Callable[[str], Iterable[Dict]] = default_load_dataset_fn,
                 raw_parse_example_fn: Callable[[Dict], str] = default_parse_example_fn,
                 target_load_dataset_fn: Callable[[str], Iterable[Dict]] = default_load_dataset_fn,
                 target_parse_example_fn: Callable[[Dict], str] = default_parse_example_fn,
                 num_proc: Optional[int] = None,
                 separate_targets: bool = False,
                 target_proportions: Optional[List[float]] = None) -> None:
        """
        Args:
            raw_datasets: List of data paths
            target_datasets: List of data paths
            cache_dir: Directory to store cached intermediates (log importance weights)
            raw_load_dataset_fn: Function to load raw dataset from path
            raw_parse_example_fn: a function that takes in an example dict and outputs a string
            target_load_dataset_fn: Function to load target dataset from path
            target_parse_example_fn: a function that takes in an example dict and outputs a string
            num_proc: num cpus to parallelize over. If None, use all available cpus.
            separate_targets: whether to select data separately for each target and then join them
            target_proportions: weighting across multiple targets if separate_targets=True. Set to None to weight by the size of each target dataset
        """
        self.raw_datasets = raw_datasets
        self.target_datasets = target_datasets
        self.raw_parse_example_fn = raw_parse_example_fn
        self.raw_load_dataset_fn = raw_load_dataset_fn
        self.target_parse_example_fn = target_parse_example_fn
        self.target_load_dataset_fn = target_load_dataset_fn
        self.cache_dir = Path(cache_dir)
        if num_proc is None:
            try:
                # doesn't work on some systems
                self.num_proc = len(os.sched_getaffinity(0))
            except AttributeError:
                self.num_proc = mp.cpu_count()
        else:
            self.num_proc = num_proc
        self.log_importance_weights_dir = self.cache_dir / 'log_importance_weights'
        self.log_importance_weights_dir.mkdir(parents=True, exist_ok=True)
        self.perexample_metadata_dir = self.cache_dir / 'perexample_metadata'
        self.separate_targets = separate_targets
        self.target_proportions = target_proportions
        if self.target_proportions is not None:
            self.target_proportions = np.asarray(self.target_proportions) / np.sum(self.target_proportions)

    def _get_virtually_sharded_datasets(self, datasets: List[str]):
        """Return virtual shard parameters."""
        num_proc_per_shard = max(1, self.num_proc // len(datasets))
        if self.num_proc >= len(datasets):
            remainder = self.num_proc % len(datasets)
        else:
            remainder = 0

        overall_idx = 0
        shard_params = []
        for i, dataset in enumerate(datasets):
            curr_num_proc = num_proc_per_shard
            if i < remainder:
                curr_num_proc += 1
            for j in range(curr_num_proc):
                shard_params.append({'path': dataset, 'shard_idx': j, 'num_shards': curr_num_proc, 'overall_idx': overall_idx})
                overall_idx += 1
        return shard_params

    def featurizer(self, text: str) -> np.ndarray:
        """Takes a string and outputs a feature vector."""
        raise NotImplementedError

    def importance_estimator(self, features: np.ndarray) -> Union[float, np.ndarray]:
        """Takes a feature vector and outputs an importance weight."""
        raise NotImplementedError

    def get_perexample_metadata(self, ex: Dict, features: np.ndarray) -> np.ndarray:
        """Get per-example metadata.

        Args:
            ex: example dict
            features: feature vector
        """
        return NotImplementedError

    def fit_importance_estimator(self) -> None:
        """Fits parameters needed to run self.importance_estimator.

        Args:
        """
        raise NotImplementedError

    def compute_importance_weights(self) -> None:
        """Compute importance weights on raw dataset with self.importance_estimator.
        Saves importance weights in self.log_importance_weights_dir / {index}.npy in chunks indexed by index.
        Also saves other per-example metadata (numpy arrays) in self.perexample_metadata_dir / {index}.npy."""
        def job(args: Dict):
            path = args['path']
            num_shards = args['num_shards']
            shard_idx = args['shard_idx']
            overall_idx = args['overall_idx']

            log_importance_weights = []
            perexample_metadata = []

            dataset = self.raw_load_dataset_fn(path)

            iterator = _iterate_virtually_sharded_dataset(dataset, num_shards, shard_idx)
            for ex in tqdm(iterator, miniters=10000, maxinterval=1000000):
                if self.raw_parse_example_fn is not None:
                    text = self.raw_parse_example_fn(ex)
                else:
                    text = ex
                features = self.featurizer(text)
                log_importance_weights.append(self.importance_estimator(features))
                if perexample_metadata is not None:
                    try:
                        perexample_metadata.append(self.get_perexample_metadata(ex, features))
                    except NotImplementedError:
                        perexample_metadata = None

            log_importance_weights = np.asarray(log_importance_weights)
            save_path = Path(self.log_importance_weights_dir) / f"{overall_idx}.npy"
            np.save(str(save_path), log_importance_weights)
            if perexample_metadata is not None:
                self.perexample_metadata_dir.mkdir(parents=True, exist_ok=True)
                perexample_metadata = np.asarray(perexample_metadata)
                save_path = Path(self.perexample_metadata_dir) / f"{overall_idx}.npy"
                np.save(str(save_path), perexample_metadata)

        sharded_raw_datasets = self._get_virtually_sharded_datasets(self.raw_datasets)
        parallelize(job, sharded_raw_datasets, self.num_proc)

    def perexample_metadata_filter(self, concat_metadata: np.ndarray) -> np.array:
        """Return a boolean array of examples that pass the filter according to the metadata."""
        return NotImplementedError

    def resample(self, out_dir: str, num_to_sample: int, cache_dir: str = None, top_k: bool = False) -> None:
        """Resample raw dataset according to importance weights.

        Args:
            out_dir (str): path to save resampled dataset
            num_to_sample (int): number of samples to resample
            cache_dir (str): path to cache resampled dataset
            top_k (bool): if True, get top_k examples by importance weight instead of sampling
        """
        if cache_dir is None:
            cache_dir = out_dir

        out_dir = Path(out_dir)
        cache_dir = Path(cache_dir)
        cache_dir.mkdir(parents=True, exist_ok=True)

        sharded_raw_datasets = self._get_virtually_sharded_datasets(self.raw_datasets)

        # load log importance weights
        log_importance_weights_ls = [
                np.load(str(Path(self.log_importance_weights_dir) / f'{shard_params["overall_idx"]}.npy'), mmap_mode='r')
                for shard_params in sharded_raw_datasets]
        concat_log_importance_weights = np.concatenate(log_importance_weights_ls, axis=0)

        # filter examples by metadata first
        if Path(self.perexample_metadata_dir).exists():
            metadata_ls = [
                    np.load(str(Path(self.perexample_metadata_dir) / f'{shard_params["overall_idx"]}.npy'), mmap_mode='r')
                    for shard_params in sharded_raw_datasets]
            concat_metadata = np.concatenate(metadata_ls, axis=0)
            global_mask = self.perexample_metadata_filter(concat_metadata)
            del concat_metadata
        else:
            global_mask = np.ones(len(concat_log_importance_weights), dtype=bool)

        if self.separate_targets:
            # determine how many to sample per target
            num_to_sample_pertarget = [int(num_to_sample * p) for p in self.target_proportions]
            num_to_sample_pertarget[-1] += num_to_sample - sum(num_to_sample_pertarget)
        else:
            num_to_sample_pertarget = [num_to_sample]
            concat_log_importance_weights = concat_log_importance_weights[:, np.newaxis]

        chosen_mask = np.zeros(len(concat_log_importance_weights), dtype=bool)

        for i, curr_num_to_sample in enumerate(num_to_sample_pertarget):
            if curr_num_to_sample == 0:
                continue
            curr_log_importance_weights = concat_log_importance_weights[:, i]
            # apply filter
            curr_log_importance_weights = curr_log_importance_weights[global_mask]
            # noise the log_importance_weights (Gumbel top-k for sampling without replacement)
            if not top_k:
                curr_log_importance_weights += np.random.gumbel(size=curr_log_importance_weights.shape)

            # Take top-k
            nonzero_idxs = np.where(global_mask)[0]
            chosen_idxs = np.argpartition(-curr_log_importance_weights, curr_num_to_sample)[:curr_num_to_sample]
            chosen_idxs = nonzero_idxs[chosen_idxs]

            chosen_mask[chosen_idxs] = True
            # don't choose these examples again
            global_mask[chosen_idxs] = False

        del chosen_idxs
        del nonzero_idxs
        del concat_log_importance_weights
        del global_mask

        # split the global mask into per-dataset masks
        masks = []
        start_idx = 0
        for log_importance_weights in log_importance_weights_ls:
            end_idx = start_idx + len(log_importance_weights)
            masks.append(chosen_mask[start_idx:end_idx])
            start_idx = end_idx

        def job(args: Dict):
            in_path = args['in_path']
            out_path = args['out_path']
            mask = args['mask']
            shard_idx = args['shard_idx']
            num_shards = args['num_shards']

            if self.raw_load_dataset_fn.__name__ == 'default_load_dataset_fn':
                # faster to not load json lines into dicts
                curr_idx = 0
                with open(out_path, 'w') as f:
                    with open(in_path, 'r') as f_in:
                        iterator = _iterate_virtually_sharded_dataset(f_in, num_shards, shard_idx)
                        for line in tqdm(iterator, miniters=10000, maxinterval=1000000):
                            if len(line) == 0:
                                continue

                            if mask[curr_idx]:
                                f.write(line.strip() + '\n')
                            curr_idx += 1
            else:
                dataset = self.raw_load_dataset_fn(in_path)

                with open(out_path, 'w') as f:
                    iterator = _iterate_virtually_sharded_dataset(dataset, num_shards, shard_idx)
                    for i, ex in tqdm(enumerate(iterator), miniters=10000, maxinterval=1000000):
                        if mask[i]:
                            f.write(json.dumps(ex) + '\n')

        sharded_raw_datasets = self._get_virtually_sharded_datasets(self.raw_datasets)
        args = [{'out_path': cache_dir / f"{i}.jsonl",
                 'in_path': shard_params['path'],
                 'mask': masks[i],
                 'shard_idx': shard_params['shard_idx'],
                 'num_shards': shard_params['num_shards']}
                for i, shard_params in enumerate(sharded_raw_datasets)]

        parallelize(job, args, self.num_proc)

        # move the cache_dir to out_dir
        shutil.move(str(cache_dir), str(out_dir))

    def save(self, path: str) -> None:
        """Save parameters to save computation"""
        Path(path).parent.mkdir(parents=True, exist_ok=True)
        with open(path, 'wb') as f:
            pickle.dump(self, f)

    def load(self, path: str, exclude_keys: Optional[List[str]] = None) -> None:
        """Load saved parameters.

        Args:
        path: path to saved parameters
        exclude_keys: keys to exclude from loading
        """

        with open(path, 'rb') as f:
            obj = pickle.load(f)

        if obj.__version__ != self.__version__:
            raise warnings.warn(f"Version mismatch: Saved version: {obj.__version__} != Current version: {self.__version__}")

        for k, v in obj.__dict__.items():
            if exclude_keys is not None and k in exclude_keys:
                continue
            setattr(self, k, v)


================================================
FILE: data_selection/hashed_ngram_dsir.py
================================================
from typing import List, Optional, Dict, Callable, Union, Iterable
import hashlib
from tqdm import tqdm
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import word_tokenize
from nltk import ngrams as get_ngrams
import numpy as np

from data_selection.base import (
        DSIR,
        default_load_dataset_fn,
        default_parse_example_fn,
        _iterate_virtually_sharded_dataset,
)

from data_selection.utils import parallelize


wpt = WordPunctTokenizer()


def hash_buckets(text: str, num_buckets: int = 10000) -> int:
    return int(hashlib.sha256(text.encode('utf-8')).hexdigest(), 16) % num_buckets


def get_ngram_counts(line: str,
                     n: int = 2,
                     num_buckets: int = 10000,
                     counts: Optional[np.ndarray] = None,
                     tokenizer: Callable = wpt.tokenize) -> np.ndarray:
    '''Return ngram count features given a string.

    Args:
        line: string to get ngram counts from
        n: n in ngrams
        num_buckets: number of buckets to hash ngrams into
        counts: pre-initialized counts array
        tokenizer: tokenization function to use. Defaults to word_tokenize from nltk
    '''
    words = tokenizer(line.lower())

    if counts is None:
        counts = np.zeros(num_buckets, dtype=int)

    for w in words:
        counts[hash_buckets(w, num_buckets=num_buckets)] += 1
    for i in range(2, n + 1):
        for ng in list(get_ngrams(words, i)):
            ng = ' '.join(ng)
            counts[hash_buckets(ng, num_buckets=num_buckets)] += 1
    return counts


class HashedNgramDSIR(DSIR):
    """DSIR with hashed n-gram features."""

    def __init__(self,
                 raw_datasets: List[str],
                 target_datasets: List[str],
                 cache_dir: str,
                 raw_load_dataset_fn: Callable[[str], Iterable[Dict]] = default_load_dataset_fn,
                 raw_parse_example_fn: Callable[[Dict], str] = default_parse_example_fn,
                 target_load_dataset_fn: Callable[[str], Iterable[Dict]] = default_load_dataset_fn,
                 target_parse_example_fn: Callable[[Dict], str] = default_parse_example_fn,
                 num_proc: Optional[int] = None,
                 ngrams: int = 2,
                 num_buckets: int = 10000,
                 tokenizer: str = 'wordpunct',
                 min_example_length: int = 100,
                 target_laplace_smoothing: float = 0.0,
                 separate_targets: bool = False,
                 target_proportions: Optional[List[float]] = None) -> None:
        '''Initialize the HashedNgramDSIR object.

        Args:
            raw_datasets: List of data paths
            target_datasets: List of data paths
            cache_dir: place to store cached log_importance_weights
            load_dataset_fn: Function to load a dataset from a path. Defaults to default_load_dataset_fn.
            parse_example_fn: Function that takes in an example dict and returns a string.
                              Defaults to returning the 'text' field of the example.
            num_proc: number of processes to use for parallelization. Defaults to number of cores.
            ngrams: N in N-grams. 2 means both unigram and bigrams.
            num_buckets: number of buckets to hash ngrams into.
            tokenizer: word_tokenize or wordpunct
            min_example_length: minimum number of tokens in an example to be considered.
            target_laplace_smoothing: Smooth the target hashed ngram distribution. This parameter is a pseudo-count. This could be useful for small target datasets.
            separate_targets: whether to select data separately for each target and then join them
            target_proportions: weighting across multiple targets if separate_targets=True. Set to None to weight by the size of each target dataset
        '''
        super().__init__(
                raw_datasets=raw_datasets,
                target_datasets=target_datasets,
                cache_dir=cache_dir,
                raw_load_dataset_fn=raw_load_dataset_fn,
                raw_parse_example_fn=raw_parse_example_fn,
                target_load_dataset_fn=target_load_dataset_fn,
                target_parse_example_fn=target_parse_example_fn,
                num_proc=num_proc,
                separate_targets=separate_targets,
                target_proportions=target_proportions)
        if tokenizer == 'word_tokenize':
            self.tokenizer = word_tokenize
        elif tokenizer == 'wordpunct':
            self.tokenizer = wpt.tokenize
        else:
            raise ValueError('tokenizer not recognized')
        self.ngrams = ngrams
        self.num_buckets = num_buckets
        self.min_example_length = min_example_length
        self.raw_probs = None
        self.target_probs = None
        self.log_diff = None
        self.target_laplace_smoothing = target_laplace_smoothing

    def featurizer(self, text: str) -> np.ndarray:
        return get_ngram_counts(text, tokenizer=self.tokenizer, num_buckets=self.num_buckets, n=self.ngrams)

    def importance_estimator(self, features: np.ndarray) -> Union[float, np.ndarray]:
        return np.dot(self.log_diff, features)

    def get_perexample_metadata(self, ex: Dict, features: np.ndarray) -> int:
        """Returns the example length."""
        remainder = self.ngrams * (self.ngrams - 1) / 2
        return (features.sum() + remainder) // self.ngrams

    def perexample_metadata_filter(self, concat_metadata: np.ndarray) -> np.array:
        """Filters out short examples."""
        return concat_metadata >= self.min_example_length

    def _fit_bow(self,
                 paths: List[str],
                 num_tokens_to_fit: Optional[int] = None,
                 load_dataset_fn: Callable[[str], Iterable[Dict]] = default_load_dataset_fn,
                 parse_example_fn: Callable[[Dict], str] = default_parse_example_fn) -> np.ndarray:

        sharded_datasets = self._get_virtually_sharded_datasets(paths)

        def job(args: Dict):
            path = args['path']
            num_shards = args['num_shards']
            shard_idx = args['shard_idx']

            counts = np.zeros(self.num_buckets).astype(int)
            dataset = load_dataset_fn(path)
            iterator = _iterate_virtually_sharded_dataset(dataset, num_shards, shard_idx)
            for ex in tqdm(iterator, miniters=10000, maxinterval=1000000):
                if parse_example_fn is not None:
                    text = parse_example_fn(ex)
                else:
                    text = ex
                counts = get_ngram_counts(text,
                                          n=self.ngrams,
                                          num_buckets=self.num_buckets,
                                          counts=counts,
                                          tokenizer=self.tokenizer)

                if num_tokens_to_fit is not None and counts.sum() > num_tokens_to_fit // len(sharded_datasets):
                    break

            return counts

        all_counts = parallelize(job, sharded_datasets, self.num_proc)
        counts = sum(all_counts)

        return counts

    def fit_importance_estimator(self, num_tokens_to_fit: Union[str, int] = 'auto') -> None:
        '''Fit the importance estimator.
        Args:
            num_tokens_to_fit: number of tokens to fit the raw dataset importance estimator on.
                               Set to "all" to fit on all tokens, and "auto" to determine
                               the number of tokens to fit on automatically (100k * num_buckets).
                               Set to an integer to fit on that many tokens.
        '''
        if num_tokens_to_fit == 'auto':
            num_tokens_to_fit = 100000 * self.num_buckets
        elif num_tokens_to_fit == 'all':
            num_tokens_to_fit = None

        self.raw_probs = self._fit_bow(
                self.raw_datasets,
                num_tokens_to_fit=num_tokens_to_fit,
                parse_example_fn=self.raw_parse_example_fn,
                load_dataset_fn=self.raw_load_dataset_fn)
        self.raw_probs = self.raw_probs / self.raw_probs.sum()

        if self.separate_targets:
            target_probs = []
            target_proportions = []

            for target_dataset in self.target_datasets:
                curr_target_probs = self._fit_bow(
                        [target_dataset],
                        num_tokens_to_fit=num_tokens_to_fit,
                        parse_example_fn=self.target_parse_example_fn,
                        load_dataset_fn=self.target_load_dataset_fn)
                target_proportions.append(curr_target_probs.sum())
                # smoothing
                curr_target_probs = curr_target_probs + self.target_laplace_smoothing
                curr_target_probs = curr_target_probs / curr_target_probs.sum()
                target_probs.append(curr_target_probs)
            target_proportions = np.asarray(target_proportions)
            if self.target_proportions is None:
                self.target_proportions = target_proportions / target_proportions.sum()

            self.target_probs = np.asarray(target_probs)

        else:
            self.target_probs = self._fit_bow(
                    self.target_datasets,
                    num_tokens_to_fit=None,  # fit on all tokens for target
                    parse_example_fn=self.target_parse_example_fn,
                    load_dataset_fn=self.target_load_dataset_fn)
            # smoothing
            self.target_probs = self.target_probs + self.target_laplace_smoothing
            self.target_probs = self.target_probs / self.target_probs.sum()


        self.log_diff = np.log(self.target_probs + 1e-8) - np.log(self.raw_probs + 1e-8)


================================================
FILE: data_selection/utils.py
================================================
from typing import Callable, List, Any
from joblib import Parallel, delayed


def parallelize(fn: Callable, args: List[Any], num_proc: int):
    return Parallel(n_jobs=num_proc)(delayed(fn)(arg) for arg in args)


================================================
FILE: experimental/README.md
================================================
# Code for the DSIR paper
This directory has the code for preprocessing, data selection, pretraining, and fine-tuning for the experiments in the DSIR paper. Pre-filtered datasets and pre-trained models from the paper are linked in the README at the outer directory.

## Code for data selection

To select your own subset of The Pile, all you need is a small set of target examples representing the kind of data you want to select.
This target dataset should be in jsonl format -- it can also be a dataset from HuggingFace Datasets. Note that our current workflow requires about 2TB of storage space --- we're working on reducing this! All the code should be run from the `experimental/` directory.
1. Create a virtualenv using `requirements.txt`: `virtualenv .venv; source .venv/bin/activate; pip install -r requirements.txt`
2. Download The Pile to `PILE_PATH` and change the corresponding variables in `config.sh`.
3. Run preprocessing on The Pile: Run `bash preprocessing/run_slurm.sh`. You can also run `bash preprocessing/run.sh` directly using the arguments in `preprocessing/run_slurm.sh`. This only needs to be run once. 
4. Precompute quality filter stats: Run `bash preprocessing/quality_scores/run_slurm_quality_stats.sh`. After this, run `bash preprocessing/quality_scores/run_merge_quality_scores.sh`. This only needs to be run once. (We're working on streamlining steps 3 and 4. Stay tuned!) 
5. Run DSIR: For an example, run `bash data_selection/run_cmds.sh`. For new target datasets, some information about which fields in the dataset to use should be placed in the `dsname_to_args` dictionary at the top of the `data_selection/dsir_pipeline.py` file. If you wish to retrieve from custom subsets of the Pile (for example, only select data from one chunk of the Pile), you will need to tweak one part of the code, in the main part of the script (an example is provided of how to do so as a comment). Many of the steps in DSIR can be cached and will only run the first time. For example, resampling a different number of examples with the same target dataset uses cached importance weights.

## Code for pretraining and GLUE evaluation

We provide scripts for training BERT-style masked language models on the selected data and evaluating it on GLUE in the `train` and `glue_eval` directories, respectively. All code should be run from the `experimental/` directory.
1. Install further dependencies using `train/requirements.txt`: `pip install -r train/requirements.txt`
2. Change the `PRETRAIN_OUTPUT_DIR` variable in `config.sh`.
3. Write a job command in `train/run_slurm.sh`. An example command in this file. You will need to change the path to the training data. If you want to skip preprocessing (if it's already done), set the first of two boolean variables to `false`. By setting both to `true`, there will be two jobs launched: one for preprocessing and one for pretraining. The pretraining job should take about 50 hours on 4 RTX 3090 GPUs. Kick off the jobs by running `bash train/run_slurm.sh`.
4. Evaluate the trained model by editing the evaluation job command in `glue_eval/run_eval_exps.sh` with the path to the model checkpoint. This script runs 5 seeds for each GLUE dataset. The results and finetuned models will be saved a new `finetune_runs` directory inside the pretrained model checkpoint directory. Kick off the jobs by running `bash glue_exps/run_eval_exps.sh`.
5. Read the GLUE results by running `python read_glue_results.py --results_dir </path/to/checkpoint>/finetune_runs` in the `glue_eval` directory.

## Pre-filtered datasets
Note: previous versions of the datasets had a small validation and test split (50000 examples each), but we concatenated these onto the end of the train set (in the order validation, then test) to better align with the paper. The datasets should be further shuffled during preprocessing before training.

### DSIR-filtered-pile-50M
- Target distribution: Wikipedia, BookCorpus2
- Selection method: DSIR (with importance resampling on hashed n-gram model importance weights)
- Raw dataset: The Pile
- Size: 80GB, 51.2M examples
- Used for 128-token context models in the paper. Suitable for token length 512 or 1024, but can be used for shorter token lengths.
- The dataset contains 51.2M examples, most of which are selected from Pile subsets that are not Wikipedia or books-related (BookCorpus2, Books3, Gutenberg). 4% of the data is randomly selected from Wikipedia and books-related subsets. Every example concatenates 2 snippets, possibly from different sources, to ensure that the examples are long enough for longer context models (512 or 1024 tokens). Metadata about which sources the text comes from is included with every example.
- Available on HuggingFace at https://huggingface.co/datasets/stanford-crfm/DSIR-filtered-pile-50M. Use with HuggingFace Datasets:
```python
from datasets import load_dataset
dataset = load_dataset("stanford-crfm/DSIR-filtered-pile-50M")
```

### heuristic_classification-filtered-pile-50M
- Target distribution: Wikipedia, BookCorpus2
- Selection method: Heuristic classification (FastText binary classifier)
- Raw dataset: The Pile
- Size: 80GB, 51.2M examples
- Used for 128-token context length models in the paper. Suitable for token length 512 or 1024, but can be used for shorter token lengths
- The dataset contains 51.2M examples, most of which are selected from Pile subsets that are not Wikipedia or books-related (BookCorpus2, Books3, Gutenberg). 4% of the data is randomly selected from Wikipedia and books-related subsets. Every example concatenates 2 snippets, possibly from different sources, to ensure that the examples are long enough for longer context models (512 or 1024 tokens). Metadata about which sources the text comes from is included with every example.
- Available on HuggingFace at https://huggingface.co/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M. Use with HuggingFace Datasets:
```python
from datasets import load_dataset
dataset = load_dataset("stanford-crfm/heuristic_classification-filtered-pile-50M")
```
- Comparisons for training BERT-base models from scratch (50k steps, 128 max token length, 4096 batch size):

| GLUE dev                                          |  MNLI |  QNLI |   QQP |   RTE | SST2 |  MRPC |  CoLA | STSB |   Avg |
|---------------------------------------------------|------:|------:|------:|------:|------:|------:|------:|------:|------:|
| Random selection from The Pile                    | 82.63 |  86.9 | 89.57 | 67.37 | 90.05 | 87.40 | 49.41 | 88.63 | 80.25 |
| Heuristic classification (GPT-3/Pile/PaLM method) | 82.69 | 85.95 | 89.77 | 68.59 | 88.94 | 86.03 | 48.17 | 88.62 | 79.85 |
| DSIR                                              | 83.07 | 89.11 | 89.80 | 75.09 | 90.48 | 87.70 | 54.00 | 89.17 | 82.30 |


## Pretrained models

In the table below, `{dataset}` can be replaced with one of `{ag, amazon, citation_intent, hyp, imdb, sciie, chemprot, rct-20k}` for the continued pretraining models.

| HuggingFace ID | Link | Dataset size | Max token length | Training steps | Architecture | Initialization | Description |
|---|---|---|---|---|---|---|---|
| dsir-bert-scratch-wiki_and_books | [Link](https://huggingface.co/sangmichaelxie/dsir-bert-scratch-wiki_and_books) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on [DSIR-filtered-pile-50M](https://huggingface.co/datasets/stanford-crfm/DSIR-filtered-pile-50M/viewer/default/train?p=31445&row=3144531) |
| heuristiccls-bert-scratch-wiki_and_books | [Link](https://huggingface.co/sangmichaelxie/heuristiccls-bert-scratch-wiki_and_books) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on Pile data filtered by heuristic classification |
| randomselect-bert-scratch | [Link](https://huggingface.co/sangmichaelxie/randomselect-bert-scratch) | 6.5B tokens (51.2M examples) | 128 | 5.00E+04 | bert-base-uncased | scratch | BERT model trained on random subset of The Pile |
| dsir-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by DSIR with target={dataset} |
| heuristiccls-roberta-continuedpretrain-{dataset} | Link format: `https://huggingface.co/sangmichaelxie/dsir-roberta-continuedpretrain-{dataset}` | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on data selected by heurstic classification with target={dataset} |
| randomselect-roberta-continuedpretrain | [Link](https://huggingface.co/sangmichaelxie/randomselect-roberta-continuedpretrain) | 6.4B tokens (25M examples) | 256 | 25000 | roberta-base | roberta-base | RoBERTa model with continued pretraining on random subset of The Pile |

## Citation Information
Paper: <https://arxiv.org/abs/2302.03169>
```
@article{xie2023data,
  author = {Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang},
  journal = {arXiv preprint arXiv:2302.03169},
  title = {Data Selection for Language Models via Importance Resampling},
  year = {2023},
}
```



================================================
FILE: experimental/config.sh
================================================
#!/bin/bash

CACHE='/path/to/cachedir'
ROOT_DIR='/path/to/dsir/experimental'
VIRTUAL_ENV='/path/to/.env'
PILE_PATH='/path/to/pile'
DSIR_OUTPUT_DIR='/path/to/outputdir'
PRETRAIN_OUTPUT_DIR='/path/to/model_outputdir'
WORD_VECTORS_PATH='/path/to/pretrained_fasttext_wordvecs.vec'
# Slurm
cluster_info='--partition <PARTITION_NAME>'

source ${VIRTUAL_ENV}/bin/activate


================================================
FILE: experimental/data_selection/dsir_general/data_selection.py
================================================
from pathlib import Path
import argparse
import random
import shutil
from json import loads, dumps
from tqdm import tqdm
from datasets import load_dataset
from glob import glob
from multiprocessing import Pool, cpu_count
import numpy as np
from utils import *
from time import time
import os


def compute_ngrams_raw(args, in_path: str, cache_path: Path):
    file_name = in_path.split("/")[-1]
    save_path = cache_path / f'{file_name}_{args.ngrams}grams_buckets_{args.num_buckets}_all.npy'

    if not save_path.exists():
        num_docs = 0
        st = time()
        counts = np.zeros(args.num_buckets).astype(int)
        with open(in_path, 'r') as f:
            for line in f:
                ex = loads(line)
                num_docs += 1
                if num_docs % 10000 == 0:
                    speed = num_docs / (time() - st)
                    print(num_docs, file_name, speed)
                line = ex["text"]
                curr_count = get_ngram_info(line, n=args.ngrams, num_buckets=args.num_buckets)
                counts = counts + curr_count
        np.save(str(save_path), counts)
    else:
        counts = np.load(str(save_path))
    print(file_name, "done!")
    return counts


def compute_importance_weights(args, in_path: str):
    file_name = in_path.split("/")[-1]
    out_dir = Path(args.out_path) / f'logratio_{args.ngrams}grams_buckets_{args.num_buckets}'
    out_dir.mkdir(parents=True, exist_ok=True)
    save_path = out_dir / f'{file_name}.npy'

    if not save_path.exists():
        logratios = []
        st = time()
        num_docs = 0
        with open(in_path, 'r') as f:
            for line in f:
                ex = loads(line)
                line = ex["text"]
                num_docs += 1
                if num_docs % 10000 == 0:
                    speed = num_docs / (time() - st)
                    print(num_docs, file_name, speed)
                curr_count = get_ngram_info(line, n=args.ngrams, num_buckets=args.num_buckets)
                logratio = np.inner(curr_count, args.log_diff_dist)
                logratios.append(logratio)
            logratios = np.asarray(logratios)
        np.save(str(save_path), logratios)
    else:
        logratios = np.load(str(save_path))
    print(file_name, "log ratio done!")
    return logratios


def resample(args, data_files, cache_ds_dir, streaming=False):
    retrieved_dir = args.out_path / f'retrieved'
    retrieved_path_cache = cache_ds_dir / f'retrieved_{args.num_to_retrieve}.jsonl'

    retrieved_dir.mkdir(parents=True, exist_ok=True)
    retrieved_path = retrieved_dir / f'{args.ds_name}_{args.target_ds_name}_retrieved_{args.num_to_retrieve}.jsonl'

    # merge logratio chunks
    logratios_file = retrieved_dir / 'logratios.npy'

    if not logratios_file.exists():
        logratios = []
        chunk_dir = f'{args.out_path}/logratio_{args.ngrams}grams_buckets_{args.num_buckets}'
        all_logratio_files = sorted(glob(f"{chunk_dir}/*.npy"))
        for curr_logratio_file in all_logratio_files:
            logratios.append(np.load(str(curr_logratio_file)))
        logratios = np.concatenate(logratios)
        np.save(logratios_file, logratios)
    else:
        logratios = np.load(logratios_file)

    print("logratios cnt", len(logratios))

    # noise the logratios
    logratios += np.random.gumbel(size=len(logratios))

    # choose top k
    chosen_idxs = np.argpartition(-logratios, args.num_to_retrieve)[:args.num_to_retrieve]

    global_mask = np.zeros(len(logratios)).astype(bool)
    global_mask[chosen_idxs] = True

    del nonzero_idxs
    del chosen_idxs
    del logratios

    print("Loading data...")

    # data_files = sorted(glob(f"{args.data_pool_path}/*.jsonl"))
    if not streaming:
        combined_streaming_ds = []
        for f_name in data_files:
            with open(f_name, "r") as f:
                combined_streaming_ds.extend(f.read().splitlines())
        print("data line cnt", len(combined_streaming_ds))

        assert len(combined_streaming_ds) == len(global_mask)

        with open(retrieved_path_cache, 'w') as fout:
            for i, curr_ex in tqdm(enumerate(combined_streaming_ds)):
                if global_mask[i]:
                    fout.write(curr_ex.strip() + '\n')
    else:
        combined_streaming_ds = load_dataset(
            'json',
            data_files=data_files,
            streaming=True)["train"]

        with open(retrieved_path_cache, 'w') as fout:
            for i, curr_ex in tqdm(enumerate(combined_streaming_ds)):
                # curr_ex["timestamp"] = curr_ex["timestamp"].strftime("%m/%d/%Y, %H:%M:%S")
                if global_mask[i]:
                    fout.write(dumps(curr_ex) + "\n")

    shutil.move(retrieved_path_cache, retrieved_path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Data selection with DSIR')
    parser.add_argument('--data_pool_path',
                        default="",
                        type=str,
                        help='path to data pool')
    parser.add_argument('--ds_name', default="data pool", type=str, help='pretraining dataset name')
    parser.add_argument('--target_path',
                        default="",
                        type=str,
                        help='path to target data')
    parser.add_argument('--target_ds_name', default="target data", type=str, help='target dataset name')
    parser.add_argument('--output_dir', default="", type=str,
                        help='output path')
    parser.add_argument('--num_to_retrieve', type=int, default=200000, help='Number of examples to retrieve')
    parser.add_argument('--cache_dir', default="", type=str,
                        help='cache directory for datasets')
    parser.add_argument('--ngrams', type=int, default=3, help='N in N-grams. 2 means both unigram and bigram.')
    parser.add_argument('--num_buckets', type=int, default=10000, help='number of ngram hash buckets')
    parser.add_argument('--pipeline_step', default="resample", type=str,
                        help='which step of pipeline to run. (importance_weights, resample)')
    args = parser.parse_args()
    random.seed(42)

    print("PYTHONHASHSEED", os.environ.get('PYTHONHASHSEED'))

    cache_ds_dir = Path(args.cache_dir) / 'ngram_cache' / args.ds_name
    cache_ds_dir.mkdir(exist_ok=True, parents=True)

    cache_target_dir = Path(args.cache_dir) / 'ngram_cache' / args.target_ds_name
    cache_target_dir.mkdir(exist_ok=True, parents=True)

    args.out_path = Path(args.output_dir) / args.target_ds_name / args.ds_name
    args.out_path.mkdir(exist_ok=True, parents=True)

    ds_in_paths = sorted(glob(f"{args.data_pool_path}/*"))
    target_in_paths = sorted(glob(f"{args.target_path}/*"))
    with Pool(cpu_count()) as p:
        ds_all_args = [(args, in_path, cache_ds_dir) for in_path in ds_in_paths]
        ds_ngram_dist = p.starmap(compute_ngrams_raw, ds_all_args)

        target_all_args = [(args, in_path, cache_target_dir) for in_path in target_in_paths]
        target_ngram_dist = p.starmap(compute_ngrams_raw, target_all_args)
    for i in range(1, len(ds_ngram_dist)):
        ds_ngram_dist[0] = ds_ngram_dist[0] + ds_ngram_dist[i]
        ds_ngram_dist[i] = None
    ds_ngram_dist = ds_ngram_dist[0]
    ds_ngram_dist = ds_ngram_dist / ds_ngram_dist.sum()

    for i in range(1, len(target_ngram_dist)):
        target_ngram_dist[0] = target_ngram_dist[0] + target_ngram_dist[i]
        target_ngram_dist[i] = None
    target_ngram_dist = target_ngram_dist[0]
    target_ngram_dist = target_ngram_dist / target_ngram_dist.sum()
    print(ds_ngram_dist)
    print(target_ngram_dist)

    args.log_diff_dist = np.log(target_ngram_dist + 1e-8) - np.log(ds_ngram_dist + 1e-8)
    with Pool(cpu_count()) as p:
        importance_weights_args = [(args, in_path) for in_path in ds_in_paths]
        importance_weights = p.starmap(compute_importance_weights, importance_weights_args)

    resample(args, ds_in_paths, cache_ds_dir)


================================================
FILE: experimental/data_selection/dsir_general/run_data_selection.py
================================================
import subprocess
import os

environment = dict(os.environ, PYTHONHASHSEED='42')
subprocess.run(["python", "data_selection.py"], env=environment)


================================================
FILE: experimental/data_selection/dsir_general/utils.py
================================================
from nltk.tokenize import WordPunctTokenizer
from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter
import nltk
import numpy as np
import subprocess
from nltk.corpus import stopwords
import string


nltk.download('stopwords')


def hash_buckets(text, num_buckets=1e4):
    return int(abs(hash(text)) % num_buckets)


wpt = WordPunctTokenizer()


def get_ngram_info(line, n=2, num_buckets=10000):
    words = wpt.tokenize(line.lower())
    # words = line.lower().split()
    counts = np.zeros(num_buckets, dtype=int)
    for w in words:
        counts[hash_buckets(w, num_buckets=num_buckets)] += 1
    for i in range(2, n + 1):
        for ng in list(ngrams(words, i)):
            counts[hash_buckets(ng, num_buckets=num_buckets)] += 1
    return counts


def linecount(filename):
    out = subprocess.Popen(['wc', '-l', filename],
                           stdout=subprocess.PIPE).communicate()[0]
    return int(out.strip().partition(b' ')[0])


stop = set(stopwords.words('english') + list(string.punctuation))
numeric = set(list(string.digits))


def transform_text(text):
    return word_tokenize(text.lower())


def repeating_filter(x_tok, n=1):
    if len(x_tok) == 0:
        return 0
    counts = Counter(x_tok)
    if n == 1:
        ratio = (max(counts.values()) / len(x_tok))
    else:
        ratio = sum(sorted(counts.values(), reverse=True)[:n]) / len(x_tok)
    return ratio


def mostly_uninformative_filter(x_tok):
    if len(x_tok) == 0:
        return 0
    informative_ratio = (len([x for x in x_tok if x not in stop]) / len(x_tok))
    return informative_ratio


def numeric_filter(x_tok):
    if len(x_tok) == 0:
        return 0
    ratio = (len([x for x in x_tok if x not in numeric]) / len(x_tok))
    return ratio


================================================
FILE: experimental/data_selection/dsir_pipeline.py
================================================
from pathlib import Path
from itertools import zip_longest
import random
import argparse
import json
import shutil
from collections import defaultdict
import subprocess
from itertools import islice

from tqdm import tqdm
from nltk.tokenize import WordPunctTokenizer
import numpy as np
from datasets import load_dataset

import logging
logging.basicConfig(level=logging.INFO)


# Place information about datasets in the dict below.
# The columns field is a list of columns to use for DSIR.
dsname_to_args = {
    'ag_news': {'dataset_name': 'yxchar/ag-tlm',
                'task_name': None,
                'columns': ['text'], },
    'chemprot': {'dataset_name': "yxchar/chemprot-tlm",
                 'task_name': None,
                 'columns': ['text']},
    'citation_intent': {'dataset_name': "yxchar/citation_intent-tlm",
                        'task_name': None,
                        'columns': ['text']},
    'hyperpartisan': {'dataset_name': "yxchar/hyp-tlm",
                      'task_name': None,
                      'columns': ['text']},
    'rct': {'dataset_name': "yxchar/rct-20k-tlm",
            'task_name': None,
            'columns': ['text']},
    'imdb': {'dataset_name': 'yxchar/imdb-tlm',
             'task_name': None,
             'columns': ['text']},
    'sciie': {'dataset_name': 'yxchar/sciie-tlm',
              'task_name': None,
              'columns': ['text']},
    'helpfulness': {'dataset_name': 'yxchar/amazon-tlm',
                    'task_name': None,
                    'columns': ['text']},
    'pile_val': {'dataset_name': 'json',
                 'task_name': None,  # to be set later
                 'quality_scores': None,  # to be set later
                 'columns': ['contents'], },
    'pile': {'dataset_name': 'json',
             'task_name': None,  # to be set later
             'quality_scores': None,  # to be set later
             'columns': ['contents'],
             'total_lines': 1745766302, },
 }


subsets = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
           '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
           '21', '22', '23', '24', '25', '26', '27', '28', '29']


def get_quality_mask(quality_scores):
    keep = (
        (quality_scores['length'] > 40)
        & (quality_scores['length'] < 500)
        & (quality_scores['repeating_ratio'] > 0.02)
        & (quality_scores['repeating_ratio'] < 0.2)
        & (quality_scores['informative_ratio'] > 0.3)
        & (quality_scores['informative_ratio'] < 0.7)
        & (quality_scores['numeric_ratio'] > 0.8)
    )
    return keep


def hash_buckets(string, num_buckets=10e4):
    return int(abs(hash(string)) % num_buckets)


def unigrams_bigrams(text):
    words = text.split()
    return words, list(zip(words, islice(words, 1, None)))


wpt = WordPunctTokenizer()
def get_ngram_info(line, n=2, num_buckets=10000):
    words = wpt.tokenize(line.lower())
    unigrams, bigrams = words, list(zip(words, islice(words, 1, None)))

    counts = np.zeros(num_buckets, dtype=int)
    for unigram in unigrams:
        counts[hash_buckets(unigram, num_buckets=num_buckets)] += 1
    for bigram in bigrams:
        counts[hash_buckets(bigram, num_buckets=num_buckets)] += 1
    return counts


def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
    "Collect data into non-overlapping fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
    # grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
    # grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
    args = [iter(iterable)] * n
    if incomplete == 'fill':
        return zip_longest(*args, fillvalue=fillvalue)
    if incomplete == 'strict':
        return zip(*args, strict=True)
    if incomplete == 'ignore':
        return zip(*args)
    else:
        raise ValueError('Expected fill, strict, or ignore')


def compute_ngrams_hf(ds_name, ds_dir, cache_dir, ngrams, num_buckets):
    ds_dir = Path(ds_dir)
    save_path = ds_dir / f'{ds_name}_ngramcounts.npy'
    if not save_path.exists():
        config = dsname_to_args[ds_name]
        text_cols = config["columns"]
        logging.info(f"{text_cols}")

        if ds_name == 'dbpedia':
            ds = load_dataset(config["dataset_name"], data_files=config["task_name"],
                              cache_dir=cache_dir, download_mode='force_redownload')
        else:
            ds = load_dataset(config["dataset_name"], config["task_name"],
                              cache_dir=cache_dir)

        counts = np.zeros(num_buckets).astype(int)
        for i, ex in tqdm(enumerate(ds['train']), miniters=10000, maxinterval=1000000):
            line = " ".join([ex[c] for c in text_cols])
            curr_count = get_ngram_info(line, n=ngrams)
            counts = counts + curr_count
        np.save(save_path, counts)
    else:
        counts = np.load(save_path)
    return counts


def compute_ngrams_pile(
        path, ngrams=2, num_buckets=10000,
        filter_domains=None,
        cache_dir=None):

    path_parent = Path(path).parent

    if filter_domains is None:
        save_path = path_parent / f'ngrams{ngrams}_buckets{num_buckets}_nofilter.npy'
    else:
        filter_domains_str = '_'.join(filter_domains)
        save_path = path_parent / f'ngrams{ngrams}_buckets{num_buckets}_{filter_domains_str}.npy'

    if not save_path.exists():

        counts = np.zeros(num_buckets).astype(int)
        num_docs = 0
        with open(path, 'r') as f:
            for k, line in tqdm(enumerate(f), miniters=1000000, maxinterval=1000000):
                ex = json.loads(line)
                domain = ex["metadata"]["pile_set_name"]
                if filter_domains is not None and domain not in filter_domains:
                    continue
                num_docs += 1
                line = ex["contents"]
                curr_count = get_ngram_info(line, n=ngrams)
                counts = counts + curr_count
        np.save(save_path, counts)
    else:
        counts = np.load(save_path)
    return counts


def compute_importance_weights(
        path, ds_dir, chunk_idx, target_dist, pile_dist, ngrams=2, num_buckets=10000):
    chunk_dir = Path(ds_dir) / f'logratio_chunks_ngrams{ngrams}_buckets{num_buckets}'
    chunk_dir.mkdir(parents=True, exist_ok=True)
    save_path = chunk_dir / f'{chunk_idx}.npy'

    log_diff_dist = np.log(target_dist + 1e-8) - np.log(pile_dist + 1e-8)

    if not save_path.exists():
        logratios = []
        with open(path, 'r') as f:
            for k, line in tqdm(enumerate(f), miniters=1000000, maxinterval=1000000):
                ex = json.loads(line)
                line = ex["contents"]
                curr_count = get_ngram_info(line, n=ngrams)
                logratio = np.inner(curr_count, log_diff_dist)
                logratios.append(logratio)
            logratios = np.asarray(logratios)
        np.save(save_path, logratios)
    else:
        logratios = np.load(save_path)
    return logratios


def compute_domain_idxs(filter_domains):
    # path to outer directory
    ds_path = Path(dsname_to_args['pile']['task_name'][0]).parent.parent

    domain_to_idxs = defaultdict(list)
    todo_domains = []
    for domain in filter_domains:
        domain_idxs_path = ds_path / f"{domain.replace(' ', '_')}_idxs.npy"
        if not domain_idxs_path.exists():
            todo_domains.append(domain)

    combined_streaming_ds = load_dataset(
            'json',
            data_files=dsname_to_args['pile']['task_name'],
            streaming=True)['train']

    todo_domains = set(todo_domains)
    if len(todo_domains) > 0:
        for i, ex in tqdm(enumerate(combined_streaming_ds), miniters=1000000):
            domain = ex["metadata"]["pile_set_name"]
            if domain in todo_domains:
                domain_to_idxs[domain].append(i)
        for domain, idxs in domain_to_idxs.items():
            np.save(ds_path / f"{domain.replace(' ', '_')}_idxs.npy", np.asarray(idxs))

    for domain in filter_domains:
        domain_idxs_path = ds_path / f"{domain.replace(' ', '_')}_idxs.npy"
        domain_idxs = np.load(domain_idxs_path)
        domain_to_idxs[domain] = domain_idxs

    return domain_to_idxs


def resample(ds_dir, cache_ds_dir, num_to_retrieve):
    if args.pack_every_2_examples:
        suffix = '_pack'
    else:
        suffix = '_nopack'
    retrieved_dir = ds_dir / 'retrieved'
    retrieved_path_cache = cache_ds_dir / f'retrieved_{num_to_retrieve}{suffix}.jsonl'

    retrieved_dir.mkdir(parents=True, exist_ok=True)
    retrieved_path = retrieved_dir / f'retrieved_{num_to_retrieve}{suffix}.jsonl'

    total_lines = dsname_to_args['pile']['total_lines']

    if args.qualityfilter:
        quality_scores = np.load(dsname_to_args['pile']['quality_scores'])
        global_mask = get_quality_mask(quality_scores)

    if args.ds_name == 'wiki_and_books':
        # compute the wikipedia and book masks and filter out
        filter_domains = ['Wikipedia (en)', 'BookCorpus2', 'Books3', 'Gutenberg (PG-19)']
        domain_to_idxs = compute_domain_idxs(filter_domains)
        for domain, idxs in domain_to_idxs.items():
            # ignore wiki and books during retrieval
            mask = np.ones(total_lines).astype(bool)
            mask[idxs] = False
            global_mask = global_mask & mask

    # merge logratio chunks
    logratios_file = retrieved_dir / 'logratios.npy'
    chunk_dir = Path(ds_dir) / f'logratio_chunks_ngrams{args.ngrams}_buckets{args.num_buckets}'
    if not logratios_file.exists():
        logratios = []
        for i in subsets:
            curr_logratios_file = chunk_dir / f'{i}.npy'
            logratios.append(np.load(curr_logratios_file))
        logratios = np.concatenate(logratios)
        np.save(logratios_file, logratios)
    else:
        logratios = np.load(logratios_file)

    assert(len(logratios) == total_lines)

    # noise the logratios
    logratios = logratios[global_mask]
    logratios += np.random.gumbel(size=len(logratios))

    nonzero_idxs = np.where(global_mask)[0]
    # choose top k
    chosen_idxs = np.argpartition(-logratios, num_to_retrieve)[:num_to_retrieve]
    chosen_idxs = nonzero_idxs[chosen_idxs]

    if args.ds_name == 'wiki_and_books':
        # add in some wikipedia and bookcorpus
        all_domain_idxs = []
        for domain, idxs in domain_to_idxs.items():
            # add 2 million from each domain (1million packed)
            if domain == 'Wikipedia (en)':
                num_to_add = 2000000
            else:
                num_to_add = 2000000 // 3
            np.random.shuffle(idxs)
            domain_chosen_idxs = idxs[:num_to_add]
            all_domain_idxs.append(domain_chosen_idxs)
        chosen_idxs = np.concatenate([chosen_idxs] + all_domain_idxs)

    global_mask = np.zeros(len(global_mask)).astype(bool)
    global_mask[chosen_idxs] = True

    del logratios
    del nonzero_idxs
    del chosen_idxs

    combined_streaming_ds = load_dataset(
            'json',
            data_files=dsname_to_args['pile']['task_name'],
            streaming=True)['train']

    prev_ex = None
    with open(retrieved_path_cache, 'w') as fout:
        for i, curr_ex in tqdm(enumerate(combined_streaming_ds), miniters=1000000, total=total_lines):
            if global_mask[i]:
                if args.pack_every_2_examples and prev_ex is not None:
                    prev_ex['contents'] += curr_ex['contents']
                    prev_ex['metadata']['pile_set_name'] = [
                        prev_ex['metadata']['pile_set_name'],
                        curr_ex['metadata']['pile_set_name']]
                    fout.write(json.dumps(prev_ex).strip() + '\n')
                    prev_ex = None
                elif args.pack_every_2_examples and prev_ex is None:
                    prev_ex = curr_ex
                else:
                    fout.write(json.dumps(curr_ex).strip() + '\n')

    shutil.move(retrieved_path_cache, retrieved_path)


def linecount(filename):
    out = subprocess.Popen(['wc', '-l', filename],
                           stdout=subprocess.PIPE).communicate()[0]
    return int(out.strip().partition(b' ')[0])


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Data selection with DSIR')
    parser.add_argument('--pile_path', type=str, help='path to pile')
    parser.add_argument('--ds_name', type=str, help='dataset name')
    parser.add_argument('--output_dir', type=str, help='output path')
    parser.add_argument('--num_to_retrieve', type=int, default=25600000, help='Number of examples to retrieve')
    parser.add_argument('--cache_dir', type=str,
                        help='cache directory for datasets')
    parser.add_argument('--num_proc', type=int, help='number of threads')
    parser.add_argument('--overwrite', action='store_true', help='Overwrite importance weights')
    parser.add_argument('--overwrite_preprocess', action='store_true', help='overwrite data preprocessing')
    parser.add_argument('--ngrams', type=int, default=2, help='N in N-grams. 2 means both unigram and bigram.')
    parser.add_argument('--num_buckets', type=int, default=10000, help='number of ngram hash buckets')
    parser.add_argument('--pipeline_step', type=str, help='which step of pipeline to run. (imporance_weights, resample)')
    parser.add_argument('--chunk_idx', type=str, default='01', help='which chunk of prediction')
    parser.add_argument('--num_chunks', type=int, default=29, help='Number of chunks')
    parser.add_argument('--qualityfilter', action='store_true', help='whether to implement quality filtering')
    parser.add_argument('--pack_every_2_examples', action='store_true', help='whether to pack two examples together to get longer examples')
    args = parser.parse_args()
    random.seed(121)

    chunked_dir = 'chunked'
    dsname_to_args['pile_val'].update(
            {'task_name': [f'{args.pile_path}/{chunked_dir}/VAL_128/val_128.json'],
             'quality_scores': f'{args.pile_path}/{chunked_dir}/VAL_128/val_128.json_qualityscores.npz'}
             )

    #    XXX: if using fewer subsets of the Pile, please change the subsets list and the total_lines variable.
    #    We provde an example below (note that using linecount can take a long time for large numbers of subsets - we suggest running this once and hardcoding the number):
    # subsets = ['00']
    # total_lines = sum([linecount(f'{args.pile_path}/{chunked_dir}/{subset}_128/{subset}_128.json') for subset in subsets])
    # dsname_to_args['pile']['total_lines'] = total_lines

    dsname_to_args['pile'].update(
        {'task_name': [f'{args.pile_path}/{chunked_dir}/{subset}_128/{subset}_128.json' for subset in subsets],
         'quality_scores': f'{args.pile_path}/{chunked_dir}/combined_all.json_qualityscores.npz'}
            )

    cache_ds_dir = Path(args.cache_dir) / 'ngram_cache' / args.ds_name
    cache_ds_dir.mkdir(exist_ok=True, parents=True)
    ds_dir = Path(args.output_dir) / args.ds_name
    ds_dir.mkdir(exist_ok=True, parents=True)

    if args.ds_name == 'wiki_and_books':
        filter_domains = {'Wikipedia (en)', 'BookCorpus2'}
        ds_ngram_dist = compute_ngrams_pile(
                path=dsname_to_args['pile_val']['task_name'][0],
                ngrams=args.ngrams,
                num_buckets=args.num_buckets,
                filter_domains=filter_domains,
                )
        ds_ngram_dist = ds_ngram_dist / ds_ngram_dist.sum()
    else:
        ds_ngram_dist = compute_ngrams_hf(args.ds_name, ds_dir, cache_ds_dir, ngrams=args.ngrams, num_buckets=args.num_buckets)
        ds_ngram_dist = ds_ngram_dist / ds_ngram_dist.sum()

    pile_dist = compute_ngrams_pile(
            path=dsname_to_args['pile_val']['task_name'][0],
            ngrams=args.ngrams,
            num_buckets=args.num_buckets,
            )
    pile_dist = pile_dist / pile_dist.sum()

    if args.pipeline_step == 'importance_weights':
        _ = compute_importance_weights(
                path=f"{args.pile_path}/{chunked_dir}/{args.chunk_idx}_128/{args.chunk_idx}_128.json",
                ds_dir=ds_dir,
                chunk_idx=args.chunk_idx,
                target_dist=ds_ngram_dist,
                pile_dist=pile_dist,
                ngrams=args.ngrams,
                num_buckets=args.num_buckets,
                )
    elif args.pipeline_step == 'resample':
        resample(ds_dir, cache_ds_dir, args.num_to_retrieve)


================================================
FILE: experimental/data_selection/heuristic_cls_pipeline.py
================================================
from pathlib import Path
import os
import random
import argparse
import json
import shutil
from multiprocessing import Pool
from collections import defaultdict

from tqdm import tqdm
from nltk import word_tokenize
import numpy as np
from datasets import load_dataset
import fasttext

from dsir_pipeline import (
    dsname_to_args,
    subsets,
    get_quality_mask,
    linecount,

        )


def transform_text(text):
    return ' '.join(word_tokenize(text.lower()))


def batch_process(e, text_cols, label_col, fixed_label=None):
    sent = ' '.join([e[col] for col in text_cols])
    if fixed_label is not None:
        label = fixed_label
    else:
        label = e[label_col]
    sent = transform_text(sent)

    text = f'__label__{label} , {sent}'
    return {'text': text}


def reformat_dataset(ds_name, output_dir, cache_dir, num_proc=10, fixed_label=None, filter_domains=None):
    if args.qualityfilter and ds_name == 'pile':
        if filter_domains is not None:
            ds_output_dir = Path(output_dir) / (ds_name + '_qf_wikiandbooks')
        else:
            ds_output_dir = Path(output_dir) / (ds_name + '_qf')

        ds_output_dir.mkdir(exist_ok=True)

        quality_scores = np.load(dsname_to_args['pile']['quality_scores'])
        quality_mask = get_quality_mask(quality_scores)
    else:
        ds_output_dir = Path(output_dir) / ds_name
        ds_output_dir.mkdir(exist_ok=True)

    config = dsname_to_args[ds_name]
    text_cols, label_col = config["columns"], config["label"]
    print(text_cols, label_col)

    if ds_name != 'pile':
        ds = load_dataset(config["dataset_name"], config["task_name"],
                          cache_dir=cache_dir)
    else:
        ds = load_dataset(config["dataset_name"], data_files=config["task_name"],
                          cache_dir=cache_dir)

    for split in ds:
        print(split)
        if (ds_output_dir / f"{split}.txt").exists() and not args.overwrite_preprocess:
            continue
        split_ds = ds[split]

        column_names = list(iter(split_ds).__next__().keys())
        if ds_name == 'pile':
            column_names = [c for c in column_names if c != 'metadata']

        with open(f'{ds_output_dir}/{split}.txt', 'w') as f:
            for i, ex in tqdm(enumerate(split_ds)):
                if ds_name == 'pile':
                    if args.qualityfilter:
                        if not quality_mask[i]:
                            continue
                    domain = ex['metadata']['pile_set_name']
                    if filter_domains is not None and domain not in filter_domains:
                        continue

                sent = ' '.join([ex[col] for col in text_cols])
                if fixed_label is not None:
                    label = fixed_label
                else:
                    label = ex[label_col]
                sent = transform_text(sent)

                text = f'__label__{label} , {sent}'
                f.write(json.dumps({'text': text}) + '\n')

    return ds_output_dir


def replace_label(line, label):
    toks = line.split()
    new_label = '__label__' + str(label)

    rest = ' '.join(toks[2:])

    return f"{new_label} , {rest}"


def mix_dataset(ds_dir, pile_val_dir):
    """Interleave 2 datasets for fasttext classification: the target dataset ds_dir and the pile validation set"""
    ds_dir = Path(ds_dir)
    ds2_path = pile_val_dir / "train.txt"

    split = 'train'
    mixed_path_train = ds_dir / 'mixed-train.txt'
    mixed_path_val = ds_dir / 'mixed-val.txt'
    mixed_path_train_cache = cache_ds_dir / 'mixed-train.txt'
    mixed_path_val_cache = cache_ds_dir / 'mixed-val.txt'

    if args.qualityfilter:
        mixed_path_train = mixed_path_train.parent / f"{mixed_path_train.stem}-qf.txt"
        mixed_path_val = mixed_path_val.parent / f"{mixed_path_val.stem}-qf.txt"
        mixed_path_train_cache = mixed_path_train_cache.parent / f"{mixed_path_train_cache.stem}-qf.txt"
        mixed_path_val_cache = mixed_path_val_cache.parent / f"{mixed_path_val_cache.stem}-qf.txt"

    ds1_path = ds_dir / f"{split}.txt"

    if (mixed_path_train.exists() and mixed_path_val.exists()) and not args.overwrite_preprocess:
        return mixed_path_train, mixed_path_val

    num_lines1 = linecount(ds1_path)
    num_lines2 = linecount(ds2_path)

    num_lines = min(num_lines1, num_lines2)
    num_train = int(num_lines * 0.9)

    counter = 0
    with open(mixed_path_train_cache, 'w') as fout:
        with open(mixed_path_val_cache, 'w') as fout_val:
            with open(ds1_path, 'r') as f1:
                with open(ds2_path, 'r') as f2:
                    for line1, line2 in tqdm(zip(f1, f2), miniters=100000):
                        if counter < num_train:
                            if counter > 0:
                                fout.write('\n')
                            fout.write(replace_label(line1.strip(), label=1) + '\n')
                            fout.write(replace_label(line2.strip(), label=0))
                        else:
                            if counter - num_train > 0:
                                fout_val.write('\n')
                            fout_val.write(replace_label(line1.strip(), label=1) + '\n')
                            fout_val.write(replace_label(line2.strip(), label=0))
                        counter += 1
    shutil.move(mixed_path_train_cache, mixed_path_train)
    shutil.move(mixed_path_val_cache, mixed_path_val)
    return mixed_path_train, mixed_path_val


def prepare_fasttext_dataset(ds_dir):
    ds_dir = Path(ds_dir)
    split = 'train'
    ds1_path = ds_dir / f"{split}.txt"

    train_file = ds_dir / 'fasttext-train.txt'
    val_file = ds_dir / 'fasttext-val.txt'

    if (train_file.exists() and val_file.exists()) and not args.overwrite_preprocess:
        return train_file, val_file

    train_cache = cache_ds_dir / 'fasttext-train.txt'
    val_cache = cache_ds_dir / 'fasttext-val.txt'

    num_lines = linecount(ds1_path)
    num_train = int(num_lines * 0.9)

    counter = 0
    with open(train_cache, 'w') as fout:
        with open(val_cache, 'w') as fout_val:
            with open(ds1_path, 'r') as f:
                for line in tqdm(f, miniters=1000000):
                    if counter < num_train:
                        if counter > 0:
                            fout.write('\n')
                        fout.write(line.strip())
                    else:
                        if counter - num_train > 0:
                            fout_val.write('\n')
                        fout_val.write(line.strip())
                    counter += 1
    shutil.move(train_cache, train_file)
    shutil.move(val_cache, val_file)
    return train_file, val_file


def make_prediction(line, model):
    example = json.loads(line)
    transformed_example = transform_text(example['contents'])
    prediction = model.predict(transformed_example)

    label = int(prediction[0][0].split('__label__')[1])
    prob = np.amax(prediction[1])

    if label == 0:
        prob = 1 - prob

    return prob


model = None
def process(line):
    return make_prediction(line, model)


def make_prediction_chunk(ds_path, model_path, chunk_idx):
    global model
    model = fasttext.load_model(str(model_path))
    probs = []

    num_cpus = len(os.sched_getaffinity(0))
    pool = Pool(num_cpus)
    with open(ds_path, 'r') as f:
        for prob in pool.imap(process, f, chunksize=100000):
            probs.append(prob)

            if len(probs) % 100000 == 0:
                print(len(probs), flush=True)
    return np.asarray(probs)


def predict_chunk(model_path, ds_dir, chunk_idx):

    retrieved_dir = ds_dir / 'heuristic_cls_retrieved'
    retrieved_path_cache = cache_ds_dir / 'heuristic_cls_retrieved.jsonl'

    if args.word_vectors is not None:
        retrieved_dir = retrieved_dir.parent / f"{retrieved_dir.stem}_wordvecs"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_wordvecs.jsonl"

    if args.ngrams != 2:
        retrieved_dir = retrieved_dir.parent / f"{retrieved_dir.stem}_ngrams{args.ngrams}"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_ngrams{args.ngrams}.jsonl"

    if args.qualityfilter:
        retrieved_dir = retrieved_dir.parent / f"{retrieved_dir.stem}_qf"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_qf.jsonl"

    retrieved_dir.mkdir(parents=True, exist_ok=True)

    # run through dataset once to make the predictions
    ds_path = dsname_to_args['pile']['task_name']

    chunk_dir = retrieved_dir / 'chunks_bysubset'
    chunk_dir.mkdir(exist_ok=True)
    if args.word_vectors is not None:
        probabilities_file = chunk_dir / f'pred_probs_wordvecs_{chunk_idx}.npy'
    else:
        probabilities_file = chunk_dir / f'pred_probs_{chunk_idx}.npy'

    if not probabilities_file.exists() or args.overwrite:
        probabilities = make_prediction_chunk(ds_path, model_path, chunk_idx)
        np.save(probabilities_file, probabilities)


def compute_domain_idxs(filter_domains):
    ds_path = dsname_to_args['pile']['task_name']

    domain_to_idxs = defaultdict(list)
    todo_domains = []
    for domain in filter_domains:
        domain_idxs_path = Path(ds_path).parent / f"{domain.replace(' ', '_')}_idxs.npy"
        if not domain_idxs_path.exists():
            todo_domains.append(domain)

    combined_streaming_ds = load_dataset(
            'json',
            data_files=ds_path,
            streaming=True)['train']

    todo_domains = set(todo_domains)
    if len(todo_domains) > 0:
        for i, ex in tqdm(enumerate(combined_streaming_ds), miniters=1000000):
            domain = ex["metadata"]["pile_set_name"]
            if domain in todo_domains:
                domain_to_idxs[domain].append(i)
        for domain, idxs in domain_to_idxs.items():
            np.save(Path(ds_path).parent / f"{domain.replace(' ', '_')}_idxs.npy", np.asarray(idxs))

    for domain in filter_domains:
        domain_idxs_path = Path(ds_path).parent / f"{domain.replace(' ', '_')}_idxs.npy"
        domain_idxs = np.load(domain_idxs_path)
        domain_to_idxs[domain] = domain_idxs

    return domain_to_idxs


def retrieve_from_pile(model_path, num_to_retrieve, ds_dir):

    retrieved_dir = ds_dir / 'heuristic_cls_retrieved'
    retrieved_path_cache = cache_ds_dir / 'heuristic_cls_retrieved_{num_to_retrieve}.jsonl'

    if args.word_vectors is not None:
        retrieved_dir = retrieved_dir.parent / f"{retrieved_dir.stem}_wordvecs"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_wordvecs.jsonl"

    if args.ngrams != 2:
        retrieved_dir = retrieved_dir.parent / f"{retrieved_dir.stem}_ngrams{args.ngrams}"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_ngrams{args.ngrams}.jsonl"

    if args.qualityfilter:
        retrieved_dir = retrieved_dir.parent / f"{retrieved_dir.stem}_qf"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_qf.jsonl"

    retrieved_dir.mkdir(parents=True, exist_ok=True)
    retrieved_path = retrieved_dir / f'heuristic_cls_retrieved_{num_to_retrieve}.jsonl'

    retrieved_path = retrieved_path.parent / f"{retrieved_path.stem}_retrievemode{args.retrieval_mode}.jsonl"
    retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_retrievemode{args.retrieval_mode}.jsonl"

    if args.pack_every_2_examples:
        retrieved_path = retrieved_path.parent / f"{retrieved_path.stem}_pack.jsonl"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_pack.jsonl"
    else:
        retrieved_path = retrieved_path.parent / f"{retrieved_path.stem}_nopack.jsonl"
        retrieved_path_cache = retrieved_path_cache.parent / f"{retrieved_path_cache.stem}_nopack.jsonl"

    ds_path = dsname_to_args['pile']['task_name']
    total_lines = dsname_to_args['pile']['total_lines']

    if args.word_vectors is not None:
        probabilities_file = retrieved_dir / 'pred_probs_wordvecs.npy'
    else:
        probabilities_file = retrieved_dir / 'pred_probs.npy'

    if not probabilities_file.exists() or args.overwrite:
        chunk_dir = retrieved_dir / 'chunks_bysubset'
        probabilities_ls = []
        for i in subsets:
            if args.word_vectors is not None:
                probabilities_file_chunk = chunk_dir / f'pred_probs_wordvecs_{i}.npy'
            else:
                probabilities_file_chunk = chunk_dir / f'pred_probs_{i}.npy'

            probabilities_ls.append(np.load(probabilities_file_chunk))
        probabilities = np.concatenate(probabilities_ls)

        np.save(probabilities_file, probabilities)
    else:
        probabilities = np.load(probabilities_file)

    assert(len(probabilities) == total_lines)

    if args.qualityfilter:
        quality_scores = np.load(dsname_to_args['pile']['quality_scores'])
        global_mask = get_quality_mask(quality_scores)
    else:
        global_mask = np.ones(total_lines).astype(bool)

    if args.ds_name == 'wiki_and_books':
        # compute the wikipedia and book masks and filter out
        filter_domains = ['Wikipedia (en)', 'BookCorpus2', 'Books3', 'Gutenberg (PG-19)']
        domain_to_idxs = compute_domain_idxs(filter_domains)
        for domain, idxs in domain_to_idxs.items():
            mask = np.ones(total_lines).astype(bool)
            mask[idxs] = False
            global_mask = global_mask & mask

    num_to_retrieve = min(num_to_retrieve, len(probabilities))

    def retrieve_mask(num, probabilities, mask=None):
        if mask is not None:
            mask = mask & global_mask
        else:
            mask = global_mask.copy()

        nonzero_idxs = np.where(mask)[0]

        if num <= mask.sum():
            if args.retrieval_mode == 'topk':
                chosen_idxs = np.argpartition(-probabilities[mask], num)[:num]
            elif args.retrieval_mode == 'pareto':
                pareto_rand_mask = np.zeros(len(nonzero_idxs)).astype(bool)
                masked_probs = probabilities[mask]
                while pareto_rand_mask.sum() < num:
                    rand = np.random.pareto(9, size=len(masked_probs))
                    pareto_rand_mask = pareto_rand_mask | (rand > (1 - masked_probs))
                    print("Pareto rand mask sum: ", pareto_rand_mask.sum())
                chosen_idxs = np.where(pareto_rand_mask)[0]
                np.random.shuffle(chosen_idxs)
                chosen_idxs = chosen_idxs[:num]

            else:
                raise ValueError("not implemented")

            chosen_idxs = nonzero_idxs[chosen_idxs]

            if args.ds_name == 'wiki_and_books':
                # add in some wikipedia and bookcorpus
                all_domain_idxs = []
                for domain, idxs in domain_to_idxs.items():
                    # add 2 million from each domain
                    if domain == 'Wikipedia (en)':
                        num_to_add = args.num_wiki_and_books // 2
                    else:
                        num_to_add = args.num_wiki_and_books // 6
                    np.random.shuffle(idxs)
                    domain_chosen_idxs = idxs[:num_to_add]
                    all_domain_idxs.append(domain_chosen_idxs)
                chosen_idxs = np.concatenate([chosen_idxs] + all_domain_idxs)

            new_mask = np.zeros(len(mask)).astype(bool)
            new_mask[chosen_idxs] = True
            mask = new_mask
        else:
            pass
        return mask

    mask = retrieve_mask(num_to_retrieve, probabilities)

    prev_line = None
    with open(retrieved_path_cache, 'w') as fout:
        with open(ds_path, 'r') as f:
            for i, line in tqdm(enumerate(f), total=total_lines, miniters=1000000):
                if mask[i]:
                    if args.pack_every_2_examples and prev_line is not None:
                        prev_ex = json.loads(prev_line)
                        curr_ex = json.loads(line)
                        prev_ex['contents'] += curr_ex['contents']
                        prev_ex['metadata']['pile_set_name'] = [
                            prev_ex['metadata']['pile_set_name'],
                            curr_ex['metadata']['pile_set_name']]
                        fout.write(json.dumps(prev_ex).strip() + '\n')
                        prev_line = None
                    elif args.pack_every_2_examples and prev_line is None:
                        prev_line = line
                    else:
                        example = json.loads(line)
                        example['score'] = probabilities[i]
                        fout.write(json.dumps(example).strip() + '\n')
    shutil.move(retrieved_path_cache, retrieved_path)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='reformat datasets into fasttext and train classifier')
    parser.add_argument('--ds_name', type=str, help='dataset name')
    parser.add_argument('--output_dir', type=str, help='output path')
    parser.add_argument('--pile_path', type=str, help='path to pile')
    parser.add_argument('--num_to_retrieve', type=int, default=8192000,
                        help='amount of data to retrieve')
    parser.add_argument('--cache_dir', type=str,
                        help='cache directory for datasets')
    parser.add_argument('--word_vectors', default=None, type=str, help='path to word vectors. if None, word vectors are randomly initialized')
    parser.add_argument('--num_proc', type=int, default=2, help='number of threads')
    parser.add_argument('--overwrite', action='store_true', help='overwrite fasttext model and outputs')
    parser.add_argument('--overwrite_preprocess', action='store_true', help='overwrite data preprocessing')
    parser.add_argument('--ngrams', type=int, default=2, help='number of ngrams')
    parser.add_argument('--pipeline_step', type=str, help='which step of pipeline')
    parser.add_argument('--chunk_idx', type=str, default='01', help='which chunk of prediction')
    parser.add_argument('--retrieval_mode', default='pareto', type=str, help='type of retrieval')
    parser.add_argument('--qualityfilter', action='store_true', help='whether to implement quality filtering')
    parser.add_argument('--retrain_model', action='store_true', help='whether to retrain the model')
    parser.add_argument('--pack_every_2_examples', action='store_true', help='whether to pack')
    parser.add_argument('--num_wiki_and_books', type=int, default=4000000, help='number of random eaxmples from wikipedia and books')
    args = parser.parse_args()
    random.seed(121)

    chunked_dir = 'chunked'
    dsname_to_args['pile_val'].update(
            {'task_name': [f'{args.pile_path}/{chunked_dir}/VAL_128/val_128.json'],
             'quality_scores': f'{args.pile_path}/{chunked_dir}/VAL_128/val_128.json_qualityscores.npz'}
             )

    dsname_to_args['pile'].update(
        {'task_name': [f'{args.pile_path}/{chunked_dir}/{subset}_128/{subset}_128.json' for subset in subsets],
         'quality_scores': f'{args.pile_path}/{chunked_dir}/combined_all.json_qualityscores.npz'}
            )

    if args.ds_name == 'wiki_and_books':
        filter_domains = ['Wikipedia (en)', 'BookCorpus2']
        ds_dir = reformat_dataset('pile', args.output_dir, args.cache_dir,
                                  num_proc=args.num_proc, fixed_label="1",
                                  filter_domains=filter_domains)
    else:
        ds_dir = reformat_dataset(args.ds_name, args.output_dir, args.cache_dir,
                                  num_proc=args.num_proc)
    cache_ds_dir = Path(args.cache_dir) / 'retrieval_cache' / args.ds_name
    cache_ds_dir.mkdir(exist_ok=True, parents=True)

    pile_val_dir = reformat_dataset('pile', args.output_dir, args.cache_dir,
                                    num_proc=args.num_proc, fixed_label="0")

    train_file, val_file = mix_dataset(ds_dir, pile_val_dir)

    if args.word_vectors is not None:
        model_path = ds_dir / 'model_wordvecs.bin'
    else:
        model_path = ds_dir / 'model.bin'

    if args.ngrams != 2:
        model_path = model_path.parent / f'{model_path.stem}_{args.ngrams}.bin'

    if args.qualityfilter:
        model_path = model_path.parent / f"{model_path.stem}_qf.bin"

    if args.pipeline_step == 'model':
        if not model_path.exists() or args.overwrite_preprocess or args.retrain_model:
            fasttext_opts = {
                    'input': str(train_file),
                    'wordNgrams': args.ngrams,
                    'dim': 300,
                    'thread': args.num_proc,
                    'autotuneDuration': 1800,
                    'autotuneValidationFile': str(val_file)}

            fasttext_opts['pretrainedVectors'] = args.word_vectors

            print(fasttext_opts)

            model = fasttext.train_supervised(**fasttext_opts)
            model.save_model(str(model_path))

            n_samples, precision, recall = model.test(str(val_file))
            print("Precision:", precision)
            print("Recall:", recall)
            print("F1:", 2 * (precision * recall) / (precision + recall))
            del model

    elif args.pipeline_step.startswith('predict'):
        predict_chunk(model_path, ds_dir, args.chunk_idx)
    elif args.pipeline_step == 'retrieve':
        retrieve_from_pile(model_path, args.num_to_retrieve, ds_dir)
    else:
        raise ValueError('not implemented')


================================================
FILE: experimental/data_selection/run_cmds.sh
================================================
#!/bin/bash
num_buckets=10000
ngrams=2

#### target = books and wiki.
#### for the wiki_and_books target, we add 4M additional examples from the wiki and books related subsets of the Pile.
#### pack_every_2_examples reduces the number of examples by a factor of 2

TARGET='wiki_and_books'
run_name=${TARGET}_all_ngrammatching_ngrams${ngrams}_buckets${num_buckets}_qf
bash run_dsir.sh ${TARGET} ${run_name} " --qualityfilter --ngrams ${ngrams} --num_buckets ${num_buckets} --pack_every_2_examples" 98400000



================================================
FILE: experimental/data_selection/run_dsir.sh
================================================
#!/bin/bash
source config.sh


task=$1
run_name=$2
other_args=$3
num_to_retrieve=${4:-25000000}

LOGDIR=logs/data_selection/dsir/${run_name}
mkdir -p ${LOGDIR}

NUM_CHUNKS=29
CHUNK_IDX=01
predict_jid=$(sbatch \
        --parsable \
        --mem 5G \
        ${cluster_info} \
        -c 2 \
        --output ${LOGDIR}/prepare_${CHUNK_IDX} \
        data_selection/run_dsir_helper.sh ${task} "--pipeline_step prepare --chunk_idx ${CHUNK_IDX} --num_chunks ${NUM_CHUNKS} ${other_args} --num_proc 16")

dependency="--dependency afterok"
for CHUNK_IDX in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29;
do
    predict_jid=$(sbatch \
            --parsable \
            --mem 5G \
            ${cluster_info} \
            -c 2 \
            --output ${LOGDIR}/predict_${CHUNK_IDX} \
            data_selection/run_dsir_helper.sh ${task} "--pipeline_step importance_weights --chunk_idx ${CHUNK_IDX} --num_chunks ${NUM_CHUNKS} ${other_args} --num_proc 16")
    echo -n "${predict_jid} "
    dependency="${dependency}:${predict_jid}"
done

jid=$(sbatch \
        --parsable \
        ${cluster_info} \
        --mem 48G \
        ${dependency} \
        -c 4 \
        --output ${LOGDIR}/retrieve_${num_to_retrieve} \
        run_dsir_helper.sh ${task} "--pipeline_step resample ${other_args} --num_proc 8 --num_to_retrieve ${num_to_retrieve}")
echo -n "${jid} "



================================================
FILE: experimental/data_selection/run_dsir_helper.sh
================================================
#!/bin/bash

set -x

source config.sh

source ${VIRTUAL_ENV}/bin/activate
mkdir -p $CACHE
export HF_HOME=$CACHE
export TRANSFORMERS_CACHE=$CACHE
export HF_DATASETS_CACHE=$CACHE
export HF_DATASETS_IN_MEMORY_MAX_SIZE=0
export TORCH_EXTENSIONS_DIR=$CACHE

task=$1
args=$2

output_dir=${DSIR_OUTPUT_DIR}
mkdir -p ${output_dir}

# important! set this environment variable to make sure the hash
# function is consistent across different machines.
export PYTHONHASHSEED=42

python dsir_pipeline.py \
    --pile_path ${PILE_PATH} \
    --ds_name ${task} \
    --output_dir ${output_dir} \
    --cache_dir ${CACHE} \
    ${args}


================================================
FILE: experimental/data_selection/run_heuristic_cls.sh
================================================
#!/bin/bash
set -x

source config.sh

source ${VIRTUAL_ENV}/bin/activate

task=$1
run_name=$2
other_args=$3
num_to_retrieve=${4:-25000000}

LOGDIR=logs/preprocessing/heuristic_cls/${run_name}
mkdir -p ${LOGDIR}

NUM_CHUNKS=29

jid=$(sbatch \
        --parsable \
        --mem 32G \
        ${cluster_info} \
        -c 4 \
        --output ${LOGDIR}/train_fasttext_model \
        preprocessing/run_heuristic_cls_helper.sh ${task} "--pipeline_step model ${other_args} --num_proc 4")
echo -n "${jid} "

dependency="--dependency afterok"
for CHUNK_IDX in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29;
do
    predict_jid=$(sbatch \
            --parsable \
            ${cluster_info} \
            --dependency ${jid} \
            --mem 20G \
            -c 1 \
            --output ${LOGDIR}/predict_${CHUNK_IDX} \
            preprocessing/run_heuristic_cls_helper.sh ${task} "--pipeline_step predict --chunk_idx ${CHUNK_IDX} --num_chunks ${NUM_CHUNKS} ${other_args} --num_proc 1")
    echo -n "${predict_jid} "
    dependency="${dependency}:${predict_jid}"
done

jid=$(sbatch \
        --parsable \
        --mem 100G \
        ${cluster_info} \
        ${dependency} \
        -c 4 \
        --output ${LOGDIR}/retrieve_${num_to_retrieve} \
        preprocessing/run_heuristic_cls_helper.sh ${task} "--pipeline_step retrieve ${other_args} --num_chunks ${NUM_CHUNKS} --num_proc 4 --num_to_retrieve ${num_to_retrieve}")
echo -n "${jid} "



================================================
FILE: experimental/data_selection/run_heuristic_cls_helper.sh
================================================
#!/bin/bash
set -x
source config.sh

mkdir -p $CACHE
export HF_HOME=$CACHE
export TRANSFORMERS_CACHE=$CACHE
export HF_DATASETS_CACHE=$CACHE
export HF_DATASETS_IN_MEMORY_MAX_SIZE=100000000000
export TORCH_EXTENSIONS_DIR=$CACHE

task=$1
args=$2

mkdir -p ${OUTPUT_DIR}

python heuristic_cls_pipeline.py \
    --pile_path ${PILE_PATH}
    --ds_name ${task} \
    --output_dir ${OUTPUT_DIR} \
    --cache_dir ${CACHE} \
    --word_vectors ${WORD_VECTORS_PATH} \
    ${args}




================================================
FILE: experimental/glue_eval/read_glue_results.py
================================================
import pandas as pd
from pathlib import Path
from collections import defaultdict
import json
import subprocess

task_to_col = {'QNLI': 'eval_accuracy',
               'STSB': 'eval_spearmanr',
               'MRPC': 'eval_accuracy',
               'COLA': 'eval_matthews_correlation',
               'RTE': 'eval_accuracy',
               'MNLI': 'eval_accuracy',
               'SST2': 'eval_accuracy',
               'QQP': 'eval_accuracy'}

def read_file(path, task_name):
    with open(path, 'r') as f:
        curr_res = json.load(f)
    eval_res = curr_res[task_to_col[task_name]]
    del curr_res[task_to_col[task_name]]
    curr_res['dev'] = eval_res
    return curr_res

def parse_file_name(name):
    toks = name.split('_')
    task_name = toks[0]

    for tok in toks[1:]:
        if tok.startswith('EPOCHS'):
            epochs = int(tok[6:])
        elif tok.startswith('BS'):
            bs = int(tok[2:])
        elif tok.startswith('LR'):
            lr = float(tok[2:])
        elif tok.startswith('seed'):
            seed = int(tok[4:])
    return {
        'task_name': task_name,
        'epochs': epochs,
        'bs': bs,
        'lr': lr,
        'seed': seed, }


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description='Read GLUE results')
    parser.add_argument('--results_dir', type=str, help="directory of GLUE results")
    args = parser.parse_args()

    res = []

    results_dir = Path(args.results_dir).resolve().expanduser()
    for trial_dir in results_dir.iterdir():
        if not trial_dir.is_dir():
            continue
        curr_res = parse_file_name(trial_dir.name)

        try:
            curr_res.update(read_file(trial_dir / f'eval_results.json', curr_res['task_name']))
            res.append(curr_res)
        except Exception:
            print(f"skipped: {curr_res}")

    df = pd.DataFrame(res)
    df['dev'] *= 100
    df = df.round({'dev': 2})
    grouped_df = df.groupby(['task_name', 'epochs', 'bs', 'lr']).agg({'dev': ['mean', 'std', 'median']})
    grouped_df.columns = ['mean', 'std', 'median']
    grouped_df = grouped_df.reset_index()
    # we only run one set of hyperparams per task in the paper, so this max-median selection is a no-op
    max_median_idx = grouped_df.groupby(['task_name', 'epochs', 'bs', 'lr'])['median'].transform(max) == grouped_df['median']
    df = grouped_df[max_median_idx]
    df = df.reset_index()
    df = df.round(2)
    df.to_csv(str(results_dir / 'glue_results_eval.tsv'), sep='\t', index=False)

    subprocess.run(f"head -n 50 {str(results_dir / 'glue_results_eval.tsv')}".split())



================================================
FILE: experimental/glue_eval/run_eval_exps.sh
================================================
#!/bin/bash

# example

bash glue_eval/run_glue_dist.sh \
    /path/to/trained/checkpoint \
    "bert-base-uncased" \
    128


================================================
FILE: experimental/glue_eval/run_glue.py
================================================
#!/usr/bin/env python
# coding=utf-8
# Copyright 2020 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Finetuning the library models for sequence classification on GLUE."""
# You can also adapt this script on your own text classification task. Pointers for this are left as comments.

import logging
import os
import random
import sys
from dataclasses import dataclass, field
from typing import Optional
import json
from argparse import Namespace
import uuid
from pathlib import Path

import datasets
import numpy as np
from datasets import load_dataset, load_metric

import transformers
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    PretrainedConfig,
    Trainer,
    TrainingArguments,
    default_data_collator,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from transformers.utils.versions import require_version


# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.12.0")

require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt")

task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

logger = logging.getLogger(__name__)


@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.

    Using `HfArgumentParser` we can turn this class
    into argparse arguments to be able to specify them on
    the command line.
    """

    task_name: Optional[str] = field(
        default=None,
        metadata={"help": "The name of the task to train on: " + ", ".join(task_to_keys.keys())},
    )
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    max_seq_length: int = field(
        default=128,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
    )
    pad_to_max_length: bool = field(
        default=True,
        metadata={
            "help": "Whether to pad all samples to `max_seq_length`. "
            "If False, will pad the samples dynamically when batching to the maximum length in the batch."
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
            "value if set."
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
            "value if set."
        },
    )
    max_predict_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this "
            "value if set."
        },
    )
    train_file: Optional[str] = field(
        default=None, metadata={"help": "A csv or a json file containing the training data."}
    )
    validation_file: Optional[str] = field(
        default=None, metadata={"help": "A csv or a json file containing the validation data."}
    )
    test_file: Optional[str] = field(default=None, metadata={"help": "A csv or a json file containing the test data."})

    def __post_init__(self):
        if self.task_name is not None:
            self.task_name = self.task_name.lower()
            if self.task_name not in task_to_keys.keys():
                raise ValueError("Unknown task, you should pick one in " + ",".join(task_to_keys.keys()))
        elif self.dataset_name is not None:
            pass
        elif self.train_file is None or self.validation_file is None:
            raise ValueError("Need either a GLUE task, a training/validation file or a dataset name.")
        else:
            train_extension = self.train_file.split(".")[-1]
            assert train_extension in ["csv", "json"], "`train_file` should be a csv or a json file."
            validation_extension = self.validation_file.split(".")[-1]
            assert (
                validation_extension == train_extension
            ), "`validation_file` should have the same extension (csv or json) as `train_file`."


@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
            "with private models)."
        },
    )


def main():
    # See all possible arguments in src/transformers/training_args.py
    # or by passing the --help flag to this script.
    # We now keep distinct sets of args, for a cleaner separation of concerns.

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Training/evaluation parameters {training_args}")

    # Detecting last checkpoint.
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
            raise ValueError(
                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
                "Use --overwrite_output_dir to overcome."
            )
        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
            logger.info(
                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
            )

    # Set seed before initializing model.
    set_seed(training_args.seed)

    # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below)
    # or specify a GLUE benchmark task (the dataset will be downloaded automatically from the datasets Hub).
    #
    # For CSV/JSON files, this script will use as labels the column called 'label' and as pair of sentences the
    # sentences in columns called 'sentence1' and 'sentence2' if such column exists or the first two columns not named
    # label if at least two columns are provided.
    #
    # If the CSVs/JSONs contain only one non-label column, the script does single sentence classification on this
    # single column. You can easily tweak this behavior (see below)
    #
    # In distributed training, the load_dataset function guarantee that only one local process can concurrently
    # download the dataset.
    if data_args.task_name is not None:
        # a known task

        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)

    elif data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(
            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
        )
    else:
        # Loading a dataset from your local files.
        # CSV/JSON training and evaluation files are needed.
        data_files = {"train": data_args.train_file, "validation": data_args.validation_file}

        # Get the test dataset: you can provide your own CSV/JSON test file (see below)
        # when you use `do_predict` without specifying a GLUE benchmark task.
        if training_args.do_predict:
            if data_args.test_file is not None:
                train_extension = data_args.train_file.split(".")[-1]
                test_extension = data_args.test_file.split(".")[-1]
                assert (
                    test_extension == train_extension
                ), "`test_file` should have the same extension (csv or json) as `train_file`."
                data_files["test"] = data_args.test_file
            else:
                raise ValueError("Need either a GLUE task or a test file for `do_predict`.")

        for key in data_files.keys():
            logger.info(f"load a local file for {key}: {data_files[key]}")

        if data_args.train_file.endswith(".csv"):
            # Loading a dataset from local csv files
            raw_datasets = load_dataset("csv", data_files=data_files, cache_dir=model_args.cache_dir)
        else:
            # Loading a dataset from local json files
            raw_datasets = load_dataset("json", data_files=data_files, cache_dir=model_args.cache_dir)
    # See more about loading any type of standard or custom dataset at
    # https://huggingface.co/docs/datasets/loading_datasets.html.

    # Labels
    if data_args.task_name is not None:
        is_regression = data_args.task_name == "stsb"
        if not is_regression:
            label_list = raw_datasets["train"].features["label"].names
            num_labels = len(label_list)
        else:
            num_labels = 1
    else:
        # Trying to have good defaults here, don't hesitate to tweak to your needs.
        is_regression = raw_datasets["train"].features["label"].dtype in ["float32", "float64"]
        if is_regression:
            num_labels = 1
        else:
            # A useful fast method:
            # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique
            label_list = raw_datasets["train"].unique("label")
            label_list.sort()  # Let's sort it for determinism
            num_labels = len(label_list)

    config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        num_labels=num_labels,
        finetuning_task=data_args.task_name,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        # use_auth_token=None,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast_tokenizer,
        revision=model_args.model_revision,
        # use_auth_token=True if model_args.use_auth_token else None,
    )
    model = AutoModelForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        # use_auth_token=True if model_args.use_auth_token else None,
        ignore_mismatched_sizes=True,
    )

    # Preprocessing the raw_datasets
    if data_args.task_name is not None:
        sentence1_key, sentence2_key = task_to_keys[data_args.task_name]
    else:
        # Again, we try to have some nice defaults but don't hesitate to tweak to your use case.
        non_label_column_names = [name for name in raw_datasets["train"].column_names if name != "label"]
        if "sentence1" in non_label_column_names and "sentence2" in non_label_column_names:
            sentence1_key, sentence2_key = "sentence1", "sentence2"
        else:
            if len(non_label_column_names) >= 2:
                sentence1_key, sentence2_key = non_label_column_names[:2]
            else:
                sentence1_key, sentence2_key = non_label_column_names[0], None

    # Padding strategy
    if data_args.pad_to_max_length:
        padding = "max_length"
    else:
        # We will pad later, dynamically at batch creation, to the max sequence length in each batch
        padding = False

    # Some models have set the order of the labels to use, so let's make sure we do use it.
    label_to_id = None
    if (
        model.config.label2id != PretrainedConfig(num_labels=num_labels).label2id
        and data_args.task_name is not None
        and not is_regression
    ):
        # Some have all caps in their config, some don't.
        label_name_to_id = {k.lower(): v for k, v in model.config.label2id.items()}
        if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)):
            label_to_id = {i: int(label_name_to_id[label_list[i]]) for i in range(num_labels)}
        else:
            logger.warning(
                "Your model seems to have been trained with labels, but they don't match the dataset: ",
                f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels: {list(sorted(label_list))}."
                "\nIgnoring the model labels as a result.",
            )
    elif data_args.task_name is None and not is_regression:
        label_to_id = {v: i for i, v in enumerate(label_list)}

    if label_to_id is not None:
        model.config.label2id = label_to_id
        model.config.id2label = {id: label for label, id in config.label2id.items()}
    elif data_args.task_name is not None and not is_regression:
        model.config.label2id = {l: i for i, l in enumerate(label_list)}
        model.config.id2label = {id: label for label, id in config.label2id.items()}

    if data_args.max_seq_length > tokenizer.model_max_length:
        logger.warning(
            f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the"
            f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}."
        )
    max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)

    def preprocess_function(examples):
        # Tokenize the texts
        args = (
            (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
        )
        result = tokenizer(*args, padding=padding, max_length=max_seq_length, truncation=True)

        # Map labels to IDs (not necessary for GLUE tasks)
        if label_to_id is not None and "label" in examples:
            result["label"] = [(label_to_id[l] if l != -1 else -1) for l in examples["label"]]
        return result

    with training_args.main_process_first(desc="dataset map pre-processing"):
        raw_datasets = raw_datasets.map(
            preprocess_function,
            batched=True,
            load_from_cache_file=not data_args.overwrite_cache,
            desc="Running tokenizer on dataset",
        )
    if training_args.do_train:
        if "train" not in raw_datasets:
            raise ValueError("--do_train requires a train dataset")
        train_dataset = raw_datasets["train"]
        if data_args.max_train_samples is not None:
            train_dataset = train_dataset.select(range(data_args.max_train_samples))

    if training_args.do_eval:
        if "validation" not in raw_datasets and "validation_matched" not in raw_datasets:
            raise ValueError("--do_eval requires a validation dataset")
        eval_dataset = raw_datasets["validation_matched" if data_args.task_name == "mnli" else "validation"]
        if data_args.max_eval_samples is not None:
            eval_dataset = eval_dataset.select(range(data_args.max_eval_samples))

    if training_args.do_predict or data_args.task_name is not None or data_args.test_file is not None:
        if "test" not in raw_datasets and "test_matched" not in raw_datasets:
            raise ValueError("--do_predict requires a test dataset")
        predict_dataset = raw_datasets["test_matched" if data_args.task_name == "mnli" else "test"]
        if data_args.max_predict_samples is not None:
            predict_dataset = predict_dataset.select(range(data_args.max_predict_samples))

    # Log a few random samples from the training set:
    if training_args.do_train:
        for index in random.sample(range(len(train_dataset)), 3):
            logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")

    # Get the metric function
    if data_args.task_name is not None:
        metric = load_metric("glue", data_args.task_name)
    else:
        metric = load_metric("accuracy")

    # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
    # predictions and label_ids field) and has to return a dictionary string to float.
    def compute_metrics(p: EvalPrediction):
        preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
        preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)
        if data_args.task_name is not None:
            result = metric.compute(predictions=preds, references=p.label_ids)
            if len(result) > 1:
                result["combined_score"] = np.mean(list(result.values())).item()
            return result
        elif is_regression:
            return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
        else:
            return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

    # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding.
    if data_args.pad_to_max_length:
        data_collator = default_data_collator
    elif training_args.fp16:
        data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8)
    else:
        data_collator = None

    # Initialize our Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        metrics = train_result.metrics
        max_train_samples = (
            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        )
        metrics["train_samples"] = min(max_train_samples, len(train_dataset))

        trainer.save_model()  # Saves the tokenizer too for easy upload

        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()

    # Evaluation
    if training_args.do_eval:
        logger.info("*** Evaluate ***")

        # Loop to handle MNLI double evaluation (matched, mis-matched)
        tasks = [data_args.task_name]
        eval_datasets = [eval_dataset]
        if data_args.task_name == "mnli":
            tasks.append("mnli-mm")
            eval_datasets.append(raw_datasets["validation_mismatched"])

        for eval_dataset, task in zip(eval_datasets, tasks):
            metrics = trainer.evaluate(eval_dataset=eval_dataset)

            max_eval_samples = (
                data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset)
            )
            metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

            trainer.log_metrics("eval", metrics)
            trainer.save_metrics("eval", metrics)

    if training_args.do_predict:
        logger.info("*** Predict ***")

        # Loop to handle MNLI double evaluation (matched, mis-matched)
        tasks = [data_args.task_name]
        predict_datasets = [predict_dataset]
        if data_args.task_name == "mnli":
            tasks.append("mnli-mm")
            predict_datasets.append(raw_datasets["test_mismatched"])

        for predict_dataset, task in zip(predict_datasets, tasks):
            # Removing the `label` columns because it contains -1 and Trainer won't like that.
            predict_dataset = predict_dataset.remove_columns("label")
            predictions = trainer.predict(predict_dataset, metric_key_prefix="predict").predictions
            predictions = np.squeeze(predictions) if is_regression else np.argmax(predictions, axis=1)

            output_predict_file = os.path.join(training_args.output_dir, f"predict_results_{task}.txt")
            if trainer.is_world_process_zero():
                with open(output_predict_file, "w") as writer:
                    logger.info(f"***** Predict results {task} *****")
                    writer.write("index\tprediction\n")
                    for index, item in enumerate(predictions):
                        if is_regression:
                            writer.write(f"{index}\t{item:3.3f}\n")
                        else:
                            item = label_list[item]
                            writer.write(f"{index}\t{item}\n")

    kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-classification"}
    if data_args.task_name is not None:
        kwargs["language"] = "en"
        kwargs["dataset_tags"] = "glue"
        kwargs["dataset_args"] = data_args.task_name
        kwargs["dataset"] = f"GLUE {data_args.task_name.upper()}"

    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)
    else:
        trainer.create_model_card(**kwargs)


def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()


if __name__ == "__main__":
    main()


================================================
FILE: experimental/glue_eval/run_glue_dist.sh
================================================
#!/bin/bash
source config.sh

gpus=1
mem=8G
cpus=2

PRETRAINED_PATH=$1
BASENAME=${2:-"bert-base-uncased"}
MAX_LEN=${3:-512}

SAVE_PATH=${PRETRAINED_PATH}
mkdir -p ${SAVE_PATH}/logs


for SEED in 1 2 3 4 5; do
    TASK_NAME="MNLI"
    LR=1e-5
    EPOCHS=10
    BATCH_SIZE=32

    mnli_jid=$(sbatch \
            --parsable \
            --gres=gpu:${gpus} \
            --mem $mem \
            -c $cpus \
            --output ${SAVE_PATH}/logs/${TASK_NAME}_${EPOCHS}_${LR}_${BATCH_SIZE}_${SEED} \
            ${cluster_info} \
            glue_eval/run_glue_for_seed_task.sh ${PRETRAINED_PATH} ${TASK_NAME} ${SEED} ${EPOCHS} ${LR} ${BATCH_SIZE} ${CACHE} ${BASENAME})
    echo -n "${mnli_jid} "

    # Following RoBERTa paper, initialize RTE/MRPC/STS from MNLI
    for TASK_NAME in "RTE" "MRPC" "STSB"; do
        if [[ "${TASK_NAME}" = "RTE" ]]; then
            LR=2e-5
            EPOCHS=10
            BATCH_SIZE=16
        elif [[ "${TASK_NAME}" = "MRPC" ]]; then
            LR=1e-5
            EPOCHS=10
            BATCH_SIZE=16
        elif [[ "${TASK_NAME}" = "STSB" ]]; then
            LR=2e-5
            EPOCHS=10
            BATCH_SIZE=16
        fi
        jid=$(sbatch \
                --parsable \
                --dependency ${mnli_jid} \
                --gres=gpu:${gpus} \
                --mem $mem \
                -c $cpus \
                --output ${SAVE_PATH}/logs/${TASK_NAME}_${EPOCHS}_${LR}_${BATCH_SIZE}_${SEED} \
                ${cluster_info} \
                glue_eval/run_glue_for_seed_task.sh ${SAVE_PATH}/finetune-runs/MNLI_EPOCHS10_BS32_LR1e-5_seed${SEED} ${TASK_NAME} ${SEED} ${EPOCHS} ${LR} ${BATCH_SIZE} ${CACHE}  ${BASENAME} ${SAVE_PATH} )
        echo -n "${jid} "
    done
done

for TASK_NAME in "COLA"; do
LR=1e-5
EPOCHS=10
BATCH_SIZE=16
for SEED in 1 2 3 4 5; do
jid=$(sbatch \
        --parsable \
        --gres=gpu:${gpus} \
        --mem $mem \
        -c $cpus \
        --output ${SAVE_PATH}/logs/${TASK_NAME}_${EPOCHS}_${LR}_${BATCH_SIZE}_${SEED} \
        ${cluster_info} \
        glue_eval/run_glue_for_seed_task.sh ${PRETRAINED_PATH} ${TASK_NAME} ${SEED} ${EPOCHS} ${LR} ${BATCH_SIZE} ${CACHE} ${BASENAME})
echo -n "${jid} "
    done
done

for TASK_NAME in "QQP" "SST2" "QNLI"; do
if [[ "${TASK_NAME}" = "QQP" ]]; then
    LR=1e-5
    EPOCHS=10
    BATCH_SIZE=32
elif [[ "${TASK_NAME}" = "SST2" ]]; then
    LR=1e-5
    EPOCHS=10
    BATCH_SIZE=32
elif [[ "${TASK_NAME}" = "QNLI" ]]; then
    LR=1e-5
    EPOCHS=10
    BATCH_SIZE=32
fi

for SEED in 1 2 3 4 5; do
    jid=$(sbatch \
        --parsable \
        --gres=gpu:${gpus} \
        --mem $mem \
        -c $cpus \
        --output ${SAVE_PATH}/logs/${TASK_NAME}_${EPOCHS}_${LR}_${BATCH_SIZE}_${SEED} \
        ${cluster_info} \
        glue_eval/run_glue_for_seed_task.sh ${PRETRAINED_PATH} ${TASK_NAME} ${SEED} ${EPOCHS} ${LR} ${BATCH_SIZE} ${CACHE} ${BASENAME})
    echo -n "${jid} "
    done
done



================================================
FILE: experimental/glue_eval/run_glue_for_seed_task.sh
================================================
#!/bin/bash
set -x

PRETRAINED_PATH=$1
TASK_NAME=$2
SEED=$3
EPOCHS=$4
LR=$5
BATCH_SIZE=$6
CACHE=$7
BASENAME=${8:-"bert-base-uncased"}
SAVE_PATH=${9:-${PRETRAINED_PATH}}
MAX_LEN=${10:-128}

mkdir -p $CACHE
export HF_HOME=$CACHE
export TRANSFORMERS_CACHE=$CACHE
export HF_DATASETS_CACHE=$CACHE
export HF_DATASETS_IN_MEMORY_MAX_SIZE=0
export TORCH_EXTENSIONS_DIR=$CACHE

export WANDB_PROJECT="glue"


FINETUNE_PATH=${SAVE_PATH}/finetune-runs
mkdir -p $FINETUNE_PATH

RUN=${TASK_NAME}_EPOCHS${EPOCHS}_BS${BATCH_SIZE}_LR${LR}_seed${SEED}
RUN_DIR=${FINETUNE_PATH}/${RUN}
mkdir -p $RUN_DIR


ACCUM=1

if [[ $(ls ${RUN_DIR}/eval_results*) ]]; then
  echo "${RUN} finished"
else
    python glue_eval/run_glue.py \
        --fp16 \
        --model_name_or_path $PRETRAINED_PATH \
        --tokenizer_name ${BASENAME} \
        --config_name ${BASENAME} \
        --task_name $TASK_NAME \
        --do_train \
        --do_eval \
        --max_seq_length ${MAX_LEN} \
        --per_device_train_batch_size ${BATCH_SIZE} \
        --gradient_accumulation_steps ${ACCUM} \
        --num_train_epochs ${EPOCHS} \
        --run_name ${RUN_DIR} \
        --logging_steps 100 \
        --save_strategy epoch \
        --evaluation_strategy epoch \
        --learning_rate ${LR} \
        --seed ${SEED} \
        --output_dir ${RUN_DIR} \
        --logging_dir ${RUN_DIR}/tensorboard \
        --max_grad_norm 1.0 \
        --lr_scheduler_type polynomial \
        --weight_decay 0.1 \
        --load_best_model_at_end \
        --save_total_limit 1 \
        --adam_beta1 0.9 \
        --adam_beta2 0.98 \
        --adam_epsilon 1e-6 \
        --warmup_ratio 0.06 \
        --overwrite_output_dir
fi


================================================
FILE: experimental/preprocessing/quality_scores/compute_quality_stats.py
================================================
from pathlib import Path
import json
import numpy as np
import os
from itertools import islice
from joblib import Parallel, delayed
from nltk.tokenize import word_tokenize
from collections import Counter
from tqdm import tqdm
from multiprocessing import Pool

import string
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

stop = set(stopwords.words('english') + list(string.punctuation))
numeric = set(list(string.digits))


def transform_text(text):
    return ' '.join(word_tokenize(text.lower()))


def length_filter(x_tok):
    return len(x_tok)


def repeating_filter(x_tok):
    if len(x_tok) == 0:
        return 0
    counts = Counter(x_tok)
    ratio = (max(counts.values()) / len(x_tok))
    return ratio


def mostly_uninformative_filter(x_tok):
    if len(x_tok) == 0:
        return 0
    informative_ratio = (len([x for x in x_tok if x not in stop]) / len(x_tok))
    return informative_ratio


def numeric_filter(x_tok):
    if len(x_tok) == 0:
        return 0
    ratio = (len([x for x in x_tok if x not in numeric]) / len(x_tok))
    return ratio


filter_funcs = [length_filter, repeating_filter, mostly_uninformative_filter, numeric_filter]


def process(example):
    line_json = json.loads(example)
    text_tok = transform_text(line_json['contents']).split()
    return [fn(text_tok) for fn in filter_funcs]


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser(description='compute statistic for quality filtering')
    parser.add_argument('--ds_path', type=str, help='path to jsonl dataset')
    parser.add_argument('--chunksize', type=int, default=100000, help='chunk size')
    parser.add_argument('--no_parallel', action='store_true', help='dont do in parallel')
    args = parser.parse_args()

    ds_path = Path(args.ds_path)
    num_cpus = len(os.sched_getaffinity(0))

    print(f"Num cpus: {num_cpus}")

    if args.no_parallel:
        scores = []
        with open(ds_path, 'r') as f:
            for line in f:
                processed = process(line)
                scores.append(processed)
                if len(scores) % 100000 == 0:
                    print(len(scores), flush=True)

    else:
        pool = Pool(num_cpus)
        chunk_size = args.chunksize
        scores = []
        with open(ds_path, 'r') as f:
            for processed in pool.imap(process, f, chunksize=chunk_size):
                scores.append(processed)
                if len(scores) % 100000 == 0:
                    print(len(scores), flush=True)
        pool.close()
    scores = np.asarray(scores)

    np.savez(str(ds_path.parent / (ds_path.name + '_qualityscores.npz')),
             length=scores[:, 0],
             repeating_ratio=scores[:, 1],
             informative_ratio=scores[:, 2],
             numeric_ratio=scores[:, 3])


================================================
FILE: experimental/preprocessing/quality_scores/merge_quality_scores.py
================================================
import subprocess
import numpy as np
import os


import argparse
parser = argparse.ArgumentParser(description='merge quality stats')
parser.add_argument('--pile_path', type=str, help='path to pile')
args = parser.parse_args()


subsets = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
           '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
           '21', '22', '23', '24', '25', '26', '27', '28', '29']

filenames_qualityscores = [f"{args.pile_path}/chunked/{subset}_128/{subset}_128.json_qualityscores.npz" for subset in subsets]


# merge the numpy
qualityscores = [np.load(fn) for fn in filenames_qualityscores]
print("quality scores loaded")
keys = qualityscores[0].keys()
concatted = {k: np.concatenate([qs[k] for qs in qualityscores], axis=0) for k in keys}

print("saving")

np.savez(f"{args.pile_path}/chunked/combined_all.json_qualityscores.npz",
         **concatted)


================================================
FILE: experimental/preprocessing/quality_scores/run_merge_quality_scores.sh
================================================
#!/bin/bash

source config.sh

python merge_quality_scores.py --pile_path ${PILE_PATH}


================================================
FILE: experimental/preprocessing/quality_scores/run_quality_stats.sh
================================================
#!/bin/bash

DS_PATH=$1

python preprocessing/quality_scores/compute_quality_stats.py --ds_path ${DS_PATH}


================================================
FILE: experimental/preprocessing/quality_scores/run_slurm_quality_stats.sh
================================================
#!/bin/bash

source config.sh

LOGDIR=logs/preprocessing/qualitystats
mkdir -p ${LOGDIR}

for SUBSET in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29; do
    jid=$(sbatch \
            --parsable \
            ${cluster_info} \
            --mem 24G \
            -c 16 \
            --output ${LOGDIR}/chunk_${SUBSET} \
            preprocessing/quality_scores/run_quality_stats.sh ${PILE_PATH}/chunked/${SUBSET}_128/${SUBSET}_128.json)
    echo -n "${jid} "
done


# validation data
jid=$(sbatch \
        --parsable \
        ${cluster_info} \
        --mem 24G \
        -c 16 \
        --output ${LOGDIR}/chunk_val \
        run_quality_stats.sh ${PILE_PATH}/chunked/VAL_128/val_128.json)
echo -n "${jid} "



================================================
FILE: experimental/preprocessing/reformat_and_chunk_data.py
================================================
import json
import numpy as np
from datasets import load_dataset
from argparse import ArgumentParser
import os
from pathlib import Path
from tqdm import tqdm

CHUNK_LENGTH=128

parser = ArgumentParser()
parser.add_argument('--input_dir', type=str)
parser.add_argument('--output_dir', type=str)
parser.add_argument('--input_filename', default="22.jsonl.zst", type=str)
parser.add_argument('--chunk_length', default=128, type=int)
parser.add_argument('--output_filename', default=None, type=str)
parser.add_argument('--cache_dir', type=str)


def chunk_examples(examples, chunk_length=CHUNK_LENGTH):
    chunks, metadata = [], []
    for sentence, meta in zip(examples['text'], examples['meta']):
        words = sentence.split(' ')
        curr_chunks = [' '.join(words[i:i + chunk_length]) for i in range(0, len(words), chunk_length)]
        chunks += curr_chunks
        metadata += [meta] * len(curr_chunks)
    return {'contents': chunks, 'metadata': metadata}

def add_id(examples, idx):
    examples['id'] = idx
    return examples

def main(args):
    CHUNK_LENGTH = args.chunk_length

    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)

    if args.output_filename is None:
        args.output_filename = args.input_filename.split('.')[0] + f'_{args.chunk_length}.json'

    print("Beginning dataset load")
    ds = load_dataset('json',
                      data_files=[f'{args.input_dir}/{args.input_filename}'],
                      cache_dir=args.cache_dir,
                      streaming=True)['train']

    column_names = list(next(iter(ds)).keys())

    ds = ds.map(chunk_examples, batched=True, remove_columns=column_names)

    print("Done loading dataset")
    ds = ds.map(add_id, batched=True, with_indices=True)

    print("Saving file")
    with open(Path(args.output_dir) / args.output_filename, 'w') as f:
        for ex in tqdm(iter(ds)):
            f.write(json.dumps(ex).strip() + '\n')

if __name__=="__main__":

    args = parser.parse_args()
    main(args)



================================================
FILE: experimental/preprocessing/run.sh
================================================
#!/bin/bash
source config.sh

source ${VIRTUAL_ENV}/bin/activate

mkdir -p $CACHE
export HF_HOME=$CACHE
export TRANSFORMERS_CACHE=$CACHE
export HF_DATASETS_CACHE=$CACHE
export HF_DATASETS_IN_MEMORY_MAX_SIZE=0
export TORCH_EXTENSIONS_DIR=$CACHE

ARGS=$1

python preprocessing/reformat_and_chunk_data.py ${ARGS}




================================================
FILE: experimental/preprocessing/run_slurm.sh
================================================
#!/bin/bash

source config.sh

LOGDIR=logs/preprocess
mkdir -p ${LOGDIR}

for SUBSET in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29; do
     jid=$(sbatch \
             --parsable \
             ${cluster_info} \
             --mem 48G \
             -c 16 \
             --output ${LOGDIR}/chunk_${SUBSET} \
             preprocessing/run.sh "--input_filename ${SUBSET}.jsonl.zst --chunk_length 128 --input_dir ${PILE_PATH} --output_dir ${PILE_PATH}/chunked/${SUBSET}_128 --cache_dir ${CACHE}}")
     echo -n "${jid} "
done

# validation data
 jid=$(sbatch \
         --parsable \
         ${cluster_info} \
         --mem 48G \
         -c 16 \
         --output ${LOGDIR}/chunk_val \
         preprocessing/run.sh "--input_filename val.jsonl.zst --chunk_length 128 --input_dir ${PILE_PATH} --output_dir ${PILE_PATH}/chunked/VAL_128 --cache_dir ${CACHE}}")
 echo -n "${jid} "



================================================
FILE: experimental/requirements.txt
================================================
datasets
pandas
tqdm
numpy
joblib
nltk
fasttext


================================================
FILE: experimental/train/accelerate_config.yaml
================================================
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
fp16: true
machine_rank: 0
main_process_ip: null
main_process_port: 1234
main_training_function: main
num_machines: 1
num_processes: 4


================================================
FILE: experimental/train/collator.py
================================================
import random
import warnings
from dataclasses import dataclass
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union

import torch

from transformers.tokenization_utils_base import BatchEncoding, PreTrainedTokenizerBase

@dataclass
class DataCollatorForLanguageModeling:

    tokenizer: PreTrainedTokenizerBase
    mlm: bool = True
    mlm_probability: float = 0.15
    pad_to_multiple_of: Optional[int] = None

    def __post_init__(self):
        if self.mlm and self.tokenizer.mask_token is None:
            raise ValueError(
                "This tokenizer does not have a mask token which is necessary for masked language modeling. "
                "You should pass `mlm=False` to train on causal language modeling instead."
            )

    def __call__(
        self, examples: List[Union[List[int], torch.Tensor, Dict[str, torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # Handle dict or lists with proper padding and conversion to tensor.
        batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)

        # If special token mask has been preprocessed, pop it from the dict.
        special_tokens_mask = batch.pop("special_tokens_mask", None)
        batch.pop("id", None)

        init_labels = batch.pop("labels", None)
        input_ids = batch["input_ids"].clone()
        if self.mlm:
            batch["input_ids"], batch["labels"] = self.mask_tokens(
                input_ids, special_tokens_mask=special_tokens_mask,
                init_labels=init_labels,
            )
        else:
            batch.pop("labels", None)
        return batch

    def mask_tokens(
        self, inputs: torch.Tensor, special_tokens_mask: Optional[torch.Tensor] = None,
        init_labels: Optional[torch.Tensor] = None,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """
        labels = inputs.clone()
        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        if special_tokens_mask is None:
            special_tokens_mask = [
                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
            ]
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
        else:
            special_tokens_mask = special_tokens_mask.bool()

        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # We only compute loss on masked tokens
        if init_labels is not None:
            labels[inputs == self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)] = init_labels

        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # 10% of the time, we replace masked input tokens with random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels

================================================
FILE: experimental/train/model.py
================================================
import logging
import torch
from torch import nn
from torch.nn import CrossEntropyLoss

from transformers import BertPreTrainedModel, BertModel
from transformers.models.bert.modeling_bert import BertOnlyMLMHead
from transformers import RobertaPreTrainedModel, RobertaModel
from transformers.models.roberta.modeling_roberta import RobertaLMHead
from transformers.modeling_outputs import MaskedLMOutput
from transformers.utils.dummy_pt_objects import NoRepeatNGramLogitsProcessor

logger = logging.getLogger(__name__)

class BertForMaskedLM(BertPreTrainedModel):

    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]

    def __init__(self, config):
        super().__init__(config)

        if config.is_decoder:
            logger.warning(
                "If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for "
                "bi-directional self-attention."
            )

        self.bert = BertModel(config, add_pooling_layer=False)
        self.cls = BertOnlyMLMHead(config)

        self.init_weights()

    def get_output_embeddings(self):
        return self.cls.predictions.decoder

    def set_output_embeddings(self, new_embeddings):
        self.cls.predictions.decoder = new_embeddings

    def set_args(self, args):
        self.args = args

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        return_loss=True,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        loss_fct = CrossEntropyLoss()  # -100 index = padding token
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]

        # compute mlm loss
        loss = None
        if labels is not None:
            prediction_scores = self.cls(sequence_output)
            l_shape = labels.shape
            loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
        else:
            loss = self.cls(sequence_output).sum() * 0.0

        return MaskedLMOutput(
            loss=loss,
            logits=prediction_scores,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )


class RobertaForMaskedLM(RobertaPreTrainedModel):

    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]

    def __init__(self, config):
        super().__init__(config)

        if config.is_decoder:
            logger.warning(
                "If you want to use `RobertaForMaskedLM` make sure `config.is_decoder=False` for "
                "bi-directional self-attention."
            )

        self.roberta = RobertaModel(config, add_pooling_layer=False)
        self.lm_head = RobertaLMHead(config)

        self.init_weights()

    def get_output_embeddings(self):
        return self.lm_head.decoder

    def set_output_embeddings(self, new_embeddings):
        self.lm_head.decoder = new_embeddings

    def set_args(self, args):
        self.args = args

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        return_loss=True,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        loss_fct = CrossEntropyLoss()  # -100 index = padding token
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]

        # compute mlm loss
        mlm_loss = None
        if labels is not None:
            prediction_scores = self.lm_head(sequence_output)
            l_shape = labels.shape
            mlm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
        else:
            if self.lm_head is not None:
                mlm_loss = self.lm_head(sequence_output).sum() * 0.0
            else:
                mlm_loss = 0.0

        return MaskedLMOutput(
            loss=mlm_loss,
            logits=prediction_scores,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )



================================================
FILE: experimental/train/preprocess_general.sh
================================================
#!/bin/bash
source config.sh

source ${VIRTUAL_ENV}/bin/activate

TASK=$1
PRETRAIN_DATA_PATH=$2
CACHE=$3
OTHER_ARGS=$4
MAXLEN=128

mkdir -p $CACHE
export HF_HOME=$CACHE
export TRANSFORMERS_CACHE=$CACHE
export HF_DATASETS_CACHE=$CACHE
export HF_DATASETS_IN_MEMORY_MAX_SIZE=0
export TORCH_EXTENSIONS_DIR=$CACHE
export TMPDIR=$CACHE




python run_pipeline.py \
    --pipeline_step preprocess \
    --preprocessing_num_workers 32 \
    --cache_dir ${CACHE} \
    --dataset_path ${PRETRAIN_DATA_PATH} \
    --max_ckpts_to_keep 3 \
    --max_length ${MAXLEN} \
    --pad_to_max_length \
    --model_name_or_path bert-base-uncased \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 1 \
    --task_name $TASK \
    --save_final \
    --seed 42 \
    ${OTHER_ARGS}






================================================
FILE: experimental/train/pretrain_general.sh
================================================
#!/bin/bash
source config.sh

source ${VIRTUAL_ENV}/bin/activate


TASK=$1
PRETRAIN_DATA_PATH=$2
CUDA_DEVICES=$3
NUM_GPUS=$4
SAVENAME=$5
PORT=$6
PRETRAIN_OUTPUT_DIR=$7
CACHE=$8
OTHER_ARGS=$9
LR=${10:-"5e-4"}

mkdir -p $CACHE
export HF_HOME=$CACHE
export TRANSFORMERS_CACHE=$CACHE
export HF_DATASETS_CACHE=$CACHE
export HF_DATASETS_IN_MEMORY_MAX_SIZE=0
export TORCH_EXTENSIONS_DIR=$CACHE
export TMPDIR=$CACHE


# We expect 4 GPUs / RTX 3090
BATCH_SIZE=64
GRAD_ACCUM=16

mkdir -p $PRETRAIN_OUTPUT_DIR

WD=0.01
WARMUP=3000

OUTPUT_PATH=${PRETRAIN_OUTPUT_DIR}/${SAVENAME}_pretrain
MAXLEN=128


accelerate launch \
    --config_file ./accelerate_config.yaml \
   --main_process_port ${PORT} \
   --num_processes ${NUM_GPUS} \
    run_pipeline.py \
    --pipeline_step pretrain \
    --cache_dir ${CACHE} \
    --dataset_path ${PRETRAIN_DATA_PATH} \
    --max_train_steps 50000 \
    --steps_to_eval 1000 \
    --steps_to_save 12500 \
    --steps_to_log 100 \
    --max_ckpts_to_keep 4 \
    --max_length ${MAXLEN} \
    --pad_to_max_length \
    --model_name_or_path bert-base-uncased \
    --output_dir $OUTPUT_PATH \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps ${GRAD_ACCUM} \
    --cuda_devices $CUDA_DEVICES \
    --task_name $TASK \
    --save_final \
    --weight_decay $WD \
    --learning_rate $LR \
    --num_warmup_steps $WARMUP \
    --seed 42 \
    ${OTHER_ARGS}




================================================
FILE: experimental/train/requirements.txt
================================================
accelerate==0.10.0
pandas==1.1.5
transformers==4.12.3
nltk>=3.6.2
torch==1.10.1
datasets
wandb
scipy
scikit-learn


================================================
FILE: experimental/train/run_pipeline.py
================================================
# Adapted from https://github.com/yaoxingcheng/TLM and HuggingFace example

import argparse
import logging
import math
import os, sys
import random
import torch
import datasets
from datasets import load_dataset, load_metric, load_from_disk, DatasetDict, concatenate_datasets
from torch.utils.data.dataloader import DataLoader
from tqdm.auto import tqdm
from pathlib import Path
import wandb

import transformers
from accelerate import Accelerator, DeepSpeedPlugin
from transformers import (
    AdamW,
    AutoConfig,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    PreTrainedTokenizerFast,
    DataCollatorWithPadding,
    PretrainedConfig,
    SchedulerType,
    default_data_collator,
    get_scheduler,
    set_seed,
)
from transformers.utils.versions import require_version
from collator import DataCollatorForLanguageModeling
from model import BertForMaskedLM
from trainer import PretrainTrainer

logger = logging.getLogger(__name__)

def parse_args():
    parser = argparse.ArgumentParser(description="Train a Transformer model")
    # about active sampling
    parser.add_argument(
        "--tokenizer_file",
        type=str,
        default=None,
        help="The name of the tokenizer file or path"
    )
    parser.add_argument(
        "--pipeline_step",
        type=str,
        default='preprocess',
        help="step of the pipeline - preprocess, pretrain"
    )
    parser.add_argument(
        "--save_final",
        action="store_true",
        help="save the final checkpoint",
    )
    parser.add_argument(
        "--from_scratch",
        action="store_true",
        help="training from scratch",
    )
    parser.add_argument(
        "--from_ckpt",
        type=str,
        default=None,
        help="restore the model training process from a checkpoint",
    )
    parser.add_argument(
        "--dataset_path",
        type=str,
        default=None,
        help="path to the pretraining dataset"
    )
    parser.add_argument(
        "--cache_dir",
        type=str,
        default='/scr',
        help="path to cache directory"
    )
    parser.add_argument(
        "--max_ckpts_to_keep",
        type=int,
        default=3,
        help="Number of checkpoints to keep"
    )
    parser.add_argument(
        "--preprocessing_num_workers",
        type=int,
        default=None,
        help="Number of preprocessors"
    )
    parser.add_argument(
        "--steps_to_log",
        type=int,
        default=None,
        help="Num steps to log training info"
    )
    parser.add_argument(
        "--steps_to_eval",
        type=int,
        default=None,
        help="Num steps to evaluate on the dev set"
    )
    parser.add_argument(
        "--steps_to_save",
        type=int,
        default=None,
        help="Num steps to save the checkpoint"
    )
    parser.add_argument(
        "--mlm_probability",
        type=float,
        default=0.15,
        help="Probability of masking"
    )
    parser.add_argument(
        "--adam_beta1",
        type=float,
        default=0.9,
        help="Adam beta 1"
    )
    parser.add_argument(
        "--adam_beta2",
        type=float,
        default=0.999,
        help="Adam beta 2"
    )
    parser.add_argument(
        "--adam_eps",
        type=float,
        default=1e-8,
        help="Adam epsilon"
    )
    parser.add_argument(
        "--max_grad_norm",
        type=float,
        default=0.0,
        help="Max gradient norm"
    )
    parser.add_argument(
        "--task_name",
        type=str,
        default=None,
        help="The name of the task to train on.",
    )
    parser.add_argument(
        "--max_length",
        type=int,
        default=128,
        help=(
            "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated,"
            " sequences shorter will be padded if `--pad_to_max_lengh` is passed."
        ),
    )
    parser.add_argument(
        "--pad_to_max_length",
        action="store_true",
        help="If passed, pad all samples to `max_length`. Otherwise, dynamic padding is used.",
    )
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
    )
    parser.add_argument(
        "--config_dir",
        type=str,
        help="Path to pretrained model or model identifier from huggingface.co/models.",
        default=None,
    )
    parser.add_argument(
        "--per_device_train_batch_size",
        type=int,
        default=8,
        help="Batch size (per device) for the training dataloader.",
    )
    parser.add_argument(
        "--per_device_eval_batch_size",
        type=int,
        default=8,
        help="Batch size (per device) for the evaluation dataloader.",
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        default=2e-5,
        help="Initial learning rate (after the potential warmup period) to use.",
    )
    parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.")
    parser.add_argument("--num_train_epochs", type=int, default=None, help="Total number of training epochs to perform.")
    parser.add_argument(
        "--max_train_steps",
        type=int,
        default=0,
        help="Total number of training steps to perform. If provided, overrides num_train_epochs.",
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument(
        "--lr_scheduler_type",
        type=SchedulerType,
        default="linear",
        help="The scheduler type to use.",
        # choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"],
    )
    parser.add_argument(
        "--num_warmup_steps", type=float, default=10000, help="Number of steps for the warmup in the lr scheduler."
    )
    parser.add_argument(
        "--cuda_devices", type=str, default='0', help="visible cuda devices."
    )
    parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.")
    parser.add_argument("--seed", type=int, default=42, help="A seed for reproducible training.")
    args = parser.parse_args()
    # Sanity checks
    if args.task_name is None:
        raise ValueError("Need a task name.")

    if args.model_name_or_path is None:
        assert args.from_scratch, "no model name or path is provided but trying to initialize from a pre-trained weight"

    if args.output_dir is not None:
        os.makedirs(args.output_dir, exist_ok=True)
        with open(os.path.join(args.output_dir, "args"), "w") as f:
            for arg in vars(args):
                f.write(f"{arg}: {getattr(args, arg)}\n")

    return args

def get_logger(args, accelerator=None):
    # Make one log on every process with the configuration for debugging.
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )
    if accelerator is not None:
        logger.info(accelerator.state)

    # Setup logging, we only want one process per machine to log things on the screen.
    # accelerator.is_local_main_process is only True for one process per machine.
    if args.output_dir is not None:
        logfile = os.path.join(args.output_dir, "log")
        if accelerator is not None and accelerator.is_main_process:
            if os.path.exists(logfile):
                os.remove(logfile)
            os.mknod(logfile)
            fh = logging.FileHandler(logfile, mode='w')
            fh.setLevel(logging.INFO)
            formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(name)s - %(message)s")
            fh.setFormatter(formatter)
            logger.addHandler(fh)

    if accelerator is None:
        logger.setLevel(logging.INFO)
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()
    else:
        logger.setLevel(logging.INFO if accelerator.is_main_process else logging.ERROR)
        if accelerator.is_main_process:
            datasets.utils.logging.set_verbosity_warning()
            transformers.utils.logging.set_verbosity_info()
        else:
            datasets.utils.logging.set_verbosity_error()
            transformers.utils.logging.set_verbosity_error()

    # If passed along, set the training seed now.
    if args.seed is not None:
        set_seed(args.seed)
    return logger


def get_dataset(args, preprocessed_cache):
    data_files = {}
    data_files["train"] = args.dataset_path
    extension = Path(args.dataset_path).name.split(".")[-1]
    if extension == "txt":
        extension = "text"
    elif extension == "jsonl":
        extension = "json"
    else:
        extension = "json"

    raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=f"{args.cache_dir}/cache")
    return raw_datasets


def preprocess(args, raw_datasets, tokenizer, logger, preprocessed_cache):
    logger.info("preprocessing datasets")
    if raw_datasets is not None:
        column_names = raw_datasets["train"].column_names
        text_column_name = "contents" if "contents" in column_names else column_names[0]
    else:
        text_column_name = "contents"

    padding = "max_length"
    def tokenize_function(examples):
        # Remove empty lines
        examples[text_column_name] = [
            line for line in examples[text_column_name] if line is not None and len(line) > 0 and not line.isspace()
        ]

        tokenized_examples = tokenizer(
            examples[text_column_name],
            padding=padding,
            truncation=True,
            max_length=args.max_length,
            return_special_tokens_mask=True,
        )
        remove_cols = set(column_names)
        for k in examples:
            if k not in tokenized_examples and k not in {text_column_name, "rank", "ids", "id"} and k not in remove_cols:
                tokenized_examples[k] = examples[k][:len(examples[text_column_name])]

        return tokenized_examples

    num_cpus = len(os.sched_getaffinity(0))
    num_shards = 8
    tokenized_datasets_ls = []
    for shard_i in range(num_shards):
        preprocessed_cache.mkdir(exist_ok=True)
        preprocessed_cache_i = preprocessed_cache / f"shard_{shard_i}"
        if not preprocessed_cache_i.exists():
            raw_datasets_shard = raw_datasets["train"].shard(
                    num_shards=num_shards, index=shard_i).flatten_indices()
            logger.info(f"Processing shard {shard_i}")
            tokenized_datasets_i = raw_datasets_shard.map(
                    tokenize_function,
                    batched=True,
                    batch_size=100,
                    num_proc=num_cpus//2,
                    remove_columns=raw_datasets_shard.column_names,
                    desc="Running tokenizer on dataset line_by_line",
                )
            logger.info(f"Saving shard {shard_i} to disk")
            tokenized_datasets_i.save_to_disk(str(preprocessed_cache_i))

    for shard_i in range(num_shards):
        assert preprocessed_cache_i.exists()
        tokenized_datasets_i = load_from_disk(str(preprocessed_cache_i))
        tokenized_datasets_ls.append(tokenized_datasets_i)
    tokenized_datasets = concatenate_datasets(tokenized_datasets_ls)

    return tokenized_datasets, text_column_name

def get_model(args, load_model=True):
    # Load pretrained model and tokenizer
    #
    # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently
    # download model & vocab.
    if args.pipeline_step == 'preprocess':
        load_model = False

    if args.model_name_or_path and not args.from_scratch:
        try:
            tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
        except Exception:
            tokenizer = AutoTokenizer.from_pretrained(args.config_dir)

        if load_model:
            config = AutoConfig.from_pretrained(args.model_name_or_path)

            # set_cls = True if not args.from_scratch else False
            if args.model_name_or_path == 'bert-base-uncased' or args.model_name_or_path == 'bert-large-uncased':
                model = BertForMaskedLM.from_pretrained(
                    args.model_name_or_path,
                    from_tf=bool(".ckpt" in args.model_name_or_path),
                    config=config,
                )
            else:
                model = RobertaForMaskedLM.from_pretrained(
                    args.model_name_or_path,
                    from_tf=bool(".ckpt" in args.model_name_or_path),
                    config=config,
                )
    elif args.model_name_or_path and args.from_scratch:
        config = AutoConfig.from_pretrained(args.model_name_or_path)
        tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
        if load_model:
            if args.model_name_or_path == 'bert-base-uncased' or args.model_name_or_path == 'bert-large-uncased':
                model = BertForMaskedLM(config)
            else:
                model = RobertaForMaskedLM(config)
    else:
        config = AutoConfig.from_pretrained(args.config_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.config_dir)
        if load_model:
            if args.model_name_or_path == 'bert-base-uncased' or args.model_name_or_path == 'bert-large-uncased':
                model = BertForMaskedLM(config)
            else:
                model = RobertaForMaskedLM(config)

    if load_model:
        model.set_args(args)
    else:
        model = None
    return tokenizer, model


def main():
    args = parse_args()
    os.environ['CUDA_VISIBLE_DEVICES'] = args.cuda_devices
    os.environ['NVIDIA_VISIBLE_DEVICES'] = args.cuda_devices

    preprocessed_cache = Path(args.dataset_path).parent / 'preprocessed_cache' / f"{args.task_name}_{Path(args.dataset_path).parent.name}_{Path(args.dataset_path).stem}"

    preprocessed_cache.parent.mkdir(exist_ok=True, parents=True)

    if args.pipeline_step == 'preprocess':
        logger = get_logger(args)
        tokenizer, model = get_model(args, load_model=False)

        raw_dataset = get_dataset(args, preprocessed_cache)
        dataset, text_column_name = preprocess(args, raw_dataset, tokenizer, logger, preprocessed_cache)

    elif args.pipeline_step == 'pretrain':
        accelerator = Accelerator(fp16=True)
        args.device = accelerator.device
        logger = get_logger(args, accelerator)
        tokenizer, model = get_model(args)
        if accelerator.is_main_process:
            wandb.init(project="pretrain", name=f"{args.pipeline_step}_{args.task_name}_{Path(args.output_dir).name}_{args.seed}")


        raw_dataset = None
        dataset, text_column_name = preprocess(args, raw_dataset, tokenizer, logger, preprocessed_cache=preprocessed_cache)
        data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=args.mlm_probability)
        dataloader = DataLoader(
            dataset, shuffle=True, collate_fn=data_collator, batch_size=args.per_device_train_batch_size, num_workers=0
        )

        # Optimizer
        # Split weights in two groups, one with weight decay and the other not.
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": args.weight_decay,
            },
            {
                "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, betas=(args.adam_beta1, args.adam_beta2), eps=args.adam_eps)

        # Prepare everything with `accelerator`.
        model, optimizer, dataloader = accelerator.prepare(
            model, optimizer, dataloader
        )
        num_update_steps_per_epoch = math.ceil(len(dataloader) / args.gradient_accumulation_steps)
        if args.max_train_steps is None or args.max_train_steps == 0:
            args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
        else:
            args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)

        lr_scheduler = get_scheduler(
            name=args.lr_scheduler_type,
            optimizer=optimizer,
            num_warmup_steps=int(args.num_warmup_steps),
            num_training_steps=args.max_train_steps,
        )

        trainer = PretrainTrainer(
            args=args,
            model=model,
            optimizer=optimizer,
            lr_scheduler=lr_scheduler,
            dataloader=dataloader,
            logger=logger,
            accelerator=accelerator,
            from_checkpoint=args.from_ckpt,
            tokenizer=tokenizer,
            max_grad_norm=args.max_grad_norm,
        )
        trainer.train()

if __name__ == "__main__":
    main()


================================================
FILE: experimental/train/run_pretrain_pipeline_general.sh
================================================
#!/bin/bash
set -x

source config.sh

TASK=$1
SUFFIX=$2
PORT=$3
PRETRAIN_DATA_PATH=$4
DO_PREPROCESS=${5:-"true"}
DO_PRETRAIN=${6:-"true"}
OTHER_ARGS=${7:-""}
LR=${8:-"5e-4"}


LOGDIR=logs/train
mkdir -p ${LOGDIR}

dependency=""
if [[ "${DO_PREPROCESS}" = "true" ]]; then
    jid=$(sbatch \
        --parsable \
        --mem=128g \
        ${cluster_info} \
        -c 8 \
        --output ${LOGDIR}/${TASK}_preprocess_${SUFFIX}.log \
        train/preprocess_general.sh ${TASK} ${PRETRAIN_DATA_PATH} ${CACHE} "${OTHER_ARGS}")
    echo -n "${jid} "
    dependency="--dependency afterok:${jid}"

fi

# pretrain
if [[ "${DO_PRETRAIN}" = "true" ]]; then
    # --mem=96g \
    jid=$(sbatch \
        --parsable \
        ${dependency} \
        ${cluster_info} \
        --gres=gpu:4 \
        -c 8 \
        --mem=64g \
        -t 14-0:00 \
        --output ${LOGDIR}/${TASK}_pretrain_${SUFFIX}.log \
        train/pretrain_general.sh ${TASK} ${PRETRAIN_DATA_PATH} "0,1,2,3" 4 ${TASK}_${SUFFIX} ${PORT} ${PRETRAIN_OUTPUT_DIR} ${CACHE} "${OTHER_ARGS}" ${LR})
    echo -n "${jid} "
    dependency="--dependency afterok:${jid}"
else
    dependency=""
fi


================================================
FILE: experimental/train/run_slurm.sh
================================================

# Train general-domain model from scratch
# We first try learning rate 1e-3, then lower to 8e-4 if the loss diverges

bash train/run_pretrain_pipeline_general.sh \
      wiki_and_books \
      dsir_scratch \
      60200 \
      /path/to/data.jsonl \
     "true" "true" "--from_scratch --adam_beta1 0.9 --adam_beta2 0.98 --adam_eps 1e-6 --max_grad_norm 1.0" 8e-4


================================================
FILE: experimental/train/trainer.py
================================================
import os
import shutil
import json
import torch
import math
import numpy as np
from tqdm import tqdm
import wandb
from pathlib import Path


class PretrainTrainer:

    def __init__(self,
                 args,
                 model,
                 optimizer,
                 lr_scheduler,
                 dataloader,
                 logger,
                 accelerator,
                 tokenizer,
                 max_grad_norm=0.0,
                 from_checkpoint=None,
                ):

        self.args = args
        self.model = model
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.dataloader = dataloader
        self.logger = logger
        self.completed_steps = 0
        self.accelerator = accelerator
        self.tokenizer = tokenizer
        self.max_grad_norm = max_grad_norm
        self._iter = iter(dataloader)
        self.from_checkpoint = from_checkpoint

    def _move_to_device(self, batch):
        for k in batch:
            batch[k] = batch[k].to(self.args.device)
        return batch

    def _save_model(self, save_path=None):
        if save_path is None:
            save_path = self.args.output_dir

        if self.accelerator.is_main_process:
           unwrapped_model = self.accelerator.unwrap_model(self.model)
           unwrapped_model.save_pretrained(save_path, save_function=self.accelerator.save)

    def _save_trained(self, save_path=None):
        if save_path is None:
            save_path = self.args.output_dir
        Path(save_path).mkdir(parents=True, exist_ok=True)
        torch.save(self.optimizer.state_dict(), os.path.join(save_path, "optimizer.pt"))
        torch.save(self.lr_scheduler.state_dict(), os.path.join(save_path, "scheduler.pt"))
        trainer_state = {
            "completed_steps": self.completed_steps,
            "best_metric": 0.0,
            "test_results": 0.0,
        }
        if self.accelerator.is_main_process:
            with open(os.path.join(save_path, "trainer_state.json"), "w") as f:
                json.dump(trainer_state, f)
        self._save_model(save_path=save_path)

    def evaluate(self):
        pass

    def _get_batch(self):
        try:
            batch = next(self._iter)
        except StopIteration:
            self._iter = iter(self.dataloader)
            batch = next(self._iter)

        batch = self._move_to_device(batch)

        if 'rank' in batch:
            batch.pop('rank')
        return batch

    def compute_loss(self):
        self.model.train()
        batch = self._get_batch()
        outputs = self.model(**batch)
        loss = outputs.loss
        loss = loss / self.args.gradient_accumulation_steps
        self.accelerator.backward(loss)
        return loss.item()

    def _prepare_from_checkpoint(self):

        if self.from_checkpoint is None:
            return

        state_file = os.path.join(self.from_checkpoint, "trainer_state.json")
        optim_file = os.path.join(self.from_checkpoint, "optimizer.pt")
        sched_file = os.path.join(self.from_checkpoint, "scheduler.pt")
        if os.path.exists(sched_file):
            sched_state = torch.load(sched_file)
            self.lr_scheduler.load_state_dict(sched_state)
        if not os.path.exists(state_file):
            return

        with open(state_file, "r") as f:
            state = json.load(f)
            self.pre_completed_steps = state["completed_steps"]
            self.best_metric = state["best_metric"]

        self.logger.info(f"pretrained steps: {self.pre_completed_steps}, best dev metric {self.best_metric}")
        self.accelerator.wait_for_everyone()

    def update(self, tr_loss, loss_step):
        if self.completed_steps % self.args.steps_to_log == 0:
            self.logger.info(
                "step {}, learning rate {}, average loss {}".format(
                    self.completed_steps,
                    self.optimizer.param_groups[0]["lr"],
                    tr_loss / loss_step
                )
            )
            if self.accelerator.is_main_process:
                wandb.log({'step': self.completed_steps, 'loss': tr_loss / loss_step,
                           'lr': self.optimizer.param_groups[0]["lr"]})
            tr_loss = 0.0
            loss_step = 0

        if self.completed_steps % self.args.steps_to_eval == 0:
            self.evaluate()

        if self.completed_steps % self.args.steps_to_save == 0:
            self.accelerator.wait_for_everyone()
            if self.accelerator.is_main_process:
                self._save_trained(
                    save_path = os.path.join(self.args.output_dir, 'checkpoint-{}'.format(self.completed_steps))
                )
                # delete outdated checkpoints
                for files in os.listdir(self.args.output_dir):
                    file_name = os.path.join(self.args.output_dir, files)
                    if os.path.isdir(file_name) and files.startswith('checkpoint-'):
                        checked_step = int(files[11:])
                        if self.completed_steps - checked_step >= self.args.max_ckpts_to_keep * self.args.steps_to_save:
                            if self.accelerator.is_main_process:
                                shutil.rmtree(file_name)

    def train(self):

        total_batch_size = self.args.per_device_train_batch_size * self.args.gradient_accumulation_steps * self.accelerator.num_processes
        self.logger.info("***** Running training *****")
        self.logger.info(f"  Num examples = {len(self.dataloader.dataset)}")
        self.logger.info(f"  Num Epochs = {self.args.num_train_epochs}")
        self.logger.info(f"  Instantaneous batch size per device = {self.args.per_device_train_batch_size}")
        self.logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
        self.logger.info(f"  Gradient Accumulation steps = {self.args.gradient_accumulation_steps}")
        self.logger.info(f"  Total optimization steps = {self.args.max_train_steps}")
        # Only show the progress bar once on each machine.
        progress_bar = tqdm(range(self.args.max_train_steps))
        self.completed_steps = 0
        self.pre_completed_steps = 0
        self._prepare_from_checkpoint()
        tr_loss = 0.0
        loss_step = 0
        for step in range(self.args.max_train_steps * self.args.gradient_accumulation_steps):
            if self.completed_steps < self.pre_completed_steps:
                self._get_batch()
                if step % self.args.gradient_accumulation_steps == 0:
                    self.completed_steps += 1
                    progress_bar.update(1)
                continue
            if step % self.args.gradient_accumulation_steps == 0:
                tr_loss += self.compute_loss()
                loss_step += 1
                if self.max_grad_norm > 0:
                    self.accelerator.clip_grad_norm_(
                            self.model.parameters(), self.max_grad_norm)
                self.optimizer.step()
                self.lr_scheduler.step()
                self.optimizer.zero_grad()
                progress_bar.update(1)
                self.completed_steps += 1
                self.update(tr_loss, loss_step)
                tr_loss = 0.0
                loss_step = 0
            else:
                with self.accelerator.no_sync(self.model):
                    tr_loss += self.compute_loss()
                    loss_step += 1

        self.evaluate()
        self._save_trained(
            save_path = os.path.join(self.args.output_dir, 'final')
        )


================================================
FILE: pyproject.toml
================================================
[build-system]
requires      = ["setuptools>=61.0.0", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "data-selection"
version = "1.0.3"
authors = [
  { name="Sang Michael Xie", email="xie@cs.stanford.edu" },
]
description = "Data Selection with Importance Resampling"
readme = "README.md"
requires-python = ">=3.6"
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: MIT License",
    "Operating System :: OS Independent",
]

license = { file = "LICENSE" }
keywords = ["data selection", "importance resampling", "dsir", "nlp", "language models"]
dependencies = [
    'numpy>=1.21.6',
    'tqdm>=4.62.3',
    'joblib>=1.1.0',
    'nltk>=3.8.1',
]

[project.optional-dependencies]
dev = ["pytest"]

[project.urls]
"Homepage" = "https://github.com/p-lambda/dsir"


================================================
FILE: setup.py
================================================
from setuptools import setup, find_packages

from pathlib import Path
if __name__ == "__main__":
    # parse version from pyproject.toml
    curr_dir = Path(__file__).parent
    with open(curr_dir / 'pyproject.toml', 'r') as f:
        for line in f:
            if line.startswith('version'):
                version = line.split('=')[1].strip().strip('"')
                break

    setup(name='data-selection',
          version=version,
          description='Data Selection with Importance Resampling',
          url='https://github.com/p-lambda/dsir',
          author='Sang Michael Xie',
          author_email='xie@cs.stanford.edu',
          packages=['data_selection'],
          install_requires=[
            'numpy>=1.21.6',
            'tqdm>=4.62.3',
            'joblib>=1.1.0',
            'nltk>=3.8.1',
          ]
    )


================================================
FILE: tests/test_hashed_ngram.py
================================================
import pytest
from pathlib import Path
import numpy as np
import json
import shutil

from data_selection.hashed_ngram_dsir import HashedNgramDSIR, hash_buckets, get_ngram_counts


toy_dataset = Path(__file__).parent / "toy_pile_data.jsonl"
raw_datasets = [str(toy_dataset)] * 2
target_datasets = [str(Path(__file__).parent / "toy_target_data.jsonl")]
separated_target_datasets = [str(Path(__file__).parent / "toy_target_data.jsonl"), str(Path(__file__).parent / "toy_target_data_2.jsonl")]


def parse_example_fn(ex):
    return ex['contents']


@pytest.fixture
def dsir_obj():
    dsir = HashedNgramDSIR(
            raw_datasets=raw_datasets,
            target_datasets=target_datasets,
            cache_dir='/tmp/dsir_params',
            raw_parse_example_fn=parse_example_fn,
            target_parse_example_fn=parse_example_fn,
            num_proc=2,
            ngrams=2,
            num_buckets=10000)

    yield dsir

    if Path('/tmp/dsir_params').exists():
        shutil.rmtree('/tmp/dsir_params')


@pytest.fixture
def dsir_obj_diffparams():
    dsir = HashedNgramDSIR(
            raw_datasets=raw_datasets,
            target_datasets=target_datasets,
            cache_dir='/tmp/dsir_params',
            raw_parse_example_fn=parse_example_fn,
            target_parse_example_fn=parse_example_fn,
            num_proc=2,
            ngrams=3,
            num_buckets=50000)

    yield dsir

    if Path('/tmp/dsir_params').exists():
        shutil.rmtree('/tmp/dsir_params')


@pytest.fixture
def dsir_obj_septarget():
    dsir = HashedNgramDSIR(
            raw_datasets=raw_datasets,
            target_datasets=separated_target_datasets,
            cache_dir='/tmp/dsir_params',
            raw_parse_example_fn=parse_example_fn,
            target_parse_example_fn=parse_example_fn,
            num_proc=2,
            ngrams=2,
            num_buckets=10000,
            separate_targets=True,
            target_proportions=[0.5, 0.5])

    yield dsir

    if Path('/tmp/dsir_params').exists():
        shutil.rmtree('/tmp/dsir_params')


def test_hash_buckets():
    bucket = hash_buckets('alice')
    bucket_2 = hash_buckets('alice went')

    assert bucket == 6720
    assert bucket_2 == 114


def test_get_ngram_counts():
    line = 'Alice went to the store'

    counts = get_ngram_counts(line, n=2, num_buckets=10000)
    assert counts.shape == (10000,)
    assert counts.sum() == 9

    bucket = hash_buckets('alice')
    bucket_2 = hash_buckets('alice went')

    assert counts[bucket] > 0
    assert counts[bucket_2] > 0


def test_virtual_shards(dsir_obj):
    assert len(dsir_obj._get_virtually_sharded_datasets(raw_datasets)) == 2


def test_length_metadata(dsir_obj):
    text = "Alice walked into the store\n\n()()($#%@?)@(#(*"
    length = len(dsir_obj.tokenizer(text))
    feats = dsir_obj.featurizer(text)
    assert length == dsir_obj.get_perexample_metadata(ex=None, features=feats)


def test_fit(dsir_obj):
    dsir_obj.fit_importance_estimator(num_tokens_to_fit='all')

    assert dsir_obj.raw_probs is not None
    assert dsir_obj.raw_probs.shape == (10000,)
    assert dsir_obj.raw_probs.sum() == 1.0
    assert dsir_obj.target_probs is not None
    assert dsir_obj.target_probs.shape == (10000,)
    assert dsir_obj.target_probs.sum() == 1.0
    assert dsir_obj.log_diff is not None
    assert dsir_obj.log_diff.shape == (10000,)
    assert np.allclose(dsir_obj.log_diff, np.log(dsir_obj.target_probs + 1e-8) - np.log(dsir_obj.raw_probs + 1e-8))


    dsir_obj.fit_importance_estimator(num_tokens_to_fit='auto')

    assert dsir_obj.raw_probs is not None
    assert dsir_obj.raw_probs.shape == (10000,)
    assert dsir_obj.raw_probs.sum() == 1.0
    assert dsir_obj.target_probs is not None
    assert dsir_obj.target_probs.shape == (10000,)
    assert dsir_obj.target_probs.sum() == 1.0
    assert dsir_obj.log_diff is not None
    assert dsir_obj.log_diff.shape == (10000,)
    assert np.allclose(dsir_obj.log_diff, np.log(dsir_obj.target_probs + 1e-8) - np.log(dsir_obj.raw_probs + 1e-8))


    dsir_obj.fit_importance_estimator(num_tokens_to_fit=100000)

    assert dsir_obj.raw_probs is not None
    assert dsir_obj.raw_probs.shape == (10000,)
    assert dsir_obj.raw_probs.sum() == 1.0
    assert dsir_obj.target_probs is not None
    assert dsir_obj.target_probs.shape == (10000,)
    assert dsir_obj.target_probs.sum() == 1.0
    assert dsir_obj.log_diff is not None
    assert dsir_obj.log_diff.shape == (10000,)
    assert np.allclose(dsir_obj.log_diff, np.log(dsir_obj.target_probs + 1e-8) - np.log(dsir_obj.raw_probs + 1e-8))



def test_compute(dsir_obj):
    dsir_obj.fit_importance_estimator()

    dsir_obj.compute_importance_weights()

    log_importance_weights = [
        np.load(dsir_obj.log_importance_weights_dir / f"{i}.npy") for i in range(2)]
    assert len(log_importance_weights) == len(raw_datasets)
    assert log_importance_weights[0].shape == (1000,)

def test_resample(dsir_obj):
    dsir_obj.fit_importance_estimator()

    dsir_obj.compute_importance_weights()

    dsir_obj.resample(out_dir='/tmp/resampled', num_to_sample=2, cache_dir='/tmp/resampled_cache')

    assert Path('/tmp/resampled').exists()
    assert not Path('/tmp/resampled_cache').exists()

    all_lines = []
    for i in range(dsir_obj.num_proc):
        with open(f'/tmp/resampled/{i}.jsonl', 'r') as f:
            lines = f.readlines()
            all_lines += lines

    assert len(all_lines) == 2

    for line in all_lines:
        ex = json.loads(line)
        assert ex['id'] == 0
        length = len(dsir_obj.tokenizer(dsir_obj.raw_parse_example_fn(ex)))
        assert length >= dsir_obj.min_example_length

    shutil.rmtree('/tmp/resampled')
    if Path('/tmp/resampled_cache').exists():
        shutil.rmtree('/tmp/resampled_cache')

def test_resample_diffparams(dsir_obj_diffparams):
    dsir_obj = dsir_obj_diffparams
    dsir_obj.fit_importance_estimator()

    dsir_obj.compute_importance_weights()

    dsir_obj.resample(out_dir='/tmp/resampled', num_to_sample=2, cache_dir='/tmp/resampled_cache')

    assert Path('/tmp/resampled').exists()
    assert not Path('/tmp/resampled_cache').exists()

    all_lines = []
    for i in range(dsir_obj.num_proc):
        with open(f'/tmp/resampled/{i}.jsonl', 'r') as f:
            lines = f.readlines()
            all_lines += lines

    assert len(all_lines) == 2

    for line in all_lines:
        ex = json.loads(line)
        assert ex['id'] == 0
        length = len(dsir_obj.tokenizer(dsir_obj.raw_parse_example_fn(ex)))
        assert length >= dsir_obj.min_example_length

    shutil.rmtree('/tmp/resampled')
    if Path('/tmp/resampled_cache').exists():
        shutil.rmtree('/tmp/resampled_cache')


def test_resample_septarget(dsir_obj_septarget):
    dsir_obj = dsir_obj_septarget
    dsir_obj.fit_importance_estimator()

    dsir_obj.compute_importance_weights()

    dsir_obj.resample(out_dir='/tmp/resampled', num_to_sample=2, cache_dir='/tmp/resampled_cache')

    assert Path('/tmp/resampled').exists()
    assert not Path('/tmp/resampled_cache').exists()
    assert np.allclose(dsir_obj.target_proportions - 0.5, 0.0)

    all_lines = []
    for i in range(dsir_obj.num_proc):
        with open(f'/tmp/resampled/{i}.jsonl', 'r') as f:
            lines = f.readlines()
            all_lines += lines

    assert len(all_lines) == 2

    all_ids = []
    for line in all_lines:
        ex = json.loads(line)
        all_ids.append(ex['id'])
        length = len(dsir_obj.tokenizer(dsir_obj.raw_parse_example_fn(ex)))
        assert length >= dsir_obj.min_example_length

    assert len(set(all_ids)) == 2
    assert min(all_ids) == 0
    assert max(all_ids) == 2

    shutil.rmtree('/tmp/resampled')
    if Path('/tmp/resampled_cache').exists():
        shutil.rmtree('/tmp/resampled_cache')


def test_resample_virtual_sharding():
    dsir_obj = HashedNgramDSIR(
            raw_datasets=raw_datasets,
            target_datasets=target_datasets,
            cache_dir='/tmp/dsir_params',
            raw_parse_example_fn=parse_example_fn,
            target_parse_example_fn=parse_example_fn,
            num_proc=15,
            ngrams=2,
            num_buckets=10000)

    assert len(dsir_obj._get_virtually_sharded_datasets(raw_datasets)) == 15

    dsir_obj.fit_importance_estimator()

    dsir_obj.compute_importance_weights()

    dsir_obj.resample(out_dir='/tmp/resampled_virtual', num_to_sample=2, cache_dir='/tmp/resampled_cache_virtual')

    assert Path('/tmp/resampled_virtual').exists()
    assert not Path('/tmp/resampled_cache_virtual').exists()

    all_lines = []
    for i in range(15):
        with open(f'/tmp/resampled_virtual/{i}.jsonl', 'r') as f:
            lines = f.readlines()
            all_lines += lines

    assert len(all_lines) == 2

    for line in all_lines:
        ex = json.loads(line)
        assert ex['id'] == 0
        length = len(dsir_obj.tokenizer(dsir_obj.raw_parse_example_fn(ex)))
        assert length >= dsir_obj.min_example_length

    shutil.rmtree('/tmp/resampled_virtual')
    if Path('/tmp/resampled_virtual').exists():
        shutil.rmtree('/tmp/resampled_virtual')


def test_smoothing(dsir_obj):
    dsir_obj.fit_importance_estimator()

    target_probs_1 = dsir_obj.target_probs

    smoothing_param = 1.0
    dsir_obj_2 = HashedNgramDSIR(
            raw_datasets=raw_datasets,
            target_datasets=target_datasets,
            cache_dir='/tmp/dsir_params_2',
            raw_parse_example_fn=parse_example_fn,
            target_parse_example_fn=parse_example_fn,
            num_proc=2,
            ngrams=2,
            num_buckets=10000,
            target_laplace_smoothing=smoothing_param)
    dsir_obj_2.fit_importance_estimator()

    target_probs_2 = dsir_obj_2.target_probs

    total_counts = None
    with open(target_datasets[0], 'r') as f:
        for line in f:
            ex = json.loads(line)
            text = parse_example_fn(ex)
            counts = get_ngram_counts(text, n=2, num_buckets=10000)
            if total_counts is None:
                total_counts = counts
            else:
                total_counts += counts

    total_tokens = total_counts.sum()

    assert np.allclose(total_counts / total_tokens, target_probs_1)

    assert np.allclose((target_probs_1 * total_tokens + smoothing_param) / (total_tokens + smoothing_param * len(target_probs_1)), target_probs_2)

    if Path('/tmp/dsir_params').exists():
        shutil.rmtree('/tmp/dsir_params_2')


def test_save_load(dsir_obj):
    dsir_obj.fit_importance_estimator()

    dsir_obj.compute_importance_weights()

    dsir_obj.save('/tmp/dsir.pkl')

    dsir_obj_2 = HashedNgramDSIR([], [], '/tmp/cache')
    dsir_obj_2.load('/tmp/dsir.pkl', exclude_keys=['raw_datasets', 'target_datasets', 'cache_dir'])
    assert np.allclose(dsir_obj_2.raw_probs, dsi

Download .txt

gitextract_g6u__ttu/

├── .gitignore
├── LICENSE
├── README.md
├── data_selection/
│   ├── README.md
│   ├── __init__.py
│   ├── base.py
│   ├── hashed_ngram_dsir.py
│   └── utils.py
├── experimental/
│   ├── README.md
│   ├── config.sh
│   ├── data_selection/
│   │   ├── dsir_general/
│   │   │   ├── data_selection.py
│   │   │   ├── run_data_selection.py
│   │   │   └── utils.py
│   │   ├── dsir_pipeline.py
│   │   ├── heuristic_cls_pipeline.py
│   │   ├── run_cmds.sh
│   │   ├── run_dsir.sh
│   │   ├── run_dsir_helper.sh
│   │   ├── run_heuristic_cls.sh
│   │   └── run_heuristic_cls_helper.sh
│   ├── glue_eval/
│   │   ├── read_glue_results.py
│   │   ├── run_eval_exps.sh
│   │   ├── run_glue.py
│   │   ├── run_glue_dist.sh
│   │   └── run_glue_for_seed_task.sh
│   ├── preprocessing/
│   │   ├── quality_scores/
│   │   │   ├── compute_quality_stats.py
│   │   │   ├── merge_quality_scores.py
│   │   │   ├── run_merge_quality_scores.sh
│   │   │   ├── run_quality_stats.sh
│   │   │   └── run_slurm_quality_stats.sh
│   │   ├── reformat_and_chunk_data.py
│   │   ├── run.sh
│   │   └── run_slurm.sh
│   ├── requirements.txt
│   └── train/
│       ├── accelerate_config.yaml
│       ├── collator.py
│       ├── model.py
│       ├── preprocess_general.sh
│       ├── pretrain_general.sh
│       ├── requirements.txt
│       ├── run_pipeline.py
│       ├── run_pretrain_pipeline_general.sh
│       ├── run_slurm.sh
│       └── trainer.py
├── pyproject.toml
├── setup.py
└── tests/
    ├── test_hashed_ngram.py
    ├── test_utils.py
    ├── toy_pile_data.jsonl
    ├── toy_target_data.jsonl
    └── toy_target_data_2.jsonl

Download .txt

SYMBOL INDEX (126 symbols across 17 files)

FILE: data_selection/base.py
  function default_load_dataset_fn (line 18) | def default_load_dataset_fn(path: str) -> Iterable[Dict]:
  function default_parse_example_fn (line 30) | def default_parse_example_fn(ex: Dict) -> str:
  function _iterate_virtually_sharded_dataset (line 39) | def _iterate_virtually_sharded_dataset(dataset: Iterable, num_shards: in...
  class DSIR (line 46) | class DSIR():
    method __init__ (line 50) | def __init__(self,
    method _get_virtually_sharded_datasets (line 97) | def _get_virtually_sharded_datasets(self, datasets: List[str]):
    method featurizer (line 116) | def featurizer(self, text: str) -> np.ndarray:
    method importance_estimator (line 120) | def importance_estimator(self, features: np.ndarray) -> Union[float, n...
    method get_perexample_metadata (line 124) | def get_perexample_metadata(self, ex: Dict, features: np.ndarray) -> n...
    method fit_importance_estimator (line 133) | def fit_importance_estimator(self) -> None:
    method compute_importance_weights (line 140) | def compute_importance_weights(self) -> None:
    method perexample_metadata_filter (line 181) | def perexample_metadata_filter(self, concat_metadata: np.ndarray) -> n...
    method resample (line 185) | def resample(self, out_dir: str, num_to_sample: int, cache_dir: str = ...
    method save (line 304) | def save(self, path: str) -> None:
    method load (line 310) | def load(self, path: str, exclude_keys: Optional[List[str]] = None) ->...

FILE: data_selection/hashed_ngram_dsir.py
  function hash_buckets (line 22) | def hash_buckets(text: str, num_buckets: int = 10000) -> int:
  function get_ngram_counts (line 26) | def get_ngram_counts(line: str,
  class HashedNgramDSIR (line 54) | class HashedNgramDSIR(DSIR):
    method __init__ (line 57) | def __init__(self,
    method featurizer (line 116) | def featurizer(self, text: str) -> np.ndarray:
    method importance_estimator (line 119) | def importance_estimator(self, features: np.ndarray) -> Union[float, n...
    method get_perexample_metadata (line 122) | def get_perexample_metadata(self, ex: Dict, features: np.ndarray) -> int:
    method perexample_metadata_filter (line 127) | def perexample_metadata_filter(self, concat_metadata: np.ndarray) -> n...
    method _fit_bow (line 131) | def _fit_bow(self,
    method fit_importance_estimator (line 168) | def fit_importance_estimator(self, num_tokens_to_fit: Union[str, int] ...

FILE: data_selection/utils.py
  function parallelize (line 5) | def parallelize(fn: Callable, args: List[Any], num_proc: int):

FILE: experimental/data_selection/dsir_general/data_selection.py
  function compute_ngrams_raw (line 16) | def compute_ngrams_raw(args, in_path: str, cache_path: Path):
  function compute_importance_weights (line 41) | def compute_importance_weights(args, in_path: str):
  function resample (line 70) | def resample(args, data_files, cache_ds_dir, streaming=False):

FILE: experimental/data_selection/dsir_general/utils.py
  function hash_buckets (line 15) | def hash_buckets(text, num_buckets=1e4):
  function get_ngram_info (line 22) | def get_ngram_info(line, n=2, num_buckets=10000):
  function linecount (line 34) | def linecount(filename):
  function transform_text (line 44) | def transform_text(text):
  function repeating_filter (line 48) | def repeating_filter(x_tok, n=1):
  function mostly_uninformative_filter (line 59) | def mostly_uninformative_filter(x_tok):
  function numeric_filter (line 66) | def numeric_filter(x_tok):

FILE: experimental/data_selection/dsir_pipeline.py
  function get_quality_mask (line 64) | def get_quality_mask(quality_scores):
  function hash_buckets (line 77) | def hash_buckets(string, num_buckets=10e4):
  function unigrams_bigrams (line 81) | def unigrams_bigrams(text):
  function get_ngram_info (line 87) | def get_ngram_info(line, n=2, num_buckets=10000):
  function grouper (line 99) | def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
  function compute_ngrams_hf (line 115) | def compute_ngrams_hf(ds_name, ds_dir, cache_dir, ngrams, num_buckets):
  function compute_ngrams_pile (line 141) | def compute_ngrams_pile(
  function compute_importance_weights (line 174) | def compute_importance_weights(
  function compute_domain_idxs (line 198) | def compute_domain_idxs(filter_domains):
  function resample (line 231) | def resample(ds_dir, cache_ds_dir, num_to_retrieve):
  function linecount (line 327) | def linecount(filename):

FILE: experimental/data_selection/heuristic_cls_pipeline.py
  function transform_text (line 25) | def transform_text(text):
  function batch_process (line 29) | def batch_process(e, text_cols, label_col, fixed_label=None):
  function reformat_dataset (line 41) | def reformat_dataset(ds_name, output_dir, cache_dir, num_proc=10, fixed_...
  function replace_label (line 100) | def replace_label(line, label):
  function mix_dataset (line 109) | def mix_dataset(ds_dir, pile_val_dir):
  function prepare_fasttext_dataset (line 159) | def prepare_fasttext_dataset(ds_dir):
  function make_prediction (line 195) | def make_prediction(line, model):
  function process (line 210) | def process(line):
  function make_prediction_chunk (line 214) | def make_prediction_chunk(ds_path, model_path, chunk_idx):
  function predict_chunk (line 230) | def predict_chunk(model_path, ds_dir, chunk_idx):
  function compute_domain_idxs (line 264) | def compute_domain_idxs(filter_domains):
  function retrieve_from_pile (line 296) | def retrieve_from_pile(model_path, num_to_retrieve, ds_dir):

FILE: experimental/glue_eval/read_glue_results.py
  function read_file (line 16) | def read_file(path, task_name):
  function parse_file_name (line 24) | def parse_file_name(name):

FILE: experimental/glue_eval/run_glue.py
  class DataTrainingArguments (line 74) | class DataTrainingArguments:
    method __post_init__ (line 139) | def __post_init__(self):
  class ModelArguments (line 158) | class ModelArguments:
  function main (line 193) | def main():
  function _mp_fn (line 566) | def _mp_fn(index):

FILE: experimental/preprocessing/quality_scores/compute_quality_stats.py
  function transform_text (line 22) | def transform_text(text):
  function length_filter (line 26) | def length_filter(x_tok):
  function repeating_filter (line 30) | def repeating_filter(x_tok):
  function mostly_uninformative_filter (line 38) | def mostly_uninformative_filter(x_tok):
  function numeric_filter (line 45) | def numeric_filter(x_tok):
  function process (line 55) | def process(example):

FILE: experimental/preprocessing/reformat_and_chunk_data.py
  function chunk_examples (line 20) | def chunk_examples(examples, chunk_length=CHUNK_LENGTH):
  function add_id (line 29) | def add_id(examples, idx):
  function main (line 33) | def main(args):

FILE: experimental/train/collator.py
  class DataCollatorForLanguageModeling (line 11) | class DataCollatorForLanguageModeling:
    method __post_init__ (line 18) | def __post_init__(self):
    method __call__ (line 25) | def __call__(
    method mask_tokens (line 46) | def mask_tokens(

FILE: experimental/train/model.py
  class BertForMaskedLM (line 15) | class BertForMaskedLM(BertPreTrainedModel):
    method __init__ (line 20) | def __init__(self, config):
    method get_output_embeddings (line 34) | def get_output_embeddings(self):
    method set_output_embeddings (line 37) | def set_output_embeddings(self, new_embeddings):
    method set_args (line 40) | def set_args(self, args):
    method forward (line 43) | def forward(
  class RobertaForMaskedLM (line 93) | class RobertaForMaskedLM(RobertaPreTrainedModel):
    method __init__ (line 98) | def __init__(self, config):
    method get_output_embeddings (line 112) | def get_output_embeddings(self):
    method set_output_embeddings (line 115) | def set_output_embeddings(self, new_embeddings):
    method set_args (line 118) | def set_args(self, args):
    method forward (line 121) | def forward(

FILE: experimental/train/run_pipeline.py
  function parse_args (line 39) | def parse_args():
  function get_logger (line 236) | def get_logger(args, accelerator=None):
  function get_dataset (line 279) | def get_dataset(args, preprocessed_cache):
  function preprocess (line 294) | def preprocess(args, raw_datasets, tokenizer, logger, preprocessed_cache):
  function get_model (line 352) | def get_model(args, load_model=True):
  function main (line 406) | def main():

FILE: experimental/train/trainer.py
  class PretrainTrainer (line 12) | class PretrainTrainer:
    method __init__ (line 14) | def __init__(self,
    method _move_to_device (line 40) | def _move_to_device(self, batch):
    method _save_model (line 45) | def _save_model(self, save_path=None):
    method _save_trained (line 53) | def _save_trained(self, save_path=None):
    method evaluate (line 69) | def evaluate(self):
    method _get_batch (line 72) | def _get_batch(self):
    method compute_loss (line 85) | def compute_loss(self):
    method _prepare_from_checkpoint (line 94) | def _prepare_from_checkpoint(self):
    method update (line 116) | def update(self, tr_loss, loss_step):
    method train (line 149) | def train(self):

FILE: tests/test_hashed_ngram.py
  function parse_example_fn (line 16) | def parse_example_fn(ex):
  function dsir_obj (line 21) | def dsir_obj():
  function dsir_obj_diffparams (line 39) | def dsir_obj_diffparams():
  function dsir_obj_septarget (line 57) | def dsir_obj_septarget():
  function test_hash_buckets (line 76) | def test_hash_buckets():
  function test_get_ngram_counts (line 84) | def test_get_ngram_counts():
  function test_virtual_shards (line 98) | def test_virtual_shards(dsir_obj):
  function test_length_metadata (line 102) | def test_length_metadata(dsir_obj):
  function test_fit (line 109) | def test_fit(dsir_obj):
  function test_compute (line 150) | def test_compute(dsir_obj):
  function test_resample (line 160) | def test_resample(dsir_obj):
  function test_resample_diffparams (line 188) | def test_resample_diffparams(dsir_obj_diffparams):
  function test_resample_septarget (line 218) | def test_resample_septarget(dsir_obj_septarget):
  function test_resample_virtual_sharding (line 254) | def test_resample_virtual_sharding():
  function test_smoothing (line 295) | def test_smoothing(dsir_obj):
  function test_save_load (line 336) | def test_save_load(dsir_obj):

FILE: tests/test_utils.py
  function job (line 4) | def job(arg):
  function test_parallelize (line 8) | def test_parallelize():

Download .json

Condensed preview — 51 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,008K chars).

[
  {
    "path": ".gitignore",
    "chars": 1869,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "LICENSE",
    "chars": 1073,
    "preview": "MIT License\n\nCopyright (c) 2023 Sang Michael Xie\n\nPermission is hereby granted, free of charge, to any person obtaining "
  },
  {
    "path": "README.md",
    "chars": 7978,
    "preview": "# Data Selection for Language Models via Importance Resampling (DSIR)\n[![License: MIT](https://img.shields.io/badge/Lice"
  },
  {
    "path": "data_selection/README.md",
    "chars": 4497,
    "preview": "# Usage\nIn general, DSIR aims to select data from the raw dataset that matches the feature distribution of the target da"
  },
  {
    "path": "data_selection/__init__.py",
    "chars": 246,
    "preview": "try:\n    import importlib.metadata as importlib_metadata\nexcept ModuleNotFoundError:\n    import importlib_metadata\n\n__ve"
  },
  {
    "path": "data_selection/base.py",
    "chars": 14042,
    "preview": "# base DSIR class\nimport os\nfrom typing import List, Optional, Dict, Callable, Iterable, Union\nimport multiprocessing as"
  },
  {
    "path": "data_selection/hashed_ngram_dsir.py",
    "chars": 9821,
    "preview": "from typing import List, Optional, Dict, Callable, Union, Iterable\nimport hashlib\nfrom tqdm import tqdm\nfrom nltk.tokeni"
  },
  {
    "path": "data_selection/utils.py",
    "chars": 212,
    "preview": "from typing import Callable, List, Any\nfrom joblib import Parallel, delayed\n\n\ndef parallelize(fn: Callable, args: List[A"
  },
  {
    "path": "experimental/README.md",
    "chars": 9266,
    "preview": "# Code for the DSIR paper\nThis directory has the code for preprocessing, data selection, pretraining, and fine-tuning fo"
  },
  {
    "path": "experimental/config.sh",
    "chars": 365,
    "preview": "#!/bin/bash\n\nCACHE='/path/to/cachedir'\nROOT_DIR='/path/to/dsir/experimental'\nVIRTUAL_ENV='/path/to/.env'\nPILE_PATH='/pat"
  },
  {
    "path": "experimental/data_selection/dsir_general/data_selection.py",
    "chars": 8007,
    "preview": "from pathlib import Path\nimport argparse\nimport random\nimport shutil\nfrom json import loads, dumps\nfrom tqdm import tqdm"
  },
  {
    "path": "experimental/data_selection/dsir_general/run_data_selection.py",
    "chars": 146,
    "preview": "import subprocess\nimport os\n\nenvironment = dict(os.environ, PYTHONHASHSEED='42')\nsubprocess.run([\"python\", \"data_selecti"
  },
  {
    "path": "experimental/data_selection/dsir_general/utils.py",
    "chars": 1784,
    "preview": "from nltk.tokenize import WordPunctTokenizer\nfrom nltk import ngrams\nfrom nltk.tokenize import word_tokenize\nfrom collec"
  },
  {
    "path": "experimental/data_selection/dsir_pipeline.py",
    "chars": 16500,
    "preview": "from pathlib import Path\nfrom itertools import zip_longest\nimport random\nimport argparse\nimport json\nimport shutil\nfrom "
  },
  {
    "path": "experimental/data_selection/heuristic_cls_pipeline.py",
    "chars": 21574,
    "preview": "from pathlib import Path\nimport os\nimport random\nimport argparse\nimport json\nimport shutil\nfrom multiprocessing import P"
  },
  {
    "path": "experimental/data_selection/run_cmds.sh",
    "chars": 507,
    "preview": "#!/bin/bash\nnum_buckets=10000\nngrams=2\n\n#### target = books and wiki.\n#### for the wiki_and_books target, we add 4M addi"
  },
  {
    "path": "experimental/data_selection/run_dsir.sh",
    "chars": 1398,
    "preview": "#!/bin/bash\nsource config.sh\n\n\ntask=$1\nrun_name=$2\nother_args=$3\nnum_to_retrieve=${4:-25000000}\n\nLOGDIR=logs/data_select"
  },
  {
    "path": "experimental/data_selection/run_dsir_helper.sh",
    "chars": 620,
    "preview": "#!/bin/bash\n\nset -x\n\nsource config.sh\n\nsource ${VIRTUAL_ENV}/bin/activate\nmkdir -p $CACHE\nexport HF_HOME=$CACHE\nexport T"
  },
  {
    "path": "experimental/data_selection/run_heuristic_cls.sh",
    "chars": 1483,
    "preview": "#!/bin/bash\nset -x\n\nsource config.sh\n\nsource ${VIRTUAL_ENV}/bin/activate\n\ntask=$1\nrun_name=$2\nother_args=$3\nnum_to_retri"
  },
  {
    "path": "experimental/data_selection/run_heuristic_cls_helper.sh",
    "chars": 472,
    "preview": "#!/bin/bash\nset -x\nsource config.sh\n\nmkdir -p $CACHE\nexport HF_HOME=$CACHE\nexport TRANSFORMERS_CACHE=$CACHE\nexport HF_DA"
  },
  {
    "path": "experimental/glue_eval/read_glue_results.py",
    "chars": 2623,
    "preview": "import pandas as pd\nfrom pathlib import Path\nfrom collections import defaultdict\nimport json\nimport subprocess\n\ntask_to_"
  },
  {
    "path": "experimental/glue_eval/run_eval_exps.sh",
    "chars": 126,
    "preview": "#!/bin/bash\n\n# example\n\nbash glue_eval/run_glue_dist.sh \\\n    /path/to/trained/checkpoint \\\n    \"bert-base-uncased\" \\\n  "
  },
  {
    "path": "experimental/glue_eval/run_glue.py",
    "chars": 25096,
    "preview": "#!/usr/bin/env python\n# coding=utf-8\n# Copyright 2020 The HuggingFace Inc. team. All rights reserved.\n#\n# Licensed under"
  },
  {
    "path": "experimental/glue_eval/run_glue_dist.sh",
    "chars": 2921,
    "preview": "#!/bin/bash\nsource config.sh\n\ngpus=1\nmem=8G\ncpus=2\n\nPRETRAINED_PATH=$1\nBASENAME=${2:-\"bert-base-uncased\"}\nMAX_LEN=${3:-5"
  },
  {
    "path": "experimental/glue_eval/run_glue_for_seed_task.sh",
    "chars": 1684,
    "preview": "#!/bin/bash\nset -x\n\nPRETRAINED_PATH=$1\nTASK_NAME=$2\nSEED=$3\nEPOCHS=$4\nLR=$5\nBATCH_SIZE=$6\nCACHE=$7\nBASENAME=${8:-\"bert-b"
  },
  {
    "path": "experimental/preprocessing/quality_scores/compute_quality_stats.py",
    "chars": 2824,
    "preview": "from pathlib import Path\nimport json\nimport numpy as np\nimport os\nfrom itertools import islice\nfrom joblib import Parall"
  },
  {
    "path": "experimental/preprocessing/quality_scores/merge_quality_scores.py",
    "chars": 910,
    "preview": "import subprocess\nimport numpy as np\nimport os\n\n\nimport argparse\nparser = argparse.ArgumentParser(description='merge qua"
  },
  {
    "path": "experimental/preprocessing/quality_scores/run_merge_quality_scores.sh",
    "chars": 87,
    "preview": "#!/bin/bash\n\nsource config.sh\n\npython merge_quality_scores.py --pile_path ${PILE_PATH}\n"
  },
  {
    "path": "experimental/preprocessing/quality_scores/run_quality_stats.sh",
    "chars": 107,
    "preview": "#!/bin/bash\n\nDS_PATH=$1\n\npython preprocessing/quality_scores/compute_quality_stats.py --ds_path ${DS_PATH}\n"
  },
  {
    "path": "experimental/preprocessing/quality_scores/run_slurm_quality_stats.sh",
    "chars": 754,
    "preview": "#!/bin/bash\n\nsource config.sh\n\nLOGDIR=logs/preprocessing/qualitystats\nmkdir -p ${LOGDIR}\n\nfor SUBSET in 01 02 03 04 05 0"
  },
  {
    "path": "experimental/preprocessing/reformat_and_chunk_data.py",
    "chars": 2020,
    "preview": "import json\nimport numpy as np\nfrom datasets import load_dataset\nfrom argparse import ArgumentParser\nimport os\nfrom path"
  },
  {
    "path": "experimental/preprocessing/run.sh",
    "chars": 312,
    "preview": "#!/bin/bash\nsource config.sh\n\nsource ${VIRTUAL_ENV}/bin/activate\n\nmkdir -p $CACHE\nexport HF_HOME=$CACHE\nexport TRANSFORM"
  },
  {
    "path": "experimental/preprocessing/run_slurm.sh",
    "chars": 922,
    "preview": "#!/bin/bash\n\nsource config.sh\n\nLOGDIR=logs/preprocess\nmkdir -p ${LOGDIR}\n\nfor SUBSET in 01 02 03 04 05 06 07 08 09 10 11"
  },
  {
    "path": "experimental/requirements.txt",
    "chars": 48,
    "preview": "datasets\npandas\ntqdm\nnumpy\njoblib\nnltk\nfasttext\n"
  },
  {
    "path": "experimental/train/accelerate_config.yaml",
    "chars": 198,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: MULTI_GPU\nfp16: true\nmachine_rank: 0\nmain_process_ip: null\nmain_pro"
  },
  {
    "path": "experimental/train/collator.py",
    "chars": 3704,
    "preview": "import random\nimport warnings\nfrom dataclasses import dataclass\nfrom typing import Any, Callable, Dict, List, NewType, O"
  },
  {
    "path": "experimental/train/model.py",
    "chars": 5670,
    "preview": "import logging\nimport torch\nfrom torch import nn\nfrom torch.nn import CrossEntropyLoss\n\nfrom transformers import BertPre"
  },
  {
    "path": "experimental/train/preprocess_general.sh",
    "chars": 825,
    "preview": "#!/bin/bash\nsource config.sh\n\nsource ${VIRTUAL_ENV}/bin/activate\n\nTASK=$1\nPRETRAIN_DATA_PATH=$2\nCACHE=$3\nOTHER_ARGS=$4\nM"
  },
  {
    "path": "experimental/train/pretrain_general.sh",
    "chars": 1451,
    "preview": "#!/bin/bash\nsource config.sh\n\nsource ${VIRTUAL_ENV}/bin/activate\n\n\nTASK=$1\nPRETRAIN_DATA_PATH=$2\nCUDA_DEVICES=$3\nNUM_GPU"
  },
  {
    "path": "experimental/train/requirements.txt",
    "chars": 114,
    "preview": "accelerate==0.10.0\npandas==1.1.5\ntransformers==4.12.3\nnltk>=3.6.2\ntorch==1.10.1\ndatasets\nwandb\nscipy\nscikit-learn\n"
  },
  {
    "path": "experimental/train/run_pipeline.py",
    "chars": 17281,
    "preview": "# Adapted from https://github.com/yaoxingcheng/TLM and HuggingFace example\n\nimport argparse\nimport logging\nimport math\ni"
  },
  {
    "path": "experimental/train/run_pretrain_pipeline_general.sh",
    "chars": 1148,
    "preview": "#!/bin/bash\nset -x\n\nsource config.sh\n\nTASK=$1\nSUFFIX=$2\nPORT=$3\nPRETRAIN_DATA_PATH=$4\nDO_PREPROCESS=${5:-\"true\"}\nDO_PRET"
  },
  {
    "path": "experimental/train/run_slurm.sh",
    "chars": 363,
    "preview": "\n# Train general-domain model from scratch\n# We first try learning rate 1e-3, then lower to 8e-4 if the loss diverges\n\nb"
  },
  {
    "path": "experimental/train/trainer.py",
    "chars": 7619,
    "preview": "import os\nimport shutil\nimport json\nimport torch\nimport math\nimport numpy as np\nfrom tqdm import tqdm\nimport wandb\nfrom "
  },
  {
    "path": "pyproject.toml",
    "chars": 819,
    "preview": "[build-system]\nrequires      = [\"setuptools>=61.0.0\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname ="
  },
  {
    "path": "setup.py",
    "chars": 840,
    "preview": "from setuptools import setup, find_packages\n\nfrom pathlib import Path\nif __name__ == \"__main__\":\n    # parse version fro"
  },
  {
    "path": "tests/test_hashed_ngram.py",
    "chars": 11699,
    "preview": "import pytest\nfrom pathlib import Path\nimport numpy as np\nimport json\nimport shutil\n\nfrom data_selection.hashed_ngram_ds"
  },
  {
    "path": "tests/test_utils.py",
    "chars": 209,
    "preview": "from data_selection import utils\n\n\ndef job(arg):\n    return arg + 1\n\n\ndef test_parallelize():\n\n    results = utils.paral"
  },
  {
    "path": "tests/toy_pile_data.jsonl",
    "chars": 766517,
    "preview": "{\"contents\":\"\\/*\\n * Copyright (c) 2015 Kaprica Security, Inc.\\n *\\n * Permission is hereby granted, free of charge, to "
  },
  {
    "path": "tests/toy_target_data.jsonl",
    "chars": 855,
    "preview": "{\"contents\":\"\\/*\\n * Copyright (c) 2015 Kaprica Security, Inc.\\n *\\n * Permission is hereby granted, free of charge, to "
  },
  {
    "path": "tests/toy_target_data_2.jsonl",
    "chars": 897,
    "preview": "{\"contents\":\"New Grub Street\\n\\nNew Grub Street is a novel by George Gissing published in 1891, which is set in the lite"
  }
]

About this extraction

This page contains the full source code of the p-lambda/dsir GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 51 files (939.9 KB), approximately 243.3k tokens, and a symbol index with 126 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo