Full Code of huggingface/dataspeech for AI

main aecbcfde51bf cached
44 files
160.4 KB
40.0k tokens
37 symbols
1 requests
Download .txt
Repository: huggingface/dataspeech
Branch: main
Commit: aecbcfde51bf
Files: 44
Total size: 160.4 KB

Directory structure:
gitextract_4p4dwrk2/

├── .gitignore
├── LICENSE
├── README.md
├── dataspeech/
│   ├── __init__.py
│   ├── cpu_enrichments/
│   │   ├── __init__.py
│   │   └── rate.py
│   └── gpu_enrichments/
│       ├── __init__.py
│       ├── pitch.py
│       ├── snr_and_reverb.py
│       └── squim.py
├── examples/
│   ├── prompt_creation/
│   │   ├── run_prompt_creation_10k.sh
│   │   ├── run_prompt_creation_1k.sh
│   │   ├── run_prompt_creation_1k_with_speaker_consistency.sh
│   │   ├── run_prompt_creation_45k.sh
│   │   ├── run_prompt_creation_dummy.sh
│   │   ├── run_prompt_creation_jenny.sh
│   │   └── speaker_ids_to_names.json
│   ├── prompt_creation_llm_swarm/
│   │   ├── nginx.template.conf
│   │   ├── run_prompt_creation_10k.sh
│   │   ├── run_prompt_creation_1k.sh
│   │   ├── run_prompt_creation_dummy.sh
│   │   ├── run_prompt_creation_full_mls.sh
│   │   └── tgi_h100.template.slurm
│   ├── tagging/
│   │   ├── run_main_10k.sh
│   │   ├── run_main_1k.sh
│   │   ├── run_main_45k.sh
│   │   └── run_main_dummy.sh
│   └── tags_to_annotations/
│       ├── run_metadata_to_text_10k.sh
│       ├── run_metadata_to_text_10k_v02.sh
│       ├── run_metadata_to_text_for_finetuning.sh
│       ├── v01_bin_edges.json
│       ├── v01_text_bins.json
│       ├── v02_bin_edges.json
│       └── v02_text_bins.json
├── main.py
├── requirements.txt
└── scripts/
    ├── filter_audio_separation.py
    ├── merge_audio_to_metadata.py
    ├── metadata_to_text.py
    ├── per_dataset_script/
    │   ├── add_gender_to_MLS.py
    │   ├── add_gender_to_libritts_r.py
    │   └── clean_libritts_r.py
    ├── run_prompt_creation.py
    └── run_prompt_creation_llm_swarm.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
artefacts/*
env_dataspeech/*
**/__pycache__/*
wip_scripts/*
plots/*
.vscode/*

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2024 The Hugging Face team.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Data-Speech

Data-Speech is a suite of utility scripts designed to tag speech datasets. 

Its aim is to provide a simple, clean codebase for applying audio transformations (or annotations) that may be requested as part of the development of speech-based AI models, such as text-to-speech engines.

Its primary use is to reproduce the annotation method from Dan Lyth and Simon King's research paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912), that labels various speaker characteristics with natural language descriptions.

Applying these tools allows us to prepare and release tagged versions of [LibriTTS-R](https://huggingface.co/datasets/parler-tts/libritts-r-filtered-speaker-descriptions), and of [the English version of MLS](https://huggingface.co/datasets/parler-tts/mls-eng-speaker-descriptions).

This repository is designed to accompany the [Parler-TTS library](https://github.com/huggingface/parler-tts), which contains the inference and training code for Parler-TTS, a new family of high-quality text-to-speech models.

---------

## 📖 Quick Index
* [Requirements](#set-up)
* [Annotating datasets to fine-tune Parler-TTS](#annotating-datasets-to-fine-tune-parler-tts)
* [Annotating datasets from scratch](#annotating-datasets-from-scratch)
* [Using Data-Speech to filter your speech datasets](#using-data-speech-to-filter-your-speech-datasets)
* [❓ FAQ](#faq)
* [Logs](#logs)


## Set-up

You first need to clone this repository before installing requirements.

```sh
git clone git@github.com:huggingface/dataspeech.git
cd dataspeech
pip install -r requirements.txt
```

## Annotating datasets to fine-tune Parler-TTS

In the following examples, we'll load 30 hours of audio data from the [Jenny TTS dataset](https://github.com/dioco-group/jenny-tts-dataset), a high-quality mono-speaker TTS dataset, from an Irish female speaker named Jenny.

The aim here is to create an annotated version of Jenny TTS, in order to fine-tune the [Parler-TTS v1 checkpoint](https://huggingface.co/parler-tts/parler-tts-mini-v1) on this dataset.

Thanks to a [script similar to what's described in the FAQ](#how-do-i-use-datasets-that-i-have-with-this-repository), we've uploaded the dataset to the HuggingFace hub, under the name [reach-vb/jenny_tts_dataset](https://huggingface.co/datasets/reach-vb/jenny_tts_dataset).

Feel free to follow the link above to listen to some samples of the Jenny TTS dataset thanks to the hub viewer.

> [!IMPORTANT]
> Refer to the section [Annotating datasets from scratch](#annotating-datasets-from-scratch) for more detailed explanations of what's going on under-the-hood.

We'll:
1. Annotate the Jenny dataset with continuous variables that measures the speech characteristics
2. Map those annotations to text bins that characterize the speech characteristics.
3. Create natural language descriptions from those text bins

### 1. Annotate the Jenny dataset

We'll use [`main.py`](main.py) to get the following continuous variables:
    - Speaking rate `(nb_phonemes / utterance_length)`
    - Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) 
    - Reverberation
    - Speech monotony

```sh
python main.py "reach-vb/jenny_tts_dataset" \
  --configuration "default" \
  --text_column_name "transcription" \
  --audio_column_name "audio" \
  --cpu_num_workers 8 \
  --rename_column \
  --repo_id "jenny-tts-tags-v1" \
  --apply_squim_quality_estimation
```

Note that the script will be faster if you have GPUs at your disposal. It will automatically scale-up to every GPUs available in your environnement.

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle. Mine was pushed to [ylacombe/jenny-tts-tags-v1](https://huggingface.co/datasets/ylacombe/jenny-tts-tags-v1).

### 2. Map annotations to text bins

Since the ultimate goal here is to fine-tune the [Parler-TTS v1 checkpoint](https://huggingface.co/parler-tts/parler-tts-mini-v1) on the Jenny dataset, we want to stay consistent with the text bins of the datasets on which the latter model was trained.

This is easy to do thanks to the following command:

```sh
python ./scripts/metadata_to_text.py \
    "ylacombe/jenny-tts-tags-v1" \
    --repo_id "jenny-tts-tags-v1" \
    --configuration "default" \
    --cpu_num_workers "8" \
    --path_to_bin_edges "./examples/tags_to_annotations/v02_bin_edges.json" \
    --path_to_text_bins "./examples/tags_to_annotations/v02_text_bins.json" \
    --avoid_pitch_computation \
    --apply_squim_quality_estimation
```

Thanks to [`v02_bin_edges.json`](/examples/tags_to_annotations/v02_bin_edges.json), we don't need to recompute bins from scratch and the above script takes a few seconds.

The resulting dataset will be pushed to the HuggingFace hub under your HuggingFace handle. Mine was push to [ylacombe/jenny-tts-tags-v1](https://huggingface.co/datasets/ylacombe/jenny-tts-tags-v1).

You can notice that text bins such as `slightly slowly`, `very monotone` have been added to the samples.

### 3. Create natural language descriptions from those text bins

Now that we have text bins associated to the Jenny dataset, the next step is to create natural language descriptions out of the few created features.

Here, we decided to create prompts that use the name `Jenny`, prompts that'll look like the following:
`In a very expressive voice, Jenny pronounces her words incredibly slowly. There's some background noise in this room with a bit of echo.'`

This step generally demands more resources and times and should use one or many GPUs.

[`run_prompt_creation_jenny.sh`](examples/prompt_creation/run_prompt_creation_jenny.sh) indicates how to run it on the Jenny dataset:

```sh
python ./scripts/run_prompt_creation.py \
  --speaker_name "Jenny" \
  --is_single_speaker \
  --is_new_speaker_prompt \
  --dataset_name "ylacombe/jenny-tts-tags-v1" \
  --dataset_config_name "default" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 128 \
  --attn_implementation "sdpa" \
  --output_dir "./tmp_jenny" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "jenny-tts-tagged-v1" \
  --preprocessing_num_workers 24 \
  --dataloader_num_workers 24
```

As usual, we precise the dataset name and configuration we want to annotate. `model_name_or_path` should point to a `transformers` model for prompt annotation. You can find a list of such models [here](https://huggingface.co/models?pipeline_tag=text-generation&library=transformers&sort=trending). Here, we used a version of Mistral's 7B model.

> [!NOTE]
> If you want to use this on a multi-speaker dataset, you'll have to adapt the logic of the script. First, you need to remove the `--is_single_speaker` and `--speaker_name "Jenny"` flags.
> 
> Then, there's two cases:
> 1. In case you want to associate names to some speakers, you need to pass the speaker id column name, and a JSON file which maps the speaker ids to these names. For example, `--speaker_id_column "speaker_id" --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json`. Feel free to take a look at [speaker_ids_to_names.json](examples/prompt_creation/speaker_ids_to_names.json) to get inspiration.
> 2. In case you don't want to associate names to speakers, you don't have to do anything else. 


## Annotating datasets from scratch

In the following examples, we'll load 1,000 hours of labelled audio data from the [LibriTTS-R dataset](https://huggingface.co/datasets/blabble-io/libritts_r) and add annotations using the dataspeech library. The resulting dataset is complete with discrete annotation tags, as well as a coherent audio
description of the spoken audio characteristics.


There are 3 steps to be completed in order to generate annotations:
1. [Annotate the speech dataset](#predict-annotations) to get the following continuous variables:
    - Speaking rate `(nb_phonemes / utterance_length)`
    - Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) 
    - Reverberation
    - Speech monotony
2. [Map the previous annotations categorical to discrete keywords bins](#map-continuous-annotations-to-key-words)
3. [Create natural language descriptions from a set of keywords](#generate-natural-language-descriptions)


### 1. Predict annotations

For the time being, [`main.py`](main.py) can be used to generate speaking rate, SNR, reverberation, PESQ, SI-SDR and pitch estimation. 

To use it, you need a dataset from the [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index) library, either locally or on the [hub](https://huggingface.co/datasets).


```sh
python main.py "blabble-io/libritts_r" \
  --configuration "dev" \
  --output_dir ./tmp_libritts_r_dev/ \
  --text_column_name "text_normalized" \
  --audio_column_name "audio" \
  --cpu_num_workers 8 \
  --rename_column \
  --apply_squim_quality_estimation
```

Here, we've used 8 processes for operations that don't use GPUs, namely to compute the speaking rate. If GPUs were present in the environnement, the operations that can be computed on GPUs - namely pitch, SNR and reverberation estimation - will use every GPUs available in the environnement.

You can learn more about the arguments you can pass to `main.py` by passing:

```sh
python main.py --help
```

In [`/examples/tagging/run_main_1k.sh`](/examples/tagging/run_main_1k.sh), we scaled up the initial command line to the whole dataset. Note that we've used the `repo_id` argument to push the dataset to the hub, resulting in [this dataset](https://huggingface.co/datasets/ylacombe/libritts-r-text-tags-v3).

The dataset viewer gives an idea of what has been done, namely:
- new columns were added:
    - `utterance_pitch_std`: Gives a measure of the standard deviation of pitch in the utterance.
    - `utterance_pitch_mean`: Gives a measure of average pitch in the utterance.
    - `snr`: Speech-to-noise ratio
    - `c50`: Reverberation estimation
    - `speaking_rate`
    - `phonemes`: which was used to compute the speaking rate
    - `pesq` and `si-sdr`: which measure intelligibility and a proxy of noise, as indicated [here](https://pytorch.org/audio/main/tutorials/squim_tutorial.html)
- the audio column was removed - this is especially useful when dealing with big datasets, as writing and pushing audio data can become a bottleneck.

![image](https://github.com/ylacombe/dataspeech/assets/52246514/f422a728-f2af-4c8f-bf2a-65c6722bc0c6)


### 2. Map continuous annotations to key-words

The next step is to map the continuous annotations from the previous steps to key-words. To do so, continous annotations are mapped to categorical bins that are then associated to key-words. For example, the speaking rate can be associated to 7 text bins which are: `"very slowly", "quite slowly", "slightly slowly", "moderate speed", "slightly fast", "quite fast", "very fast"`.

[`scripts/metadata_to_text.py`](/scripts/metadata_to_text.py) computes bins on aggregated statistics from multiple datasets:
- A speaker's pitch is calculated by averaging the pitches across its voice clips. The computed pitch estimator is then compared to speakers of the same gender to derive the pitch keyword of the speaker(very high-pitched to very low-pitched).
- The rest of the keywords are derived by [computing histograms](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html) of the continuous variables over all training samples, from which the extreme values have been eliminated, and associating a keyword with each bin.

```sh
python ./scripts/metadata_to_text.py "ylacombe/libritts-r-text-tags-v3+ylacombe/libritts-r-text-tags-v3" \
--configuration "clean+other" \
--output_dir "./tmp_tts_clean+./tmp_tts_other" \
--cpu_num_workers "8" \
--leading_split_for_bins "train" \
--plot_directory "./plots/" \
--path_to_text_bins "./examples/tags_to_annotations/v02_text_bins.json" \
--apply_squim_quality_estimation \
```
Note how we've been able to pass different datasets with different configurations by separating the relevant arguments with `"+"`.

By passing `--repo_id parler-tts/libritts-r-tags-and-text+parler-tts/libritts-r-tags-and-text`, we pushed the resulting dataset to [this hub repository](https://huggingface.co/datasets/parler-tts/libritts-r-tags-and-text).

Note that this step is a bit more subtle than the previous one, as we generally want to collect a wide variety of speech data to compute accurate key-words. 

Indeed, some datasets, such as LibriTTS-R, collect data from only one or a few sources; for LibriTTS-R, these are audiobooks, and the process of collecting or processing the data can result in homogeneous data that has little variation. In the case of LibriTTS-R, the data has been cleaned to have little noise, little reverberation, and the audiobooks collected leaves little variety in intonation.

You can learn more about the arguments you can pass to `main.py` by passing:

```sh
python main.py --help
```

### 3. Generate natural language descriptions

Now that we have text bins associated to our datasets, the next step is to create natural language descriptions. To 
achieve this, we pass the discrete features to an LLM, and have it generate a natural language description. This step 
generally demands more resources and times and should use one or many GPUs. It can be performed in one of two ways:
1. Using the [Accelerate](https://huggingface.co/docs/accelerate/index)-based script, [`scripts/run_prompt_creation.py`](/scripts/run_prompt_creation.py), or
2. Using the [TGI](https://huggingface.co/docs/text-generation-inference/en/index)-based script, [`scripts/run_prompt_creation_llm_swarm.py`](/scripts/run_prompt_creation_llm_swarm.py)

We recommend you first try the Accelerate script, since it makes no assumptions about the GPU hardware available and is 
thus easier to run. Should you need faster inference, you can switch to the TGI script, which assumes you have a SLURM 
cluster with Docker support.

### 3.1 Accelerate Inference

[`scripts/run_prompt_creation.py`](/scripts/run_prompt_creation.py) relies on [`accelerate`](https://huggingface.co/docs/accelerate/index) and [`transformers`](https://huggingface.co/docs/transformers/index) to generate natural language descriptions from LLMs. 

[`examples/prompt_creation/run_prompt_creation_1k.sh`](examples/prompt_creation/run_prompt_creation_1k.sh) indicates how to run it on LibriTTS-R
with 8 GPUs in half-precision:

```sh
accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "parler-tts/libritts-r-tags-and-text" \
  --dataset_config_name "clean" \
  --model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --torch_compile \
  --dataloader_num_workers 4 \
  --output_dir "./" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "parler-tts/libritts-r-tags-and-text-generated" \
  --is_new_speaker_prompt \
```

As usual, we define the dataset name and configuration we want to annotate. `model_name_or_path` should point to a `transformers` model for prompt annotation. You can find a list of such models [here](https://huggingface.co/models?pipeline_tag=text-generation&library=transformers&sort=trending). Here, we used an instruction-tuned version of Meta's LLaMA-3 8B model. Should you use LLaMA or Gemma, you can enable torch compile with the flag `--torch_compile` for up to 1.5x faster inference.

The folder [`examples/prompt_creation/`](examples/prompt_creation/) contains more examples. 

In particular, (`run_prompt_creation_1k_with_speaker_consistency.sh`)[examples/prompt_creation/run_prompt_creation_1k_with_speaker_consistency.sh] adapts the previous example but introduces speaker consistency. Here, "speaker consistency" simply means associating certain speakers with specific names. In this case, all descriptions linked to these speakers will specify their names, rather than generating anonymous descriptions.


> [!TIP]
> Scripts from this library can also be used as a starting point for applying other models to other datasets from the [datasets library](https://huggingface.co/docs/datasets/v2.17.0/en/index) in a large-scale settings.
> 
> For example, `scripts/run_prompt_creation.py` can be adapted to perform large-scaled inference using other LLMs and prompts.

### 3.2 TGI Inference

[`scripts/run_prompt_creation_llm_swarm.py`](/scripts/run_prompt_creation_llm_swarm.py) relies on [TGI](https://huggingface.co/docs/text-generation-inference/en/index) 
and [LLM-Swarm](https://github.com/huggingface/llm-swarm/tree/main) to generate descriptions from an LLM endpoint.
Compared to the Accelerate script, it uses continuous-batching, which improves throughput by up to 1.5x. It requires one 
extra dependency, LLM-Swarm:

```sh
pip install git+https://github.com/huggingface/llm-swarm.git
```

[`examples/prompt_creation_llm_swarm/run_prompt_creation_1k.sh`](examples/prompt_creation_llm_swarm/run_prompt_creation_1k.sh) indicates how to run it on LibriTTS-R
with 1 TGI instance:

```sh
python run_prompt_creation_llm_swarm.py \
  --dataset_name "stable-speech/libritts-r-tags-and-text" \
  --dataset_config_name "clean" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --num_instances "1" \
  --output_dir "./" \
  --push_to_hub \
  --hub_dataset_id "parler-tts/libritts-r-tags-and-text-generated"
```

Note that the script relies on the SLURM file [`examples/prompt_creation_llm_swarm/tgi_h100.template.slurm`](examples/prompt_creation_llm_swarm/tgi_h100.template.slurm),
which is a template configuration for the Hugging Face H100 cluster. You can update the config based on your cluster.

### To conclude

In the [`/examples`](/examples/) folder, we applied this recipe to both [MLS Eng](https://huggingface.co/datasets/parler-tts/mls-eng-speaker-descriptions) and [LibriTTS-R](https://huggingface.co/datasets/parler-tts/libritts-r-filtered-speaker-descriptions). The resulting datasets were used to train [Parler-TTS](https://github.com/huggingface/parler-tts), a new text-to-speech model.

This recipe is both scalable and easily modifiable and will hopefully help the TTS research community explore new ways of conditionning speech synthesis. 

## Using Data-Speech to filter your speech datasets

While the rest of the README explains how to use this repository to create text descriptions of speech utterances, Data-Speech can also be used to perform filtering on speech datasets.

For example, you can
1. Use the [`Predict annotations`](#1-predict-annotations) step to predict SNR and reverberation.
2. Filter your data sets to retain only the most qualitative samples.

You could also, to give more examples, filter on a certain pitch level (e.g only low-pitched voices), or a certain speech rate (e.g only fast speech).

## FAQ

### What kind of datasets do I need?

We rely on the [`datasets`](https://huggingface.co/docs/datasets/v2.17.0/en/index) library, which is optimized for speed and efficiency, and is deeply integrated with the [HuggingFace Hub](https://huggingface.co/datasets) which allows easy sharing and loading.

In order to use this repository, you need a speech dataset from [`datasets`](https://huggingface.co/docs/datasets/v2.17.0/en/index) with at least one audio column and a text transcription column. Additionally, you also need a gender and a speaker id column, especially if you want to compute pitch.

### How do I use datasets that I have with this repository?

If you have a local dataset, and want to create a dataset from [`datasets`](https://huggingface.co/docs/datasets/v2.17.0/en/index) to use Data-Speech, you can use the following recipes or refer to the [`dataset` docs](https://huggingface.co/docs/datasets/v2.17.0/en/index) for more complex use-cases.

1. You first need to create a csv file that contains the **full paths** to the audio. The column name for those audio files could be for example `audio`, but you can use whatever you want. You also need a column with the transcriptions of the audio, this column can be named `transcript` but you can use whatever you want.

2. Once you have this csv file, you can load it to a dataset like this:
```python
from datasets import DatasetDict

dataset = DatasetDict.from_csv({"train": PATH_TO_CSV_FILE})
```
3. You then need to convert the audio column name to [`Audio`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Audio) so that `datasets` understand that it deals with audio files.
```python
from datasets import Audio
dataset = dataset.cast_column("audio", Audio())
```
4. You can then [push the dataset to the hub](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub):
```python
dataset.push_to_hub(REPO_ID)
```

Note that you can make the dataset private by passing [`private=True`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub.private) to the [`push_to_hub`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub) method. Find other possible arguments [here](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub).

When using Data-Speech, you can then use `REPO_ID` (replace this by the name you want here and above) as the dataset name.

## Logs


* [August 2024]: Updated version of Data-Speech, suited for Parler-TTS v1
  * New measures: Pesq and SI-SDR, the latter being used for better noise estimation
  * Improved prompts
  * Prompt creation can deal with speaker consistency and accents
* [April 2024]: Release of the first version of Data-Speech 


## Acknowledgements

This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

Special thanks to:
- Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://arxiv.org/abs/2402.01912).
- and the many libraries used, namely [datasets](https://huggingface.co/docs/datasets/v2.17.0/en/index), [brouhaha](https://github.com/marianne-m/brouhaha-vad/blob/main/README.md), [penn](https://github.com/interactiveaudiolab/penn/blob/master/README.md), [g2p](https://github.com/Kyubyong/g2p), [accelerate](https://huggingface.co/docs/accelerate/en/index) and [transformers](https://huggingface.co/docs/transformers/index).

## Citation

If you found this repository useful, please consider citing this work and also the original Stability AI paper:

```
@misc{lacombe-etal-2024-dataspeech,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Data-Speech},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ylacombe/dataspeech}}
}
```

```
@misc{lyth2024natural,
      title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
      author={Dan Lyth and Simon King},
      year={2024},
      eprint={2402.01912},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}
```

### TODOs
- [ ] Accent classification training script
- [ ] Accent classification inference script
- [x] Better speaking rate estimation with long silence removal
- [x] Better SNR estimation with other SNR models
- [ ] Add more annotation categories
- [ ] Multilingual speaking rate estimation

- [ ] (long term) Benchmark for best audio dataset format
- [ ] (long term) Compatibility with streaming


================================================
FILE: dataspeech/__init__.py
================================================
from .cpu_enrichments import rate_apply
from .gpu_enrichments import pitch_apply, snr_apply, squim_apply

================================================
FILE: dataspeech/cpu_enrichments/__init__.py
================================================
from .rate import rate_apply



================================================
FILE: dataspeech/cpu_enrichments/rate.py
================================================
from g2p import make_g2p

transducer = make_g2p('eng', 'eng-ipa')

def rate_apply(batch, rank=None, audio_column_name="audio", text_column_name="text"):
    if isinstance(batch[text_column_name], list):  
        speaking_rates = []
        phonemes_list = []
        if "speech_duration" in batch:
            for text, audio_duration in zip(batch[text_column_name], batch["speech_duration"]):
                phonemes = transducer(text).output_string
                audio_duration = audio_duration if audio_duration != 0 else 0.01
                speaking_rate = len(phonemes) / audio_duration
                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)
        else:
            for text, audio in zip(batch[text_column_name], batch[audio_column_name]):
                phonemes = transducer(text).output_string
                
                sample_rate = audio["sampling_rate"]
                audio_length = len(audio["array"].squeeze()) / sample_rate
                
                speaking_rate = len(phonemes) / audio_length

                
                speaking_rates.append(speaking_rate)
                phonemes_list.append(phonemes)
        
        batch["speaking_rate"] = speaking_rates
        batch["phonemes"] = phonemes_list
    else:
        phonemes = transducer(batch[text_column_name]).output_string
        if "speech_duration" in batch:
            audio_length = batch["speech_duration"] if batch["speech_duration"] != 0 else 0.01
        else:
            sample_rate = batch[audio_column_name]["sampling_rate"]
            audio_length = len(batch[audio_column_name]["array"].squeeze()) / sample_rate

        speaking_rate = len(phonemes) / audio_length
        
        batch["speaking_rate"] = speaking_rate
        batch["phonemes"] = phonemes

    return batch

================================================
FILE: dataspeech/gpu_enrichments/__init__.py
================================================
from .pitch import pitch_apply
from .snr_and_reverb import snr_apply
from .squim import squim_apply

================================================
FILE: dataspeech/gpu_enrichments/pitch.py
================================================
import torch 
import penn


# Here we'll use a 10 millisecond hopsize
hopsize = .01

# Provide a sensible frequency range given your domain and model
fmin = 30.
fmax = 1000.

# Select a checkpoint to use for inference. Selecting None will
# download and use FCNF0++ pretrained on MDB-stem-synth and PTDB
checkpoint = None

# Centers frames at hopsize / 2, 3 * hopsize / 2, 5 * hopsize / 2, ...
center = 'half-hop'

# (Optional) Linearly interpolate unvoiced regions below periodicity threshold
interp_unvoiced_at = .065


def pitch_apply(batch, rank=None, audio_column_name="audio", output_column_name="utterance_pitch", penn_batch_size=4096):
    if isinstance(batch[audio_column_name], list):  
        utterance_pitch_mean = []
        utterance_pitch_std = []
        for sample in batch[audio_column_name]:
            # Infer pitch and periodicity
            pitch, periodicity = penn.from_audio(
                torch.tensor(sample["array"][None, :]).float(),
                sample["sampling_rate"],
                hopsize=hopsize,
                fmin=fmin,
                fmax=fmax,
                checkpoint=checkpoint,
                batch_size=penn_batch_size,
                center=center,
                interp_unvoiced_at=interp_unvoiced_at,
                gpu=(rank or 0)% torch.cuda.device_count() if torch.cuda.device_count() > 0 else rank
                )
            
            utterance_pitch_mean.append(pitch.mean().cpu())
            utterance_pitch_std.append(pitch.std().cpu())
            
        batch[f"{output_column_name}_mean"] = utterance_pitch_mean 
        batch[f"{output_column_name}_std"] = utterance_pitch_std 
    else:
        sample = batch[audio_column_name]
        pitch, periodicity = penn.from_audio(
                torch.tensor(sample["array"][None, :]).float(),
                sample["sampling_rate"],
                hopsize=hopsize,
                fmin=fmin,
                fmax=fmax,
                checkpoint=checkpoint,
                batch_size=penn_batch_size,
                center=center,
                interp_unvoiced_at=interp_unvoiced_at,
                gpu=(rank or 0)% torch.cuda.device_count() if torch.cuda.device_count() > 0 else rank
                )        
        batch[f"{output_column_name}_mean"] = pitch.mean().cpu()
        batch[f"{output_column_name}_std"] = pitch.std().cpu()

    return batch


================================================
FILE: dataspeech/gpu_enrichments/snr_and_reverb.py
================================================
from pyannote.audio import Model
from pathlib import Path
from brouhaha.pipeline import RegressiveActivityDetectionPipeline
import torch 
from huggingface_hub import hf_hub_download
import numpy as np

model = None
ratio = 16000/270

def snr_apply(batch, rank=None, audio_column_name="audio", batch_size=32):
    global model
    if model is None:
        model = Model.from_pretrained(
            Path(hf_hub_download(repo_id="ylacombe/brouhaha-best", filename="best.ckpt")),
            strict=False,
        )
    if rank is not None or torch.cuda.device_count() > 0:
        # move the model to the right GPU if not there already
        device = f"cuda:{(rank or 0)% torch.cuda.device_count()}"
        # move to device and create pipeline here because the pipeline moves to the first GPU it finds anyway
        model.to(device)

    pipeline = RegressiveActivityDetectionPipeline(segmentation=model, batch_size = batch_size)
    if rank:
        pipeline.to(torch.device(device))
    
    device = pipeline._models["segmentation"].device

    if isinstance(batch[audio_column_name], list):  
        snr = []
        c50 = []
        vad_durations = []
        for sample in batch[audio_column_name]:
            res = pipeline({"sample_rate": sample["sampling_rate"],
                            "waveform": torch.tensor(sample["array"][None, :]).to(device).float()})
            
            mask = np.full(res["snr"].shape, False)
            for (segment, _) in res["annotation"].itertracks():
                start = int(segment.start * ratio)
                end = int(segment.end * ratio)
                mask[start:end] = True
            mask =  (~((res["snr"] == 0.0) & (res["c50"] == 0.0)) & mask)

            vad_duration = sum(map(lambda x: x[0].duration, res["annotation"].itertracks()))
            
            snr.append(res["snr"][mask].mean())
            c50.append(res["c50"][mask].mean())
            vad_durations.append(np.float32(vad_duration))
        
        # 16ms window
        batch["snr"] = snr
        batch["c50"] = c50
        batch["speech_duration"] = vad_durations
        
    else:
        res = pipeline({"sample_rate": batch[audio_column_name]["sampling_rate"],
                        "waveform": torch.tensor(batch[audio_column_name]["array"][None, :]).to(device).float()})
        
        mask = np.full(res["snr"].shape, False)
        for (segment, _) in res["annotation"].itertracks():
            start = int(segment.start * ratio)
            end = int(segment.end * ratio)
            mask[start:end] = True
        mask =  (~((res["snr"] == 0.0) & (res["c50"] == 0.0)) & mask)

        vad_duration = sum(map(lambda x: x[0].duration, res["annotation"].itertracks()))     
        
        batch["snr"] = res["snr"][mask].mean()
        batch["c50"] = res["c50"][mask].mean()
        batch["speech_duration"] = vad_duration
        
    return batch

================================================
FILE: dataspeech/gpu_enrichments/squim.py
================================================
from torchaudio.pipelines import SQUIM_OBJECTIVE
import torch 
import torchaudio

model = None
max_audio_length = 15 * SQUIM_OBJECTIVE.sample_rate

def squim_apply(batch, rank=None, audio_column_name="audio"):
    global model
    if model is None:
        model = SQUIM_OBJECTIVE.get_model()
    if rank is not None or torch.cuda.device_count() > 0:
        # move the model to the right GPU if not there already
        device = f"cuda:{(rank or 0)% torch.cuda.device_count()}"
        # move to device and create pipeline here because the pipeline moves to the first GPU it finds anyway
        model.to(device)
    else:
        device = "cpu"
    if isinstance(batch[audio_column_name], list):  
        sdr = []
        pesq = []
        stoi = []
        for sample in batch[audio_column_name]:
            waveform = torchaudio.functional.resample(torch.tensor(sample["array"])[None, :].to(device).float(), sample["sampling_rate"], SQUIM_OBJECTIVE.sample_rate)
            with torch.no_grad():
                waveform = waveform[:, :min(max_audio_length, waveform.shape[1])]
                stoi_sample, pesq_sample, sdr_sample = model(waveform)
            sdr.append(sdr_sample.cpu()[0])
            pesq.append(pesq_sample.cpu()[0])
            stoi.append(stoi_sample.cpu()[0])

        batch["sdr"] = sdr
        batch["pesq"] = pesq
        batch["stoi"] = stoi
    else:
    
        waveform = torchaudio.functional.resample(torch.tensor(batch[audio_column_name]["array"][None, :]).to(device).float(), batch[audio_column_name]["sampling_rate"], SQUIM_OBJECTIVE.sample_rate)
        with torch.no_grad():
            stoi_sample, pesq_sample, sdr_sample = model(waveform)
        batch["sdr"] = sdr_sample.cpu()[0]
        batch["pesq"] = pesq_sample.cpu()[0]
        batch["stoi"] = stoi_sample.cpu()[0]
        # TODO
    return batch



================================================
FILE: examples/prompt_creation/run_prompt_creation_10k.sh
================================================
#!/usr/bin/env bash

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "ylacombe/libritts_r_tags_tagged_10k" \
  --dataset_config_name "clean" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./libritts_r_tags_tagged_10k_generated" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "parler-tts/libritts_r_tags_tagged_10k_generated"

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "ylacombe/libritts_r_tags_tagged_10k" \
  --dataset_config_name "other" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./libritts_r_tags_tagged_10k_generated" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "parler-tts/libritts_r_tags_tagged_10k_generated"

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "ylacombe/mls-eng-10k-tags_tagged_10k" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./mls-eng-10k-tags_tagged_10k_generated" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "parler-tts/mls-eng-10k-tags_tagged_10k_generated"


================================================
FILE: examples/prompt_creation/run_prompt_creation_1k.sh
================================================
#!/usr/bin/env bash

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "parler-tts/libritts-r-tags-and-text" \
  --dataset_config_name "clean" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --hub_dataset_id "parler-tts/libritts-r-tags-and-text-generated"

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "parler-tts/libritts-r-tags-and-text" \
  --dataset_config_name "other" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --hub_dataset_id "parler-tts/libritts-r-tags-and-text-generated"


================================================
FILE: examples/prompt_creation/run_prompt_creation_1k_with_speaker_consistency.sh
================================================
#!/usr/bin/env bash

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "parler-tts/libritts-r-tags-and-text" \
  --dataset_config_name "clean" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --speaker_id_column 'speaker_id' \
  --hub_dataset_id "parler-tts/libritts-r-tags-and-text-generated" \
  --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json \

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "parler-tts/libritts-r-tags-and-text" \
  --dataset_config_name "other" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --speaker_id_column 'speaker_id' \
  --hub_dataset_id "parler-tts/libritts-r-tags-and-text-generated" \
  --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json \



================================================
FILE: examples/prompt_creation/run_prompt_creation_45k.sh
================================================
#!/usr/bin/env bash

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "ylacombe/libritts-r-text-tags-v4" \
  --dataset_config_name "clean" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./libritts_r_descriptions_clean" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --speaker_id_column 'speaker_id' \
  --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json \
  --hub_dataset_id "ylacombe/libritts-r-descriptions-10k-v5-without-accents"

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "ylacombe/libritts-r-text-tags-v4" \
  --dataset_config_name "other" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./libritts_r_descriptions_other" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --speaker_id_column 'speaker_id' \
  --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json \
  --hub_dataset_id "ylacombe/libritts-r-descriptions-10k-v5-without-accents"

accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \
  --dataset_name "ylacombe/mls-eng-text-tags-v5" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 64 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 4 \
  --output_dir "./mls-eng-descriptions" \
  --push_to_hub \
  --is_new_speaker_prompt \
  --speaker_id_column 'speaker_id' \
  --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json \
  --hub_dataset_id "parler-tts/mls-eng-speaker-descriptions"


================================================
FILE: examples/prompt_creation/run_prompt_creation_dummy.sh
================================================
#!/usr/bin/env bash

python run_prompt_creation.py \
  --dataset_name "ylacombe/libritts_r_tags_and_text" \
  --dataset_config_name "clean" \
  --dataset_split_name "dev.clean" \
  --model_name_or_path "hf-internal-testing/tiny-random-LlamaForCausalLM" \
  --per_device_eval_batch_size 2 \
  --attn_implementation "sdpa" \
  --torch_compile \
  --max_eval_samples 128 \
  --max_new_tokens 4 \
  --dataloader_num_workers 0 \
  --save_steps 32 \
  --save_total_limit 2 \
  --output_dir "./" \
  --do_sample False


================================================
FILE: examples/prompt_creation/run_prompt_creation_jenny.sh
================================================
#!/usr/bin/env bash
python ./scripts/run_prompt_creation.py \
  --speaker_name "Jenny" \
  --is_single_speaker \
  --is_new_speaker_prompt \
  --dataset_name "ylacombe/jenny-tts-tags-v1" \
  --dataset_config_name "default" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --per_device_eval_batch_size 128 \
  --attn_implementation "sdpa" \
  --dataloader_num_workers 8 \
  --output_dir "./tmp_jenny" \
  --load_in_4bit \
  --push_to_hub \
  --hub_dataset_id "jenny-tts-tagged-v1" \
  --preprocessing_num_workers 48 \
  --dataloader_num_workers 24

================================================
FILE: examples/prompt_creation/speaker_ids_to_names.json
================================================
{
    "192": "Brenda",
    "274": "Eileen",
    "392": "Joy",
    "409": "James",
    "412": "Eric",
    "505": "Aaron",
    "887": "Emily",
    "1088": "Laura",
    "1112": "Gary",
    "1355": "Jon",
    "1502": "Lea",
    "1509": "Karen",
    "1646": "Rick",
    "2588": "David",
    "2769": "Jordan",
    "2990": "Mike",
    "3114": "Yann",
    "4195": "Lauren",
    "4297": "Rose",
    "4397": "Will",
    "4719": "Jason",
    "5514": "Naomie",
    "5724": "Alisa",
    "5746": "Patrick",
    "5909": "Jerry",
    "6054": "Tina",
    "6904": "Jenna",
    "6912": "Bill",
    "7190": "Tom",
    "7434": "Carol",
    "7584": "Barbara",
    "7789": "Rebecca",
    "8684": "Anna",
    "8791": "Bruce"
  }

================================================
FILE: examples/prompt_creation_llm_swarm/nginx.template.conf
================================================
events {
    # resolve "worker_connections are not enough while connecting to upstream"
    # https://stackoverflow.com/questions/28265717/worker-connections-are-not-enough
    worker_connections 100000;
}

http {
    upstream mytgi {
        least_conn;
        {{servers}}
    }

    server {
        listen {{port}};

        location / {
            proxy_pass http://mytgi;
            proxy_read_timeout 300s;  # Increase this to 300 seconds (5 minutes)
            proxy_connect_timeout 60s;  # Increase this to 60 seconds (1 minute)
        }
    }
}


# sudo docker run  -p 80:80 --network host -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf nginx
# curl 127.0.0.1:80/generate \
#     -X POST \
#     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
#     -H 'Content-Type: application/json'

================================================
FILE: examples/prompt_creation_llm_swarm/run_prompt_creation_10k.sh
================================================
#!/usr/bin/env bash

python run_prompt_creation_llm_swarm.py \
  --dataset_name "ylacombe/mls-eng-10k-text-tags-v2" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --num_instances "2" \
  --output_dir "./mls-eng-10k-descriptions-v2" \
  --push_to_hub \
  --hub_dataset_id "stable-speech/mls-eng-10k-descriptions-v2"


================================================
FILE: examples/prompt_creation_llm_swarm/run_prompt_creation_1k.sh
================================================
#!/usr/bin/env bash

python run_prompt_creation_llm_swarm.py \
  --dataset_name "stable-speech/libritts-r-tags-and-text" \
  --dataset_config_name "clean" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --output_dir "./"

python run_prompt_creation_llm_swarm.py \
  --dataset_name "stable-speech/libritts-r-tags-and-text" \
  --dataset_config_name "other" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --output_dir "./"


================================================
FILE: examples/prompt_creation_llm_swarm/run_prompt_creation_dummy.sh
================================================
#!/usr/bin/env bash

python run_prompt_creation_llm_swarm.py \
  --dataset_name "stable-speech/libritts-r-tags-and-text" \
  --dataset_config_name "clean" \
  --dataset_split_name "train.clean.100" \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --output_dir "./libritts-r-descriptions"


================================================
FILE: examples/prompt_creation_llm_swarm/run_prompt_creation_full_mls.sh
================================================
#!/usr/bin/env bash

python ./run_prompt_creation_llm_swarm.py \
  --dataset_name 'ylacombe/mls-eng-text-tags-v5' \
  --dataset_config_name 'default' \
  --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
  --num_instances "8" \
  --output_dir "./tmp_mls_with_accent" \
  --push_to_hub \
  --hub_dataset_id 'ylacombe/mls-eng-descriptions-v5' \
  --temperature 1.2 \
  --is_new_speaker_prompt \
  --speaker_id_column 'speaker_id' \
  --speaker_ids_to_name_json ./examples/prompt_creation/speaker_ids_to_names.json \
  --accent_column 'accent'

================================================
FILE: examples/prompt_creation_llm_swarm/tgi_h100.template.slurm
================================================
#!/bin/bash
#SBATCH --job-name=llm-swarm
#SBATCH --partition hopper-prod
#SBATCH --gpus={{gpus}}
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=11G
#SBATCH -o slurm/logs/%x_%j.out

# START EDIT
source ~/.bashrc
VOLUME="/fsx/yoach/.cache"
# END EDIT

export model={{model}}
export revision={{revision}}

function unused_port() {
    N=${1:-1}
    comm -23 \
        <(seq "1025" "65535" | sort) \
        <(ss -Htan |
            awk '{print $4}' |
            cut -d':' -f2 |
            sort -u) |
        shuf |
        head -n "$N"
}
export PORT=$(unused_port)

if [ -z "$HUGGING_FACE_HUB_TOKEN" ]; then
    # try reading from file
    export HUGGING_FACE_HUB_TOKEN=$(cat "${HF_HOME}"/token)
fi

echo "Starting TGI container port $PORT"
echo "http://$(hostname -I | awk '{print $1}'):$PORT" >> {{slurm_hosts_path}}

# unset cache dirs to avoid pyxis having host env var somehow get into the container
unset HF_HUB_CACHE HF_ASSETS_CACHE HF_DATASETS_CACHE HF_MODULES_CACHE
srun --container-image='ghcr.io#huggingface/text-generation-inference:2.0' \
    --container-env=HUGGING_FACE_HUB_TOKEN,PORT \
    --container-mounts="${VOLUME}:/data" \
    --no-container-mount-home \
    --qos normal \
    /usr/local/bin/text-generation-launcher \
    --model-id $model \
    --revision $revision \
    --max-concurrent-requests 2500 \
    --max-total-tokens {{model_max_length}} \
    --max-input-length {{model_input_length}} \
    --max-batch-prefill-tokens {{model_max_length}} \

echo "End of job"

================================================
FILE: examples/tagging/run_main_10k.sh
================================================
#!/usr/bin/env bash

python main.py "blabble-io/libritts_r" \
    --configuration "clean" \
    --output_dir ./tmp_libritts_r_clean/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/libritts_r_tags"\

python main.py "blabble-io/libritts_r" \
    --configuration "other" \
    --output_dir ./tmp_libritts_r_other/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/libritts_r_tags"\

python main.py "parler-tts/mls_eng_10k" \
    --output_dir ./tmp_mls_eng_10k/ \
    --text_column_name "transcript" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/mls_eng_10k_tags"\

================================================
FILE: examples/tagging/run_main_1k.sh
================================================
#!/usr/bin/env bash

python main.py "blabble-io/libritts_r" \
    --configuration "clean" \
    --output_dir ./tmp_libritts_r_clean/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/libritts-r-text-tags-v3"\
    --apply_squim_quality_estimation \


python main.py "blabble-io/libritts_r" \
    --configuration "other" \
    --output_dir ./tmp_libritts_r_other/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/libritts-r-text-tags-v3"\
    --apply_squim_quality_estimation \



================================================
FILE: examples/tagging/run_main_45k.sh
================================================
#!/usr/bin/env bash

python main.py "blabble-io/libritts_r" \
    --configuration "clean" \
    --output_dir ./tmp_libritts_r_clean/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/libritts-r-text-tags-v3"\
    --apply_squim_quality_estimation \


python main.py "blabble-io/libritts_r" \
    --configuration "other" \
    --output_dir ./tmp_libritts_r_other/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/libritts-r-text-tags-v3"\
    --apply_squim_quality_estimation \

python main.py "parler-tts/mls_eng" \
    --output_dir ./tmp_mls_eng/ \
    --text_column_name "transcript" \
    --audio_column_name "audio" \
    --cpu_num_workers 32 \
    --num_workers_per_gpu 4 \
    --rename_column \
    --repo_id "ylacombe/mls-eng-tags-v4"\
    --apply_squim_quality_estimation \


================================================
FILE: examples/tagging/run_main_dummy.sh
================================================
#!/usr/bin/env bash

python main.py "blabble-io/libritts_r" \
    --configuration "dev" \
    --output_dir ./tmp_libritts_r_dev/ \
    --text_column_name "text_normalized" \
    --audio_column_name "audio" \
    --cpu_num_workers 8 \
    --num_workers_per_gpu 4 \
    --rename_column \

================================================
FILE: examples/tags_to_annotations/run_metadata_to_text_10k.sh
================================================
#!/usr/bin/env bash

python ./scripts/metadata_to_text.py "ylacombe/mls-eng-10k-tags+ylacombe/libritts_r_tags+ylacombe/libritts_r_tags" \
    --configuration "default+clean+other" \
    --output_dir "./tmp_mls+./tmp_tts_clean+./tmp_tts_other" \
    --cpu_num_workers "8" \
    --leading_split_for_bins "train" \
    --plot_directory "./plots/" \
    --save_bin_edges "./examples/tags_to_annotations/v01_bin_edges.json" \
    --only_save_plot


================================================
FILE: examples/tags_to_annotations/run_metadata_to_text_10k_v02.sh
================================================
#!/usr/bin/env bash

python ./scripts/metadata_to_text.py "ylacombe/mls-eng-10k-tags+ylacombe/libritts_r_tags+ylacombe/libritts_r_tags" \
    --configuration "default+clean+other" \
    --output_dir "./tmp_mls+./tmp_tts_clean+./tmp_tts_other" \
    --cpu_num_workers "8" \
    --leading_split_for_bins "train" \
    --plot_directory "./plots/" \
    --save_bin_edges "./examples/tags_to_annotations/v02_bin_edges.json" \
    --path_to_text_bins ".examples/tags_to_annotations/v02_text_bins.json" \
    --pitch_std_tolerance "1.5"\
    --reverberation_std_tolerance "8."\
    --speech_monotony_std_tolerance "2."\
    --speaking_rate_std_tolerance "5.5"\
    --snr_std_tolerance "3.5"\
    --only_save_plot

================================================
FILE: examples/tags_to_annotations/run_metadata_to_text_for_finetuning.sh
================================================
#!/usr/bin/env bash

python ./scripts/metadata_to_text.py \
    "ylacombe/jenny-tts-tags-v1" \
    --repo_id "jenny-tts-tags-v1" \
    --configuration "default" \
    --cpu_num_workers "8" \
    --path_to_bin_edges "./examples/tags_to_annotations/v02_bin_edges.json" \
    --path_to_text_bins "./examples/tags_to_annotations/v02_text_bins.json" \
    --avoid_pitch_computation \
    --apply_squim_quality_estimation \



================================================
FILE: examples/tags_to_annotations/v01_bin_edges.json
================================================
{"speaking_rate": [3.508771929824561, 6.187242299296628, 8.865712668768696, 11.544183038240764, 14.22265340771283, 16.901123777184896, 19.579594146656966, 22.258064516129032], "noise": [27.179607391357422, 33.90050179617746, 40.62139620099749, 47.342290605817524, 54.063185010637554, 60.78407941545759, 67.50497382027763, 74.22586822509766], "reverberation": [30.498437881469727, 34.706024169921875, 38.91361045837402, 43.12119674682617, 47.32878303527832, 51.53636932373047, 55.74395561218262, 59.951541900634766], "speech_monotony": [0.0, 17.430070059640066, 34.86014011928013, 52.2902101789202, 69.72028023856026, 87.15035029820032, 104.5804203578404, 122.01049041748047], "pitch_bins_male": [74.04898071289062, 88.6379623413086, 103.22694396972656, 117.81592559814453, 132.4049072265625, 146.993896484375, 161.58287048339844, 176.17185974121094], "pitch_bins_female": [130.46119689941406, 149.0537567138672, 167.64630126953125, 186.23886108398438, 204.83140563964844, 223.42396545410156, 242.01651000976562, 260.60906982421875]}

================================================
FILE: examples/tags_to_annotations/v01_text_bins.json
================================================
{
    "speaker_rate_bins":
        ["very slowly", "quite slowly", "slightly slowly", "moderate speed", "slightly fast", "quite fast", "very fast"],
    "snr_bins":
        ["very noisy", "quite noisy", "slightly noisy", "moderate ambient sound", "slightly clear", "quite clear", "very clear"],
    "reverberation_bins":
        ["very roomy sounding", "quite roomy sounding", "slightly roomy sounding", "moderate reverberation", "slightly confined sounding", "quite confined sounding", "very confined sounding"],
    "utterance_level_std":
        ["very monotone", "quite monotone", "slightly monotone", "moderate intonation", "slightly expressive", "quite expressive", "very expressive"],
    "speaker_level_pitch_bins":
        ["very low pitch", "quite low pitch", "slightly low pitch", "moderate pitch", "slightly high pitch", "quite high pitch", "very high pitch"]
}

================================================
FILE: examples/tags_to_annotations/v02_bin_edges.json
================================================
{
    "speaking_rate": [0.0, 3.8258038258038254, 7.651607651607651, 11.477411477411476, 15.303215303215302, 19.129019129019127, 22.95482295482295, 26.78062678062678], 
    "noise": [17.12751579284668, 25.4012325831822, 33.67494937351772, 41.94866616385323, 50.22238295418875, 58.49609974452427, 66.76981653485979, 75.04353332519531], 
    "reverberation": [10, 35, 45, 55, 59, 60], 
    "speech_monotony": [0.0, 20.37920924595424, 40.75841849190848, 70, 90, 142.6544647216797], 
    "pitch_bins_male": [64.6531982421875, 81.66683959960938, 98.68048095703125, 115.69412231445312, 132.707763671875, 149.72140502929688, 166.73504638671875, 183.74868774414062], 
    "pitch_bins_female": [120.17855072021484, 141.6242690945264, 163.06998746883795, 184.51570584314953, 205.96142421746106, 227.40714259177264, 248.8528609660842, 270.29857934039575], 
    "si-sdr": [-17.804332733154297, -0.40644073486328125, 10, 20, 25, 28, 34.38934326171875], 
    "pesq": [1, 1.7, 2.4, 3.1, 3.6, 4, 4.499948978424072]
}

================================================
FILE: examples/tags_to_annotations/v02_text_bins.json
================================================
{
    "speaker_rate_bins":
        ["very slowly", "slowly", "slightly slowly", "moderate speed", "slightly fast", "fast", "very fast"],
    "snr_bins":
        ["very noisy", "noisy", "slightly noisy", "balanced in clarity", "slightly clean", "clean", "very clean"],
    "reverberation_bins":
        ["very distant-sounding", "distant-sounding", "slightly distant-sounding", "slightly close-sounding", "very close-sounding"],
    "utterance_level_std":
        ["very monotone", "monotone", "slightly expressive and animated", "expressive and animated", "very expressive and animated"],
    "speaker_level_pitch_bins":
        ["very low-pitch", "low-pitch", "slightly low-pitch", "moderate pitch", "slightly high-pitch", "high-pitch", "very high-pitch"]
}

================================================
FILE: main.py
================================================
from datasets import load_dataset, Audio
from multiprocess import set_start_method
from dataspeech import rate_apply, pitch_apply, snr_apply, squim_apply
import torch
import argparse


if __name__ == "__main__":
    set_start_method("spawn")
    parser = argparse.ArgumentParser()
    
    
    parser.add_argument("dataset_name", type=str, help="Path or name of the dataset. See: https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset.path")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration to use, if necessary.")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dataset on disk with this path.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the dataset to the hub.")
    parser.add_argument("--audio_column_name", default="audio", type=str, help="Column name of the audio column to be enriched.")
    parser.add_argument("--text_column_name", default="text", type=str, help="Text column name.")
    parser.add_argument("--rename_column", action="store_true", help="If activated, rename audio and text column names to 'audio' and 'text'. Useful if you want to merge datasets afterwards.")
    parser.add_argument("--cpu_num_workers", default=1, type=int, help="Number of CPU workers for transformations that don't use GPUs or if no GPU are available.")
    parser.add_argument("--cpu_writer_batch_size", default=1000, type=int, help="writer_batch_size for transformations that don't use GPUs. See: https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.Dataset.map.writer_batch_size")
    parser.add_argument("--batch_size", default=2, type=int, help="This parameters specify how many samples are passed by workers for operations that are using GPUs.")
    parser.add_argument("--penn_batch_size", default=4096, type=int, help="Pitch estimation chunks audio into smaller pieces and processes them in batch. This specify the batch size. If you are using a gpu, pick a batch size that doesn't cause memory errors.")
    parser.add_argument("--num_workers_per_gpu_for_pitch", default=1, type=int, help="Number of workers per GPU for the pitch estimation if GPUs are available. Defaults to 1 if some are avaiable. Useful if you want multiple processes per GPUs to maximise GPU usage.")
    parser.add_argument("--num_workers_per_gpu_for_snr", default=1, type=int, help="Number of workers per GPU for the SNR and reverberation estimation if GPUs are available. Defaults to 1 if some are avaiable. Useful if you want multiple processes per GPUs to maximise GPU usage.")
    parser.add_argument("--apply_squim_quality_estimation", action="store_true", help="If set, will also use torchaudio-squim estimation (SI-SNR, STOI and PESQ).")
    parser.add_argument("--num_workers_per_gpu_for_squim", default=1, type=int, help="Number of workers per GPU for the SI-SNR, STOI and PESQ estimation if GPUs are available. Defaults to 1 if some are avaiable. Useful if you want multiple processes per GPUs to maximise GPU usage.")


    args = parser.parse_args()
    
    if args.configuration:
        dataset = load_dataset(args.dataset_name, args.configuration, num_proc=args.cpu_num_workers,)
    else:
        dataset = load_dataset(args.dataset_name, num_proc=args.cpu_num_workers,)
        
    audio_column_name = "audio" if args.rename_column else args.audio_column_name
    text_column_name = "text" if args.rename_column else args.text_column_name
    if args.rename_column:
        dataset = dataset.rename_columns({args.audio_column_name: "audio", args.text_column_name: "text"})
        

    if args.apply_squim_quality_estimation:
        print("Compute SI-SDR, PESQ, STOI")
        squim_dataset = dataset.map(
            squim_apply,
            batched=True,
            batch_size=args.batch_size,
            with_rank=True if torch.cuda.device_count()>0 else False,
            num_proc=torch.cuda.device_count()*args.num_workers_per_gpu_for_squim if torch.cuda.device_count()>0 else args.cpu_num_workers,
            remove_columns=[audio_column_name], # tricks to avoid rewritting audio
            fn_kwargs={"audio_column_name": audio_column_name,},
        )

    print("Compute pitch")
    pitch_dataset = dataset.cast_column(audio_column_name, Audio(sampling_rate=16_000)).map(
        pitch_apply,
        batched=True,
        batch_size=args.batch_size,
        with_rank=True if torch.cuda.device_count()>0 else False,
        num_proc=torch.cuda.device_count()*args.num_workers_per_gpu_for_pitch if torch.cuda.device_count()>0 else args.cpu_num_workers,
        remove_columns=[audio_column_name], # tricks to avoid rewritting audio
        fn_kwargs={"audio_column_name": audio_column_name, "penn_batch_size": args.penn_batch_size},
    )

    print("Compute snr and reverb")
    snr_dataset = dataset.map(
        snr_apply,
        batched=True,
        batch_size=args.batch_size,
        with_rank=True if torch.cuda.device_count()>0 else False,
        num_proc=torch.cuda.device_count()*args.num_workers_per_gpu_for_snr if torch.cuda.device_count()>0 else args.cpu_num_workers,
        remove_columns=[audio_column_name], # tricks to avoid rewritting audio
        fn_kwargs={"audio_column_name": audio_column_name},
    )
    
    print("Compute speaking rate")
    if "speech_duration" in snr_dataset[next(iter(snr_dataset.keys()))].features:    
        rate_dataset = snr_dataset.map(
            rate_apply,
            with_rank=False,
            num_proc=args.cpu_num_workers,
            writer_batch_size= args.cpu_writer_batch_size,
            fn_kwargs={"audio_column_name": audio_column_name, "text_column_name": text_column_name},
        )
    else:
        rate_dataset = dataset.map(
            rate_apply,
            with_rank=False,
            num_proc=args.cpu_num_workers,
            writer_batch_size= args.cpu_writer_batch_size,
            remove_columns=[audio_column_name], # tricks to avoid rewritting audio
            fn_kwargs={"audio_column_name": audio_column_name, "text_column_name": text_column_name},
        )
    
    for split in dataset.keys():
        dataset[split] = pitch_dataset[split].add_column("snr", snr_dataset[split]["snr"]).add_column("c50", snr_dataset[split]["c50"])
        if "speech_duration" in snr_dataset[split]:
            dataset[split] = dataset[split].add_column("speech_duration", snr_dataset[split]["speech_duration"])
        dataset[split] = dataset[split].add_column("speaking_rate", rate_dataset[split]["speaking_rate"]).add_column("phonemes", rate_dataset[split]["phonemes"])
        if args.apply_squim_quality_estimation:
            dataset[split] = dataset[split].add_column("stoi", squim_dataset[split]["stoi"]).add_column("si-sdr", squim_dataset[split]["sdr"]).add_column("pesq", squim_dataset[split]["pesq"])
    
    if args.output_dir:
        print("Saving to disk...")
        dataset.save_to_disk(args.output_dir)
    if args.repo_id:
        print("Pushing to the hub...")
        if args.configuration:
            dataset.push_to_hub(args.repo_id, args.configuration)
        else:
            dataset.push_to_hub(args.repo_id)
    


================================================
FILE: requirements.txt
================================================
datasets[audio]
https://github.com/marianne-m/brouhaha-vad/archive/main.zip
penn
g2p
demucs
transformers
accelerate
bitsandbytes

================================================
FILE: scripts/filter_audio_separation.py
================================================
from demucs import pretrained
from demucs.apply import apply_model
from demucs.audio import convert_audio
from datasets import load_dataset
from multiprocess import set_start_method
import torch
import argparse
from datasets import Audio



demucs = pretrained.get_model('htdemucs')
source = demucs.sources

def wrap_audio(audio, sr):
    return {
        "array": audio.cpu().numpy(),
        "sampling_rate": sr
    }


# TODO(YL): make compatible with other naming and stems
def filter_stems(batch, rank=None):
    if rank is not None:
        # move the model to the right GPU if not there already
        device = f"cuda:{(rank or 0)% torch.cuda.device_count()}"
        # move to device and create pipeline here because the pipeline moves to the first GPU it finds anyway
        demucs.to(device)

    if isinstance(batch["audio"], list):  
        wavs = [convert_audio(
                    torch.tensor(audio["array"][None], device=device).to(torch.float32), audio["sampling_rate"], demucs.samplerate, demucs.audio_channels).T for audio in batch["audio"]]
        wavs_length = [audio.shape[0] for audio in wavs]
        
        wavs = torch.nn.utils.rnn.pad_sequence(wavs, batch_first=True, padding_value=0.0).transpose(1,2)
        stems = apply_model(demucs, wavs)
        
        batch["vocals"] = [wrap_audio(s[-1,:,:length].mean(0), demucs.samplerate) for (s,length) in zip(stems, wavs_length)]
        batch["others"] = [wrap_audio(s[:-1, :,:length].sum(0).mean(0), demucs.samplerate) for (s,length) in zip(stems, wavs_length)]
        
    else:
        audio = torch.tensor(batch["audio"]["array"].squeeze(), device=device).to(torch.float32)
        sample_rate = batch["audio"]["sampling_rate"]
        audio = convert_audio(
                audio, sample_rate, demucs.samplerate, demucs.audio_channels)
        stems = apply_model(demucs, audio[None])
        
        batch["vocals"] = wrap_audio(stems[0,-1].mean(0), demucs.samplerate)
        batch["others"] = wrap_audio(stems[0, :-1].sum(0).mean(0), demucs.samplerate)

    return batch
    
if __name__ == "__main__":
    set_start_method("spawn")

    parser = argparse.ArgumentParser()
    parser.add_argument("dataset_name", type=str, help="Path or name of the dataset. See: https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/loading_methods#datasets.load_dataset.path")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration to use, if necessary.")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dataset on disk with this path.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the model to the hub.")
    parser.add_argument("--audio_column_name", default="audio", type=str, help="Column name of the audio column to be separated.")
    parser.add_argument("--batch_size", default=8, type=int, help="Batch size. Speeds up operations on GPU.")
    parser.add_argument("--num_workers_per_gpu", default=1, type=int, help="Number of workers per GPU for transformations that uses GPUs if GPUs are available. Defaults to 1 if some are avaiable. Useful if you want multiple processes per GPUs to maximise GPU usage.")
    args = parser.parse_args()
    
    if args.configuration:
        dataset = load_dataset(args.dataset_name, args.configuration)
    else:
        dataset = load_dataset(args.dataset_name)    


    num_proc = torch.cuda.device_count()*args.num_workers_per_gpu if torch.cuda.device_count() >= 1 else None

    updated_dataset = dataset.map(
        filter_stems,
        batched=True,
        batch_size=args.batch_size,
        with_rank=True,
        num_proc=num_proc,
    )
    
    updated_dataset = updated_dataset.cast_column("vocals", Audio())
    updated_dataset = updated_dataset.cast_column("others", Audio())
    
    if args.output_dir:
        print("Saving to disk...")
        updated_dataset.save_to_disk(args.output_dir)
    if args.repo_id:
        print("Pushing to the hub...")
        if args.configuration:
            updated_dataset.push_to_hub(args.repo_id, args.configuration)
        else:
            updated_dataset.push_to_hub(args.repo_id)
    



================================================
FILE: scripts/merge_audio_to_metadata.py
================================================
import numpy as np
import pandas as pd
from datasets import load_dataset, concatenate_datasets
from multiprocess import set_start_method
import argparse



if __name__ == "__main__":
    set_start_method("spawn")
    parser = argparse.ArgumentParser()
    
    
    parser.add_argument("dataset_name", type=str, help="Repo id.")
    parser.add_argument("metadata_dataset_name", type=str, help="Repo id.")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration to use.")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dasaset on disk.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the model to the hub.")
    parser.add_argument("--cpu_num_workers", default=1, type=int, help="Number of CPU workers.")
    parser.add_argument("--strategy", default="concatenate", type=str, help="For now only concatenate.")
    parser.add_argument("--id_column_name", default="id", type=str, help="For now only concatenate.") # TODO
    parser.add_argument("--columns_to_drop", default=None, type=str, help="Column names to drop in the metadataset. If some columns are duplicates. Separated by '+'. ")
    

    args = parser.parse_args()
    
    args = parser.parse_args()
    
    if args.configuration:
        dataset = load_dataset(args.dataset_name, args.configuration)
    else:
        dataset = load_dataset(args.dataset_name)
        
    if args.configuration:
        metadata_dataset = load_dataset(args.metadata_dataset_name, args.configuration)
    else:
        metadata_dataset = load_dataset(args.metadata_dataset_name)

    columns_to_drop = None
    if args.columns_to_drop is not None:
        columns_to_drop = args.columns_to_drop.split("+")
        metadata_dataset = metadata_dataset.remove_columns(columns_to_drop)
    
    # TODO: for now suppose that they've kept the same ordering
    for split in dataset:
        if split in metadata_dataset:
            dataset[split] = concatenate_datasets([dataset[split], metadata_dataset[split].rename_column(args.id_column_name, f"metadata_{args.id_column_name}")], axis=1)
        else:
            raise ValueError(f"Metadataset don't have the same split {split} than dataset")
        
        if len(dataset[split].filter(lambda id1, id2: id1!=id2, input_columns=[args.id_column_name, f"metadata_{args.id_column_name}"])) != 0:
            raise ValueError(f"Concatenate didn't work. Some ids don't correspond on split {split}")
    

    if args.output_dir:
        dataset.save_to_disk(args.output_dir)
    if args.repo_id:
        if args.configuration:
            dataset.push_to_hub(args.repo_id, args.configuration)
        else:
            dataset.push_to_hub(args.repo_id)
    

================================================
FILE: scripts/metadata_to_text.py
================================================
import numpy as np
import pandas as pd
from datasets import load_dataset, DatasetDict
from multiprocess import set_start_method
import argparse
from pathlib import Path
import os
import matplotlib.pyplot as plt
import json

SPEAKER_RATE_BINS = ["very slowly", "quite slowly", "slightly slowly", "moderate speed", "slightly fast", "quite fast", "very fast"]
SNR_BINS = ["very noisy", "quite noisy", "slightly noisy", "moderate ambient sound", "slightly clear", "quite clear", "very clear"]
REVERBERATION_BINS = ["very roomy sounding", "quite roomy sounding", "slightly roomy sounding", "moderate reverberation", "slightly confined sounding", "quite confined sounding", "very confined sounding"]
UTTERANCE_LEVEL_STD = ["very monotone", "quite monotone", "slightly monotone", "moderate intonation", "slightly expressive", "quite expressive", "very expressive"]
SI_SDR_BINS = ["extremely noisy", "very noisy", "noisy", "slightly noisy", "almost no noise", "very clear"]
PESQ_BINS = ["very bad speech quality", "bad speech quality", "slightly bad speech quality", "moderate speech quality", "great speech quality", "wonderful speech quality"]

# this one is supposed to be apply to speaker-level mean pitch, and relative to gender
SPEAKER_LEVEL_PITCH_BINS = ["very low pitch", "quite low pitch", "slightly low pitch", "moderate pitch", "slightly high pitch", "quite high pitch", "very high pitch"]


def visualize_bins_to_text(values_1, values_2, name_1, name_2, text_bins, save_dir, output_column_name, default_bins=100, lower_range=None):
    # Save both histograms into a single figure
    fig, axs = plt.subplots(2, figsize=(8,6), sharex=True)
    
    # Plot histogram and vertical lines for subplot 1
    axs[0].hist(values_1, bins=default_bins, color='blue', alpha=0.7)
    _, bin_edges1 = np.histogram(values_1, bins=len(text_bins), range=(lower_range, values_1.max()) if lower_range else None)
    for edge in bin_edges1:
        axs[0].axvline(x=edge, color='red', linestyle='--', linewidth=1)


    # Plot histogram and vertical lines for subplot 2
    axs[1].hist(values_2, bins=default_bins, color='green', alpha=0.7)
    _, bin_edges2 = np.histogram(values_2, bins=len(text_bins), range=(lower_range, values_2.max()) if lower_range else None)
    for edge in bin_edges2:
        axs[1].axvline(x=edge, color='red', linestyle='--', linewidth=1)

    # Add labels and title
    axs[0].set_title(name_1)
    axs[1].set_title(name_2)
    axs[0].set_yscale('log')
    axs[1].set_yscale('log')
    axs[0].set_ylabel('Frequency')
    axs[1].set_ylabel('Frequency')
    axs[1].set_xlabel(f'{output_column_name}')

    # Adjust layout
    plt.tight_layout()

    filename = f"{output_column_name}.png"
    filepath = os.path.join(save_dir, filename)
    plt.savefig(filepath)
    print(f"Plots saved at '{filename}'!")

def bins_to_text(dataset, text_bins, column_name, output_column_name, leading_split_for_bins="train", batch_size = 4, num_workers = 1, std_tolerance=5, save_dir=None, only_save_plot=False, lower_range=None, bin_edges=None):
    '''
    Compute bins of `column_name` from the splits `leading_split_for_bins` and apply text bins to every split.
    `leading_split_for_bins` can be a string or a list.
    '''
    if bin_edges is None:
        values = []
        for df in dataset:
            for split in df:
                if leading_split_for_bins is None or leading_split_for_bins in split:
                    values.extend(df[split][column_name])
        
        # filter out outliers
        values = np.array(values)
        values = values[~np.isnan(values)]
        filtered_values = values
        if std_tolerance is not None:
            filtered_values = values[np.abs(values - np.mean(values)) < std_tolerance * np.std(values)]

        if save_dir is not None:
            visualize_bins_to_text(values, filtered_values, "Before filtering", "After filtering", text_bins, save_dir, output_column_name, lower_range=lower_range)
            
        # speaking_rate can easily have outliers
        if save_dir is not None and output_column_name=="speaking_rate":
            visualize_bins_to_text(filtered_values, filtered_values, "After filtering", "After filtering", text_bins, save_dir, f"{output_column_name}_after_filtering", lower_range=lower_range)
        
        values = filtered_values
        hist, bin_edges = np.histogram(values, bins = len(text_bins), range=(lower_range, values.max()) if lower_range else None)
        
        if only_save_plot:
            return dataset, bin_edges
    else:
        print(f"Already computed bin edges have been passed for {output_column_name}. Will use: {bin_edges}.")

    def batch_association(batch):
        index_bins = np.searchsorted(bin_edges, batch, side="left")
        # do min(max(...)) when values are outside of the main bins
        # it happens when value = min or max or have been filtered out from bins computation
        batch_bins = [text_bins[min(max(i-1, 0), len(text_bins)-1)] for i in index_bins]
        return {
            output_column_name: batch_bins
        }
    
    dataset = [df.map(batch_association, batched=True, batch_size=batch_size, input_columns=[column_name], num_proc=num_workers) for df in dataset]
    return dataset, bin_edges

def speaker_level_relative_to_gender(dataset, text_bins, speaker_column_name, gender_column_name, column_name, output_column_name, batch_size = 4, num_workers=1, std_tolerance=None, save_dir=None, only_save_plot=False, bin_edges=None):
    '''
    Computes mean values on a speaker level and computes bins on top relative to the gender column name.
    Then associate a text bin to the column.
    This time, doesn't use leading_split_for_bins, computes it for all. Could probably be optimized
    '''
    list_data = []
    for df in dataset:
        for split in df:
            panda_data = df[split].remove_columns([col for col in df[split].column_names if col not in {speaker_column_name, column_name, gender_column_name}]).to_pandas()
            list_data.append(panda_data)
        
    dataframe = pd.concat(list_data, ignore_index=True)
    dataframe = dataframe.groupby(speaker_column_name).agg({column_name: "mean", gender_column_name: "first"})
    if bin_edges is None:
        bin_edges = {}
        if save_dir is not None:
            save_dict = {}
            save_dict_afer_filtering = {}
        for category in ["male", "female"]:
            values = dataframe[dataframe[gender_column_name] == category][column_name]
            values = np.array(values)
            if save_dir is not None:
                save_dict[category] = values
            if std_tolerance is not None:
                # filter out outliers
                values = values[np.abs(values - np.mean(values)) < std_tolerance * np.std(values)]
                if save_dir is not None:
                    save_dict_afer_filtering[category] = values
            bin_edges[category] = np.histogram(values, len(text_bins))[1]
        
        if save_dir is not None:
            visualize_bins_to_text(save_dict["male"], save_dict["female"], "Male distribution", "Female distribution", text_bins, save_dir, output_column_name)
            if std_tolerance is not None:
                visualize_bins_to_text(save_dict_afer_filtering["male"], save_dict_afer_filtering["female"], "Male distribution", "Female distribution", text_bins, save_dir, f"{output_column_name}_after_filtering")

        if only_save_plot:
            return dataset, bin_edges
    else:
        print(f"Already computed bin edges have been passed for {output_column_name}. Will use: {bin_edges}.")
     
    speaker_id_to_bins = dataframe.apply(lambda x: np.searchsorted(bin_edges[x[gender_column_name]], x[column_name]), axis=1).to_dict()
        
    def batch_association(batch):
        index_bins = [speaker_id_to_bins[speaker] for speaker in batch]
        # do min(max(...)) when values are outside of the main bins
        # it happens when value = min or max or have been filtered out from bins computation
        batch_bins = [text_bins[min(max(i-1, 0), len(text_bins)-1)] for i in index_bins]
        return {
            output_column_name: batch_bins
        }
        
    
    dataset = [df.map(batch_association, batched=True, input_columns=[speaker_column_name], batch_size=batch_size, num_proc=num_workers) for df in dataset]
    return dataset, bin_edges

if __name__ == "__main__":
    set_start_method("spawn")
    parser = argparse.ArgumentParser()
    
    
    parser.add_argument("dataset_name", type=str, help="Path or name of the dataset(s). If multiple datasets, names have to be separated by `+`.")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration(s) to use (or configuration separated by +).")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dataset(s) on disk. If multiple datasets, paths have to be separated by `+`.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the dataset(s) to the hub. If multiple datasets, names have to be separated by `+`.")
    parser.add_argument("--path_to_text_bins", default=None, type=str, help="If specified, points to a JSON file which contains the text bins that will be associated to each bins. Will use default bins.")
    parser.add_argument("--path_to_bin_edges", default=None, type=str, help="If specified, points to a JSON file which contains the bin edges. Useful if you want to apply already computed bins to new datasets. If not specified, will recompute bin edges from scratch.")
    parser.add_argument("--save_bin_edges", default=None, type=str, help="If specified, it's the name of the JSON file which will contains the edge bins that have been computed. Useful if you want to reuse those bin eges on new datasets. By default, it won't save those edges..")
    parser.add_argument("--avoid_pitch_computation", default=False, action="store_true", help="If `True`, will not compute `pitch`. Note that `pitch` is computed on a speaker-level, relative to gender, so you don't need it in a mono-speaker setting.")
    parser.add_argument("--cpu_num_workers", default=1, type=int, help="Number of CPU workers.")
    parser.add_argument("--batch_size", default=16, type=int, help="Batch size in `Dataset.map` operations. https://huggingface.co/docs/datasets/v2.17.0/en/package_reference/main_classes#datasets.Dataset.map")
    parser.add_argument("--speaker_id_column_name", default="speaker_id", type=str, help="Speaker id column name. Only used if `avoid_pitch_computation=False`")
    parser.add_argument("--gender_column_name", default="gender", type=str, help="Gender column name. .Only used if `avoid_pitch_computation=False`")
    parser.add_argument("--pitch_std_tolerance", default=2., type=float, help="Standard deviation tolerance for pitch estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `avoid_pitch_computation=False`.")
    parser.add_argument("--speaking_rate_std_tolerance", default=4., type=float, help="Standard deviation tolerance for speaking rate estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--snr_std_tolerance", default=3.5, type=float, help="Standard deviation tolerance for SNR estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--reverberation_std_tolerance", default=4, type=float, help="Standard deviation tolerance for reverberation estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--speech_monotony_std_tolerance", default=4, type=float, help="Standard deviation tolerance for speech monotony estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--leading_split_for_bins", default=None, type=str, help="If specified, will use every split that contains this string to compute statistics. If not specified, will use every split. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--plot_directory", default=None, type=str, help="If specified, will save visualizing plots to this directory. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--only_save_plot", default=False, action="store_true", help="If `True` and `--plot_directory` is specified, will only compute plot. Only used if `path_to_bin_edges=False`.")
    parser.add_argument("--snr_lower_range", default=None, type=float, help="The lower range of the SNR bins")
    parser.add_argument("--speaking_rate_lower_range", default=None, type=float, help="The lower range of the speaking rate bins")
    parser.add_argument("--apply_squim_quality_estimation", action="store_true", help="If set, will also compute bins for torchaudio-squim estimation (SI-SNR, PESQ).")
    parser.add_argument("--pesq_std_tolerance", default=None, type=float, help="Used if `apply_squim_quality_estimation=True`. Standard deviation tolerance for PESQ estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `avoid_pitch_computation=False`.")
    parser.add_argument("--sdr_std_tolerance", default=None, type=float, help="Used if `apply_squim_quality_estimation=True`. Standard deviation tolerance for SI-SDR estimation. Any value that is outside mean ± std * tolerance is discared. Only used if `path_to_bin_edges=False`.")

    args = parser.parse_args()
    
    if args.plot_directory is None and args.only_save_plot:
        raise ValueError("`only_save_plot=true` but `plot_directory` is not specified. Please give a path to the directory where you want the plot to be saved.")
    if args.only_save_plot and args.path_to_bin_edges:
        raise ValueError("`only_save_plot=true` but `path_to_bin_edges` is specified. Since the latter is specified, we won't redo computations that would have been used for plotting. Chose one ar another. Note that if you use this script to label a new dataset for fine-tuning, I'd recommend avoiding plotting and set `only_save_plot=false`")
        
    text_bins_dict = {}
    if args.path_to_text_bins:
        with open(args.path_to_text_bins) as json_file:
            text_bins_dict = json.load(json_file)
            
    bin_edges_dict = {}
    if args.path_to_bin_edges:
        with open(args.path_to_bin_edges) as json_file:
            bin_edges_dict = json.load(json_file)

    speaker_level_pitch_bins = text_bins_dict.get("speaker_level_pitch_bins", SPEAKER_LEVEL_PITCH_BINS)
    speaker_rate_bins = text_bins_dict.get("speaker_rate_bins", SPEAKER_RATE_BINS)
    snr_bins = text_bins_dict.get("snr_bins", SNR_BINS)
    reverberation_bins = text_bins_dict.get("reverberation_bins", REVERBERATION_BINS)
    utterance_level_std = text_bins_dict.get("utterance_level_std", UTTERANCE_LEVEL_STD)
    
    if args.apply_squim_quality_estimation:
        sdr_bins = text_bins_dict.get("sdr_bins", SI_SDR_BINS)
        pesq_std = text_bins_dict.get("pesq_bins", PESQ_BINS)

    output_dirs = [args.output_dir] if args.output_dir is not None else None
    repo_ids = [args.repo_id] if args.repo_id is not None else None
    if args.configuration:
        if "+" in args.dataset_name:
            dataset_names = args.dataset_name.split("+")
            dataset_configs = args.configuration.split("+")
            if len(dataset_names) != len(dataset_configs):
                raise ValueError(f"There are {len(dataset_names)} datasets spotted but {len(dataset_configs)} configuration spotted")
            
            if args.repo_id is not None:
                repo_ids = args.repo_id.split("+")
                if len(dataset_names) != len(repo_ids):
                    raise ValueError(f"There are {len(dataset_names)} datasets spotted but {len(repo_ids)} repository ids spotted")

            if args.output_dir is not None:
                output_dirs = args.output_dir.split("+")
                if len(dataset_names) != len(output_dirs):
                    raise ValueError(f"There are {len(dataset_names)} datasets spotted but {len(output_dirs)} local paths on which to save the datasets spotted")
            
            dataset = []
            for dataset_name, dataset_config in zip(dataset_names, dataset_configs):
                tmp_dataset = load_dataset(dataset_name, dataset_config, num_proc=args.cpu_num_workers)
                dataset.append(tmp_dataset)
        else:
            dataset = [load_dataset(args.dataset_name, args.configuration, num_proc=args.cpu_num_workers)]
            dataset_configs = [args.configuration]
    else:
        if "+" in args.dataset_name:
            dataset_names = args.dataset_name.split("+")
            if args.repo_id is not None:
                repo_ids = args.repo_id.split("+")
                if len(dataset_names) != len(repo_ids):
                    raise ValueError(f"There are {len(dataset_names)} datasets spotted but {len(repo_ids)} repository ids spotted")

            if args.output_dir is not None:
                output_dirs = args.output_dir.split("+")
                if len(dataset_names) != len(output_dirs):
                    raise ValueError(f"There are {len(dataset_names)} datasets spotted but {len(output_dirs)} local paths on which to save the datasets spotted")
            
            dataset = []
            for dataset_name, dataset_config in zip(dataset_names):
                tmp_dataset = load_dataset(dataset_name, num_proc=args.cpu_num_workers)
                dataset.append(tmp_dataset)

        else:
            dataset = [load_dataset(args.dataset_name, num_proc=args.cpu_num_workers)]

    if args.plot_directory:
        Path(args.plot_directory).mkdir(parents=True, exist_ok=True)
    
    if not args.avoid_pitch_computation:
        bin_edges = None
        if "pitch_bins_male" in bin_edges_dict and "pitch_bins_female" in bin_edges_dict:
            bin_edges = {"male": bin_edges_dict["pitch_bins_male"], "female": bin_edges_dict["pitch_bins_female"]}

        dataset, pitch_bin_edges = speaker_level_relative_to_gender(dataset, speaker_level_pitch_bins, args.speaker_id_column_name, args.gender_column_name, "utterance_pitch_mean", "pitch", batch_size=args.batch_size, num_workers=args.cpu_num_workers, std_tolerance=args.pitch_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges)

    dataset, speaking_rate_bin_edges = bins_to_text(dataset, speaker_rate_bins, "speaking_rate", "speaking_rate", batch_size=args.batch_size, num_workers=args.cpu_num_workers, leading_split_for_bins=args.leading_split_for_bins, std_tolerance=args.speaking_rate_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges_dict.get("speaking_rate",None), lower_range=args.speaking_rate_lower_range)
    dataset, noise_bin_edges = bins_to_text(dataset, snr_bins, "snr", "noise", batch_size=args.batch_size, num_workers=args.cpu_num_workers, leading_split_for_bins=args.leading_split_for_bins, std_tolerance=args.snr_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges_dict.get("noise",None), lower_range=args.snr_lower_range)
    dataset, reverberation_bin_edges = bins_to_text(dataset, reverberation_bins, "c50", "reverberation", batch_size=args.batch_size, num_workers=args.cpu_num_workers, leading_split_for_bins=args.leading_split_for_bins, std_tolerance=args.reverberation_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges_dict.get("reverberation",None))
    dataset, speech_monotony_bin_edges = bins_to_text(dataset, utterance_level_std, "utterance_pitch_std", "speech_monotony", batch_size=args.batch_size, num_workers=args.cpu_num_workers, leading_split_for_bins=args.leading_split_for_bins, std_tolerance=args.speech_monotony_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges_dict.get("speech_monotony",None))

    if args.apply_squim_quality_estimation:
        dataset, sdr_bin_edges = bins_to_text(dataset, sdr_bins, "si-sdr", "sdr_noise", batch_size=args.batch_size, num_workers=args.cpu_num_workers, leading_split_for_bins=args.leading_split_for_bins, std_tolerance=args.sdr_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges_dict.get("si-sdr",None))
        dataset, pesq_bin_edges = bins_to_text(dataset, pesq_std, "pesq", "pesq_speech_quality", batch_size=args.batch_size, num_workers=args.cpu_num_workers, leading_split_for_bins=args.leading_split_for_bins, std_tolerance=args.pesq_std_tolerance, save_dir=args.plot_directory, only_save_plot=args.only_save_plot, bin_edges=bin_edges_dict.get("pesq",None))

    if args.save_bin_edges:
        bin_edges = {
            "speaking_rate": speaking_rate_bin_edges.tolist(),
            "noise": noise_bin_edges.tolist(),
            "reverberation": reverberation_bin_edges.tolist(),
            "speech_monotony": speech_monotony_bin_edges.tolist(),
        }
        if not args.avoid_pitch_computation:
            bin_edges["pitch_bins_male"] = pitch_bin_edges["male"].tolist()
            bin_edges["pitch_bins_female"] = pitch_bin_edges["female"].tolist()
        if args.apply_squim_quality_estimation:
            bin_edges["si-sdr"] = sdr_bin_edges.tolist()
            bin_edges["pesq"] = pesq_bin_edges.tolist()
        
        with open(args.save_bin_edges, "w") as outfile: 
            json.dump(bin_edges, outfile)
        
    if not args.only_save_plot:
        if args.output_dir:
            for output_dir, df in zip(output_dirs, dataset):
                df.save_to_disk(output_dir)
        if args.repo_id:
            for i, (repo_id, df) in enumerate(zip(repo_ids, dataset)):
                if args.configuration:
                    df.push_to_hub(repo_id, dataset_configs[i])
                else:
                    df.push_to_hub(repo_id)


================================================
FILE: scripts/per_dataset_script/add_gender_to_MLS.py
================================================
from datasets import load_dataset
from multiprocess import set_start_method
import pandas as pd
import argparse


if __name__ == "__main__":
    set_start_method("spawn")
    parser = argparse.ArgumentParser()
    
    
    parser.add_argument("dataset_name", type=str, help="Repo id or local path.")
    parser.add_argument("tsv_path", default=None, type=str, help="Text column name.")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration to use.")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dasaset on disk.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the model to the hub.")
    parser.add_argument("--speaker_id_column_name", default="speaker_id", type=str, help="Audio column name.")
    parser.add_argument("--cpu_num_workers", default=1, type=int, help="Number of CPU workers for transformations that don't use GPUs or if no GPU are available.")

    args = parser.parse_args()
    
    if args.configuration:
        dataset = load_dataset(args.dataset_name, args.configuration)
    else:
        dataset = load_dataset(args.dataset_name)
        
    speaker_id_column_name = args.speaker_id_column_name

    speaker_dataset = pd.read_csv(args.tsv_path, sep="|", on_bad_lines='skip')
    speaker_column = ' SPEAKER   ' 
    gender_column = '   GENDER   '
    speaker_dataset = speaker_dataset.set_index(speaker_column)[gender_column]
    speaker_dataset = speaker_dataset.to_dict()
    
    def map_gender(speaker_ids):
        genders = [speaker_dataset[int(speaker)].strip() for speaker in speaker_ids]
        return {"gender": ["male" if g=="M" else "female" for g in genders]}
    
    dataset = dataset.map(map_gender, batched=True, batch_size=128, input_columns=speaker_id_column_name, num_proc=args.cpu_num_workers)

    
    if args.output_dir:
        dataset.save_to_disk(args.output_dir)
    if args.repo_id:
        if args.configuration:
            dataset.push_to_hub(args.repo_id, args.configuration)
        else:
            dataset.push_to_hub(args.repo_id)
    


================================================
FILE: scripts/per_dataset_script/add_gender_to_libritts_r.py
================================================
from datasets import load_dataset
from multiprocess import set_start_method
import pandas as pd
import argparse


if __name__ == "__main__":
    set_start_method("spawn")
    parser = argparse.ArgumentParser()
    
    
    parser.add_argument("dataset_name", type=str, help="Repo id or local path.")
    parser.add_argument("tsv_path", default=None, type=str, help="Text column name.")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration to use.")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dasaset on disk.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the model to the hub.")
    parser.add_argument("--speaker_id_column_name", default="speaker_id", type=str, help="Audio column name.")
    parser.add_argument("--cpu_num_workers", default=1, type=int, help="Number of CPU workers for transformations that don't use GPUs or if no GPU are available.")

    args = parser.parse_args()
    
    if args.configuration:
        dataset = load_dataset(args.dataset_name, args.configuration)
    else:
        dataset = load_dataset(args.dataset_name)
        
    speaker_id_column_name = args.speaker_id_column_name

    speaker_dataset = pd.read_csv(args.tsv_path, sep="\t").to_dict()
    
    def map_gender(speaker_ids):
        genders = [speaker_dataset["READER"][int(speaker)] for speaker in speaker_ids]
        return {"gender": ["male" if g=="M" else "female" for g in genders]}
    
    dataset = dataset.map(map_gender, batched=True, batch_size=128, input_columns=speaker_id_column_name, num_proc=args.cpu_num_workers)

    
    if args.output_dir:
        dataset.save_to_disk(args.output_dir)
    if args.repo_id:
        if args.configuration:
            dataset.push_to_hub(args.repo_id, args.configuration)
        else:
            dataset.push_to_hub(args.repo_id)
    


================================================
FILE: scripts/per_dataset_script/clean_libritts_r.py
================================================
from datasets import load_dataset
from multiprocess import set_start_method
import pandas as pd
import argparse
from os import listdir
import os


if __name__ == "__main__":
    set_start_method("spawn")
    parser = argparse.ArgumentParser()
    
    
    parser.add_argument("dataset_name", type=str, help="Repo id or local path.")
    parser.add_argument("bad_samples_folder", default=None, type=str, help="Path to LibriTTS-R bad folder samples.")
    parser.add_argument("--configuration", default=None, type=str, help="Dataset configuration to use.")
    parser.add_argument("--output_dir", default=None, type=str, help="If specified, save the dasaset on disk.")
    parser.add_argument("--repo_id", default=None, type=str, help="If specified, push the model to the hub.")
    parser.add_argument("--speaker_id_column_name", default="speaker_id", type=str, help="Speaker id column name.")
    parser.add_argument("--cpu_num_workers", default=1, type=int, help="Number of CPU workers for transformations that don't use GPUs or if no GPU are available.")

    args = parser.parse_args()
    
    if args.configuration:
        dataset = load_dataset(args.dataset_name, args.configuration)
    else:
        dataset = load_dataset(args.dataset_name)
        
    speaker_id_column_name = args.speaker_id_column_name
    
    # speakers to exclude because of mixed gender detection
    # cf: https://github.com/line/LibriTTS-P/blob/main/data/excluded_spk_list.txt
    speakers_to_remove = {2074, 4455, 6032, 3546, 2262, 8097, 1734, 3793, 8295}
    
    def filter_speakers(speaker, speakers_to_remove):
        return int(speaker) not in speakers_to_remove 

    print(dataset)
    dataset = dataset.filter(filter_speakers, input_columns=speaker_id_column_name, num_proc=args.cpu_num_workers, fn_kwargs={"speakers_to_remove": speakers_to_remove})
    print(dataset)
    
    bad_samples_txt_files = [os.path.join(args.bad_samples_folder, f) for f in listdir(args.bad_samples_folder) if "bad_sample" in f] 

    samples_to_filter = set()
    for txt_file in bad_samples_txt_files:
        with open(txt_file, 'r') as file:
            for line in file:
                line = line.strip().split("/")[-1].split(".")[0]

                samples_to_filter.add(line)

    print(len(samples_to_filter))
    def filter_samples(id, samples_to_filter):
        return id not in samples_to_filter 
    dataset = dataset.filter(filter_samples, input_columns="id", num_proc=args.cpu_num_workers, fn_kwargs={"samples_to_filter": samples_to_filter})

    print(dataset)
    if args.output_dir:
        dataset.save_to_disk(args.output_dir)
    if args.repo_id:
        if args.configuration:
            dataset.push_to_hub(args.repo_id, args.configuration)
        else:
            dataset.push_to_hub(args.repo_id)
    


================================================
FILE: scripts/run_prompt_creation.py
================================================
import json
import logging
import os
import re
import shutil
import sys
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple, Union

import numpy as np
import torch
from accelerate import Accelerator, skip_first_batches
from accelerate.logging import get_logger
from datasets import DatasetDict, load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
)
from datetime import timedelta
from accelerate import InitProcessGroupKwargs


logger = get_logger(__name__, log_level="INFO")


@dataclass
class ModelArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    model_name_or_path: str = field(
        metadata={"help": "The name of the model to use (via the transformers library) for the prompt annotation."},
    )
    per_device_eval_batch_size: int = field(
        metadata={"help": "The per-device batch size to use for inference."},
    )
    model_variant: str = field(
        default=None,
        metadata={"help": "If specified load weights from `variant` filename, *e.g.* pytorch_model.<variant>.bin. "},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
    )
    torch_dtype: Optional[str] = field(
        default="float16",
        metadata={
            "help": (
                "Floating-point format in which the model weights should be initialized"
                " and the computations run. Choose one of `[float32, float16, bfloat16]`."
            )
        },
    )
    attn_implementation: Optional[str] = field(
        default="sdpa",
        metadata={"help": "Which attn type to use: ['eager', 'sdpa', 'flash_attention_2']"},
    )
    load_in_8bit: Optional[bool] = field(
        default=False, metadata={"help": "Whether to use 8-bit precision for inference."}
    )
    load_in_4bit: Optional[bool] = field(
        default=False, metadata={"help": "Whether to use 4-bit precision for inference."}
    )
    bnb_4bit_quant_type: Optional[str] = field(
        default="nf4", metadata={"help": "precise the quantization type (fp4 or nf4)"}
    )
    use_bnb_nested_quant: Optional[bool] = field(default=False, metadata={"help": "use nested quantization"})
    trust_remote_code: Optional[bool] = field(
        default=False,
        metadata={
            "help": (
                "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option "
                "should only be set to `True` for repositories you trust and in which you have read the code, as it will "
                "execute code present on the Hub on your local machine."
            )
        },
    )
    use_fast_tokenizer: Optional[bool] = field(
        default=True, metadata={"help": "Use fast tokenizer for encoding/decoding input ids"}
    )
    token: Optional[bool] = field(
        default=True,
        metadata={
            "help": "Whether or not to use an authentication token when loading/uploading from the Hugging Face Hub"
        },
    )
    do_sample: Optional[bool] = field(default=True, metadata={"help": "Whether to use sampling mode for generation"})
    temperature: Optional[float] = field(default=0.6, metadata={"help": "Temperature for sampling-based generation"})
    max_new_tokens: Optional[int] = field(
        default=256, metadata={"help": "Maximum number of new tokens during generation"}
    )
    torch_compile: Optional[bool] = field(
        default=False,
        metadata={
            "help": "Whether to compile the forward pass (not sampling) in generate. Only compatible with Gemma and LlaMA."
        },
    )


@dataclass
class DataArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    output_dir: str = field(
        metadata={
            "help": "Where to save the processed dataset to disk. If unspecified, uses a 'pretty' version of the "
            "original dataset name. E.g. 'facebook/voxpopuli' will be saved under 'voxpopuli'."
        },
    )
    dataset_name: str = field(
        default=None,
        metadata={"help": "The name of the dataset to use (via the datasets library)"},
    )
    dataset_config_name: Optional[str] = field(
        default=None,
        metadata={"help": "The configuration name of the dataset to use (via the datasets library)."},
    )
    dataset_split_name: Optional[str] = field(
        default=None,
        metadata={"help": "The split name of the dataset to use (via the datasets library)."},
    )
    dataset_cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Path to cache directory for saving and loading datasets"},
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={"help": "Maximum number of samples for generation - use for debugging purposes."},
    )
    overwrite_cache: bool = field(
        default=False,
        metadata={"help": "Overwrite the cached training and evaluation sets"},
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    dataloader_num_workers: Optional[int] = field(
        default=0,
        metadata={"help": "The number of processes to use for the dataloader."},
    )
    push_to_hub: Optional[bool] = field(
        default=False,
        metadata={"help": "Whether or not to push the processed dataset to the Hub."},
    )
    hub_dataset_id: Optional[str] = field(
        default=None,
        metadata={"help": "Repository namespace if pushing to the Hugging Face Hub."},
    )
    overwrite_output_dir: Optional[bool] = field(
        default=False,
        metadata={"help": "Overwrite the content of the output directory each time the script is run."},
    )
    save_steps: Optional[int] = field(
        default=500,
        metadata={"help": "Save the generated prompts every save_steps."},
    )
    save_total_limit: Optional[int] = field(
        default=1, metadata={"help": ("If a value is passed, will limit the total number of saved checkpoints")}
    )
    speaker_name: Optional[str] = field(
        default=None,
        metadata={"help": "If `is_single_speaker`, it specified the speaker name that you want to give to the mono-speaker of your dataset."},
    )
    is_single_speaker: Optional[bool] = field(
        default=False, metadata={"help": "Whether to use a single speaker prompt, with a single name, specified by `speaker_name`."}
    )
    is_new_speaker_prompt: Optional[bool] = field(
        default=False, metadata={"help": "Whether to use the newest speaker prompt, which will be used for the next Parler-TTS."}
    )
    speaker_id_column: Optional[str] = field(
        default=None, metadata={"help": "Speaker id column name. Only used if creating a dataset with multiple speaker names (i.e if `speaker_ids_to_name_json` is specified)"}
    )
    speaker_ids_to_name_json: Optional[str] = field(
        default=None, metadata={"help": "Path to a JSON file which map some speaker ids to some names. Only used if `speaker_id_column` is specified."}
    )
    accent_column: Optional[str] = field(
        default=None, metadata={"help": "Accent column name, if any."}
    )


    def __post_init__(self):
        if self.push_to_hub and self.hub_dataset_id is None:
            raise ValueError("You must specify the `hub_dataset_id` when setting `--push_to_hub=True`")


def get_quantization_config(model_args: ModelArguments) -> Union[BitsAndBytesConfig, None]:
    if model_args.load_in_4bit:
        compute_dtype = torch.float16
        if model_args.torch_dtype not in {"auto", None}:
            compute_dtype = getattr(torch, model_args.torch_dtype)

        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=compute_dtype,
            bnb_4bit_quant_type=model_args.bnb_4bit_quant_type,
            bnb_4bit_use_double_quant=model_args.use_bnb_nested_quant,
        )
    elif model_args.load_in_8bit:
        quantization_config = BitsAndBytesConfig(
            load_in_8bit=True,
        )
    else:
        quantization_config = None

    return quantization_config


def get_current_device() -> int:
    """Get the current device. For GPU we return the local process index to enable multiple GPU training."""
    return Accelerator().local_process_index if torch.cuda.is_available() else "cpu"


def get_kbit_device_map() -> Union[Dict[str, int], None]:
    """Useful for running inference with quantized models by setting `device_map=get_peft_device_map()`"""
    return {"": get_current_device()} if torch.cuda.is_available() else None


CHECKPOINT_PREFIX = "checkpoint"
_RE_CHECKPOINT = re.compile(r"^checkpoint-(\d+).json$")


def save_checkpoint(output_dir, all_generated_ids, step):
    checkpoint_path = f"{CHECKPOINT_PREFIX}-{step}.json"
    output_path = os.path.join(output_dir, checkpoint_path)
    all_generated_ids = [ids.tolist() for ids in all_generated_ids]
    with open(output_path, "w") as file:
        json.dump(all_generated_ids, file)


def load_checkpoint(checkpoint_path):
    with open(checkpoint_path, "r") as file:
        all_generated_ids = json.load(file)
    logger.info(f"Json file {checkpoint_path} loaded.")
    all_generated_ids = [np.array(lst) for lst in all_generated_ids]
    return all_generated_ids


def sorted_checkpoints(output_dir=None) -> List[str]:
    """Helper function to sort saved checkpoints from oldest to newest."""
    ordering_and_checkpoint_path = []

    glob_checkpoints = [str(x) for x in Path(output_dir).glob(f"{CHECKPOINT_PREFIX}-*")]

    for path in glob_checkpoints:
        regex_match = re.match(f".*{CHECKPOINT_PREFIX}-([0-9]+)", path)
        if regex_match is not None and regex_match.groups() is not None:
            ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def rotate_checkpoints(save_total_limit=None, output_dir=None) -> None:
    """Helper function to delete old checkpoints."""
    if save_total_limit is None or save_total_limit <= 0:
        return
    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = sorted_checkpoints(output_dir=output_dir)
    if len(checkpoints_sorted) <= save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit")
        os.remove(checkpoint)


def get_last_checkpoint(folder, return_list=False) -> Tuple[List, int]:
    if not os.path.exists(folder) or not os.path.isdir(folder):
        os.makedirs(folder, exist_ok=True)
        return [], 0
    content = os.listdir(folder)
    checkpoints = [path for path in content if _RE_CHECKPOINT.search(path) is not None]
    if len(checkpoints) == 0:
        return [], 0
    last_checkpoint = os.path.join(folder, max(checkpoints, key=lambda x: int(_RE_CHECKPOINT.search(x).groups()[0])))
    # Find num steps saved state string pattern
    pattern = r"checkpoint-(\d+).json"
    match = re.search(pattern, last_checkpoint)
    cur_step = int(match.group(1))
    if return_list:
        # load corresponding generated ids
        all_generated_ids = load_checkpoint(last_checkpoint)
        return all_generated_ids, cur_step
    else:
        return [], cur_step


@dataclass
class DataCollatorWithPadding:
    """
    Data collator that will dynamically pad the inputs received to the longest sequence in the batch.
    """

    tokenizer: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_ids = {"input_ids": [feature["input_ids"] for feature in features]}
        batch = self.tokenizer.pad(input_ids, return_tensors="pt", padding="longest", return_attention_mask=True)
        return batch


PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (e.g., male, female)
2. The level of reverberation (e.g., very roomy sounding, quite roomy sounding, slightly roomy sounding, moderate reverberation, slightly confined sounding, quite confined sounding, very confined sounding)
3. The amount of noise the sample (e.g., very noisy, quite noisy, slightly noisy, moderate ambient sound, slightly clear, quite clear, very clear)
4. The tone of the speaker's voice (e.g., very monotone, quite monotone, slightly monotone, moderate intonation, slightly expressive, quite expressive, very expressive)
5. The pace of the speaker's delivery (e.g., very slowly, quite slowly, slightly slowly, moderate speed, slightly fast, quite fast, very fast)
6. The pitch of the speaker's voice (e.g., very low pitch, quite low pitch, slightly low pitch, moderate pitch, slightly high pitch, quite high pitch, very high pitch)
Your task is to create a text description using these keywords that accurately describes the speech sample while ensuring the description remains grammatically correct and easy to understand. You should rearrange the keyword order as necessary, and substitute synonymous terms where appropriate. If the amount of noise is 'very noisy' and the level of reverberation is 'very roomy sounding', include terms like 'very bad recording' in the description. Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very confined sounding', include terms like 'very good recording' in the description. Otherwise, do not add extra details beyond what has been provided, and only return the generated description.
For example, given the following keywords: 'female', 'slightly roomy sounding', 'slightly noisy', 'very expressive', 'slightly low pitch', 'very slowly', a valid description would be: 'a woman with a deep voice speaks slowly but has an animated delivery in an echoey room with some background noise'.
For the keywords: '[gender]', '[reverberation]', '[noise]', '[speech_monotony]', '[pitch]', '[speaking_rate]', the corresponding description is:"
"""

NEW_PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (male, female)
2. The level of reverberation (very distant-sounding, distant-sounding, slightly distant-sounding, slightly close-sounding, very close-sounding)
3. The amount of noise in the sample (extremely noisy, very noisy, noisy, slightly noisy, almost no noise, very clear)
4. The tone of the speaker's voice (very monotone, monotone, slightly expressive and animated, expressive and animated, very expressive and animated)
5. The pace of the speaker's delivery (very slowly, slowly, slightly slowly, moderate speed, slightly fast, fast, very fast)
6. The pitch of the speaker's voice (very low-pitch, low-pitch, slightly low-pitch, moderate pitch, slightly high-pitch, high-pitch, very high-pitch)

Your task is to create a text description using these keywords that accurately describes the speech sample.
If the amount of noise is 'very noisy' and the level of reverberation is 'very distant-sounding', you must include terms such as 'very poor recording' or `very bad recording` in the description. 
Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very close-sounding', you must include terms like 'very good recording' or `excellent recording` in the description. 
You can randomly omit the following terms, as they are default terms: 'moderate speed' and 'moderate pitch'.
Do not add extra details beyond what has been provided above. You can change the order of keywords, and replace synonymous terms.

For example, given the following keywords: 'female', 'slightly distant-sounding', 'noisy', 'very expressive and animated', 'very slowly', 'moderate pitch', a valid description would be: 'A woman speaks very slowly but has a very animated delivery. The recording is noisy and there is some roominess.'
Another valid description would be: 'In a noisy room, a female speaker delivers a very animated and expressive speech, at a very slow pace.'
Another valid description would be: 'A woman enunciates a very expressive speech. Her voice is slightly distant-sounding, with some background noise present. She speaks very slowly with a moderate pitch but a very expressive tone.'

Ensure that the generated description is grammatically correct, easy to understand, and concise. Only return one and only one description.

For the keywords: '[gender]', '[reverberation]', '[sdr_noise]', '[speech_monotony]', '[speaking_rate]', '[pitch]', the corresponding description is:
"""

NEW_PROMPT_WITH_ACCENT = """You will be given 7 descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (male, female)
2. The level of reverberation (very distant-sounding, distant-sounding, slightly distant-sounding, slightly close-sounding, very close-sounding)
3. The amount of noise in the sample (extremely noisy, very noisy, noisy, slightly noisy, almost no noise, very clear)
4. The tone of the speaker's voice (very monotone, monotone, slightly expressive and animated, expressive and animated, very expressive and animated)
5. The pace of the speaker's delivery (very slowly, slowly, slightly slowly, moderate speed, slightly fast, fast, very fast)
6. The pitch of the speaker's voice (very low-pitch, low-pitch, slightly low-pitch, moderate pitch, slightly high-pitch, high-pitch, very high-pitch)
7. The accent of the speaker.

Your task is to create a text description using these keywords that accurately describes the speech sample.
If the amount of noise is 'very noisy' and the level of reverberation is 'very distant-sounding', you must include terms such as 'very poor recording' or `very bad recording` in the description. 
Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very close-sounding', you must include terms like 'very good recording' or `excellent recording` in the description. 
You can randomly omit the following terms, as they are default terms: 'moderate speed' and 'moderate pitch'.
Do not add extra details beyond what has been provided above. You can change the order of keywords, and replace synonymous terms.

For example, given the following keywords: 'female', 'slightly distant-sounding', 'noisy', 'very expressive and animated', 'very slowly', 'moderate pitch', 'Chinese', a valid description would be: 'A woman with a Chinese accent speaks very slowly but has a very animated delivery. The recording is noisy and there is some roominess.'
Another valid description would be: 'In a noisy room, a female speaker with a Chinese accent delivers a very animated and expressive speech, at a very slow pace.'
Another valid description would be: 'A woman with a Chinese accent enunciates a very expressive speech. Her voice is slightly distant-sounding, with some background noise present. She speaks very slowly with a moderate pitch but a very expressive tone.'

Ensure that the generated description is grammatically correct, easy to understand, and concise. Only return one and only one description.

For the keywords: '[gender]', '[reverberation]', '[sdr_noise]', '[speech_monotony]', '[speaking_rate]', '[pitch]', '[accent]', the corresponding description is:
"""


NEW_SINGLE_SPEAKER_PROMPT = """You will be given four descriptive keywords related to an audio sample of [speaker_name]'s speech. These keywords include:
1. The level of reverberation (very distant-sounding, distant-sounding, slightly distant-sounding, slightly close-sounding, very close-sounding)
3. The amount of noise in the sample (extremely noisy, very noisy, noisy, slightly noisy, almost no noise, very clear)
3. The tone of the speaker's voice (very monotone, monotone, slightly expressive and animated, expressive and animated, very expressive and animated)
4. The pace of the speaker's delivery (very slowly, slowly, slightly slowly, moderate speed, slightly fast, fast, very fast)

Your task is to create a text description using these keywords that accurately describes [speaker_name]'s speech sample.
If the amount of noise is 'very noisy' and the level of reverberation is 'very distant-sounding', you must include terms such as 'very poor recording' or `very bad recording` in the description. 
Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very close-sounding', you must include terms like 'very good recording' or `excellent recording` in the description. 
You can randomly omit the following terms, as they are default terms: 'moderate speed' and 'moderate pitch'.
Do not add extra details beyond what has been provided above. You can change the order of keywords, and replace synonymous terms.

For example, given the following keywords: 'slightly distant-sounding', 'clear', 'very expressive and animated', 'slightly fast', a valid description would be: '[speaker_name] speaks slightly fast but has a very animated delivery in a room with slight echo but no background noise.'
Another valid description would be: `In a very animated voice, [speaker_name] delivers words slightly quickly. The room is quite, but there's a bit of echo.'

Ensure that the generated description is grammatically correct, easy to understand, and concise. Only return one and only one description.

For the keywords: ''[reverberation]', '[sdr_noise]', '[speech_monotony]', '[speaking_rate]', the corresponding description is:
"""

SINGLE_SPEAKER_PROMPT = """You will be given four descriptive keywords related to an audio sample of [speaker_name]'s speech. These keywords include:
1. The level of reverberation (e.g., very roomy sounding, quite roomy sounding, slightly roomy sounding, moderate reverberation, slightly confined sounding, quite confined sounding, very confined sounding)
2. The amount of noise the sample (e.g., very noisy, quite noisy, slightly noisy, moderate ambient sound, slightly clear, quite clear, very clear)
3. The tone of the speaker's voice (e.g., very monotone, quite monotone, slightly monotone, moderate intonation, slightly expressive, quite expressive, very expressive)
4. The pace of the speaker's delivery (e.g., very slowly, quite slowly, slightly slowly, moderate speed, slightly fast, quite fast, very fast)

Your task is to create a single and only short text description using these keywords that accurately describes the speech sample while ensuring the description remains grammatically correct and easy to understand. You should rearrange the keyword order as necessary, and substitute synonymous terms where appropriate. If the amount of noise is 'very noisy' and the level of reverberation is 'very roomy sounding', you must include terms like 'very bad recording' in the description. Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very confined sounding', you must include terms like 'very good recording' in the description. Otherwise, do not add extra details beyond what has been provided, and only return the generated description.

For example, given the following keywords: 'slightly roomy sounding', 'quite noisy', 'very expressive', 'very slowly', a valid description would be: '[speaker_name] speaks very slowly but has an animated delivery in an echoey room with background noise.'.
Feel free to change the order of keywords, and to use synonyms, for example, with the previous keywords: `In a very expressive voice, [speaker_name] pronounces her words incredibly slowly. There's some background noise in this room with a bit of echo.'.

For the keywords: ''[reverberation]', '[noise]', '[speech_monotony]', '[speaking_rate]', the corresponding description is:
"""

def main():
    # 1. Parse input arguments
    parser = HfArgumentParser((ModelArguments, DataArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, data_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, data_args = parser.parse_args_into_dataclasses()

    # 2. Setup logging
    # Make one log on every process with the configuration for debugging.
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    
    if data_args.is_single_speaker and data_args.speaker_name is None:
        raise ValueError("`is_single_speaker=True` but `speaker_name` is not specified. Specify it or remove `is_single_speaker`.")

    if not data_args.is_single_speaker and data_args.speaker_name:
        raise ValueError(f"`is_single_speaker=False` but `speaker_name=data_args.speaker_name` is not specified. Add `--is_single_speaker` or remove `speaker_name`.")


    # Create the custom configuration
    process_group_kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=3600*3))
    accelerator = Accelerator(kwargs_handlers=[process_group_kwargs])

    if data_args.overwrite_output_dir and os.path.exists(data_args.output_dir) and os.path.isdir(data_args.output_dir):
        logger.info("Cleaning output dir from previous run...")
        shutil.rmtree(data_args.output_dir)

    # 3. Load annotated dataset
    logger.info("*** Load annotated dataset ***")
    if data_args.dataset_split_name is not None:
        raw_datasets = DatasetDict()
        data_splits = data_args.dataset_split_name.split("+")
        # load on a split-wise basis
        for split in data_splits:
            with accelerator.local_main_process_first():
                raw_datasets[split] = load_dataset(
                    data_args.dataset_name,
                    data_args.dataset_config_name,
                    split=split,
                    cache_dir=model_args.cache_dir,
                    token=model_args.token,
                    num_proc=data_args.preprocessing_num_workers,
                )
    else:
        with accelerator.local_main_process_first():
            # load all splits for annotation
            raw_datasets = load_dataset(
                data_args.dataset_name,
                data_args.dataset_config_name,
                cache_dir=model_args.cache_dir,
                token=model_args.token,
                num_proc=data_args.preprocessing_num_workers,
            )

    raw_datasets_features = set(raw_datasets[next(iter(raw_datasets))].features.keys())

    if data_args.max_eval_samples is not None:
        for split in raw_datasets:
            raw_datasets[split] = raw_datasets[split].select(range(data_args.max_eval_samples))

    EXPECTED_COLUMNS = {"gender", "pitch", "noise", "reverberation", "speech_monotony", "speaking_rate"}
    if data_args.is_single_speaker:
        EXPECTED_COLUMNS = {"noise", "reverberation", "speech_monotony", "speaking_rate"}
        
    if data_args.is_new_speaker_prompt:
        EXPECTED_COLUMNS.remove("noise")
        EXPECTED_COLUMNS.add("sdr_noise")
        
    speaker_ids_to_name = {}
    speaker_id_column = data_args.speaker_id_column
    if data_args.speaker_id_column and data_args.speaker_ids_to_name_json:
        import json
        if data_args.is_single_speaker:
            raise ValueError(f"`is_single_speaker=True` but `speaker_ids_to_name_json={data_args.speaker_ids_to_name_json}`. Specify one or another.")
        
        EXPECTED_COLUMNS.add(data_args.speaker_id_column)
        with open(data_args.speaker_ids_to_name_json, "r") as read_file:
            speaker_ids_to_name = json.load(read_file)

    if not EXPECTED_COLUMNS.issubset(raw_datasets_features):
        missing_columns = EXPECTED_COLUMNS - raw_datasets_features
        raise ValueError(
            f"Missing columns {missing_columns} from the dataset features. Got dataset features {raw_datasets_features}"
        )

    # 4. Load pre-trained model
    logger.info("*** Load pretrained model ***")
    torch_dtype = (
        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
    )
    quantization_config = get_quantization_config(model_args)

    model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        revision=model_args.model_revision,
        variant=model_args.model_variant,
        trust_remote_code=model_args.trust_remote_code,
        attn_implementation=model_args.attn_implementation,
        torch_dtype=torch_dtype,
        device_map=get_kbit_device_map() if quantization_config is not None else None,
        quantization_config=quantization_config,
        low_cpu_mem_usage=True,
        token=model_args.token,
    ).eval()

    if model_args.torch_compile:
        # torch compile only compatible with gemma and llama
        if not callable(getattr(model, "_setup_cache", None)):
            raise ValueError(
                f"Static k/v cache is not compatible with the model {model.__class__.__name__}. Set `--torch_compile=False"
                "for dynamic k/v cache"
            )
        model.generation_config.cache_implementation = "static"
        # compile the forward pass (but not the top-{p,k} sampling)
        model = torch.compile(model, mode="reduce-overhead", fullgraph=True)

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        revision=model_args.model_revision,
        trust_remote_code=model_args.trust_remote_code,
        use_fast=model_args.use_fast_tokenizer,
        padding_side="left",
    )
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.bos_token_id
        model.generation_config.pad_token_id = model.generation_config.eos_token_id

    speaker_name = data_args.speaker_name
    is_single_speaker = data_args.is_single_speaker
    is_new_speaker_prompt = data_args.is_new_speaker_prompt
    accent_column_name = data_args.accent_column

    def prepare_dataset(sample):
        sample_prompt = PROMPT
        if is_single_speaker:
            sample_prompt = SINGLE_SPEAKER_PROMPT if not is_new_speaker_prompt else NEW_SINGLE_SPEAKER_PROMPT
            sample_prompt = sample_prompt.replace(f"[speaker_name]", speaker_name)
        elif (speaker_id_column and speaker_ids_to_name.get(str(sample.get(speaker_id_column)), None)):
            name =  speaker_ids_to_name.get(str(sample.get(speaker_id_column)), None)
            sample_prompt = SINGLE_SPEAKER_PROMPT if not is_new_speaker_prompt else NEW_SINGLE_SPEAKER_PROMPT
            sample_prompt = sample_prompt.replace(f"[speaker_name]", name)
        elif is_new_speaker_prompt and accent_column_name is not None:
            sample_prompt = NEW_PROMPT if sample.get(accent_column_name, "Unindentified") == "Unindentified" else NEW_PROMPT_WITH_ACCENT
        elif is_new_speaker_prompt:
            sample_prompt = NEW_PROMPT
        for key in EXPECTED_COLUMNS:
            sample_prompt = sample_prompt.replace(f"[{key}]", sample[key])
        if accent_column_name is not None and sample.get(accent_column_name, "Unindentified") != "Unindentified":
            sample_prompt = sample_prompt.replace("[accent]", sample["accent"])
            
        sample_prompt = [{"role": "user", "content": sample_prompt}]
        token_ids = tokenizer.apply_chat_template(sample_prompt)
        sample["input_ids"] = token_ids
        return sample

    with accelerator.local_main_process_first():
        vectorized_datasets = raw_datasets.map(
            prepare_dataset, num_proc=data_args.preprocessing_num_workers, desc="Preparing prompts"
        )

    # Prepare everything with our `accelerator`
    model = accelerator.prepare(model)
    data_collator = DataCollatorWithPadding(tokenizer)

    def generate_step(batch):
        output_ids = accelerator.unwrap_model(model).generate(
            batch["input_ids"],
            attention_mask=batch["attention_mask"],
            do_sample=model_args.do_sample,
            temperature=model_args.temperature,
            max_new_tokens=model_args.max_new_tokens,
        )
        output_ids = accelerator.pad_across_processes(output_ids, dim=1, pad_index=tokenizer.pad_token_id)
        return output_ids

    def postprocess_dataset(batch):
        prompt_texts = tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=True)
        generated_texts = tokenizer.batch_decode(batch["generated_ids"], skip_special_tokens=True)
        
        batch["text_description"] = [generated_text[len(prompt_text) :] for (prompt_text, generated_text) in zip(prompt_texts, generated_texts)]
        return batch

    for split in vectorized_datasets:
        data_loader = DataLoader(
            vectorized_datasets[split],
            batch_size=model_args.per_device_eval_batch_size,
            collate_fn=data_collator,
            num_workers=data_args.dataloader_num_workers,
            pin_memory=True,
        )
        data_loader = accelerator.prepare(data_loader)
        total_inference_steps = len(data_loader)
        progress_bar = tqdm(
            range(total_inference_steps), desc=" ... ", position=0, disable=not accelerator.is_local_main_process
        )

        split_output_dir = os.path.join(data_args.output_dir, split)
        all_generated_ids, cur_step = get_last_checkpoint(split_output_dir, accelerator.is_local_main_process)
        accelerator.wait_for_everyone()

        if cur_step > 0:
            logger.info(f"Resuming {split} from step {cur_step}")
            # efficiently skip the first n batches
            data_loader = skip_first_batches(data_loader, cur_step)
            progress_bar.update(cur_step)

        while cur_step < total_inference_steps:
            for batch in data_loader:
                generated_ids = generate_step(batch)
                generated_ids = accelerator.gather_for_metrics(generated_ids)
                if accelerator.is_local_main_process:
                    all_generated_ids.extend(generated_ids.cpu().numpy())

                cur_step += 1
                progress_bar.update(1)

                if (cur_step % data_args.save_steps == 0) or (cur_step == total_inference_steps):
                    if accelerator.is_main_process:
                        save_checkpoint(split_output_dir, all_generated_ids, cur_step)
                        rotate_checkpoints(data_args.save_total_limit, output_dir=split_output_dir)
                    accelerator.wait_for_everyone()

        if accelerator.is_local_main_process:
            vectorized_datasets[split] = vectorized_datasets[split].add_column("generated_ids", all_generated_ids)

        if accelerator.is_main_process:
            vectorized_datasets[split] = vectorized_datasets[split].map(
                postprocess_dataset,
                batched=True,
                num_proc=data_args.preprocessing_num_workers,
                desc="Postprocessing dataset",
                remove_columns=["input_ids", "generated_ids"],
            )
        accelerator.wait_for_everyone()

    if accelerator.is_main_process:
        vectorized_datasets.save_to_disk(data_args.output_dir)
        if data_args.push_to_hub:
            vectorized_datasets.push_to_hub(
                data_args.hub_dataset_id,
                config_name=data_args.dataset_config_name if data_args.dataset_config_name is not None else "default",
                token=model_args.token,
            )
    accelerator.wait_for_everyone()
    accelerator.end_training()


if __name__ == "__main__":
    main()

================================================
FILE: scripts/run_prompt_creation_llm_swarm.py
================================================
import json
import os
import re
import shutil
import sys
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional, Tuple, List

import logging

import math
from datasets import DatasetDict, load_dataset
from tqdm import tqdm
from transformers import (
    AutoTokenizer,
    HfArgumentParser,
)
import asyncio
from llm_swarm import LLMSwarm, LLMSwarmConfig
from huggingface_hub import AsyncInferenceClient


logger = logging.getLogger(__name__)


@dataclass
class ModelArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    model_name_or_path: str = field(
        metadata={
            "help": "The name of the model to use (via the transformers library) for the prompt annotation."
        },
    )
    num_instances: int = field(
        default=1,
        metadata={"help": "Number of TGI instances."},
    )
    per_instance_max_parallel_requests: int = field(
        default=500,
        metadata={"help": "Maximum number of parallel requests per instance."},
    )
    checkpoint_interval: Optional[int] = field(
        default=1000,
        metadata={
            "help": "Interval for streaming chunks of generation."
        },
    )
    model_revision: str = field(
        default="main",
        metadata={
            "help": "The specific model version to use (can be a branch name, tag name or commit id)."
        },
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={
            "help": "Where to store the pretrained models downloaded from huggingface.co"
        },
    )
    do_sample: Optional[bool] = field(
        default=True, metadata={"help": "Whether to use sampling mode for generation"}
    )
    temperature: Optional[float] = field(
        default=0.6, metadata={"help": "Temperature for sampling-based generation"}
    )
    max_new_tokens: Optional[int] = field(
        default=256, metadata={"help": "Maximum number of new tokens during generation"}
    )
    token: Optional[bool] = field(
        default=True,
        metadata={
            "help": "Whether or not to use an authentication token when loading/uploading from the Hugging Face Hub"
        },
    )
    debug_endpoint: Optional[str] = field(
        default=None,
        metadata={"help": "Endpoint to use for debugging (e.g. http://localhost:13120)."},
    )
    max_retries: Optional[int] = field(
        default=5,
        metadata={"help": "Maximum number of retries per sample."},
    )
    retry_delay_in_s: Optional[float] = field(
        default=5.0,
        metadata={"help": "Time to wait between successive retries in seconds."},
    )


@dataclass
class DataArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    output_dir: str = field(
        metadata={
            "help": "Where to save the processed dataset to disk. If unspecified, uses a 'pretty' version of the "
            "original dataset name. E.g. 'facebook/voxpopuli' will be saved under 'voxpopuli'."
        },
    )
    dataset_name: str = field(
        default=None,
        metadata={"help": "The name of the dataset to use (via the datasets library)"},
    )
    dataset_config_name: Optional[str] = field(
        default=None,
        metadata={
            "help": "The configuration name of the dataset to use (via the datasets library)."
        },
    )
    dataset_split_name: Optional[str] = field(
        default=None,
        metadata={
            "help": "The split name of the dataset to use (via the datasets library)."
        },
    )
    dataset_cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Path to cache directory for saving and loading datasets"},
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "Maximum number of samples for generation - use for debugging purposes."
        },
    )
    overwrite_cache: bool = field(
        default=False,
        metadata={"help": "Overwrite the cached training and evaluation sets"},
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    push_to_hub: Optional[bool] = field(
        default=False,
        metadata={"help": "Whether or not to push the processed dataset to the Hub."},
    )
    hub_dataset_id: Optional[str] = field(
        default=None,
        metadata={"help": "Repository namespace if pushing to the Hugging Face Hub."},
    )
    overwrite_output_dir: Optional[bool] = field(
        default=False,
        metadata={
            "help": "Overwrite the content of the output directory each time the script is run."
        },
    )
    save_steps: Optional[int] = field(
        default=100,
        metadata={"help": "Save the generated prompts every save_steps."},
    )
    save_total_limit: Optional[int] = field(
        default=1, metadata={"help": ("If a value is passed, will limit the total number of saved checkpoints")}
    )
    speaker_name: Optional[str] = field(
        default=None,
        metadata={"help": "If `is_single_speaker`, it specified the speaker name that you want to give to the mono-speaker of your dataset."},
    )
    is_single_speaker: Optional[bool] = field(
        default=False, metadata={"help": "Whether to use a single speaker prompt, with a single name, specified by `speaker_name`."}
    )
    is_new_speaker_prompt: Optional[bool] = field(
        default=False, metadata={"help": "Whether to use the newest speaker prompt, which will be used for the next Parler-TTS."}
    )
    speaker_id_column: Optional[str] = field(
        default=None, metadata={"help": "Speaker id column name. Only used if creating a dataset with multiple speaker names (i.e if `speaker_ids_to_name_json` is specified)"}
    )
    speaker_ids_to_name_json: Optional[str] = field(
        default=None, metadata={"help": "Path to a JSON file which map some speaker ids to some names. Only used if `speaker_id_column` is specified."}
    )
    accent_column: Optional[str] = field(
        default=None, metadata={"help": "Accent column name, if any."}
    )

    def __post_init__(self):
        if self.push_to_hub and self.hub_dataset_id is None:
            raise ValueError(
                "You must specify the `hub_dataset_id` when setting `--push_to_hub=True`"
            )

CHECKPOINT_PREFIX = "checkpoint"
_RE_CHECKPOINT = re.compile(r"^checkpoint-(\d+).json$")


def save_checkpoint(output_dir, all_generated_ids, step):
    checkpoint_path = f"{CHECKPOINT_PREFIX}-{step}.json"
    output_path = os.path.join(output_dir, checkpoint_path)
    with open(output_path, "w") as file:
        json.dump(all_generated_ids, file)


def load_checkpoint(checkpoint_path):
    with open(checkpoint_path, "r") as file:
        all_generated_ids = json.load(file)
    return all_generated_ids


def sorted_checkpoints(output_dir=None) -> List[str]:
    """Helper function to sort saved checkpoints from oldest to newest."""
    ordering_and_checkpoint_path = []

    glob_checkpoints = [str(x) for x in Path(output_dir).glob(f"{CHECKPOINT_PREFIX}-*")]

    for path in glob_checkpoints:
        regex_match = re.match(f".*{CHECKPOINT_PREFIX}-([0-9]+)", path)
        if regex_match is not None and regex_match.groups() is not None:
            ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def rotate_checkpoints(save_total_limit=None, output_dir=None) -> None:
    """Helper function to delete old checkpoints."""
    if save_total_limit is None or save_total_limit <= 0:
        return
    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = sorted_checkpoints(output_dir=output_dir)
    if len(checkpoints_sorted) <= save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit")
        os.remove(checkpoint)


def get_last_checkpoint(folder) -> Tuple[List, int]:
    if not os.path.exists(folder) or not os.path.isdir(folder):
        os.makedirs(folder, exist_ok=True)
        return [], 0
    content = os.listdir(folder)
    checkpoints = [path for path in content if _RE_CHECKPOINT.search(path) is not None]
    if len(checkpoints) == 0:
        return [], 0
    last_checkpoint = os.path.join(folder, max(checkpoints, key=lambda x: int(_RE_CHECKPOINT.search(x).groups()[0])))
    # Find num steps saved state string pattern
    pattern = r"checkpoint-(\d+).json"
    match = re.search(pattern, last_checkpoint)
    cur_step = int(match.group(1))
    # load corresponding generated ids
    all_generated_ids = load_checkpoint(last_checkpoint)
    return all_generated_ids, cur_step



PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (e.g., male, female)
2. The level of reverberation (e.g., very roomy sounding, quite roomy sounding, slightly roomy sounding, moderate reverberation, slightly confined sounding, quite confined sounding, very confined sounding)
3. The amount of noise the sample (e.g., very noisy, quite noisy, slightly noisy, moderate ambient sound, slightly clear, quite clear, very clear)
4. The tone of the speaker's voice (e.g., very monotone, quite monotone, slightly monotone, moderate intonation, slightly expressive, quite expressive, very expressive)
5. The pace of the speaker's delivery (e.g., very slowly, quite slowly, slightly slowly, moderate speed, slightly fast, quite fast, very fast)
6. The pitch of the speaker's voice (e.g., very low pitch, quite low pitch, slightly low pitch, moderate pitch, slightly high pitch, quite high pitch, very high pitch)
Your task is to create a text description using these keywords that accurately describes the speech sample while ensuring the description remains grammatically correct and easy to understand. You should rearrange the keyword order as necessary, and substitute synonymous terms where appropriate. If the amount of noise is 'very noisy' and the level of reverberation is 'very roomy sounding', include terms like 'very bad recording' in the description. Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very confined sounding', include terms like 'very good recording' in the description. Otherwise, do not add extra details beyond what has been provided, and only return the generated description.
For example, given the following keywords: 'female', 'slightly roomy sounding', 'slightly noisy', 'very expressive', 'slightly low pitch', 'very slowly', a valid description would be: 'a woman with a deep voice speaks slowly but has an animated delivery in an echoey room with some background noise'.
For the keywords: '[gender]', '[reverberation]', '[noise]', '[speech_monotony]', '[pitch]', '[speaking_rate]', the corresponding description is:"
"""

NEW_PROMPT = """You will be given six descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (male, female)
2. The level of reverberation (very distant-sounding, distant-sounding, slightly distant-sounding, slightly close-sounding, very close-sounding)
3. The amount of noise in the sample (extremely noisy, very noisy, noisy, slightly noisy, almost no noise, very clear)
4. The tone of the speaker's voice (very monotone, monotone, slightly expressive and animated, expressive and animated, very expressive and animated)
5. The pace of the speaker's delivery (very slowly, slowly, slightly slowly, moderate speed, slightly fast, fast, very fast)
6. The pitch of the speaker's voice (very low-pitch, low-pitch, slightly low-pitch, moderate pitch, slightly high-pitch, high-pitch, very high-pitch)

Your task is to create a text description using these keywords that accurately describes the speech sample.
If the amount of noise is 'very noisy' and the level of reverberation is 'very distant-sounding', you must include terms such as 'very poor recording' or `very bad recording` in the description. 
Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very close-sounding', you must include terms like 'very good recording' or `excellent recording` in the description. 
You can randomly omit the following terms, as they are default terms: 'moderate speed' and 'moderate pitch'.
Do not add extra details beyond what has been provided above. You can change the order of keywords, and replace synonymous terms.

For example, given the following keywords: 'female', 'slightly distant-sounding', 'noisy', 'very expressive and animated', 'very slowly', 'moderate pitch', a valid description would be: 'A woman speaks very slowly but has a very animated delivery. The recording is noisy and there is some roominess.'
Another valid description would be: 'In a noisy room, a female speaker delivers a very animated and expressive speech, at a very slow pace.'
Another valid description would be: 'A woman enunciates a very expressive speech. Her voice is slightly distant-sounding, with some background noise present. She speaks very slowly with a moderate pitch but a very expressive tone.'

Ensure that the generated description is grammatically correct, easy to understand, and concise. Only return one and only one description.

For the keywords: '[gender]', '[reverberation]', '[sdr_noise]', '[speech_monotony]', '[speaking_rate]', '[pitch]', the corresponding description is:
"""

NEW_PROMPT_WITH_ACCENT = """You will be given 7 descriptive keywords related to an audio sample of a person's speech. These keywords include:
1. The gender (male, female)
2. The level of reverberation (very distant-sounding, distant-sounding, slightly distant-sounding, slightly close-sounding, very close-sounding)
3. The amount of noise in the sample (extremely noisy, very noisy, noisy, slightly noisy, almost no noise, very clear)
4. The tone of the speaker's voice (very monotone, monotone, slightly expressive and animated, expressive and animated, very expressive and animated)
5. The pace of the speaker's delivery (very slowly, slowly, slightly slowly, moderate speed, slightly fast, fast, very fast)
6. The pitch of the speaker's voice (very low-pitch, low-pitch, slightly low-pitch, moderate pitch, slightly high-pitch, high-pitch, very high-pitch)
7. The accent of the speaker.

Your task is to create a text description using these keywords that accurately describes the speech sample.
If the amount of noise is 'very noisy' and the level of reverberation is 'very distant-sounding', you must include terms such as 'very poor recording' or `very bad recording` in the description. 
Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very close-sounding', you must include terms like 'very good recording' or `excellent recording` in the description. 
You can randomly omit the following terms, as they are default terms: 'moderate speed' and 'moderate pitch'.
Do not add extra details beyond what has been provided above. You can change the order of keywords, and replace synonymous terms.

For example, given the following keywords: 'female', 'slightly distant-sounding', 'noisy', 'very expressive and animated', 'very slowly', 'moderate pitch', 'Chinese', a valid description would be: 'A woman with a Chinese accent speaks very slowly but has a very animated delivery. The recording is noisy and there is some roominess.'
Another valid description would be: 'In a noisy room, a female speaker with a Chinese accent delivers a very animated and expressive speech, at a very slow pace.'
Another valid description would be: 'A woman with a Chinese accent enunciates a very expressive speech. Her voice is slightly distant-sounding, with some background noise present. She speaks very slowly with a moderate pitch but a very expressive tone.'

Ensure that the generated description is grammatically correct, easy to understand, and concise. Only return one and only one description.

For the keywords: '[gender]', '[reverberation]', '[sdr_noise]', '[speech_monotony]', '[speaking_rate]', '[pitch]', '[accent]', the corresponding description is:
"""


NEW_SINGLE_SPEAKER_PROMPT = """You will be given four descriptive keywords related to an audio sample of [speaker_name]'s speech. These keywords include:
1. The level of reverberation (very distant-sounding, distant-sounding, slightly distant-sounding, slightly close-sounding, very close-sounding)
3. The amount of noise in the sample (extremely noisy, very noisy, noisy, slightly noisy, almost no noise, very clear)
3. The tone of the speaker's voice (very monotone, monotone, slightly expressive and animated, expressive and animated, very expressive and animated)
4. The pace of the speaker's delivery (very slowly, slowly, slightly slowly, moderate speed, slightly fast, fast, very fast)

Your task is to create a text description using these keywords that accurately describes [speaker_name]'s speech sample.
If the amount of noise is 'very noisy' and the level of reverberation is 'very distant-sounding', you must include terms such as 'very poor recording' or `very bad recording` in the description. 
Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very close-sounding', you must include terms like 'very good recording' or `excellent recording` in the description. 
You can randomly omit the following terms, as they are default terms: 'moderate speed' and 'moderate pitch'.
Do not add extra details beyond what has been provided above. You can change the order of keywords, and replace synonymous terms.

For example, given the following keywords: 'slightly distant-sounding', 'clear', 'very expressive and animated', 'slightly fast', a valid description would be: '[speaker_name] speaks slightly fast but has a very animated delivery in a room with slight echo but no background noise.'
Another valid description would be: `In a very animated voice, [speaker_name] delivers words slightly quickly. The room is quite, but there's a bit of echo.'

Ensure that the generated description is grammatically correct, easy to understand, and concise. Only return one and only one description.

For the keywords: ''[reverberation]', '[sdr_noise]', '[speech_monotony]', '[speaking_rate]', the corresponding description is:
"""

SINGLE_SPEAKER_PROMPT = """You will be given four descriptive keywords related to an audio sample of [speaker_name]'s speech. These keywords include:
1. The level of reverberation (e.g., very roomy sounding, quite roomy sounding, slightly roomy sounding, moderate reverberation, slightly confined sounding, quite confined sounding, very confined sounding)
2. The amount of noise the sample (e.g., very noisy, quite noisy, slightly noisy, moderate ambient sound, slightly clear, quite clear, very clear)
3. The tone of the speaker's voice (e.g., very monotone, quite monotone, slightly monotone, moderate intonation, slightly expressive, quite expressive, very expressive)
4. The pace of the speaker's delivery (e.g., very slowly, quite slowly, slightly slowly, moderate speed, slightly fast, quite fast, very fast)

Your task is to create a single and only short text description using these keywords that accurately describes the speech sample while ensuring the description remains grammatically correct and easy to understand. You should rearrange the keyword order as necessary, and substitute synonymous terms where appropriate. If the amount of noise is 'very noisy' and the level of reverberation is 'very roomy sounding', you must include terms like 'very bad recording' in the description. Likewise, if the amount of noise is 'very clear' and the level of reverberation is 'very confined sounding', you must include terms like 'very good recording' in the description. Otherwise, do not add extra details beyond what has been provided, and only return the generated description.

For example, given the following keywords: 'slightly roomy sounding', 'quite noisy', 'very expressive', 'very slowly', a valid description would be: '[speaker_name] speaks very slowly but has an animated delivery in an echoey room with background noise.'.
Feel free to change the order of keywords, and to use synonyms, for example, with the previous keywords: `In a very expressive voice, [speaker_name] pronounces her words incredibly slowly. There's some background noise in this room with a bit of echo.'.

For the keywords: ''[reverberation]', '[noise]', '[speech_monotony]', '[speaking_rate]', the corresponding description is:
"""

# 1. Parse input arguments
parser = HfArgumentParser((ModelArguments, DataArguments))
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
    # If we pass only one argument to the script and it's the path to a json file,
    # let's parse it to get our arguments.
    model_args, data_args = parser.parse_json_file(
        json_file=os.path.abspath(sys.argv[1])
    )
else:
    model_args, data_args = parser.parse_args_into_dataclasses()

if data_args.is_single_speaker and data_args.speaker_name is None:
    raise ValueError("`is_single_speaker=True` but `speaker_name` is not specified. Specify it or remove `is_single_speaker`.")

if not data_args.is_single_speaker and data_args.speaker_name:
    raise ValueError(f"`is_single_speaker=False` but `speaker_name=data_args.speaker_name` is not specified. Add `--is_single_speaker` or remove `speaker_name`.")

EXPECTED_COLUMNS = {"gender", "pitch", "noise", "reverberation", "speech_monotony", "speaking_rate"}
if data_args.is_single_speaker:
    EXPECTED_COLUMNS = {"noise", "reverberation", "speech_monotony", "speaking_rate"}
    
if data_args.is_new_speaker_prompt:
    EXPECTED_COLUMNS.remove("noise")
    EXPECTED_COLUMNS.add("sdr_noise")

speaker_ids_to_name = {}
speaker_id_column = data_args.speaker_id_column
if data_args.speaker_id_column and data_args.speaker_ids_to_name_json:
    import json
    if data_args.is_single_speaker:
        raise ValueError(f"`is_single_speaker=True` but `speaker_ids_to_name_json={data_args.speaker_ids_to_name_json}`. Specify one or another.")
    
    EXPECTED_COLUMNS.add(data_args.speaker_id_column)
    with open(data_args.speaker_ids_to_name_json, "r") as read_file:
        speaker_ids_to_name = json.load(read_file)

speaker_name = data_args.speaker_name
is_single_speaker = data_args.is_single_speaker
is_new_speaker_prompt = data_args.is_new_speaker_prompt
accent_column_name = data_args.accent_column
    
with LLMSwarm(
    LLMSwarmConfig(
        instances=model_args.num_instances,
        inference_engine="tgi",
        slurm_template_path="./tgi_h100.template.slurm",
        load_balancer_template_path="./nginx.template.conf",
        model=model_args.model_name_or_path,
        revision=model_args.model_revision,
        per_instance_max_parallel_requests=model_args.per_instance_max_parallel_requests,
        debug_endpoint=model_args.debug_endpoint,
    )
) as llm_swarm:
    semaphore = asyncio.Semaphore(llm_swarm.suggested_max_parallel_requests)
    client = AsyncInferenceClient(model=llm_swarm.endpoint)
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        revision=model_args.model_revision,
    )

    async def process_text(sample):
        sample_prompt = PROMPT
        if is_single_speaker:
            sample_prompt = SINGLE_SPEAKER_PROMPT if not is_new_speaker_prompt else NEW_SINGLE_SPEAKER_PROMPT
            sample_prompt = sample_prompt.replace(f"[speaker_name]", speaker_name)
        elif (speaker_id_column and speaker_ids_to_name.get(str(sample.get(speaker_id_column)), None)):
            name =  speaker_ids_to_name.get(str(sample.get(speaker_id_column)), None)
            sample_prompt = SINGLE_SPEAKER_PROMPT if not is_new_speaker_prompt else NEW_SINGLE_SPEAKER_PROMPT
            sample_prompt = sample_prompt.replace(f"[speaker_name]", name)
        elif is_new_speaker_prompt and accent_column_name is not None:
            sample_prompt = NEW_PROMPT if sample.get(accent_column_name, "Unindentified") == "Unindentified" else NEW_PROMPT_WITH_ACCENT
        elif is_new_speaker_prompt:
            sample_prompt = NEW_PROMPT
        for key in EXPECTED_COLUMNS:
            sample_prompt = sample_prompt.replace(f"[{key}]", sample[key])
        if accent_column_name is not None and sample.get(accent_column_name, "Unindentified") != "Unindentified":
            sample_prompt = sample_prompt.replace("[accent]", sample["accent"])

        sample_prompt = [{"role": "user", "content": sample_prompt}]
        sample_prompt = tokenizer.apply_chat_template(sample_prompt, tokenize=False)
        attempt = 0
        while attempt < model_args.max_retries:
            try:
                async with semaphore:
                    return await client.text_generation(
                        prompt=sample_prompt,
                        max_new_tokens=model_args.max_new_tokens,
                        temperature=model_args.temperature,
                        do_sample=model_args.do_sample,
                    )
            except Exception as e:
                attempt += 1
                if attempt < model_args.max_retries:
                    print(
                        f"Request failed due to {e}\nRetrying in {model_args.retry_delay_in_s} seconds... (Attempt {attempt}/{model_args.max_retries})"
                    )
                    await asyncio.sleep(model_args.retry_delay_in_s)
                else:
                    raise ValueError(
                        f"Max retries reached. Failed with error: {e}."
                    )

    async def main():
        # 2. Setup logging
        logger.setLevel(logging.INFO)
        logging.basicConfig(
            format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
            datefmt="%m/%d/%Y %H:%M:%S",
            handlers=[logging.StreamHandler(sys.stdout)],
        )

        if (
            data_args.overwrite_output_dir
            and os.path.exists(data_args.output_dir)
            and os.path.isdir(data_args.output_dir)
        ):
            logger.info("Cleaning output dir from previous run...")
            shutil.rmtree(data_args.output_dir)

        # 3. Load annotated dataset
        logger.info("*** Load annotated dataset ***")
        if data_args.dataset_split_name is not None:
            raw_datasets = DatasetDict()
            data_splits = data_args.dataset_split_name.split("+")
            # load on a split-wise basis
            for split in data_splits:
                raw_datasets[split] = load_dataset(
                    data_args.dataset_name,
                    data_args.dataset_config_name,
                    split=split,
                    cache_dir=model_args.cache_dir,
                    token=model_args.token,
                    num_proc=data_args.preprocessing_num_workers,
                )
        else:
            # load all splits for annotation
            raw_datasets = load_dataset(
                data_args.dataset_name,
                data_args.dataset_config_name,
                cache_dir=model_args.cache_dir,
                token=model_args.token,
                num_proc=data_args.preprocessing_num_workers,
            )

        raw_datasets_features = set(
            raw_datasets[next(iter(raw_datasets))].features.keys()
        )

        if data_args.max_eval_samples is not None:
            for split in raw_datasets:
                raw_datasets[split] = raw_datasets[split].select(
                    range(data_args.max_eval_samples)
                )

        if not EXPECTED_COLUMNS.issubset(raw_datasets_features):
            missing_columns = EXPECTED_COLUMNS - raw_datasets_features
            raise ValueError(
                f"Missing columns {missing_columns} from the dataset features. Got dataset features {raw_datasets_features}"
            )

        for split in raw_datasets:
            total_samples = len(raw_datasets[split])
            total_inference_steps = math.ceil(total_samples / model_args.checkpoint_interval)

            split_output_dir = os.path.join(data_args.output_dir, split)
            progress_bar = tqdm(range(total_inference_steps), desc=f"{split}", position=0)

            all_generated_ids, inference_step = get_last_checkpoint(split_output_dir)
            if inference_step > 0:
                logger.info(f"Resuming {split} from step {inference_step}")
                progress_bar.update(inference_step)

            while inference_step < total_inference_steps:
                start_index = inference_step * model_args.checkpoint_interval
                end_index = min((inference_step + 1) * model_args.checkpoint_interval, total_samples)
                inference_chunk = raw_datasets[split].select(range(start_index, end_index))
                results = await asyncio.gather(
                    *(process_text(sample) for sample in inference_chunk)
                )
                inference_step += 1
                progress_bar.update(1)
                all_generated_ids.extend(results)

                if (inference_step % data_args.save_steps == 0) or (inference_step == total_inference_steps):
                    logger.info(f"Saving generations of step {inference_step}")
                    save_checkpoint(split_output_dir, all_generated_ids, inference_step)
                    rotate_checkpoints(data_args.save_total_limit, output_dir=split_output_dir)

            raw_datasets[split] = raw_datasets[split].add_column(
                "text_description", all_generated_ids
            )

        raw_datasets.save_to_disk(data_args.output_dir)
        if data_args.push_to_hub:
            raw_datasets.push_to_hub(
                data_args.hub_dataset_id,
                config_name=(
                    data_args.dataset_config_name
                    if data_args.dataset_config_name is not None
                    else "default"
                ),
                token=model_args.token,
            )

    asyncio.run(main())
Download .txt
gitextract_4p4dwrk2/

├── .gitignore
├── LICENSE
├── README.md
├── dataspeech/
│   ├── __init__.py
│   ├── cpu_enrichments/
│   │   ├── __init__.py
│   │   └── rate.py
│   └── gpu_enrichments/
│       ├── __init__.py
│       ├── pitch.py
│       ├── snr_and_reverb.py
│       └── squim.py
├── examples/
│   ├── prompt_creation/
│   │   ├── run_prompt_creation_10k.sh
│   │   ├── run_prompt_creation_1k.sh
│   │   ├── run_prompt_creation_1k_with_speaker_consistency.sh
│   │   ├── run_prompt_creation_45k.sh
│   │   ├── run_prompt_creation_dummy.sh
│   │   ├── run_prompt_creation_jenny.sh
│   │   └── speaker_ids_to_names.json
│   ├── prompt_creation_llm_swarm/
│   │   ├── nginx.template.conf
│   │   ├── run_prompt_creation_10k.sh
│   │   ├── run_prompt_creation_1k.sh
│   │   ├── run_prompt_creation_dummy.sh
│   │   ├── run_prompt_creation_full_mls.sh
│   │   └── tgi_h100.template.slurm
│   ├── tagging/
│   │   ├── run_main_10k.sh
│   │   ├── run_main_1k.sh
│   │   ├── run_main_45k.sh
│   │   └── run_main_dummy.sh
│   └── tags_to_annotations/
│       ├── run_metadata_to_text_10k.sh
│       ├── run_metadata_to_text_10k_v02.sh
│       ├── run_metadata_to_text_for_finetuning.sh
│       ├── v01_bin_edges.json
│       ├── v01_text_bins.json
│       ├── v02_bin_edges.json
│       └── v02_text_bins.json
├── main.py
├── requirements.txt
└── scripts/
    ├── filter_audio_separation.py
    ├── merge_audio_to_metadata.py
    ├── metadata_to_text.py
    ├── per_dataset_script/
    │   ├── add_gender_to_MLS.py
    │   ├── add_gender_to_libritts_r.py
    │   └── clean_libritts_r.py
    ├── run_prompt_creation.py
    └── run_prompt_creation_llm_swarm.py
Download .txt
SYMBOL INDEX (37 symbols across 11 files)

FILE: dataspeech/cpu_enrichments/rate.py
  function rate_apply (line 5) | def rate_apply(batch, rank=None, audio_column_name="audio", text_column_...

FILE: dataspeech/gpu_enrichments/pitch.py
  function pitch_apply (line 23) | def pitch_apply(batch, rank=None, audio_column_name="audio", output_colu...

FILE: dataspeech/gpu_enrichments/snr_and_reverb.py
  function snr_apply (line 11) | def snr_apply(batch, rank=None, audio_column_name="audio", batch_size=32):

FILE: dataspeech/gpu_enrichments/squim.py
  function squim_apply (line 8) | def squim_apply(batch, rank=None, audio_column_name="audio"):

FILE: scripts/filter_audio_separation.py
  function wrap_audio (line 15) | def wrap_audio(audio, sr):
  function filter_stems (line 23) | def filter_stems(batch, rank=None):

FILE: scripts/metadata_to_text.py
  function visualize_bins_to_text (line 22) | def visualize_bins_to_text(values_1, values_2, name_1, name_2, text_bins...
  function bins_to_text (line 56) | def bins_to_text(dataset, text_bins, column_name, output_column_name, le...
  function speaker_level_relative_to_gender (line 102) | def speaker_level_relative_to_gender(dataset, text_bins, speaker_column_...

FILE: scripts/per_dataset_script/add_gender_to_MLS.py
  function map_gender (line 35) | def map_gender(speaker_ids):

FILE: scripts/per_dataset_script/add_gender_to_libritts_r.py
  function map_gender (line 31) | def map_gender(speaker_ids):

FILE: scripts/per_dataset_script/clean_libritts_r.py
  function filter_speakers (line 35) | def filter_speakers(speaker, speakers_to_remove):
  function filter_samples (line 53) | def filter_samples(id, samples_to_filter):

FILE: scripts/run_prompt_creation.py
  class ModelArguments (line 32) | class ModelArguments:
  class DataArguments (line 111) | class DataArguments:
    method __post_init__ (line 194) | def __post_init__(self):
  function get_quantization_config (line 199) | def get_quantization_config(model_args: ModelArguments) -> Union[BitsAnd...
  function get_current_device (line 221) | def get_current_device() -> int:
  function get_kbit_device_map (line 226) | def get_kbit_device_map() -> Union[Dict[str, int], None]:
  function save_checkpoint (line 235) | def save_checkpoint(output_dir, all_generated_ids, step):
  function load_checkpoint (line 243) | def load_checkpoint(checkpoint_path):
  function sorted_checkpoints (line 251) | def sorted_checkpoints(output_dir=None) -> List[str]:
  function rotate_checkpoints (line 267) | def rotate_checkpoints(save_total_limit=None, output_dir=None) -> None:
  function get_last_checkpoint (line 283) | def get_last_checkpoint(folder, return_list=False) -> Tuple[List, int]:
  class DataCollatorWithPadding (line 305) | class DataCollatorWithPadding:
    method __call__ (line 312) | def __call__(self, features: List[Dict[str, Union[List[int], torch.Ten...
  function main (line 414) | def main():

FILE: scripts/run_prompt_creation_llm_swarm.py
  class ModelArguments (line 28) | class ModelArguments:
  class DataArguments (line 94) | class DataArguments:
    method __post_init__ (line 180) | def __post_init__(self):
  function save_checkpoint (line 190) | def save_checkpoint(output_dir, all_generated_ids, step):
  function load_checkpoint (line 197) | def load_checkpoint(checkpoint_path):
  function sorted_checkpoints (line 203) | def sorted_checkpoints(output_dir=None) -> List[str]:
  function rotate_checkpoints (line 219) | def rotate_checkpoints(save_total_limit=None, output_dir=None) -> None:
  function get_last_checkpoint (line 235) | def get_last_checkpoint(folder) -> Tuple[List, int]:
  function process_text (line 408) | async def process_text(sample):
  function main (line 450) | async def main():
Condensed preview — 44 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (173K chars).
[
  {
    "path": ".gitignore",
    "chars": 77,
    "preview": "artefacts/*\nenv_dataspeech/*\n**/__pycache__/*\nwip_scripts/*\nplots/*\n.vscode/*"
  },
  {
    "path": "LICENSE",
    "chars": 1079,
    "preview": "MIT License\n\nCopyright (c) 2024 The Hugging Face team.\n\nPermission is hereby granted, free of charge, to any person obta"
  },
  {
    "path": "README.md",
    "chars": 23712,
    "preview": "# Data-Speech\n\nData-Speech is a suite of utility scripts designed to tag speech datasets. \n\nIts aim is to provide a simp"
  },
  {
    "path": "dataspeech/__init__.py",
    "chars": 104,
    "preview": "from .cpu_enrichments import rate_apply\nfrom .gpu_enrichments import pitch_apply, snr_apply, squim_apply"
  },
  {
    "path": "dataspeech/cpu_enrichments/__init__.py",
    "chars": 30,
    "preview": "from .rate import rate_apply\n\n"
  },
  {
    "path": "dataspeech/cpu_enrichments/rate.py",
    "chars": 1845,
    "preview": "from g2p import make_g2p\n\ntransducer = make_g2p('eng', 'eng-ipa')\n\ndef rate_apply(batch, rank=None, audio_column_name=\"a"
  },
  {
    "path": "dataspeech/gpu_enrichments/__init__.py",
    "chars": 99,
    "preview": "from .pitch import pitch_apply\nfrom .snr_and_reverb import snr_apply\nfrom .squim import squim_apply"
  },
  {
    "path": "dataspeech/gpu_enrichments/pitch.py",
    "chars": 2396,
    "preview": "import torch \nimport penn\n\n\n# Here we'll use a 10 millisecond hopsize\nhopsize = .01\n\n# Provide a sensible frequency rang"
  },
  {
    "path": "dataspeech/gpu_enrichments/snr_and_reverb.py",
    "chars": 2911,
    "preview": "from pyannote.audio import Model\nfrom pathlib import Path\nfrom brouhaha.pipeline import RegressiveActivityDetectionPipel"
  },
  {
    "path": "dataspeech/gpu_enrichments/squim.py",
    "chars": 1855,
    "preview": "from torchaudio.pipelines import SQUIM_OBJECTIVE\nimport torch \nimport torchaudio\n\nmodel = None\nmax_audio_length = 15 * S"
  },
  {
    "path": "examples/prompt_creation/run_prompt_creation_10k.sh",
    "chars": 1531,
    "preview": "#!/usr/bin/env bash\n\naccelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \\\n  -"
  },
  {
    "path": "examples/prompt_creation/run_prompt_creation_1k.sh",
    "chars": 990,
    "preview": "#!/usr/bin/env bash\n\naccelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \\\n  -"
  },
  {
    "path": "examples/prompt_creation/run_prompt_creation_1k_with_speaker_consistency.sh",
    "chars": 1237,
    "preview": "#!/usr/bin/env bash\n\naccelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \\\n  -"
  },
  {
    "path": "examples/prompt_creation/run_prompt_creation_45k.sh",
    "chars": 1884,
    "preview": "#!/usr/bin/env bash\n\naccelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=8 run_prompt_creation.py \\\n  -"
  },
  {
    "path": "examples/prompt_creation/run_prompt_creation_dummy.sh",
    "chars": 511,
    "preview": "#!/usr/bin/env bash\n\npython run_prompt_creation.py \\\n  --dataset_name \"ylacombe/libritts_r_tags_and_text\" \\\n  --dataset_"
  },
  {
    "path": "examples/prompt_creation/run_prompt_creation_jenny.sh",
    "chars": 563,
    "preview": "#!/usr/bin/env bash\npython ./scripts/run_prompt_creation.py \\\n  --speaker_name \"Jenny\" \\\n  --is_single_speaker \\\n  --is_"
  },
  {
    "path": "examples/prompt_creation/speaker_ids_to_names.json",
    "chars": 704,
    "preview": "{\n    \"192\": \"Brenda\",\n    \"274\": \"Eileen\",\n    \"392\": \"Joy\",\n    \"409\": \"James\",\n    \"412\": \"Eric\",\n    \"505\": \"Aaron\","
  },
  {
    "path": "examples/prompt_creation_llm_swarm/nginx.template.conf",
    "chars": 825,
    "preview": "events {\n    # resolve \"worker_connections are not enough while connecting to upstream\"\n    # https://stackoverflow.com/"
  },
  {
    "path": "examples/prompt_creation_llm_swarm/run_prompt_creation_10k.sh",
    "chars": 334,
    "preview": "#!/usr/bin/env bash\n\npython run_prompt_creation_llm_swarm.py \\\n  --dataset_name \"ylacombe/mls-eng-10k-text-tags-v2\" \\\n  "
  },
  {
    "path": "examples/prompt_creation_llm_swarm/run_prompt_creation_1k.sh",
    "chars": 458,
    "preview": "#!/usr/bin/env bash\n\npython run_prompt_creation_llm_swarm.py \\\n  --dataset_name \"stable-speech/libritts-r-tags-and-text\""
  },
  {
    "path": "examples/prompt_creation_llm_swarm/run_prompt_creation_dummy.sh",
    "chars": 305,
    "preview": "#!/usr/bin/env bash\n\npython run_prompt_creation_llm_swarm.py \\\n  --dataset_name \"stable-speech/libritts-r-tags-and-text\""
  },
  {
    "path": "examples/prompt_creation_llm_swarm/run_prompt_creation_full_mls.sh",
    "chars": 550,
    "preview": "#!/usr/bin/env bash\n\npython ./run_prompt_creation_llm_swarm.py \\\n  --dataset_name 'ylacombe/mls-eng-text-tags-v5' \\\n  --"
  },
  {
    "path": "examples/prompt_creation_llm_swarm/tgi_h100.template.slurm",
    "chars": 1497,
    "preview": "#!/bin/bash\n#SBATCH --job-name=llm-swarm\n#SBATCH --partition hopper-prod\n#SBATCH --gpus={{gpus}}\n#SBATCH --cpus-per-task"
  },
  {
    "path": "examples/tagging/run_main_10k.sh",
    "chars": 920,
    "preview": "#!/usr/bin/env bash\n\npython main.py \"blabble-io/libritts_r\" \\\n    --configuration \"clean\" \\\n    --output_dir ./tmp_libri"
  },
  {
    "path": "examples/tagging/run_main_1k.sh",
    "chars": 742,
    "preview": "#!/usr/bin/env bash\n\npython main.py \"blabble-io/libritts_r\" \\\n    --configuration \"clean\" \\\n    --output_dir ./tmp_libri"
  },
  {
    "path": "examples/tagging/run_main_45k.sh",
    "chars": 1046,
    "preview": "#!/usr/bin/env bash\n\npython main.py \"blabble-io/libritts_r\" \\\n    --configuration \"clean\" \\\n    --output_dir ./tmp_libri"
  },
  {
    "path": "examples/tagging/run_main_dummy.sh",
    "chars": 285,
    "preview": "#!/usr/bin/env bash\n\npython main.py \"blabble-io/libritts_r\" \\\n    --configuration \"dev\" \\\n    --output_dir ./tmp_libritt"
  },
  {
    "path": "examples/tags_to_annotations/run_metadata_to_text_10k.sh",
    "chars": 442,
    "preview": "#!/usr/bin/env bash\n\npython ./scripts/metadata_to_text.py \"ylacombe/mls-eng-10k-tags+ylacombe/libritts_r_tags+ylacombe/l"
  },
  {
    "path": "examples/tags_to_annotations/run_metadata_to_text_10k_v02.sh",
    "chars": 705,
    "preview": "#!/usr/bin/env bash\n\npython ./scripts/metadata_to_text.py \"ylacombe/mls-eng-10k-tags+ylacombe/libritts_r_tags+ylacombe/l"
  },
  {
    "path": "examples/tags_to_annotations/run_metadata_to_text_for_finetuning.sh",
    "chars": 419,
    "preview": "#!/usr/bin/env bash\n\npython ./scripts/metadata_to_text.py \\\n    \"ylacombe/jenny-tts-tags-v1\" \\\n    --repo_id \"jenny-tts-"
  },
  {
    "path": "examples/tags_to_annotations/v01_bin_edges.json",
    "chars": 1032,
    "preview": "{\"speaking_rate\": [3.508771929824561, 6.187242299296628, 8.865712668768696, 11.544183038240764, 14.22265340771283, 16.90"
  },
  {
    "path": "examples/tags_to_annotations/v01_text_bins.json",
    "chars": 873,
    "preview": "{\n    \"speaker_rate_bins\":\n        [\"very slowly\", \"quite slowly\", \"slightly slowly\", \"moderate speed\", \"slightly fast\","
  },
  {
    "path": "examples/tags_to_annotations/v02_bin_edges.json",
    "chars": 999,
    "preview": "{\n    \"speaking_rate\": [0.0, 3.8258038258038254, 7.651607651607651, 11.477411477411476, 15.303215303215302, 19.129019129"
  },
  {
    "path": "examples/tags_to_annotations/v02_text_bins.json",
    "chars": 758,
    "preview": "{\n    \"speaker_rate_bins\":\n        [\"very slowly\", \"slowly\", \"slightly slowly\", \"moderate speed\", \"slightly fast\", \"fast"
  },
  {
    "path": "main.py",
    "chars": 7264,
    "preview": "from datasets import load_dataset, Audio\nfrom multiprocess import set_start_method\nfrom dataspeech import rate_apply, pi"
  },
  {
    "path": "requirements.txt",
    "chars": 128,
    "preview": "datasets[audio]\nhttps://github.com/marianne-m/brouhaha-vad/archive/main.zip\npenn\ng2p\ndemucs\ntransformers\naccelerate\nbits"
  },
  {
    "path": "scripts/filter_audio_separation.py",
    "chars": 4214,
    "preview": "from demucs import pretrained\nfrom demucs.apply import apply_model\nfrom demucs.audio import convert_audio\nfrom datasets "
  },
  {
    "path": "scripts/merge_audio_to_metadata.py",
    "chars": 2773,
    "preview": "import numpy as np\nimport pandas as pd\nfrom datasets import load_dataset, concatenate_datasets\nfrom multiprocess import "
  },
  {
    "path": "scripts/metadata_to_text.py",
    "chars": 22146,
    "preview": "import numpy as np\nimport pandas as pd\nfrom datasets import load_dataset, DatasetDict\nfrom multiprocess import set_start"
  },
  {
    "path": "scripts/per_dataset_script/add_gender_to_MLS.py",
    "chars": 2128,
    "preview": "from datasets import load_dataset\nfrom multiprocess import set_start_method\nimport pandas as pd\nimport argparse\n\n\nif __n"
  },
  {
    "path": "scripts/per_dataset_script/add_gender_to_libritts_r.py",
    "chars": 1922,
    "preview": "from datasets import load_dataset\nfrom multiprocess import set_start_method\nimport pandas as pd\nimport argparse\n\n\nif __n"
  },
  {
    "path": "scripts/per_dataset_script/clean_libritts_r.py",
    "chars": 2810,
    "preview": "from datasets import load_dataset\nfrom multiprocess import set_start_method\nimport pandas as pd\nimport argparse\nfrom os "
  },
  {
    "path": "scripts/run_prompt_creation.py",
    "chars": 36532,
    "preview": "import json\nimport logging\nimport os\nimport re\nimport shutil\nimport sys\nfrom dataclasses import dataclass, field\nfrom pa"
  },
  {
    "path": "scripts/run_prompt_creation_llm_swarm.py",
    "chars": 30565,
    "preview": "import json\nimport os\nimport re\nimport shutil\nimport sys\nfrom dataclasses import dataclass, field\nfrom pathlib import Pa"
  }
]

About this extraction

This page contains the full source code of the huggingface/dataspeech GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 44 files (160.4 KB), approximately 40.0k tokens, and a symbol index with 37 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!