Full Code of neonbjb/tortoise-tts for AI

main 8a2563ecabe9 cached
56 files
1.2 MB
599.1k tokens
428 symbols
1 requests
Download .txt
Showing preview only (1,230K chars total). Download the full file or copy to clipboard to get everything.
Repository: neonbjb/tortoise-tts
Branch: main
Commit: 8a2563ecabe9
Files: 56
Total size: 1.2 MB

Directory structure:
gitextract_v5pjsrh0/

├── .github/
│   └── workflows
├── .gitignore
├── Advanced_Usage.md
├── CHANGELOG.md
├── CITATION.cff
├── Dockerfile
├── LICENSE
├── MANIFEST.in
├── README.md
├── requirements.txt
├── scripts/
│   └── tortoise_tts.py
├── setup.cfg
├── setup.py
├── tortoise/
│   ├── __init__.py
│   ├── api.py
│   ├── api_fast.py
│   ├── data/
│   │   ├── got.txt
│   │   ├── layman.txt
│   │   ├── mel_norms.pth
│   │   ├── riding_hood.txt
│   │   ├── seal_copypasta.txt
│   │   └── tokenizer.json
│   ├── do_tts.py
│   ├── eval.py
│   ├── get_conditioning_latents.py
│   ├── is_this_from_tortoise.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── arch_util.py
│   │   ├── autoregressive.py
│   │   ├── classifier.py
│   │   ├── clvp.py
│   │   ├── cvvp.py
│   │   ├── diffusion_decoder.py
│   │   ├── hifigan_decoder.py
│   │   ├── random_latent_generator.py
│   │   ├── stream_generator.py
│   │   ├── transformer.py
│   │   ├── vocoder.py
│   │   └── xtransformers.py
│   ├── read.py
│   ├── read_fast.py
│   ├── socket_client.py
│   ├── socket_server.py
│   ├── tts_stream.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── audio.py
│   │   ├── diffusion.py
│   │   ├── stft.py
│   │   ├── text.py
│   │   ├── tokenizer.py
│   │   ├── typical_sampling.py
│   │   └── wav2vec_alignment.py
│   └── voices/
│       └── cond_latent_example/
│           └── pat.pth
├── tortoise_tts.ipynb
├── tortoise_v2_examples.html
└── voice_customization_guide.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows
================================================
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Upload Python Package

on:
  release:
    types: [published]

permissions:
  contents: read

jobs:
  deploy:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python
      uses: actions/setup-python@v3
      with:
        python-version: '3.x'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install build
    - name: Build package
      run: python -m build
    - name: Publish package
      uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
      with:
        user: __token__
        password: ${{ secrets.PYPI_API_TOKEN }}


================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

.idea/*
.models/*
.custom/*
results/*
debug_states/*

================================================
FILE: Advanced_Usage.md
================================================
## Advanced Usage

### Generation settings

Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs
that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using
various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've
set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with
these settings (and it's very likely that I missed something!)

These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See
```api.tts``` for a full list.

### Prompt engineering

Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion
by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to
take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the
prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality).

### Playing with the voice latent

Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent,
then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents
are quite expressive, affecting everything from tone to speaking rate to speech abnormalities.

This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output
what it thinks the "average" of those two voices sounds like.

#### Generating conditioning latents from voices

Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script
will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent).

Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents.

#### Using raw conditioning latents to generate speech

After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single
".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent).

## Tortoise-detect

Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip
came from Tortoise.

This classifier can be run on any computer, usage is as follows:

```commandline
python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
```

This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
positives.

## Model architecture

Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
models that work together. I've assembled a write-up of the system architecture here:
[https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/)

## Training

These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own
[DLAS](https://github.com/neonbjb/DL-Art-School) trainer.

I currently do not have plans to release the training configurations or methodology. See the next section..

## Ethical Considerations

Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
could be misused are many. It doesn't take much creativity to think up how.

After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:

1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above.
5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.

### Diversity

The diversity expressed by ML models is strongly tied to the datasets they were trained on.

Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
or of people who speak with strong accents.

## Looking forward

Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.

I want to mention here
that I think Tortoise could be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
to believe that the same is not true of TTS.


================================================
FILE: CHANGELOG.md
================================================
## Changelog
#### v3.0.0; 2023/10/18
- Added fast inference for tortoise with HiFi Decoder (inspired by xtts by [coquiTTS](https://github.com/coqui-ai/TTS) 🐸, check out their multilingual model for noncommercial uses)
#### v2.8.0; 2023/9/13
- Added custom tokenizer for non-english models
#### v2.7.0; 2023/7/26
- Bug fixes
- Added Apple Silicon Support
- Updated Transformer version
#### v2.6.0; 2023/7/26
- Bug fixes

#### v2.5.0; 2023/7/09
- Added kv_cache support 5x faster
- Added deepspeed support 10x faster
- Added half precision support
  
#### v2.4.0; 2022/5/17
- Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output.
- Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs.

#### v2.3.0; 2022/5/12
- New CLVP-large model for further improved decoding guidance.
- Improvements to read.py and do_tts.py (new options)

#### v2.2.0; 2022/5/5
- Added several new voices from the training set.
- Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets.
- Bug fixes

#### v2.1.0; 2022/5/2
- Added ability to produce totally random voices.
- Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent.
- Added ability to use your own pretrained models.
- Refactored directory structures.
- Performance improvements & bug fixes.


================================================
FILE: CITATION.cff
================================================
cff-version: 1.3.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Betker"
  given-names: "James"
  orcid: "https://orcid.org/my-orcid?orcid=0000-0003-3259-4862"
title: "TorToiSe text-to-speech"
version: 2.0
date-released: 2022-04-28
url: "https://github.com/neonbjb/tortoise-tts"

================================================
FILE: Dockerfile
================================================
FROM nvidia/cuda:12.2.0-base-ubuntu22.04

COPY . /app

RUN apt-get update && \
    apt-get install -y --allow-unauthenticated --no-install-recommends \
    wget \
    git \
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

ENV HOME="/root"
ENV CONDA_DIR="${HOME}/miniconda"
ENV PATH="$CONDA_DIR/bin":$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false
ENV PIP_DOWNLOAD_CACHE="$HOME/.pip/cache"
ENV TORTOISE_MODELS_DIR="$HOME/tortoise-tts/build/lib/tortoise/models"

RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda3.sh \
    && bash /tmp/miniconda3.sh -b -p "${CONDA_DIR}" -f -u \
    && "${CONDA_DIR}/bin/conda" init bash \
    && rm -f /tmp/miniconda3.sh \
    && echo ". '${CONDA_DIR}/etc/profile.d/conda.sh'" >> "${HOME}/.profile"

# --login option used to source bashrc (thus activating conda env) at every RUN statement
SHELL ["/bin/bash", "--login", "-c"]

RUN conda create --name tortoise python=3.9 numba inflect -y \
    && conda activate tortoise \
    && conda install --yes pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia \
    && conda install --yes transformers=4.31.0 \
    && cd /app \
    && python setup.py install


================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: MANIFEST.in
================================================
recursive-include tortoise/data *
recursive-include tortoise/voices *


================================================
FILE: README.md
================================================
# TorToiSe

Tortoise is a text-to-speech program built with the following priorities:

1. Strong multi-voice capabilities.
2. Highly realistic prosody and intonation.
   
This repo contains all the code needed to run Tortoise TTS in inference mode.

Manuscript: https://arxiv.org/abs/2305.07243
## Hugging Face space

A live demo is hosted on Hugging Face Spaces. If you'd like to avoid a queue, please duplicate the Space and add a GPU. Please note that CPU-only spaces do not work for this demo.

https://huggingface.co/spaces/Manmay/tortoise-tts

## Install via pip
```bash
pip install tortoise-tts
```

If you would like to install the latest development version, you can also install it directly from the git repository:

```bash
pip install git+https://github.com/neonbjb/tortoise-tts
```

## What's in a name?

I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model
is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low
sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.

well..... not so slow anymore now we can get a **0.25-0.3 RTF** on 4GB vram and with streaming we can get < **500 ms** latency !!! 

## Demos

See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs.

A cool application of Tortoise + GPT-3 (not affiliated with this repository): https://twitter.com/lexman_ai. Unfortunately, this project seems no longer to be active.

## Usage guide

### Local installation

If you want to use this on your own computer, you must have an NVIDIA GPU.

> [!TIP]
> On Windows, I **highly** recommend using the Conda installation method. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems.

First, install miniconda: https://docs.conda.io/en/latest/miniconda.html

Then run the following commands, using anaconda prompt as the terminal (or any other terminal configured to work with conda)

This will:
1. create conda environment with minimal dependencies specified
1. activate the environment
1. install pytorch with the command provided here: https://pytorch.org/get-started/locally/
1. clone tortoise-tts
1. change the current directory to tortoise-tts
1. run tortoise python setup install script

```shell
conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install transformers=4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install
```

Optionally, pytorch can be installed in the base environment, so that other conda environments can use it too. To do this, simply send the `conda install pytorch...` line before activating the tortoise environment.

> [!NOTE]  
> When you want to use tortoise-tts, you will always have to ensure the `tortoise` conda environment is activated.

If you are on windows, you may also need to install pysoundfile: `conda install -c conda-forge pysoundfile`

### Docker

An easy way to hit the ground running and a good jumping off point depending on your use case.

```sh
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts

docker build . -t tts

docker run --gpus all \
    -e TORTOISE_MODELS_DIR=/models \
    -v /mnt/user/data/tortoise_tts/models:/models \
    -v /mnt/user/data/tortoise_tts/results:/results \
    -v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface \
    -v /root:/work \
    -it tts
```
This gives you an interactive terminal in an environment that's ready to do some tts. Now you can explore the different interfaces that tortoise exposes for tts.

For example:

```sh
cd app
conda activate tortoise
time python tortoise/do_tts.py \
    --output_path /results \
    --preset ultra_fast \
    --voice geralt \
    --text "Time flies like an arrow; fruit flies like a bananna."
```

## Apple Silicon

On macOS 13+ with M1/M2 chips you need to install the nighly version of PyTorch, as stated in the official page you can do:

```shell
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
```

Be sure to do that after you activate the environment. If you don't use conda the commands would look like this:

```shell
python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect psutil
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .
```

Be aware that DeepSpeed is disabled on Apple Silicon since it does not work. The flag `--use_deepspeed` is ignored.
You may need to prepend `PYTORCH_ENABLE_MPS_FALLBACK=1` to the commands below to make them work since MPS does not support all the operations in Pytorch.


### do_tts.py

This script allows you to speak a single phrase with one or more voices.
```shell
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
```
### do socket streaming
```socket server
python tortoise/socket_server.py 
```
will listen at port 5000


### faster inference read.py

This script provides tools for reading large amounts of text.

```shell
python tortoise/read_fast.py --textfile <your text to be read> --voice random
```

### read.py

This script provides tools for reading large amounts of text.

```shell
python tortoise/read.py --textfile <your text to be read> --voice random
```

This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series
of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and
output that as well.

Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate
argument.

### API

Tortoise can be used programmatically, like so:

```python
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
```

To use deepspeed:

```python
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
```

To use kv cache:

```python
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(kv_cache=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
```

To run model in float16:

```python
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
```
for Faster runs use all three:

```python
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
```

## Acknowledgements

This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to
credit a few of the amazing folks in the community that have helped make this happen:

- Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
- [Ramesh et al](https://arxiv.org/pdf/2102.12092.pdf) who authored the DALLE paper, which is the inspiration behind Tortoise.
- [Nichol and Dhariwal](https://arxiv.org/pdf/2102.09672.pdf) who authored the (revision of) the code that drives the diffusion model.
- [Jang et al](https://arxiv.org/pdf/2106.07889.pdf) who developed and open-sourced univnet, the vocoder this repo uses.
- [Kim and Jung](https://github.com/mindslab-ai/univnet) who implemented univnet pytorch model.
- [lucidrains](https://github.com/lucidrains) who writes awesome open source pytorch models, many of which are used here.
- [Patrick von Platen](https://huggingface.co/patrickvonplaten) whose guides on setting up wav2vec were invaluable to building my dataset.

## Notice

Tortoise was built entirely by the author (James Betker) using their own hardware. Their employer was not involved in any facet of Tortoise's development.

## License

Tortoise TTS is licensed under the Apache 2.0 license.

If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.


================================================
FILE: requirements.txt
================================================
tqdm
rotary_embedding_torch
transformers==4.31.0
tokenizers
inflect
progressbar
einops==0.4.1
unidecode
scipy
librosa==0.9.1
ffmpeg
numpy
numba
torchaudio
threadpoolctl
llvmlite
appdirs
nbconvert==5.3.1
tornado==4.2
pydantic==1.9.1
deepspeed==0.8.3
py-cpuinfo
hjson
psutil
sounddevice
spacy==3.7.5


================================================
FILE: scripts/tortoise_tts.py
================================================
#!/usr/bin/env python3

import argparse
import os
import sys
import tempfile
import time

import torch
import torchaudio

from tortoise.api import MODELS_DIR, TextToSpeech
from tortoise.utils.audio import get_voices, load_voices, load_audio
from tortoise.utils.text import split_and_recombine_text

parser = argparse.ArgumentParser(
    description='TorToiSe is a text-to-speech program that is capable of synthesizing speech '
                'in multiple voices with realistic prosody and intonation.')

parser.add_argument(
    'text', type=str, nargs='*',
    help='Text to speak. If omitted, text is read from stdin.')
parser.add_argument(
    '-v, --voice', type=str, default='random', metavar='VOICE', dest='voice',
    help='Selects the voice to use for generation. Use the & character to join two voices together. '
         'Use a comma to perform inference on multiple voices. Set to "all" to use all available voices. '
         'Note that multiple voices require the --output-dir option to be set.')
parser.add_argument(
    '-V, --voices-dir', metavar='VOICES_DIR', type=str, dest='voices_dir',
    help='Path to directory containing extra voices to be loaded. Use a comma to specify multiple directories.')
parser.add_argument(
    '-p, --preset', type=str, default='fast', choices=['ultra_fast', 'fast', 'standard', 'high_quality'], dest='preset',
    help='Which voice quality preset to use.')
parser.add_argument(
    '-q, --quiet', default=False, action='store_true', dest='quiet',
    help='Suppress all output.')

output_group = parser.add_mutually_exclusive_group(required=True)
output_group.add_argument(
    '-l, --list-voices', default=False, action='store_true', dest='list_voices',
    help='List available voices and exit.')
output_group.add_argument(
    '-P, --play', action='store_true', dest='play',
    help='Play the audio (requires pydub).')
output_group.add_argument(
    '-o, --output', type=str, metavar='OUTPUT', dest='output',
    help='Save the audio to a file.')
output_group.add_argument(
    '-O, --output-dir', type=str, metavar='OUTPUT_DIR', dest='output_dir',
    help='Save the audio to a directory as individual segments.')

multi_output_group = parser.add_argument_group('multi-output options (requires --output-dir)')
multi_output_group.add_argument(
    '--candidates', type=int, default=1,
    help='How many output candidates to produce per-voice. Note that only the first candidate is used in the combined output.')
multi_output_group.add_argument(
    '--regenerate', type=str, default=None,
    help='Comma-separated list of clip numbers to re-generate.')
multi_output_group.add_argument(
    '--skip-existing', action='store_true',
    help='Set to skip re-generating existing clips.')

advanced_group = parser.add_argument_group('advanced options')
advanced_group.add_argument(
    '--produce-debug-state', default=False, action='store_true',
    help='Whether or not to produce debug_states in current directory, which can aid in reproducing problems.')
advanced_group.add_argument(
    '--seed', type=int, default=None,
    help='Random seed which can be used to reproduce results.')
advanced_group.add_argument(
    '--models-dir', type=str, default=MODELS_DIR,
    help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to '
         '~/.cache/tortoise/.models, so this should only be specified if you have custom checkpoints.')
advanced_group.add_argument(
    '--text-split', type=str, default=None,
    help='How big chunks to split the text into, in the format <desired_length>,<max_length>.')
advanced_group.add_argument(
    '--disable-redaction', default=False, action='store_true',
    help='Normally text enclosed in brackets are automatically redacted from the spoken output '
         '(but are still rendered by the model), this can be used for prompt engineering. '
         'Set this to disable this behavior.')
advanced_group.add_argument(
    '--device', type=str, default=None,
    help='Device to use for inference.')
advanced_group.add_argument(
    '--batch-size', type=int, default=None,
    help='Batch size to use for inference. If omitted, the batch size is set based on available GPU memory.')

tuning_group = parser.add_argument_group('tuning options (overrides preset settings)')
tuning_group.add_argument(
    '--num-autoregressive-samples', type=int, default=None,
    help='Number of samples taken from the autoregressive model, all of which are filtered using CLVP. '
         'As TorToiSe is a probabilistic model, more samples means a higher probability of creating something "great".')
tuning_group.add_argument(
    '--temperature', type=float, default=None,
    help='The softmax temperature of the autoregressive model.')
tuning_group.add_argument(
    '--length-penalty', type=float, default=None,
    help='A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.')
tuning_group.add_argument(
    '--repetition-penalty', type=float, default=None,
    help='A penalty that prevents the autoregressive decoder from repeating itself during decoding. '
         'Can be used to reduce the incidence of long silences or "uhhhhhhs", etc.')
tuning_group.add_argument(
    '--top-p', type=float, default=None,
    help='P value used in nucleus sampling. 0 to 1. Lower values mean the decoder produces more "likely" (aka boring) outputs.')
tuning_group.add_argument(
    '--max-mel-tokens', type=int, default=None,
    help='Restricts the output length. 1 to 600. Each unit is 1/20 of a second.')
tuning_group.add_argument(
    '--cvvp-amount', type=float, default=None,
    help='How much the CVVP model should influence the output.'
    'Increasing this can in some cases reduce the likelihood of multiple speakers.')
tuning_group.add_argument(
    '--diffusion-iterations', type=int, default=None,
    help='Number of diffusion steps to perform.  More steps means the network has more chances to iteratively'
         'refine the output, which should theoretically mean a higher quality output. '
         'Generally a value above 250 is not noticeably better, however.')
tuning_group.add_argument(
    '--cond-free', type=bool, default=None,
    help='Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for '
         'each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output '
         'of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and '
         'dramatically improves realism.')
tuning_group.add_argument(
    '--cond-free-k', type=float, default=None,
    help='Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf]. '
         'As cond_free_k increases, the output becomes dominated by the conditioning-free signal. '
         'Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k')
tuning_group.add_argument(
    '--diffusion-temperature', type=float, default=None,
    help='Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 '
         'are the "mean" prediction of the diffusion network and will sound bland and smeared. ')

usage_examples = f'''
Examples:

Read text using random voice and place it in a file:

    {parser.prog} -o hello.wav "Hello, how are you?"

Read text from stdin and play it using the tom voice:

    echo "Say it like you mean it!" | {parser.prog} -P -v tom

Read a text file using multiple voices and save the audio clips to a directory:

    {parser.prog} -O /tmp/tts-results -v tom,emma <textfile.txt
'''

try:
    args = parser.parse_args()
except SystemExit as e:
    if e.code == 0:
        print(usage_examples)
    sys.exit(e.code)

extra_voice_dirs = args.voices_dir.split(',') if args.voices_dir else []
all_voices = sorted(get_voices(extra_voice_dirs))

if args.list_voices:
    for v in all_voices:
        print(v)
    sys.exit(0)

selected_voices = all_voices if args.voice == 'all' else args.voice.split(',')
selected_voices = [v.split('&') if '&' in v else [v] for v in selected_voices]
for voices in selected_voices:
    for v in voices:
        if v != 'random' and v not in all_voices:
            parser.error(f'voice {v} not available, use --list-voices to see available voices.')

if len(args.text) == 0:
    text = ''
    for line in sys.stdin:
        text += line
else:
    text = ' '.join(args.text)
text = text.strip()
if args.text_split:
    desired_length, max_length = [int(x) for x in args.text_split.split(',')]
    if desired_length > max_length:
        parser.error(f'--text-split: desired_length ({desired_length}) must be <= max_length ({max_length})')
    texts = split_and_recombine_text(text, desired_length, max_length)
else:
    texts = split_and_recombine_text(text)
if len(texts) == 0:
    parser.error('no text provided')

if args.output_dir:
    os.makedirs(args.output_dir, exist_ok=True)
else:
    if len(selected_voices) > 1:
        parser.error('cannot have multiple voices without --output-dir"')
    if args.candidates > 1:
        parser.error('cannot have multiple candidates without --output-dir"')

# error out early if pydub isn't installed
if args.play:
    try:
        import pydub
        import pydub.playback
    except ImportError:
        parser.error('--play requires pydub to be installed, which can be done with "pip install pydub"')

seed = int(time.time()) if args.seed is None else args.seed
if not args.quiet:
    print('Loading tts...')
tts = TextToSpeech(models_dir=args.models_dir, enable_redaction=not args.disable_redaction,
                   device=args.device, autoregressive_batch_size=args.batch_size)
gen_settings = {
    'use_deterministic_seed': seed,
    'verbose': not args.quiet,
    'k': args.candidates,
    'preset': args.preset,
}
tuning_options = [
    'num_autoregressive_samples', 'temperature', 'length_penalty', 'repetition_penalty', 'top_p',
    'max_mel_tokens', 'cvvp_amount', 'diffusion_iterations', 'cond_free', 'cond_free_k', 'diffusion_temperature']
for option in tuning_options:
    if getattr(args, option) is not None:
        gen_settings[option] = getattr(args, option)
total_clips = len(texts) * len(selected_voices)
regenerate_clips = [int(x) for x in args.regenerate.split(',')] if args.regenerate else None
for voice_idx, voice in enumerate(selected_voices):
    audio_parts = []
    voice_samples, conditioning_latents = load_voices(voice, extra_voice_dirs)
    for text_idx, text in enumerate(texts):
        clip_name = f'{"-".join(voice)}_{text_idx:02d}'
        if args.output_dir:
            first_clip = os.path.join(args.output_dir, f'{clip_name}_00.wav')
            if (args.skip_existing or (regenerate_clips and text_idx not in regenerate_clips)) and os.path.exists(first_clip):
                audio_parts.append(load_audio(first_clip, 24000))
                if not args.quiet:
                    print(f'Skipping {clip_name}')
                continue
        if not args.quiet:
            print(f'Rendering {clip_name} ({(voice_idx * len(texts) + text_idx + 1)} of {total_clips})...')
            print('  ' + text)
        gen = tts.tts_with_preset(
            text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, **gen_settings)
        gen = gen if args.candidates > 1 else [gen]
        for candidate_idx, audio in enumerate(gen):
            audio = audio.squeeze(0).cpu()
            if candidate_idx == 0:
                audio_parts.append(audio)
            if args.output_dir:
                filename = f'{clip_name}_{candidate_idx:02d}.wav'
                torchaudio.save(os.path.join(args.output_dir, filename), audio, 24000)

    audio = torch.cat(audio_parts, dim=-1)
    if args.output_dir:
        filename = f'{"-".join(voice)}_combined.wav'
        torchaudio.save(os.path.join(args.output_dir, filename), audio, 24000)
    elif args.output:
        filename = args.output if args.output else os.tmp
        torchaudio.save(args.output, audio, 24000)
    elif args.play:
        f = tempfile.NamedTemporaryFile(suffix='.wav', delete=True)
        torchaudio.save(f.name, audio, 24000)
        pydub.playback.play(pydub.AudioSegment.from_wav(f.name))

    if args.produce_debug_state:
        os.makedirs('debug_states', exist_ok=True)
        dbg_state = (seed, texts, voice_samples, conditioning_latents, args)
        torch.save(dbg_state, os.path.join('debug_states', f'debug_{"-".join(voice)}.pth'))


================================================
FILE: setup.cfg
================================================
[metadata]
description_file = README.md


================================================
FILE: setup.py
================================================
import setuptools

with open("README.md", "r", encoding="utf-8") as fh:
    long_description = fh.read()

setuptools.setup(
    name="tortoise-tts",
    packages=setuptools.find_packages(),
    version="3.0.0",
    author="James Betker",
    author_email="james@adamant.ai",
    description="A high quality multi-voice text-to-speech library",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/neonbjb/tortoise-tts",
    project_urls={},
    scripts=[
        'scripts/tortoise_tts.py',
    ],
    include_package_data=True,
    install_requires=[
        'tqdm',
        'rotary_embedding_torch',
        'inflect',
        'progressbar',
        'einops',
        'unidecode',
        'scipy',
        'librosa',
        'transformers==4.31.0',
        'tokenizers==0.14.0',
        'scipy==1.13.1'
        # 'deepspeed==0.8.3',
    ],
    classifiers=[
        "Programming Language :: Python :: 3",
        "License :: OSI Approved :: Apache Software License",
        "Operating System :: OS Independent",
    ],
    python_requires=">=3.6",
)


================================================
FILE: tortoise/__init__.py
================================================


================================================
FILE: tortoise/api.py
================================================
import os
import random
import uuid
from time import time
from urllib import request

import torch
import torch.nn.functional as F
import progressbar
import torchaudio

from tortoise.models.classifier import AudioMiniEncoderWithClassifierHead
from tortoise.models.diffusion_decoder import DiffusionTts
from tortoise.models.autoregressive import UnifiedVoice
from tqdm import tqdm
from tortoise.models.arch_util import TorchMelSpectrogram
from tortoise.models.clvp import CLVP
from tortoise.models.cvvp import CVVP
from tortoise.models.random_latent_generator import RandomLatentConverter
from tortoise.models.vocoder import UnivNetGenerator
from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel, TacotronSTFT
from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule
from tortoise.utils.tokenizer import VoiceBpeTokenizer
from tortoise.utils.wav2vec_alignment import Wav2VecAlignment
from contextlib import contextmanager
from huggingface_hub import hf_hub_download
pbar = None

DEFAULT_MODELS_DIR = os.path.join(os.path.expanduser('~'), '.cache', 'tortoise', 'models')
MODELS_DIR = os.environ.get('TORTOISE_MODELS_DIR', DEFAULT_MODELS_DIR)
MODELS = {
    'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth',
    'classifier.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/classifier.pth',
    'clvp2.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth',
    'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth',
    'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth',
    'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth',
    'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth',
    'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth',
}

def get_model_path(model_name, models_dir=MODELS_DIR):
    """
    Get path to given model, download it if it doesn't exist.
    """
    if model_name not in MODELS:
        raise ValueError(f'Model {model_name} not found in available models.')
    model_path = hf_hub_download(repo_id="Manmay/tortoise-tts", filename=model_name, cache_dir=models_dir)
    return model_path


def pad_or_truncate(t, length):
    """
    Utility function for forcing <t> to have the specified sequence length, whether by clipping it or padding it with 0s.
    """
    if t.shape[-1] == length:
        return t
    elif t.shape[-1] < length:
        return F.pad(t, (0, length-t.shape[-1]))
    else:
        return t[..., :length]


def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusion_steps=200, cond_free=True, cond_free_k=1):
    """
    Helper function to load a GaussianDiffusion instance configured for use as a vocoder.
    """
    return SpacedDiffusion(use_timesteps=space_timesteps(trained_diffusion_steps, [desired_diffusion_steps]), model_mean_type='epsilon',
                           model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps),
                           conditioning_free=cond_free, conditioning_free_k=cond_free_k)


def format_conditioning(clip, cond_length=132300, device="cuda" if not torch.backends.mps.is_available() else 'mps'):
    """
    Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models.
    """
    gap = clip.shape[-1] - cond_length
    if gap < 0:
        clip = F.pad(clip, pad=(0, abs(gap)))
    elif gap > 0:
        rand_start = random.randint(0, gap)
        clip = clip[:, rand_start:rand_start + cond_length]
    mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0)
    return mel_clip.unsqueeze(0).to(device)


def fix_autoregressive_output(codes, stop_token, complain=True):
    """
    This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was
    trained on and what the autoregressive code generator creates (which has no padding or end).
    This is highly specific to the DVAE being used, so this particular coding will not necessarily work if used with
    a different DVAE. This can be inferred by feeding a audio clip padded with lots of zeros on the end through the DVAE
    and copying out the last few codes.

    Failing to do this padding will produce speech with a harsh end that sounds like "BLAH" or similar.
    """
    # Strip off the autoregressive stop token and add padding.
    stop_token_indices = (codes == stop_token).nonzero()
    if len(stop_token_indices) == 0:
        if complain:
            print("No stop tokens found in one of the generated voice clips. This typically means the spoken audio is "
                  "too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, "
                  "try breaking up your input text.")
        return codes
    else:
        codes[stop_token_indices] = 83
    stm = stop_token_indices.min().item()
    codes[stm:] = 83
    if stm - 3 < codes.shape[0]:
        codes[-3] = 45
        codes[-2] = 45
        codes[-1] = 248

    return codes


def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True):
    """
    Uses the specified diffusion model to convert discrete codes into a spectrogram.
    """
    with torch.no_grad():
        output_seq_len = latents.shape[1] * 4 * 24000 // 22050  # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal.
        output_shape = (latents.shape[0], 100, output_seq_len)
        precomputed_embeddings = diffusion_model.timestep_independent(latents, conditioning_latents, output_seq_len, False)

        noise = torch.randn(output_shape, device=latents.device) * temperature
        mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise,
                                      model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
                                     progress=verbose)
        return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]


def classify_audio_clip(clip):
    """
    Returns whether or not Tortoises' classifier thinks the given clip came from Tortoise.
    :param clip: torch tensor containing audio waveform data (get it from load_audio)
    :return: True if the clip was classified as coming from Tortoise and false if it was classified as real.
    """
    classifier = AudioMiniEncoderWithClassifierHead(2, spec_dim=1, embedding_dim=512, depth=5, downsample_factor=4,
                                                    resnet_blocks=2, attn_blocks=4, num_attn_heads=4, base_channels=32,
                                                    dropout=0, kernel_size=5, distribute_zero_label=False)
    classifier.load_state_dict(torch.load(get_model_path('classifier.pth'), map_location=torch.device('cpu')))
    clip = clip.cpu().unsqueeze(0)
    results = F.softmax(classifier(clip), dim=-1)
    return results[0][0]


def pick_best_batch_size_for_gpu():
    """
    Tries to pick a batch size that will fit in your GPU. These sizes aren't guaranteed to work, but they should give
    you a good shot.
    """
    if torch.cuda.is_available():
        _, available = torch.cuda.mem_get_info()
        availableGb = available / (1024 ** 3)
        if availableGb > 14:
            return 16
        elif availableGb > 10:
            return 8
        elif availableGb > 7:
            return 4
    if torch.backends.mps.is_available():
        import psutil
        available = psutil.virtual_memory().total
        availableGb = available / (1024 ** 3)
        if availableGb > 14:
            return 16
        elif availableGb > 10:
            return 8
        elif availableGb > 7:
            return 4
    return 1

class TextToSpeech:
    """
    Main entry point into Tortoise.
    """

    def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, 
                 enable_redaction=True, kv_cache=False, use_deepspeed=False, half=False, device=None,
                 tokenizer_vocab_file=None, tokenizer_basic=False):

        """
        Constructor
        :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
                                          GPU OOM errors. Larger numbers generates slightly faster.
        :param models_dir: Where model weights are stored. This should only be specified if you are providing your own
                           models, otherwise use the defaults.
        :param enable_redaction: When true, text enclosed in brackets are automatically redacted from the spoken output
                                 (but are still rendered by the model). This can be used for prompt engineering.
                                 Default is true.
        :param device: Device to use when running the model. If omitted, the device will be automatically chosen.
        """
        self.models_dir = models_dir
        self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size
        self.enable_redaction = enable_redaction
        if device is None:
            self.device = torch.device('cuda' if torch.cuda.is_available() else'cpu')
        else:
            self.device = torch.device(device)
            
        if torch.backends.mps.is_available():
            self.device = torch.device('mps')
        if self.enable_redaction:
            self.aligner = Wav2VecAlignment()

        self.tokenizer = VoiceBpeTokenizer(
            vocab_file=tokenizer_vocab_file,
            use_basic_cleaners=tokenizer_basic,
        )
        self.half = half
        if os.path.exists(f'{models_dir}/autoregressive.ptt'):
            # Assume this is a traced directory.
            self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt')
            self.diffusion = torch.jit.load(f'{models_dir}/diffusion_decoder.ptt')
        else:
            self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30,
                                          model_dim=1024,
                                          heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False,
                                          train_solo_embeddings=False).cpu().eval()
            self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir)), strict=False)
            self.autoregressive.post_init_gpt2_config(use_deepspeed=use_deepspeed, kv_cache=kv_cache, half=self.half)
            
            self.diffusion = DiffusionTts(model_channels=1024, num_layers=10, in_channels=100, out_channels=200,
                                          in_latent_channels=1024, in_tokens=8193, dropout=0, use_fp16=False, num_heads=16,
                                          layer_drop=0, unconditioned_percentage=0).cpu().eval()
            self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir)))

        self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20,
                         text_seq_len=350, text_heads=12,
                         num_speech_tokens=8192, speech_enc_depth=20, speech_heads=12, speech_seq_len=430,
                         use_xformers=True).cpu().eval()
        self.clvp.load_state_dict(torch.load(get_model_path('clvp2.pth', models_dir)))
        self.cvvp = None # CVVP model is only loaded if used.

        self.vocoder = UnivNetGenerator().cpu()
        self.vocoder.load_state_dict(torch.load(get_model_path('vocoder.pth', models_dir), map_location=torch.device('cpu'))['model_g'])
        self.vocoder.eval(inference=True)
        
        self.stft = None # TacotronSTFT is only loaded if used.

        # Random latent generators (RLGs) are loaded lazily.
        self.rlg_auto = None
        self.rlg_diffusion = None
    @contextmanager
    def temporary_cuda(self, model):
        m = model.to(self.device)
        yield m
        m = model.cpu()

    
    def load_cvvp(self):
        """Load CVVP model."""
        self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0,
                         speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval()
        self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir)))

    def get_conditioning_latents(self, voice_samples, return_mels=False):
        """
        Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
        These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
        properties.
        :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data.
        """
        with torch.no_grad():
            voice_samples = [v.to(self.device) for v in voice_samples]

            auto_conds = []
            if not isinstance(voice_samples, list):
                voice_samples = [voice_samples]
            for vs in voice_samples:
                auto_conds.append(format_conditioning(vs, device=self.device))
            auto_conds = torch.stack(auto_conds, dim=1)
            self.autoregressive = self.autoregressive.to(self.device)
            auto_latent = self.autoregressive.get_conditioning(auto_conds)
            self.autoregressive = self.autoregressive.cpu()

            if self.stft is None:
                # Initialize STFT
                self.stft = TacotronSTFT(1024, 256, 1024, 100, 24000, 0, 12000).to(self.device)

            diffusion_conds = []
            for sample in voice_samples:
                # The diffuser operates at a sample rate of 24000 (except for the latent inputs)
                sample = torchaudio.functional.resample(sample, 22050, 24000)
                sample = pad_or_truncate(sample, 102400)
                cond_mel = wav_to_univnet_mel(sample.to(self.device), do_normalization=False,
                                              device=self.device, stft=self.stft)
                diffusion_conds.append(cond_mel)
            diffusion_conds = torch.stack(diffusion_conds, dim=1)

            self.diffusion = self.diffusion.to(self.device)
            diffusion_latent = self.diffusion.get_conditioning(diffusion_conds)
            self.diffusion = self.diffusion.cpu()

        if return_mels:
            return auto_latent, diffusion_latent, auto_conds, diffusion_conds
        else:
            return auto_latent, diffusion_latent

    def get_random_conditioning_latents(self):
        # Lazy-load the RLG models.
        if self.rlg_auto is None:
            self.rlg_auto = RandomLatentConverter(1024).eval()
            self.rlg_auto.load_state_dict(torch.load(get_model_path('rlg_auto.pth', self.models_dir), map_location=torch.device('cpu')))
            self.rlg_diffusion = RandomLatentConverter(2048).eval()
            self.rlg_diffusion.load_state_dict(torch.load(get_model_path('rlg_diffuser.pth', self.models_dir), map_location=torch.device('cpu')))
        with torch.no_grad():
            return self.rlg_auto(torch.tensor([0.0])), self.rlg_diffusion(torch.tensor([0.0]))

    def tts_with_preset(self, text, preset='fast', **kwargs):
        """
        Calls TTS with one of a set of preset generation parameters. Options:
            'ultra_fast': Produces speech at a speed which belies the name of this repo. (Not really, but it's definitely fastest).
            'fast': Decent quality speech at a decent inference rate. A good choice for mass inference.
            'standard': Very good quality. This is generally about as good as you are going to get.
            'high_quality': Use if you want the absolute best. This is not really worth the compute, though.
        """
        # Use generally found best tuning knobs for generation.
        settings = {'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0,
                    'top_p': .8,
                    'cond_free_k': 2.0, 'diffusion_temperature': 1.0}
        # Presets are defined here.
        presets = {
            'ultra_fast': {'num_autoregressive_samples': 16, 'diffusion_iterations': 30, 'cond_free': False},
            'fast': {'num_autoregressive_samples': 96, 'diffusion_iterations': 80},
            'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200},
            'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400},
        }
        settings.update(presets[preset])
        settings.update(kwargs) # allow overriding of preset settings with kwargs
        return self.tts(text, **settings)

    def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None,
            return_deterministic_state=False,
            # autoregressive generation parameters follow
            num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
            # CVVP parameters follow
            cvvp_amount=.0,
            # diffusion generation parameters follow
            diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0,
            **hf_generate_kwargs):
        """
        Produces an audio clip of the given text being spoken with the given reference voice.
        :param text: Text to be spoken.
        :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
        :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
                                     can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
                                     Conditioning latents can be retrieved via get_conditioning_latents().
        :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned.
        :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
        ~~AUTOREGRESSIVE KNOBS~~
        :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP.
               As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
        :param temperature: The softmax temperature of the autoregressive model.
        :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
        :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
                                   of long silences or "uhhhhhhs", etc.
        :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
        :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
        :param typical_sampling: Turns typical sampling on or off. This sampling mode is discussed in this paper: https://arxiv.org/abs/2202.00666
                                 I was interested in the premise, but the results were not as good as I was hoping. This is off by default, but
                                 could use some tuning.
        :param typical_mass: The typical_mass parameter from the typical_sampling algorithm.
        ~~CLVP-CVVP KNOBS~~
        :param cvvp_amount: Controls the influence of the CVVP model in selecting the best output from the autoregressive model.
                            [0,1]. Values closer to 1 mean the CVVP model is more important, 0 disables the CVVP model.
        ~~DIFFUSION KNOBS~~
        :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
                                     the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
                                     however.
        :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
                          each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                          of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
                          dramatically improves realism.
        :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
                            As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
                            Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
        :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                      are the "mean" prediction of the diffusion network and will sound bland and smeared.
        ~~OTHER STUFF~~
        :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
                                   Extra keyword args fed to this function get forwarded directly to that API. Documentation
                                   here: https://huggingface.co/docs/transformers/internal/generation_utils
        :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
                 Sample rate is 24kHz.
        """
        deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

        text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
        text_tokens = F.pad(text_tokens, (0, 1))  # This may not be necessary.
        assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
        auto_conds = None
        if voice_samples is not None:
            auto_conditioning, diffusion_conditioning, auto_conds, _ = self.get_conditioning_latents(voice_samples, return_mels=True)
        elif conditioning_latents is not None:
            auto_conditioning, diffusion_conditioning = conditioning_latents
        else:
            auto_conditioning, diffusion_conditioning = self.get_random_conditioning_latents()
        auto_conditioning = auto_conditioning.to(self.device)
        diffusion_conditioning = diffusion_conditioning.to(self.device)

        diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=diffusion_iterations, cond_free=cond_free, cond_free_k=cond_free_k)

        with torch.no_grad():
            samples = []
            num_batches = num_autoregressive_samples // self.autoregressive_batch_size
            stop_mel_token = self.autoregressive.stop_mel_token
            calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
            if verbose:
                print("Generating autoregressive samples..")
            if not torch.backends.mps.is_available():
                with self.temporary_cuda(self.autoregressive
                ) as autoregressive, torch.autocast(device_type="cuda", dtype=torch.float16, enabled=self.half):
                    for b in tqdm(range(num_batches), disable=not verbose):
                        codes = autoregressive.inference_speech(auto_conditioning, text_tokens,
                                                                    do_sample=True,
                                                                    top_p=top_p,
                                                                    temperature=temperature,
                                                                    num_return_sequences=self.autoregressive_batch_size,
                                                                    length_penalty=length_penalty,
                                                                    repetition_penalty=repetition_penalty,
                                                                    max_generate_length=max_mel_tokens,
                                                                    **hf_generate_kwargs)
                        padding_needed = max_mel_tokens - codes.shape[1]
                        codes = F.pad(codes, (0, padding_needed), value=stop_mel_token)
                        samples.append(codes)
            else:
                with self.temporary_cuda(self.autoregressive) as autoregressive:
                    for b in tqdm(range(num_batches), disable=not verbose):
                        codes = autoregressive.inference_speech(auto_conditioning, text_tokens,
                                                                    do_sample=True,
                                                                    top_p=top_p,
                                                                    temperature=temperature,
                                                                    num_return_sequences=self.autoregressive_batch_size,
                                                                    length_penalty=length_penalty,
                                                                    repetition_penalty=repetition_penalty,
                                                                    max_generate_length=max_mel_tokens,
                                                                    **hf_generate_kwargs)
                        padding_needed = max_mel_tokens - codes.shape[1]
                        codes = F.pad(codes, (0, padding_needed), value=stop_mel_token)
                        samples.append(codes)

            clip_results = []
            
            if not torch.backends.mps.is_available():
                with self.temporary_cuda(self.clvp) as clvp, torch.autocast(
                    device_type="cuda" if not torch.backends.mps.is_available() else 'mps', dtype=torch.float16, enabled=self.half
                ):
                    if cvvp_amount > 0:
                        if self.cvvp is None:
                            self.load_cvvp()
                        self.cvvp = self.cvvp.to(self.device)
                    if verbose:
                        if self.cvvp is None:
                            print("Computing best candidates using CLVP")
                        else:
                            print(f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%")
                    for batch in tqdm(samples, disable=not verbose):
                        for i in range(batch.shape[0]):
                            batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)
                        if cvvp_amount != 1:
                            clvp_out = clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False)
                        if auto_conds is not None and cvvp_amount > 0:
                            cvvp_accumulator = 0
                            for cl in range(auto_conds.shape[1]):
                                cvvp_accumulator = cvvp_accumulator + self.cvvp(auto_conds[:, cl].repeat(batch.shape[0], 1, 1), batch, return_loss=False)
                            cvvp = cvvp_accumulator / auto_conds.shape[1]
                            if cvvp_amount == 1:
                                clip_results.append(cvvp)
                            else:
                                clip_results.append(cvvp * cvvp_amount + clvp_out * (1-cvvp_amount))
                        else:
                            clip_results.append(clvp_out)
                    clip_results = torch.cat(clip_results, dim=0)
                    samples = torch.cat(samples, dim=0)
                    best_results = samples[torch.topk(clip_results, k=k).indices]
            else:
                with self.temporary_cuda(self.clvp) as clvp:
                    if cvvp_amount > 0:
                        if self.cvvp is None:
                            self.load_cvvp()
                        self.cvvp = self.cvvp.to(self.device)
                    if verbose:
                        if self.cvvp is None:
                            print("Computing best candidates using CLVP")
                        else:
                            print(f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%")
                    for batch in tqdm(samples, disable=not verbose):
                        for i in range(batch.shape[0]):
                            batch[i] = fix_autoregressive_output(batch[i], stop_mel_token)
                        if cvvp_amount != 1:
                            clvp_out = clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False)
                        if auto_conds is not None and cvvp_amount > 0:
                            cvvp_accumulator = 0
                            for cl in range(auto_conds.shape[1]):
                                cvvp_accumulator = cvvp_accumulator + self.cvvp(auto_conds[:, cl].repeat(batch.shape[0], 1, 1), batch, return_loss=False)
                            cvvp = cvvp_accumulator / auto_conds.shape[1]
                            if cvvp_amount == 1:
                                clip_results.append(cvvp)
                            else:
                                clip_results.append(cvvp * cvvp_amount + clvp_out * (1-cvvp_amount))
                        else:
                            clip_results.append(clvp_out)
                    clip_results = torch.cat(clip_results, dim=0)
                    samples = torch.cat(samples, dim=0)
                    best_results = samples[torch.topk(clip_results, k=k).indices]
            if self.cvvp is not None:
                self.cvvp = self.cvvp.cpu()
            del samples

            # The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning
            # inputs. Re-produce those for the top results. This could be made more efficient by storing all of these
            # results, but will increase memory usage.
            if not torch.backends.mps.is_available():
                with self.temporary_cuda(
                    self.autoregressive
                ) as autoregressive, torch.autocast(
                    device_type="cuda" if not torch.backends.mps.is_available() else 'mps', dtype=torch.float16, enabled=self.half
                ):
                    best_latents = autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                                                    torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
                                                    torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                                                    return_latent=True, clip_inputs=False)
                    del auto_conditioning
            else:
                with self.temporary_cuda(
                    self.autoregressive
                ) as autoregressive:
                    best_latents = autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                                                    torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results,
                                                    torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                                                    return_latent=True, clip_inputs=False)
                    del auto_conditioning

            if verbose:
                print("Transforming autoregressive outputs into audio..")
            wav_candidates = []
            if not torch.backends.mps.is_available():
                with self.temporary_cuda(self.diffusion) as diffusion, self.temporary_cuda(
                    self.vocoder
                ) as vocoder:
                    for b in range(best_results.shape[0]):
                        codes = best_results[b].unsqueeze(0)
                        latents = best_latents[b].unsqueeze(0)

                        # Find the first occurrence of the "calm" token and trim the codes to that.
                        ctokens = 0
                        for k in range(codes.shape[-1]):
                            if codes[0, k] == calm_token:
                                ctokens += 1
                            else:
                                ctokens = 0
                            if ctokens > 8:  # 8 tokens gives the diffusion model some "breathing room" to terminate speech.
                                latents = latents[:, :k]
                                break
                        mel = do_spectrogram_diffusion(diffusion, diffuser, latents, diffusion_conditioning, temperature=diffusion_temperature, 
                                                    verbose=verbose)
                        wav = vocoder.inference(mel)
                        wav_candidates.append(wav.cpu())
            else:
                diffusion, vocoder = self.diffusion, self.vocoder
                diffusion_conditioning = diffusion_conditioning.cpu()
                for b in range(best_results.shape[0]):
                    codes = best_results[b].unsqueeze(0).cpu()
                    latents = best_latents[b].unsqueeze(0).cpu()

                    # Find the first occurrence of the "calm" token and trim the codes to that.
                    ctokens = 0
                    for k in range(codes.shape[-1]):
                        if codes[0, k] == calm_token:
                            ctokens += 1
                        else:
                            ctokens = 0
                        if ctokens > 8:  # 8 tokens gives the diffusion model some "breathing room" to terminate speech.
                            latents = latents[:, :k]
                            break
                    mel = do_spectrogram_diffusion(diffusion, diffuser, latents, diffusion_conditioning, temperature=diffusion_temperature, 
                                                verbose=verbose)
                    wav = vocoder.inference(mel)
                    wav_candidates.append(wav.cpu())

            def potentially_redact(clip, text):
                if self.enable_redaction:
                    return self.aligner.redact(clip.squeeze(1), text).unsqueeze(1)
                return clip
            wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates]

            if len(wav_candidates) > 1:
                res = wav_candidates
            else:
                res = wav_candidates[0]

            if return_deterministic_state:
                return res, (deterministic_seed, text, voice_samples, conditioning_latents)
            else:
                return res
    def deterministic_state(self, seed=None):
        """
        Sets the random seeds that tortoise uses to the current time() and returns that seed so results can be
        reproduced.
        """
        seed = int(time()) if seed is None else seed
        torch.manual_seed(seed)
        random.seed(seed)
        # Can't currently set this because of CUBLAS. TODO: potentially enable it if necessary.
        # torch.use_deterministic_algorithms(True)

        return seed


================================================
FILE: tortoise/api_fast.py
================================================
import os
import random
import uuid
from time import time
from urllib import request

import torch
import torch.nn.functional as F
import progressbar
import torchaudio
import numpy as np
from tortoise.models.classifier import AudioMiniEncoderWithClassifierHead
from tortoise.models.diffusion_decoder import DiffusionTts
from tortoise.models.autoregressive import UnifiedVoice
from tqdm import tqdm
from tortoise.models.arch_util import TorchMelSpectrogram
from tortoise.models.clvp import CLVP
from tortoise.models.cvvp import CVVP
from tortoise.models.hifigan_decoder import HifiganGenerator
from tortoise.models.random_latent_generator import RandomLatentConverter
from tortoise.models.vocoder import UnivNetGenerator
from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel
from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule
from tortoise.utils.tokenizer import VoiceBpeTokenizer
from tortoise.utils.wav2vec_alignment import Wav2VecAlignment
from contextlib import contextmanager
from tortoise.models.stream_generator import init_stream_support
from huggingface_hub import hf_hub_download
pbar = None
init_stream_support()
DEFAULT_MODELS_DIR = os.path.join(os.path.expanduser('~'), '.cache', 'tortoise', 'models')
MODELS_DIR = os.environ.get('TORTOISE_MODELS_DIR', DEFAULT_MODELS_DIR)

MODELS = {
    'autoregressive.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/autoregressive.pth',
    'classifier.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/classifier.pth',
    'rlg_auto.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/rlg_auto.pth',
    'hifidecoder.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/hifidecoder.pth',
}

def get_model_path(model_name, models_dir=MODELS_DIR):
    """
    Get path to given model, download it if it doesn't exist.
    """
    if model_name not in MODELS:
        raise ValueError(f'Model {model_name} not found in available models.')
    model_path = hf_hub_download(repo_id="Manmay/tortoise-tts", filename=model_name, cache_dir=models_dir)
    return model_path


def pad_or_truncate(t, length):
    """
    Utility function for forcing <t> to have the specified sequence length, whether by clipping it or padding it with 0s.
    """
    if t.shape[-1] == length:
        return t
    elif t.shape[-1] < length:
        return F.pad(t, (0, length-t.shape[-1]))
    else:
        return t[..., :length]


def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusion_steps=200, cond_free=True, cond_free_k=1):
    """
    Helper function to load a GaussianDiffusion instance configured for use as a vocoder.
    """
    return SpacedDiffusion(use_timesteps=space_timesteps(trained_diffusion_steps, [desired_diffusion_steps]), model_mean_type='epsilon',
                           model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps),
                           conditioning_free=cond_free, conditioning_free_k=cond_free_k)


def format_conditioning(clip, cond_length=132300, device="cuda" if not torch.backends.mps.is_available() else 'mps'):
    """
    Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models.
    """
    gap = clip.shape[-1] - cond_length
    if gap < 0:
        clip = F.pad(clip, pad=(0, abs(gap)))
    elif gap > 0:
        rand_start = random.randint(0, gap)
        clip = clip[:, rand_start:rand_start + cond_length]
    mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0)
    return mel_clip.unsqueeze(0).to(device)


def fix_autoregressive_output(codes, stop_token, complain=True):
    """
    This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was
    trained on and what the autoregressive code generator creates (which has no padding or end).
    This is highly specific to the DVAE being used, so this particular coding will not necessarily work if used with
    a different DVAE. This can be inferred by feeding a audio clip padded with lots of zeros on the end through the DVAE
    and copying out the last few codes.

    Failing to do this padding will produce speech with a harsh end that sounds like "BLAH" or similar.
    """
    # Strip off the autoregressive stop token and add padding.
    stop_token_indices = (codes == stop_token).nonzero()
    if len(stop_token_indices) == 0:
        if complain:
            print("No stop tokens found in one of the generated voice clips. This typically means the spoken audio is "
                  "too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, "
                  "try breaking up your input text.")
        return codes
    else:
        codes[stop_token_indices] = 83
    stm = stop_token_indices.min().item()
    codes[stm:] = 83
    if stm - 3 < codes.shape[0]:
        codes[-3] = 45
        codes[-2] = 45
        codes[-1] = 248

    return codes


def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True):
    """
    Uses the specified diffusion model to convert discrete codes into a spectrogram.
    """
    with torch.no_grad():
        output_seq_len = latents.shape[1] * 4 * 24000 // 22050  # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal.
        output_shape = (latents.shape[0], 100, output_seq_len)
        precomputed_embeddings = diffusion_model.timestep_independent(latents, conditioning_latents, output_seq_len, False)

        noise = torch.randn(output_shape, device=latents.device) * temperature
        mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise,
                                      model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings},
                                     progress=verbose)
        return denormalize_tacotron_mel(mel)[:,:,:output_seq_len]


def classify_audio_clip(clip):
    """
    Returns whether or not Tortoises' classifier thinks the given clip came from Tortoise.
    :param clip: torch tensor containing audio waveform data (get it from load_audio)
    :return: True if the clip was classified as coming from Tortoise and false if it was classified as real.
    """
    classifier = AudioMiniEncoderWithClassifierHead(2, spec_dim=1, embedding_dim=512, depth=5, downsample_factor=4,
                                                    resnet_blocks=2, attn_blocks=4, num_attn_heads=4, base_channels=32,
                                                    dropout=0, kernel_size=5, distribute_zero_label=False)
    classifier.load_state_dict(torch.load(get_model_path('classifier.pth'), map_location=torch.device('cpu')))
    clip = clip.cpu().unsqueeze(0)
    results = F.softmax(classifier(clip), dim=-1)
    return results[0][0]


def pick_best_batch_size_for_gpu():
    """
    Tries to pick a batch size that will fit in your GPU. These sizes aren't guaranteed to work, but they should give
    you a good shot.
    """
    if torch.cuda.is_available():
        _, available = torch.cuda.mem_get_info()
        availableGb = available / (1024 ** 3)
        if availableGb > 14:
            return 16
        elif availableGb > 10:
            return 8
        elif availableGb > 7:
            return 4
    if torch.backends.mps.is_available():
        import psutil
        available = psutil.virtual_memory().total
        availableGb = available / (1024 ** 3)
        if availableGb > 14:
            return 16
        elif availableGb > 10:
            return 8
        elif availableGb > 7:
            return 4
    return 1

class TextToSpeech:
    """
    Main entry point into Tortoise.
    """

    def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, 
                 enable_redaction=True, kv_cache=False, use_deepspeed=False, half=False, device=None,
                 tokenizer_vocab_file=None, tokenizer_basic=False):

        """
        Constructor
        :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing
                                          GPU OOM errors. Larger numbers generates slightly faster.
        :param models_dir: Where model weights are stored. This should only be specified if you are providing your own
                           models, otherwise use the defaults.
        :param enable_redaction: When true, text enclosed in brackets are automatically redacted from the spoken output
                                 (but are still rendered by the model). This can be used for prompt engineering.
                                 Default is true.
        :param device: Device to use when running the model. If omitted, the device will be automatically chosen.
        """
        self.models_dir = models_dir
        self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size
        self.enable_redaction = enable_redaction
        if device is None:
            self.device = torch.device('cuda' if torch.cuda.is_available() else'cpu')
        else:
            self.device = torch.device(device)
            
        if torch.backends.mps.is_available():
            self.device = torch.device('mps')
        if self.enable_redaction:
            self.aligner = Wav2VecAlignment()

        self.tokenizer = VoiceBpeTokenizer(
            vocab_file=tokenizer_vocab_file,
            use_basic_cleaners=tokenizer_basic,
        )
        self.half = half
        if os.path.exists(f'{models_dir}/autoregressive.ptt'):
            # Assume this is a traced directory.
            self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt')
        else:
            self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30,
                                          model_dim=1024,
                                          heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False,
                                          train_solo_embeddings=False).to(self.device).eval()
            self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir)), strict=False)
            self.autoregressive.post_init_gpt2_config(use_deepspeed=use_deepspeed, kv_cache=kv_cache, half=self.half)

        self.hifi_decoder = HifiganGenerator(in_channels=1024, out_channels = 1, resblock_type = "1",
        resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], resblock_kernel_sizes = [3, 7, 11],
        upsample_kernel_sizes = [16, 16, 4, 4], upsample_initial_channel = 512, upsample_factors = [8, 8, 2, 2],
        cond_channels=1024).to(self.device).eval()
        hifi_model = torch.load(get_model_path('hifidecoder.pth'))
        self.hifi_decoder.load_state_dict(hifi_model, strict=False)
        # Random latent generators (RLGs) are loaded lazily.
        self.rlg_auto = None
    def get_conditioning_latents(self, voice_samples, return_mels=False):
        """
        Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent).
        These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic
        properties.
        :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data.
        """
        with torch.no_grad():
            voice_samples = [v.to(self.device) for v in voice_samples]

            auto_conds = []
            if not isinstance(voice_samples, list):
                voice_samples = [voice_samples]
            for vs in voice_samples:
                auto_conds.append(format_conditioning(vs, device=self.device))
            auto_conds = torch.stack(auto_conds, dim=1)
            auto_latent = self.autoregressive.get_conditioning(auto_conds)

        if return_mels:
            return auto_latent
        else:
            return auto_latent

    def get_random_conditioning_latents(self):
        # Lazy-load the RLG models.
        if self.rlg_auto is None:
            self.rlg_auto = RandomLatentConverter(1024).eval()
            self.rlg_auto.load_state_dict(torch.load(get_model_path('rlg_auto.pth', self.models_dir), map_location=torch.device('cpu')))
        with torch.no_grad():
            return self.rlg_auto(torch.tensor([0.0]))

    def tts_with_preset(self, text, preset='fast', **kwargs):
        """
        Calls TTS with one of a set of preset generation parameters. Options:
            'ultra_fast': Produces speech at a speed which belies the name of this repo. (Not really, but it's definitely fastest).
            'fast': Decent quality speech at a decent inference rate. A good choice for mass inference.
            'standard': Very good quality. This is generally about as good as you are going to get.
            'high_quality': Use if you want the absolute best. This is not really worth the compute, though.
        """
        # Use generally found best tuning knobs for generation.
        settings = {'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0,
                    'top_p': .8,
                    'cond_free_k': 2.0, 'diffusion_temperature': 1.0}
        # Presets are defined here.
        presets = {
            'ultra_fast': {'num_autoregressive_samples': 1, 'diffusion_iterations': 10},
            'fast': {'num_autoregressive_samples': 32, 'diffusion_iterations': 50},
            'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200},
            'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400},
        }
        settings.update(presets[preset])
        settings.update(kwargs) # allow overriding of preset settings with kwargs
        for audio_frame in self.tts(text, **settings):
            yield audio_frame
    # taken from here https://github.com/coqui-ai/TTS/blob/b4c552a112fd4c5f3477f439882eb43c2e2ce85f/TTS/tts/models/xtts.py#L600
    def handle_chunks(self, wav_gen, wav_gen_prev, wav_overlap, overlap_len):
        """Handle chunk formatting in streaming mode"""
        wav_chunk = wav_gen[:-overlap_len]
        if wav_gen_prev is not None:
            wav_chunk = wav_gen[(wav_gen_prev.shape[0] - overlap_len) : -overlap_len]
        if wav_overlap is not None:
            # cross fade the overlap section
            if overlap_len > len(wav_chunk):
                # wav_chunk is smaller than overlap_len, pass on last wav_gen
                if wav_gen_prev is not None:
                    wav_chunk = wav_gen[(wav_gen_prev.shape[0] - overlap_len):]
                else:
                    # not expecting will hit here as problem happens on last chunk
                    wav_chunk = wav_gen[-overlap_len:]
                return wav_chunk, wav_gen, None
            else:
                crossfade_wav = wav_chunk[:overlap_len]
                crossfade_wav = crossfade_wav * torch.linspace(0.0, 1.0, overlap_len).to(crossfade_wav.device)
                wav_chunk[:overlap_len] = wav_overlap * torch.linspace(1.0, 0.0, overlap_len).to(wav_overlap.device)
                wav_chunk[:overlap_len] += crossfade_wav

        wav_overlap = wav_gen[-overlap_len:]
        wav_gen_prev = wav_gen
        return wav_chunk, wav_gen_prev, wav_overlap


    def tts_stream(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None,
            return_deterministic_state=False, overlap_wav_len=1024, stream_chunk_size=40,
            # autoregressive generation parameters follow
            num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500,
            # CVVP parameters follow
            cvvp_amount=.0,
            # diffusion generation parameters follow
            diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0,
            **hf_generate_kwargs):
        """
        Produces an audio clip of the given text being spoken with the given reference voice.
        :param text: Text to be spoken.
        :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
        :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
                                     can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
                                     Conditioning latents can be retrieved via get_conditioning_latents().
        :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned.
        :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
        ~~AUTOREGRESSIVE KNOBS~~
        :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP.
               As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
        :param temperature: The softmax temperature of the autoregressive model.
        :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
        :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
                                   of long silences or "uhhhhhhs", etc.
        :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
        :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
        ~~DIFFUSION KNOBS~~
        :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
                                     the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
                                     however.
        :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
                          each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                          of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
                          dramatically improves realism.
        :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
                            As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
                            Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
        :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                      are the "mean" prediction of the diffusion network and will sound bland and smeared.
        ~~OTHER STUFF~~
        :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
                                   Extra keyword args fed to this function get forwarded directly to that API. Documentation
                                   here: https://huggingface.co/docs/transformers/internal/generation_utils
        :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
                 Sample rate is 24kHz.
        """
        deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

        text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
        text_tokens = F.pad(text_tokens, (0, 1))  # This may not be necessary.
        assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
        if voice_samples is not None:
            auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False)
        else:
            auto_conditioning  = self.get_random_conditioning_latents()
        auto_conditioning = auto_conditioning.to(self.device)

        with torch.no_grad():
            calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
            if verbose:
                print("Generating autoregressive samples..")
            with torch.autocast(
                    device_type="cuda" , dtype=torch.float16, enabled=self.half
                ):
                fake_inputs = self.autoregressive.compute_embeddings(
                    auto_conditioning,
                    text_tokens,
                )
                gpt_generator = self.autoregressive.get_generator(
                    fake_inputs=fake_inputs,
                    top_k=50,
                    top_p=top_p,
                    temperature=temperature,
                    do_sample=True,
                    num_beams=1,
                    num_return_sequences=1,
                    length_penalty=float(length_penalty),
                    repetition_penalty=float(repetition_penalty),
                    output_attentions=False,
                    output_hidden_states=True,
                    **hf_generate_kwargs,
                )
            all_latents = []
            codes_ = []
            wav_gen_prev = None
            wav_overlap = None
            is_end = False
            first_buffer = 60
            while not is_end:
                try:
                    with torch.autocast(
                        device_type="cuda", dtype=torch.float16, enabled=self.half
                    ):
                        codes, latent = next(gpt_generator)
                        all_latents += [latent]
                        codes_ += [codes]
                except StopIteration:
                    is_end = True

                if is_end or (stream_chunk_size > 0 and len(codes_) >= max(stream_chunk_size, first_buffer)):
                    first_buffer = 0
                    gpt_latents = torch.cat(all_latents, dim=0)[None, :]
                    wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning)
                    wav_gen = wav_gen.squeeze()
                    wav_chunk, wav_gen_prev, wav_overlap = self.handle_chunks(
                        wav_gen.squeeze(), wav_gen_prev, wav_overlap, overlap_wav_len
                    )
                    codes_ = []
                    yield wav_chunk
    def tts(self, text, voice_samples=None, k=1, verbose=True, use_deterministic_seed=None,
            # autoregressive generation parameters follow
            num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, 
            top_p=.8, max_mel_tokens=500,
            # CVVP parameters follow
            cvvp_amount=.0,
            **hf_generate_kwargs):
        """
        Produces an audio clip of the given text being spoken with the given reference voice.
        :param text: Text to be spoken.
        :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data.
        :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which
                                     can be provided in lieu of voice_samples. This is ignored unless voice_samples=None.
                                     Conditioning latents can be retrieved via get_conditioning_latents().
        :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned.
        :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true.
        ~~AUTOREGRESSIVE KNOBS~~
        :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP.
               As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great".
        :param temperature: The softmax temperature of the autoregressive model.
        :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.
        :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence
                                   of long silences or "uhhhhhhs", etc.
        :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs.
        :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second.
        ~~DIFFUSION KNOBS~~
        :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine
                                     the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better,
                                     however.
        :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for
                          each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output
                          of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and
                          dramatically improves realism.
        :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf].
                            As cond_free_k increases, the output becomes dominated by the conditioning-free signal.
                            Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k
        :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0
                                      are the "mean" prediction of the diffusion network and will sound bland and smeared.
        ~~OTHER STUFF~~
        :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer.
                                   Extra keyword args fed to this function get forwarded directly to that API. Documentation
                                   here: https://huggingface.co/docs/transformers/internal/generation_utils
        :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length.
                 Sample rate is 24kHz.
        """
        deterministic_seed = self.deterministic_state(seed=use_deterministic_seed)

        text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device)
        text_tokens = F.pad(text_tokens, (0, 1))  # This may not be necessary.
        assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.'
        if voice_samples is not None:
            auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False)
        else:
            auto_conditioning  = self.get_random_conditioning_latents()
        auto_conditioning = auto_conditioning.to(self.device)

        with torch.no_grad():
            calm_token = 83  # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output"
            if verbose:
                print("Generating autoregressive samples..")
            with torch.autocast(
                    device_type="cuda" , dtype=torch.float16, enabled=self.half
                ):
                codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens,
                                                            top_k=50,
                                                            top_p=top_p,
                                                            temperature=temperature,
                                                            do_sample=True,
                                                            num_beams=1,
                                                            num_return_sequences=1,
                                                            length_penalty=float(length_penalty),
                                                            repetition_penalty=float(repetition_penalty),
                                                            output_attentions=False,
                                                            output_hidden_states=True,
                                                            **hf_generate_kwargs)
                gpt_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1),
                                torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes,
                                torch.tensor([codes.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device),
                                return_latent=True, clip_inputs=False)
            if verbose:
                print("generating audio..")
            wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning)
            return wav_gen
    def deterministic_state(self, seed=None):
        """
        Sets the random seeds that tortoise uses to the current time() and returns that seed so results can be
        reproduced.
        """
        seed = int(time()) if seed is None else seed
        torch.manual_seed(seed)
        random.seed(seed)
        # Can't currently set this because of CUBLAS. TODO: potentially enable it if necessary.
        # torch.use_deterministic_algorithms(True)

        return seed


================================================
FILE: tortoise/data/got.txt
================================================
Chapter One


Bran


The morning had dawned clear and cold, with a crispness that hinted at the end of summer. They set forth at daybreak to see a man beheaded, twenty in all, and Bran rode among them, nervous with excitement. This was the first time he had been deemed old enough to go with his lord father and his brothers to see the king's justice done. It was the ninth year of summer, and the seventh of Bran's life.


The man had been taken outside a small holdfast in the hills. Robb thought he was a wildling, his sword sworn to Mance Rayder, the King-beyond-the-Wall. It made Bran's skin prickle to think of it. He remembered the hearth tales Old Nan told them. The wildlings were cruel men, she said, slavers and slayers and thieves. They consorted with giants and ghouls, stole girl children in the dead of night, and drank blood from polished horns. And their women lay with the Others in the Long Night to sire terrible half-human children.


But the man they found bound hand and foot to the holdfast wall awaiting the king's justice was old and scrawny, not much taller than Robb. He had lost both ears and a finger to frostbite, and he dressed all in black, the same as a brother of the Night's Watch, except that his furs were ragged and greasy.


The breath of man and horse mingled, steaming, in the cold morning air as his lord father had the man cut down from the wall and dragged before them. Robb and Jon sat tall and still on their horses, with Bran between them on his pony, trying to seem older than seven, trying to pretend that he'd seen all this before. A faint wind blew through the holdfast gate. Over their heads flapped the banner of the Starks of Winterfell: a grey direwolf racing across an ice-white field.

Bran's father sat solemnly on his horse, long brown hair stirring in the wind. His closely trimmed beard was shot with white, making him look older than his thirty-five years. He had a grim cast to his grey eyes this day, and he seemed not at all the man who would sit before the fire in the evening and talk softly of the age of heroes and the children of the forest. He had taken off Father's face, Bran thought, and donned the face of Lord Stark of Winterfell.


There were questions asked and answers given there in the chill of morning, but afterward Bran could not recall much of what had been said. Finally his lord father gave a command, and two of his guardsmen dragged the ragged man to the ironwood stump in the center of the square. They forced his head down onto the hard black wood. Lord Eddard Stark dismounted and his ward Theon Greyjoy brought forth the sword. "Ice," that sword was called. It was as wide across as a man's hand, and taller even than Robb. The blade was Valyrian steel, spell-forged and dark as smoke. Nothing held an edge like Valyrian steel.


His father peeled off his gloves and handed them to Jory Cassel, the captain of his household guard. He took hold of Ice with both hands and said, "In the name of Robert of the House Baratheon, the First of his Name, King of the Andals and the Rhoynar and the First Men, Lord of the Seven Kingdoms and Protector of the Realm, by the word of Eddard of the House Stark, Lord of Winterfell and Warden of the North, I do sentence you to die." He lifted the greatsword high above his head.


Bran's bastard brother Jon Snow moved closer. "Keep the pony well in hand," he whispered. "And don't look away. Father will know if you do."


Bran kept his pony well in hand, and did not look away.


His father took off the man's head with a single sure stroke. Blood sprayed out across the snow, as red as surnmerwine. One of the horses reared and had to be restrained to keep from bolting. Bran could not take his eyes off the blood. The snows around the stump drank it eagerly, reddening as he watched.

The head bounced off a thick root and rolled. It came up near Greyjoy's feet. Theon was a lean, dark youth of nineteen who found everything amusing. He laughed, put his boot on the head, and kicked it away.


"Ass," Jon muttered, low enough so Greyjoy did not hear. He put a hand on Bran's shoulder, and Bran looked over at his bastard brother. "You did well," Jon told him solemnly. Jon was fourteen, an old hand at justice.


It seemed colder on the long ride back to Winterfell, though the wind had died by then and the sun was higher in the sky. Bran rode with his brothers, well ahead of the main party, his pony struggling hard to keep up with their horses.


"The deserter died bravely," Robb said. He was big and broad and growing every day, with his mother's coloring, the fair skin, red-brown hair, and blue eyes of the Tullys of Riverrun. "He had courage, at the least."


"No," Jon Snow said quietly. "It was not courage. This one was dead of fear. You could see it in his eyes, Stark." Jon's eyes were a grey so dark they seemed almost black, but there was little they did not see. He was of an age with Robb, but they did not look alike. Jon was slender where Robb was muscular, dark where Robb was fair, graceful and quick where his half brother was strong and fast.


Robb was not impressed. "The Others take his eyes," he swore. "He died well. Race you to the bridge?"


"Done," Jon said, kicking his horse forward. Robb cursed and followed, and they galloped off down the trail, Robb laughing and hooting, Jon silent and intent. The hooves of their horses kicked up showers of snow as they went.

Bran did not try to follow. His pony could not keep up. He had seen the ragged man's eyes, and he was thinking of them now. After a while, the sound of Robb's laughter receded, and the woods grew silent again.


So deep in thought was he that he never heard the rest of the party until his father moved up to ride beside him. "Are you well, Bran?" he asked, not unkindly.


"Yes, Father," Bran told him. He looked up. Wrapped in his furs and leathers, mounted on his great warhorse, his lord father loomed over him like a giant. "Robb says the man died bravely, but Jon says he was afraid."


"What do you think?" his father asked.


Bran thought about it. "Can a man still be brave if he's afraid?"


"That is the only time a man can be brave," his father told him. "Do you understand why I did it?"


"He was a wildling," Bran said. "They carry off women and sell them to the Others."


His lord father smiled. "Old Nan has been telling you stories again. In truth, the man was an oathbreaker, a deserter from the Night's Watch. No man is more dangerous. The deserter knows his life is forfeit if he is taken, so he will not flinch from any crime, no matter how vile. But you mistake me. The question was not why the man had to die, but why I must do it."


Bran had no answer for that. "King Robert has a headsman," he said, uncertainly.


"He does," his father admitted. "As did the Targaryen kings before him. Yet our way is the older way. The blood of the First Men still flows in the veins of the Starks, and we hold to the belief that the man who passes the sentence should swing the sword. If you would take a man's life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die.


"One day, Bran, you will be Robb's bannerman, holding a keep of your own for your brother and your king, and justice will fall to you. When that day comes, you must take no pleasure in the task, but neither must you look away. A ruler who hides behind paid executioners soon forgets what death is."


That was when Jon reappeared on the crest of the hill before them. He waved and shouted down at them. "Father, Bran, come quickly, see what Robb has found!" Then he was gone again.


Jory rode up beside them. "Trouble, my lord?"


"Beyond a doubt," his lord father said. "Come, let us see what mischief my sons have rooted out now." He sent his horse into a trot. Jory and Bran and the rest came after.


They found Robb on the riverbank north of the bridge, with Jon still mounted beside him. The late summer snows had been heavy this moonturn. Robb stood knee-deep in white, his hood pulled back so the sun shone in his hair. He was cradling something in his arm, while the boys talked in hushed, excited voices.


The riders picked their way carefully through the drifts, groping for solid footing on the hidden, uneven ground . Jory Cassel and Theon Greyjoy were the first to reach the boys. Greyjoy was laughing and joking as he rode. Bran heard the breath go out of him. "Gods!" he exclaimed, struggling to keep control of his horse as he reached for his sword.


Jory's sword was already out. "Robb, get away from it!" he called as his horse reared under him.


Robb grinned and looked up from the bundle in his arms. "She can't hurt you," he said. "She's dead, Jory."


Bran was afire with curiosity by then. He would have spurred the pony faster, but his father made them dismount beside the bridge and approach on foot. Bran jumped off and ran.


By then Jon, Jory, and Theon Greyjoy had all dismounted as well. "What in the seven hells is it?" Greyjoy was saying.


"A wolf," Robb told him.


"A freak," Greyjoy said. "Look at the size of it."


Bran's heart was thumping in his chest as he pushed through a waist-high drift to his brothers' side.


Half-buried in bloodstained snow, a huge dark shape slumped in death. Ice had formed in its shaggy grey fur, and the faint smell of corruption clung to it like a woman's perfume. Bran glimpsed blind eyes crawling with maggots, a wide mouth full of yellowed teeth. But it was the size of it that made him gasp. It was bigger than his pony, twice the size of the largest hound in his father's kennel.


"It's no freak," Jon said calmly. "That's a direwolf. They grow larger than the other kind."


Theon Greyjoy said, "There's not been a direwolf sighted south of the Wall in two hundred years."


"I see one now," Jon replied.


Bran tore his eyes away from the monster. That was when he noticed the bundle in Robb's arms. He gave a cry of delight and moved closer. The pup was a tiny ball of grey-black fur, its eyes still closed. It nuzzled blindly against Robb's chest as he cradled it, searching for milk among his leathers, making a sad little whimpery sound. Bran reached out hesitantly. "Go on," Robb told him. "You can touch him."


Bran gave the pup a quick nervous stroke, then turned as Jon said, "Here you go." His half brother put a second pup into his arms. "There are five of them." Bran sat down in the snow and hugged the wolf pup to his face. Its fur was soft and warm against his cheek.


"Direwolves loose in the realm, after so many years," muttered Hullen, the master of horse. "I like it not."


"It is a sign," Jory said.


Father frowned. "This is only a dead animal, Jory," he said. Yet he seemed troubled. Snow crunched under his boots as he moved around the body. "Do we know what killed her?"


"There's something in the throat," Robb told him, proud to have found the answer before his father even asked. "There, just under the jaw."


His father knelt and groped under the beast's head with his hand. He gave a yank and held it up for all to see. A foot of shattered antler, tines snapped off, all wet with blood.


A sudden silence descended over the party. The men looked at the antler uneasily, and no one dared to speak. Even Bran could sense their fear, though he did not understand.


His father tossed the antler to the side and cleansed his hands in the snow. "I'm surprised she lived long enough to whelp," he said. His voice broke the spell.


"Maybe she didn't," Jory said. "I've heard tales . . . maybe the bitch was already dead when the pups came."


"Born with the dead," another man put in. "Worse luck."


"No matter," said Hullen. "They be dead soon enough too."


Bran gave a wordless cry of dismay.


"The sooner the better," Theon Greyjoy agreed. He drew his sword. "Give the beast here, Bran."


The little thing squirmed against him, as if it heard and understood. "No!" Bran cried out fiercely. "It's mine."


"Put away your sword, Greyjoy," Robb said. For a moment he sounded as commanding as their father, like the lord he would someday be. "We will keep these pups."


"You cannot do that, boy," said Harwin, who was Hullen's son.


"It be a mercy to kill them," Hullen said.


Bran looked to his lord father for rescue, but got only a frown, a furrowed brow. "Hullen speaks truly, son. Better a swift death than a hard one from cold and starvation."


"No!" He could feel tears welling in his eyes, and he looked away. He did not want to cry in front of his father.


Robb resisted stubbornly. "Ser Rodrik's red bitch whelped again last week," he said. "It was a small litter, only two live pups. She'll have milk enough."


"She'll rip them apart when they try to nurse."


"Lord Stark," Jon said. It was strange to hear him call Father that, so formal. Bran looked at him with desperate hope. "There are five pups," he told Father. "Three male, two female."


"What of it, Jon?"


"You have five trueborn children," Jon said. "Three sons, two daughters. The direwolf is the sigil of your House. Your children were meant to have these pups, my lord."


Bran saw his father's face change, saw the other men exchange glances. He loved Jon with all his heart at that moment. Even at seven, Bran understood what his brother had done. The count had come right only because Jon had omitted himself. He had included the girls, included even Rickon, the baby, but not the bastard who bore the surname Snow, the name that custom decreed be given to all those in the north unlucky enough to be born with no name of their own.


Their father understood as well. "You want no pup for yourself, Jon?" he asked softly.


"The direwolf graces the banners of House Stark," Jon pointed out. "I am no Stark, Father."


Their lord father regarded Jon thoughtfully. Robb rushed into the silence he left. "I will nurse him myself, Father," he promised. "I will soak a towel with warm milk, and give him suck from that."


"Me too!" Bran echoed.


The lord weighed his sons long and carefully with his eyes. "Easy to say, and harder to do. I will not have you wasting the servants' time with this. If you want these pups, you will feed them yourselves. Is that understood?"


Bran nodded eagerly. The pup squirmed in his grasp, licked at his face with a warm tongue.


"You must train them as well," their father said. "You must train them. The kennelmaster will have nothing to do with these monsters, I promise you that. And the gods help you if you neglect them, or brutalize them, or train them badly. These are not dogs to beg for treats and slink off at a kick. A direwolf will rip a man's arm off his shoulder as easily as a dog will kill a rat. Are you sure you want this?"

"Yes, Father," Bran said.


"Yes," Robb agreed.


"The pups may die anyway, despite all you do."


"They won't die," Robb said. "We won't let them die."


"Keep them, then. Jory, Desmond, gather up the other pups. It's time we were back to Winterfell."


It was not until they were mounted and on their way that Bran allowed himself to taste the sweet air of victory. By then, his pup was snuggled inside his leathers, warm against him, safe for the long ride home. Bran was wondering what to name him.


Halfway across the bridge, Jon pulled up suddenly.


"What is it, Jon?" their lord father asked.


"Can't you hear it?"


Bran could hear the wind in the trees, the clatter of their hooves on the ironwood planks, the whimpering of his hungry pup, but Jon was listening to something else.


"There," Jon said. He swung his horse around and galloped back across the bridge. They watched him dismount where the direwolf lay dead in the snow, watched him kneel. A moment later he was riding back to them, smiling.


"He must have crawled away from the others," Jon said.


"Or been driven away," their father said, looking at the sixth pup. His fur was white, where the rest of the litter was grey. His eyes were as red as the blood of the ragged man who had died that morning. Bran thought it curious that this pup alone would have opened his eyes while the others were still blind.


"An albino," Theon Greyjoy said with wry amusement. "This one will die even faster than the others."


Jon Snow gave his father's ward a long, chilling look. "I think not, Greyjoy," he said. "This one belongs to me."

================================================
FILE: tortoise/data/layman.txt
================================================


================================================
FILE: tortoise/data/riding_hood.txt
================================================
Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her mother was excessively fond of her; and her grandmother doted on her still more. This good woman had a little red riding hood made for her. It suited the girl so extremely well that everybody called her Little Red Riding Hood.
One day her mother, having made some cakes, said to her, "Go, my dear, and see how your grandmother is doing, for I hear she has been very ill. Take her a cake, and this little pot of butter."

Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village.

As she was going through the wood, she met with a wolf, who had a very great mind to eat her up, but he dared not, because of some woodcutters working nearby in the forest. He asked her where she was going. The poor child, who did not know that it was dangerous to stay and talk to a wolf, said to him, "I am going to see my grandmother and carry her a cake and a little pot of butter from my mother."

"Does she live far off?" said the wolf

"Oh I say," answered Little Red Riding Hood; "it is beyond that mill you see there, at the first house in the village."

"Well," said the wolf, "and I'll go and see her too. I'll go this way and go you that, and we shall see who will be there first."

The wolf ran as fast as he could, taking the shortest path, and the little girl took a roundabout way, entertaining herself by gathering nuts, running after butterflies, and gathering bouquets of little flowers. It was not long before the wolf arrived at the old woman's house. He knocked at the door: tap, tap.

"Who's there?"

"Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother."

The good grandmother, who was in bed, because she was somewhat ill, cried out, "Pull the bobbin, and the latch will go up."

The wolf pulled the bobbin, and the door opened, and then he immediately fell upon the good woman and ate her up in a moment, for it been more than three days since he had eaten. He then shut the door and got into the grandmother's bed, expecting Little Red Riding Hood, who came some time afterwards and knocked at the door: tap, tap.

"Who's there?"

Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you."

The wolf cried out to her, softening his voice as much as he could, "Pull the bobbin, and the latch will go up."

Little Red Riding Hood pulled the bobbin, and the door opened.

The wolf, seeing her come in, said to her, hiding himself under the bedclothes, "Put the cake and the little pot of butter upon the stool, and come get into bed with me."

Little Red Riding Hood took off her clothes and got into bed. She was greatly amazed to see how her grandmother looked in her nightclothes, and said to her, "Grandmother, what big arms you have!"

"All the better to hug you with, my dear."

"Grandmother, what big legs you have!"

"All the better to run with, my child."

"Grandmother, what big ears you have!"

"All the better to hear with, my child."

"Grandmother, what big eyes you have!"

"All the better to see with, my child."

"Grandmother, what big teeth you have got!"

"All the better to eat you up with."

And, saying these words, this wicked wolf fell upon Little Red Riding Hood, and ate her all up.

================================================
FILE: tortoise/data/seal_copypasta.txt
================================================
What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al kayda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire U S armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the U S A and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo.

================================================
FILE: tortoise/data/tokenizer.json
================================================
{"version":"1.0","truncation":null,"padding":null,"added_tokens":[{"id":0,"special":true,"content":"[STOP]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":1,"special":true,"content":"[UNK]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":2,"special":true,"content":"[SPACE]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false}],"normalizer":null,"pre_tokenizer":{"type":"Whitespace"},"post_processor":null,"decoder":null,"model":{"type":"BPE","dropout":null,"unk_token":"[UNK]","continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"[STOP]":0,"[UNK]":1,"[SPACE]":2,"!":3,"'":4,"(":5,")":6,",":7,"-":8,".":9,"/":10,":":11,";":12,"?":13,"a":14,"b":15,"c":16,"d":17,"e":18,"f":19,"g":20,"h":21,"i":22,"j":23,"k":24,"l":25,"m":26,"n":27,"o":28,"p":29,"q":30,"r":31,"s":32,"t":33,"u":34,"v":35,"w":36,"x":37,"y":38,"z":39,"th":40,"in":41,"the":42,"an":43,"er":44,"ou":45,"re":46,"on":47,"at":48,"ed":49,"en":50,"to":51,"ing":52,"and":53,"is":54,"as":55,"al":56,"or":57,"of":58,"ar":59,"it":60,"es":61,"he":62,"st":63,"le":64,"om":65,"se":66,"be":67,"ad":68,"ow":69,"ly":70,"ch":71,"wh":72,"that":73,"you":74,"li":75,"ve":76,"ac":77,"ti":78,"ld":79,"me":80,"was":81,"gh":82,"id":83,"ll":84,"wi":85,"ent":86,"for":87,"ay":88,"ro":89,"ver":90,"ic":91,"her":92,"ke":93,"his":94,"no":95,"ut":96,"un":97,"ir":98,"lo":99,"we":100,"ri":101,"ha":102,"with":103,"ght":104,"out":105,"im":106,"ion":107,"all":108,"ab":109,"one":110,"ne":111,"ge":112,"ould":113,"ter":114,"mo":115,"had":116,"ce":117,"she":118,"go":119,"sh":120,"ur":121,"am":122,"so":123,"pe":124,"my":125,"de":126,"are":127,"but":128,"ome":129,"fr":130,"ther":131,"fe":132,"su":133,"do":134,"con":135,"te":136,"ain":137,"ere":138,"po":139,"if":140,"they":141,"us":142,"ag":143,"tr":144,"now":145,"oun":146,"this":147,"have":148,"not":149,"sa":150,"il":151,"up":152,"thing":153,"from":154,"ap":155,"him":156,"ack":157,"ation":158,"ant":159,"our":160,"op":161,"like":162,"ust":163,"ess":164,"bo":165,"ok":166,"ul":167,"ind":168,"ex":169,"com":170,"some":171,"there":172,"ers":173,"co":174,"res":175,"man":176,"ard":177,"pl":178,"wor":179,"way":180,"tion":181,"fo":182,"ca":183,"were":184,"by":185,"ate":186,"pro":187,"ted":188,"ound":189,"own":190,"would":191,"ts":192,"what":193,"qu":194,"ally":195,"ight":196,"ck":197,"gr":198,"when":199,"ven":200,"can":201,"ough":202,"ine":203,"end":204,"per":205,"ous":206,"od":207,"ide":208,"know":209,"ty":210,"very":211,"si":212,"ak":213,"who":214,"about":215,"ill":216,"them":217,"est":218,"red":219,"ye":220,"could":221,"ong":222,"your":223,"their":224,"em":225,"just":226,"other":227,"into":228,"any":229,"whi":230,"um":231,"tw":232,"ast":233,"der":234,"did":235,"ie":236,"been":237,"ace":238,"ink":239,"ity":240,"back":241,"ting":242,"br":243,"more":244,"ake":245,"pp":246,"then":247,"sp":248,"el":249,"use":250,"bl":251,"said":252,"over":253,"get":254},"merges":["t h","i n","th e","a n","e r","o u","r e","o n","a t","e d","e n","t o","in g","an d","i s","a s","a l","o r","o f","a r","i t","e s","h e","s t","l e","o m","s e","b e","a d","o w","l y","c h","w h","th at","y ou","l i","v e","a c","t i","l d","m e","w as","g h","i d","l l","w i","en t","f or","a y","r o","v er","i c","h er","k e","h is","n o","u t","u n","i r","l o","w e","r i","h a","wi th","gh t","ou t","i m","i on","al l","a b","on e","n e","g e","ou ld","t er","m o","h ad","c e","s he","g o","s h","u r","a m","s o","p e","m y","d e","a re","b ut","om e","f r","the r","f e","s u","d o","c on","t e","a in","er e","p o","i f","the y","u s","a g","t r","n ow","ou n","th is","ha ve","no t","s a","i l","u p","th ing","fr om","a p","h im","ac k","at ion","an t","ou r","o p","li ke","u st","es s","b o","o k","u l","in d","e x","c om","s ome","the re","er s","c o","re s","m an","ar d","p l","w or","w ay","ti on","f o","c a","w ere","b y","at e","p ro","t ed","oun d","ow n","w ould","t s","wh at","q u","al ly","i ght","c k","g r","wh en","v en","c an","ou gh","in e","en d","p er","ou s","o d","id e","k now","t y","ver y","s i","a k","wh o","ab out","i ll","the m","es t","re d","y e","c ould","on g","you r","the ir","e m","j ust","o ther","in to","an y","wh i","u m","t w","as t","d er","d id","i e","be en","ac e","in k","it y","b ack","t ing","b r","mo re","a ke","p p","the n","s p","e l","u se","b l","sa id","o ver","ge t"]}}

================================================
FILE: tortoise/do_tts.py
================================================
import argparse
import os

import torch
import torchaudio

from api import TextToSpeech, MODELS_DIR
from utils.audio import load_voices

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--text', type=str, help='Text to speak.', default="The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.")
    parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
                                                 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random')
    parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast')
    parser.add_argument('--use_deepspeed', type=str, help='Use deepspeed for speed bump.', default=False)
    parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True)
    parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True)
    parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/')
    parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
                                                      'should only be specified if you have custom checkpoints.', default=MODELS_DIR)
    parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice.', default=3)
    parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None)
    parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True)
    parser.add_argument('--cvvp_amount', type=float, help='How much the CVVP model should influence the output.'
                                                          'Increasing this can in some cases reduce the likelihood of multiple speakers. Defaults to 0 (disabled)', default=.0)
    args = parser.parse_args()
    if torch.backends.mps.is_available():
        args.use_deepspeed = False
    os.makedirs(args.output_path, exist_ok=True)
    tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half)

    selected_voices = args.voice.split(',')
    for k, selected_voice in enumerate(selected_voices):
        if '&' in selected_voice:
            voice_sel = selected_voice.split('&')
        else:
            voice_sel = [selected_voice]
        voice_samples, conditioning_latents = load_voices(voice_sel)

        gen, dbg_state = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
                                  preset=args.preset, use_deterministic_seed=args.seed, return_deterministic_state=True, cvvp_amount=args.cvvp_amount)
        if isinstance(gen, list):
            for j, g in enumerate(gen):
                torchaudio.save(os.path.join(args.output_path, f'{selected_voice}_{k}_{j}.wav'), g.squeeze(0).cpu(), 24000)
        else:
            torchaudio.save(os.path.join(args.output_path, f'{selected_voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000)

        if args.produce_debug_state:
            os.makedirs('debug_states', exist_ok=True)
            torch.save(dbg_state, f'debug_states/do_tts_debug_{selected_voice}.pth')



================================================
FILE: tortoise/eval.py
================================================
import argparse
import os

import torchaudio

from api import TextToSpeech
from tortoise.utils.audio import load_audio

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--eval_path', type=str, help='Path to TSV test file', default="D:\\tmp\\tortoise-tts-eval\\test.tsv")
    parser.add_argument('--output_path', type=str, help='Where to put results', default="D:\\tmp\\tortoise-tts-eval\\baseline")
    parser.add_argument('--preset', type=str, help='Rendering preset.', default="standard")
    args = parser.parse_args()
    os.makedirs(args.output_path, exist_ok=True)

    tts = TextToSpeech()

    with open(args.eval_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    for line in lines:
        text, real = line.strip().split('\t')
        conds = [load_audio(real, 22050)]
        gen = tts.tts_with_preset(text, voice_samples=conds, conditioning_latents=None, preset=args.preset)
        torchaudio.save(os.path.join(args.output_path, os.path.basename(real)), gen.squeeze(0).cpu(), 24000)



================================================
FILE: tortoise/get_conditioning_latents.py
================================================
import argparse
import os
import torch

from api import TextToSpeech
from tortoise.utils.audio import load_audio, get_voices

"""
Dumps the conditioning latents for the specified voice to disk. These are expressive latents which can be used for
other ML models, or can be augmented manually and fed back into Tortoise to affect vocal qualities.
"""
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--voice', type=str, help='Selects the voice to convert to conditioning latents', default='pat2')
    parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='../results/conditioning_latents')
    args = parser.parse_args()
    os.makedirs(args.output_path, exist_ok=True)

    tts = TextToSpeech()
    voices = get_voices()
    selected_voices = args.voice.split(',')
    for voice in selected_voices:
        cond_paths = voices[voice]
        conds = []
        for cond_path in cond_paths:
            c = load_audio(cond_path, 22050)
            conds.append(c)
        conditioning_latents = tts.get_conditioning_latents(conds)
        torch.save(conditioning_latents, os.path.join(args.output_path, f'{voice}.pth'))



================================================
FILE: tortoise/is_this_from_tortoise.py
================================================
import argparse

from api import classify_audio_clip
from tortoise.utils.audio import load_audio

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--clip', type=str, help='Path to an audio clip to classify.', default="../examples/favorite_riding_hood.mp3")
    args = parser.parse_args()

    clip = load_audio(args.clip, 24000)
    clip = clip[:, :220000]
    prob = classify_audio_clip(clip)
    print(f"This classifier thinks there is a {prob*100}% chance that this clip was generated from Tortoise.")

================================================
FILE: tortoise/models/__init__.py
================================================


================================================
FILE: tortoise/models/arch_util.py
================================================
import os
import functools
import math

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from tortoise.models.xtransformers import ContinuousTransformerWrapper, RelativePositionBias


def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module


class GroupNorm32(nn.GroupNorm):
    def forward(self, x):
        return super().forward(x.float()).type(x.dtype)


def normalization(channels):
    """
    Make a standard normalization layer.

    :param channels: number of input channels.
    :return: an nn.Module for normalization.
    """
    groups = 32
    if channels <= 16:
        groups = 8
    elif channels <= 64:
        groups = 16
    while channels % groups != 0:
        groups = int(groups / 2)
    assert groups > 2
    return GroupNorm32(groups, channels)


class QKVAttentionLegacy(nn.Module):
    """
    A module which performs QKV attention. Matches legacy QKVAttention + input/output heads shaping
    """

    def __init__(self, n_heads):
        super().__init__()
        self.n_heads = n_heads

    def forward(self, qkv, mask=None, rel_pos=None):
        """
        Apply QKV attention.

        :param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs.
        :return: an [N x (H * C) x T] tensor after attention.
        """
        bs, width, length = qkv.shape
        assert width % (3 * self.n_heads) == 0
        ch = width // (3 * self.n_heads)
        q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1)
        scale = 1 / math.sqrt(math.sqrt(ch))
        weight = torch.einsum(
            "bct,bcs->bts", q * scale, k * scale
        )  # More stable with f16 than dividing afterwards
        if rel_pos is not None:
            weight = rel_pos(weight.reshape(bs, self.n_heads, weight.shape[-2], weight.shape[-1])).reshape(bs * self.n_heads, weight.shape[-2], weight.shape[-1])
        weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
        if mask is not None:
            # The proper way to do this is to mask before the softmax using -inf, but that doesn't work properly on CPUs.
            mask = mask.repeat(self.n_heads, 1).unsqueeze(1)
            weight = weight * mask
        a = torch.einsum("bts,bcs->bct", weight, v)

        return a.reshape(bs, -1, length)


class AttentionBlock(nn.Module):
    """
    An attention block that allows spatial positions to attend to each other.

    Originally ported from here, but adapted to the N-d case.
    https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
    """

    def __init__(
        self,
        channels,
        num_heads=1,
        num_head_channels=-1,
        do_checkpoint=True,
        relative_pos_embeddings=False,
    ):
        super().__init__()
        self.channels = channels
        self.do_checkpoint = do_checkpoint
        if num_head_channels == -1:
            self.num_heads = num_heads
        else:
            assert (
                channels % num_head_channels == 0
            ), f"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}"
            self.num_heads = channels // num_head_channels
        self.norm = normalization(channels)
        self.qkv = nn.Conv1d(channels, channels * 3, 1)
        # split heads before split qkv
        self.attention = QKVAttentionLegacy(self.num_heads)

        self.proj_out = zero_module(nn.Conv1d(channels, channels, 1))
        if relative_pos_embeddings:
            self.relative_pos_embeddings = RelativePositionBias(scale=(channels // self.num_heads) ** .5, causal=False, heads=num_heads, num_buckets=32, max_distance=64)
        else:
            self.relative_pos_embeddings = None

    def forward(self, x, mask=None):
        b, c, *spatial = x.shape
        x = x.reshape(b, c, -1)
        qkv = self.qkv(self.norm(x))
        h = self.attention(qkv, mask, self.relative_pos_embeddings)
        h = self.proj_out(h)
        return (x + h).reshape(b, c, *spatial)


class Upsample(nn.Module):
    """
    An upsampling layer with an optional convolution.

    :param channels: channels in the inputs and outputs.
    :param use_conv: a bool determining if a convolution is applied.
    """

    def __init__(self, channels, use_conv, out_channels=None, factor=4):
        super().__init__()
        self.channels = channels
        self.out_channels = out_channels or channels
        self.use_conv = use_conv
        self.factor = factor
        if use_conv:
            ksize = 5
            pad = 2
            self.conv = nn.Conv1d(self.channels, self.out_channels, ksize, padding=pad)

    def forward(self, x):
        assert x.shape[1] == self.channels
        x = F.interpolate(x, scale_factor=self.factor, mode="nearest")
        if self.use_conv:
            x = self.conv(x)
        return x


class Downsample(nn.Module):
    """
    A downsampling layer with an optional convolution.

    :param channels: channels in the inputs and outputs.
    :param use_conv: a bool determining if a convolution is applied.
    """

    def __init__(self, channels, use_conv, out_channels=None, factor=4, ksize=5, pad=2):
        super().__init__()
        self.channels = channels
        self.out_channels = out_channels or channels
        self.use_conv = use_conv

        stride = factor
        if use_conv:
            self.op = nn.Conv1d(
                self.channels, self.out_channels, ksize, stride=stride, padding=pad
            )
        else:
            assert self.channels == self.out_channels
            self.op = nn.AvgPool1d(kernel_size=stride, stride=stride)

    def forward(self, x):
        assert x.shape[1] == self.channels
        return self.op(x)


class ResBlock(nn.Module):
    def __init__(
            self,
            channels,
            dropout,
            out_channels=None,
            use_conv=False,
            use_scale_shift_norm=False,
            up=False,
            down=False,
            kernel_size=3,
    ):
        super().__init__()
        self.channels = channels
        self.dropout = dropout
        self.out_channels = out_channels or channels
        self.use_conv = use_conv
        self.use_scale_shift_norm = use_scale_shift_norm
        padding = 1 if kernel_size == 3 else 2

        self.in_layers = nn.Sequential(
            normalization(channels),
            nn.SiLU(),
            nn.Conv1d(channels, self.out_channels, kernel_size, padding=padding),
        )

        self.updown = up or down

        if up:
            self.h_upd = Upsample(channels, False)
            self.x_upd = Upsample(channels, False)
        elif down:
            self.h_upd = Downsample(channels, False)
            self.x_upd = Downsample(channels, False)
        else:
            self.h_upd = self.x_upd = nn.Identity()

        self.out_layers = nn.Sequential(
            normalization(self.out_channels),
            nn.SiLU(),
            nn.Dropout(p=dropout),
            zero_module(
                nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding)
            ),
        )

        if self.out_channels == channels:
            self.skip_connection = nn.Identity()
        elif use_conv:
            self.skip_connection = nn.Conv1d(
                channels, self.out_channels, kernel_size, padding=padding
            )
        else:
            self.skip_connection = nn.Conv1d(channels, self.out_channels, 1)

    def forward(self, x):
        if self.updown:
            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
            h = in_rest(x)
            h = self.h_upd(h)
            x = self.x_upd(x)
            h = in_conv(h)
        else:
            h = self.in_layers(x)
        h = self.out_layers(h)
        return self.skip_connection(x) + h


class AudioMiniEncoder(nn.Module):
    def __init__(self,
                 spec_dim,
                 embedding_dim,
                 base_channels=128,
                 depth=2,
                 resnet_blocks=2,
                 attn_blocks=4,
                 num_attn_heads=4,
                 dropout=0,
                 downsample_factor=2,
                 kernel_size=3):
        super().__init__()
        self.init = nn.Sequential(
            nn.Conv1d(spec_dim, base_channels, 3, padding=1)
        )
        ch = base_channels
        res = []
        for l in range(depth):
            for r in range(resnet_blocks):
                res.append(ResBlock(ch, dropout, kernel_size=kernel_size))
            res.append(Downsample(ch, use_conv=True, out_channels=ch*2, factor=downsample_factor))
            ch *= 2
        self.res = nn.Sequential(*res)
        self.final = nn.Sequential(
            normalization(ch),
            nn.SiLU(),
            nn.Conv1d(ch, embedding_dim, 1)
        )
        attn = []
        for a in range(attn_blocks):
            attn.append(AttentionBlock(embedding_dim, num_attn_heads,))
        self.attn = nn.Sequential(*attn)
        self.dim = embedding_dim

    def forward(self, x):
        h = self.init(x)
        h = self.res(h)
        h = self.final(h)
        h = self.attn(h)
        return h[:, :, 0]


DEFAULT_MEL_NORM_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data/mel_norms.pth')


class TorchMelSpectrogram(nn.Module):
    def __init__(self, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0, mel_fmax=8000,
                 sampling_rate=22050, normalize=False, mel_norm_file=DEFAULT_MEL_NORM_FILE):
        super().__init__()
        # These are the default tacotron values for the MEL spectrogram.
        self.filter_length = filter_length
        self.hop_length = hop_length
        self.win_length = win_length
        self.n_mel_channels = n_mel_channels
        self.mel_fmin = mel_fmin
        self.mel_fmax = mel_fmax
        self.sampling_rate = sampling_rate
        self.mel_stft = torchaudio.transforms.MelSpectrogram(n_fft=self.filter_length, hop_length=self.hop_length,
                                                             win_length=self.win_length, power=2, normalized=normalize,
                                                             sample_rate=self.sampling_rate, f_min=self.mel_fmin,
                                                             f_max=self.mel_fmax, n_mels=self.n_mel_channels,
                                                             norm="slaney")
        self.mel_norm_file = mel_norm_file
        if self.mel_norm_file is not None:
            self.mel_norms = torch.load(self.mel_norm_file)
        else:
            self.mel_norms = None

    def forward(self, inp):
        if len(inp.shape) == 3:  # Automatically squeeze out the channels dimension if it is present (assuming mono-audio)
            inp = inp.squeeze(1)
        assert len(inp.shape) == 2
        if torch.backends.mps.is_available():
            inp = inp.to('cpu')
        self.mel_stft = self.mel_stft.to(inp.device)
        mel = self.mel_stft(inp)
        # Perform dynamic range compression
        mel = torch.log(torch.clamp(mel, min=1e-5))
        if self.mel_norms is not None:
            self.mel_norms = self.mel_norms.to(mel.device)
            mel = mel / self.mel_norms.unsqueeze(0).unsqueeze(-1)
        return mel


class CheckpointedLayer(nn.Module):
    """
    Wraps a module. When forward() is called, passes kwargs that require_grad through torch.checkpoint() and bypasses
    checkpoint for all other args.
    """
    def __init__(self, wrap):
        super().__init__()
        self.wrap = wrap

    def forward(self, x, *args, **kwargs):
        for k, v in kwargs.items():
            assert not (isinstance(v, torch.Tensor) and v.requires_grad)  # This would screw up checkpointing.
        partial = functools.partial(self.wrap, **kwargs)
        return partial(x, *args)


class CheckpointedXTransformerEncoder(nn.Module):
    """
    Wraps a ContinuousTransformerWrapper and applies CheckpointedLayer to each layer and permutes from channels-mid
    to channels-last that XTransformer expects.
    """
    def __init__(self, needs_permute=True, exit_permute=True, checkpoint=True, **xtransformer_kwargs):
        super().__init__()
        self.transformer = ContinuousTransformerWrapper(**xtransformer_kwargs)
        self.needs_permute = needs_permute
        self.exit_permute = exit_permute

        if not checkpoint:
            return
        for i in range(len(self.transformer.attn_layers.layers)):
            n, b, r = self.transformer.attn_layers.layers[i]
            self.transformer.attn_layers.layers[i] = nn.ModuleList([n, CheckpointedLayer(b), r])

    def forward(self, x, **kwargs):
        if self.needs_permute:
            x = x.permute(0,2,1)
        h = self.transformer(x, **kwargs)
        if self.exit_permute:
            h = h.permute(0,2,1)
        return h

================================================
FILE: tortoise/models/autoregressive.py
================================================
import functools

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import GPT2Config, GPT2PreTrainedModel, LogitsProcessorList
from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
from transformers.utils.model_parallel_utils import get_device_map, assert_device_map
from tortoise.models.arch_util import AttentionBlock
from tortoise.utils.typical_sampling import TypicalLogitsWarper


def null_position_embeddings(range, dim):
    return torch.zeros((range.shape[0], range.shape[1], dim), device=range.device)


class ResBlock(nn.Module):
    """
    Basic residual convolutional block that uses GroupNorm.
    """
    def __init__(self, chan):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv1d(chan, chan, kernel_size=3, padding=1),
            nn.GroupNorm(chan//8, chan),
            nn.ReLU(),
            nn.Conv1d(chan, chan, kernel_size=3, padding=1),
            nn.GroupNorm(chan//8, chan)
        )

    def forward(self, x):
        return F.relu(self.net(x) + x)


class GPT2InferenceModel(GPT2PreTrainedModel):
    def __init__(self, config, gpt, text_pos_emb, embeddings, norm, linear, kv_cache=False):
        super().__init__(config)
        self.transformer = gpt
        self.text_pos_embedding = text_pos_emb
        self.embeddings = embeddings
        self.final_norm = norm
        self.lm_head = nn.Sequential(norm, linear)
        self.kv_cache = kv_cache
        
        # Model parallel
        self.model_parallel = False
        self.device_map = None
        self.cached_mel_emb = None
    def parallelize(self, device_map=None):
        self.device_map = (
            get_device_map(len(self.transformer.h), range(max(1, torch.cuda.device_count())))
            if device_map is None
            else device_map
        )
        assert_device_map(self.device_map, len(self.transformer.h))
        self.transformer.parallelize(self.device_map)
        self.lm_head = self.lm_head.to(self.transformer.first_device)
        self.model_parallel = True

    def deparallelize(self):
        self.transformer.deparallelize()
        self.transformer = self.transformer.to("cpu")
        self.lm_head = self.lm_head.to("cpu")
        self.model_parallel = False
        torch.cuda.empty_cache()
        if torch.backends.mps.is_available():
            torch.mps.empty_cache()
    
    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings
    
    def store_mel_emb(self, mel_emb):
        self.cached_mel_emb = mel_emb

    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):
        token_type_ids = kwargs.get("token_type_ids", None)  # usually None
        if not self.kv_cache:
            past_key_values = None
        # only last token for inputs_ids if past is defined in kwargs
        if past_key_values:
            input_ids = input_ids[:, -1].unsqueeze(-1)
            if token_type_ids is not None:
                token_type_ids = token_type_ids[:, -1].unsqueeze(-1)

        attention_mask = kwargs.get("attention_mask", None)
        position_ids = kwargs.get("position_ids", None)

        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -1].unsqueeze(-1)
        else:
            position_ids = None
        return {
            "input_ids": input_ids,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "position_ids": position_ids,
            "attention_mask": attention_mask,
            "token_type_ids": token_type_ids,
        }

    def forward(
        self,
        input_ids=None,
        past_key_values=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        labels=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        assert self.cached_mel_emb is not None
        assert inputs_embeds is None  # Not supported by this inference model.
        assert labels is None  # Training not supported by this inference model.
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        # Create embedding
        mel_len = self.cached_mel_emb.shape[1]
        if input_ids.shape[1] != 1:
            text_inputs = input_ids[:, mel_len:]
            text_emb = self.embeddings(text_inputs)
            text_emb = text_emb + self.text_pos_embedding(text_emb)
            if self.cached_mel_emb.shape[0] != text_emb.shape[0]:
                mel_emb = self.cached_mel_emb.repeat_interleave(
                    text_emb.shape[0] // self.cached_mel_emb.shape[0], 0
                )
            else:  # this outcome only occurs once per loop in most cases
                mel_emb = self.cached_mel_emb
            emb = torch.cat([mel_emb, text_emb], dim=1)
        else:
            emb = self.embeddings(input_ids)
            emb = emb + self.text_pos_embedding.get_fixed_embedding(
                attention_mask.shape[1] - mel_len, attention_mask.device
            )
        transformer_outputs = self.transformer(
            inputs_embeds=emb,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]

        # Set device for model parallelism
        if self.model_parallel:
            if torch.backends.mps.is_available():
                self.to(self.transformer.first_device)
            else:
                torch.cuda.set_device(self.transformer.first_device)
            hidden_states = hidden_states.to(self.lm_head.weight.device)

        lm_logits = self.lm_head(hidden_states)

        if not return_dict:
            return (lm_logits,) + transformer_outputs[1:]

        return CausalLMOutputWithCrossAttentions(
            loss=None,
            logits=lm_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
            cross_attentions=transformer_outputs.cross_attentions,
        )

    @staticmethod
    def _reorder_cache(past, beam_idx):
        """
        This function is used to re-order the :obj:`past_key_values` cache if
        :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is
        called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
        """
        return tuple(
            tuple(
                past_state.index_select(0, beam_idx.to(past_state.device))
                for past_state in layer_past
            )
            for layer_past in past
        )


class ConditioningEncoder(nn.Module):
    def __init__(self,
                 spec_dim,
                 embedding_dim,
                 attn_blocks=6,
                 num_attn_heads=4,
                 do_checkpointing=False,
                 mean=False):
        super().__init__()
        attn = []
        self.init = nn.Conv1d(spec_dim, embedding_dim, kernel_size=1)
        for a in range(attn_blocks):
            attn.append(AttentionBlock(embedding_dim, num_attn_heads))
        self.attn = nn.Sequential(*attn)
        self.dim = embedding_dim
        self.do_checkpointing = do_checkpointing
        self.mean = mean

    def forward(self, x):
        h = self.init(x)
        h = self.attn(h)
        if self.mean:
            return h.mean(dim=2)
        else:
            return h[:, :, 0]


class LearnedPositionEmbeddings(nn.Module):
    def __init__(self, seq_len, model_dim, init=.02):
        super().__init__()
        self.emb = nn.Embedding(seq_len, model_dim)
        # Initializing this way is standard for GPT-2
        self.emb.weight.data.normal_(mean=0.0, std=init)

    def forward(self, x):
        sl = x.shape[1]
        return self.emb(torch.arange(0, sl, device=x.device))

    def get_fixed_embedding(self, ind, dev):
        return self.emb(torch.tensor([ind], device=dev)).unsqueeze(0)


def build_hf_gpt_transformer(layers, model_dim, heads, max_mel_seq_len, max_text_seq_len, checkpointing):
    """
    GPT-2 implemented by the HuggingFace library.
    """
    from transformers import GPT2Config, GPT2Model
    gpt_config = GPT2Config(vocab_size=256,  # Unused.
                             n_positions=max_mel_seq_len+max_text_seq_len,
                             n_ctx=max_mel_seq_len+max_text_seq_len,
                             n_embd=model_dim,
                             n_layer=layers,
                             n_head=heads,
                             gradient_checkpointing=checkpointing,
                             use_cache=not checkpointing)
    gpt = GPT2Model(gpt_config)
    # Override the built in positional embeddings
    del gpt.wpe
    gpt.wpe = functools.partial(null_position_embeddings, dim=model_dim)
    # Built-in token embeddings are unused.
    del gpt.wte
    return gpt, LearnedPositionEmbeddings(max_mel_seq_len, model_dim), LearnedPositionEmbeddings(max_text_seq_len, model_dim),\
           None, None


class MelEncoder(nn.Module):
    def __init__(self, channels, mel_channels=80, resblocks_per_reduction=2):
        super().__init__()
        self.channels = channels
        self.encoder = nn.Sequential(nn.Conv1d(mel_channels, channels//4, kernel_size=3, padding=1),
                                     nn.Sequential(*[ResBlock(channels//4) for _ in range(resblocks_per_reduction)]),
                                     nn.Conv1d(channels//4, channels//2, kernel_size=3, stride=2, padding=1),
                                     nn.GroupNorm(channels//16, channels//2),
                                     nn.ReLU(),
                                     nn.Sequential(*[ResBlock(channels//2) for _ in range(resblocks_per_reduction)]),
                                     nn.Conv1d(channels//2, channels, kernel_size=3, stride=2, padding=1),
                                     nn.GroupNorm(channels//8, channels),
                                     nn.ReLU(),
                                     nn.Sequential(*[ResBlock(channels) for _ in range(resblocks_per_reduction)]),
                                     )
        self.reduction = 4


    def forward(self, x):
        for e in self.encoder:
            x = e(x)
        return x.permute(0,2,1)


class UnifiedVoice(nn.Module):
    def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_mel_tokens=250, max_conditioning_inputs=1,
                 mel_length_compression=1024, number_text_tokens=256,
                 start_text_token=None, number_mel_codes=8194, start_mel_token=8192,
                 stop_mel_token=8193, train_solo_embeddings=False, use_mel_codes_as_input=True,
                 checkpointing=True, types=1):
        """
        Args:
            layers: Number of layers in transformer stack.
            model_dim: Operating dimensions of the transformer
            heads: Number of transformer heads. Must be divisible by model_dim. Recommend model_dim//64
            max_text_tokens: Maximum number of text tokens that will be encountered by model.
            max_mel_tokens: Maximum number of MEL tokens that will be encountered by model.
            max_conditioning_inputs: Maximum number of conditioning inputs provided to the model. If (1), conditioning input can be of format (b,80,s), otherwise (b,n,80,s).
            mel_length_compression: The factor between <number_input_samples> and <mel_tokens>. Used to compute MEL code padding given wav input length.
            number_text_tokens:
            start_text_token:
            stop_text_token:
            number_mel_codes:
            start_mel_token:
            stop_mel_token:
            train_solo_embeddings:
            use_mel_codes_as_input:
            checkpointing:
        """
        super().__init__()

        self.number_text_tokens = number_text_tokens
        self.start_text_token = number_text_tokens * types if start_text_token is None else start_text_token
        self.stop_text_token = 0
        self.number_mel_codes = number_mel_codes
        self.start_mel_token = start_mel_token
        self.stop_mel_token = stop_mel_token
        self.layers = layers
        self.heads = heads
        self.max_mel_tokens = max_mel_tokens
        self.max_text_tokens = max_text_tokens
        self.model_dim = model_dim
        self.max_conditioning_inputs = max_conditioning_inputs
        self.mel_length_compression = mel_length_compression
        self.conditioning_encoder = ConditioningEncoder(80, model_dim, num_attn_heads=heads)
        self.text_embedding = nn.Embedding(self.number_text_tokens*types+1, model_dim)
        if use_mel_codes_as_input:
            self.mel_embedding = nn.Embedding(self.number_mel_codes, model_dim)
        else:
            self.mel_embedding = MelEncoder(model_dim, resblocks_per_reduction=1)
        self.gpt, self.mel_pos_embedding, self.text_pos_embedding, self.mel_layer_pos_embedding, self.text_layer_pos_embedding = \
            build_hf_gpt_transformer(layers, model_dim, heads, self.max_mel_tokens+2+self.max_conditioning_inputs, self.max_text_tokens+2, checkpointing)
        if train_solo_embeddings:
            self.mel_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True)
            self.text_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True)
        else:
            self.mel_solo_embedding = 0
            self.text_solo_embedding = 0

        self.final_norm = nn.LayerNorm(model_dim)
        self.text_head = nn.Linear(model_dim, self.number_text_tokens*types+1)
        self.mel_head = nn.Linear(model_dim, self.number_mel_codes)

        # Initialize the embeddings per the GPT-2 scheme
        embeddings = [self.text_embedding]
        if use_mel_codes_as_input:
            embeddings.append(self.mel_embedding)
        for module in embeddings:
            module.weight.data.normal_(mean=0.0, std=.02)
    def post_init_gpt2_config(self, use_deepspeed=False, kv_cache=False, half=False):
        seq_length = self.max_mel_tokens + self.max_text_tokens + 2
        gpt_config = GPT2Config(
            vocab_size=self.max_mel_tokens,
            n_positions=seq_length,
            n_ctx=seq_length,
            n_embd=self.model_dim,
            n_layer=self.layers,
            n_head=self.heads,
            gradient_checkpointing=False,
            use_cache=True,
        )
        self.inference_model = GPT2InferenceModel(
            gpt_config,
            self.gpt,
            self.mel_pos_embedding,
            self.mel_embedding,
            self.final_norm,
            self.mel_head,
            kv_cache=kv_cache,
        )
        if use_deepspeed and half and torch.cuda.is_available():
            import deepspeed
            self.ds_engine = deepspeed.init_inference(model=self.inference_model,  
                                                    mp_size=1,
                                                    replace_with_kernel_inject=True,
                                                    dtype=torch.float16)
            self.inference_model = self.ds_engine.module.eval()
        elif use_deepspeed and torch.cuda.is_available():
            import deepspeed
            self.ds_engine = deepspeed.init_inference(model=self.inference_model,  
                                                    mp_size=1,
                                                    replace_with_kernel_inject=True,
                                                    dtype=torch.float32)
            self.inference_model = self.ds_engine.module.eval()
        else:
            self.inference_model = self.inference_model.eval()

        # self.inference_model = PrunedGPT2InferenceModel(gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head)
        self.gpt.wte = self.mel_embedding
    def build_aligned_inputs_and_targets(self, input, start_token, stop_token):
        inp = F.pad(input, (1,0), value=start_token)
        tar = F.pad(input, (0,1), value=stop_token)
        return inp, tar

    def set_mel_padding(self, mel_input_tokens, wav_lengths):
        """
        Given mel tokens that are derived from a padded audio clip and the actual lengths of each batch element in
        that audio clip, reformats the tokens with STOP_MEL_TOKEN in place of the zero padding. This is required
        preformatting to create a working TTS model.
        """
        # Set padding areas within MEL (currently it is coded with the MEL code for <zero>).
        mel_lengths = torch.div(wav_lengths, self.mel_length_compression, rounding_mode='trunc')
        for b in range(len(mel_lengths)):
            actual_end = mel_lengths[b] + 1  # Due to the convolutional nature of how these tokens are generated, it would be best if the model predicts a token past the actual last token.
            if actual_end < mel_input_tokens.shape[-1]:
                mel_input_tokens[b, actual_end:] = self.stop_mel_token
        return mel_input_tokens

    def get_logits(self, speech_conditioning_inputs, first_inputs, first_head, second_inputs=None, second_head=None, get_attns=False, return_latent=False):
        if second_inputs is not None:
            emb = torch.cat([speech_conditioning_inputs, first_inputs, second_inputs], dim=1)
        else:
            emb = torch.cat([speech_conditioning_inputs, first_inputs], dim=1)

        gpt_out = self.gpt(inputs_embeds=emb, return_dict=True, output_attentions=get_attns)
        if get_attns:
            return gpt_out.attentions

        enc = gpt_out.last_hidden_state[:, 1:]  # The first logit is tied to the speech_conditioning_input
        enc = self.final_norm(enc)

        if return_latent:
            return enc[:, speech_conditioning_inputs.shape[1]:speech_conditioning_inputs.shape[1]+first_inputs.shape[1]], enc[:, -second_inputs.shape[1]:]

        first_logits = enc[:, :first_inputs.shape[1]]
        first_logits = first_head(first_logits)
        first_logits = first_logits.permute(0,2,1)
        if second_inputs is not None:
            second_logits = enc[:, -second_inputs.shape[1]:]
            second_logits = second_head(second_logits)
            second_logits = second_logits.permute(0,2,1)
            return first_logits, second_logits
        else:
            return first_logits

    def get_conditioning(self, speech_conditioning_input):
        speech_conditioning_input = speech_conditioning_input.unsqueeze(1) if len(
            speech_conditioning_input.shape) == 3 else speech_conditioning_input
        conds = []
        for j in range(speech_conditioning_input.shape[1]):
            conds.append(self.conditioning_encoder(speech_conditioning_input[:, j]))
        conds = torch.stack(conds, dim=1)
        conds = conds.mean(dim=1)
        return conds

    def forward(self, speech_conditioning_latent, text_inputs, text_lengths, mel_codes, wav_lengths, types=None, text_first=True, raw_mels=None, return_attentions=False,
                return_latent=False, clip_inputs=True):
        """
        Forward pass that uses both text and voice in either text conditioning mode or voice conditioning mode
        (actuated by `text_first`).

        speech_conditioning_input: MEL float tensor, (b,1024)
        text_inputs: long tensor, (b,t)
        text_lengths: long tensor, (b,)
        mel_inputs:  long tensor, (b,m)
        wav_lengths: long tensor, (b,)
        raw_mels: MEL float tensor (b,80,s)

        If return_attentions is specified, only logits are returned.
        If return_latent is specified, loss & logits are not computed or returned. Only the predicted latents are returned.
        If clip_inputs is True, the inputs will be clipped to the smallest input size across each input modality.
        """
        # Types are expressed by expanding the text embedding space.
        if types is not None:
            text_inputs = text_inputs * (1+types).unsqueeze(-1)

        if clip_inputs:
            # This model will receive micro-batches with a ton of padding for both the text and MELs. Ameliorate this by
            # chopping the inputs by the maximum actual length.
            max_text_len = text_lengths.max()
            text_inputs = text_inputs[:, :max_text_len]
            max_mel_len = wav_lengths.max() // self.mel_length_compression
            mel_codes = mel_codes[:, :max_mel_len]
            if raw_mels is not None:
                raw_mels = raw_mels[:, :, :max_mel_len*4]
        mel_codes = self.set_mel_padding(mel_codes, wav_lengths)
        text_inputs = F.pad(text_inputs, (0,1), value=self.stop_text_token)
        mel_codes = F.pad(mel_codes, (0,1), value=self.stop_mel_token)

        conds = speech_conditioning_latent.unsqueeze(1)
        text_inputs, text_targets = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token)
        text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs)
        mel_codes, mel_targets = self.build_aligned_inputs_and_targets(mel_codes, self.start_mel_token, self.stop_mel_token)
        if raw_mels is not None:
            mel_inp = F.pad(raw_mels, (0, 8))
        else:
            mel_inp = mel_codes
        mel_emb = self.mel_embedding(mel_inp)
        mel_emb = mel_emb + self.mel_pos_embedding(mel_codes)

        if text_first:
            text_logits, mel_logits = self.get_logits(conds, text_emb, self.text_head, mel_emb, self.mel_head, get_attns=return_attentions, return_latent=return_latent)
            if return_latent:
                return mel_logits[:, :-2]  # Despite the name, these are not logits. Strip off the two tokens added by this forward pass.
        else:
            mel_logits, text_logits = self.get_logits(conds, mel_emb, self.mel_head, text_emb, self.text_head, get_attns=return_attentions, return_latent=return_latent)
            if return_latent:
                return text_logits[:, :-2]  # Despite the name, these are not logits. Strip off the two tokens added by this forward pass.

        if return_attentions:
            return mel_logits
        loss_text = F.cross_entropy(text_logits, text_targets.long())
        loss_mel = F.cross_entropy(mel_logits, mel_targets.long())
        return loss_text.mean(), loss_mel.mean(), mel_logits
    def compute_embeddings(
        self,
        cond_latents,
        text_inputs,
    ):
        text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token)
        text_inputs = F.pad(text_inputs, (1, 0), value=self.start_text_token)
        emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs)
        conds = cond_latents.unsqueeze(1)
        emb = torch.cat([conds, emb], dim=1)
        self.inference_model.store_mel_emb(emb)
        gpt_inputs = torch.full(
            (
                emb.shape[0],
                emb.shape[1] + 1,  # +1 for the start_mel_token
            ),
            fill_value=1,
            dtype=torch.long,
            device=text_inputs.device,
        )
        gpt_inputs[:, -1] = self.start_mel_token
        return gpt_inputs
    def inference_speech(self, speech_conditioning_latent, text_inputs, input_tokens=None, num_return_sequences=1,
                         max_generate_length=None, typical_sampling=False, typical_mass=.9, **hf_generate_kwargs):        

        text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token)
        text_inputs, _ = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token)
        text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs)

        conds = speech_conditioning_latent.unsqueeze(1)
        emb = torch.cat([conds, text_emb], dim=1)
        self.inference_model.store_mel_emb(emb)

        fake_inputs = torch.full((emb.shape[0], conds.shape[1] + emb.shape[1],), fill_value=1, dtype=torch.long,
                                 device=text_inputs.device)
        fake_inputs[:, -1] = self.start_mel_token
        trunc_index = fake_inputs.shape[1]
        if input_tokens is None:
            inputs = fake_inputs
        else:
            assert num_return_sequences % input_tokens.shape[0] == 0, "The number of return sequences must be divisible by the number of input sequences"
            fake_inputs = fake_inputs.repeat(num_return_sequences, 1)
            input_tokens = input_tokens.repeat(num_return_sequences // input_tokens.shape[0], 1)
            inputs = torch.cat([fake_inputs, input_tokens], dim=1)

        logits_processor = LogitsProcessorList([TypicalLogitsWarper(mass=typical_mass)]) if typical_sampling else LogitsProcessorList()
        max_length = trunc_index + self.max_mel_tokens - 1  if max_generate_length is None else trunc_index + max_generate_length
        gen = self.inference_model.generate(inputs, bos_token_id=self.start_mel_token, pad_token_id=self.stop_mel_token, eos_token_id=self.stop_mel_token,
                                            max_length=max_length, logits_processor=logits_processor,
                                            num_return_sequences=num_return_sequences, **hf_generate_kwargs)
        return gen[:, trunc_index:]

    def get_generator(self, fake_inputs, **hf_generate_kwargs):
        return self.inference_model.generate_stream(
            fake_inputs,
            bos_token_id=self.start_mel_token,
            pad_token_id=self.stop_mel_token,
            eos_token_id=self.stop_mel_token,
            max_length=500,
            do_stream=True,
            **hf_generate_kwargs,
        )
if __name__ == '__main__':
    gpt = UnifiedVoice(model_dim=256, heads=4, train_solo_embeddings=True, use_mel_codes_as_input=True, max_conditioning_inputs=4)
    l = gpt(torch.randn(2, 3, 80, 800),
            torch.randint(high=120, size=(2,120)),
            torch.tensor([32, 120]),
            torch.randint(high=8192, size=(2,250)),
            torch.tensor([250*256,195*256]))
    gpt.text_forward(torch.randn(2,80,800), torch.randint(high=50, size=(2,80)), torch.tensor([32, 80]))


================================================
FILE: tortoise/models/classifier.py
================================================
import torch
import torch.nn as nn

from tortoise.models.arch_util import Upsample, Downsample, normalization, zero_module, AttentionBlock


class ResBlock(nn.Module):
    def __init__(
        self,
        channels,
        dropout,
        out_channels=None,
        use_conv=False,
        use_scale_shift_norm=False,
        dims=2,
        up=False,
        down=False,
        kernel_size=3,
        do_checkpoint=True,
    ):
        super().__init__()
        self.channels = channels
        self.dropout = dropout
        self.out_channels = out_channels or channels
        self.use_conv = use_conv
        self.use_scale_shift_norm = use_scale_shift_norm
        self.do_checkpoint = do_checkpoint
        padding = 1 if kernel_size == 3 else 2

        self.in_layers = nn.Sequential(
            normalization(channels),
            nn.SiLU(),
            nn.Conv1d(channels, self.out_channels, kernel_size, padding=padding),
        )

        self.updown = up or down

        if up:
            self.h_upd = Upsample(channels, False, dims)
            self.x_upd = Upsample(channels, False, dims)
        elif down:
            self.h_upd = Downsample(channels, False, dims)
            self.x_upd = Downsample(channels, False, dims)
        else:
            self.h_upd = self.x_upd = nn.Identity()

        self.out_layers = nn.Sequential(
            normalization(self.out_channels),
            nn.SiLU(),
            nn.Dropout(p=dropout),
            zero_module(
                nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding)
            ),
        )

        if self.out_channels == channels:
            self.skip_connection = nn.Identity()
        elif use_conv:
            self.skip_connection = nn.Conv1d(
                dims, channels, self.out_channels, kernel_size, padding=padding
            )
        else:
            self.skip_connection = nn.Conv1d(dims, channels, self.out_channels, 1)

    def forward(self, x):
        if self.updown:
            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
            h = in_rest(x)
            h = self.h_upd(h)
            x = self.x_upd(x)
            h = in_conv(h)
        else:
            h = self.in_layers(x)
        h = self.out_layers(h)
        return self.skip_connection(x) + h


class AudioMiniEncoder(nn.Module):
    def __init__(self,
                 spec_dim,
                 embedding_dim,
                 base_channels=128,
                 depth=2,
                 resnet_blocks=2,
                 attn_blocks=4,
                 num_attn_heads=4,
                 dropout=0,
                 downsample_factor=2,
                 kernel_size=3):
        super().__init__()
        self.init = nn.Sequential(
            nn.Conv1d(spec_dim, base_channels, 3, padding=1)
        )
        ch = base_channels
        res = []
        self.layers = depth
        for l in range(depth):
            for r in range(resnet_blocks):
                res.append(ResBlock(ch, dropout, do_checkpoint=False, kernel_size=kernel_size))
            res.append(Downsample(ch, use_conv=True, out_channels=ch*2, factor=downsample_factor))
            ch *= 2
        self.res = nn.Sequential(*res)
        self.final = nn.Sequential(
            normalization(ch),
            nn.SiLU(),
            nn.Conv1d(ch, embedding_dim, 1)
        )
        attn = []
        for a in range(attn_blocks):
            attn.append(AttentionBlock(embedding_dim, num_attn_heads, do_checkpoint=False))
        self.attn = nn.Sequential(*attn)
        self.dim = embedding_dim

    def forward(self, x):
        h = self.init(x)
        h = self.res(h)
        h = self.final(h)
        for blk in self.attn:
            h = blk(h)
        return h[:, :, 0]


class AudioMiniEncoderWithClassifierHead(nn.Module):
    def __init__(self, classes, distribute_zero_label=True, **kwargs):
        super().__init__()
        self.enc = AudioMiniEncoder(**kwargs)
        self.head = nn.Linear(self.enc.dim, classes)
        self.num_classes = classes
        self.distribute_zero_label = distribute_zero_label

    def forward(self, x, labels=None):
        h = self.enc(x)
        logits = self.head(h)
        if labels is None:
            return logits
        else:
            if self.distribute_zero_label:
                oh_labels = nn.functional.one_hot(labels, num_classes=self.num_classes)
                zeros_indices = (labels == 0).unsqueeze(-1)
                # Distribute 20% of the probability mass on all classes when zero is specified, to compensate for dataset noise.
                zero_extra_mass = torch.full_like(oh_labels, dtype=torch.float, fill_value=.2/(self.num_classes-1))
                zero_extra_mass[:, 0] = -.2
                zero_extra_mass = zero_extra_mass * zeros_indices
                oh_labels = oh_labels + zero_extra_mass
            else:
                oh_labels = labels
            loss = nn.functional.cross_entropy(logits, oh_labels)
            return loss


================================================
FILE: tortoise/models/clvp.py
================================================
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import einsum

from tortoise.models.arch_util import CheckpointedXTransformerEncoder
from tortoise.models.transformer import Transformer
from tortoise.models.xtransformers import Encoder


def exists(val):
    return val is not None


def masked_mean(t, mask, dim = 1):
    t = t.masked_fill(~mask[:, :, None], 0.)
    return t.sum(dim = 1) / mask.sum(dim = 1)[..., None]

class CLVP(nn.Module):
    """
    CLIP model retrofitted for performing contrastive evaluation between tokenized audio data and the corresponding
    transcribed text.

    Originally from https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/dalle_pytorch.py
    """

    def __init__(
            self,
            *,
            dim_text=512,
            dim_speech=512,
            dim_latent=512,
            num_text_tokens=256,
            text_enc_depth=6,
            text_seq_len=120,
            text_heads=8,
            num_speech_tokens=8192,
            speech_enc_depth=6,
            speech_heads=8,
            speech_seq_len=250,
            text_mask_percentage=0,
            voice_mask_percentage=0,
            wav_token_compression=1024,
            use_xformers=False,
    ):
        super().__init__()
        self.text_emb = nn.Embedding(num_text_tokens, dim_text)
        self.to_text_latent = nn.Linear(dim_text, dim_latent, bias=False)

        self.speech_emb = nn.Embedding(num_speech_tokens, dim_speech)
        self.to_speech_latent = nn.Linear(dim_speech, dim_latent, bias=False)

        if use_xformers:
            self.text_transformer = CheckpointedXTransformerEncoder(
                needs_permute=False,
                exit_permute=False,
                max_seq_len=-1,
                attn_layers=Encoder(
                    dim=dim_text,
                    depth=text_enc_depth,
                    heads=text_heads,
                    ff_dropout=.1,
                    ff_mult=2,
                    attn_dropout=.1,
                    use_rmsnorm=True,
                    ff_glu=True,
                    rotary_pos_emb=True,
                ))
            self.speech_transformer = CheckpointedXTransformerEncoder(
                needs_permute=False,
                exit_permute=False,
                max_seq_len=-1,
                attn_layers=Encoder(
                    dim=dim_speech,
                    depth=speech_enc_depth,
                    heads=speech_heads,
                    ff_dropout=.1,
                    ff_mult=2,
                    attn_dropout=.1,
                    use_rmsnorm=True,
                    ff_glu=True,
                    rotary_pos_emb=True,
                ))
        else:
            self.text_transformer = Transformer(causal=False, seq_len=text_seq_len, dim=dim_text, depth=text_enc_depth,
                                                heads=text_heads)
            self.speech_transformer = Transformer(causal=False, seq_len=speech_seq_len, dim=dim_speech,
                                                  depth=speech_enc_depth, heads=speech_heads)

        self.temperature = nn.Parameter(torch.tensor(1.))
        self.text_mask_percentage = text_mask_percentage
        self.voice_mask_percentage = voice_mask_percentage
        self.wav_token_compression = wav_token_compression
        self.xformers = use_xformers
        if not use_xformers:
            self.text_pos_emb = nn.Embedding(text_seq_len, dim_text)
            self.speech_pos_emb = nn.Embedding(num_speech_tokens, dim_speech)

    def forward(
            self,
            text,
            speech_tokens,
            return_loss=False
    ):
        b, device = text.shape[0], text.device
        if self.training:
            text_mask = torch.rand_like(text.float()) > self.text_mask_percentage
            voice_mask = torch.rand_like(speech_tokens.float()) > self.voice_mask_percentage
        else:
            text_mask = torch.ones_like(text.float()).bool()
            voice_mask = torch.ones_like(speech_tokens.float()).bool()

        text_emb = self.text_emb(text)
        speech_emb = self.speech_emb(speech_tokens)

        if not self.xformers:
            text_emb += self.text_pos_emb(torch.arange(text.shape[1], device=device))
            speech_emb += self.speech_pos_emb(torch.arange(speech_emb.shape[1], device=device))

        enc_text = self.text_transformer(text_emb, mask=text_mask)
        enc_speech = self.speech_transformer(speech_emb, mask=voice_mask)

        text_latents = masked_mean(enc_text, text_mask, dim=1)
        speech_latents = masked_mean(enc_speech, voice_mask, dim=1)

        text_latents = self.to_text_latent(text_latents)
        speech_latents = self.to_speech_latent(speech_latents)

        text_latents, speech_latents = map(lambda t: F.normalize(t, p=2
Download .txt
gitextract_v5pjsrh0/

├── .github/
│   └── workflows
├── .gitignore
├── Advanced_Usage.md
├── CHANGELOG.md
├── CITATION.cff
├── Dockerfile
├── LICENSE
├── MANIFEST.in
├── README.md
├── requirements.txt
├── scripts/
│   └── tortoise_tts.py
├── setup.cfg
├── setup.py
├── tortoise/
│   ├── __init__.py
│   ├── api.py
│   ├── api_fast.py
│   ├── data/
│   │   ├── got.txt
│   │   ├── layman.txt
│   │   ├── mel_norms.pth
│   │   ├── riding_hood.txt
│   │   ├── seal_copypasta.txt
│   │   └── tokenizer.json
│   ├── do_tts.py
│   ├── eval.py
│   ├── get_conditioning_latents.py
│   ├── is_this_from_tortoise.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── arch_util.py
│   │   ├── autoregressive.py
│   │   ├── classifier.py
│   │   ├── clvp.py
│   │   ├── cvvp.py
│   │   ├── diffusion_decoder.py
│   │   ├── hifigan_decoder.py
│   │   ├── random_latent_generator.py
│   │   ├── stream_generator.py
│   │   ├── transformer.py
│   │   ├── vocoder.py
│   │   └── xtransformers.py
│   ├── read.py
│   ├── read_fast.py
│   ├── socket_client.py
│   ├── socket_server.py
│   ├── tts_stream.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── audio.py
│   │   ├── diffusion.py
│   │   ├── stft.py
│   │   ├── text.py
│   │   ├── tokenizer.py
│   │   ├── typical_sampling.py
│   │   └── wav2vec_alignment.py
│   └── voices/
│       └── cond_latent_example/
│           └── pat.pth
├── tortoise_tts.ipynb
├── tortoise_v2_examples.html
└── voice_customization_guide.md
Download .txt
SYMBOL INDEX (428 symbols across 24 files)

FILE: tortoise/api.py
  function get_model_path (line 42) | def get_model_path(model_name, models_dir=MODELS_DIR):
  function pad_or_truncate (line 52) | def pad_or_truncate(t, length):
  function load_discrete_vocoder_diffuser (line 64) | def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired...
  function format_conditioning (line 73) | def format_conditioning(clip, cond_length=132300, device="cuda" if not t...
  function fix_autoregressive_output (line 87) | def fix_autoregressive_output(codes, stop_token, complain=True):
  function do_spectrogram_diffusion (line 117) | def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditi...
  function classify_audio_clip (line 133) | def classify_audio_clip(clip):
  function pick_best_batch_size_for_gpu (line 148) | def pick_best_batch_size_for_gpu():
  class TextToSpeech (line 174) | class TextToSpeech:
    method __init__ (line 179) | def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR,
    method temporary_cuda (line 246) | def temporary_cuda(self, model):
    method load_cvvp (line 252) | def load_cvvp(self):
    method get_conditioning_latents (line 258) | def get_conditioning_latents(self, voice_samples, return_mels=False):
    method get_random_conditioning_latents (line 301) | def get_random_conditioning_latents(self):
    method tts_with_preset (line 311) | def tts_with_preset(self, text, preset='fast', **kwargs):
    method tts (line 334) | def tts(self, text, voice_samples=None, conditioning_latents=None, k=1...
    method deterministic_state (line 598) | def deterministic_state(self, seed=None):

FILE: tortoise/api_fast.py
  function get_model_path (line 41) | def get_model_path(model_name, models_dir=MODELS_DIR):
  function pad_or_truncate (line 51) | def pad_or_truncate(t, length):
  function load_discrete_vocoder_diffuser (line 63) | def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired...
  function format_conditioning (line 72) | def format_conditioning(clip, cond_length=132300, device="cuda" if not t...
  function fix_autoregressive_output (line 86) | def fix_autoregressive_output(codes, stop_token, complain=True):
  function do_spectrogram_diffusion (line 116) | def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditi...
  function classify_audio_clip (line 132) | def classify_audio_clip(clip):
  function pick_best_batch_size_for_gpu (line 147) | def pick_best_batch_size_for_gpu():
  class TextToSpeech (line 173) | class TextToSpeech:
    method __init__ (line 178) | def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR,
    method get_conditioning_latents (line 230) | def get_conditioning_latents(self, voice_samples, return_mels=False):
    method get_random_conditioning_latents (line 253) | def get_random_conditioning_latents(self):
    method tts_with_preset (line 261) | def tts_with_preset(self, text, preset='fast', **kwargs):
    method handle_chunks (line 285) | def handle_chunks(self, wav_gen, wav_gen_prev, wav_overlap, overlap_len):
    method tts_stream (line 311) | def tts_stream(self, text, voice_samples=None, conditioning_latents=No...
    method tts (line 421) | def tts(self, text, voice_samples=None, k=1, verbose=True, use_determi...
    method deterministic_state (line 504) | def deterministic_state(self, seed=None):

FILE: tortoise/models/arch_util.py
  function zero_module (line 12) | def zero_module(module):
  class GroupNorm32 (line 21) | class GroupNorm32(nn.GroupNorm):
    method forward (line 22) | def forward(self, x):
  function normalization (line 26) | def normalization(channels):
  class QKVAttentionLegacy (line 44) | class QKVAttentionLegacy(nn.Module):
    method __init__ (line 49) | def __init__(self, n_heads):
    method forward (line 53) | def forward(self, qkv, mask=None, rel_pos=None):
  class AttentionBlock (line 80) | class AttentionBlock(nn.Module):
    method __init__ (line 88) | def __init__(
    method forward (line 117) | def forward(self, x, mask=None):
  class Upsample (line 126) | class Upsample(nn.Module):
    method __init__ (line 134) | def __init__(self, channels, use_conv, out_channels=None, factor=4):
    method forward (line 145) | def forward(self, x):
  class Downsample (line 153) | class Downsample(nn.Module):
    method __init__ (line 161) | def __init__(self, channels, use_conv, out_channels=None, factor=4, ks...
    method forward (line 176) | def forward(self, x):
  class ResBlock (line 181) | class ResBlock(nn.Module):
    method __init__ (line 182) | def __init__(
    method forward (line 236) | def forward(self, x):
  class AudioMiniEncoder (line 249) | class AudioMiniEncoder(nn.Module):
    method __init__ (line 250) | def __init__(self,
    method forward (line 284) | def forward(self, x):
  class TorchMelSpectrogram (line 295) | class TorchMelSpectrogram(nn.Module):
    method __init__ (line 296) | def __init__(self, filter_length=1024, hop_length=256, win_length=1024...
    method forward (line 318) | def forward(self, inp):
  class CheckpointedLayer (line 334) | class CheckpointedLayer(nn.Module):
    method __init__ (line 339) | def __init__(self, wrap):
    method forward (line 343) | def forward(self, x, *args, **kwargs):
  class CheckpointedXTransformerEncoder (line 350) | class CheckpointedXTransformerEncoder(nn.Module):
    method __init__ (line 355) | def __init__(self, needs_permute=True, exit_permute=True, checkpoint=T...
    method forward (line 367) | def forward(self, x, **kwargs):

FILE: tortoise/models/autoregressive.py
  function null_position_embeddings (line 13) | def null_position_embeddings(range, dim):
  class ResBlock (line 17) | class ResBlock(nn.Module):
    method __init__ (line 21) | def __init__(self, chan):
    method forward (line 31) | def forward(self, x):
  class GPT2InferenceModel (line 35) | class GPT2InferenceModel(GPT2PreTrainedModel):
    method __init__ (line 36) | def __init__(self, config, gpt, text_pos_emb, embeddings, norm, linear...
    method parallelize (line 49) | def parallelize(self, device_map=None):
    method deparallelize (line 60) | def deparallelize(self):
    method get_output_embeddings (line 69) | def get_output_embeddings(self):
    method set_output_embeddings (line 72) | def set_output_embeddings(self, new_embeddings):
    method store_mel_emb (line 75) | def store_mel_emb(self, mel_emb):
    method prepare_inputs_for_generation (line 78) | def prepare_inputs_for_generation(self, input_ids, past_key_values=Non...
    method forward (line 108) | def forward(
    method _reorder_cache (line 189) | def _reorder_cache(past, beam_idx):
  class ConditioningEncoder (line 204) | class ConditioningEncoder(nn.Module):
    method __init__ (line 205) | def __init__(self,
    method forward (line 222) | def forward(self, x):
  class LearnedPositionEmbeddings (line 231) | class LearnedPositionEmbeddings(nn.Module):
    method __init__ (line 232) | def __init__(self, seq_len, model_dim, init=.02):
    method forward (line 238) | def forward(self, x):
    method get_fixed_embedding (line 242) | def get_fixed_embedding(self, ind, dev):
  function build_hf_gpt_transformer (line 246) | def build_hf_gpt_transformer(layers, model_dim, heads, max_mel_seq_len, ...
  class MelEncoder (line 269) | class MelEncoder(nn.Module):
    method __init__ (line 270) | def __init__(self, channels, mel_channels=80, resblocks_per_reduction=2):
    method forward (line 287) | def forward(self, x):
  class UnifiedVoice (line 293) | class UnifiedVoice(nn.Module):
    method __init__ (line 294) | def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=1...
    method post_init_gpt2_config (line 358) | def post_init_gpt2_config(self, use_deepspeed=False, kv_cache=False, h...
    method build_aligned_inputs_and_targets (line 398) | def build_aligned_inputs_and_targets(self, input, start_token, stop_to...
    method set_mel_padding (line 403) | def set_mel_padding(self, mel_input_tokens, wav_lengths):
    method get_logits (line 417) | def get_logits(self, speech_conditioning_inputs, first_inputs, first_h...
    method get_conditioning (line 444) | def get_conditioning(self, speech_conditioning_input):
    method forward (line 454) | def forward(self, speech_conditioning_latent, text_inputs, text_length...
    method compute_embeddings (line 513) | def compute_embeddings(
    method inference_speech (line 535) | def inference_speech(self, speech_conditioning_latent, text_inputs, in...
    method get_generator (line 565) | def get_generator(self, fake_inputs, **hf_generate_kwargs):

FILE: tortoise/models/classifier.py
  class ResBlock (line 7) | class ResBlock(nn.Module):
    method __init__ (line 8) | def __init__(
    method forward (line 65) | def forward(self, x):
  class AudioMiniEncoder (line 78) | class AudioMiniEncoder(nn.Module):
    method __init__ (line 79) | def __init__(self,
    method forward (line 114) | def forward(self, x):
  class AudioMiniEncoderWithClassifierHead (line 123) | class AudioMiniEncoderWithClassifierHead(nn.Module):
    method __init__ (line 124) | def __init__(self, classes, distribute_zero_label=True, **kwargs):
    method forward (line 131) | def forward(self, x, labels=None):

FILE: tortoise/models/clvp.py
  function exists (line 11) | def exists(val):
  function masked_mean (line 15) | def masked_mean(t, mask, dim = 1):
  class CLVP (line 19) | class CLVP(nn.Module):
    method __init__ (line 27) | def __init__(
    method forward (line 99) | def forward(

FILE: tortoise/models/cvvp.py
  function exists (line 10) | def exists(val):
  function masked_mean (line 14) | def masked_mean(t, mask):
  class CollapsingTransformer (line 19) | class CollapsingTransformer(nn.Module):
    method __init__ (line 20) | def __init__(self, model_dim, output_dims, heads, dropout, depth, mask...
    method forward (line 43) | def forward(self, x, **transformer_kwargs):
  class ConvFormatEmbedding (line 54) | class ConvFormatEmbedding(nn.Module):
    method __init__ (line 55) | def __init__(self, *args, **kwargs):
    method forward (line 59) | def forward(self, x):
  class CVVP (line 64) | class CVVP(nn.Module):
    method __init__ (line 65) | def __init__(
    method get_grad_norm_parameter_groups (line 99) | def get_grad_norm_parameter_groups(self):
    method forward (line 105) | def forward(

FILE: tortoise/models/diffusion_decoder.py
  function is_latent (line 13) | def is_latent(t):
  function is_sequence (line 17) | def is_sequence(t):
  function timestep_embedding (line 21) | def timestep_embedding(timesteps, dim, max_period=10000):
  class TimestepBlock (line 42) | class TimestepBlock(nn.Module):
    method forward (line 44) | def forward(self, x, emb):
  class TimestepEmbedSequential (line 50) | class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
    method forward (line 51) | def forward(self, x, emb):
  class ResBlock (line 60) | class ResBlock(TimestepBlock):
    method __init__ (line 61) | def __init__(
    method forward (line 107) | def forward(self, x, emb):
  class DiffusionLayer (line 123) | class DiffusionLayer(TimestepBlock):
    method __init__ (line 124) | def __init__(self, model_channels, dropout, num_heads):
    method forward (line 129) | def forward(self, x, time_emb):
  class DiffusionTts (line 134) | class DiffusionTts(nn.Module):
    method __init__ (line 135) | def __init__(
    method get_grad_norm_parameter_groups (line 212) | def get_grad_norm_parameter_groups(self):
    method get_conditioning (line 222) | def get_conditioning(self, conditioning_input):
    method timestep_independent (line 232) | def timestep_independent(self, aligned_conditioning, conditioning_late...
    method forward (line 262) | def forward(self, x, timesteps, aligned_conditioning=None, conditionin...

FILE: tortoise/models/hifigan_decoder.py
  function get_padding (line 11) | def get_padding(k, d):
  class ResBlock1 (line 15) | class ResBlock1(torch.nn.Module):
    method __init__ (line 30) | def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
    method forward (line 81) | def forward(self, x):
    method remove_weight_norm (line 98) | def remove_weight_norm(self):
  class ResBlock2 (line 105) | class ResBlock2(torch.nn.Module):
    method __init__ (line 120) | def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
    method forward (line 147) | def forward(self, x):
    method remove_weight_norm (line 154) | def remove_weight_norm(self):
  class HifiganGenerator (line 159) | class HifiganGenerator(torch.nn.Module):
    method __init__ (line 160) | def __init__(
    method forward (line 237) | def forward(self, x, g=None):
    method inference (line 269) | def inference(self, c, g=None):
    method remove_weight_norm (line 296) | def remove_weight_norm(self):

FILE: tortoise/models/random_latent_generator.py
  function fused_leaky_relu (line 8) | def fused_leaky_relu(input, bias=None, negative_slope=0.2, scale=2 ** 0.5):
  class EqualLinear (line 21) | class EqualLinear(nn.Module):
    method __init__ (line 22) | def __init__(
    method forward (line 34) | def forward(self, input):
  class RandomLatentConverter (line 40) | class RandomLatentConverter(nn.Module):
    method __init__ (line 41) | def __init__(self, channels):
    method forward (line 47) | def forward(self, ref):

FILE: tortoise/models/stream_generator.py
  function setup_seed (line 26) | def setup_seed(seed):
  class StreamGenerationConfig (line 37) | class StreamGenerationConfig(GenerationConfig):
    method __init__ (line 38) | def __init__(self, **kwargs):
  class NewGenerationMixin (line 43) | class NewGenerationMixin(GenerationMixin):
    method generate (line 45) | def generate(
    method sample_stream (line 722) | def sample_stream(
  function init_stream_support (line 1003) | def init_stream_support():

FILE: tortoise/models/transformer.py
  function exists (line 13) | def exists(val):
  function default (line 17) | def default(val, d):
  function cast_tuple (line 21) | def cast_tuple(val, depth = 1):
  function max_neg_value (line 27) | def max_neg_value(t):
  function stable_softmax (line 31) | def stable_softmax(t, dim = -1, alpha = 32 ** 2):
  function route_args (line 37) | def route_args(router, args, depth):
  class SequentialSequence (line 50) | class SequentialSequence(nn.Module):
    method __init__ (line 51) | def __init__(self, layers, args_route = {}, layer_dropout = 0.):
    method forward (line 58) | def forward(self, x, **kwargs):
  class DivideMax (line 68) | class DivideMax(nn.Module):
    method __init__ (line 69) | def __init__(self, dim):
    method forward (line 73) | def forward(self, x):
  class LayerScale (line 79) | class LayerScale(nn.Module):
    method __init__ (line 80) | def __init__(self, dim, depth, fn):
    method forward (line 92) | def forward(self, x, **kwargs):
  class PreNorm (line 98) | class PreNorm(nn.Module):
    method __init__ (line 99) | def __init__(self, dim, fn, sandwich = False):
    method forward (line 105) | def forward(self, x, **kwargs):
  class GEGLU (line 113) | class GEGLU(nn.Module):
    method forward (line 114) | def forward(self, x):
  class FeedForward (line 119) | class FeedForward(nn.Module):
    method __init__ (line 120) | def __init__(self, dim, dropout = 0., mult = 4.):
    method forward (line 129) | def forward(self, x):
  class Attention (line 135) | class Attention(nn.Module):
    method __init__ (line 136) | def __init__(self, dim, seq_len, causal = True, heads = 8, dim_head = ...
    method forward (line 151) | def forward(self, x, mask = None):
  class Transformer (line 182) | class Transformer(nn.Module):
    method __init__ (line 183) | def __init__(
    method forward (line 218) | def forward(self, x, **kwargs):

FILE: tortoise/models/vocoder.py
  class KernelPredictor (line 7) | class KernelPredictor(torch.nn.Module):
    method __init__ (line 10) | def __init__(
    method forward (line 66) | def forward(self, c):
    method remove_weight_norm (line 95) | def remove_weight_norm(self):
  class LVCBlock (line 104) | class LVCBlock(torch.nn.Module):
    method __init__ (line 107) | def __init__(
    method forward (line 155) | def forward(self, x, c):
    method location_variable_convolution (line 182) | def location_variable_convolution(self, x, kernel, bias, dilation=1, h...
    method remove_weight_norm (line 218) | def remove_weight_norm(self):
  class UnivNetGenerator (line 225) | class UnivNetGenerator(nn.Module):
    method __init__ (line 232) | def __init__(self, noise_dim=64, channel_size=32, dilations=[1,3,9,27]...
    method forward (line 267) | def forward(self, c, z):
    method eval (line 284) | def eval(self, inference=False):
    method remove_weight_norm (line 290) | def remove_weight_norm(self):
    method inference (line 300) | def inference(self, c, z=None):

FILE: tortoise/models/xtransformers.py
  function exists (line 27) | def exists(val):
  function default (line 31) | def default(val, d):
  function cast_tuple (line 37) | def cast_tuple(val, depth):
  class always (line 41) | class always():
    method __init__ (line 42) | def __init__(self, val):
    method __call__ (line 45) | def __call__(self, *args, **kwargs):
  class not_equals (line 49) | class not_equals():
    method __init__ (line 50) | def __init__(self, val):
    method __call__ (line 53) | def __call__(self, x, *args, **kwargs):
  class equals (line 57) | class equals():
    method __init__ (line 58) | def __init__(self, val):
    method __call__ (line 61) | def __call__(self, x, *args, **kwargs):
  function max_neg_value (line 65) | def max_neg_value(tensor):
  function l2norm (line 69) | def l2norm(t):
  function init_zero_ (line 75) | def init_zero_(layer):
  function pick_and_pop (line 83) | def pick_and_pop(keys, d):
  function group_dict_by_key (line 88) | def group_dict_by_key(cond, d):
  function string_begins_with (line 97) | def string_begins_with(prefix, str):
  function group_by_key_prefix (line 101) | def group_by_key_prefix(prefix, d):
  function groupby_prefix_and_trim (line 105) | def groupby_prefix_and_trim(prefix, d):
  class ReluSquared (line 113) | class ReluSquared(nn.Module):
    method forward (line 114) | def forward(self, x):
  class AbsolutePositionalEmbedding (line 120) | class AbsolutePositionalEmbedding(nn.Module):
    method __init__ (line 121) | def __init__(self, dim, max_seq_len):
    method forward (line 126) | def forward(self, x):
  class FixedPositionalEmbedding (line 133) | class FixedPositionalEmbedding(nn.Module):
    method __init__ (line 134) | def __init__(self, dim):
    method forward (line 139) | def forward(self, x, seq_dim=1, offset=0):
  class RelativePositionBias (line 146) | class RelativePositionBias(nn.Module):
    method __init__ (line 147) | def __init__(self, scale, causal=False, num_buckets=32, max_distance=1...
    method _relative_position_bucket (line 156) | def _relative_position_bucket(relative_position, causal=True, num_buck...
    method forward (line 177) | def forward(self, qk_dots):
  class AlibiPositionalBias (line 189) | class AlibiPositionalBias(nn.Module):
    method __init__ (line 190) | def __init__(self, heads, **kwargs):
    method _get_slopes (line 199) | def _get_slopes(heads):
    method forward (line 212) | def forward(self, qk_dots):
  class LearnedAlibiPositionalBias (line 229) | class LearnedAlibiPositionalBias(AlibiPositionalBias):
    method __init__ (line 230) | def __init__(self, heads, bidirectional=False):
    method forward (line 239) | def forward(self, qk_dots):
  class RotaryEmbedding (line 264) | class RotaryEmbedding(nn.Module):
    method __init__ (line 265) | def __init__(self, dim):
    method forward (line 270) | def forward(self, max_seq_len, device):
  function rotate_half (line 277) | def rotate_half(x):
  function apply_rotary_pos_emb (line 283) | def apply_rotary_pos_emb(t, freqs):
  class Scale (line 291) | class Scale(nn.Module):
    method __init__ (line 292) | def __init__(self, value, fn):
    method forward (line 297) | def forward(self, x, **kwargs):
  class Rezero (line 307) | class Rezero(nn.Module):
    method __init__ (line 308) | def __init__(self, fn):
    method forward (line 313) | def forward(self, x, **kwargs):
  class ScaleNorm (line 323) | class ScaleNorm(nn.Module):
    method __init__ (line 324) | def __init__(self, dim, eps=1e-5):
    method forward (line 330) | def forward(self, x):
  class RMSNorm (line 335) | class RMSNorm(nn.Module):
    method __init__ (line 336) | def __init__(self, dim, eps=1e-8):
    method forward (line 342) | def forward(self, x):
  class RMSScaleShiftNorm (line 347) | class RMSScaleShiftNorm(nn.Module):
    method __init__ (line 348) | def __init__(self, dim, eps=1e-8):
    method forward (line 355) | def forward(self, x, norm_scale_shift_inp):
  class Residual (line 367) | class Residual(nn.Module):
    method __init__ (line 368) | def __init__(self, dim, scale_residual=False):
    method forward (line 372) | def forward(self, x, residual):
  class GRUGating (line 379) | class GRUGating(nn.Module):
    method __init__ (line 380) | def __init__(self, dim, scale_residual=False):
    method forward (line 385) | def forward(self, x, residual):
  function shift (line 399) | def shift(t, amount, mask=None):
  class ShiftTokens (line 409) | class ShiftTokens(nn.Module):
    method __init__ (line 410) | def __init__(self, shifts, fn):
    method forward (line 415) | def forward(self, x, **kwargs):
  class GLU (line 429) | class GLU(nn.Module):
    method __init__ (line 430) | def __init__(self, dim_in, dim_out, activation):
    method forward (line 435) | def forward(self, x):
  class FeedForward (line 440) | class FeedForward(nn.Module):
    method __init__ (line 441) | def __init__(
    method forward (line 473) | def forward(self, x):
  class Attention (line 479) | class Attention(nn.Module):
    method __init__ (line 480) | def __init__(
    method forward (line 576) | def forward(
  class AttentionLayers (line 731) | class AttentionLayers(nn.Module):
    method __init__ (line 732) | def __init__(
    method forward (line 906) | def forward(
  class Encoder (line 1016) | class Encoder(AttentionLayers):
    method __init__ (line 1017) | def __init__(self, **kwargs):
  class Decoder (line 1022) | class Decoder(AttentionLayers):
    method __init__ (line 1023) | def __init__(self, **kwargs):
  class CrossAttender (line 1028) | class CrossAttender(AttentionLayers):
    method __init__ (line 1029) | def __init__(self, **kwargs):
  class ViTransformerWrapper (line 1033) | class ViTransformerWrapper(nn.Module):
    method __init__ (line 1034) | def __init__(
    method forward (line 1062) | def forward(
  class TransformerWrapper (line 1087) | class TransformerWrapper(nn.Module):
    method __init__ (line 1088) | def __init__(
    method init_ (line 1131) | def init_(self):
    method forward (line 1134) | def forward(
  class ContinuousTransformerWrapper (line 1187) | class ContinuousTransformerWrapper(nn.Module):
    method __init__ (line 1188) | def __init__(
    method forward (line 1217) | def forward(

FILE: tortoise/socket_client.py
  function play_audio_stream (line 5) | def play_audio_stream(client_socket):
  function send_text_to_server (line 31) | def send_text_to_server(character_name, text, server_ip='localhost', ser...

FILE: tortoise/socket_server.py
  function generate_audio_stream (line 11) | def generate_audio_stream(text, tts, voice_samples):
  function split_text (line 25) | def split_text(text, max_length=200):
  function handle_client (line 46) | def handle_client(client_socket, tts):
  function start_server (line 69) | def start_server():

FILE: tortoise/tts_stream.py
  function play_audio (line 14) | def play_audio(audio_queue):

FILE: tortoise/utils/audio.py
  function load_wav_to_torch (line 16) | def load_wav_to_torch(full_path):
  function load_audio (line 29) | def load_audio(audiopath, sampling_rate):
  function denormalize_tacotron_mel (line 63) | def denormalize_tacotron_mel(norm_mel):
  function normalize_tacotron_mel (line 67) | def normalize_tacotron_mel(mel):
  function dynamic_range_compression (line 71) | def dynamic_range_compression(x, C=1, clip_val=1e-5):
  function dynamic_range_decompression (line 80) | def dynamic_range_decompression(x, C=1):
  function get_voices (line 89) | def get_voices(extra_voice_dirs=[]):
  function save_pth (line 100) | def save_pth(conds, save_path):
  function load_voice (line 104) | def load_voice(voice, extra_voice_dirs=[]):
  function load_voices (line 127) | def load_voices(voices, extra_voice_dirs=[]):
  class TacotronSTFT (line 151) | class TacotronSTFT(torch.nn.Module):
    method __init__ (line 152) | def __init__(self, filter_length=1024, hop_length=256, win_length=1024,
    method spectral_normalize (line 165) | def spectral_normalize(self, magnitudes):
    method spectral_de_normalize (line 169) | def spectral_de_normalize(self, magnitudes):
    method mel_spectrogram (line 173) | def mel_spectrogram(self, y):
  function wav_to_univnet_mel (line 194) | def wav_to_univnet_mel(wav, do_normalization=False,

FILE: tortoise/utils/diffusion.py
  function normal_kl (line 19) | def normal_kl(mean1, logvar1, mean2, logvar2):
  function approx_standard_normal_cdf (line 49) | def approx_standard_normal_cdf(x):
  function discretized_gaussian_log_likelihood (line 57) | def discretized_gaussian_log_likelihood(x, *, means, log_scales):
  function mean_flat (line 87) | def mean_flat(tensor):
  function get_named_beta_schedule (line 94) | def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
  function betas_for_alpha_bar (line 121) | def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.9...
  class ModelMeanType (line 141) | class ModelMeanType(enum.Enum):
  class ModelVarType (line 151) | class ModelVarType(enum.Enum):
  class LossType (line 165) | class LossType(enum.Enum):
    method is_vb (line 171) | def is_vb(self):
  class GaussianDiffusion (line 175) | class GaussianDiffusion:
    method __init__ (line 192) | def __init__(
    method q_mean_variance (line 251) | def q_mean_variance(self, x_start, t):
    method q_sample (line 268) | def q_sample(self, x_start, t, noise=None):
    method q_posterior_mean_variance (line 288) | def q_posterior_mean_variance(self, x_start, x_t, t):
    method p_mean_variance (line 312) | def p_mean_variance(
    method _predict_xstart_from_eps (line 420) | def _predict_xstart_from_eps(self, x_t, t, eps):
    method _predict_xstart_from_xprev (line 427) | def _predict_xstart_from_xprev(self, x_t, t, xprev):
    method _predict_eps_from_xstart (line 437) | def _predict_eps_from_xstart(self, x_t, t, pred_xstart):
    method _scale_timesteps (line 443) | def _scale_timesteps(self, t):
    method condition_mean (line 448) | def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
    method condition_score (line 463) | def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None):
    method p_sample (line 487) | def p_sample(
    method p_sample_loop (line 533) | def p_sample_loop(
    method p_sample_loop_progressive (line 579) | def p_sample_loop_progressive(
    method ddim_sample (line 623) | def ddim_sample(
    method ddim_reverse_sample (line 673) | def ddim_reverse_sample(
    method ddim_sample_loop (line 711) | def ddim_sample_loop(
    method ddim_sample_loop_progressive (line 745) | def ddim_sample_loop_progressive(
    method _vb_terms_bpd (line 795) | def _vb_terms_bpd(
    method training_losses (line 830) | def training_losses(self, model, x_start, t, model_kwargs=None, noise=...
    method autoregressive_training_losses (line 918) | def autoregressive_training_losses(self, model, x_start, t, model_outp...
    method _prior_bpd (line 990) | def _prior_bpd(self, x_start):
    method calc_bpd_loop (line 1008) | def calc_bpd_loop(self, model, x_start, clip_denoised=True, model_kwar...
  function get_named_beta_schedule (line 1066) | def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
  class SpacedDiffusion (line 1093) | class SpacedDiffusion(GaussianDiffusion):
    method __init__ (line 1102) | def __init__(self, use_timesteps, **kwargs):
    method p_mean_variance (line 1118) | def p_mean_variance(
    method training_losses (line 1123) | def training_losses(
    method autoregressive_training_losses (line 1128) | def autoregressive_training_losses(
    method condition_mean (line 1133) | def condition_mean(self, cond_fn, *args, **kwargs):
    method condition_score (line 1136) | def condition_score(self, cond_fn, *args, **kwargs):
    method _wrap_model (line 1139) | def _wrap_model(self, model, autoregressive=False):
    method _scale_timesteps (line 1147) | def _scale_timesteps(self, t):
  function space_timesteps (line 1152) | def space_timesteps(num_timesteps, section_counts):
  class _WrappedModel (line 1208) | class _WrappedModel:
    method __init__ (line 1209) | def __init__(self, model, timestep_map, rescale_timesteps, original_nu...
    method __call__ (line 1215) | def __call__(self, x, ts, **kwargs):
  class _WrappedAutoregressiveModel (line 1223) | class _WrappedAutoregressiveModel:
    method __init__ (line 1224) | def __init__(self, model, timestep_map, rescale_timesteps, original_nu...
    method __call__ (line 1230) | def __call__(self, x, x0, ts, **kwargs):
  function _extract_into_tensor (line 1237) | def _extract_into_tensor(arr, timesteps, broadcast_shape):

FILE: tortoise/utils/stft.py
  function window_sumsquare (line 42) | def window_sumsquare(window, n_frames, hop_length=200, win_length=800,
  class STFT (line 94) | class STFT(torch.nn.Module):
    method __init__ (line 96) | def __init__(self, filter_length=800, hop_length=200, win_length=800,
    method transform (line 129) | def transform(self, input_data):
    method inverse (line 159) | def inverse(self, magnitude, phase):
    method forward (line 190) | def forward(self, input_data):

FILE: tortoise/utils/text.py
  function split_and_recombine_text (line 4) | def split_and_recombine_text(text, desired_length=200, max_length=300):
  class Test (line 80) | class Test(unittest.TestCase):
    method test_split_and_recombine_text (line 81) | def test_split_and_recombine_text(self):
    method test_split_and_recombine_text_2 (line 96) | def test_split_and_recombine_text_2(self):
    method test_split_and_recombine_text_3 (line 107) | def test_split_and_recombine_text_3(self):

FILE: tortoise/utils/tokenizer.py
  function expand_abbreviations (line 38) | def expand_abbreviations(text):
  function _remove_commas (line 53) | def _remove_commas(m):
  function _expand_decimal_point (line 57) | def _expand_decimal_point(m):
  function _expand_dollars (line 61) | def _expand_dollars(m):
  function _expand_ordinal (line 82) | def _expand_ordinal(m):
  function _expand_number (line 86) | def _expand_number(m):
  function normalize_numbers (line 101) | def normalize_numbers(text):
  function expand_numbers (line 111) | def expand_numbers(text):
  function lowercase (line 115) | def lowercase(text):
  function collapse_whitespace (line 119) | def collapse_whitespace(text):
  function convert_to_ascii (line 123) | def convert_to_ascii(text):
  function basic_cleaners (line 127) | def basic_cleaners(text):
  function transliteration_cleaners (line 134) | def transliteration_cleaners(text):
  function english_cleaners (line 142) | def english_cleaners(text):
  function lev_distance (line 153) | def lev_distance(s1, s2):
  class VoiceBpeTokenizer (line 172) | class VoiceBpeTokenizer:
    method __init__ (line 173) | def __init__(self, vocab_file=None, use_basic_cleaners=False):
    method encode (line 182) | def encode(self, txt):
    method decode (line 187) | def decode(self, seq):

FILE: tortoise/utils/typical_sampling.py
  class TypicalLogitsWarper (line 5) | class TypicalLogitsWarper(LogitsWarper):
    method __init__ (line 6) | def __init__(self, mass: float = 0.9, filter_value: float = -float("In...
    method __call__ (line 11) | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTen...

FILE: tortoise/utils/wav2vec_alignment.py
  function max_alignment (line 10) | def max_alignment(s1, s2, skip_character='~', record=None):
  class Wav2VecAlignment (line 48) | class Wav2VecAlignment:
    method __init__ (line 52) | def __init__(self, device='cuda' if not torch.backends.mps.is_availabl...
    method align (line 58) | def align(self, audio, expected_text, audio_sample_rate=24000):
    method redact (line 125) | def redact(self, audio, expected_text, audio_sample_rate=24000):
Condensed preview — 56 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,243K chars).
[
  {
    "path": ".github/workflows",
    "chars": 1084,
    "preview": "# This workflow will upload a Python Package using Twine when a release is created\n# For more information see: https://d"
  },
  {
    "path": ".gitignore",
    "chars": 1852,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "Advanced_Usage.md",
    "chars": 6541,
    "preview": "## Advanced Usage\n\n### Generation settings\n\nTortoise is primarily an autoregressive decoder model combined with a diffus"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 1433,
    "preview": "## Changelog\n#### v3.0.0; 2023/10/18\n- Added fast inference for tortoise with HiFi Decoder (inspired by xtts by [coquiTT"
  },
  {
    "path": "CITATION.cff",
    "chars": 320,
    "preview": "cff-version: 1.3.0\nmessage: \"If you use this software, please cite it as below.\"\nauthors:\n- family-names: \"Betker\"\n  giv"
  },
  {
    "path": "Dockerfile",
    "chars": 1257,
    "preview": "FROM nvidia/cuda:12.2.0-base-ubuntu22.04\n\nCOPY . /app\n\nRUN apt-get update && \\\n    apt-get install -y --allow-unauthenti"
  },
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "MANIFEST.in",
    "chars": 70,
    "preview": "recursive-include tortoise/data *\nrecursive-include tortoise/voices *\n"
  },
  {
    "path": "README.md",
    "chars": 8728,
    "preview": "# TorToiSe\n\nTortoise is a text-to-speech program built with the following priorities:\n\n1. Strong multi-voice capabilitie"
  },
  {
    "path": "requirements.txt",
    "chars": 298,
    "preview": "tqdm\nrotary_embedding_torch\ntransformers==4.31.0\ntokenizers\ninflect\nprogressbar\neinops==0.4.1\nunidecode\nscipy\nlibrosa==0"
  },
  {
    "path": "scripts/tortoise_tts.py",
    "chars": 12621,
    "preview": "#!/usr/bin/env python3\n\nimport argparse\nimport os\nimport sys\nimport tempfile\nimport time\n\nimport torch\nimport torchaudio"
  },
  {
    "path": "setup.cfg",
    "chars": 40,
    "preview": "[metadata]\ndescription_file = README.md\n"
  },
  {
    "path": "setup.py",
    "chars": 1117,
    "preview": "import setuptools\n\nwith open(\"README.md\", \"r\", encoding=\"utf-8\") as fh:\n    long_description = fh.read()\n\nsetuptools.set"
  },
  {
    "path": "tortoise/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tortoise/api.py",
    "chars": 36346,
    "preview": "import os\nimport random\nimport uuid\nfrom time import time\nfrom urllib import request\n\nimport torch\nimport torch.nn.funct"
  },
  {
    "path": "tortoise/api_fast.py",
    "chars": 30646,
    "preview": "import os\nimport random\nimport uuid\nfrom time import time\nfrom urllib import request\n\nimport torch\nimport torch.nn.funct"
  },
  {
    "path": "tortoise/data/got.txt",
    "chars": 16399,
    "preview": "Chapter One\n\n\nBran\n\n\nThe morning had dawned clear and cold, with a crispness that hinted at the end of summer. They set "
  },
  {
    "path": "tortoise/data/layman.txt",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tortoise/data/riding_hood.txt",
    "chars": 3587,
    "preview": "Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her m"
  },
  {
    "path": "tortoise/data/seal_copypasta.txt",
    "chars": 1518,
    "preview": "What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the"
  },
  {
    "path": "tortoise/data/tokenizer.json",
    "chars": 4409,
    "preview": "{\"version\":\"1.0\",\"truncation\":null,\"padding\":null,\"added_tokens\":[{\"id\":0,\"special\":true,\"content\":\"[STOP]\",\"single_word"
  },
  {
    "path": "tortoise/do_tts.py",
    "chars": 3686,
    "preview": "import argparse\nimport os\n\nimport torch\nimport torchaudio\n\nfrom api import TextToSpeech, MODELS_DIR\nfrom utils.audio imp"
  },
  {
    "path": "tortoise/eval.py",
    "chars": 1059,
    "preview": "import argparse\nimport os\n\nimport torchaudio\n\nfrom api import TextToSpeech\nfrom tortoise.utils.audio import load_audio\n\n"
  },
  {
    "path": "tortoise/get_conditioning_latents.py",
    "chars": 1194,
    "preview": "import argparse\nimport os\nimport torch\n\nfrom api import TextToSpeech\nfrom tortoise.utils.audio import load_audio, get_vo"
  },
  {
    "path": "tortoise/is_this_from_tortoise.py",
    "chars": 546,
    "preview": "import argparse\n\nfrom api import classify_audio_clip\nfrom tortoise.utils.audio import load_audio\n\nif __name__ == '__main"
  },
  {
    "path": "tortoise/models/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tortoise/models/arch_util.py",
    "chars": 13039,
    "preview": "import os\nimport functools\nimport math\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torcha"
  },
  {
    "path": "tortoise/models/autoregressive.py",
    "chars": 27160,
    "preview": "import functools\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom transformers import GPT2Config"
  },
  {
    "path": "tortoise/models/classifier.py",
    "chars": 5030,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom tortoise.models.arch_util import Upsample, Downsample, normalization, zero_modu"
  },
  {
    "path": "tortoise/models/clvp.py",
    "chars": 5783,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch import einsum\n\nfrom tortoise.models.arch_u"
  },
  {
    "path": "tortoise/models/cvvp.py",
    "chars": 4928,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch import einsum\n\nfrom tortoise.models.arch_u"
  },
  {
    "path": "tortoise/models/diffusion_decoder.py",
    "chars": 15561,
    "preview": "import math\nimport random\nfrom abc import abstractmethod\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional "
  },
  {
    "path": "tortoise/models/hifigan_decoder.py",
    "chars": 10563,
    "preview": "# adopted from https://github.com/jik876/hifi-gan/blob/master/models.py\nimport torch\nfrom torch import nn\nfrom torch.nn "
  },
  {
    "path": "tortoise/models/random_latent_generator.py",
    "chars": 1649,
    "preview": "import math\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\ndef fused_leaky_relu(input, bias=None,"
  },
  {
    "path": "tortoise/models/stream_generator.py",
    "chars": 48681,
    "preview": "# Adapted from: https://github.com/LowinLi/transformers-stream-generator\n\nfrom transformers import (\n    GenerationConfi"
  },
  {
    "path": "tortoise/models/transformer.py",
    "chars": 6215,
    "preview": "from functools import partial\n\nimport torch\nimport torch.nn.functional as F\nfrom einops import rearrange\nfrom rotary_emb"
  },
  {
    "path": "tortoise/models/vocoder.py",
    "chars": 12843,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nMAX_WAV_VALUE = 32768.0\n\nclass KernelPredictor(torch"
  },
  {
    "path": "tortoise/models/xtransformers.py",
    "chars": 41987,
    "preview": "import math\nfrom collections import namedtuple\nfrom functools import partial\nfrom inspect import isfunction\n\nimport torc"
  },
  {
    "path": "tortoise/read.py",
    "chars": 5755,
    "preview": "import argparse\nimport os\nfrom time import time\n\nimport torch\nimport torchaudio\n\nfrom api import TextToSpeech, MODELS_DI"
  },
  {
    "path": "tortoise/read_fast.py",
    "chars": 4206,
    "preview": "import argparse\nimport os\nfrom time import time\n\nimport torch\nimport torchaudio\n\nfrom api_fast import TextToSpeech, MODE"
  },
  {
    "path": "tortoise/socket_client.py",
    "chars": 1488,
    "preview": "import socket\nimport sounddevice as sd\nimport numpy as np\n\ndef play_audio_stream(client_socket):\n    buffer = b''\n    st"
  },
  {
    "path": "tortoise/socket_server.py",
    "chars": 2300,
    "preview": "import spacy\nimport threading\nimport socket\nfrom tortoise.api_fast import TextToSpeech\nfrom utils.audio import load_voic"
  },
  {
    "path": "tortoise/tts_stream.py",
    "chars": 4256,
    "preview": "import argparse\nimport os\nfrom time import time\n\nimport torch\nimport torchaudio\n\nfrom api_fast import TextToSpeech, MODE"
  },
  {
    "path": "tortoise/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tortoise/utils/audio.py",
    "chars": 6847,
    "preview": "import os\nfrom glob import glob\n\nimport librosa\nimport torch\nimport torchaudio\nimport numpy as np\nfrom scipy.io.wavfile "
  },
  {
    "path": "tortoise/utils/diffusion.py",
    "chars": 49162,
    "preview": "\"\"\"\nThis is an almost carbon copy of gaussian_diffusion.py from OpenAI's ImprovedDiffusion repo, which itself:\n\nThis cod"
  },
  {
    "path": "tortoise/utils/stft.py",
    "chars": 7462,
    "preview": "\"\"\"\nBSD 3-Clause License\n\nCopyright (c) 2017, Prem Seetharaman\nAll rights reserved.\n\n* Redistribution and use in source "
  },
  {
    "path": "tortoise/utils/text.py",
    "chars": 8486,
    "preview": "import re\n\n\ndef split_and_recombine_text(text, desired_length=200, max_length=300):\n    \"\"\"Split text it into chunks of "
  },
  {
    "path": "tortoise/utils/tokenizer.py",
    "chars": 5279,
    "preview": "import os\nimport re\n\nimport inflect\nimport torch\nfrom tokenizers import Tokenizer\n\n\n# Regular expression matching whites"
  },
  {
    "path": "tortoise/utils/typical_sampling.py",
    "chars": 1595,
    "preview": "import torch\nfrom transformers import LogitsWarper\n\n\nclass TypicalLogitsWarper(LogitsWarper):\n    def __init__(self, mas"
  },
  {
    "path": "tortoise/utils/wav2vec_alignment.py",
    "chars": 6376,
    "preview": "import re\n\nimport torch\nimport torchaudio\nfrom transformers import Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2CTC"
  },
  {
    "path": "tortoise_tts.ipynb",
    "chars": 691528,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"_pIZ3ZXNp7cf\"\n   },\n   \"source\": [\n    \"Welcom"
  },
  {
    "path": "tortoise_v2_examples.html",
    "chars": 85074,
    "preview": "<html><head><meta charset=\"UTF-8\"><title>TorToiSe - These words were never spoken.</title></head>\n<body>\n<h1>Introductio"
  },
  {
    "path": "voice_customization_guide.md",
    "chars": 2400,
    "preview": "## Voice Customization Guide\n\nTortoise was specifically trained to be a multi-speaker model. It accomplishes this by con"
  }
]

// ... and 2 more files (download for full content)

About this extraction

This page contains the full source code of the neonbjb/tortoise-tts GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 56 files (1.2 MB), approximately 599.1k tokens, and a symbol index with 428 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!