Repository: neonbjb/tortoise-tts Branch: main Commit: 8a2563ecabe9 Files: 56 Total size: 1.2 MB Directory structure: gitextract_v5pjsrh0/ ├── .github/ │ └── workflows ├── .gitignore ├── Advanced_Usage.md ├── CHANGELOG.md ├── CITATION.cff ├── Dockerfile ├── LICENSE ├── MANIFEST.in ├── README.md ├── requirements.txt ├── scripts/ │ └── tortoise_tts.py ├── setup.cfg ├── setup.py ├── tortoise/ │ ├── __init__.py │ ├── api.py │ ├── api_fast.py │ ├── data/ │ │ ├── got.txt │ │ ├── layman.txt │ │ ├── mel_norms.pth │ │ ├── riding_hood.txt │ │ ├── seal_copypasta.txt │ │ └── tokenizer.json │ ├── do_tts.py │ ├── eval.py │ ├── get_conditioning_latents.py │ ├── is_this_from_tortoise.py │ ├── models/ │ │ ├── __init__.py │ │ ├── arch_util.py │ │ ├── autoregressive.py │ │ ├── classifier.py │ │ ├── clvp.py │ │ ├── cvvp.py │ │ ├── diffusion_decoder.py │ │ ├── hifigan_decoder.py │ │ ├── random_latent_generator.py │ │ ├── stream_generator.py │ │ ├── transformer.py │ │ ├── vocoder.py │ │ └── xtransformers.py │ ├── read.py │ ├── read_fast.py │ ├── socket_client.py │ ├── socket_server.py │ ├── tts_stream.py │ ├── utils/ │ │ ├── __init__.py │ │ ├── audio.py │ │ ├── diffusion.py │ │ ├── stft.py │ │ ├── text.py │ │ ├── tokenizer.py │ │ ├── typical_sampling.py │ │ └── wav2vec_alignment.py │ └── voices/ │ └── cond_latent_example/ │ └── pat.pth ├── tortoise_tts.ipynb ├── tortoise_v2_examples.html └── voice_customization_guide.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows ================================================ # This workflow will upload a Python Package using Twine when a release is created # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries # This workflow uses actions that are not certified by GitHub. # They are provided by a third-party and are governed by # separate terms of service, privacy policy, and support # documentation. name: Upload Python Package on: release: types: [published] permissions: contents: read jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v3 with: python-version: '3.x' - name: Install dependencies run: | python -m pip install --upgrade pip pip install build - name: Build package run: python -m build - name: Publish package uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29 with: user: __token__ password: ${{ secrets.PYPI_API_TOKEN }} ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ pip-wheel-metadata/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # PEP 582; used by e.g. github.com/David-OConnor/pyflow __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ .idea/* .models/* .custom/* results/* debug_states/* ================================================ FILE: Advanced_Usage.md ================================================ ## Advanced Usage ### Generation settings Tortoise is primarily an autoregressive decoder model combined with a diffusion model. Both of these have a lot of knobs that can be turned that I've abstracted away for the sake of ease of use. I did this by generating thousands of clips using various permutations of the settings and using a metric for voice realism and intelligibility to measure their effects. I've set the defaults to the best overall settings I was able to find. For specific use-cases, it might be effective to play with these settings (and it's very likely that I missed something!) These settings are not available in the normal scripts packaged with Tortoise. They are available, however, in the API. See ```api.tts``` for a full list. ### Prompt engineering Some people have discovered that it is possible to do prompt engineering with Tortoise! For example, you can evoke emotion by including things like "I am really sad," before your text. I've built an automated redaction system that you can use to take advantage of this. It works by attempting to redact any text in the prompt surrounded by brackets. For example, the prompt "\[I am really sad,\] Please feed me." will only speak the words "Please feed me" (with a sad tonality). ### Playing with the voice latent Tortoise ingests reference clips by feeding them through individually through a small submodel that produces a point latent, then taking the mean of all of the produced latents. The experimentation I have done has indicated that these point latents are quite expressive, affecting everything from tone to speaking rate to speech abnormalities. This lends itself to some neat tricks. For example, you can combine feed two different voices to tortoise and it will output what it thinks the "average" of those two voices sounds like. #### Generating conditioning latents from voices Use the script `get_conditioning_latents.py` to extract conditioning latents for a voice you have installed. This script will dump the latents to a .pth pickle file. The file will contain a single tuple, (autoregressive_latent, diffusion_latent). Alternatively, use the api.TextToSpeech.get_conditioning_latents() to fetch the latents. #### Using raw conditioning latents to generate speech After you've played with them, you can use them to generate speech by creating a subdirectory in voices/ with a single ".pth" file containing the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent). ## Tortoise-detect Out of concerns that this model might be misused, I've built a classifier that tells the likelihood that an audio clip came from Tortoise. This classifier can be run on any computer, usage is as follows: ```commandline python tortoise/is_this_from_tortoise.py --clip= ``` This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false positives. ## Model architecture Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: [https://nonint.com/2022/04/25/tortoise-architectural-design-doc/](https://nonint.com/2022/04/25/tortoise-architectural-design-doc/) ## Training These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by [ocotillo](http://www.github.com/neonbjb/ocotillo). Training was done on my own [DLAS](https://github.com/neonbjb/DL-Art-School) trainer. I currently do not have plans to release the training configurations or methodology. See the next section.. ## Ethical Considerations Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system could be misused are many. It doesn't take much creativity to think up how. After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice: 1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well. 2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled. 3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback. 4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See `tortoise-detect` above. 5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do. ### Diversity The diversity expressed by ML models is strongly tied to the datasets they were trained on. Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities or of people who speak with strong accents. ## Looking forward Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck. I want to mention here that I think Tortoise could be a **lot** better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS. ================================================ FILE: CHANGELOG.md ================================================ ## Changelog #### v3.0.0; 2023/10/18 - Added fast inference for tortoise with HiFi Decoder (inspired by xtts by [coquiTTS](https://github.com/coqui-ai/TTS) 🐸, check out their multilingual model for noncommercial uses) #### v2.8.0; 2023/9/13 - Added custom tokenizer for non-english models #### v2.7.0; 2023/7/26 - Bug fixes - Added Apple Silicon Support - Updated Transformer version #### v2.6.0; 2023/7/26 - Bug fixes #### v2.5.0; 2023/7/09 - Added kv_cache support 5x faster - Added deepspeed support 10x faster - Added half precision support #### v2.4.0; 2022/5/17 - Removed CVVP model. Found that it does not, in fact, make an appreciable difference in the output. - Add better debugging support; existing tools now spit out debug files which can be used to reproduce bad runs. #### v2.3.0; 2022/5/12 - New CLVP-large model for further improved decoding guidance. - Improvements to read.py and do_tts.py (new options) #### v2.2.0; 2022/5/5 - Added several new voices from the training set. - Automated redaction. Wrap the text you want to use to prompt the model but not be spoken in brackets. - Bug fixes #### v2.1.0; 2022/5/2 - Added ability to produce totally random voices. - Added ability to download voice conditioning latent via a script, and then use a user-provided conditioning latent. - Added ability to use your own pretrained models. - Refactored directory structures. - Performance improvements & bug fixes. ================================================ FILE: CITATION.cff ================================================ cff-version: 1.3.0 message: "If you use this software, please cite it as below." authors: - family-names: "Betker" given-names: "James" orcid: "https://orcid.org/my-orcid?orcid=0000-0003-3259-4862" title: "TorToiSe text-to-speech" version: 2.0 date-released: 2022-04-28 url: "https://github.com/neonbjb/tortoise-tts" ================================================ FILE: Dockerfile ================================================ FROM nvidia/cuda:12.2.0-base-ubuntu22.04 COPY . /app RUN apt-get update && \ apt-get install -y --allow-unauthenticated --no-install-recommends \ wget \ git \ && apt-get autoremove -y \ && apt-get clean -y \ && rm -rf /var/lib/apt/lists/* ENV HOME="/root" ENV CONDA_DIR="${HOME}/miniconda" ENV PATH="$CONDA_DIR/bin":$PATH ENV CONDA_AUTO_UPDATE_CONDA=false ENV PIP_DOWNLOAD_CACHE="$HOME/.pip/cache" ENV TORTOISE_MODELS_DIR="$HOME/tortoise-tts/build/lib/tortoise/models" RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda3.sh \ && bash /tmp/miniconda3.sh -b -p "${CONDA_DIR}" -f -u \ && "${CONDA_DIR}/bin/conda" init bash \ && rm -f /tmp/miniconda3.sh \ && echo ". '${CONDA_DIR}/etc/profile.d/conda.sh'" >> "${HOME}/.profile" # --login option used to source bashrc (thus activating conda env) at every RUN statement SHELL ["/bin/bash", "--login", "-c"] RUN conda create --name tortoise python=3.9 numba inflect -y \ && conda activate tortoise \ && conda install --yes pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia \ && conda install --yes transformers=4.31.0 \ && cd /app \ && python setup.py install ================================================ FILE: LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: MANIFEST.in ================================================ recursive-include tortoise/data * recursive-include tortoise/voices * ================================================ FILE: README.md ================================================ # TorToiSe Tortoise is a text-to-speech program built with the following priorities: 1. Strong multi-voice capabilities. 2. Highly realistic prosody and intonation. This repo contains all the code needed to run Tortoise TTS in inference mode. Manuscript: https://arxiv.org/abs/2305.07243 ## Hugging Face space A live demo is hosted on Hugging Face Spaces. If you'd like to avoid a queue, please duplicate the Space and add a GPU. Please note that CPU-only spaces do not work for this demo. https://huggingface.co/spaces/Manmay/tortoise-tts ## Install via pip ```bash pip install tortoise-tts ``` If you would like to install the latest development version, you can also install it directly from the git repository: ```bash pip install git+https://github.com/neonbjb/tortoise-tts ``` ## What's in a name? I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder **and** a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes. well..... not so slow anymore now we can get a **0.25-0.3 RTF** on 4GB vram and with streaming we can get < **500 ms** latency !!! ## Demos See [this page](http://nonint.com/static/tortoise_v2_examples.html) for a large list of example outputs. A cool application of Tortoise + GPT-3 (not affiliated with this repository): https://twitter.com/lexman_ai. Unfortunately, this project seems no longer to be active. ## Usage guide ### Local installation If you want to use this on your own computer, you must have an NVIDIA GPU. > [!TIP] > On Windows, I **highly** recommend using the Conda installation method. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems. First, install miniconda: https://docs.conda.io/en/latest/miniconda.html Then run the following commands, using anaconda prompt as the terminal (or any other terminal configured to work with conda) This will: 1. create conda environment with minimal dependencies specified 1. activate the environment 1. install pytorch with the command provided here: https://pytorch.org/get-started/locally/ 1. clone tortoise-tts 1. change the current directory to tortoise-tts 1. run tortoise python setup install script ```shell conda create --name tortoise python=3.9 numba inflect conda activate tortoise conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia conda install transformers=4.29.2 git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts python setup.py install ``` Optionally, pytorch can be installed in the base environment, so that other conda environments can use it too. To do this, simply send the `conda install pytorch...` line before activating the tortoise environment. > [!NOTE] > When you want to use tortoise-tts, you will always have to ensure the `tortoise` conda environment is activated. If you are on windows, you may also need to install pysoundfile: `conda install -c conda-forge pysoundfile` ### Docker An easy way to hit the ground running and a good jumping off point depending on your use case. ```sh git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts docker build . -t tts docker run --gpus all \ -e TORTOISE_MODELS_DIR=/models \ -v /mnt/user/data/tortoise_tts/models:/models \ -v /mnt/user/data/tortoise_tts/results:/results \ -v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface \ -v /root:/work \ -it tts ``` This gives you an interactive terminal in an environment that's ready to do some tts. Now you can explore the different interfaces that tortoise exposes for tts. For example: ```sh cd app conda activate tortoise time python tortoise/do_tts.py \ --output_path /results \ --preset ultra_fast \ --voice geralt \ --text "Time flies like an arrow; fruit flies like a bananna." ``` ## Apple Silicon On macOS 13+ with M1/M2 chips you need to install the nighly version of PyTorch, as stated in the official page you can do: ```shell pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu ``` Be sure to do that after you activate the environment. If you don't use conda the commands would look like this: ```shell python3.10 -m venv .venv source .venv/bin/activate pip install numba inflect psutil pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu pip install transformers git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts pip install . ``` Be aware that DeepSpeed is disabled on Apple Silicon since it does not work. The flag `--use_deepspeed` is ignored. You may need to prepend `PYTORCH_ENABLE_MPS_FALLBACK=1` to the commands below to make them work since MPS does not support all the operations in Pytorch. ### do_tts.py This script allows you to speak a single phrase with one or more voices. ```shell python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast ``` ### do socket streaming ```socket server python tortoise/socket_server.py ``` will listen at port 5000 ### faster inference read.py This script provides tools for reading large amounts of text. ```shell python tortoise/read_fast.py --textfile --voice random ``` ### read.py This script provides tools for reading large amounts of text. ```shell python tortoise/read.py --textfile --voice random ``` This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well. Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running `read.py` with the --regenerate argument. ### API Tortoise can be used programmatically, like so: ```python reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech() pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast') ``` To use deepspeed: ```python reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech(use_deepspeed=True) pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast') ``` To use kv cache: ```python reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech(kv_cache=True) pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast') ``` To run model in float16: ```python reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech(half=True) pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast') ``` for Faster runs use all three: ```python reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True) pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast') ``` ## Acknowledgements This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen: - Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights. - [Ramesh et al](https://arxiv.org/pdf/2102.12092.pdf) who authored the DALLE paper, which is the inspiration behind Tortoise. - [Nichol and Dhariwal](https://arxiv.org/pdf/2102.09672.pdf) who authored the (revision of) the code that drives the diffusion model. - [Jang et al](https://arxiv.org/pdf/2106.07889.pdf) who developed and open-sourced univnet, the vocoder this repo uses. - [Kim and Jung](https://github.com/mindslab-ai/univnet) who implemented univnet pytorch model. - [lucidrains](https://github.com/lucidrains) who writes awesome open source pytorch models, many of which are used here. - [Patrick von Platen](https://huggingface.co/patrickvonplaten) whose guides on setting up wav2vec were invaluable to building my dataset. ## Notice Tortoise was built entirely by the author (James Betker) using their own hardware. Their employer was not involved in any facet of Tortoise's development. ## License Tortoise TTS is licensed under the Apache 2.0 license. If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub. ================================================ FILE: requirements.txt ================================================ tqdm rotary_embedding_torch transformers==4.31.0 tokenizers inflect progressbar einops==0.4.1 unidecode scipy librosa==0.9.1 ffmpeg numpy numba torchaudio threadpoolctl llvmlite appdirs nbconvert==5.3.1 tornado==4.2 pydantic==1.9.1 deepspeed==0.8.3 py-cpuinfo hjson psutil sounddevice spacy==3.7.5 ================================================ FILE: scripts/tortoise_tts.py ================================================ #!/usr/bin/env python3 import argparse import os import sys import tempfile import time import torch import torchaudio from tortoise.api import MODELS_DIR, TextToSpeech from tortoise.utils.audio import get_voices, load_voices, load_audio from tortoise.utils.text import split_and_recombine_text parser = argparse.ArgumentParser( description='TorToiSe is a text-to-speech program that is capable of synthesizing speech ' 'in multiple voices with realistic prosody and intonation.') parser.add_argument( 'text', type=str, nargs='*', help='Text to speak. If omitted, text is read from stdin.') parser.add_argument( '-v, --voice', type=str, default='random', metavar='VOICE', dest='voice', help='Selects the voice to use for generation. Use the & character to join two voices together. ' 'Use a comma to perform inference on multiple voices. Set to "all" to use all available voices. ' 'Note that multiple voices require the --output-dir option to be set.') parser.add_argument( '-V, --voices-dir', metavar='VOICES_DIR', type=str, dest='voices_dir', help='Path to directory containing extra voices to be loaded. Use a comma to specify multiple directories.') parser.add_argument( '-p, --preset', type=str, default='fast', choices=['ultra_fast', 'fast', 'standard', 'high_quality'], dest='preset', help='Which voice quality preset to use.') parser.add_argument( '-q, --quiet', default=False, action='store_true', dest='quiet', help='Suppress all output.') output_group = parser.add_mutually_exclusive_group(required=True) output_group.add_argument( '-l, --list-voices', default=False, action='store_true', dest='list_voices', help='List available voices and exit.') output_group.add_argument( '-P, --play', action='store_true', dest='play', help='Play the audio (requires pydub).') output_group.add_argument( '-o, --output', type=str, metavar='OUTPUT', dest='output', help='Save the audio to a file.') output_group.add_argument( '-O, --output-dir', type=str, metavar='OUTPUT_DIR', dest='output_dir', help='Save the audio to a directory as individual segments.') multi_output_group = parser.add_argument_group('multi-output options (requires --output-dir)') multi_output_group.add_argument( '--candidates', type=int, default=1, help='How many output candidates to produce per-voice. Note that only the first candidate is used in the combined output.') multi_output_group.add_argument( '--regenerate', type=str, default=None, help='Comma-separated list of clip numbers to re-generate.') multi_output_group.add_argument( '--skip-existing', action='store_true', help='Set to skip re-generating existing clips.') advanced_group = parser.add_argument_group('advanced options') advanced_group.add_argument( '--produce-debug-state', default=False, action='store_true', help='Whether or not to produce debug_states in current directory, which can aid in reproducing problems.') advanced_group.add_argument( '--seed', type=int, default=None, help='Random seed which can be used to reproduce results.') advanced_group.add_argument( '--models-dir', type=str, default=MODELS_DIR, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to ' '~/.cache/tortoise/.models, so this should only be specified if you have custom checkpoints.') advanced_group.add_argument( '--text-split', type=str, default=None, help='How big chunks to split the text into, in the format ,.') advanced_group.add_argument( '--disable-redaction', default=False, action='store_true', help='Normally text enclosed in brackets are automatically redacted from the spoken output ' '(but are still rendered by the model), this can be used for prompt engineering. ' 'Set this to disable this behavior.') advanced_group.add_argument( '--device', type=str, default=None, help='Device to use for inference.') advanced_group.add_argument( '--batch-size', type=int, default=None, help='Batch size to use for inference. If omitted, the batch size is set based on available GPU memory.') tuning_group = parser.add_argument_group('tuning options (overrides preset settings)') tuning_group.add_argument( '--num-autoregressive-samples', type=int, default=None, help='Number of samples taken from the autoregressive model, all of which are filtered using CLVP. ' 'As TorToiSe is a probabilistic model, more samples means a higher probability of creating something "great".') tuning_group.add_argument( '--temperature', type=float, default=None, help='The softmax temperature of the autoregressive model.') tuning_group.add_argument( '--length-penalty', type=float, default=None, help='A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs.') tuning_group.add_argument( '--repetition-penalty', type=float, default=None, help='A penalty that prevents the autoregressive decoder from repeating itself during decoding. ' 'Can be used to reduce the incidence of long silences or "uhhhhhhs", etc.') tuning_group.add_argument( '--top-p', type=float, default=None, help='P value used in nucleus sampling. 0 to 1. Lower values mean the decoder produces more "likely" (aka boring) outputs.') tuning_group.add_argument( '--max-mel-tokens', type=int, default=None, help='Restricts the output length. 1 to 600. Each unit is 1/20 of a second.') tuning_group.add_argument( '--cvvp-amount', type=float, default=None, help='How much the CVVP model should influence the output.' 'Increasing this can in some cases reduce the likelihood of multiple speakers.') tuning_group.add_argument( '--diffusion-iterations', type=int, default=None, help='Number of diffusion steps to perform. More steps means the network has more chances to iteratively' 'refine the output, which should theoretically mean a higher quality output. ' 'Generally a value above 250 is not noticeably better, however.') tuning_group.add_argument( '--cond-free', type=bool, default=None, help='Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for ' 'each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output ' 'of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and ' 'dramatically improves realism.') tuning_group.add_argument( '--cond-free-k', type=float, default=None, help='Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf]. ' 'As cond_free_k increases, the output becomes dominated by the conditioning-free signal. ' 'Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k') tuning_group.add_argument( '--diffusion-temperature', type=float, default=None, help='Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 ' 'are the "mean" prediction of the diffusion network and will sound bland and smeared. ') usage_examples = f''' Examples: Read text using random voice and place it in a file: {parser.prog} -o hello.wav "Hello, how are you?" Read text from stdin and play it using the tom voice: echo "Say it like you mean it!" | {parser.prog} -P -v tom Read a text file using multiple voices and save the audio clips to a directory: {parser.prog} -O /tmp/tts-results -v tom,emma max_length: parser.error(f'--text-split: desired_length ({desired_length}) must be <= max_length ({max_length})') texts = split_and_recombine_text(text, desired_length, max_length) else: texts = split_and_recombine_text(text) if len(texts) == 0: parser.error('no text provided') if args.output_dir: os.makedirs(args.output_dir, exist_ok=True) else: if len(selected_voices) > 1: parser.error('cannot have multiple voices without --output-dir"') if args.candidates > 1: parser.error('cannot have multiple candidates without --output-dir"') # error out early if pydub isn't installed if args.play: try: import pydub import pydub.playback except ImportError: parser.error('--play requires pydub to be installed, which can be done with "pip install pydub"') seed = int(time.time()) if args.seed is None else args.seed if not args.quiet: print('Loading tts...') tts = TextToSpeech(models_dir=args.models_dir, enable_redaction=not args.disable_redaction, device=args.device, autoregressive_batch_size=args.batch_size) gen_settings = { 'use_deterministic_seed': seed, 'verbose': not args.quiet, 'k': args.candidates, 'preset': args.preset, } tuning_options = [ 'num_autoregressive_samples', 'temperature', 'length_penalty', 'repetition_penalty', 'top_p', 'max_mel_tokens', 'cvvp_amount', 'diffusion_iterations', 'cond_free', 'cond_free_k', 'diffusion_temperature'] for option in tuning_options: if getattr(args, option) is not None: gen_settings[option] = getattr(args, option) total_clips = len(texts) * len(selected_voices) regenerate_clips = [int(x) for x in args.regenerate.split(',')] if args.regenerate else None for voice_idx, voice in enumerate(selected_voices): audio_parts = [] voice_samples, conditioning_latents = load_voices(voice, extra_voice_dirs) for text_idx, text in enumerate(texts): clip_name = f'{"-".join(voice)}_{text_idx:02d}' if args.output_dir: first_clip = os.path.join(args.output_dir, f'{clip_name}_00.wav') if (args.skip_existing or (regenerate_clips and text_idx not in regenerate_clips)) and os.path.exists(first_clip): audio_parts.append(load_audio(first_clip, 24000)) if not args.quiet: print(f'Skipping {clip_name}') continue if not args.quiet: print(f'Rendering {clip_name} ({(voice_idx * len(texts) + text_idx + 1)} of {total_clips})...') print(' ' + text) gen = tts.tts_with_preset( text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, **gen_settings) gen = gen if args.candidates > 1 else [gen] for candidate_idx, audio in enumerate(gen): audio = audio.squeeze(0).cpu() if candidate_idx == 0: audio_parts.append(audio) if args.output_dir: filename = f'{clip_name}_{candidate_idx:02d}.wav' torchaudio.save(os.path.join(args.output_dir, filename), audio, 24000) audio = torch.cat(audio_parts, dim=-1) if args.output_dir: filename = f'{"-".join(voice)}_combined.wav' torchaudio.save(os.path.join(args.output_dir, filename), audio, 24000) elif args.output: filename = args.output if args.output else os.tmp torchaudio.save(args.output, audio, 24000) elif args.play: f = tempfile.NamedTemporaryFile(suffix='.wav', delete=True) torchaudio.save(f.name, audio, 24000) pydub.playback.play(pydub.AudioSegment.from_wav(f.name)) if args.produce_debug_state: os.makedirs('debug_states', exist_ok=True) dbg_state = (seed, texts, voice_samples, conditioning_latents, args) torch.save(dbg_state, os.path.join('debug_states', f'debug_{"-".join(voice)}.pth')) ================================================ FILE: setup.cfg ================================================ [metadata] description_file = README.md ================================================ FILE: setup.py ================================================ import setuptools with open("README.md", "r", encoding="utf-8") as fh: long_description = fh.read() setuptools.setup( name="tortoise-tts", packages=setuptools.find_packages(), version="3.0.0", author="James Betker", author_email="james@adamant.ai", description="A high quality multi-voice text-to-speech library", long_description=long_description, long_description_content_type="text/markdown", url="https://github.com/neonbjb/tortoise-tts", project_urls={}, scripts=[ 'scripts/tortoise_tts.py', ], include_package_data=True, install_requires=[ 'tqdm', 'rotary_embedding_torch', 'inflect', 'progressbar', 'einops', 'unidecode', 'scipy', 'librosa', 'transformers==4.31.0', 'tokenizers==0.14.0', 'scipy==1.13.1' # 'deepspeed==0.8.3', ], classifiers=[ "Programming Language :: Python :: 3", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", ], python_requires=">=3.6", ) ================================================ FILE: tortoise/__init__.py ================================================ ================================================ FILE: tortoise/api.py ================================================ import os import random import uuid from time import time from urllib import request import torch import torch.nn.functional as F import progressbar import torchaudio from tortoise.models.classifier import AudioMiniEncoderWithClassifierHead from tortoise.models.diffusion_decoder import DiffusionTts from tortoise.models.autoregressive import UnifiedVoice from tqdm import tqdm from tortoise.models.arch_util import TorchMelSpectrogram from tortoise.models.clvp import CLVP from tortoise.models.cvvp import CVVP from tortoise.models.random_latent_generator import RandomLatentConverter from tortoise.models.vocoder import UnivNetGenerator from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel, TacotronSTFT from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule from tortoise.utils.tokenizer import VoiceBpeTokenizer from tortoise.utils.wav2vec_alignment import Wav2VecAlignment from contextlib import contextmanager from huggingface_hub import hf_hub_download pbar = None DEFAULT_MODELS_DIR = os.path.join(os.path.expanduser('~'), '.cache', 'tortoise', 'models') MODELS_DIR = os.environ.get('TORTOISE_MODELS_DIR', DEFAULT_MODELS_DIR) MODELS = { 'autoregressive.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/autoregressive.pth', 'classifier.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/classifier.pth', 'clvp2.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/clvp2.pth', 'cvvp.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/cvvp.pth', 'diffusion_decoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/diffusion_decoder.pth', 'vocoder.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/vocoder.pth', 'rlg_auto.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_auto.pth', 'rlg_diffuser.pth': 'https://huggingface.co/jbetker/tortoise-tts-v2/resolve/main/.models/rlg_diffuser.pth', } def get_model_path(model_name, models_dir=MODELS_DIR): """ Get path to given model, download it if it doesn't exist. """ if model_name not in MODELS: raise ValueError(f'Model {model_name} not found in available models.') model_path = hf_hub_download(repo_id="Manmay/tortoise-tts", filename=model_name, cache_dir=models_dir) return model_path def pad_or_truncate(t, length): """ Utility function for forcing to have the specified sequence length, whether by clipping it or padding it with 0s. """ if t.shape[-1] == length: return t elif t.shape[-1] < length: return F.pad(t, (0, length-t.shape[-1])) else: return t[..., :length] def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusion_steps=200, cond_free=True, cond_free_k=1): """ Helper function to load a GaussianDiffusion instance configured for use as a vocoder. """ return SpacedDiffusion(use_timesteps=space_timesteps(trained_diffusion_steps, [desired_diffusion_steps]), model_mean_type='epsilon', model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps), conditioning_free=cond_free, conditioning_free_k=cond_free_k) def format_conditioning(clip, cond_length=132300, device="cuda" if not torch.backends.mps.is_available() else 'mps'): """ Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models. """ gap = clip.shape[-1] - cond_length if gap < 0: clip = F.pad(clip, pad=(0, abs(gap))) elif gap > 0: rand_start = random.randint(0, gap) clip = clip[:, rand_start:rand_start + cond_length] mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0) return mel_clip.unsqueeze(0).to(device) def fix_autoregressive_output(codes, stop_token, complain=True): """ This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was trained on and what the autoregressive code generator creates (which has no padding or end). This is highly specific to the DVAE being used, so this particular coding will not necessarily work if used with a different DVAE. This can be inferred by feeding a audio clip padded with lots of zeros on the end through the DVAE and copying out the last few codes. Failing to do this padding will produce speech with a harsh end that sounds like "BLAH" or similar. """ # Strip off the autoregressive stop token and add padding. stop_token_indices = (codes == stop_token).nonzero() if len(stop_token_indices) == 0: if complain: print("No stop tokens found in one of the generated voice clips. This typically means the spoken audio is " "too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, " "try breaking up your input text.") return codes else: codes[stop_token_indices] = 83 stm = stop_token_indices.min().item() codes[stm:] = 83 if stm - 3 < codes.shape[0]: codes[-3] = 45 codes[-2] = 45 codes[-1] = 248 return codes def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True): """ Uses the specified diffusion model to convert discrete codes into a spectrogram. """ with torch.no_grad(): output_seq_len = latents.shape[1] * 4 * 24000 // 22050 # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal. output_shape = (latents.shape[0], 100, output_seq_len) precomputed_embeddings = diffusion_model.timestep_independent(latents, conditioning_latents, output_seq_len, False) noise = torch.randn(output_shape, device=latents.device) * temperature mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise, model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings}, progress=verbose) return denormalize_tacotron_mel(mel)[:,:,:output_seq_len] def classify_audio_clip(clip): """ Returns whether or not Tortoises' classifier thinks the given clip came from Tortoise. :param clip: torch tensor containing audio waveform data (get it from load_audio) :return: True if the clip was classified as coming from Tortoise and false if it was classified as real. """ classifier = AudioMiniEncoderWithClassifierHead(2, spec_dim=1, embedding_dim=512, depth=5, downsample_factor=4, resnet_blocks=2, attn_blocks=4, num_attn_heads=4, base_channels=32, dropout=0, kernel_size=5, distribute_zero_label=False) classifier.load_state_dict(torch.load(get_model_path('classifier.pth'), map_location=torch.device('cpu'))) clip = clip.cpu().unsqueeze(0) results = F.softmax(classifier(clip), dim=-1) return results[0][0] def pick_best_batch_size_for_gpu(): """ Tries to pick a batch size that will fit in your GPU. These sizes aren't guaranteed to work, but they should give you a good shot. """ if torch.cuda.is_available(): _, available = torch.cuda.mem_get_info() availableGb = available / (1024 ** 3) if availableGb > 14: return 16 elif availableGb > 10: return 8 elif availableGb > 7: return 4 if torch.backends.mps.is_available(): import psutil available = psutil.virtual_memory().total availableGb = available / (1024 ** 3) if availableGb > 14: return 16 elif availableGb > 10: return 8 elif availableGb > 7: return 4 return 1 class TextToSpeech: """ Main entry point into Tortoise. """ def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, kv_cache=False, use_deepspeed=False, half=False, device=None, tokenizer_vocab_file=None, tokenizer_basic=False): """ Constructor :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing GPU OOM errors. Larger numbers generates slightly faster. :param models_dir: Where model weights are stored. This should only be specified if you are providing your own models, otherwise use the defaults. :param enable_redaction: When true, text enclosed in brackets are automatically redacted from the spoken output (but are still rendered by the model). This can be used for prompt engineering. Default is true. :param device: Device to use when running the model. If omitted, the device will be automatically chosen. """ self.models_dir = models_dir self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size self.enable_redaction = enable_redaction if device is None: self.device = torch.device('cuda' if torch.cuda.is_available() else'cpu') else: self.device = torch.device(device) if torch.backends.mps.is_available(): self.device = torch.device('mps') if self.enable_redaction: self.aligner = Wav2VecAlignment() self.tokenizer = VoiceBpeTokenizer( vocab_file=tokenizer_vocab_file, use_basic_cleaners=tokenizer_basic, ) self.half = half if os.path.exists(f'{models_dir}/autoregressive.ptt'): # Assume this is a traced directory. self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt') self.diffusion = torch.jit.load(f'{models_dir}/diffusion_decoder.ptt') else: self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30, model_dim=1024, heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False, train_solo_embeddings=False).cpu().eval() self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir)), strict=False) self.autoregressive.post_init_gpt2_config(use_deepspeed=use_deepspeed, kv_cache=kv_cache, half=self.half) self.diffusion = DiffusionTts(model_channels=1024, num_layers=10, in_channels=100, out_channels=200, in_latent_channels=1024, in_tokens=8193, dropout=0, use_fp16=False, num_heads=16, layer_drop=0, unconditioned_percentage=0).cpu().eval() self.diffusion.load_state_dict(torch.load(get_model_path('diffusion_decoder.pth', models_dir))) self.clvp = CLVP(dim_text=768, dim_speech=768, dim_latent=768, num_text_tokens=256, text_enc_depth=20, text_seq_len=350, text_heads=12, num_speech_tokens=8192, speech_enc_depth=20, speech_heads=12, speech_seq_len=430, use_xformers=True).cpu().eval() self.clvp.load_state_dict(torch.load(get_model_path('clvp2.pth', models_dir))) self.cvvp = None # CVVP model is only loaded if used. self.vocoder = UnivNetGenerator().cpu() self.vocoder.load_state_dict(torch.load(get_model_path('vocoder.pth', models_dir), map_location=torch.device('cpu'))['model_g']) self.vocoder.eval(inference=True) self.stft = None # TacotronSTFT is only loaded if used. # Random latent generators (RLGs) are loaded lazily. self.rlg_auto = None self.rlg_diffusion = None @contextmanager def temporary_cuda(self, model): m = model.to(self.device) yield m m = model.cpu() def load_cvvp(self): """Load CVVP model.""" self.cvvp = CVVP(model_dim=512, transformer_heads=8, dropout=0, mel_codes=8192, conditioning_enc_depth=8, cond_mask_percentage=0, speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1).cpu().eval() self.cvvp.load_state_dict(torch.load(get_model_path('cvvp.pth', self.models_dir))) def get_conditioning_latents(self, voice_samples, return_mels=False): """ Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent). These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic properties. :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data. """ with torch.no_grad(): voice_samples = [v.to(self.device) for v in voice_samples] auto_conds = [] if not isinstance(voice_samples, list): voice_samples = [voice_samples] for vs in voice_samples: auto_conds.append(format_conditioning(vs, device=self.device)) auto_conds = torch.stack(auto_conds, dim=1) self.autoregressive = self.autoregressive.to(self.device) auto_latent = self.autoregressive.get_conditioning(auto_conds) self.autoregressive = self.autoregressive.cpu() if self.stft is None: # Initialize STFT self.stft = TacotronSTFT(1024, 256, 1024, 100, 24000, 0, 12000).to(self.device) diffusion_conds = [] for sample in voice_samples: # The diffuser operates at a sample rate of 24000 (except for the latent inputs) sample = torchaudio.functional.resample(sample, 22050, 24000) sample = pad_or_truncate(sample, 102400) cond_mel = wav_to_univnet_mel(sample.to(self.device), do_normalization=False, device=self.device, stft=self.stft) diffusion_conds.append(cond_mel) diffusion_conds = torch.stack(diffusion_conds, dim=1) self.diffusion = self.diffusion.to(self.device) diffusion_latent = self.diffusion.get_conditioning(diffusion_conds) self.diffusion = self.diffusion.cpu() if return_mels: return auto_latent, diffusion_latent, auto_conds, diffusion_conds else: return auto_latent, diffusion_latent def get_random_conditioning_latents(self): # Lazy-load the RLG models. if self.rlg_auto is None: self.rlg_auto = RandomLatentConverter(1024).eval() self.rlg_auto.load_state_dict(torch.load(get_model_path('rlg_auto.pth', self.models_dir), map_location=torch.device('cpu'))) self.rlg_diffusion = RandomLatentConverter(2048).eval() self.rlg_diffusion.load_state_dict(torch.load(get_model_path('rlg_diffuser.pth', self.models_dir), map_location=torch.device('cpu'))) with torch.no_grad(): return self.rlg_auto(torch.tensor([0.0])), self.rlg_diffusion(torch.tensor([0.0])) def tts_with_preset(self, text, preset='fast', **kwargs): """ Calls TTS with one of a set of preset generation parameters. Options: 'ultra_fast': Produces speech at a speed which belies the name of this repo. (Not really, but it's definitely fastest). 'fast': Decent quality speech at a decent inference rate. A good choice for mass inference. 'standard': Very good quality. This is generally about as good as you are going to get. 'high_quality': Use if you want the absolute best. This is not really worth the compute, though. """ # Use generally found best tuning knobs for generation. settings = {'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0, 'top_p': .8, 'cond_free_k': 2.0, 'diffusion_temperature': 1.0} # Presets are defined here. presets = { 'ultra_fast': {'num_autoregressive_samples': 16, 'diffusion_iterations': 30, 'cond_free': False}, 'fast': {'num_autoregressive_samples': 96, 'diffusion_iterations': 80}, 'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200}, 'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400}, } settings.update(presets[preset]) settings.update(kwargs) # allow overriding of preset settings with kwargs return self.tts(text, **settings) def tts(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None, return_deterministic_state=False, # autoregressive generation parameters follow num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500, # CVVP parameters follow cvvp_amount=.0, # diffusion generation parameters follow diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0, **hf_generate_kwargs): """ Produces an audio clip of the given text being spoken with the given reference voice. :param text: Text to be spoken. :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data. :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which can be provided in lieu of voice_samples. This is ignored unless voice_samples=None. Conditioning latents can be retrieved via get_conditioning_latents(). :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned. :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true. ~~AUTOREGRESSIVE KNOBS~~ :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP. As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great". :param temperature: The softmax temperature of the autoregressive model. :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs. :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second. :param typical_sampling: Turns typical sampling on or off. This sampling mode is discussed in this paper: https://arxiv.org/abs/2202.00666 I was interested in the premise, but the results were not as good as I was hoping. This is off by default, but could use some tuning. :param typical_mass: The typical_mass parameter from the typical_sampling algorithm. ~~CLVP-CVVP KNOBS~~ :param cvvp_amount: Controls the influence of the CVVP model in selecting the best output from the autoregressive model. [0,1]. Values closer to 1 mean the CVVP model is more important, 0 disables the CVVP model. ~~DIFFUSION KNOBS~~ :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better, however. :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and dramatically improves realism. :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf]. As cond_free_k increases, the output becomes dominated by the conditioning-free signal. Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. ~~OTHER STUFF~~ :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length. Sample rate is 24kHz. """ deterministic_seed = self.deterministic_state(seed=use_deterministic_seed) text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device) text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary. assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.' auto_conds = None if voice_samples is not None: auto_conditioning, diffusion_conditioning, auto_conds, _ = self.get_conditioning_latents(voice_samples, return_mels=True) elif conditioning_latents is not None: auto_conditioning, diffusion_conditioning = conditioning_latents else: auto_conditioning, diffusion_conditioning = self.get_random_conditioning_latents() auto_conditioning = auto_conditioning.to(self.device) diffusion_conditioning = diffusion_conditioning.to(self.device) diffuser = load_discrete_vocoder_diffuser(desired_diffusion_steps=diffusion_iterations, cond_free=cond_free, cond_free_k=cond_free_k) with torch.no_grad(): samples = [] num_batches = num_autoregressive_samples // self.autoregressive_batch_size stop_mel_token = self.autoregressive.stop_mel_token calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output" if verbose: print("Generating autoregressive samples..") if not torch.backends.mps.is_available(): with self.temporary_cuda(self.autoregressive ) as autoregressive, torch.autocast(device_type="cuda", dtype=torch.float16, enabled=self.half): for b in tqdm(range(num_batches), disable=not verbose): codes = autoregressive.inference_speech(auto_conditioning, text_tokens, do_sample=True, top_p=top_p, temperature=temperature, num_return_sequences=self.autoregressive_batch_size, length_penalty=length_penalty, repetition_penalty=repetition_penalty, max_generate_length=max_mel_tokens, **hf_generate_kwargs) padding_needed = max_mel_tokens - codes.shape[1] codes = F.pad(codes, (0, padding_needed), value=stop_mel_token) samples.append(codes) else: with self.temporary_cuda(self.autoregressive) as autoregressive: for b in tqdm(range(num_batches), disable=not verbose): codes = autoregressive.inference_speech(auto_conditioning, text_tokens, do_sample=True, top_p=top_p, temperature=temperature, num_return_sequences=self.autoregressive_batch_size, length_penalty=length_penalty, repetition_penalty=repetition_penalty, max_generate_length=max_mel_tokens, **hf_generate_kwargs) padding_needed = max_mel_tokens - codes.shape[1] codes = F.pad(codes, (0, padding_needed), value=stop_mel_token) samples.append(codes) clip_results = [] if not torch.backends.mps.is_available(): with self.temporary_cuda(self.clvp) as clvp, torch.autocast( device_type="cuda" if not torch.backends.mps.is_available() else 'mps', dtype=torch.float16, enabled=self.half ): if cvvp_amount > 0: if self.cvvp is None: self.load_cvvp() self.cvvp = self.cvvp.to(self.device) if verbose: if self.cvvp is None: print("Computing best candidates using CLVP") else: print(f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%") for batch in tqdm(samples, disable=not verbose): for i in range(batch.shape[0]): batch[i] = fix_autoregressive_output(batch[i], stop_mel_token) if cvvp_amount != 1: clvp_out = clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False) if auto_conds is not None and cvvp_amount > 0: cvvp_accumulator = 0 for cl in range(auto_conds.shape[1]): cvvp_accumulator = cvvp_accumulator + self.cvvp(auto_conds[:, cl].repeat(batch.shape[0], 1, 1), batch, return_loss=False) cvvp = cvvp_accumulator / auto_conds.shape[1] if cvvp_amount == 1: clip_results.append(cvvp) else: clip_results.append(cvvp * cvvp_amount + clvp_out * (1-cvvp_amount)) else: clip_results.append(clvp_out) clip_results = torch.cat(clip_results, dim=0) samples = torch.cat(samples, dim=0) best_results = samples[torch.topk(clip_results, k=k).indices] else: with self.temporary_cuda(self.clvp) as clvp: if cvvp_amount > 0: if self.cvvp is None: self.load_cvvp() self.cvvp = self.cvvp.to(self.device) if verbose: if self.cvvp is None: print("Computing best candidates using CLVP") else: print(f"Computing best candidates using CLVP {((1-cvvp_amount) * 100):2.0f}% and CVVP {(cvvp_amount * 100):2.0f}%") for batch in tqdm(samples, disable=not verbose): for i in range(batch.shape[0]): batch[i] = fix_autoregressive_output(batch[i], stop_mel_token) if cvvp_amount != 1: clvp_out = clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False) if auto_conds is not None and cvvp_amount > 0: cvvp_accumulator = 0 for cl in range(auto_conds.shape[1]): cvvp_accumulator = cvvp_accumulator + self.cvvp(auto_conds[:, cl].repeat(batch.shape[0], 1, 1), batch, return_loss=False) cvvp = cvvp_accumulator / auto_conds.shape[1] if cvvp_amount == 1: clip_results.append(cvvp) else: clip_results.append(cvvp * cvvp_amount + clvp_out * (1-cvvp_amount)) else: clip_results.append(clvp_out) clip_results = torch.cat(clip_results, dim=0) samples = torch.cat(samples, dim=0) best_results = samples[torch.topk(clip_results, k=k).indices] if self.cvvp is not None: self.cvvp = self.cvvp.cpu() del samples # The diffusion model actually wants the last hidden layer from the autoregressive model as conditioning # inputs. Re-produce those for the top results. This could be made more efficient by storing all of these # results, but will increase memory usage. if not torch.backends.mps.is_available(): with self.temporary_cuda( self.autoregressive ) as autoregressive, torch.autocast( device_type="cuda" if not torch.backends.mps.is_available() else 'mps', dtype=torch.float16, enabled=self.half ): best_latents = autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1), torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results, torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device), return_latent=True, clip_inputs=False) del auto_conditioning else: with self.temporary_cuda( self.autoregressive ) as autoregressive: best_latents = autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1), torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), best_results, torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device), return_latent=True, clip_inputs=False) del auto_conditioning if verbose: print("Transforming autoregressive outputs into audio..") wav_candidates = [] if not torch.backends.mps.is_available(): with self.temporary_cuda(self.diffusion) as diffusion, self.temporary_cuda( self.vocoder ) as vocoder: for b in range(best_results.shape[0]): codes = best_results[b].unsqueeze(0) latents = best_latents[b].unsqueeze(0) # Find the first occurrence of the "calm" token and trim the codes to that. ctokens = 0 for k in range(codes.shape[-1]): if codes[0, k] == calm_token: ctokens += 1 else: ctokens = 0 if ctokens > 8: # 8 tokens gives the diffusion model some "breathing room" to terminate speech. latents = latents[:, :k] break mel = do_spectrogram_diffusion(diffusion, diffuser, latents, diffusion_conditioning, temperature=diffusion_temperature, verbose=verbose) wav = vocoder.inference(mel) wav_candidates.append(wav.cpu()) else: diffusion, vocoder = self.diffusion, self.vocoder diffusion_conditioning = diffusion_conditioning.cpu() for b in range(best_results.shape[0]): codes = best_results[b].unsqueeze(0).cpu() latents = best_latents[b].unsqueeze(0).cpu() # Find the first occurrence of the "calm" token and trim the codes to that. ctokens = 0 for k in range(codes.shape[-1]): if codes[0, k] == calm_token: ctokens += 1 else: ctokens = 0 if ctokens > 8: # 8 tokens gives the diffusion model some "breathing room" to terminate speech. latents = latents[:, :k] break mel = do_spectrogram_diffusion(diffusion, diffuser, latents, diffusion_conditioning, temperature=diffusion_temperature, verbose=verbose) wav = vocoder.inference(mel) wav_candidates.append(wav.cpu()) def potentially_redact(clip, text): if self.enable_redaction: return self.aligner.redact(clip.squeeze(1), text).unsqueeze(1) return clip wav_candidates = [potentially_redact(wav_candidate, text) for wav_candidate in wav_candidates] if len(wav_candidates) > 1: res = wav_candidates else: res = wav_candidates[0] if return_deterministic_state: return res, (deterministic_seed, text, voice_samples, conditioning_latents) else: return res def deterministic_state(self, seed=None): """ Sets the random seeds that tortoise uses to the current time() and returns that seed so results can be reproduced. """ seed = int(time()) if seed is None else seed torch.manual_seed(seed) random.seed(seed) # Can't currently set this because of CUBLAS. TODO: potentially enable it if necessary. # torch.use_deterministic_algorithms(True) return seed ================================================ FILE: tortoise/api_fast.py ================================================ import os import random import uuid from time import time from urllib import request import torch import torch.nn.functional as F import progressbar import torchaudio import numpy as np from tortoise.models.classifier import AudioMiniEncoderWithClassifierHead from tortoise.models.diffusion_decoder import DiffusionTts from tortoise.models.autoregressive import UnifiedVoice from tqdm import tqdm from tortoise.models.arch_util import TorchMelSpectrogram from tortoise.models.clvp import CLVP from tortoise.models.cvvp import CVVP from tortoise.models.hifigan_decoder import HifiganGenerator from tortoise.models.random_latent_generator import RandomLatentConverter from tortoise.models.vocoder import UnivNetGenerator from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel from tortoise.utils.diffusion import SpacedDiffusion, space_timesteps, get_named_beta_schedule from tortoise.utils.tokenizer import VoiceBpeTokenizer from tortoise.utils.wav2vec_alignment import Wav2VecAlignment from contextlib import contextmanager from tortoise.models.stream_generator import init_stream_support from huggingface_hub import hf_hub_download pbar = None init_stream_support() DEFAULT_MODELS_DIR = os.path.join(os.path.expanduser('~'), '.cache', 'tortoise', 'models') MODELS_DIR = os.environ.get('TORTOISE_MODELS_DIR', DEFAULT_MODELS_DIR) MODELS = { 'autoregressive.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/autoregressive.pth', 'classifier.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/classifier.pth', 'rlg_auto.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/rlg_auto.pth', 'hifidecoder.pth': 'https://huggingface.co/Manmay/tortoise-tts/resolve/main/hifidecoder.pth', } def get_model_path(model_name, models_dir=MODELS_DIR): """ Get path to given model, download it if it doesn't exist. """ if model_name not in MODELS: raise ValueError(f'Model {model_name} not found in available models.') model_path = hf_hub_download(repo_id="Manmay/tortoise-tts", filename=model_name, cache_dir=models_dir) return model_path def pad_or_truncate(t, length): """ Utility function for forcing to have the specified sequence length, whether by clipping it or padding it with 0s. """ if t.shape[-1] == length: return t elif t.shape[-1] < length: return F.pad(t, (0, length-t.shape[-1])) else: return t[..., :length] def load_discrete_vocoder_diffuser(trained_diffusion_steps=4000, desired_diffusion_steps=200, cond_free=True, cond_free_k=1): """ Helper function to load a GaussianDiffusion instance configured for use as a vocoder. """ return SpacedDiffusion(use_timesteps=space_timesteps(trained_diffusion_steps, [desired_diffusion_steps]), model_mean_type='epsilon', model_var_type='learned_range', loss_type='mse', betas=get_named_beta_schedule('linear', trained_diffusion_steps), conditioning_free=cond_free, conditioning_free_k=cond_free_k) def format_conditioning(clip, cond_length=132300, device="cuda" if not torch.backends.mps.is_available() else 'mps'): """ Converts the given conditioning signal to a MEL spectrogram and clips it as expected by the models. """ gap = clip.shape[-1] - cond_length if gap < 0: clip = F.pad(clip, pad=(0, abs(gap))) elif gap > 0: rand_start = random.randint(0, gap) clip = clip[:, rand_start:rand_start + cond_length] mel_clip = TorchMelSpectrogram()(clip.unsqueeze(0)).squeeze(0) return mel_clip.unsqueeze(0).to(device) def fix_autoregressive_output(codes, stop_token, complain=True): """ This function performs some padding on coded audio that fixes a mismatch issue between what the diffusion model was trained on and what the autoregressive code generator creates (which has no padding or end). This is highly specific to the DVAE being used, so this particular coding will not necessarily work if used with a different DVAE. This can be inferred by feeding a audio clip padded with lots of zeros on the end through the DVAE and copying out the last few codes. Failing to do this padding will produce speech with a harsh end that sounds like "BLAH" or similar. """ # Strip off the autoregressive stop token and add padding. stop_token_indices = (codes == stop_token).nonzero() if len(stop_token_indices) == 0: if complain: print("No stop tokens found in one of the generated voice clips. This typically means the spoken audio is " "too long. In some cases, the output will still be good, though. Listen to it and if it is missing words, " "try breaking up your input text.") return codes else: codes[stop_token_indices] = 83 stm = stop_token_indices.min().item() codes[stm:] = 83 if stm - 3 < codes.shape[0]: codes[-3] = 45 codes[-2] = 45 codes[-1] = 248 return codes def do_spectrogram_diffusion(diffusion_model, diffuser, latents, conditioning_latents, temperature=1, verbose=True): """ Uses the specified diffusion model to convert discrete codes into a spectrogram. """ with torch.no_grad(): output_seq_len = latents.shape[1] * 4 * 24000 // 22050 # This diffusion model converts from 22kHz spectrogram codes to a 24kHz spectrogram signal. output_shape = (latents.shape[0], 100, output_seq_len) precomputed_embeddings = diffusion_model.timestep_independent(latents, conditioning_latents, output_seq_len, False) noise = torch.randn(output_shape, device=latents.device) * temperature mel = diffuser.p_sample_loop(diffusion_model, output_shape, noise=noise, model_kwargs={'precomputed_aligned_embeddings': precomputed_embeddings}, progress=verbose) return denormalize_tacotron_mel(mel)[:,:,:output_seq_len] def classify_audio_clip(clip): """ Returns whether or not Tortoises' classifier thinks the given clip came from Tortoise. :param clip: torch tensor containing audio waveform data (get it from load_audio) :return: True if the clip was classified as coming from Tortoise and false if it was classified as real. """ classifier = AudioMiniEncoderWithClassifierHead(2, spec_dim=1, embedding_dim=512, depth=5, downsample_factor=4, resnet_blocks=2, attn_blocks=4, num_attn_heads=4, base_channels=32, dropout=0, kernel_size=5, distribute_zero_label=False) classifier.load_state_dict(torch.load(get_model_path('classifier.pth'), map_location=torch.device('cpu'))) clip = clip.cpu().unsqueeze(0) results = F.softmax(classifier(clip), dim=-1) return results[0][0] def pick_best_batch_size_for_gpu(): """ Tries to pick a batch size that will fit in your GPU. These sizes aren't guaranteed to work, but they should give you a good shot. """ if torch.cuda.is_available(): _, available = torch.cuda.mem_get_info() availableGb = available / (1024 ** 3) if availableGb > 14: return 16 elif availableGb > 10: return 8 elif availableGb > 7: return 4 if torch.backends.mps.is_available(): import psutil available = psutil.virtual_memory().total availableGb = available / (1024 ** 3) if availableGb > 14: return 16 elif availableGb > 10: return 8 elif availableGb > 7: return 4 return 1 class TextToSpeech: """ Main entry point into Tortoise. """ def __init__(self, autoregressive_batch_size=None, models_dir=MODELS_DIR, enable_redaction=True, kv_cache=False, use_deepspeed=False, half=False, device=None, tokenizer_vocab_file=None, tokenizer_basic=False): """ Constructor :param autoregressive_batch_size: Specifies how many samples to generate per batch. Lower this if you are seeing GPU OOM errors. Larger numbers generates slightly faster. :param models_dir: Where model weights are stored. This should only be specified if you are providing your own models, otherwise use the defaults. :param enable_redaction: When true, text enclosed in brackets are automatically redacted from the spoken output (but are still rendered by the model). This can be used for prompt engineering. Default is true. :param device: Device to use when running the model. If omitted, the device will be automatically chosen. """ self.models_dir = models_dir self.autoregressive_batch_size = pick_best_batch_size_for_gpu() if autoregressive_batch_size is None else autoregressive_batch_size self.enable_redaction = enable_redaction if device is None: self.device = torch.device('cuda' if torch.cuda.is_available() else'cpu') else: self.device = torch.device(device) if torch.backends.mps.is_available(): self.device = torch.device('mps') if self.enable_redaction: self.aligner = Wav2VecAlignment() self.tokenizer = VoiceBpeTokenizer( vocab_file=tokenizer_vocab_file, use_basic_cleaners=tokenizer_basic, ) self.half = half if os.path.exists(f'{models_dir}/autoregressive.ptt'): # Assume this is a traced directory. self.autoregressive = torch.jit.load(f'{models_dir}/autoregressive.ptt') else: self.autoregressive = UnifiedVoice(max_mel_tokens=604, max_text_tokens=402, max_conditioning_inputs=2, layers=30, model_dim=1024, heads=16, number_text_tokens=255, start_text_token=255, checkpointing=False, train_solo_embeddings=False).to(self.device).eval() self.autoregressive.load_state_dict(torch.load(get_model_path('autoregressive.pth', models_dir)), strict=False) self.autoregressive.post_init_gpt2_config(use_deepspeed=use_deepspeed, kv_cache=kv_cache, half=self.half) self.hifi_decoder = HifiganGenerator(in_channels=1024, out_channels = 1, resblock_type = "1", resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], resblock_kernel_sizes = [3, 7, 11], upsample_kernel_sizes = [16, 16, 4, 4], upsample_initial_channel = 512, upsample_factors = [8, 8, 2, 2], cond_channels=1024).to(self.device).eval() hifi_model = torch.load(get_model_path('hifidecoder.pth')) self.hifi_decoder.load_state_dict(hifi_model, strict=False) # Random latent generators (RLGs) are loaded lazily. self.rlg_auto = None def get_conditioning_latents(self, voice_samples, return_mels=False): """ Transforms one or more voice_samples into a tuple (autoregressive_conditioning_latent, diffusion_conditioning_latent). These are expressive learned latents that encode aspects of the provided clips like voice, intonation, and acoustic properties. :param voice_samples: List of 2 or more ~10 second reference clips, which should be torch tensors containing 22.05kHz waveform data. """ with torch.no_grad(): voice_samples = [v.to(self.device) for v in voice_samples] auto_conds = [] if not isinstance(voice_samples, list): voice_samples = [voice_samples] for vs in voice_samples: auto_conds.append(format_conditioning(vs, device=self.device)) auto_conds = torch.stack(auto_conds, dim=1) auto_latent = self.autoregressive.get_conditioning(auto_conds) if return_mels: return auto_latent else: return auto_latent def get_random_conditioning_latents(self): # Lazy-load the RLG models. if self.rlg_auto is None: self.rlg_auto = RandomLatentConverter(1024).eval() self.rlg_auto.load_state_dict(torch.load(get_model_path('rlg_auto.pth', self.models_dir), map_location=torch.device('cpu'))) with torch.no_grad(): return self.rlg_auto(torch.tensor([0.0])) def tts_with_preset(self, text, preset='fast', **kwargs): """ Calls TTS with one of a set of preset generation parameters. Options: 'ultra_fast': Produces speech at a speed which belies the name of this repo. (Not really, but it's definitely fastest). 'fast': Decent quality speech at a decent inference rate. A good choice for mass inference. 'standard': Very good quality. This is generally about as good as you are going to get. 'high_quality': Use if you want the absolute best. This is not really worth the compute, though. """ # Use generally found best tuning knobs for generation. settings = {'temperature': .8, 'length_penalty': 1.0, 'repetition_penalty': 2.0, 'top_p': .8, 'cond_free_k': 2.0, 'diffusion_temperature': 1.0} # Presets are defined here. presets = { 'ultra_fast': {'num_autoregressive_samples': 1, 'diffusion_iterations': 10}, 'fast': {'num_autoregressive_samples': 32, 'diffusion_iterations': 50}, 'standard': {'num_autoregressive_samples': 256, 'diffusion_iterations': 200}, 'high_quality': {'num_autoregressive_samples': 256, 'diffusion_iterations': 400}, } settings.update(presets[preset]) settings.update(kwargs) # allow overriding of preset settings with kwargs for audio_frame in self.tts(text, **settings): yield audio_frame # taken from here https://github.com/coqui-ai/TTS/blob/b4c552a112fd4c5f3477f439882eb43c2e2ce85f/TTS/tts/models/xtts.py#L600 def handle_chunks(self, wav_gen, wav_gen_prev, wav_overlap, overlap_len): """Handle chunk formatting in streaming mode""" wav_chunk = wav_gen[:-overlap_len] if wav_gen_prev is not None: wav_chunk = wav_gen[(wav_gen_prev.shape[0] - overlap_len) : -overlap_len] if wav_overlap is not None: # cross fade the overlap section if overlap_len > len(wav_chunk): # wav_chunk is smaller than overlap_len, pass on last wav_gen if wav_gen_prev is not None: wav_chunk = wav_gen[(wav_gen_prev.shape[0] - overlap_len):] else: # not expecting will hit here as problem happens on last chunk wav_chunk = wav_gen[-overlap_len:] return wav_chunk, wav_gen, None else: crossfade_wav = wav_chunk[:overlap_len] crossfade_wav = crossfade_wav * torch.linspace(0.0, 1.0, overlap_len).to(crossfade_wav.device) wav_chunk[:overlap_len] = wav_overlap * torch.linspace(1.0, 0.0, overlap_len).to(wav_overlap.device) wav_chunk[:overlap_len] += crossfade_wav wav_overlap = wav_gen[-overlap_len:] wav_gen_prev = wav_gen return wav_chunk, wav_gen_prev, wav_overlap def tts_stream(self, text, voice_samples=None, conditioning_latents=None, k=1, verbose=True, use_deterministic_seed=None, return_deterministic_state=False, overlap_wav_len=1024, stream_chunk_size=40, # autoregressive generation parameters follow num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500, # CVVP parameters follow cvvp_amount=.0, # diffusion generation parameters follow diffusion_iterations=100, cond_free=True, cond_free_k=2, diffusion_temperature=1.0, **hf_generate_kwargs): """ Produces an audio clip of the given text being spoken with the given reference voice. :param text: Text to be spoken. :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data. :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which can be provided in lieu of voice_samples. This is ignored unless voice_samples=None. Conditioning latents can be retrieved via get_conditioning_latents(). :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned. :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true. ~~AUTOREGRESSIVE KNOBS~~ :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP. As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great". :param temperature: The softmax temperature of the autoregressive model. :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs. :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second. ~~DIFFUSION KNOBS~~ :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better, however. :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and dramatically improves realism. :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf]. As cond_free_k increases, the output becomes dominated by the conditioning-free signal. Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. ~~OTHER STUFF~~ :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length. Sample rate is 24kHz. """ deterministic_seed = self.deterministic_state(seed=use_deterministic_seed) text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device) text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary. assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.' if voice_samples is not None: auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False) else: auto_conditioning = self.get_random_conditioning_latents() auto_conditioning = auto_conditioning.to(self.device) with torch.no_grad(): calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output" if verbose: print("Generating autoregressive samples..") with torch.autocast( device_type="cuda" , dtype=torch.float16, enabled=self.half ): fake_inputs = self.autoregressive.compute_embeddings( auto_conditioning, text_tokens, ) gpt_generator = self.autoregressive.get_generator( fake_inputs=fake_inputs, top_k=50, top_p=top_p, temperature=temperature, do_sample=True, num_beams=1, num_return_sequences=1, length_penalty=float(length_penalty), repetition_penalty=float(repetition_penalty), output_attentions=False, output_hidden_states=True, **hf_generate_kwargs, ) all_latents = [] codes_ = [] wav_gen_prev = None wav_overlap = None is_end = False first_buffer = 60 while not is_end: try: with torch.autocast( device_type="cuda", dtype=torch.float16, enabled=self.half ): codes, latent = next(gpt_generator) all_latents += [latent] codes_ += [codes] except StopIteration: is_end = True if is_end or (stream_chunk_size > 0 and len(codes_) >= max(stream_chunk_size, first_buffer)): first_buffer = 0 gpt_latents = torch.cat(all_latents, dim=0)[None, :] wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning) wav_gen = wav_gen.squeeze() wav_chunk, wav_gen_prev, wav_overlap = self.handle_chunks( wav_gen.squeeze(), wav_gen_prev, wav_overlap, overlap_wav_len ) codes_ = [] yield wav_chunk def tts(self, text, voice_samples=None, k=1, verbose=True, use_deterministic_seed=None, # autoregressive generation parameters follow num_autoregressive_samples=512, temperature=.8, length_penalty=1, repetition_penalty=2.0, top_p=.8, max_mel_tokens=500, # CVVP parameters follow cvvp_amount=.0, **hf_generate_kwargs): """ Produces an audio clip of the given text being spoken with the given reference voice. :param text: Text to be spoken. :param voice_samples: List of 2 or more ~10 second reference clips which should be torch tensors containing 22.05kHz waveform data. :param conditioning_latents: A tuple of (autoregressive_conditioning_latent, diffusion_conditioning_latent), which can be provided in lieu of voice_samples. This is ignored unless voice_samples=None. Conditioning latents can be retrieved via get_conditioning_latents(). :param k: The number of returned clips. The most likely (as determined by Tortoises' CLVP model) clips are returned. :param verbose: Whether or not to print log messages indicating the progress of creating a clip. Default=true. ~~AUTOREGRESSIVE KNOBS~~ :param num_autoregressive_samples: Number of samples taken from the autoregressive model, all of which are filtered using CLVP. As Tortoise is a probabilistic model, more samples means a higher probability of creating something "great". :param temperature: The softmax temperature of the autoregressive model. :param length_penalty: A length penalty applied to the autoregressive decoder. Higher settings causes the model to produce more terse outputs. :param repetition_penalty: A penalty that prevents the autoregressive decoder from repeating itself during decoding. Can be used to reduce the incidence of long silences or "uhhhhhhs", etc. :param top_p: P value used in nucleus sampling. (0,1]. Lower values mean the decoder produces more "likely" (aka boring) outputs. :param max_mel_tokens: Restricts the output length. (0,600] integer. Each unit is 1/20 of a second. ~~DIFFUSION KNOBS~~ :param diffusion_iterations: Number of diffusion steps to perform. [0,4000]. More steps means the network has more chances to iteratively refine the output, which should theoretically mean a higher quality output. Generally a value above 250 is not noticeably better, however. :param cond_free: Whether or not to perform conditioning-free diffusion. Conditioning-free diffusion performs two forward passes for each diffusion step: one with the outputs of the autoregressive model and one with no conditioning priors. The output of the two is blended according to the cond_free_k value below. Conditioning-free diffusion is the real deal, and dramatically improves realism. :param cond_free_k: Knob that determines how to balance the conditioning free signal with the conditioning-present signal. [0,inf]. As cond_free_k increases, the output becomes dominated by the conditioning-free signal. Formula is: output=cond_present_output*(cond_free_k+1)-cond_absenct_output*cond_free_k :param diffusion_temperature: Controls the variance of the noise fed into the diffusion model. [0,1]. Values at 0 are the "mean" prediction of the diffusion network and will sound bland and smeared. ~~OTHER STUFF~~ :param hf_generate_kwargs: The huggingface Transformers generate API is used for the autoregressive transformer. Extra keyword args fed to this function get forwarded directly to that API. Documentation here: https://huggingface.co/docs/transformers/internal/generation_utils :return: Generated audio clip(s) as a torch tensor. Shape 1,S if k=1 else, (k,1,S) where S is the sample length. Sample rate is 24kHz. """ deterministic_seed = self.deterministic_state(seed=use_deterministic_seed) text_tokens = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).to(self.device) text_tokens = F.pad(text_tokens, (0, 1)) # This may not be necessary. assert text_tokens.shape[-1] < 400, 'Too much text provided. Break the text up into separate segments and re-try inference.' if voice_samples is not None: auto_conditioning = self.get_conditioning_latents(voice_samples, return_mels=False) else: auto_conditioning = self.get_random_conditioning_latents() auto_conditioning = auto_conditioning.to(self.device) with torch.no_grad(): calm_token = 83 # This is the token for coding silence, which is fixed in place with "fix_autoregressive_output" if verbose: print("Generating autoregressive samples..") with torch.autocast( device_type="cuda" , dtype=torch.float16, enabled=self.half ): codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens, top_k=50, top_p=top_p, temperature=temperature, do_sample=True, num_beams=1, num_return_sequences=1, length_penalty=float(length_penalty), repetition_penalty=float(repetition_penalty), output_attentions=False, output_hidden_states=True, **hf_generate_kwargs) gpt_latents = self.autoregressive(auto_conditioning.repeat(k, 1), text_tokens.repeat(k, 1), torch.tensor([text_tokens.shape[-1]], device=text_tokens.device), codes, torch.tensor([codes.shape[-1]*self.autoregressive.mel_length_compression], device=text_tokens.device), return_latent=True, clip_inputs=False) if verbose: print("generating audio..") wav_gen = self.hifi_decoder.inference(gpt_latents.to(self.device), auto_conditioning) return wav_gen def deterministic_state(self, seed=None): """ Sets the random seeds that tortoise uses to the current time() and returns that seed so results can be reproduced. """ seed = int(time()) if seed is None else seed torch.manual_seed(seed) random.seed(seed) # Can't currently set this because of CUBLAS. TODO: potentially enable it if necessary. # torch.use_deterministic_algorithms(True) return seed ================================================ FILE: tortoise/data/got.txt ================================================ Chapter One Bran The morning had dawned clear and cold, with a crispness that hinted at the end of summer. They set forth at daybreak to see a man beheaded, twenty in all, and Bran rode among them, nervous with excitement. This was the first time he had been deemed old enough to go with his lord father and his brothers to see the king's justice done. It was the ninth year of summer, and the seventh of Bran's life. The man had been taken outside a small holdfast in the hills. Robb thought he was a wildling, his sword sworn to Mance Rayder, the King-beyond-the-Wall. It made Bran's skin prickle to think of it. He remembered the hearth tales Old Nan told them. The wildlings were cruel men, she said, slavers and slayers and thieves. They consorted with giants and ghouls, stole girl children in the dead of night, and drank blood from polished horns. And their women lay with the Others in the Long Night to sire terrible half-human children. But the man they found bound hand and foot to the holdfast wall awaiting the king's justice was old and scrawny, not much taller than Robb. He had lost both ears and a finger to frostbite, and he dressed all in black, the same as a brother of the Night's Watch, except that his furs were ragged and greasy. The breath of man and horse mingled, steaming, in the cold morning air as his lord father had the man cut down from the wall and dragged before them. Robb and Jon sat tall and still on their horses, with Bran between them on his pony, trying to seem older than seven, trying to pretend that he'd seen all this before. A faint wind blew through the holdfast gate. Over their heads flapped the banner of the Starks of Winterfell: a grey direwolf racing across an ice-white field. Bran's father sat solemnly on his horse, long brown hair stirring in the wind. His closely trimmed beard was shot with white, making him look older than his thirty-five years. He had a grim cast to his grey eyes this day, and he seemed not at all the man who would sit before the fire in the evening and talk softly of the age of heroes and the children of the forest. He had taken off Father's face, Bran thought, and donned the face of Lord Stark of Winterfell. There were questions asked and answers given there in the chill of morning, but afterward Bran could not recall much of what had been said. Finally his lord father gave a command, and two of his guardsmen dragged the ragged man to the ironwood stump in the center of the square. They forced his head down onto the hard black wood. Lord Eddard Stark dismounted and his ward Theon Greyjoy brought forth the sword. "Ice," that sword was called. It was as wide across as a man's hand, and taller even than Robb. The blade was Valyrian steel, spell-forged and dark as smoke. Nothing held an edge like Valyrian steel. His father peeled off his gloves and handed them to Jory Cassel, the captain of his household guard. He took hold of Ice with both hands and said, "In the name of Robert of the House Baratheon, the First of his Name, King of the Andals and the Rhoynar and the First Men, Lord of the Seven Kingdoms and Protector of the Realm, by the word of Eddard of the House Stark, Lord of Winterfell and Warden of the North, I do sentence you to die." He lifted the greatsword high above his head. Bran's bastard brother Jon Snow moved closer. "Keep the pony well in hand," he whispered. "And don't look away. Father will know if you do." Bran kept his pony well in hand, and did not look away. His father took off the man's head with a single sure stroke. Blood sprayed out across the snow, as red as surnmerwine. One of the horses reared and had to be restrained to keep from bolting. Bran could not take his eyes off the blood. The snows around the stump drank it eagerly, reddening as he watched. The head bounced off a thick root and rolled. It came up near Greyjoy's feet. Theon was a lean, dark youth of nineteen who found everything amusing. He laughed, put his boot on the head, and kicked it away. "Ass," Jon muttered, low enough so Greyjoy did not hear. He put a hand on Bran's shoulder, and Bran looked over at his bastard brother. "You did well," Jon told him solemnly. Jon was fourteen, an old hand at justice. It seemed colder on the long ride back to Winterfell, though the wind had died by then and the sun was higher in the sky. Bran rode with his brothers, well ahead of the main party, his pony struggling hard to keep up with their horses. "The deserter died bravely," Robb said. He was big and broad and growing every day, with his mother's coloring, the fair skin, red-brown hair, and blue eyes of the Tullys of Riverrun. "He had courage, at the least." "No," Jon Snow said quietly. "It was not courage. This one was dead of fear. You could see it in his eyes, Stark." Jon's eyes were a grey so dark they seemed almost black, but there was little they did not see. He was of an age with Robb, but they did not look alike. Jon was slender where Robb was muscular, dark where Robb was fair, graceful and quick where his half brother was strong and fast. Robb was not impressed. "The Others take his eyes," he swore. "He died well. Race you to the bridge?" "Done," Jon said, kicking his horse forward. Robb cursed and followed, and they galloped off down the trail, Robb laughing and hooting, Jon silent and intent. The hooves of their horses kicked up showers of snow as they went. Bran did not try to follow. His pony could not keep up. He had seen the ragged man's eyes, and he was thinking of them now. After a while, the sound of Robb's laughter receded, and the woods grew silent again. So deep in thought was he that he never heard the rest of the party until his father moved up to ride beside him. "Are you well, Bran?" he asked, not unkindly. "Yes, Father," Bran told him. He looked up. Wrapped in his furs and leathers, mounted on his great warhorse, his lord father loomed over him like a giant. "Robb says the man died bravely, but Jon says he was afraid." "What do you think?" his father asked. Bran thought about it. "Can a man still be brave if he's afraid?" "That is the only time a man can be brave," his father told him. "Do you understand why I did it?" "He was a wildling," Bran said. "They carry off women and sell them to the Others." His lord father smiled. "Old Nan has been telling you stories again. In truth, the man was an oathbreaker, a deserter from the Night's Watch. No man is more dangerous. The deserter knows his life is forfeit if he is taken, so he will not flinch from any crime, no matter how vile. But you mistake me. The question was not why the man had to die, but why I must do it." Bran had no answer for that. "King Robert has a headsman," he said, uncertainly. "He does," his father admitted. "As did the Targaryen kings before him. Yet our way is the older way. The blood of the First Men still flows in the veins of the Starks, and we hold to the belief that the man who passes the sentence should swing the sword. If you would take a man's life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die. "One day, Bran, you will be Robb's bannerman, holding a keep of your own for your brother and your king, and justice will fall to you. When that day comes, you must take no pleasure in the task, but neither must you look away. A ruler who hides behind paid executioners soon forgets what death is." That was when Jon reappeared on the crest of the hill before them. He waved and shouted down at them. "Father, Bran, come quickly, see what Robb has found!" Then he was gone again. Jory rode up beside them. "Trouble, my lord?" "Beyond a doubt," his lord father said. "Come, let us see what mischief my sons have rooted out now." He sent his horse into a trot. Jory and Bran and the rest came after. They found Robb on the riverbank north of the bridge, with Jon still mounted beside him. The late summer snows had been heavy this moonturn. Robb stood knee-deep in white, his hood pulled back so the sun shone in his hair. He was cradling something in his arm, while the boys talked in hushed, excited voices. The riders picked their way carefully through the drifts, groping for solid footing on the hidden, uneven ground . Jory Cassel and Theon Greyjoy were the first to reach the boys. Greyjoy was laughing and joking as he rode. Bran heard the breath go out of him. "Gods!" he exclaimed, struggling to keep control of his horse as he reached for his sword. Jory's sword was already out. "Robb, get away from it!" he called as his horse reared under him. Robb grinned and looked up from the bundle in his arms. "She can't hurt you," he said. "She's dead, Jory." Bran was afire with curiosity by then. He would have spurred the pony faster, but his father made them dismount beside the bridge and approach on foot. Bran jumped off and ran. By then Jon, Jory, and Theon Greyjoy had all dismounted as well. "What in the seven hells is it?" Greyjoy was saying. "A wolf," Robb told him. "A freak," Greyjoy said. "Look at the size of it." Bran's heart was thumping in his chest as he pushed through a waist-high drift to his brothers' side. Half-buried in bloodstained snow, a huge dark shape slumped in death. Ice had formed in its shaggy grey fur, and the faint smell of corruption clung to it like a woman's perfume. Bran glimpsed blind eyes crawling with maggots, a wide mouth full of yellowed teeth. But it was the size of it that made him gasp. It was bigger than his pony, twice the size of the largest hound in his father's kennel. "It's no freak," Jon said calmly. "That's a direwolf. They grow larger than the other kind." Theon Greyjoy said, "There's not been a direwolf sighted south of the Wall in two hundred years." "I see one now," Jon replied. Bran tore his eyes away from the monster. That was when he noticed the bundle in Robb's arms. He gave a cry of delight and moved closer. The pup was a tiny ball of grey-black fur, its eyes still closed. It nuzzled blindly against Robb's chest as he cradled it, searching for milk among his leathers, making a sad little whimpery sound. Bran reached out hesitantly. "Go on," Robb told him. "You can touch him." Bran gave the pup a quick nervous stroke, then turned as Jon said, "Here you go." His half brother put a second pup into his arms. "There are five of them." Bran sat down in the snow and hugged the wolf pup to his face. Its fur was soft and warm against his cheek. "Direwolves loose in the realm, after so many years," muttered Hullen, the master of horse. "I like it not." "It is a sign," Jory said. Father frowned. "This is only a dead animal, Jory," he said. Yet he seemed troubled. Snow crunched under his boots as he moved around the body. "Do we know what killed her?" "There's something in the throat," Robb told him, proud to have found the answer before his father even asked. "There, just under the jaw." His father knelt and groped under the beast's head with his hand. He gave a yank and held it up for all to see. A foot of shattered antler, tines snapped off, all wet with blood. A sudden silence descended over the party. The men looked at the antler uneasily, and no one dared to speak. Even Bran could sense their fear, though he did not understand. His father tossed the antler to the side and cleansed his hands in the snow. "I'm surprised she lived long enough to whelp," he said. His voice broke the spell. "Maybe she didn't," Jory said. "I've heard tales . . . maybe the bitch was already dead when the pups came." "Born with the dead," another man put in. "Worse luck." "No matter," said Hullen. "They be dead soon enough too." Bran gave a wordless cry of dismay. "The sooner the better," Theon Greyjoy agreed. He drew his sword. "Give the beast here, Bran." The little thing squirmed against him, as if it heard and understood. "No!" Bran cried out fiercely. "It's mine." "Put away your sword, Greyjoy," Robb said. For a moment he sounded as commanding as their father, like the lord he would someday be. "We will keep these pups." "You cannot do that, boy," said Harwin, who was Hullen's son. "It be a mercy to kill them," Hullen said. Bran looked to his lord father for rescue, but got only a frown, a furrowed brow. "Hullen speaks truly, son. Better a swift death than a hard one from cold and starvation." "No!" He could feel tears welling in his eyes, and he looked away. He did not want to cry in front of his father. Robb resisted stubbornly. "Ser Rodrik's red bitch whelped again last week," he said. "It was a small litter, only two live pups. She'll have milk enough." "She'll rip them apart when they try to nurse." "Lord Stark," Jon said. It was strange to hear him call Father that, so formal. Bran looked at him with desperate hope. "There are five pups," he told Father. "Three male, two female." "What of it, Jon?" "You have five trueborn children," Jon said. "Three sons, two daughters. The direwolf is the sigil of your House. Your children were meant to have these pups, my lord." Bran saw his father's face change, saw the other men exchange glances. He loved Jon with all his heart at that moment. Even at seven, Bran understood what his brother had done. The count had come right only because Jon had omitted himself. He had included the girls, included even Rickon, the baby, but not the bastard who bore the surname Snow, the name that custom decreed be given to all those in the north unlucky enough to be born with no name of their own. Their father understood as well. "You want no pup for yourself, Jon?" he asked softly. "The direwolf graces the banners of House Stark," Jon pointed out. "I am no Stark, Father." Their lord father regarded Jon thoughtfully. Robb rushed into the silence he left. "I will nurse him myself, Father," he promised. "I will soak a towel with warm milk, and give him suck from that." "Me too!" Bran echoed. The lord weighed his sons long and carefully with his eyes. "Easy to say, and harder to do. I will not have you wasting the servants' time with this. If you want these pups, you will feed them yourselves. Is that understood?" Bran nodded eagerly. The pup squirmed in his grasp, licked at his face with a warm tongue. "You must train them as well," their father said. "You must train them. The kennelmaster will have nothing to do with these monsters, I promise you that. And the gods help you if you neglect them, or brutalize them, or train them badly. These are not dogs to beg for treats and slink off at a kick. A direwolf will rip a man's arm off his shoulder as easily as a dog will kill a rat. Are you sure you want this?" "Yes, Father," Bran said. "Yes," Robb agreed. "The pups may die anyway, despite all you do." "They won't die," Robb said. "We won't let them die." "Keep them, then. Jory, Desmond, gather up the other pups. It's time we were back to Winterfell." It was not until they were mounted and on their way that Bran allowed himself to taste the sweet air of victory. By then, his pup was snuggled inside his leathers, warm against him, safe for the long ride home. Bran was wondering what to name him. Halfway across the bridge, Jon pulled up suddenly. "What is it, Jon?" their lord father asked. "Can't you hear it?" Bran could hear the wind in the trees, the clatter of their hooves on the ironwood planks, the whimpering of his hungry pup, but Jon was listening to something else. "There," Jon said. He swung his horse around and galloped back across the bridge. They watched him dismount where the direwolf lay dead in the snow, watched him kneel. A moment later he was riding back to them, smiling. "He must have crawled away from the others," Jon said. "Or been driven away," their father said, looking at the sixth pup. His fur was white, where the rest of the litter was grey. His eyes were as red as the blood of the ragged man who had died that morning. Bran thought it curious that this pup alone would have opened his eyes while the others were still blind. "An albino," Theon Greyjoy said with wry amusement. "This one will die even faster than the others." Jon Snow gave his father's ward a long, chilling look. "I think not, Greyjoy," he said. "This one belongs to me." ================================================ FILE: tortoise/data/layman.txt ================================================ ================================================ FILE: tortoise/data/riding_hood.txt ================================================ Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her mother was excessively fond of her; and her grandmother doted on her still more. This good woman had a little red riding hood made for her. It suited the girl so extremely well that everybody called her Little Red Riding Hood. One day her mother, having made some cakes, said to her, "Go, my dear, and see how your grandmother is doing, for I hear she has been very ill. Take her a cake, and this little pot of butter." Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village. As she was going through the wood, she met with a wolf, who had a very great mind to eat her up, but he dared not, because of some woodcutters working nearby in the forest. He asked her where she was going. The poor child, who did not know that it was dangerous to stay and talk to a wolf, said to him, "I am going to see my grandmother and carry her a cake and a little pot of butter from my mother." "Does she live far off?" said the wolf "Oh I say," answered Little Red Riding Hood; "it is beyond that mill you see there, at the first house in the village." "Well," said the wolf, "and I'll go and see her too. I'll go this way and go you that, and we shall see who will be there first." The wolf ran as fast as he could, taking the shortest path, and the little girl took a roundabout way, entertaining herself by gathering nuts, running after butterflies, and gathering bouquets of little flowers. It was not long before the wolf arrived at the old woman's house. He knocked at the door: tap, tap. "Who's there?" "Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother." The good grandmother, who was in bed, because she was somewhat ill, cried out, "Pull the bobbin, and the latch will go up." The wolf pulled the bobbin, and the door opened, and then he immediately fell upon the good woman and ate her up in a moment, for it been more than three days since he had eaten. He then shut the door and got into the grandmother's bed, expecting Little Red Riding Hood, who came some time afterwards and knocked at the door: tap, tap. "Who's there?" Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you." The wolf cried out to her, softening his voice as much as he could, "Pull the bobbin, and the latch will go up." Little Red Riding Hood pulled the bobbin, and the door opened. The wolf, seeing her come in, said to her, hiding himself under the bedclothes, "Put the cake and the little pot of butter upon the stool, and come get into bed with me." Little Red Riding Hood took off her clothes and got into bed. She was greatly amazed to see how her grandmother looked in her nightclothes, and said to her, "Grandmother, what big arms you have!" "All the better to hug you with, my dear." "Grandmother, what big legs you have!" "All the better to run with, my child." "Grandmother, what big ears you have!" "All the better to hear with, my child." "Grandmother, what big eyes you have!" "All the better to see with, my child." "Grandmother, what big teeth you have got!" "All the better to eat you up with." And, saying these words, this wicked wolf fell upon Little Red Riding Hood, and ate her all up. ================================================ FILE: tortoise/data/seal_copypasta.txt ================================================ What the fuck did you just fucking say about me, you little bitch? I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al kayda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire U S armed forces. You are nothing to me but just another target. I will wipe you the fuck out with precision the likes of which has never been seen before on this Earth, mark my fucking words. You think you can get away with saying that shit to me over the Internet? Think again, fucker. As we speak I am contacting my secret network of spies across the U S A and your IP is being traced right now so you better prepare for the storm, maggot. The storm that wipes out the pathetic little thing you call your life. You're fucking dead, kid. I can be anywhere, anytime, and I can kill you in over seven hundred ways, and that's just with my bare hands. Not only am I extensively trained in unarmed combat, but I have access to the entire arsenal of the United States Marine Corps and I will use it to its full extent to wipe your miserable ass off the face of the continent, you little shit. If only you could have known what unholy retribution your little "clever" comment was about to bring down upon you, maybe you would have held your fucking tongue. But you couldn't, you didn't, and now you're paying the price, you goddamn idiot. I will shit fury all over you and you will drown in it. You're fucking dead, kiddo. ================================================ FILE: tortoise/data/tokenizer.json ================================================ {"version":"1.0","truncation":null,"padding":null,"added_tokens":[{"id":0,"special":true,"content":"[STOP]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":1,"special":true,"content":"[UNK]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false},{"id":2,"special":true,"content":"[SPACE]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false}],"normalizer":null,"pre_tokenizer":{"type":"Whitespace"},"post_processor":null,"decoder":null,"model":{"type":"BPE","dropout":null,"unk_token":"[UNK]","continuing_subword_prefix":null,"end_of_word_suffix":null,"fuse_unk":false,"vocab":{"[STOP]":0,"[UNK]":1,"[SPACE]":2,"!":3,"'":4,"(":5,")":6,",":7,"-":8,".":9,"/":10,":":11,";":12,"?":13,"a":14,"b":15,"c":16,"d":17,"e":18,"f":19,"g":20,"h":21,"i":22,"j":23,"k":24,"l":25,"m":26,"n":27,"o":28,"p":29,"q":30,"r":31,"s":32,"t":33,"u":34,"v":35,"w":36,"x":37,"y":38,"z":39,"th":40,"in":41,"the":42,"an":43,"er":44,"ou":45,"re":46,"on":47,"at":48,"ed":49,"en":50,"to":51,"ing":52,"and":53,"is":54,"as":55,"al":56,"or":57,"of":58,"ar":59,"it":60,"es":61,"he":62,"st":63,"le":64,"om":65,"se":66,"be":67,"ad":68,"ow":69,"ly":70,"ch":71,"wh":72,"that":73,"you":74,"li":75,"ve":76,"ac":77,"ti":78,"ld":79,"me":80,"was":81,"gh":82,"id":83,"ll":84,"wi":85,"ent":86,"for":87,"ay":88,"ro":89,"ver":90,"ic":91,"her":92,"ke":93,"his":94,"no":95,"ut":96,"un":97,"ir":98,"lo":99,"we":100,"ri":101,"ha":102,"with":103,"ght":104,"out":105,"im":106,"ion":107,"all":108,"ab":109,"one":110,"ne":111,"ge":112,"ould":113,"ter":114,"mo":115,"had":116,"ce":117,"she":118,"go":119,"sh":120,"ur":121,"am":122,"so":123,"pe":124,"my":125,"de":126,"are":127,"but":128,"ome":129,"fr":130,"ther":131,"fe":132,"su":133,"do":134,"con":135,"te":136,"ain":137,"ere":138,"po":139,"if":140,"they":141,"us":142,"ag":143,"tr":144,"now":145,"oun":146,"this":147,"have":148,"not":149,"sa":150,"il":151,"up":152,"thing":153,"from":154,"ap":155,"him":156,"ack":157,"ation":158,"ant":159,"our":160,"op":161,"like":162,"ust":163,"ess":164,"bo":165,"ok":166,"ul":167,"ind":168,"ex":169,"com":170,"some":171,"there":172,"ers":173,"co":174,"res":175,"man":176,"ard":177,"pl":178,"wor":179,"way":180,"tion":181,"fo":182,"ca":183,"were":184,"by":185,"ate":186,"pro":187,"ted":188,"ound":189,"own":190,"would":191,"ts":192,"what":193,"qu":194,"ally":195,"ight":196,"ck":197,"gr":198,"when":199,"ven":200,"can":201,"ough":202,"ine":203,"end":204,"per":205,"ous":206,"od":207,"ide":208,"know":209,"ty":210,"very":211,"si":212,"ak":213,"who":214,"about":215,"ill":216,"them":217,"est":218,"red":219,"ye":220,"could":221,"ong":222,"your":223,"their":224,"em":225,"just":226,"other":227,"into":228,"any":229,"whi":230,"um":231,"tw":232,"ast":233,"der":234,"did":235,"ie":236,"been":237,"ace":238,"ink":239,"ity":240,"back":241,"ting":242,"br":243,"more":244,"ake":245,"pp":246,"then":247,"sp":248,"el":249,"use":250,"bl":251,"said":252,"over":253,"get":254},"merges":["t h","i n","th e","a n","e r","o u","r e","o n","a t","e d","e n","t o","in g","an d","i s","a s","a l","o r","o f","a r","i t","e s","h e","s t","l e","o m","s e","b e","a d","o w","l y","c h","w h","th at","y ou","l i","v e","a c","t i","l d","m e","w as","g h","i d","l l","w i","en t","f or","a y","r o","v er","i c","h er","k e","h is","n o","u t","u n","i r","l o","w e","r i","h a","wi th","gh t","ou t","i m","i on","al l","a b","on e","n e","g e","ou ld","t er","m o","h ad","c e","s he","g o","s h","u r","a m","s o","p e","m y","d e","a re","b ut","om e","f r","the r","f e","s u","d o","c on","t e","a in","er e","p o","i f","the y","u s","a g","t r","n ow","ou n","th is","ha ve","no t","s a","i l","u p","th ing","fr om","a p","h im","ac k","at ion","an t","ou r","o p","li ke","u st","es s","b o","o k","u l","in d","e x","c om","s ome","the re","er s","c o","re s","m an","ar d","p l","w or","w ay","ti on","f o","c a","w ere","b y","at e","p ro","t ed","oun d","ow n","w ould","t s","wh at","q u","al ly","i ght","c k","g r","wh en","v en","c an","ou gh","in e","en d","p er","ou s","o d","id e","k now","t y","ver y","s i","a k","wh o","ab out","i ll","the m","es t","re d","y e","c ould","on g","you r","the ir","e m","j ust","o ther","in to","an y","wh i","u m","t w","as t","d er","d id","i e","be en","ac e","in k","it y","b ack","t ing","b r","mo re","a ke","p p","the n","s p","e l","u se","b l","sa id","o ver","ge t"]}} ================================================ FILE: tortoise/do_tts.py ================================================ import argparse import os import torch import torchaudio from api import TextToSpeech, MODELS_DIR from utils.audio import load_voices if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--text', type=str, help='Text to speak.', default="The expressiveness of autoregressive transformers is literally nuts! I absolutely adore them.") parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) ' 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast') parser.add_argument('--use_deepspeed', type=str, help='Use deepspeed for speed bump.', default=False) parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True) parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True) parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/') parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice.', default=3) parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True) parser.add_argument('--cvvp_amount', type=float, help='How much the CVVP model should influence the output.' 'Increasing this can in some cases reduce the likelihood of multiple speakers. Defaults to 0 (disabled)', default=.0) args = parser.parse_args() if torch.backends.mps.is_available(): args.use_deepspeed = False os.makedirs(args.output_path, exist_ok=True) tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half) selected_voices = args.voice.split(',') for k, selected_voice in enumerate(selected_voices): if '&' in selected_voice: voice_sel = selected_voice.split('&') else: voice_sel = [selected_voice] voice_samples, conditioning_latents = load_voices(voice_sel) gen, dbg_state = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents, preset=args.preset, use_deterministic_seed=args.seed, return_deterministic_state=True, cvvp_amount=args.cvvp_amount) if isinstance(gen, list): for j, g in enumerate(gen): torchaudio.save(os.path.join(args.output_path, f'{selected_voice}_{k}_{j}.wav'), g.squeeze(0).cpu(), 24000) else: torchaudio.save(os.path.join(args.output_path, f'{selected_voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000) if args.produce_debug_state: os.makedirs('debug_states', exist_ok=True) torch.save(dbg_state, f'debug_states/do_tts_debug_{selected_voice}.pth') ================================================ FILE: tortoise/eval.py ================================================ import argparse import os import torchaudio from api import TextToSpeech from tortoise.utils.audio import load_audio if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--eval_path', type=str, help='Path to TSV test file', default="D:\\tmp\\tortoise-tts-eval\\test.tsv") parser.add_argument('--output_path', type=str, help='Where to put results', default="D:\\tmp\\tortoise-tts-eval\\baseline") parser.add_argument('--preset', type=str, help='Rendering preset.', default="standard") args = parser.parse_args() os.makedirs(args.output_path, exist_ok=True) tts = TextToSpeech() with open(args.eval_path, 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: text, real = line.strip().split('\t') conds = [load_audio(real, 22050)] gen = tts.tts_with_preset(text, voice_samples=conds, conditioning_latents=None, preset=args.preset) torchaudio.save(os.path.join(args.output_path, os.path.basename(real)), gen.squeeze(0).cpu(), 24000) ================================================ FILE: tortoise/get_conditioning_latents.py ================================================ import argparse import os import torch from api import TextToSpeech from tortoise.utils.audio import load_audio, get_voices """ Dumps the conditioning latents for the specified voice to disk. These are expressive latents which can be used for other ML models, or can be augmented manually and fed back into Tortoise to affect vocal qualities. """ if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--voice', type=str, help='Selects the voice to convert to conditioning latents', default='pat2') parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='../results/conditioning_latents') args = parser.parse_args() os.makedirs(args.output_path, exist_ok=True) tts = TextToSpeech() voices = get_voices() selected_voices = args.voice.split(',') for voice in selected_voices: cond_paths = voices[voice] conds = [] for cond_path in cond_paths: c = load_audio(cond_path, 22050) conds.append(c) conditioning_latents = tts.get_conditioning_latents(conds) torch.save(conditioning_latents, os.path.join(args.output_path, f'{voice}.pth')) ================================================ FILE: tortoise/is_this_from_tortoise.py ================================================ import argparse from api import classify_audio_clip from tortoise.utils.audio import load_audio if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--clip', type=str, help='Path to an audio clip to classify.', default="../examples/favorite_riding_hood.mp3") args = parser.parse_args() clip = load_audio(args.clip, 24000) clip = clip[:, :220000] prob = classify_audio_clip(clip) print(f"This classifier thinks there is a {prob*100}% chance that this clip was generated from Tortoise.") ================================================ FILE: tortoise/models/__init__.py ================================================ ================================================ FILE: tortoise/models/arch_util.py ================================================ import os import functools import math import torch import torch.nn as nn import torch.nn.functional as F import torchaudio from tortoise.models.xtransformers import ContinuousTransformerWrapper, RelativePositionBias def zero_module(module): """ Zero out the parameters of a module and return it. """ for p in module.parameters(): p.detach().zero_() return module class GroupNorm32(nn.GroupNorm): def forward(self, x): return super().forward(x.float()).type(x.dtype) def normalization(channels): """ Make a standard normalization layer. :param channels: number of input channels. :return: an nn.Module for normalization. """ groups = 32 if channels <= 16: groups = 8 elif channels <= 64: groups = 16 while channels % groups != 0: groups = int(groups / 2) assert groups > 2 return GroupNorm32(groups, channels) class QKVAttentionLegacy(nn.Module): """ A module which performs QKV attention. Matches legacy QKVAttention + input/output heads shaping """ def __init__(self, n_heads): super().__init__() self.n_heads = n_heads def forward(self, qkv, mask=None, rel_pos=None): """ Apply QKV attention. :param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs. :return: an [N x (H * C) x T] tensor after attention. """ bs, width, length = qkv.shape assert width % (3 * self.n_heads) == 0 ch = width // (3 * self.n_heads) q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1) scale = 1 / math.sqrt(math.sqrt(ch)) weight = torch.einsum( "bct,bcs->bts", q * scale, k * scale ) # More stable with f16 than dividing afterwards if rel_pos is not None: weight = rel_pos(weight.reshape(bs, self.n_heads, weight.shape[-2], weight.shape[-1])).reshape(bs * self.n_heads, weight.shape[-2], weight.shape[-1]) weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype) if mask is not None: # The proper way to do this is to mask before the softmax using -inf, but that doesn't work properly on CPUs. mask = mask.repeat(self.n_heads, 1).unsqueeze(1) weight = weight * mask a = torch.einsum("bts,bcs->bct", weight, v) return a.reshape(bs, -1, length) class AttentionBlock(nn.Module): """ An attention block that allows spatial positions to attend to each other. Originally ported from here, but adapted to the N-d case. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66. """ def __init__( self, channels, num_heads=1, num_head_channels=-1, do_checkpoint=True, relative_pos_embeddings=False, ): super().__init__() self.channels = channels self.do_checkpoint = do_checkpoint if num_head_channels == -1: self.num_heads = num_heads else: assert ( channels % num_head_channels == 0 ), f"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}" self.num_heads = channels // num_head_channels self.norm = normalization(channels) self.qkv = nn.Conv1d(channels, channels * 3, 1) # split heads before split qkv self.attention = QKVAttentionLegacy(self.num_heads) self.proj_out = zero_module(nn.Conv1d(channels, channels, 1)) if relative_pos_embeddings: self.relative_pos_embeddings = RelativePositionBias(scale=(channels // self.num_heads) ** .5, causal=False, heads=num_heads, num_buckets=32, max_distance=64) else: self.relative_pos_embeddings = None def forward(self, x, mask=None): b, c, *spatial = x.shape x = x.reshape(b, c, -1) qkv = self.qkv(self.norm(x)) h = self.attention(qkv, mask, self.relative_pos_embeddings) h = self.proj_out(h) return (x + h).reshape(b, c, *spatial) class Upsample(nn.Module): """ An upsampling layer with an optional convolution. :param channels: channels in the inputs and outputs. :param use_conv: a bool determining if a convolution is applied. """ def __init__(self, channels, use_conv, out_channels=None, factor=4): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.use_conv = use_conv self.factor = factor if use_conv: ksize = 5 pad = 2 self.conv = nn.Conv1d(self.channels, self.out_channels, ksize, padding=pad) def forward(self, x): assert x.shape[1] == self.channels x = F.interpolate(x, scale_factor=self.factor, mode="nearest") if self.use_conv: x = self.conv(x) return x class Downsample(nn.Module): """ A downsampling layer with an optional convolution. :param channels: channels in the inputs and outputs. :param use_conv: a bool determining if a convolution is applied. """ def __init__(self, channels, use_conv, out_channels=None, factor=4, ksize=5, pad=2): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.use_conv = use_conv stride = factor if use_conv: self.op = nn.Conv1d( self.channels, self.out_channels, ksize, stride=stride, padding=pad ) else: assert self.channels == self.out_channels self.op = nn.AvgPool1d(kernel_size=stride, stride=stride) def forward(self, x): assert x.shape[1] == self.channels return self.op(x) class ResBlock(nn.Module): def __init__( self, channels, dropout, out_channels=None, use_conv=False, use_scale_shift_norm=False, up=False, down=False, kernel_size=3, ): super().__init__() self.channels = channels self.dropout = dropout self.out_channels = out_channels or channels self.use_conv = use_conv self.use_scale_shift_norm = use_scale_shift_norm padding = 1 if kernel_size == 3 else 2 self.in_layers = nn.Sequential( normalization(channels), nn.SiLU(), nn.Conv1d(channels, self.out_channels, kernel_size, padding=padding), ) self.updown = up or down if up: self.h_upd = Upsample(channels, False) self.x_upd = Upsample(channels, False) elif down: self.h_upd = Downsample(channels, False) self.x_upd = Downsample(channels, False) else: self.h_upd = self.x_upd = nn.Identity() self.out_layers = nn.Sequential( normalization(self.out_channels), nn.SiLU(), nn.Dropout(p=dropout), zero_module( nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding) ), ) if self.out_channels == channels: self.skip_connection = nn.Identity() elif use_conv: self.skip_connection = nn.Conv1d( channels, self.out_channels, kernel_size, padding=padding ) else: self.skip_connection = nn.Conv1d(channels, self.out_channels, 1) def forward(self, x): if self.updown: in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1] h = in_rest(x) h = self.h_upd(h) x = self.x_upd(x) h = in_conv(h) else: h = self.in_layers(x) h = self.out_layers(h) return self.skip_connection(x) + h class AudioMiniEncoder(nn.Module): def __init__(self, spec_dim, embedding_dim, base_channels=128, depth=2, resnet_blocks=2, attn_blocks=4, num_attn_heads=4, dropout=0, downsample_factor=2, kernel_size=3): super().__init__() self.init = nn.Sequential( nn.Conv1d(spec_dim, base_channels, 3, padding=1) ) ch = base_channels res = [] for l in range(depth): for r in range(resnet_blocks): res.append(ResBlock(ch, dropout, kernel_size=kernel_size)) res.append(Downsample(ch, use_conv=True, out_channels=ch*2, factor=downsample_factor)) ch *= 2 self.res = nn.Sequential(*res) self.final = nn.Sequential( normalization(ch), nn.SiLU(), nn.Conv1d(ch, embedding_dim, 1) ) attn = [] for a in range(attn_blocks): attn.append(AttentionBlock(embedding_dim, num_attn_heads,)) self.attn = nn.Sequential(*attn) self.dim = embedding_dim def forward(self, x): h = self.init(x) h = self.res(h) h = self.final(h) h = self.attn(h) return h[:, :, 0] DEFAULT_MEL_NORM_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data/mel_norms.pth') class TorchMelSpectrogram(nn.Module): def __init__(self, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0, mel_fmax=8000, sampling_rate=22050, normalize=False, mel_norm_file=DEFAULT_MEL_NORM_FILE): super().__init__() # These are the default tacotron values for the MEL spectrogram. self.filter_length = filter_length self.hop_length = hop_length self.win_length = win_length self.n_mel_channels = n_mel_channels self.mel_fmin = mel_fmin self.mel_fmax = mel_fmax self.sampling_rate = sampling_rate self.mel_stft = torchaudio.transforms.MelSpectrogram(n_fft=self.filter_length, hop_length=self.hop_length, win_length=self.win_length, power=2, normalized=normalize, sample_rate=self.sampling_rate, f_min=self.mel_fmin, f_max=self.mel_fmax, n_mels=self.n_mel_channels, norm="slaney") self.mel_norm_file = mel_norm_file if self.mel_norm_file is not None: self.mel_norms = torch.load(self.mel_norm_file) else: self.mel_norms = None def forward(self, inp): if len(inp.shape) == 3: # Automatically squeeze out the channels dimension if it is present (assuming mono-audio) inp = inp.squeeze(1) assert len(inp.shape) == 2 if torch.backends.mps.is_available(): inp = inp.to('cpu') self.mel_stft = self.mel_stft.to(inp.device) mel = self.mel_stft(inp) # Perform dynamic range compression mel = torch.log(torch.clamp(mel, min=1e-5)) if self.mel_norms is not None: self.mel_norms = self.mel_norms.to(mel.device) mel = mel / self.mel_norms.unsqueeze(0).unsqueeze(-1) return mel class CheckpointedLayer(nn.Module): """ Wraps a module. When forward() is called, passes kwargs that require_grad through torch.checkpoint() and bypasses checkpoint for all other args. """ def __init__(self, wrap): super().__init__() self.wrap = wrap def forward(self, x, *args, **kwargs): for k, v in kwargs.items(): assert not (isinstance(v, torch.Tensor) and v.requires_grad) # This would screw up checkpointing. partial = functools.partial(self.wrap, **kwargs) return partial(x, *args) class CheckpointedXTransformerEncoder(nn.Module): """ Wraps a ContinuousTransformerWrapper and applies CheckpointedLayer to each layer and permutes from channels-mid to channels-last that XTransformer expects. """ def __init__(self, needs_permute=True, exit_permute=True, checkpoint=True, **xtransformer_kwargs): super().__init__() self.transformer = ContinuousTransformerWrapper(**xtransformer_kwargs) self.needs_permute = needs_permute self.exit_permute = exit_permute if not checkpoint: return for i in range(len(self.transformer.attn_layers.layers)): n, b, r = self.transformer.attn_layers.layers[i] self.transformer.attn_layers.layers[i] = nn.ModuleList([n, CheckpointedLayer(b), r]) def forward(self, x, **kwargs): if self.needs_permute: x = x.permute(0,2,1) h = self.transformer(x, **kwargs) if self.exit_permute: h = h.permute(0,2,1) return h ================================================ FILE: tortoise/models/autoregressive.py ================================================ import functools import torch import torch.nn as nn import torch.nn.functional as F from transformers import GPT2Config, GPT2PreTrainedModel, LogitsProcessorList from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions from transformers.utils.model_parallel_utils import get_device_map, assert_device_map from tortoise.models.arch_util import AttentionBlock from tortoise.utils.typical_sampling import TypicalLogitsWarper def null_position_embeddings(range, dim): return torch.zeros((range.shape[0], range.shape[1], dim), device=range.device) class ResBlock(nn.Module): """ Basic residual convolutional block that uses GroupNorm. """ def __init__(self, chan): super().__init__() self.net = nn.Sequential( nn.Conv1d(chan, chan, kernel_size=3, padding=1), nn.GroupNorm(chan//8, chan), nn.ReLU(), nn.Conv1d(chan, chan, kernel_size=3, padding=1), nn.GroupNorm(chan//8, chan) ) def forward(self, x): return F.relu(self.net(x) + x) class GPT2InferenceModel(GPT2PreTrainedModel): def __init__(self, config, gpt, text_pos_emb, embeddings, norm, linear, kv_cache=False): super().__init__(config) self.transformer = gpt self.text_pos_embedding = text_pos_emb self.embeddings = embeddings self.final_norm = norm self.lm_head = nn.Sequential(norm, linear) self.kv_cache = kv_cache # Model parallel self.model_parallel = False self.device_map = None self.cached_mel_emb = None def parallelize(self, device_map=None): self.device_map = ( get_device_map(len(self.transformer.h), range(max(1, torch.cuda.device_count()))) if device_map is None else device_map ) assert_device_map(self.device_map, len(self.transformer.h)) self.transformer.parallelize(self.device_map) self.lm_head = self.lm_head.to(self.transformer.first_device) self.model_parallel = True def deparallelize(self): self.transformer.deparallelize() self.transformer = self.transformer.to("cpu") self.lm_head = self.lm_head.to("cpu") self.model_parallel = False torch.cuda.empty_cache() if torch.backends.mps.is_available(): torch.mps.empty_cache() def get_output_embeddings(self): return self.lm_head def set_output_embeddings(self, new_embeddings): self.lm_head = new_embeddings def store_mel_emb(self, mel_emb): self.cached_mel_emb = mel_emb def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): token_type_ids = kwargs.get("token_type_ids", None) # usually None if not self.kv_cache: past_key_values = None # only last token for inputs_ids if past is defined in kwargs if past_key_values: input_ids = input_ids[:, -1].unsqueeze(-1) if token_type_ids is not None: token_type_ids = token_type_ids[:, -1].unsqueeze(-1) attention_mask = kwargs.get("attention_mask", None) position_ids = kwargs.get("position_ids", None) if attention_mask is not None and position_ids is None: # create position_ids on the fly for batch generation position_ids = attention_mask.long().cumsum(-1) - 1 position_ids.masked_fill_(attention_mask == 0, 1) if past_key_values: position_ids = position_ids[:, -1].unsqueeze(-1) else: position_ids = None return { "input_ids": input_ids, "past_key_values": past_key_values, "use_cache": kwargs.get("use_cache"), "position_ids": position_ids, "attention_mask": attention_mask, "token_type_ids": token_type_ids, } def forward( self, input_ids=None, past_key_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, ): assert self.cached_mel_emb is not None assert inputs_embeds is None # Not supported by this inference model. assert labels is None # Training not supported by this inference model. return_dict = ( return_dict if return_dict is not None else self.config.use_return_dict ) # Create embedding mel_len = self.cached_mel_emb.shape[1] if input_ids.shape[1] != 1: text_inputs = input_ids[:, mel_len:] text_emb = self.embeddings(text_inputs) text_emb = text_emb + self.text_pos_embedding(text_emb) if self.cached_mel_emb.shape[0] != text_emb.shape[0]: mel_emb = self.cached_mel_emb.repeat_interleave( text_emb.shape[0] // self.cached_mel_emb.shape[0], 0 ) else: # this outcome only occurs once per loop in most cases mel_emb = self.cached_mel_emb emb = torch.cat([mel_emb, text_emb], dim=1) else: emb = self.embeddings(input_ids) emb = emb + self.text_pos_embedding.get_fixed_embedding( attention_mask.shape[1] - mel_len, attention_mask.device ) transformer_outputs = self.transformer( inputs_embeds=emb, past_key_values=past_key_values, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, encoder_hidden_states=encoder_hidden_states, encoder_attention_mask=encoder_attention_mask, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) hidden_states = transformer_outputs[0] # Set device for model parallelism if self.model_parallel: if torch.backends.mps.is_available(): self.to(self.transformer.first_device) else: torch.cuda.set_device(self.transformer.first_device) hidden_states = hidden_states.to(self.lm_head.weight.device) lm_logits = self.lm_head(hidden_states) if not return_dict: return (lm_logits,) + transformer_outputs[1:] return CausalLMOutputWithCrossAttentions( loss=None, logits=lm_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions, cross_attentions=transformer_outputs.cross_attentions, ) @staticmethod def _reorder_cache(past, beam_idx): """ This function is used to re-order the :obj:`past_key_values` cache if :meth:`~transformers.PreTrainedModel.beam_search` or :meth:`~transformers.PreTrainedModel.beam_sample` is called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step. """ return tuple( tuple( past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past ) for layer_past in past ) class ConditioningEncoder(nn.Module): def __init__(self, spec_dim, embedding_dim, attn_blocks=6, num_attn_heads=4, do_checkpointing=False, mean=False): super().__init__() attn = [] self.init = nn.Conv1d(spec_dim, embedding_dim, kernel_size=1) for a in range(attn_blocks): attn.append(AttentionBlock(embedding_dim, num_attn_heads)) self.attn = nn.Sequential(*attn) self.dim = embedding_dim self.do_checkpointing = do_checkpointing self.mean = mean def forward(self, x): h = self.init(x) h = self.attn(h) if self.mean: return h.mean(dim=2) else: return h[:, :, 0] class LearnedPositionEmbeddings(nn.Module): def __init__(self, seq_len, model_dim, init=.02): super().__init__() self.emb = nn.Embedding(seq_len, model_dim) # Initializing this way is standard for GPT-2 self.emb.weight.data.normal_(mean=0.0, std=init) def forward(self, x): sl = x.shape[1] return self.emb(torch.arange(0, sl, device=x.device)) def get_fixed_embedding(self, ind, dev): return self.emb(torch.tensor([ind], device=dev)).unsqueeze(0) def build_hf_gpt_transformer(layers, model_dim, heads, max_mel_seq_len, max_text_seq_len, checkpointing): """ GPT-2 implemented by the HuggingFace library. """ from transformers import GPT2Config, GPT2Model gpt_config = GPT2Config(vocab_size=256, # Unused. n_positions=max_mel_seq_len+max_text_seq_len, n_ctx=max_mel_seq_len+max_text_seq_len, n_embd=model_dim, n_layer=layers, n_head=heads, gradient_checkpointing=checkpointing, use_cache=not checkpointing) gpt = GPT2Model(gpt_config) # Override the built in positional embeddings del gpt.wpe gpt.wpe = functools.partial(null_position_embeddings, dim=model_dim) # Built-in token embeddings are unused. del gpt.wte return gpt, LearnedPositionEmbeddings(max_mel_seq_len, model_dim), LearnedPositionEmbeddings(max_text_seq_len, model_dim),\ None, None class MelEncoder(nn.Module): def __init__(self, channels, mel_channels=80, resblocks_per_reduction=2): super().__init__() self.channels = channels self.encoder = nn.Sequential(nn.Conv1d(mel_channels, channels//4, kernel_size=3, padding=1), nn.Sequential(*[ResBlock(channels//4) for _ in range(resblocks_per_reduction)]), nn.Conv1d(channels//4, channels//2, kernel_size=3, stride=2, padding=1), nn.GroupNorm(channels//16, channels//2), nn.ReLU(), nn.Sequential(*[ResBlock(channels//2) for _ in range(resblocks_per_reduction)]), nn.Conv1d(channels//2, channels, kernel_size=3, stride=2, padding=1), nn.GroupNorm(channels//8, channels), nn.ReLU(), nn.Sequential(*[ResBlock(channels) for _ in range(resblocks_per_reduction)]), ) self.reduction = 4 def forward(self, x): for e in self.encoder: x = e(x) return x.permute(0,2,1) class UnifiedVoice(nn.Module): def __init__(self, layers=8, model_dim=512, heads=8, max_text_tokens=120, max_mel_tokens=250, max_conditioning_inputs=1, mel_length_compression=1024, number_text_tokens=256, start_text_token=None, number_mel_codes=8194, start_mel_token=8192, stop_mel_token=8193, train_solo_embeddings=False, use_mel_codes_as_input=True, checkpointing=True, types=1): """ Args: layers: Number of layers in transformer stack. model_dim: Operating dimensions of the transformer heads: Number of transformer heads. Must be divisible by model_dim. Recommend model_dim//64 max_text_tokens: Maximum number of text tokens that will be encountered by model. max_mel_tokens: Maximum number of MEL tokens that will be encountered by model. max_conditioning_inputs: Maximum number of conditioning inputs provided to the model. If (1), conditioning input can be of format (b,80,s), otherwise (b,n,80,s). mel_length_compression: The factor between and . Used to compute MEL code padding given wav input length. number_text_tokens: start_text_token: stop_text_token: number_mel_codes: start_mel_token: stop_mel_token: train_solo_embeddings: use_mel_codes_as_input: checkpointing: """ super().__init__() self.number_text_tokens = number_text_tokens self.start_text_token = number_text_tokens * types if start_text_token is None else start_text_token self.stop_text_token = 0 self.number_mel_codes = number_mel_codes self.start_mel_token = start_mel_token self.stop_mel_token = stop_mel_token self.layers = layers self.heads = heads self.max_mel_tokens = max_mel_tokens self.max_text_tokens = max_text_tokens self.model_dim = model_dim self.max_conditioning_inputs = max_conditioning_inputs self.mel_length_compression = mel_length_compression self.conditioning_encoder = ConditioningEncoder(80, model_dim, num_attn_heads=heads) self.text_embedding = nn.Embedding(self.number_text_tokens*types+1, model_dim) if use_mel_codes_as_input: self.mel_embedding = nn.Embedding(self.number_mel_codes, model_dim) else: self.mel_embedding = MelEncoder(model_dim, resblocks_per_reduction=1) self.gpt, self.mel_pos_embedding, self.text_pos_embedding, self.mel_layer_pos_embedding, self.text_layer_pos_embedding = \ build_hf_gpt_transformer(layers, model_dim, heads, self.max_mel_tokens+2+self.max_conditioning_inputs, self.max_text_tokens+2, checkpointing) if train_solo_embeddings: self.mel_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True) self.text_solo_embedding = nn.Parameter(torch.randn(1, 1, model_dim) * .02, requires_grad=True) else: self.mel_solo_embedding = 0 self.text_solo_embedding = 0 self.final_norm = nn.LayerNorm(model_dim) self.text_head = nn.Linear(model_dim, self.number_text_tokens*types+1) self.mel_head = nn.Linear(model_dim, self.number_mel_codes) # Initialize the embeddings per the GPT-2 scheme embeddings = [self.text_embedding] if use_mel_codes_as_input: embeddings.append(self.mel_embedding) for module in embeddings: module.weight.data.normal_(mean=0.0, std=.02) def post_init_gpt2_config(self, use_deepspeed=False, kv_cache=False, half=False): seq_length = self.max_mel_tokens + self.max_text_tokens + 2 gpt_config = GPT2Config( vocab_size=self.max_mel_tokens, n_positions=seq_length, n_ctx=seq_length, n_embd=self.model_dim, n_layer=self.layers, n_head=self.heads, gradient_checkpointing=False, use_cache=True, ) self.inference_model = GPT2InferenceModel( gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head, kv_cache=kv_cache, ) if use_deepspeed and half and torch.cuda.is_available(): import deepspeed self.ds_engine = deepspeed.init_inference(model=self.inference_model, mp_size=1, replace_with_kernel_inject=True, dtype=torch.float16) self.inference_model = self.ds_engine.module.eval() elif use_deepspeed and torch.cuda.is_available(): import deepspeed self.ds_engine = deepspeed.init_inference(model=self.inference_model, mp_size=1, replace_with_kernel_inject=True, dtype=torch.float32) self.inference_model = self.ds_engine.module.eval() else: self.inference_model = self.inference_model.eval() # self.inference_model = PrunedGPT2InferenceModel(gpt_config, self.gpt, self.mel_pos_embedding, self.mel_embedding, self.final_norm, self.mel_head) self.gpt.wte = self.mel_embedding def build_aligned_inputs_and_targets(self, input, start_token, stop_token): inp = F.pad(input, (1,0), value=start_token) tar = F.pad(input, (0,1), value=stop_token) return inp, tar def set_mel_padding(self, mel_input_tokens, wav_lengths): """ Given mel tokens that are derived from a padded audio clip and the actual lengths of each batch element in that audio clip, reformats the tokens with STOP_MEL_TOKEN in place of the zero padding. This is required preformatting to create a working TTS model. """ # Set padding areas within MEL (currently it is coded with the MEL code for ). mel_lengths = torch.div(wav_lengths, self.mel_length_compression, rounding_mode='trunc') for b in range(len(mel_lengths)): actual_end = mel_lengths[b] + 1 # Due to the convolutional nature of how these tokens are generated, it would be best if the model predicts a token past the actual last token. if actual_end < mel_input_tokens.shape[-1]: mel_input_tokens[b, actual_end:] = self.stop_mel_token return mel_input_tokens def get_logits(self, speech_conditioning_inputs, first_inputs, first_head, second_inputs=None, second_head=None, get_attns=False, return_latent=False): if second_inputs is not None: emb = torch.cat([speech_conditioning_inputs, first_inputs, second_inputs], dim=1) else: emb = torch.cat([speech_conditioning_inputs, first_inputs], dim=1) gpt_out = self.gpt(inputs_embeds=emb, return_dict=True, output_attentions=get_attns) if get_attns: return gpt_out.attentions enc = gpt_out.last_hidden_state[:, 1:] # The first logit is tied to the speech_conditioning_input enc = self.final_norm(enc) if return_latent: return enc[:, speech_conditioning_inputs.shape[1]:speech_conditioning_inputs.shape[1]+first_inputs.shape[1]], enc[:, -second_inputs.shape[1]:] first_logits = enc[:, :first_inputs.shape[1]] first_logits = first_head(first_logits) first_logits = first_logits.permute(0,2,1) if second_inputs is not None: second_logits = enc[:, -second_inputs.shape[1]:] second_logits = second_head(second_logits) second_logits = second_logits.permute(0,2,1) return first_logits, second_logits else: return first_logits def get_conditioning(self, speech_conditioning_input): speech_conditioning_input = speech_conditioning_input.unsqueeze(1) if len( speech_conditioning_input.shape) == 3 else speech_conditioning_input conds = [] for j in range(speech_conditioning_input.shape[1]): conds.append(self.conditioning_encoder(speech_conditioning_input[:, j])) conds = torch.stack(conds, dim=1) conds = conds.mean(dim=1) return conds def forward(self, speech_conditioning_latent, text_inputs, text_lengths, mel_codes, wav_lengths, types=None, text_first=True, raw_mels=None, return_attentions=False, return_latent=False, clip_inputs=True): """ Forward pass that uses both text and voice in either text conditioning mode or voice conditioning mode (actuated by `text_first`). speech_conditioning_input: MEL float tensor, (b,1024) text_inputs: long tensor, (b,t) text_lengths: long tensor, (b,) mel_inputs: long tensor, (b,m) wav_lengths: long tensor, (b,) raw_mels: MEL float tensor (b,80,s) If return_attentions is specified, only logits are returned. If return_latent is specified, loss & logits are not computed or returned. Only the predicted latents are returned. If clip_inputs is True, the inputs will be clipped to the smallest input size across each input modality. """ # Types are expressed by expanding the text embedding space. if types is not None: text_inputs = text_inputs * (1+types).unsqueeze(-1) if clip_inputs: # This model will receive micro-batches with a ton of padding for both the text and MELs. Ameliorate this by # chopping the inputs by the maximum actual length. max_text_len = text_lengths.max() text_inputs = text_inputs[:, :max_text_len] max_mel_len = wav_lengths.max() // self.mel_length_compression mel_codes = mel_codes[:, :max_mel_len] if raw_mels is not None: raw_mels = raw_mels[:, :, :max_mel_len*4] mel_codes = self.set_mel_padding(mel_codes, wav_lengths) text_inputs = F.pad(text_inputs, (0,1), value=self.stop_text_token) mel_codes = F.pad(mel_codes, (0,1), value=self.stop_mel_token) conds = speech_conditioning_latent.unsqueeze(1) text_inputs, text_targets = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token) text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs) mel_codes, mel_targets = self.build_aligned_inputs_and_targets(mel_codes, self.start_mel_token, self.stop_mel_token) if raw_mels is not None: mel_inp = F.pad(raw_mels, (0, 8)) else: mel_inp = mel_codes mel_emb = self.mel_embedding(mel_inp) mel_emb = mel_emb + self.mel_pos_embedding(mel_codes) if text_first: text_logits, mel_logits = self.get_logits(conds, text_emb, self.text_head, mel_emb, self.mel_head, get_attns=return_attentions, return_latent=return_latent) if return_latent: return mel_logits[:, :-2] # Despite the name, these are not logits. Strip off the two tokens added by this forward pass. else: mel_logits, text_logits = self.get_logits(conds, mel_emb, self.mel_head, text_emb, self.text_head, get_attns=return_attentions, return_latent=return_latent) if return_latent: return text_logits[:, :-2] # Despite the name, these are not logits. Strip off the two tokens added by this forward pass. if return_attentions: return mel_logits loss_text = F.cross_entropy(text_logits, text_targets.long()) loss_mel = F.cross_entropy(mel_logits, mel_targets.long()) return loss_text.mean(), loss_mel.mean(), mel_logits def compute_embeddings( self, cond_latents, text_inputs, ): text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token) text_inputs = F.pad(text_inputs, (1, 0), value=self.start_text_token) emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs) conds = cond_latents.unsqueeze(1) emb = torch.cat([conds, emb], dim=1) self.inference_model.store_mel_emb(emb) gpt_inputs = torch.full( ( emb.shape[0], emb.shape[1] + 1, # +1 for the start_mel_token ), fill_value=1, dtype=torch.long, device=text_inputs.device, ) gpt_inputs[:, -1] = self.start_mel_token return gpt_inputs def inference_speech(self, speech_conditioning_latent, text_inputs, input_tokens=None, num_return_sequences=1, max_generate_length=None, typical_sampling=False, typical_mass=.9, **hf_generate_kwargs): text_inputs = F.pad(text_inputs, (0, 1), value=self.stop_text_token) text_inputs, _ = self.build_aligned_inputs_and_targets(text_inputs, self.start_text_token, self.stop_text_token) text_emb = self.text_embedding(text_inputs) + self.text_pos_embedding(text_inputs) conds = speech_conditioning_latent.unsqueeze(1) emb = torch.cat([conds, text_emb], dim=1) self.inference_model.store_mel_emb(emb) fake_inputs = torch.full((emb.shape[0], conds.shape[1] + emb.shape[1],), fill_value=1, dtype=torch.long, device=text_inputs.device) fake_inputs[:, -1] = self.start_mel_token trunc_index = fake_inputs.shape[1] if input_tokens is None: inputs = fake_inputs else: assert num_return_sequences % input_tokens.shape[0] == 0, "The number of return sequences must be divisible by the number of input sequences" fake_inputs = fake_inputs.repeat(num_return_sequences, 1) input_tokens = input_tokens.repeat(num_return_sequences // input_tokens.shape[0], 1) inputs = torch.cat([fake_inputs, input_tokens], dim=1) logits_processor = LogitsProcessorList([TypicalLogitsWarper(mass=typical_mass)]) if typical_sampling else LogitsProcessorList() max_length = trunc_index + self.max_mel_tokens - 1 if max_generate_length is None else trunc_index + max_generate_length gen = self.inference_model.generate(inputs, bos_token_id=self.start_mel_token, pad_token_id=self.stop_mel_token, eos_token_id=self.stop_mel_token, max_length=max_length, logits_processor=logits_processor, num_return_sequences=num_return_sequences, **hf_generate_kwargs) return gen[:, trunc_index:] def get_generator(self, fake_inputs, **hf_generate_kwargs): return self.inference_model.generate_stream( fake_inputs, bos_token_id=self.start_mel_token, pad_token_id=self.stop_mel_token, eos_token_id=self.stop_mel_token, max_length=500, do_stream=True, **hf_generate_kwargs, ) if __name__ == '__main__': gpt = UnifiedVoice(model_dim=256, heads=4, train_solo_embeddings=True, use_mel_codes_as_input=True, max_conditioning_inputs=4) l = gpt(torch.randn(2, 3, 80, 800), torch.randint(high=120, size=(2,120)), torch.tensor([32, 120]), torch.randint(high=8192, size=(2,250)), torch.tensor([250*256,195*256])) gpt.text_forward(torch.randn(2,80,800), torch.randint(high=50, size=(2,80)), torch.tensor([32, 80])) ================================================ FILE: tortoise/models/classifier.py ================================================ import torch import torch.nn as nn from tortoise.models.arch_util import Upsample, Downsample, normalization, zero_module, AttentionBlock class ResBlock(nn.Module): def __init__( self, channels, dropout, out_channels=None, use_conv=False, use_scale_shift_norm=False, dims=2, up=False, down=False, kernel_size=3, do_checkpoint=True, ): super().__init__() self.channels = channels self.dropout = dropout self.out_channels = out_channels or channels self.use_conv = use_conv self.use_scale_shift_norm = use_scale_shift_norm self.do_checkpoint = do_checkpoint padding = 1 if kernel_size == 3 else 2 self.in_layers = nn.Sequential( normalization(channels), nn.SiLU(), nn.Conv1d(channels, self.out_channels, kernel_size, padding=padding), ) self.updown = up or down if up: self.h_upd = Upsample(channels, False, dims) self.x_upd = Upsample(channels, False, dims) elif down: self.h_upd = Downsample(channels, False, dims) self.x_upd = Downsample(channels, False, dims) else: self.h_upd = self.x_upd = nn.Identity() self.out_layers = nn.Sequential( normalization(self.out_channels), nn.SiLU(), nn.Dropout(p=dropout), zero_module( nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding) ), ) if self.out_channels == channels: self.skip_connection = nn.Identity() elif use_conv: self.skip_connection = nn.Conv1d( dims, channels, self.out_channels, kernel_size, padding=padding ) else: self.skip_connection = nn.Conv1d(dims, channels, self.out_channels, 1) def forward(self, x): if self.updown: in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1] h = in_rest(x) h = self.h_upd(h) x = self.x_upd(x) h = in_conv(h) else: h = self.in_layers(x) h = self.out_layers(h) return self.skip_connection(x) + h class AudioMiniEncoder(nn.Module): def __init__(self, spec_dim, embedding_dim, base_channels=128, depth=2, resnet_blocks=2, attn_blocks=4, num_attn_heads=4, dropout=0, downsample_factor=2, kernel_size=3): super().__init__() self.init = nn.Sequential( nn.Conv1d(spec_dim, base_channels, 3, padding=1) ) ch = base_channels res = [] self.layers = depth for l in range(depth): for r in range(resnet_blocks): res.append(ResBlock(ch, dropout, do_checkpoint=False, kernel_size=kernel_size)) res.append(Downsample(ch, use_conv=True, out_channels=ch*2, factor=downsample_factor)) ch *= 2 self.res = nn.Sequential(*res) self.final = nn.Sequential( normalization(ch), nn.SiLU(), nn.Conv1d(ch, embedding_dim, 1) ) attn = [] for a in range(attn_blocks): attn.append(AttentionBlock(embedding_dim, num_attn_heads, do_checkpoint=False)) self.attn = nn.Sequential(*attn) self.dim = embedding_dim def forward(self, x): h = self.init(x) h = self.res(h) h = self.final(h) for blk in self.attn: h = blk(h) return h[:, :, 0] class AudioMiniEncoderWithClassifierHead(nn.Module): def __init__(self, classes, distribute_zero_label=True, **kwargs): super().__init__() self.enc = AudioMiniEncoder(**kwargs) self.head = nn.Linear(self.enc.dim, classes) self.num_classes = classes self.distribute_zero_label = distribute_zero_label def forward(self, x, labels=None): h = self.enc(x) logits = self.head(h) if labels is None: return logits else: if self.distribute_zero_label: oh_labels = nn.functional.one_hot(labels, num_classes=self.num_classes) zeros_indices = (labels == 0).unsqueeze(-1) # Distribute 20% of the probability mass on all classes when zero is specified, to compensate for dataset noise. zero_extra_mass = torch.full_like(oh_labels, dtype=torch.float, fill_value=.2/(self.num_classes-1)) zero_extra_mass[:, 0] = -.2 zero_extra_mass = zero_extra_mass * zeros_indices oh_labels = oh_labels + zero_extra_mass else: oh_labels = labels loss = nn.functional.cross_entropy(logits, oh_labels) return loss ================================================ FILE: tortoise/models/clvp.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F from torch import einsum from tortoise.models.arch_util import CheckpointedXTransformerEncoder from tortoise.models.transformer import Transformer from tortoise.models.xtransformers import Encoder def exists(val): return val is not None def masked_mean(t, mask, dim = 1): t = t.masked_fill(~mask[:, :, None], 0.) return t.sum(dim = 1) / mask.sum(dim = 1)[..., None] class CLVP(nn.Module): """ CLIP model retrofitted for performing contrastive evaluation between tokenized audio data and the corresponding transcribed text. Originally from https://github.com/lucidrains/DALLE-pytorch/blob/main/dalle_pytorch/dalle_pytorch.py """ def __init__( self, *, dim_text=512, dim_speech=512, dim_latent=512, num_text_tokens=256, text_enc_depth=6, text_seq_len=120, text_heads=8, num_speech_tokens=8192, speech_enc_depth=6, speech_heads=8, speech_seq_len=250, text_mask_percentage=0, voice_mask_percentage=0, wav_token_compression=1024, use_xformers=False, ): super().__init__() self.text_emb = nn.Embedding(num_text_tokens, dim_text) self.to_text_latent = nn.Linear(dim_text, dim_latent, bias=False) self.speech_emb = nn.Embedding(num_speech_tokens, dim_speech) self.to_speech_latent = nn.Linear(dim_speech, dim_latent, bias=False) if use_xformers: self.text_transformer = CheckpointedXTransformerEncoder( needs_permute=False, exit_permute=False, max_seq_len=-1, attn_layers=Encoder( dim=dim_text, depth=text_enc_depth, heads=text_heads, ff_dropout=.1, ff_mult=2, attn_dropout=.1, use_rmsnorm=True, ff_glu=True, rotary_pos_emb=True, )) self.speech_transformer = CheckpointedXTransformerEncoder( needs_permute=False, exit_permute=False, max_seq_len=-1, attn_layers=Encoder( dim=dim_speech, depth=speech_enc_depth, heads=speech_heads, ff_dropout=.1, ff_mult=2, attn_dropout=.1, use_rmsnorm=True, ff_glu=True, rotary_pos_emb=True, )) else: self.text_transformer = Transformer(causal=False, seq_len=text_seq_len, dim=dim_text, depth=text_enc_depth, heads=text_heads) self.speech_transformer = Transformer(causal=False, seq_len=speech_seq_len, dim=dim_speech, depth=speech_enc_depth, heads=speech_heads) self.temperature = nn.Parameter(torch.tensor(1.)) self.text_mask_percentage = text_mask_percentage self.voice_mask_percentage = voice_mask_percentage self.wav_token_compression = wav_token_compression self.xformers = use_xformers if not use_xformers: self.text_pos_emb = nn.Embedding(text_seq_len, dim_text) self.speech_pos_emb = nn.Embedding(num_speech_tokens, dim_speech) def forward( self, text, speech_tokens, return_loss=False ): b, device = text.shape[0], text.device if self.training: text_mask = torch.rand_like(text.float()) > self.text_mask_percentage voice_mask = torch.rand_like(speech_tokens.float()) > self.voice_mask_percentage else: text_mask = torch.ones_like(text.float()).bool() voice_mask = torch.ones_like(speech_tokens.float()).bool() text_emb = self.text_emb(text) speech_emb = self.speech_emb(speech_tokens) if not self.xformers: text_emb += self.text_pos_emb(torch.arange(text.shape[1], device=device)) speech_emb += self.speech_pos_emb(torch.arange(speech_emb.shape[1], device=device)) enc_text = self.text_transformer(text_emb, mask=text_mask) enc_speech = self.speech_transformer(speech_emb, mask=voice_mask) text_latents = masked_mean(enc_text, text_mask, dim=1) speech_latents = masked_mean(enc_speech, voice_mask, dim=1) text_latents = self.to_text_latent(text_latents) speech_latents = self.to_speech_latent(speech_latents) text_latents, speech_latents = map(lambda t: F.normalize(t, p=2, dim=-1), (text_latents, speech_latents)) temp = self.temperature.exp() if not return_loss: sim = einsum('n d, n d -> n', text_latents, speech_latents) * temp return sim sim = einsum('i d, j d -> i j', text_latents, speech_latents) * temp labels = torch.arange(b, device=device) loss = (F.cross_entropy(sim, labels) + F.cross_entropy(sim.t(), labels)) / 2 return loss if __name__ == '__main__': clip = CLVP(text_mask_percentage=.2, voice_mask_percentage=.2) clip(torch.randint(0,256,(2,120)), torch.tensor([50,100]), torch.randint(0,8192,(2,250)), torch.tensor([101,102]), return_loss=True) nonloss = clip(torch.randint(0,256,(2,120)), torch.tensor([50,100]), torch.randint(0,8192,(2,250)), torch.tensor([101,102]), return_loss=False) print(nonloss.shape) ================================================ FILE: tortoise/models/cvvp.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F from torch import einsum from tortoise.models.arch_util import AttentionBlock from tortoise.models.xtransformers import ContinuousTransformerWrapper, Encoder def exists(val): return val is not None def masked_mean(t, mask): t = t.masked_fill(~mask, 0.) return t.sum(dim=1) / mask.sum(dim=1) class CollapsingTransformer(nn.Module): def __init__(self, model_dim, output_dims, heads, dropout, depth, mask_percentage=0, **encoder_kwargs): super().__init__() self.transformer = ContinuousTransformerWrapper( max_seq_len=-1, use_pos_emb=False, attn_layers=Encoder( dim=model_dim, depth=depth, heads=heads, ff_dropout=dropout, ff_mult=1, attn_dropout=dropout, use_rmsnorm=True, ff_glu=True, rotary_pos_emb=True, **encoder_kwargs, )) self.pre_combiner = nn.Sequential(nn.Conv1d(model_dim, output_dims, 1), AttentionBlock( output_dims, num_heads=heads, do_checkpoint=False), nn.Conv1d(output_dims, output_dims, 1)) self.mask_percentage = mask_percentage def forward(self, x, **transformer_kwargs): h = self.transformer(x, **transformer_kwargs) h = h.permute(0, 2, 1) h = self.pre_combiner(h).permute(0, 2, 1) if self.training: mask = torch.rand_like(h.float()) > self.mask_percentage else: mask = torch.ones_like(h.float()).bool() return masked_mean(h, mask) class ConvFormatEmbedding(nn.Module): def __init__(self, *args, **kwargs): super().__init__() self.emb = nn.Embedding(*args, **kwargs) def forward(self, x): y = self.emb(x) return y.permute(0, 2, 1) class CVVP(nn.Module): def __init__( self, model_dim=512, transformer_heads=8, dropout=.1, conditioning_enc_depth=8, cond_mask_percentage=0, mel_channels=80, mel_codes=None, speech_enc_depth=8, speech_mask_percentage=0, latent_multiplier=1, ): super().__init__() latent_dim = latent_multiplier*model_dim self.temperature = nn.Parameter(torch.tensor(1.)) self.cond_emb = nn.Sequential(nn.Conv1d(mel_channels, model_dim//2, kernel_size=5, stride=2, padding=2), nn.Conv1d(model_dim//2, model_dim, kernel_size=3, stride=2, padding=1)) self.conditioning_transformer = CollapsingTransformer( model_dim, model_dim, transformer_heads, dropout, conditioning_enc_depth, cond_mask_percentage) self.to_conditioning_latent = nn.Linear( latent_dim, latent_dim, bias=False) if mel_codes is None: self.speech_emb = nn.Conv1d( mel_channels, model_dim, kernel_size=5, padding=2) else: self.speech_emb = ConvFormatEmbedding(mel_codes, model_dim) self.speech_transformer = CollapsingTransformer( model_dim, latent_dim, transformer_heads, dropout, speech_enc_depth, speech_mask_percentage) self.to_speech_latent = nn.Linear( latent_dim, latent_dim, bias=False) def get_grad_norm_parameter_groups(self): return { 'conditioning': list(self.conditioning_transformer.parameters()), 'speech': list(self.speech_transformer.parameters()), } def forward( self, mel_cond, mel_input, return_loss=False ): cond_emb = self.cond_emb(mel_cond).permute(0, 2, 1) enc_cond = self.conditioning_transformer(cond_emb) cond_latents = self.to_conditioning_latent(enc_cond) speech_emb = self.speech_emb(mel_input).permute(0, 2, 1) enc_speech = self.speech_transformer(speech_emb) speech_latents = self.to_speech_latent(enc_speech) cond_latents, speech_latents = map(lambda t: F.normalize( t, p=2, dim=-1), (cond_latents, speech_latents)) temp = self.temperature.exp() if not return_loss: sim = einsum('n d, n d -> n', cond_latents, speech_latents) * temp return sim sim = einsum('i d, j d -> i j', cond_latents, speech_latents) * temp labels = torch.arange( cond_latents.shape[0], device=mel_input.device) loss = (F.cross_entropy(sim, labels) + F.cross_entropy(sim.t(), labels)) / 2 return loss if __name__ == '__main__': clvp = CVVP() clvp(torch.randn(2, 80, 100), torch.randn(2, 80, 95), return_loss=True) ================================================ FILE: tortoise/models/diffusion_decoder.py ================================================ import math import random from abc import abstractmethod import torch import torch.nn as nn import torch.nn.functional as F from torch import autocast from tortoise.models.arch_util import normalization, AttentionBlock def is_latent(t): return t.dtype == torch.float def is_sequence(t): return t.dtype == torch.long def timestep_embedding(timesteps, dim, max_period=10000): """ Create sinusoidal timestep embeddings. :param timesteps: a 1-D Tensor of N indices, one per batch element. These may be fractional. :param dim: the dimension of the output. :param max_period: controls the minimum frequency of the embeddings. :return: an [N x dim] Tensor of positional embeddings. """ half = dim // 2 freqs = torch.exp( -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half ).to(device=timesteps.device) args = timesteps[:, None].float() * freqs[None] embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1) if dim % 2: embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1) return embedding class TimestepBlock(nn.Module): @abstractmethod def forward(self, x, emb): """ Apply the module to `x` given `emb` timestep embeddings. """ class TimestepEmbedSequential(nn.Sequential, TimestepBlock): def forward(self, x, emb): for layer in self: if isinstance(layer, TimestepBlock): x = layer(x, emb) else: x = layer(x) return x class ResBlock(TimestepBlock): def __init__( self, channels, emb_channels, dropout, out_channels=None, dims=2, kernel_size=3, efficient_config=True, use_scale_shift_norm=False, ): super().__init__() self.channels = channels self.emb_channels = emb_channels self.dropout = dropout self.out_channels = out_channels or channels self.use_scale_shift_norm = use_scale_shift_norm padding = {1: 0, 3: 1, 5: 2}[kernel_size] eff_kernel = 1 if efficient_config else 3 eff_padding = 0 if efficient_config else 1 self.in_layers = nn.Sequential( normalization(channels), nn.SiLU(), nn.Conv1d(channels, self.out_channels, eff_kernel, padding=eff_padding), ) self.emb_layers = nn.Sequential( nn.SiLU(), nn.Linear( emb_channels, 2 * self.out_channels if use_scale_shift_norm else self.out_channels, ), ) self.out_layers = nn.Sequential( normalization(self.out_channels), nn.SiLU(), nn.Dropout(p=dropout), nn.Conv1d(self.out_channels, self.out_channels, kernel_size, padding=padding), ) if self.out_channels == channels: self.skip_connection = nn.Identity() else: self.skip_connection = nn.Conv1d(channels, self.out_channels, eff_kernel, padding=eff_padding) def forward(self, x, emb): h = self.in_layers(x) emb_out = self.emb_layers(emb).type(h.dtype) while len(emb_out.shape) < len(h.shape): emb_out = emb_out[..., None] if self.use_scale_shift_norm: out_norm, out_rest = self.out_layers[0], self.out_layers[1:] scale, shift = torch.chunk(emb_out, 2, dim=1) h = out_norm(h) * (1 + scale) + shift h = out_rest(h) else: h = h + emb_out h = self.out_layers(h) return self.skip_connection(x) + h class DiffusionLayer(TimestepBlock): def __init__(self, model_channels, dropout, num_heads): super().__init__() self.resblk = ResBlock(model_channels, model_channels, dropout, model_channels, dims=1, use_scale_shift_norm=True) self.attn = AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True) def forward(self, x, time_emb): y = self.resblk(x, time_emb) return self.attn(y) class DiffusionTts(nn.Module): def __init__( self, model_channels=512, num_layers=8, in_channels=100, in_latent_channels=512, in_tokens=8193, out_channels=200, # mean and variance dropout=0, use_fp16=False, num_heads=16, # Parameters for regularization. layer_drop=.1, unconditioned_percentage=.1, # This implements a mechanism similar to what is used in classifier-free training. ): super().__init__() self.in_channels = in_channels self.model_channels = model_channels self.out_channels = out_channels self.dropout = dropout self.num_heads = num_heads self.unconditioned_percentage = unconditioned_percentage self.enable_fp16 = use_fp16 self.layer_drop = layer_drop self.inp_block = nn.Conv1d(in_channels, model_channels, 3, 1, 1) self.time_embed = nn.Sequential( nn.Linear(model_channels, model_channels), nn.SiLU(), nn.Linear(model_channels, model_channels), ) # Either code_converter or latent_converter is used, depending on what type of conditioning data is fed. # This model is meant to be able to be trained on both for efficiency purposes - it is far less computationally # complex to generate tokens, while generating latents will normally mean propagating through a deep autoregressive # transformer network. self.code_embedding = nn.Embedding(in_tokens, model_channels) self.code_converter = nn.Sequential( AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), ) self.code_norm = normalization(model_channels) self.latent_conditioner = nn.Sequential( nn.Conv1d(in_latent_channels, model_channels, 3, padding=1), AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), AttentionBlock(model_channels, num_heads, relative_pos_embeddings=True), ) self.contextual_embedder = nn.Sequential(nn.Conv1d(in_channels,model_channels,3,padding=1,stride=2), nn.Conv1d(model_channels, model_channels*2,3,padding=1,stride=2), AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False), AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False), AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False), AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False), AttentionBlock(model_channels*2, num_heads, relative_pos_embeddings=True, do_checkpoint=False)) self.unconditioned_embedding = nn.Parameter(torch.randn(1,model_channels,1)) self.conditioning_timestep_integrator = TimestepEmbedSequential( DiffusionLayer(model_channels, dropout, num_heads), DiffusionLayer(model_channels, dropout, num_heads), DiffusionLayer(model_channels, dropout, num_heads), ) self.integrating_conv = nn.Conv1d(model_channels*2, model_channels, kernel_size=1) self.mel_head = nn.Conv1d(model_channels, in_channels, kernel_size=3, padding=1) self.layers = nn.ModuleList([DiffusionLayer(model_channels, dropout, num_heads) for _ in range(num_layers)] + [ResBlock(model_channels, model_channels, dropout, dims=1, use_scale_shift_norm=True) for _ in range(3)]) self.out = nn.Sequential( normalization(model_channels), nn.SiLU(), nn.Conv1d(model_channels, out_channels, 3, padding=1), ) def get_grad_norm_parameter_groups(self): groups = { 'minicoder': list(self.contextual_embedder.parameters()), 'layers': list(self.layers.parameters()), 'code_converters': list(self.code_embedding.parameters()) + list(self.code_converter.parameters()) + list(self.latent_conditioner.parameters()) + list(self.latent_conditioner.parameters()), 'timestep_integrator': list(self.conditioning_timestep_integrator.parameters()) + list(self.integrating_conv.parameters()), 'time_embed': list(self.time_embed.parameters()), } return groups def get_conditioning(self, conditioning_input): speech_conditioning_input = conditioning_input.unsqueeze(1) if len( conditioning_input.shape) == 3 else conditioning_input conds = [] for j in range(speech_conditioning_input.shape[1]): conds.append(self.contextual_embedder(speech_conditioning_input[:, j])) conds = torch.cat(conds, dim=-1) conds = conds.mean(dim=-1) return conds def timestep_independent(self, aligned_conditioning, conditioning_latent, expected_seq_len, return_code_pred): # Shuffle aligned_latent to BxCxS format if is_latent(aligned_conditioning): aligned_conditioning = aligned_conditioning.permute(0, 2, 1) cond_scale, cond_shift = torch.chunk(conditioning_latent, 2, dim=1) if is_latent(aligned_conditioning): code_emb = self.latent_conditioner(aligned_conditioning) else: code_emb = self.code_embedding(aligned_conditioning).permute(0, 2, 1) code_emb = self.code_converter(code_emb) code_emb = self.code_norm(code_emb) * (1 + cond_scale.unsqueeze(-1)) + cond_shift.unsqueeze(-1) unconditioned_batches = torch.zeros((code_emb.shape[0], 1, 1), device=code_emb.device) # Mask out the conditioning branch for whole batch elements, implementing something similar to classifier-free guidance. if self.training and self.unconditioned_percentage > 0: unconditioned_batches = torch.rand((code_emb.shape[0], 1, 1), device=code_emb.device) < self.unconditioned_percentage code_emb = torch.where(unconditioned_batches, self.unconditioned_embedding.repeat(aligned_conditioning.shape[0], 1, 1), code_emb) expanded_code_emb = F.interpolate(code_emb, size=expected_seq_len, mode='nearest') if not return_code_pred: return expanded_code_emb else: mel_pred = self.mel_head(expanded_code_emb) # Multiply mel_pred by !unconditioned_branches, which drops the gradient on unconditioned branches. This is because we don't want that gradient being used to train parameters through the codes_embedder as it unbalances contributions to that network from the MSE loss. mel_pred = mel_pred * unconditioned_batches.logical_not() return expanded_code_emb, mel_pred def forward(self, x, timesteps, aligned_conditioning=None, conditioning_latent=None, precomputed_aligned_embeddings=None, conditioning_free=False, return_code_pred=False): """ Apply the model to an input batch. :param x: an [N x C x ...] Tensor of inputs. :param timesteps: a 1-D batch of timesteps. :param aligned_conditioning: an aligned latent or sequence of tokens providing useful data about the sample to be produced. :param conditioning_latent: a pre-computed conditioning latent; see get_conditioning(). :param precomputed_aligned_embeddings: Embeddings returned from self.timestep_independent() :param conditioning_free: When set, all conditioning inputs (including tokens and conditioning_input) will not be considered. :return: an [N x C x ...] Tensor of outputs. """ assert precomputed_aligned_embeddings is not None or (aligned_conditioning is not None and conditioning_latent is not None) assert not (return_code_pred and precomputed_aligned_embeddings is not None) # These two are mutually exclusive. unused_params = [] if conditioning_free: code_emb = self.unconditioned_embedding.repeat(x.shape[0], 1, x.shape[-1]) unused_params.extend(list(self.code_converter.parameters()) + list(self.code_embedding.parameters())) unused_params.extend(list(self.latent_conditioner.parameters())) else: if precomputed_aligned_embeddings is not None: code_emb = precomputed_aligned_embeddings else: code_emb, mel_pred = self.timestep_independent(aligned_conditioning, conditioning_latent, x.shape[-1], True) if is_latent(aligned_conditioning): unused_params.extend(list(self.code_converter.parameters()) + list(self.code_embedding.parameters())) else: unused_params.extend(list(self.latent_conditioner.parameters())) unused_params.append(self.unconditioned_embedding) time_emb = self.time_embed(timestep_embedding(timesteps, self.model_channels)) code_emb = self.conditioning_timestep_integrator(code_emb, time_emb) x = self.inp_block(x) x = torch.cat([x, code_emb], dim=1) x = self.integrating_conv(x) for i, lyr in enumerate(self.layers): # Do layer drop where applicable. Do not drop first and last layers. if self.training and self.layer_drop > 0 and i != 0 and i != (len(self.layers)-1) and random.random() < self.layer_drop: unused_params.extend(list(lyr.parameters())) else: # First and last blocks will have autocast disabled for improved precision. if not torch.backends.mps.is_available(): with autocast(x.device.type, enabled=self.enable_fp16 and i != 0): x = lyr(x, time_emb) else: x = lyr(x, time_emb) x = x.float() out = self.out(x) # Involve probabilistic or possibly unused parameters in loss so we don't get DDP errors. extraneous_addition = 0 for p in unused_params: extraneous_addition = extraneous_addition + p.mean() out = out + extraneous_addition * 0 if return_code_pred: return out, mel_pred return out if __name__ == '__main__': clip = torch.randn(2, 100, 400) aligned_latent = torch.randn(2,388,512) aligned_sequence = torch.randint(0,8192,(2,100)) cond = torch.randn(2, 100, 400) ts = torch.LongTensor([600, 600]) model = DiffusionTts(512, layer_drop=.3, unconditioned_percentage=.5) # Test with latent aligned conditioning #o = model(clip, ts, aligned_latent, cond) # Test with sequence aligned conditioning o = model(clip, ts, aligned_sequence, cond) ================================================ FILE: tortoise/models/hifigan_decoder.py ================================================ # adopted from https://github.com/jik876/hifi-gan/blob/master/models.py import torch from torch import nn from torch.nn import Conv1d, ConvTranspose1d from torch.nn import functional as F from torch.nn.utils import remove_weight_norm, weight_norm LRELU_SLOPE = 0.1 def get_padding(k, d): return int((k * d - d) / 2) class ResBlock1(torch.nn.Module): """Residual Block Type 1. It has 3 convolutional layers in each convolutional block. Network:: x -> lrelu -> conv1_1 -> conv1_2 -> conv1_3 -> z -> lrelu -> conv2_1 -> conv2_2 -> conv2_3 -> o -> + -> o |--------------------------------------------------------------------------------------------------| Args: channels (int): number of hidden channels for the convolutional layers. kernel_size (int): size of the convolution filter in each layer. dilations (list): list of dilation value for each conv layer in a block. """ def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)): super().__init__() self.convs1 = nn.ModuleList( [ weight_norm( Conv1d( channels, channels, kernel_size, 1, dilation=dilation[0], padding=get_padding(kernel_size, dilation[0]), ) ), weight_norm( Conv1d( channels, channels, kernel_size, 1, dilation=dilation[1], padding=get_padding(kernel_size, dilation[1]), ) ), weight_norm( Conv1d( channels, channels, kernel_size, 1, dilation=dilation[2], padding=get_padding(kernel_size, dilation[2]), ) ), ] ) self.convs2 = nn.ModuleList( [ weight_norm( Conv1d(channels, channels, kernel_size, 1, dilation=1, padding=get_padding(kernel_size, 1)) ), weight_norm( Conv1d(channels, channels, kernel_size, 1, dilation=1, padding=get_padding(kernel_size, 1)) ), weight_norm( Conv1d(channels, channels, kernel_size, 1, dilation=1, padding=get_padding(kernel_size, 1)) ), ] ) def forward(self, x): """ Args: x (Tensor): input tensor. Returns: Tensor: output tensor. Shapes: x: [B, C, T] """ for c1, c2 in zip(self.convs1, self.convs2): xt = F.leaky_relu(x, LRELU_SLOPE) xt = c1(xt) xt = F.leaky_relu(xt, LRELU_SLOPE) xt = c2(xt) x = xt + x return x def remove_weight_norm(self): for l in self.convs1: remove_weight_norm(l) for l in self.convs2: remove_weight_norm(l) class ResBlock2(torch.nn.Module): """Residual Block Type 2. It has 1 convolutional layers in each convolutional block. Network:: x -> lrelu -> conv1-> -> z -> lrelu -> conv2-> o -> + -> o |---------------------------------------------------| Args: channels (int): number of hidden channels for the convolutional layers. kernel_size (int): size of the convolution filter in each layer. dilations (list): list of dilation value for each conv layer in a block. """ def __init__(self, channels, kernel_size=3, dilation=(1, 3)): super().__init__() self.convs = nn.ModuleList( [ weight_norm( Conv1d( channels, channels, kernel_size, 1, dilation=dilation[0], padding=get_padding(kernel_size, dilation[0]), ) ), weight_norm( Conv1d( channels, channels, kernel_size, 1, dilation=dilation[1], padding=get_padding(kernel_size, dilation[1]), ) ), ] ) def forward(self, x): for c in self.convs: xt = F.leaky_relu(x, LRELU_SLOPE) xt = c(xt) x = xt + x return x def remove_weight_norm(self): for l in self.convs: remove_weight_norm(l) class HifiganGenerator(torch.nn.Module): def __init__( self, in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_pre_weight_norm=True, conv_post_weight_norm=True, conv_post_bias=True, ): r"""HiFiGAN Generator with Multi-Receptive Field Fusion (MRF) Network: x -> lrelu -> upsampling_layer -> resblock1_k1x1 -> z1 -> + -> z_sum / #resblocks -> lrelu -> conv_post_7x1 -> tanh -> o .. -> zI ---| resblockN_kNx1 -> zN ---' Args: in_channels (int): number of input tensor channels. out_channels (int): number of output tensor channels. resblock_type (str): type of the `ResBlock`. '1' or '2'. resblock_dilation_sizes (List[List[int]]): list of dilation values in each layer of a `ResBlock`. resblock_kernel_sizes (List[int]): list of kernel sizes for each `ResBlock`. upsample_kernel_sizes (List[int]): list of kernel sizes for each transposed convolution. upsample_initial_channel (int): number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer. upsample_factors (List[int]): upsampling factors (stride) for each upsampling layer. inference_padding (int): constant padding applied to the input at inference time. Defaults to 5. """ super().__init__() self.inference_padding = inference_padding self.num_kernels = len(resblock_kernel_sizes) self.num_upsamples = len(upsample_factors) # initial upsampling layers self.conv_pre = weight_norm(Conv1d(in_channels, upsample_initial_channel, 7, 1, padding=3)) resblock = ResBlock1 if resblock_type == "1" else ResBlock2 # upsampling layers self.ups = nn.ModuleList() for i, (u, k) in enumerate(zip(upsample_factors, upsample_kernel_sizes)): self.ups.append( weight_norm( ConvTranspose1d( upsample_initial_channel // (2**i), upsample_initial_channel // (2 ** (i + 1)), k, u, padding=(k - u) // 2, ) ) ) # MRF blocks self.resblocks = nn.ModuleList() for i in range(len(self.ups)): ch = upsample_initial_channel // (2 ** (i + 1)) for _, (k, d) in enumerate(zip(resblock_kernel_sizes, resblock_dilation_sizes)): self.resblocks.append(resblock(ch, k, d)) # post convolution layer self.conv_post = weight_norm(Conv1d(ch, out_channels, 7, 1, padding=3, bias=conv_post_bias)) if cond_channels > 0: self.cond_layer = nn.Conv1d(cond_channels, upsample_initial_channel, 1) if not conv_pre_weight_norm: remove_weight_norm(self.conv_pre) if not conv_post_weight_norm: remove_weight_norm(self.conv_post) self.device = torch.device('cuda' if torch.cuda.is_available() else'cpu') if torch.backends.mps.is_available(): self.device = torch.device('mps') def forward(self, x, g=None): """ Args: x (Tensor): feature input tensor. g (Tensor): global conditioning input tensor. Returns: Tensor: output waveform. Shapes: x: [B, C, T] Tensor: [B, 1, T] """ o = self.conv_pre(x) if hasattr(self, "cond_layer"): o = o + self.cond_layer(g) for i in range(self.num_upsamples): o = F.leaky_relu(o, LRELU_SLOPE) o = self.ups[i](o) z_sum = None for j in range(self.num_kernels): if z_sum is None: z_sum = self.resblocks[i * self.num_kernels + j](o) else: z_sum += self.resblocks[i * self.num_kernels + j](o) o = z_sum / self.num_kernels o = F.leaky_relu(o) o = self.conv_post(o) o = torch.tanh(o) return o @torch.no_grad() def inference(self, c, g=None): """ Args: x (Tensor): conditioning input tensor. Returns: Tensor: output waveform. Shapes: x: [B, C, T] Tensor: [B, 1, T] """ # c = c.to(self.conv_pre.weight.device) # c = torch.nn.functional.pad(c, (self.inference_padding, self.inference_padding), "replicate") up_1 = torch.nn.functional.interpolate( c.transpose(1,2), scale_factor=[1024 / 256], mode="linear", ) up_2 = torch.nn.functional.interpolate( up_1, scale_factor=[24000 / 22050], mode="linear", ) g = g.unsqueeze(0) return self.forward(up_2.to(self.device), g.transpose(1,2)) def remove_weight_norm(self): print("Removing weight norm...") for l in self.ups: remove_weight_norm(l) for l in self.resblocks: l.remove_weight_norm() remove_weight_norm(self.conv_pre) remove_weight_norm(self.conv_post) ================================================ FILE: tortoise/models/random_latent_generator.py ================================================ import math import torch import torch.nn as nn import torch.nn.functional as F def fused_leaky_relu(input, bias=None, negative_slope=0.2, scale=2 ** 0.5): if bias is not None: rest_dim = [1] * (input.ndim - bias.ndim - 1) return ( F.leaky_relu( input + bias.view(1, bias.shape[0], *rest_dim), negative_slope=negative_slope ) * scale ) else: return F.leaky_relu(input, negative_slope=0.2) * scale class EqualLinear(nn.Module): def __init__( self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1 ): super().__init__() self.weight = nn.Parameter(torch.randn(out_dim, in_dim).div_(lr_mul)) if bias: self.bias = nn.Parameter(torch.zeros(out_dim).fill_(bias_init)) else: self.bias = None self.scale = (1 / math.sqrt(in_dim)) * lr_mul self.lr_mul = lr_mul def forward(self, input): out = F.linear(input, self.weight * self.scale) out = fused_leaky_relu(out, self.bias * self.lr_mul) return out class RandomLatentConverter(nn.Module): def __init__(self, channels): super().__init__() self.layers = nn.Sequential(*[EqualLinear(channels, channels, lr_mul=.1) for _ in range(5)], nn.Linear(channels, channels)) self.channels = channels def forward(self, ref): r = torch.randn(ref.shape[0], self.channels, device=ref.device) y = self.layers(r) return y if __name__ == '__main__': model = RandomLatentConverter(512) model(torch.randn(5,512)) ================================================ FILE: tortoise/models/stream_generator.py ================================================ # Adapted from: https://github.com/LowinLi/transformers-stream-generator from transformers import ( GenerationConfig, GenerationMixin, LogitsProcessorList, StoppingCriteriaList, DisjunctiveConstraint, BeamSearchScorer, PhrasalConstraint, ConstrainedBeamSearchScorer, PreTrainedModel, ) import numpy as np import random import warnings import inspect from transformers.generation.utils import GenerateOutput, SampleOutput, logger import torch from typing import Callable, List, Optional, Union from torch import nn import torch.distributed as dist import copy def setup_seed(seed): if seed == -1: return torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True class StreamGenerationConfig(GenerationConfig): def __init__(self, **kwargs): super().__init__(**kwargs) self.do_stream = kwargs.pop("do_stream", False) class NewGenerationMixin(GenerationMixin): @torch.no_grad() def generate( self, inputs: Optional[torch.Tensor] = None, generation_config: Optional[StreamGenerationConfig] = None, logits_processor: Optional[LogitsProcessorList] = None, stopping_criteria: Optional[StoppingCriteriaList] = None, prefix_allowed_tokens_fn: Optional[ Callable[[int, torch.Tensor], List[int]] ] = None, synced_gpus: Optional[bool] = False, seed=0, **kwargs, ) -> Union[GenerateOutput, torch.LongTensor]: r""" Generates sequences of token ids for models with a language modeling head. Most generation-controlling parameters are set in `generation_config` which, if not passed, will be set to the model's default generation configuration. You can override any `generation_config` by passing the corresponding parameters to generate(), e.g. `.generate(inputs, num_beams=4, do_sample=True)`. For an overview of generation strategies and code examples, check out the [following guide](./generation_strategies). Parameters: inputs (`torch.Tensor` of varying shape depending on the modality, *optional*): The sequence used as a prompt for the generation or as model inputs to the encoder. If `None` the method initializes it with `bos_token_id` and a batch size of 1. For decoder-only models `inputs` should of in the format of `input_ids`. For encoder-decoder models *inputs* can represent any of `input_ids`, `input_values`, `input_features`, or `pixel_values`. generation_config (`~generation.GenerationConfig`, *optional*): The generation configuration to be used as base parametrization for the generation call. `**kwargs` passed to generate matching the attributes of `generation_config` will override them. If `generation_config` is not provided, the default will be used, which had the following loading priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s default values, whose documentation should be checked to parameterize generation. logits_processor (`LogitsProcessorList`, *optional*): Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. stopping_criteria (`StoppingCriteriaList`, *optional*): Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. prefix_allowed_tokens_fn (`Callable[[int, torch.Tensor], List[int]]`, *optional*): If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch ID `batch_id` and `input_ids`. It has to return a list with the allowed tokens for the next generation step conditioned on the batch ID `batch_id` and the previously generated tokens `inputs_ids`. This argument is useful for constrained generation conditioned on the prefix, as described in [Autoregressive Entity Retrieval](https://arxiv.org/abs/2010.00904). synced_gpus (`bool`, *optional*, defaults to `False`): Whether to continue running the while loop until max_length (needed for ZeRO stage 3) kwargs: Ad hoc parametrization of `generate_config` and/or additional model-specific kwargs that will be forwarded to the `forward` function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with *decoder_*. Return: [`~utils.ModelOutput`] or `torch.LongTensor`: A [`~utils.ModelOutput`] (if `return_dict_in_generate=True` or when `config.return_dict_in_generate=True`) or a `torch.FloatTensor`. If the model is *not* an encoder-decoder model (`model.config.is_encoder_decoder=False`), the possible [`~utils.ModelOutput`] types are: - [`~generation.GreedySearchDecoderOnlyOutput`], - [`~generation.SampleDecoderOnlyOutput`], - [`~generation.BeamSearchDecoderOnlyOutput`], - [`~generation.BeamSampleDecoderOnlyOutput`] If the model is an encoder-decoder model (`model.config.is_encoder_decoder=True`), the possible [`~utils.ModelOutput`] types are: - [`~generation.GreedySearchEncoderDecoderOutput`], - [`~generation.SampleEncoderDecoderOutput`], - [`~generation.BeamSearchEncoderDecoderOutput`], - [`~generation.BeamSampleEncoderDecoderOutput`] """ setup_seed(seed) # 1. Handle `generation_config` and kwargs that might update it, and validate the `.generate()` call self._validate_model_class() # priority: `generation_config` argument > `model.generation_config` (the default generation config) if generation_config is None: # legacy: users may modify the model configuration to control generation -- update the generation config # model attribute accordingly, if it was created from the model config if self.generation_config._from_model_config: new_generation_config = StreamGenerationConfig.from_model_config( self.config ) if new_generation_config != self.generation_config: warnings.warn( "You have modified the pretrained model configuration to control generation. This is a" " deprecated strategy to control generation and will be removed soon, in a future version." " Please use a generation configuration file (see" " https://huggingface.co/docs/transformers/main_classes/text_generation)" ) self.generation_config = new_generation_config generation_config = self.generation_config generation_config = copy.deepcopy(generation_config) model_kwargs = generation_config.update( **kwargs ) # All unused kwargs must be model kwargs # self._validate_model_kwargs(model_kwargs.copy()) # 2. Set generation parameters if not already defined logits_processor = ( logits_processor if logits_processor is not None else LogitsProcessorList() ) stopping_criteria = ( stopping_criteria if stopping_criteria is not None else StoppingCriteriaList() ) if ( generation_config.pad_token_id is None and generation_config.eos_token_id is not None ): if model_kwargs.get("attention_mask", None) is None: logger.warning( "The attention mask and the pad token id were not set. As a consequence, you may observe " "unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results." ) eos_token_id = generation_config.eos_token_id if isinstance(eos_token_id, list): eos_token_id = eos_token_id[0] logger.warning( f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation." ) generation_config.pad_token_id = eos_token_id # 3. Define model inputs # inputs_tensor has to be defined # model_input_name is defined if model-specific keyword input is passed # otherwise model_input_name is None # all model-specific keyword inputs are removed from `model_kwargs` inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs( inputs, generation_config.bos_token_id, model_kwargs ) batch_size = inputs_tensor.shape[0] # 4. Define other model kwargs model_kwargs["output_attentions"] = generation_config.output_attentions model_kwargs["output_hidden_states"] = generation_config.output_hidden_states model_kwargs["use_cache"] = generation_config.use_cache accepts_attention_mask = "attention_mask" in set( inspect.signature(self.forward).parameters.keys() ) requires_attention_mask = "encoder_outputs" not in model_kwargs if ( model_kwargs.get("attention_mask", None) is None and requires_attention_mask and accepts_attention_mask ): model_kwargs[ "attention_mask" ] = self._prepare_attention_mask_for_generation( inputs_tensor, generation_config.pad_token_id, generation_config.eos_token_id, ) # decoder-only models should use left-padding for generation if not self.config.is_encoder_decoder: if ( generation_config.pad_token_id is not None and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0 ): logger.warning( "A decoder-only architecture is being used, but right-padding was detected! For correct " "generation results, please set `padding_side='left'` when initializing the tokenizer." ) if self.config.is_encoder_decoder and "encoder_outputs" not in model_kwargs: # if model is encoder decoder encoder_outputs are created # and added to `model_kwargs` model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation( inputs_tensor, model_kwargs, model_input_name ) # 5. Prepare `input_ids` which will be used for auto-regressive generation if self.config.is_encoder_decoder: input_ids = self._prepare_decoder_input_ids_for_generation( batch_size, decoder_start_token_id=generation_config.decoder_start_token_id, bos_token_id=generation_config.bos_token_id, model_kwargs=model_kwargs, device=inputs_tensor.device, ) else: # if decoder-only then inputs_tensor has to be `input_ids` input_ids = inputs_tensor # 6. Prepare `max_length` depending on other stopping criteria. input_ids_seq_length = input_ids.shape[-1] has_default_max_length = ( kwargs.get("max_length") is None and generation_config.max_length is not None ) if has_default_max_length and generation_config.max_new_tokens is None: warnings.warn( "Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to" f" {generation_config.max_length} (`generation_config.max_length`). Controlling `max_length` via the" " config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we" " recommend using `max_new_tokens` to control the maximum length of the generation.", UserWarning, ) elif has_default_max_length and generation_config.max_new_tokens is not None: generation_config.max_length = ( generation_config.max_new_tokens + input_ids_seq_length ) elif ( not has_default_max_length and generation_config.max_new_tokens is not None ): raise ValueError( "Both `max_new_tokens` and `max_length` have been set but they serve the same purpose -- setting a" " limit to the generated output length. Remove one of those arguments. Please refer to the" " documentation for more information. " "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)" ) if ( generation_config.min_length is not None and generation_config.min_length > generation_config.max_length ): raise ValueError( f"Unfeasible length constraints: the minimum length ({generation_config.min_length}) is larger than" f" the maximum length ({generation_config.max_length})" ) if input_ids_seq_length >= generation_config.max_length: input_ids_string = ( "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids" ) logger.warning( f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to" f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider" " increasing `max_new_tokens`." ) # 7. determine generation mode is_constraint_gen_mode = ( generation_config.constraints is not None or generation_config.force_words_ids is not None ) is_contrastive_search_gen_mode = ( generation_config.top_k is not None and generation_config.top_k > 1 and generation_config.do_sample is False and generation_config.penalty_alpha is not None and generation_config.penalty_alpha > 0 ) is_greedy_gen_mode = ( (generation_config.num_beams == 1) and (generation_config.num_beam_groups == 1) and generation_config.do_sample is False and not is_constraint_gen_mode and not is_contrastive_search_gen_mode ) is_sample_gen_mode = ( (generation_config.num_beams == 1) and (generation_config.num_beam_groups == 1) and generation_config.do_sample is True and generation_config.do_stream is False and not is_constraint_gen_mode and not is_contrastive_search_gen_mode ) is_sample_gen_stream_mode = ( (generation_config.num_beams == 1) and (generation_config.num_beam_groups == 1) and generation_config.do_stream is True and not is_constraint_gen_mode and not is_contrastive_search_gen_mode ) is_beam_gen_mode = ( (generation_config.num_beams > 1) and (generation_config.num_beam_groups == 1) and generation_config.do_sample is False and not is_constraint_gen_mode and not is_contrastive_search_gen_mode ) is_beam_sample_gen_mode = ( (generation_config.num_beams > 1) and (generation_config.num_beam_groups == 1) and generation_config.do_sample is True and not is_constraint_gen_mode and not is_contrastive_search_gen_mode ) is_group_beam_gen_mode = ( (generation_config.num_beams > 1) and (generation_config.num_beam_groups > 1) and not is_constraint_gen_mode and not is_contrastive_search_gen_mode ) if generation_config.num_beam_groups > generation_config.num_beams: raise ValueError( "`num_beam_groups` has to be smaller or equal to `num_beams`" ) if is_group_beam_gen_mode and generation_config.do_sample is True: raise ValueError( "Diverse beam search cannot be used in sampling mode. Make sure that `do_sample` is set to `False`." ) if self.device.type != input_ids.device.type: warnings.warn( "You are calling .generate() with the `input_ids` being on a device type different" f" than your model's device. `input_ids` is on {input_ids.device.type}, whereas the model" f" is on {self.device.type}. You may experience unexpected behaviors or slower generation." " Please make sure that you have put `input_ids` to the" f" correct device by calling for example input_ids = input_ids.to('{self.device.type}') before" " running `.generate()`.", UserWarning, ) # 8. prepare distribution pre_processing samplers logits_processor = self._get_logits_processor( generation_config=generation_config, input_ids_seq_length=input_ids_seq_length, encoder_input_ids=inputs_tensor, prefix_allowed_tokens_fn=prefix_allowed_tokens_fn, logits_processor=logits_processor, ) # 9. prepare stopping criteria stopping_criteria = self._get_stopping_criteria( generation_config=generation_config, stopping_criteria=stopping_criteria ) # 10. go into different generation modes if is_greedy_gen_mode: if generation_config.num_return_sequences > 1: raise ValueError( f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing" " greedy search." ) # 11. run greedy search return self.greedy_search( input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_contrastive_search_gen_mode: if generation_config.num_return_sequences > 1: raise ValueError( f"num_return_sequences has to be 1, but is {generation_config.num_return_sequences} when doing" " contrastive search." ) return self.contrastive_search( input_ids, top_k=generation_config.top_k, penalty_alpha=generation_config.penalty_alpha, logits_processor=logits_processor, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_sample_gen_mode: # 11. prepare logits warper logits_warper = self._get_logits_warper(generation_config) # 12. expand input_ids with `num_return_sequences` additional sequences per batch input_ids, model_kwargs = self._expand_inputs_for_generation( input_ids=input_ids, expand_size=generation_config.num_return_sequences, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs, ) # 13. run sample return self.sample( input_ids, logits_processor=logits_processor, logits_warper=logits_warper, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_sample_gen_stream_mode: # 11. prepare logits warper logits_warper = self._get_logits_warper(generation_config) # 12. expand input_ids with `num_return_sequences` additional sequences per batch input_ids, model_kwargs = self._expand_inputs_for_generation( input_ids=input_ids, expand_size=generation_config.num_return_sequences, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs, ) # 13. run sample return self.sample_stream( input_ids, logits_processor=logits_processor, logits_warper=logits_warper, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_beam_gen_mode: if generation_config.num_return_sequences > generation_config.num_beams: raise ValueError( "`num_return_sequences` has to be smaller or equal to `num_beams`." ) if stopping_criteria.max_length is None: raise ValueError( "`max_length` needs to be a stopping_criteria for now." ) # 11. prepare beam search scorer beam_scorer = BeamSearchScorer( batch_size=batch_size, num_beams=generation_config.num_beams, device=inputs_tensor.device, length_penalty=generation_config.length_penalty, do_early_stopping=generation_config.early_stopping, num_beam_hyps_to_keep=generation_config.num_return_sequences, ) # 12. interleave input_ids with `num_beams` additional sequences per batch input_ids, model_kwargs = self._expand_inputs_for_generation( input_ids=input_ids, expand_size=generation_config.num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs, ) # 13. run beam search return self.beam_search( input_ids, beam_scorer, logits_processor=logits_processor, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_beam_sample_gen_mode: # 11. prepare logits warper logits_warper = self._get_logits_warper(generation_config) if stopping_criteria.max_length is None: raise ValueError( "`max_length` needs to be a stopping_criteria for now." ) # 12. prepare beam search scorer beam_scorer = BeamSearchScorer( batch_size=batch_size * generation_config.num_return_sequences, num_beams=generation_config.num_beams, device=inputs_tensor.device, length_penalty=generation_config.length_penalty, do_early_stopping=generation_config.early_stopping, ) # 13. interleave input_ids with `num_beams` additional sequences per batch input_ids, model_kwargs = self._expand_inputs_for_generation( input_ids=input_ids, expand_size=generation_config.num_beams * generation_config.num_return_sequences, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs, ) # 14. run beam sample return self.beam_sample( input_ids, beam_scorer, logits_processor=logits_processor, logits_warper=logits_warper, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_group_beam_gen_mode: if generation_config.num_return_sequences > generation_config.num_beams: raise ValueError( "`num_return_sequences` has to be smaller or equal to `num_beams`." ) if generation_config.num_beams % generation_config.num_beam_groups != 0: raise ValueError( "`num_beams` should be divisible by `num_beam_groups` for group beam search." ) if stopping_criteria.max_length is None: raise ValueError( "`max_length` needs to be a stopping_criteria for now." ) has_default_typical_p = ( kwargs.get("typical_p") is None and generation_config.typical_p == 1.0 ) if not has_default_typical_p: raise ValueError( "Decoder argument `typical_p` is not supported with beam groups." ) # 11. prepare beam search scorer beam_scorer = BeamSearchScorer( batch_size=batch_size, num_beams=generation_config.num_beams, max_length=stopping_criteria.max_length, device=inputs_tensor.device, length_penalty=generation_config.length_penalty, do_early_stopping=generation_config.early_stopping, num_beam_hyps_to_keep=generation_config.num_return_sequences, num_beam_groups=generation_config.num_beam_groups, ) # 12. interleave input_ids with `num_beams` additional sequences per batch input_ids, model_kwargs = self._expand_inputs_for_generation( input_ids=input_ids, expand_size=generation_config.num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs, ) # 13. run beam search return self.group_beam_search( input_ids, beam_scorer, logits_processor=logits_processor, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) elif is_constraint_gen_mode: if generation_config.num_return_sequences > generation_config.num_beams: raise ValueError( "`num_return_sequences` has to be smaller or equal to `num_beams`." ) if stopping_criteria.max_length is None: raise ValueError( "`max_length` needs to be a stopping_criteria for now." ) if generation_config.num_beams <= 1: raise ValueError( "`num_beams` needs to be greater than 1 for constrained generation." ) if generation_config.do_sample: raise ValueError( "`do_sample` needs to be false for constrained generation." ) if ( generation_config.num_beam_groups is not None and generation_config.num_beam_groups > 1 ): raise ValueError( "`num_beam_groups` not supported yet for constrained generation." ) final_constraints = [] if generation_config.constraints is not None: final_constraints = generation_config.constraints if generation_config.force_words_ids is not None: def typeerror(): raise ValueError( "`force_words_ids` has to either be a `List[List[List[int]]]` or `List[List[int]]`" f"of positive integers, but is {generation_config.force_words_ids}." ) if ( not isinstance(generation_config.force_words_ids, list) or len(generation_config.force_words_ids) == 0 ): typeerror() for word_ids in generation_config.force_words_ids: if isinstance(word_ids[0], list): if not isinstance(word_ids, list) or len(word_ids) == 0: typeerror() if any( not isinstance(token_ids, list) for token_ids in word_ids ): typeerror() if any( any( (not isinstance(token_id, int) or token_id < 0) for token_id in token_ids ) for token_ids in word_ids ): typeerror() constraint = DisjunctiveConstraint(word_ids) else: if not isinstance(word_ids, list) or len(word_ids) == 0: typeerror() if any( (not isinstance(token_id, int) or token_id < 0) for token_id in word_ids ): typeerror() constraint = PhrasalConstraint(word_ids) final_constraints.append(constraint) # 11. prepare beam search scorer constrained_beam_scorer = ConstrainedBeamSearchScorer( constraints=final_constraints, batch_size=batch_size, num_beams=generation_config.num_beams, device=inputs_tensor.device, length_penalty=generation_config.length_penalty, do_early_stopping=generation_config.early_stopping, num_beam_hyps_to_keep=generation_config.num_return_sequences, ) # 12. interleave input_ids with `num_beams` additional sequences per batch input_ids, model_kwargs = self._expand_inputs_for_generation( input_ids=input_ids, expand_size=generation_config.num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs, ) # 13. run beam search return self.constrained_beam_search( input_ids, constrained_beam_scorer=constrained_beam_scorer, logits_processor=logits_processor, stopping_criteria=stopping_criteria, pad_token_id=generation_config.pad_token_id, eos_token_id=generation_config.eos_token_id, output_scores=generation_config.output_scores, return_dict_in_generate=generation_config.return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) @torch.no_grad() def sample_stream( self, input_ids: torch.LongTensor, logits_processor: Optional[LogitsProcessorList] = None, stopping_criteria: Optional[StoppingCriteriaList] = None, logits_warper: Optional[LogitsProcessorList] = None, max_length: Optional[int] = None, pad_token_id: Optional[int] = None, eos_token_id: Optional[Union[int, List[int]]] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, output_scores: Optional[bool] = None, return_dict_in_generate: Optional[bool] = None, synced_gpus: Optional[bool] = False, **model_kwargs, ) -> Union[SampleOutput, torch.LongTensor]: r""" Generates sequences of token ids for models with a language modeling head using **multinomial sampling** and can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models. In most cases, you do not need to call [`~generation.GenerationMixin.sample`] directly. Use generate() instead. For an overview of generation strategies and code examples, check the [following guide](./generation_strategies). Parameters: input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): The sequence used as a prompt for the generation. logits_processor (`LogitsProcessorList`, *optional*): An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`] used to modify the prediction scores of the language modeling head applied at each generation step. stopping_criteria (`StoppingCriteriaList`, *optional*): An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`] used to tell if the generation loop should stop. logits_warper (`LogitsProcessorList`, *optional*): An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsWarper`] used to warp the prediction score distribution of the language modeling head applied before multinomial sampling at each generation step. max_length (`int`, *optional*, defaults to 20): **DEPRECATED**. Use `logits_processor` or `stopping_criteria` directly to cap the number of generated tokens. The maximum length of the sequence to be generated. pad_token_id (`int`, *optional*): The id of the *padding* token. eos_token_id (`int`, *optional*): The id of the *end-of-sequence* token. output_attentions (`bool`, *optional*, defaults to `False`): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more details. output_hidden_states (`bool`, *optional*, defaults to `False`): Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more details. output_scores (`bool`, *optional*, defaults to `False`): Whether or not to return the prediction scores. See `scores` under returned tensors for more details. return_dict_in_generate (`bool`, *optional*, defaults to `False`): Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. synced_gpus (`bool`, *optional*, defaults to `False`): Whether to continue running the while loop until max_length (needed for ZeRO stage 3) model_kwargs: Additional model specific kwargs will be forwarded to the `forward` function of the model. If model is an encoder-decoder model the kwargs should include `encoder_outputs`. Return: [`~generation.SampleDecoderOnlyOutput`], [`~generation.SampleEncoderDecoderOutput`] or `torch.LongTensor`: A `torch.LongTensor` containing the generated tokens (default behaviour) or a [`~generation.SampleDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and `return_dict_in_generate=True` or a [`~generation.SampleEncoderDecoderOutput`] if `model.config.is_encoder_decoder=True`. Examples: ```python >>> from transformers import ( ... AutoTokenizer, ... AutoModelForCausalLM, ... LogitsProcessorList, ... MinLengthLogitsProcessor, ... TopKLogitsWarper, ... TemperatureLogitsWarper, ... StoppingCriteriaList, ... MaxLengthCriteria, ... ) >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("gpt2") >>> model = AutoModelForCausalLM.from_pretrained("gpt2") >>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token >>> model.config.pad_token_id = model.config.eos_token_id >>> model.generation_config.pad_token_id = model.config.eos_token_id >>> input_prompt = "Today is a beautiful day, and" >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids >>> # instantiate logits processors >>> logits_processor = LogitsProcessorList( ... [ ... MinLengthLogitsProcessor(15, eos_token_id=model.generation_config.eos_token_id), ... ] ... ) >>> # instantiate logits processors >>> logits_warper = LogitsProcessorList( ... [ ... TopKLogitsWarper(50), ... TemperatureLogitsWarper(0.7), ... ] ... ) >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)]) >>> torch.manual_seed(0) # doctest: +IGNORE_RESULT >>> outputs = model.sample( ... input_ids, ... logits_processor=logits_processor, ... logits_warper=logits_warper, ... stopping_criteria=stopping_criteria, ... ) >>> tokenizer.batch_decode(outputs, skip_special_tokens=True) ['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the'] ```""" # init values logits_processor = ( logits_processor if logits_processor is not None else LogitsProcessorList() ) stopping_criteria = ( stopping_criteria if stopping_criteria is not None else StoppingCriteriaList() ) if max_length is not None: warnings.warn( "`max_length` is deprecated in this function, use" " `stopping_criteria=StoppingCriteriaList(MaxLengthCriteria(max_length=max_length))` instead.", UserWarning, ) stopping_criteria = validate_stopping_criteria( stopping_criteria, max_length ) logits_warper = ( logits_warper if logits_warper is not None else LogitsProcessorList() ) pad_token_id = ( pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id ) eos_token_id = ( eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id ) if isinstance(eos_token_id, int): eos_token_id = [eos_token_id] output_scores = ( output_scores if output_scores is not None else self.generation_config.output_scores ) output_attentions = ( output_attentions if output_attentions is not None else self.generation_config.output_attentions ) output_hidden_states = ( output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states ) return_dict_in_generate = ( return_dict_in_generate if return_dict_in_generate is not None else self.generation_config.return_dict_in_generate ) # init attention / hidden states / scores tuples scores = () if (return_dict_in_generate and output_scores) else None decoder_attentions = ( () if (return_dict_in_generate and output_attentions) else None ) cross_attentions = ( () if (return_dict_in_generate and output_attentions) else None ) decoder_hidden_states = ( () if (return_dict_in_generate and output_hidden_states) else None ) # keep track of which sequences are already finished unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1) this_peer_finished = False # used by synced_gpus only # auto-regressive generation while True: if synced_gpus: # Under synced_gpus the `forward` call must continue until all gpus complete their sequence. # The following logic allows an early break if all peers finished generating their sequence this_peer_finished_flag = torch.tensor( 0.0 if this_peer_finished else 1.0 ).to(input_ids.device) # send 0.0 if we finished, 1.0 otherwise dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM) # did all peers finish? the reduced sum will be 0.0 then if this_peer_finished_flag.item() == 0.0: break # prepare model inputs model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) # forward pass to get next token outputs = self( **model_inputs, return_dict=True, output_attentions=output_attentions, output_hidden_states=output_hidden_states, ) if synced_gpus and this_peer_finished: continue # don't waste resources running the code we don't need next_token_logits = outputs.logits[:, -1, :] # pre-process distribution next_token_scores = logits_processor(input_ids, next_token_logits) next_token_scores = logits_warper(input_ids, next_token_scores) # Store scores, attentions and hidden_states when required if return_dict_in_generate: if output_scores: scores += (next_token_scores,) if output_attentions: decoder_attentions += ( (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,) ) if self.config.is_encoder_decoder: cross_attentions += (outputs.cross_attentions,) if output_hidden_states: decoder_hidden_states += ( (outputs.decoder_hidden_states,) if self.config.is_encoder_decoder else (outputs.hidden_states,) ) # sample probs = nn.functional.softmax(next_token_scores, dim=-1) next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) # finished sentences should have their next token be a padding token if eos_token_id is not None: if pad_token_id is None: raise ValueError( "If `eos_token_id` is defined, make sure that `pad_token_id` is defined." ) next_tokens = next_tokens * unfinished_sequences + pad_token_id * ( 1 - unfinished_sequences ) yield next_tokens, self.final_norm(outputs.hidden_states[-1][:, -1]) # update generated ids, model inputs, and length for next step input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1) model_kwargs = self._update_model_kwargs_for_generation( outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder ) # if eos_token was found in one sentence, set sentence to finished if eos_token_id is not None: unfinished_sequences = unfinished_sequences.mul( (sum(next_tokens != i for i in eos_token_id)).long() ) # stop when each sentence is finished, or if we exceed the maximum length if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores): if not synced_gpus: break else: this_peer_finished = True def init_stream_support(): """Overload PreTrainedModel for streaming.""" PreTrainedModel.generate_stream = NewGenerationMixin.generate PreTrainedModel.sample_stream = NewGenerationMixin.sample_stream if __name__ == "__main__": from transformers import PreTrainedModel from transformers import AutoTokenizer, AutoModelForCausalLM PreTrainedModel.generate = NewGenerationMixin.generate PreTrainedModel.sample_stream = NewGenerationMixin.sample_stream model = AutoModelForCausalLM.from_pretrained( "bigscience/bloom-560m", torch_dtype=torch.float16 ) tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") model = model.to("cuda:0") model = model.eval() prompt_text = "hello? \n" input_ids = tokenizer( prompt_text, return_tensors="pt", add_special_tokens=False ).input_ids input_ids = input_ids.to("cuda:0") with torch.no_grad(): result = model.generate( input_ids, max_new_tokens=200, do_sample=True, top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, early_stopping=True, seed=0, ) print(tokenizer.decode(result, skip_special_tokens=True)) generator = model.generate( input_ids, max_new_tokens=200, do_sample=True, top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, early_stopping=True, seed=0, do_stream=True, ) stream_result = "" for x in generator: chunk = tokenizer.decode(x, skip_special_tokens=True) stream_result += chunk print(stream_result) ================================================ FILE: tortoise/models/transformer.py ================================================ from functools import partial import torch import torch.nn.functional as F from einops import rearrange from rotary_embedding_torch import RotaryEmbedding from torch import nn # helpers def exists(val): return val is not None def default(val, d): return val if exists(val) else d def cast_tuple(val, depth = 1): if isinstance(val, list): val = tuple(val) return val if isinstance(val, tuple) else (val,) * depth def max_neg_value(t): return -torch.finfo(t.dtype).max def stable_softmax(t, dim = -1, alpha = 32 ** 2): t = t / alpha t = t - torch.amax(t, dim = dim, keepdim = True).detach() return (t * alpha).softmax(dim = dim) def route_args(router, args, depth): routed_args = [(dict(), dict()) for _ in range(depth)] matched_keys = [key for key in args.keys() if key in router] for key in matched_keys: val = args[key] for depth, ((f_args, g_args), routes) in enumerate(zip(routed_args, router[key])): new_f_args, new_g_args = map(lambda route: ({key: val} if route else {}), routes) routed_args[depth] = ({**f_args, **new_f_args}, {**g_args, **new_g_args}) return routed_args # classes class SequentialSequence(nn.Module): def __init__(self, layers, args_route = {}, layer_dropout = 0.): super().__init__() assert all(len(route) == len(layers) for route in args_route.values()), 'each argument route map must have the same depth as the number of sequential layers' self.layers = layers self.args_route = args_route self.layer_dropout = layer_dropout def forward(self, x, **kwargs): args = route_args(self.args_route, kwargs, len(self.layers)) layers_and_args = list(zip(self.layers, args)) for (f, g), (f_args, g_args) in layers_and_args: x = x + f(x, **f_args) x = x + g(x, **g_args) return x class DivideMax(nn.Module): def __init__(self, dim): super().__init__() self.dim = dim def forward(self, x): maxes = x.amax(dim = self.dim, keepdim = True).detach() return x / maxes # https://arxiv.org/abs/2103.17239 class LayerScale(nn.Module): def __init__(self, dim, depth, fn): super().__init__() if depth <= 18: init_eps = 0.1 elif depth > 18 and depth <= 24: init_eps = 1e-5 else: init_eps = 1e-6 scale = torch.zeros(1, 1, dim).fill_(init_eps) self.scale = nn.Parameter(scale) self.fn = fn def forward(self, x, **kwargs): return self.fn(x, **kwargs) * self.scale # layer norm class PreNorm(nn.Module): def __init__(self, dim, fn, sandwich = False): super().__init__() self.norm = nn.LayerNorm(dim) self.norm_out = nn.LayerNorm(dim) if sandwich else nn.Identity() self.fn = fn def forward(self, x, **kwargs): x = self.norm(x) x = self.fn(x, **kwargs) return self.norm_out(x) # feed forward class GEGLU(nn.Module): def forward(self, x): x, gates = x.chunk(2, dim = -1) return x * F.gelu(gates) class FeedForward(nn.Module): def __init__(self, dim, dropout = 0., mult = 4.): super().__init__() self.net = nn.Sequential( nn.Linear(dim, dim * mult * 2), GEGLU(), nn.Dropout(dropout), nn.Linear(dim * mult, dim) ) def forward(self, x): return self.net(x) # Attention class Attention(nn.Module): def __init__(self, dim, seq_len, causal = True, heads = 8, dim_head = 64, dropout = 0.): super().__init__() inner_dim = dim_head * heads self.heads = heads self.seq_len = seq_len self.scale = dim_head ** -0.5 self.causal = causal self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False) self.to_out = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) def forward(self, x, mask = None): b, n, _, h, device = *x.shape, self.heads, x.device softmax = torch.softmax qkv = self.to_qkv(x).chunk(3, dim = -1) q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), qkv) q = q * self.scale dots = torch.einsum('b h i d, b h j d -> b h i j', q, k) mask_value = max_neg_value(dots) if exists(mask): mask = rearrange(mask, 'b j -> b () () j') dots.masked_fill_(~mask, mask_value) del mask if self.causal: i, j = dots.shape[-2:] mask = torch.ones(i, j, device = device).triu_(j - i + 1).bool() dots.masked_fill_(mask, mask_value) attn = softmax(dots, dim=-1) out = torch.einsum('b h i j, b h j d -> b h i d', attn, v) out = rearrange(out, 'b h n d -> b n (h d)') out = self.to_out(out) return out # main transformer class class Transformer(nn.Module): def __init__( self, *, dim, depth, seq_len, causal = True, heads = 8, dim_head = 64, ff_mult = 4, attn_dropout = 0., ff_dropout = 0., sparse_attn = False, sandwich_norm = False, ): super().__init__() layers = nn.ModuleList([]) sparse_layer = cast_tuple(sparse_attn, depth) for ind, sparse_attn in zip(range(depth), sparse_layer): attn = Attention(dim, causal = causal, seq_len = seq_len, heads = heads, dim_head = dim_head, dropout = attn_dropout) ff = FeedForward(dim, mult = ff_mult, dropout = ff_dropout) layers.append(nn.ModuleList([ LayerScale(dim, ind + 1, PreNorm(dim, attn, sandwich = sandwich_norm)), LayerScale(dim, ind + 1, PreNorm(dim, ff, sandwich = sandwich_norm)) ])) execute_type = SequentialSequence route_attn = ((True, False),) * depth attn_route_map = {'mask': route_attn} self.layers = execute_type(layers, args_route = attn_route_map) def forward(self, x, **kwargs): return self.layers(x, **kwargs) ================================================ FILE: tortoise/models/vocoder.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F MAX_WAV_VALUE = 32768.0 class KernelPredictor(torch.nn.Module): ''' Kernel predictor for the location-variable convolutions''' def __init__( self, cond_channels, conv_in_channels, conv_out_channels, conv_layers, conv_kernel_size=3, kpnet_hidden_channels=64, kpnet_conv_size=3, kpnet_dropout=0.0, kpnet_nonlinear_activation="LeakyReLU", kpnet_nonlinear_activation_params={"negative_slope": 0.1}, ): ''' Args: cond_channels (int): number of channel for the conditioning sequence, conv_in_channels (int): number of channel for the input sequence, conv_out_channels (int): number of channel for the output sequence, conv_layers (int): number of layers ''' super().__init__() self.conv_in_channels = conv_in_channels self.conv_out_channels = conv_out_channels self.conv_kernel_size = conv_kernel_size self.conv_layers = conv_layers kpnet_kernel_channels = conv_in_channels * conv_out_channels * conv_kernel_size * conv_layers # l_w kpnet_bias_channels = conv_out_channels * conv_layers # l_b self.input_conv = nn.Sequential( nn.utils.weight_norm(nn.Conv1d(cond_channels, kpnet_hidden_channels, 5, padding=2, bias=True)), getattr(nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params), ) self.residual_convs = nn.ModuleList() padding = (kpnet_conv_size - 1) // 2 for _ in range(3): self.residual_convs.append( nn.Sequential( nn.Dropout(kpnet_dropout), nn.utils.weight_norm( nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True)), getattr(nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params), nn.utils.weight_norm( nn.Conv1d(kpnet_hidden_channels, kpnet_hidden_channels, kpnet_conv_size, padding=padding, bias=True)), getattr(nn, kpnet_nonlinear_activation)(**kpnet_nonlinear_activation_params), ) ) self.kernel_conv = nn.utils.weight_norm( nn.Conv1d(kpnet_hidden_channels, kpnet_kernel_channels, kpnet_conv_size, padding=padding, bias=True)) self.bias_conv = nn.utils.weight_norm( nn.Conv1d(kpnet_hidden_channels, kpnet_bias_channels, kpnet_conv_size, padding=padding, bias=True)) def forward(self, c): ''' Args: c (Tensor): the conditioning sequence (batch, cond_channels, cond_length) ''' batch, _, cond_length = c.shape c = self.input_conv(c) for residual_conv in self.residual_convs: residual_conv.to(c.device) c = c + residual_conv(c) k = self.kernel_conv(c) b = self.bias_conv(c) kernels = k.contiguous().view( batch, self.conv_layers, self.conv_in_channels, self.conv_out_channels, self.conv_kernel_size, cond_length, ) bias = b.contiguous().view( batch, self.conv_layers, self.conv_out_channels, cond_length, ) return kernels, bias def remove_weight_norm(self): nn.utils.remove_weight_norm(self.input_conv[0]) nn.utils.remove_weight_norm(self.kernel_conv) nn.utils.remove_weight_norm(self.bias_conv) for block in self.residual_convs: nn.utils.remove_weight_norm(block[1]) nn.utils.remove_weight_norm(block[3]) class LVCBlock(torch.nn.Module): '''the location-variable convolutions''' def __init__( self, in_channels, cond_channels, stride, dilations=[1, 3, 9, 27], lReLU_slope=0.2, conv_kernel_size=3, cond_hop_length=256, kpnet_hidden_channels=64, kpnet_conv_size=3, kpnet_dropout=0.0, ): super().__init__() self.cond_hop_length = cond_hop_length self.conv_layers = len(dilations) self.conv_kernel_size = conv_kernel_size self.kernel_predictor = KernelPredictor( cond_channels=cond_channels, conv_in_channels=in_channels, conv_out_channels=2 * in_channels, conv_layers=len(dilations), conv_kernel_size=conv_kernel_size, kpnet_hidden_channels=kpnet_hidden_channels, kpnet_conv_size=kpnet_conv_size, kpnet_dropout=kpnet_dropout, kpnet_nonlinear_activation_params={"negative_slope": lReLU_slope} ) self.convt_pre = nn.Sequential( nn.LeakyReLU(lReLU_slope), nn.utils.weight_norm(nn.ConvTranspose1d(in_channels, in_channels, 2 * stride, stride=stride, padding=stride // 2 + stride % 2, output_padding=stride % 2)), ) self.conv_blocks = nn.ModuleList() for dilation in dilations: self.conv_blocks.append( nn.Sequential( nn.LeakyReLU(lReLU_slope), nn.utils.weight_norm(nn.Conv1d(in_channels, in_channels, conv_kernel_size, padding=dilation * (conv_kernel_size - 1) // 2, dilation=dilation)), nn.LeakyReLU(lReLU_slope), ) ) def forward(self, x, c): ''' forward propagation of the location-variable convolutions. Args: x (Tensor): the input sequence (batch, in_channels, in_length) c (Tensor): the conditioning sequence (batch, cond_channels, cond_length) Returns: Tensor: the output sequence (batch, in_channels, in_length) ''' _, in_channels, _ = x.shape # (B, c_g, L') x = self.convt_pre(x) # (B, c_g, stride * L') kernels, bias = self.kernel_predictor(c) for i, conv in enumerate(self.conv_blocks): output = conv(x) # (B, c_g, stride * L') k = kernels[:, i, :, :, :, :] # (B, 2 * c_g, c_g, kernel_size, cond_length) b = bias[:, i, :, :] # (B, 2 * c_g, cond_length) output = self.location_variable_convolution(output, k, b, hop_size=self.cond_hop_length) # (B, 2 * c_g, stride * L'): LVC x = x + torch.sigmoid(output[:, :in_channels, :]) * torch.tanh( output[:, in_channels:, :]) # (B, c_g, stride * L'): GAU return x def location_variable_convolution(self, x, kernel, bias, dilation=1, hop_size=256): ''' perform location-variable convolution operation on the input sequence (x) using the local convolution kernl. Time: 414 μs ± 309 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each), test on NVIDIA V100. Args: x (Tensor): the input sequence (batch, in_channels, in_length). kernel (Tensor): the local convolution kernel (batch, in_channel, out_channels, kernel_size, kernel_length) bias (Tensor): the bias for the local convolution (batch, out_channels, kernel_length) dilation (int): the dilation of convolution. hop_size (int): the hop_size of the conditioning sequence. Returns: (Tensor): the output sequence after performing local convolution. (batch, out_channels, in_length). ''' batch, _, in_length = x.shape batch, _, out_channels, kernel_size, kernel_length = kernel.shape assert in_length == (kernel_length * hop_size), "length of (x, kernel) is not matched" padding = dilation * int((kernel_size - 1) / 2) x = F.pad(x, (padding, padding), 'constant', 0) # (batch, in_channels, in_length + 2*padding) x = x.unfold(2, hop_size + 2 * padding, hop_size) # (batch, in_channels, kernel_length, hop_size + 2*padding) if hop_size < dilation: x = F.pad(x, (0, dilation), 'constant', 0) x = x.unfold(3, dilation, dilation) # (batch, in_channels, kernel_length, (hop_size + 2*padding)/dilation, dilation) x = x[:, :, :, :, :hop_size] x = x.transpose(3, 4) # (batch, in_channels, kernel_length, dilation, (hop_size + 2*padding)/dilation) x = x.unfold(4, kernel_size, 1) # (batch, in_channels, kernel_length, dilation, _, kernel_size) o = torch.einsum('bildsk,biokl->bolsd', x, kernel) o = o.to(memory_format=torch.channels_last_3d) bias = bias.unsqueeze(-1).unsqueeze(-1).to(memory_format=torch.channels_last_3d) o = o + bias o = o.contiguous().view(batch, out_channels, -1) return o def remove_weight_norm(self): self.kernel_predictor.remove_weight_norm() nn.utils.remove_weight_norm(self.convt_pre[1]) for block in self.conv_blocks: nn.utils.remove_weight_norm(block[1]) class UnivNetGenerator(nn.Module): """ UnivNet Generator Originally from https://github.com/mindslab-ai/univnet/blob/master/model/generator.py. """ def __init__(self, noise_dim=64, channel_size=32, dilations=[1,3,9,27], strides=[8,8,4], lReLU_slope=.2, kpnet_conv_size=3, # Below are MEL configurations options that this generator requires. hop_length=256, n_mel_channels=100): super(UnivNetGenerator, self).__init__() self.mel_channel = n_mel_channels self.noise_dim = noise_dim self.hop_length = hop_length channel_size = channel_size kpnet_conv_size = kpnet_conv_size self.res_stack = nn.ModuleList() hop_length = 1 for stride in strides: hop_length = stride * hop_length self.res_stack.append( LVCBlock( channel_size, n_mel_channels, stride=stride, dilations=dilations, lReLU_slope=lReLU_slope, cond_hop_length=hop_length, kpnet_conv_size=kpnet_conv_size ) ) self.conv_pre = \ nn.utils.weight_norm(nn.Conv1d(noise_dim, channel_size, 7, padding=3, padding_mode='reflect')) self.conv_post = nn.Sequential( nn.LeakyReLU(lReLU_slope), nn.utils.weight_norm(nn.Conv1d(channel_size, 1, 7, padding=3, padding_mode='reflect')), nn.Tanh(), ) def forward(self, c, z): ''' Args: c (Tensor): the conditioning sequence of mel-spectrogram (batch, mel_channels, in_length) z (Tensor): the noise sequence (batch, noise_dim, in_length) ''' z = self.conv_pre(z) # (B, c_g, L) for res_block in self.res_stack: res_block.to(z.device) z = res_block(z, c) # (B, c_g, L * s_0 * ... * s_i) z = self.conv_post(z) # (B, 1, L * 256) return z def eval(self, inference=False): super(UnivNetGenerator, self).eval() # don't remove weight norm while validation in training loop if inference: self.remove_weight_norm() def remove_weight_norm(self): nn.utils.remove_weight_norm(self.conv_pre) for layer in self.conv_post: if len(layer.state_dict()) != 0: nn.utils.remove_weight_norm(layer) for res_block in self.res_stack: res_block.remove_weight_norm() def inference(self, c, z=None): # pad input mel with zeros to cut artifact # see https://github.com/seungwonpark/melgan/issues/8 zero = torch.full((c.shape[0], self.mel_channel, 10), -11.5129).to(c.device) mel = torch.cat((c, zero), dim=2) if z is None: z = torch.randn(c.shape[0], self.noise_dim, mel.size(2)).to(mel.device) audio = self.forward(mel, z) audio = audio[:, :, :-(self.hop_length * 10)] audio = audio.clamp(min=-1, max=1) return audio if __name__ == '__main__': model = UnivNetGenerator() c = torch.randn(3, 100, 10) z = torch.randn(3, 64, 10) print(c.shape) y = model(c, z) print(y.shape) assert y.shape == torch.Size([3, 1, 2560]) pytorch_total_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(pytorch_total_params) ================================================ FILE: tortoise/models/xtransformers.py ================================================ import math from collections import namedtuple from functools import partial from inspect import isfunction import torch import torch.nn.functional as F from einops import rearrange, repeat from torch import nn, einsum DEFAULT_DIM_HEAD = 64 Intermediates = namedtuple('Intermediates', [ 'pre_softmax_attn', 'post_softmax_attn' ]) LayerIntermediates = namedtuple('Intermediates', [ 'hiddens', 'attn_intermediates', 'past_key_values', ]) # helpers def exists(val): return val is not None def default(val, d): if exists(val): return val return d() if isfunction(d) else d def cast_tuple(val, depth): return val if isinstance(val, tuple) else (val,) * depth class always(): def __init__(self, val): self.val = val def __call__(self, *args, **kwargs): return self.val class not_equals(): def __init__(self, val): self.val = val def __call__(self, x, *args, **kwargs): return x != self.val class equals(): def __init__(self, val): self.val = val def __call__(self, x, *args, **kwargs): return x == self.val def max_neg_value(tensor): return -torch.finfo(tensor.dtype).max def l2norm(t): return F.normalize(t, p=2, dim=-1) # init helpers def init_zero_(layer): nn.init.constant_(layer.weight, 0.) if exists(layer.bias): nn.init.constant_(layer.bias, 0.) # keyword argument helpers def pick_and_pop(keys, d): values = list(map(lambda key: d.pop(key), keys)) return dict(zip(keys, values)) def group_dict_by_key(cond, d): return_val = [dict(), dict()] for key in d.keys(): match = bool(cond(key)) ind = int(not match) return_val[ind][key] = d[key] return (*return_val,) def string_begins_with(prefix, str): return str.startswith(prefix) def group_by_key_prefix(prefix, d): return group_dict_by_key(partial(string_begins_with, prefix), d) def groupby_prefix_and_trim(prefix, d): kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d) kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items()))) return kwargs_without_prefix, kwargs # activations class ReluSquared(nn.Module): def forward(self, x): return F.relu(x) ** 2 # positional embeddings class AbsolutePositionalEmbedding(nn.Module): def __init__(self, dim, max_seq_len): super().__init__() self.scale = dim ** -0.5 self.emb = nn.Embedding(max_seq_len, dim) def forward(self, x): n = torch.arange(x.shape[1], device=x.device) pos_emb = self.emb(n) pos_emb = rearrange(pos_emb, 'n d -> () n d') return pos_emb * self.scale class FixedPositionalEmbedding(nn.Module): def __init__(self, dim): super().__init__() inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim)) self.register_buffer('inv_freq', inv_freq) def forward(self, x, seq_dim=1, offset=0): t = torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq) + offset sinusoid_inp = torch.einsum('i , j -> i j', t, self.inv_freq) emb = torch.cat((sinusoid_inp.sin(), sinusoid_inp.cos()), dim=-1) return rearrange(emb, 'n d -> () n d') class RelativePositionBias(nn.Module): def __init__(self, scale, causal=False, num_buckets=32, max_distance=128, heads=8): super().__init__() self.scale = scale self.causal = causal self.num_buckets = num_buckets self.max_distance = max_distance self.relative_attention_bias = nn.Embedding(num_buckets, heads) @staticmethod def _relative_position_bucket(relative_position, causal=True, num_buckets=32, max_distance=128): ret = 0 n = -relative_position if not causal: num_buckets //= 2 ret += (n < 0).long() * num_buckets n = torch.abs(n) else: n = torch.max(n, torch.zeros_like(n)) max_exact = num_buckets // 2 is_small = n < max_exact val_if_large = max_exact + ( torch.log(n.float() / max_exact) / math.log(max_distance / max_exact) * (num_buckets - max_exact) ).long() val_if_large = torch.min(val_if_large, torch.full_like(val_if_large, num_buckets - 1)) ret += torch.where(is_small, n, val_if_large) return ret def forward(self, qk_dots): i, j, device = *qk_dots.shape[-2:], qk_dots.device q_pos = torch.arange(i, dtype=torch.long, device=device) k_pos = torch.arange(j, dtype=torch.long, device=device) rel_pos = k_pos[None, :] - q_pos[:, None] rp_bucket = self._relative_position_bucket(rel_pos, causal=self.causal, num_buckets=self.num_buckets, max_distance=self.max_distance) values = self.relative_attention_bias(rp_bucket) bias = rearrange(values, 'i j h -> () h i j') return qk_dots + (bias * self.scale) class AlibiPositionalBias(nn.Module): def __init__(self, heads, **kwargs): super().__init__() self.heads = heads slopes = torch.Tensor(self._get_slopes(heads)) slopes = rearrange(slopes, 'h -> () h () ()') self.register_buffer('slopes', slopes, persistent=False) self.register_buffer('bias', None, persistent=False) @staticmethod def _get_slopes(heads): def get_slopes_power_of_2(n): start = (2 ** (-2 ** -(math.log2(n) - 3))) ratio = start return [start * ratio ** i for i in range(n)] if math.log2(heads).is_integer(): return get_slopes_power_of_2(heads) closest_power_of_2 = 2 ** math.floor(math.log2(heads)) return get_slopes_power_of_2(closest_power_of_2) + get_slopes_power_of_2(2 * closest_power_of_2)[0::2][ :heads - closest_power_of_2] def forward(self, qk_dots): h, i, j, device = *qk_dots.shape[-3:], qk_dots.device if exists(self.bias) and self.bias.shape[-1] >= j: return qk_dots + self.bias[..., :j] bias = torch.arange(j, device=device) bias = rearrange(bias, 'j -> () () () j') bias = bias * self.slopes num_heads_unalibied = h - bias.shape[1] bias = F.pad(bias, (0, 0, 0, 0, 0, num_heads_unalibied)) self.register_buffer('bias', bias, persistent=False) return qk_dots + self.bias class LearnedAlibiPositionalBias(AlibiPositionalBias): def __init__(self, heads, bidirectional=False): super().__init__(heads) los_slopes = torch.log(self.slopes) self.learned_logslopes = nn.Parameter(los_slopes) self.bidirectional = bidirectional if self.bidirectional: self.learned_logslopes_future = nn.Parameter(los_slopes) def forward(self, qk_dots): h, i, j, device = *qk_dots.shape[-3:], qk_dots.device def get_slopes(param): return F.pad(param.exp(), (0, 0, 0, 0, 0, h - param.shape[1])) if exists(self.bias) and self.bias.shape[-1] >= j: bias = self.bias[..., :i, :j] else: i_arange = torch.arange(i, device=device) j_arange = torch.arange(j, device=device) bias = rearrange(j_arange, 'j -> 1 1 1 j') - rearrange(i_arange, 'i -> 1 1 i 1') self.register_buffer('bias', bias, persistent=False) if self.bidirectional: past_slopes = get_slopes(self.learned_logslopes) future_slopes = get_slopes(self.learned_logslopes_future) bias = torch.tril(bias * past_slopes) + torch.triu(bias * future_slopes) else: slopes = get_slopes(self.learned_logslopes) bias = bias * slopes return qk_dots + bias class RotaryEmbedding(nn.Module): def __init__(self, dim): super().__init__() inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim)) self.register_buffer('inv_freq', inv_freq) def forward(self, max_seq_len, device): t = torch.arange(max_seq_len, device=device).type_as(self.inv_freq) freqs = torch.einsum('i , j -> i j', t, self.inv_freq) emb = torch.cat((freqs, freqs), dim=-1) return rearrange(emb, 'n d -> () () n d') def rotate_half(x): x = rearrange(x, '... (j d) -> ... j d', j=2) x1, x2 = x.unbind(dim=-2) return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(t, freqs): seq_len = t.shape[-2] freqs = freqs[:, :, -seq_len:] return (t * freqs.cos()) + (rotate_half(t) * freqs.sin()) # norms class Scale(nn.Module): def __init__(self, value, fn): super().__init__() self.value = value self.fn = fn def forward(self, x, **kwargs): out = self.fn(x, **kwargs) scale_fn = lambda t: t * self.value if not isinstance(out, tuple): return scale_fn(out) return (scale_fn(out[0]), *out[1:]) class Rezero(nn.Module): def __init__(self, fn): super().__init__() self.fn = fn self.g = nn.Parameter(torch.zeros(1)) def forward(self, x, **kwargs): out = self.fn(x, **kwargs) rezero_fn = lambda t: t * self.g if not isinstance(out, tuple): return rezero_fn(out) return (rezero_fn(out[0]), *out[1:]) class ScaleNorm(nn.Module): def __init__(self, dim, eps=1e-5): super().__init__() self.scale = dim ** -0.5 self.eps = eps self.g = nn.Parameter(torch.ones(1)) def forward(self, x): norm = torch.norm(x, dim=-1, keepdim=True) * self.scale return x / norm.clamp(min=self.eps) * self.g class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-8): super().__init__() self.scale = dim ** -0.5 self.eps = eps self.g = nn.Parameter(torch.ones(dim)) def forward(self, x): norm = torch.norm(x, dim=-1, keepdim=True) * self.scale return x / norm.clamp(min=self.eps) * self.g class RMSScaleShiftNorm(nn.Module): def __init__(self, dim, eps=1e-8): super().__init__() self.scale = dim ** -0.5 self.eps = eps self.g = nn.Parameter(torch.ones(dim)) self.scale_shift_process = nn.Linear(dim * 2, dim * 2) def forward(self, x, norm_scale_shift_inp): norm = torch.norm(x, dim=-1, keepdim=True) * self.scale norm = x / norm.clamp(min=self.eps) * self.g ss_emb = self.scale_shift_process(norm_scale_shift_inp) scale, shift = torch.chunk(ss_emb, 2, dim=1) h = norm * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1) return h # residual and residual gates class Residual(nn.Module): def __init__(self, dim, scale_residual=False): super().__init__() self.residual_scale = nn.Parameter(torch.ones(dim)) if scale_residual else None def forward(self, x, residual): if exists(self.residual_scale): residual = residual * self.residual_scale return x + residual class GRUGating(nn.Module): def __init__(self, dim, scale_residual=False): super().__init__() self.gru = nn.GRUCell(dim, dim) self.residual_scale = nn.Parameter(torch.ones(dim)) if scale_residual else None def forward(self, x, residual): if exists(self.residual_scale): residual = residual * self.residual_scale gated_output = self.gru( rearrange(x, 'b n d -> (b n) d'), rearrange(residual, 'b n d -> (b n) d') ) return gated_output.reshape_as(x) # token shifting def shift(t, amount, mask=None): if amount == 0: return t if exists(mask): t = t.masked_fill(~mask[..., None], 0.) return F.pad(t, (0, 0, amount, -amount), value=0.) class ShiftTokens(nn.Module): def __init__(self, shifts, fn): super().__init__() self.fn = fn self.shifts = tuple(shifts) def forward(self, x, **kwargs): mask = kwargs.get('mask', None) shifts = self.shifts segments = len(shifts) feats_per_shift = x.shape[-1] // segments splitted = x.split(feats_per_shift, dim=-1) segments_to_shift, rest = splitted[:segments], splitted[segments:] segments_to_shift = list(map(lambda args: shift(*args, mask=mask), zip(segments_to_shift, shifts))) x = torch.cat((*segments_to_shift, *rest), dim=-1) return self.fn(x, **kwargs) # feedforward class GLU(nn.Module): def __init__(self, dim_in, dim_out, activation): super().__init__() self.act = activation self.proj = nn.Linear(dim_in, dim_out * 2) def forward(self, x): x, gate = self.proj(x).chunk(2, dim=-1) return x * self.act(gate) class FeedForward(nn.Module): def __init__( self, dim, dim_out=None, mult=4, glu=False, relu_squared=False, post_act_ln=False, dropout=0., zero_init_output=False ): super().__init__() inner_dim = int(dim * mult) dim_out = default(dim_out, dim) activation = ReluSquared() if relu_squared else nn.GELU() project_in = nn.Sequential( nn.Linear(dim, inner_dim), activation ) if not glu else GLU(dim, inner_dim, activation) self.net = nn.Sequential( project_in, nn.LayerNorm(inner_dim) if post_act_ln else nn.Identity(), nn.Dropout(dropout), nn.Linear(inner_dim, dim_out) ) # init last linear layer to 0 if zero_init_output: init_zero_(self.net[-1]) def forward(self, x): return self.net(x) # attention. class Attention(nn.Module): def __init__( self, dim, dim_head=DEFAULT_DIM_HEAD, heads=8, causal=False, talking_heads=False, head_scale=False, collab_heads=False, collab_compression=.3, sparse_topk=None, use_entmax15=False, num_mem_kv=0, dropout=0., on_attn=False, gate_values=False, zero_init_output=False, max_attend_past=None, qk_norm=False, scale_init_value=None, rel_pos_bias=False, rel_pos_num_buckets=32, rel_pos_max_distance=128, ): super().__init__() self.scale = dim_head ** -0.5 self.heads = heads self.causal = causal self.max_attend_past = max_attend_past qk_dim = v_dim = dim_head * heads # collaborative heads self.collab_heads = collab_heads if self.collab_heads: qk_dim = int(collab_compression * qk_dim) self.collab_mixing = nn.Parameter(torch.randn(heads, qk_dim)) self.to_q = nn.Linear(dim, qk_dim, bias=False) self.to_k = nn.Linear(dim, qk_dim, bias=False) self.to_v = nn.Linear(dim, v_dim, bias=False) self.dropout = nn.Dropout(dropout) # add GLU gating for aggregated values, from alphafold2 self.to_v_gate = None if gate_values: self.to_v_gate = nn.Linear(dim, v_dim) nn.init.constant_(self.to_v_gate.weight, 0) nn.init.constant_(self.to_v_gate.bias, 1) # cosine sim attention self.qk_norm = qk_norm if qk_norm: scale_init_value = default(scale_init_value, -3) # if not provided, initialize as though it were sequence length of 1024 self.scale = nn.Parameter(torch.ones(1, heads, 1, 1) * scale_init_value) # talking heads self.talking_heads = talking_heads if talking_heads: self.pre_softmax_proj = nn.Parameter(torch.randn(heads, heads)) self.post_softmax_proj = nn.Parameter(torch.randn(heads, heads)) # head scaling self.head_scale = head_scale if head_scale: self.head_scale_params = nn.Parameter(torch.ones(1, heads, 1, 1)) # explicit topk sparse attention self.sparse_topk = sparse_topk # entmax self.attn_fn = F.softmax # add memory key / values self.num_mem_kv = num_mem_kv if num_mem_kv > 0: self.mem_k = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head)) self.mem_v = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head)) # attention on attention self.attn_on_attn = on_attn self.to_out = nn.Sequential(nn.Linear(v_dim, dim * 2), nn.GLU()) if on_attn else nn.Linear(v_dim, dim) self.rel_pos_bias = rel_pos_bias if rel_pos_bias: assert rel_pos_num_buckets <= rel_pos_max_distance, 'number of relative position buckets must be less than the relative position max distance' self.rel_pos = RelativePositionBias(scale=dim_head ** 0.5, causal=causal, heads=heads, num_buckets=rel_pos_num_buckets, max_distance=rel_pos_max_distance) # init output projection 0 if zero_init_output: init_zero_(self.to_out) def forward( self, x, context=None, mask=None, context_mask=None, attn_mask=None, sinusoidal_emb=None, rotary_pos_emb=None, prev_attn=None, mem=None, layer_past=None, ): b, n, _, h, talking_heads, collab_heads, head_scale, scale, device, has_context = *x.shape, self.heads, self.talking_heads, self.collab_heads, self.head_scale, self.scale, x.device, exists( context) kv_input = default(context, x) q_input = x k_input = kv_input v_input = kv_input if exists(mem): k_input = torch.cat((mem, k_input), dim=-2) v_input = torch.cat((mem, v_input), dim=-2) if exists(sinusoidal_emb): # in shortformer, the query would start at a position offset depending on the past cached memory offset = k_input.shape[-2] - q_input.shape[-2] q_input = q_input + sinusoidal_emb(q_input, offset=offset) k_input = k_input + sinusoidal_emb(k_input) q = self.to_q(q_input) k = self.to_k(k_input) v = self.to_v(v_input) if not collab_heads: q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h=h), (q, k, v)) else: q = einsum('b i d, h d -> b h i d', q, self.collab_mixing) k = rearrange(k, 'b n d -> b () n d') v = rearrange(v, 'b n (h d) -> b h n d', h=h) if layer_past is not None: past_key, past_value = layer_past k = torch.cat([past_key, k], dim=-2) v = torch.cat([past_value, v], dim=-2) k_cache = k v_cache = v if exists(rotary_pos_emb) and not has_context: l = rotary_pos_emb.shape[-1] (ql, qr), (kl, kr), (vl, vr) = map(lambda t: (t[..., :l], t[..., l:]), (q, k, v)) ql, kl, vl = map(lambda t: apply_rotary_pos_emb(t, rotary_pos_emb), (ql, kl, vl)) q, k, v = map(lambda t: torch.cat(t, dim=-1), ((ql, qr), (kl, kr), (vl, vr))) input_mask = None if any(map(exists, (mask, context_mask))): q_mask = default(mask, lambda: torch.ones((b, n), device=device).bool()) k_mask = q_mask if not exists(context) else context_mask k_mask = default(k_mask, lambda: torch.ones((b, k.shape[-2]), device=device).bool()) q_mask = rearrange(q_mask, 'b i -> b () i ()') k_mask = rearrange(k_mask, 'b j -> b () () j') input_mask = q_mask * k_mask if self.num_mem_kv > 0: mem_k, mem_v = map(lambda t: repeat(t, 'h n d -> b h n d', b=b), (self.mem_k, self.mem_v)) k = torch.cat((mem_k, k), dim=-2) v = torch.cat((mem_v, v), dim=-2) if exists(input_mask): input_mask = F.pad(input_mask, (self.num_mem_kv, 0), value=True) if collab_heads: k = k.expand(-1, h, -1, -1) if self.qk_norm: q, k = map(l2norm, (q, k)) scale = 1 / (self.scale.exp().clamp(min=1e-2)) dots = einsum('b h i d, b h j d -> b h i j', q, k) * scale mask_value = max_neg_value(dots) if exists(prev_attn): dots = dots + prev_attn pre_softmax_attn = dots.clone() if talking_heads: dots = einsum('b h i j, h k -> b k i j', dots, self.pre_softmax_proj).contiguous() if self.rel_pos_bias: dots = self.rel_pos(dots) if exists(input_mask): dots.masked_fill_(~input_mask, mask_value) del input_mask if exists(attn_mask): assert 2 <= attn_mask.ndim <= 4, 'attention mask must have greater than 2 dimensions but less than or equal to 4' if attn_mask.ndim == 2: attn_mask = rearrange(attn_mask, 'i j -> () () i j') elif attn_mask.ndim == 3: attn_mask = rearrange(attn_mask, 'h i j -> () h i j') dots.masked_fill_(~attn_mask, mask_value) if exists(self.max_attend_past): i, j = dots.shape[-2:] range_q = torch.arange(j - i, j, device=device) range_k = torch.arange(j, device=device) dist = rearrange(range_q, 'i -> () () i ()') - rearrange(range_k, 'j -> () () () j') mask = dist > self.max_attend_past dots.masked_fill_(mask, mask_value) del mask if self.causal: i, j = dots.shape[-2:] r = torch.arange(i, device=device) mask = rearrange(r, 'i -> () () i ()') < rearrange(r, 'j -> () () () j') mask = F.pad(mask, (j - i, 0), value=False) dots.masked_fill_(mask, mask_value) del mask if exists(self.sparse_topk) and self.sparse_topk < dots.shape[-1]: top, _ = dots.topk(self.sparse_topk, dim=-1) vk = top[..., -1].unsqueeze(-1).expand_as(dots) mask = dots < vk dots.masked_fill_(mask, mask_value) del mask attn = self.attn_fn(dots, dim=-1) post_softmax_attn = attn.clone() attn = self.dropout(attn) if talking_heads: attn = einsum('b h i j, h k -> b k i j', attn, self.post_softmax_proj).contiguous() out = einsum('b h i j, b h j d -> b h i d', attn, v) if head_scale: out = out * self.head_scale_params out = rearrange(out, 'b h n d -> b n (h d)') if exists(self.to_v_gate): gates = self.to_v_gate(x) out = out * gates.sigmoid() intermediates = Intermediates( pre_softmax_attn=pre_softmax_attn, post_softmax_attn=post_softmax_attn ) return self.to_out(out), intermediates, k_cache, v_cache class AttentionLayers(nn.Module): def __init__( self, dim, depth, heads=8, causal=False, cross_attend=False, only_cross=False, use_scalenorm=False, use_rms_scaleshift_norm=False, use_rmsnorm=False, use_rezero=False, alibi_pos_bias=False, alibi_num_heads=None, alibi_learned=False, position_infused_attn=False, rotary_pos_emb=False, rotary_emb_dim=None, custom_layers=None, sandwich_coef=None, par_ratio=None, residual_attn=False, cross_residual_attn=False, macaron=False, pre_norm=True, gate_residual=False, scale_residual=False, shift_tokens=0, sandwich_norm=False, use_qk_norm_attn=False, qk_norm_attn_seq_len=None, zero_init_branch_output=False, **kwargs ): super().__init__() ff_kwargs, kwargs = groupby_prefix_and_trim('ff_', kwargs) attn_kwargs, _ = groupby_prefix_and_trim('attn_', kwargs) dim_head = attn_kwargs.get('dim_head', DEFAULT_DIM_HEAD) self.dim = dim self.depth = depth self.layers = nn.ModuleList([]) self.causal = causal rel_pos_bias = 'rel_pos_bias' in attn_kwargs self.has_pos_emb = position_infused_attn or rel_pos_bias or rotary_pos_emb self.pia_pos_emb = FixedPositionalEmbedding(dim) if position_infused_attn else None rotary_emb_dim = max(default(rotary_emb_dim, dim_head // 2), 32) self.rotary_pos_emb = RotaryEmbedding(rotary_emb_dim) if rotary_pos_emb else None assert not ( alibi_pos_bias and rel_pos_bias), 'you can only choose Alibi positional bias or T5 relative positional bias, not both' if alibi_pos_bias: alibi_num_heads = default(alibi_num_heads, heads) assert alibi_num_heads <= heads, 'number of ALiBi heads must be less than the total number of heads' alibi_pos_klass = LearnedAlibiPositionalBias if alibi_learned or not causal else AlibiPositionalBias self.rel_pos = alibi_pos_klass(heads=alibi_num_heads, bidirectional=not causal) else: self.rel_pos = None assert not (not pre_norm and sandwich_norm), 'sandwich norm cannot be used when not using prenorm' self.pre_norm = pre_norm self.sandwich_norm = sandwich_norm self.residual_attn = residual_attn self.cross_residual_attn = cross_residual_attn self.cross_attend = cross_attend norm_class = ScaleNorm if use_scalenorm else nn.LayerNorm norm_class = RMSNorm if use_rmsnorm else norm_class norm_class = RMSScaleShiftNorm if use_rms_scaleshift_norm else norm_class norm_fn = partial(norm_class, dim) norm_fn = nn.Identity if use_rezero else norm_fn branch_fn = Rezero if use_rezero else None if cross_attend and not only_cross: default_block = ('a', 'c', 'f') elif cross_attend and only_cross: default_block = ('c', 'f') else: default_block = ('a', 'f') if macaron: default_block = ('f',) + default_block # qk normalization if use_qk_norm_attn: attn_scale_init_value = -math.log(math.log2(qk_norm_attn_seq_len ** 2 - qk_norm_attn_seq_len)) if exists( qk_norm_attn_seq_len) else None attn_kwargs = {**attn_kwargs, 'qk_norm': True, 'scale_init_value': attn_scale_init_value} # zero init if zero_init_branch_output: attn_kwargs = {**attn_kwargs, 'zero_init_output': True} ff_kwargs = {**ff_kwargs, 'zero_init_output': True} # calculate layer block order if exists(custom_layers): layer_types = custom_layers elif exists(par_ratio): par_depth = depth * len(default_block) assert 1 < par_ratio <= par_depth, 'par ratio out of range' default_block = tuple(filter(not_equals('f'), default_block)) par_attn = par_depth // par_ratio depth_cut = par_depth * 2 // 3 # 2 / 3 attention layer cutoff suggested by PAR paper par_width = (depth_cut + depth_cut // par_attn) // par_attn assert len(default_block) <= par_width, 'default block is too large for par_ratio' par_block = default_block + ('f',) * (par_width - len(default_block)) par_head = par_block * par_attn layer_types = par_head + ('f',) * (par_depth - len(par_head)) elif exists(sandwich_coef): assert sandwich_coef > 0 and sandwich_coef <= depth, 'sandwich coefficient should be less than the depth' layer_types = ('a',) * sandwich_coef + default_block * (depth - sandwich_coef) + ('f',) * sandwich_coef else: layer_types = default_block * depth self.layer_types = layer_types self.num_attn_layers = len(list(filter(equals('a'), layer_types))) # calculate token shifting shift_tokens = cast_tuple(shift_tokens, len(layer_types)) # iterate and construct layers for ind, (layer_type, layer_shift_tokens) in enumerate(zip(self.layer_types, shift_tokens)): is_last_layer = ind == (len(self.layer_types) - 1) if layer_type == 'a': layer = Attention(dim, heads=heads, causal=causal, **attn_kwargs) elif layer_type == 'c': layer = Attention(dim, heads=heads, **attn_kwargs) elif layer_type == 'f': layer = FeedForward(dim, **ff_kwargs) layer = layer if not macaron else Scale(0.5, layer) else: raise Exception(f'invalid layer type {layer_type}') if layer_shift_tokens > 0: shift_range_upper = layer_shift_tokens + 1 shift_range_lower = -layer_shift_tokens if not causal else 0 layer = ShiftTokens(range(shift_range_lower, shift_range_upper), layer) if exists(branch_fn): layer = branch_fn(layer) residual_fn = GRUGating if gate_residual else Residual residual = residual_fn(dim, scale_residual=scale_residual) layer_uses_qk_norm = use_qk_norm_attn and layer_type in ('a', 'c') pre_branch_norm = norm_fn() if pre_norm and not layer_uses_qk_norm else None post_branch_norm = norm_fn() if sandwich_norm or layer_uses_qk_norm else None post_main_norm = norm_fn() if not pre_norm and not is_last_layer else None norms = nn.ModuleList([ pre_branch_norm, post_branch_norm, post_main_norm ]) self.layers.append(nn.ModuleList([ norms, layer, residual ])) def forward( self, x, context=None, full_context=None, # for passing a list of hidden states from an encoder mask=None, context_mask=None, attn_mask=None, mems=None, return_hiddens=False, norm_scale_shift_inp=None, past_key_values=None, expected_seq_len=None, ): assert not (self.cross_attend ^ (exists(context) or exists( full_context))), 'context must be passed in if cross_attend is set to True' assert context is None or full_context is None, 'only one of full_context or context can be provided' hiddens = [] intermediates = [] prev_attn = None prev_cross_attn = None mems = mems.copy() if exists(mems) else [None] * self.num_attn_layers norm_args = {} if exists(norm_scale_shift_inp): norm_args['norm_scale_shift_inp'] = norm_scale_shift_inp rotary_pos_emb = None if exists(self.rotary_pos_emb): if not self.training and self.causal: assert expected_seq_len is not None, "To decode a transformer with rotary embeddings, you must specify an `expected_seq_len`" elif expected_seq_len is None: expected_seq_len = 0 seq_len = x.shape[1] if past_key_values is not None: seq_len += past_key_values[0][0].shape[-2] max_rotary_emb_length = max(list(map(lambda m: (m.shape[1] if exists(m) else 0) + seq_len, mems)) + [expected_seq_len]) rotary_pos_emb = self.rotary_pos_emb(max_rotary_emb_length, x.device) present_key_values = [] cross_attn_count = 0 for ind, (layer_type, (norm, block, residual_fn)) in enumerate(zip(self.layer_types, self.layers)): if layer_type == 'a': layer_mem = mems.pop(0) if mems else None residual = x pre_branch_norm, post_branch_norm, post_main_norm = norm if exists(pre_branch_norm): x = pre_branch_norm(x, **norm_args) if layer_type == 'a' or layer_type == 'c': if past_key_values is not None: layer_kv = past_key_values.pop(0) layer_past = tuple(s.to(x.device) for s in layer_kv) else: layer_past = None if layer_type == 'a': out, inter, k, v = block(x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb, prev_attn, layer_mem, layer_past) elif layer_type == 'c': if exists(full_context): out, inter, k, v = block(x, full_context[cross_attn_count], mask, context_mask, None, None, None, prev_attn, None, layer_past) else: out, inter, k, v = block(x, context, mask, context_mask, None, None, None, prev_attn, None, layer_past) elif layer_type == 'f': out = block(x) if layer_type == 'a' or layer_type == 'c' and present_key_values is not None: present_key_values.append((k.detach(), v.detach())) if exists(post_branch_norm): out = post_branch_norm(out, **norm_args) x = residual_fn(out, residual) if layer_type in ('a', 'c'): intermediates.append(inter) if layer_type == 'a' and self.residual_attn: prev_attn = inter.pre_softmax_attn elif layer_type == 'c' and self.cross_residual_attn: prev_cross_attn = inter.pre_softmax_attn if exists(post_main_norm): x = post_main_norm(x, **norm_args) if layer_type == 'c': cross_attn_count += 1 if layer_type == 'f': hiddens.append(x) if return_hiddens: intermediates = LayerIntermediates( hiddens=hiddens, attn_intermediates=intermediates, past_key_values=present_key_values ) return x, intermediates return x class Encoder(AttentionLayers): def __init__(self, **kwargs): assert 'causal' not in kwargs, 'cannot set causality on encoder' super().__init__(causal=False, **kwargs) class Decoder(AttentionLayers): def __init__(self, **kwargs): assert 'causal' not in kwargs, 'cannot set causality on decoder' super().__init__(causal=True, **kwargs) class CrossAttender(AttentionLayers): def __init__(self, **kwargs): super().__init__(cross_attend=True, only_cross=True, **kwargs) class ViTransformerWrapper(nn.Module): def __init__( self, *, image_size, patch_size, attn_layers, num_classes=None, dropout=0., emb_dropout=0. ): super().__init__() assert isinstance(attn_layers, Encoder), 'attention layers must be an Encoder' assert image_size % patch_size == 0, 'image dimensions must be divisible by the patch size' dim = attn_layers.dim num_patches = (image_size // patch_size) ** 2 patch_dim = 3 * patch_size ** 2 self.patch_size = patch_size self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim)) self.patch_to_embedding = nn.Linear(patch_dim, dim) self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) self.dropout = nn.Dropout(emb_dropout) self.attn_layers = attn_layers self.norm = nn.LayerNorm(dim) self.mlp_head = FeedForward(dim, dim_out=num_classes, dropout=dropout) if exists(num_classes) else None def forward( self, img, return_embeddings=False ): p = self.patch_size x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p) x = self.patch_to_embedding(x) b, n, _ = x.shape cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b) x = torch.cat((cls_tokens, x), dim=1) x = x + self.pos_embedding[:, :(n + 1)] x = self.dropout(x) x = self.attn_layers(x) x = self.norm(x) if not exists(self.mlp_head) or return_embeddings: return x return self.mlp_head(x[:, 0]) class TransformerWrapper(nn.Module): def __init__( self, *, num_tokens, max_seq_len, attn_layers, emb_dim=None, max_mem_len=0., shift_mem_down=0, emb_dropout=0., num_memory_tokens=None, tie_embedding=False, use_pos_emb=True ): super().__init__() assert isinstance(attn_layers, AttentionLayers), 'attention layers must be one of Encoder or Decoder' dim = attn_layers.dim emb_dim = default(emb_dim, dim) self.max_seq_len = max_seq_len self.max_mem_len = max_mem_len self.shift_mem_down = shift_mem_down self.token_emb = nn.Embedding(num_tokens, emb_dim) self.pos_emb = AbsolutePositionalEmbedding(emb_dim, max_seq_len) if ( use_pos_emb and not attn_layers.has_pos_emb) else always(0) self.emb_dropout = nn.Dropout(emb_dropout) self.project_emb = nn.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity() self.attn_layers = attn_layers self.norm = nn.LayerNorm(dim) self.init_() self.to_logits = nn.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t() # memory tokens (like [cls]) from Memory Transformers paper num_memory_tokens = default(num_memory_tokens, 0) self.num_memory_tokens = num_memory_tokens if num_memory_tokens > 0: self.memory_tokens = nn.Parameter(torch.randn(num_memory_tokens, dim)) def init_(self): nn.init.kaiming_normal_(self.token_emb.weight) def forward( self, x, return_embeddings=False, mask=None, return_hiddens=False, return_attn=False, mems=None, use_cache=False, **kwargs ): b, n, device, num_mem = *x.shape, x.device, self.num_memory_tokens x = self.token_emb(x) x = x + self.pos_emb(x) x = self.emb_dropout(x) x = self.project_emb(x) if num_mem > 0: mem = repeat(self.memory_tokens, 'n d -> b n d', b=b) x = torch.cat((mem, x), dim=1) # auto-handle masking after appending memory tokens if exists(mask): mask = F.pad(mask, (num_mem, 0), value=True) if self.shift_mem_down and exists(mems): mems_l, mems_r = mems[:self.shift_mem_down], mems[self.shift_mem_down:] mems = [*mems_r, *mems_l] x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs) x = self.norm(x) mem, x = x[:, :num_mem], x[:, num_mem:] out = self.to_logits(x) if not return_embeddings else x if return_hiddens: hiddens = intermediates.hiddens return out, hiddens res = [out] if return_attn: attn_maps = list(map(lambda t: t.post_softmax_attn, intermediates.attn_intermediates)) res.append(attn_maps) if use_cache: res.append(intermediates.past_key_values) if len(res) > 1: return tuple(res) return res[0] class ContinuousTransformerWrapper(nn.Module): def __init__( self, *, max_seq_len, attn_layers, dim_in=None, dim_out=None, emb_dim=None, emb_dropout=0., use_pos_emb=True ): super().__init__() assert isinstance(attn_layers, AttentionLayers), 'attention layers must be one of Encoder or Decoder' dim = attn_layers.dim self.max_seq_len = max_seq_len self.pos_emb = AbsolutePositionalEmbedding(dim, max_seq_len) if ( use_pos_emb and not attn_layers.has_pos_emb) else always(0) self.emb_dropout = nn.Dropout(emb_dropout) self.project_in = nn.Linear(dim_in, dim) if exists(dim_in) else nn.Identity() self.attn_layers = attn_layers self.norm = nn.LayerNorm(dim) self.project_out = nn.Linear(dim, dim_out) if exists(dim_out) else nn.Identity() def forward( self, x, return_embeddings=False, mask=None, return_attn=False, mems=None, use_cache=False, **kwargs ): b, n, _, device = *x.shape, x.device x = self.project_in(x) x = x + self.pos_emb(x) x = self.emb_dropout(x) x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs) x = self.norm(x) out = self.project_out(x) if not return_embeddings else x res = [out] if return_attn: attn_maps = list(map(lambda t: t.post_softmax_attn, intermediates.attn_intermediates)) res.append(attn_maps) if use_cache: res.append(intermediates.past_key_values) if len(res) > 1: return tuple(res) return res[0] ================================================ FILE: tortoise/read.py ================================================ import argparse import os from time import time import torch import torchaudio from api import TextToSpeech, MODELS_DIR from utils.audio import load_audio, load_voices from utils.text import split_and_recombine_text if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--textfile', type=str, help='A file containing the text to read.', default="tortoise/data/riding_hood.txt") parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) ' 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='pat') parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/') parser.add_argument('--output_name', type=str, help='How to name the output file', default='combined.wav') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard') parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None) parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice. Only the first candidate is actually used in the final product, the others can be used manually.', default=1) parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) parser.add_argument('--produce_debug_state', type=bool, help='Whether or not to produce debug_state.pth, which can aid in reproducing problems. Defaults to true.', default=True) parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=False) parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True) parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True) args = parser.parse_args() if torch.backends.mps.is_available(): args.use_deepspeed = False tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half) outpath = args.output_path outname = args.output_name selected_voices = args.voice.split(',') regenerate = args.regenerate if regenerate is not None: regenerate = [int(e) for e in regenerate.split(',')] # Process text with open(args.textfile, 'r', encoding='utf-8') as f: text = ' '.join([l for l in f.readlines()]) if '|' in text: print("Found the '|' character in your text, which I will use as a cue for where to split it up. If this was not" "your intent, please remove all '|' characters from the input.") texts = text.split('|') else: texts = split_and_recombine_text(text) seed = int(time()) if args.seed is None else args.seed for selected_voice in selected_voices: voice_outpath = os.path.join(outpath, selected_voice) os.makedirs(voice_outpath, exist_ok=True) if '&' in selected_voice: voice_sel = selected_voice.split('&') else: voice_sel = [selected_voice] voice_samples, conditioning_latents = load_voices(voice_sel) all_parts = [] for j, text in enumerate(texts): if regenerate is not None and j not in regenerate: all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000)) continue gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, preset=args.preset, k=args.candidates, use_deterministic_seed=seed) if args.candidates == 1: audio_ = gen.squeeze(0).cpu() torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), audio_, 24000) else: candidate_dir = os.path.join(voice_outpath, str(j)) os.makedirs(candidate_dir, exist_ok=True) for k, g in enumerate(gen): torchaudio.save(os.path.join(candidate_dir, f'{k}.wav'), g.squeeze(0).cpu(), 24000) audio_ = gen[0].squeeze(0).cpu() all_parts.append(audio_) if args.candidates == 1: full_audio = torch.cat(all_parts, dim=-1) torchaudio.save(os.path.join(voice_outpath, f"{outname}.wav"), full_audio, 24000) if args.produce_debug_state: os.makedirs('debug_states', exist_ok=True) dbg_state = (seed, texts, voice_samples, conditioning_latents) torch.save(dbg_state, f'debug_states/read_debug_{selected_voice}.pth') # Combine each candidate's audio clips. if args.candidates > 1: audio_clips = [] for candidate in range(args.candidates): for line in range(len(texts)): wav_file = os.path.join(voice_outpath, str(line), f"{candidate}.wav") audio_clips.append(load_audio(wav_file, 24000)) audio_clips = torch.cat(audio_clips, dim=-1) torchaudio.save(os.path.join(voice_outpath, f"{outname}_{candidate:02d}.wav"), audio_clips, 24000) audio_clips = [] ================================================ FILE: tortoise/read_fast.py ================================================ import argparse import os from time import time import torch import torchaudio from api_fast import TextToSpeech, MODELS_DIR from utils.audio import load_audio, load_voices from utils.text import split_and_recombine_text if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--textfile', type=str, help='A file containing the text to read.', default="tortoise/data/riding_hood.txt") parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) ' 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='lj') parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/') parser.add_argument('--output_name', type=str, help='How to name the output file', default='combined.wav') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard') parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None) parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=False) parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True) parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True) args = parser.parse_args() if torch.backends.mps.is_available(): args.use_deepspeed = False tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half) outpath = args.output_path outname = args.output_name selected_voices = args.voice.split(',') regenerate = args.regenerate if regenerate is not None: regenerate = [int(e) for e in regenerate.split(',')] # Process text with open(args.textfile, 'r', encoding='utf-8') as f: text = ' '.join([l for l in f.readlines()]) if '|' in text: print("Found the '|' character in your text, which I will use as a cue for where to split it up. If this was not" "your intent, please remove all '|' characters from the input.") texts = text.split('|') else: texts = split_and_recombine_text(text) seed = int(time()) if args.seed is None else args.seed for selected_voice in selected_voices: voice_outpath = os.path.join(outpath, selected_voice) os.makedirs(voice_outpath, exist_ok=True) if '&' in selected_voice: voice_sel = selected_voice.split('&') else: voice_sel = [selected_voice] voice_samples, conditioning_latents = load_voices(voice_sel) all_parts = [] for j, text in enumerate(texts): if regenerate is not None and j not in regenerate: all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000)) continue start_time = time() gen = tts.tts(text, voice_samples=voice_samples, use_deterministic_seed=seed) end_time = time() audio_ = gen.squeeze(0).cpu() print("Time taken to generate the audio: ", end_time - start_time, "seconds") print("RTF: ", (end_time - start_time) / (audio_.shape[1] / 24000)) torchaudio.save(os.path.join(voice_outpath, f'{j}.wav'), audio_, 24000) all_parts.append(audio_) full_audio = torch.cat(all_parts, dim=-1) torchaudio.save(os.path.join(voice_outpath, f"{outname}.wav"), full_audio, 24000) ================================================ FILE: tortoise/socket_client.py ================================================ import socket import sounddevice as sd import numpy as np def play_audio_stream(client_socket): buffer = b'' stream = sd.OutputStream(samplerate=24000, channels=1, dtype='float32') stream.start() try: while True: chunk = client_socket.recv(1024) if b"END_OF_AUDIO" in chunk: buffer += chunk.replace(b"END_OF_AUDIO", b"") if buffer: audio_array = np.frombuffer(buffer, dtype=np.float32) stream.write(audio_array) break buffer += chunk while len(buffer) >= 4096: audio_chunk = buffer[:4096] audio_array = np.frombuffer(audio_chunk, dtype=np.float32) stream.write(audio_array) buffer = buffer[4096:] finally: stream.stop() stream.close() def send_text_to_server(character_name, text, server_ip='localhost', server_port=5000): client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) client_socket.connect((server_ip, server_port)) try: data = f"{character_name}|{text}" client_socket.sendall(data.encode('utf-8')) play_audio_stream(client_socket) print("Audio playback finished.") finally: client_socket.close() if __name__ == "__main__": character_name ="deniro" text = "Hello This is just for a live speaking test" send_text_to_server(character_name, text) ================================================ FILE: tortoise/socket_server.py ================================================ import spacy import threading import socket from tortoise.api_fast import TextToSpeech from utils.audio import load_voices tts = TextToSpeech() nlp = spacy.load("en_core_web_sm") def generate_audio_stream(text, tts, voice_samples): print(f"Generating audio stream...: {text}") voice_samples, conditioning_latents = load_voices([voice_samples]) stream = tts.tts_stream( text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, verbose=True, stream_chunk_size=40 # Adjust chunk size as needed ) for audio_chunk in stream: yield audio_chunk def split_text(text, max_length=200): doc = nlp(text) chunks = [] chunk = [] length = 0 for sent in doc.sents: sent_length = len(sent.text) if length + sent_length > max_length: chunks.append(' '.join(chunk)) chunk = [] length = 0 chunk.append(sent.text) length += sent_length + 1 if chunk: chunks.append(' '.join(chunk)) return chunks def handle_client(client_socket, tts): try: while True: data = client_socket.recv(1024).decode('utf-8') if not data: break character_name, text = data.split('|', 1) text_chunks = split_text(text, max_length=200) print(text_chunks) for chunk in text_chunks: audio_stream = generate_audio_stream(chunk, tts, character_name) for audio_chunk in audio_stream: audio_data = audio_chunk.cpu().numpy().flatten() client_socket.sendall(audio_data.tobytes()) client_socket.sendall(b"END_OF_AUDIO") finally: client_socket.close() print("Client disconnected.") def start_server(): server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.bind(('0.0.0.0', 5000)) server.listen(5) print("Server listening on port 5000") while True: client_socket, addr = server.accept() print(f"Accepted connection from {addr}") client_handler = threading.Thread(target=handle_client, args=(client_socket, tts)) client_handler.start() if __name__ == "__main__": start_server() ================================================ FILE: tortoise/tts_stream.py ================================================ import argparse import os from time import time import torch import torchaudio from api_fast import TextToSpeech, MODELS_DIR from utils.audio import load_audio, load_voices from utils.text import split_and_recombine_text import sounddevice as sd import queue import threading def play_audio(audio_queue): while True: chunk = audio_queue.get() if chunk is None: break sd.play(chunk.cpu().numpy(), samplerate=24000) sd.wait() if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--textfile', type=str, help='A file containing the text to read.', default="tortoise/data/riding_hood.txt") parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) ' 'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='lj') parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/longform/') parser.add_argument('--output_name', type=str, help='How to name the output file', default='combined.wav') parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='standard') parser.add_argument('--regenerate', type=str, help='Comma-separated list of clip numbers to re-generate, or nothing.', default=None) parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this' 'should only be specified if you have custom checkpoints.', default=MODELS_DIR) parser.add_argument('--seed', type=int, help='Random seed which can be used to reproduce results.', default=None) parser.add_argument('--use_deepspeed', type=bool, help='Use deepspeed for speed bump.', default=False) parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True) parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True) args = parser.parse_args() if torch.backends.mps.is_available(): args.use_deepspeed = False tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half) outpath = args.output_path outname = args.output_name selected_voices = args.voice.split(',') regenerate = args.regenerate if regenerate is not None: regenerate = [int(e) for e in regenerate.split(',')] # Process text with open(args.textfile, 'r', encoding='utf-8') as f: text = ' '.join([l for l in f.readlines()]) if '|' in text: print("Found the '|' character in your text, which I will use as a cue for where to split it up. If this was not" "your intent, please remove all '|' characters from the input.") texts = text.split('|') else: texts = split_and_recombine_text(text) audio_queue = queue.Queue() playback_thread = threading.Thread(target=play_audio, args=(audio_queue,)) playback_thread.start() seed = int(time()) if args.seed is None else args.seed for selected_voice in selected_voices: voice_outpath = os.path.join(outpath, selected_voice) os.makedirs(voice_outpath, exist_ok=True) if '&' in selected_voice: voice_sel = selected_voice.split('&') else: voice_sel = [selected_voice] voice_samples, conditioning_latents = load_voices(voice_sel) all_parts = [] for j, text in enumerate(texts): if regenerate is not None and j not in regenerate: all_parts.append(load_audio(os.path.join(voice_outpath, f'{j}.wav'), 24000)) continue start_time = time() audio_generator = tts.tts_stream(text, voice_samples=voice_samples, use_deterministic_seed=seed) for wav_chunk in audio_generator: audio_queue.put(wav_chunk) audio_queue.put(None) playback_thread.join() ================================================ FILE: tortoise/utils/__init__.py ================================================ ================================================ FILE: tortoise/utils/audio.py ================================================ import os from glob import glob import librosa import torch import torchaudio import numpy as np from scipy.io.wavfile import read from tortoise.utils.stft import STFT BUILTIN_VOICES_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../voices') def load_wav_to_torch(full_path): sampling_rate, data = read(full_path) if data.dtype == np.int32: norm_fix = 2 ** 31 elif data.dtype == np.int16: norm_fix = 2 ** 15 elif data.dtype == np.float16 or data.dtype == np.float32: norm_fix = 1. else: raise NotImplemented(f"Provided data dtype not supported: {data.dtype}") return (torch.FloatTensor(data.astype(np.float32)) / norm_fix, sampling_rate) def load_audio(audiopath, sampling_rate): extension = os.path.splitext(audiopath)[1].casefold() if extension == '.wav': audio, lsr = load_wav_to_torch(audiopath) elif extension == '.mp3': audio, lsr = librosa.load(audiopath, sr=sampling_rate) audio = torch.FloatTensor(audio) else: assert False, f"Unsupported audio format provided: {audiopath[-4:]}" # Remove any channel data. if len(audio.shape) > 1: if audio.shape[0] < 5: audio = audio[0] else: assert audio.shape[1] < 5 audio = audio[:, 0] if lsr != sampling_rate: audio = torchaudio.functional.resample(audio, lsr, sampling_rate) # Check some assumptions about audio range. This should be automatically fixed in load_wav_to_torch, but might not be in some edge cases, where we should squawk. # '2' is arbitrarily chosen since it seems like audio will often "overdrive" the [-1,1] bounds. if torch.any(audio > 2) or not torch.any(audio < 0): print(f"Error with {audiopath}. Max={audio.max()} min={audio.min()}") audio.clip_(-1, 1) return audio.unsqueeze(0) TACOTRON_MEL_MAX = 2.3143386840820312 TACOTRON_MEL_MIN = -11.512925148010254 def denormalize_tacotron_mel(norm_mel): return ((norm_mel+1)/2)*(TACOTRON_MEL_MAX-TACOTRON_MEL_MIN)+TACOTRON_MEL_MIN def normalize_tacotron_mel(mel): return 2 * ((mel - TACOTRON_MEL_MIN) / (TACOTRON_MEL_MAX - TACOTRON_MEL_MIN)) - 1 def dynamic_range_compression(x, C=1, clip_val=1e-5): """ PARAMS ------ C: compression factor """ return torch.log(torch.clamp(x, min=clip_val) * C) def dynamic_range_decompression(x, C=1): """ PARAMS ------ C: compression factor used to compress """ return torch.exp(x) / C def get_voices(extra_voice_dirs=[]): dirs = [BUILTIN_VOICES_DIR] + extra_voice_dirs voices = {} for d in dirs: subs = os.listdir(d) for sub in subs: subj = os.path.join(d, sub) if os.path.isdir(subj): voices[sub] = list(glob(f'{subj}/*.wav')) + list(glob(f'{subj}/*.mp3')) + list(glob(f'{subj}/*.pth')) return voices def save_pth(conds, save_path): torch.save(conds, save_path) def load_voice(voice, extra_voice_dirs=[]): if voice == 'random': return None, None voices = get_voices(extra_voice_dirs) paths = voices[voice] pth_files = [p for p in paths if p.endswith('.pth')] if len(paths) == 1 and paths[0].endswith('.pth'): return None, torch.load(paths[0]) else: conds = [] for cond_path in paths: if not cond_path.endswith('.pth'): c = load_audio(cond_path, 22050) conds.append(c) if not pth_files: pth_save_path = os.path.join(os.path.dirname(paths[0]), f"{voice}.pth") save_pth(conds, pth_save_path) return conds, None def load_voices(voices, extra_voice_dirs=[]): latents = [] clips = [] for voice in voices: if voice == 'random': if len(voices) > 1: print("Cannot combine a random voice with a non-random voice. Just using a random voice.") return None, None clip, latent = load_voice(voice, extra_voice_dirs) if latent is None: assert len(latents) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this." clips.extend(clip) elif clip is None: assert len(clips) == 0, "Can only combine raw audio voices or latent voices, not both. Do it yourself if you want this." latents.append(latent) if len(latents) == 0: return clips, None else: latents_0 = torch.stack([l[0] for l in latents], dim=0).mean(dim=0) latents_1 = torch.stack([l[1] for l in latents], dim=0).mean(dim=0) latents = (latents_0,latents_1) return None, latents class TacotronSTFT(torch.nn.Module): def __init__(self, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, sampling_rate=22050, mel_fmin=0.0, mel_fmax=8000.0): super(TacotronSTFT, self).__init__() self.n_mel_channels = n_mel_channels self.sampling_rate = sampling_rate self.stft_fn = STFT(filter_length, hop_length, win_length) from librosa.filters import mel as librosa_mel_fn mel_basis = librosa_mel_fn( sr=sampling_rate, n_fft=filter_length, n_mels=n_mel_channels, fmin=mel_fmin, fmax=mel_fmax) mel_basis = torch.from_numpy(mel_basis).float() self.register_buffer('mel_basis', mel_basis) def spectral_normalize(self, magnitudes): output = dynamic_range_compression(magnitudes) return output def spectral_de_normalize(self, magnitudes): output = dynamic_range_decompression(magnitudes) return output def mel_spectrogram(self, y): """Computes mel-spectrograms from a batch of waves PARAMS ------ y: Variable(torch.FloatTensor) with shape (B, T) in range [-1, 1] RETURNS ------- mel_output: torch.FloatTensor of shape (B, n_mel_channels, T) """ assert(torch.min(y.data) >= -10) assert(torch.max(y.data) <= 10) y = torch.clip(y, min=-1, max=1) magnitudes, phases = self.stft_fn.transform(y) magnitudes = magnitudes.data mel_output = torch.matmul(self.mel_basis, magnitudes) mel_output = self.spectral_normalize(mel_output) return mel_output def wav_to_univnet_mel(wav, do_normalization=False, device='cuda' if not torch.backends.mps.is_available() else 'mps', stft=None): # Don't require stft to be passed, but use it if it is. if stft is None: stft = TacotronSTFT(1024, 256, 1024, 100, 24000, 0, 12000) stft = stft.to(device) mel = stft.mel_spectrogram(wav) if do_normalization: mel = normalize_tacotron_mel(mel) return mel ================================================ FILE: tortoise/utils/diffusion.py ================================================ """ This is an almost carbon copy of gaussian_diffusion.py from OpenAI's ImprovedDiffusion repo, which itself: This code started out as a PyTorch port of Ho et al's diffusion models: https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/diffusion_utils_2.py Docstrings have been added, as well as DDIM sampling and a new collection of beta schedules. """ import enum import math import numpy as np import torch import torch as th from tqdm import tqdm def normal_kl(mean1, logvar1, mean2, logvar2): """ Compute the KL divergence between two gaussians. Shapes are automatically broadcasted, so batches can be compared to scalars, among other use cases. """ tensor = None for obj in (mean1, logvar1, mean2, logvar2): if isinstance(obj, th.Tensor): tensor = obj break assert tensor is not None, "at least one argument must be a Tensor" # Force variances to be Tensors. Broadcasting helps convert scalars to # Tensors, but it does not work for th.exp(). logvar1, logvar2 = [ x if isinstance(x, th.Tensor) else th.tensor(x).to(tensor) for x in (logvar1, logvar2) ] return 0.5 * ( -1.0 + logvar2 - logvar1 + th.exp(logvar1 - logvar2) + ((mean1 - mean2) ** 2) * th.exp(-logvar2) ) def approx_standard_normal_cdf(x): """ A fast approximation of the cumulative distribution function of the standard normal. """ return 0.5 * (1.0 + th.tanh(np.sqrt(2.0 / np.pi) * (x + 0.044715 * th.pow(x, 3)))) def discretized_gaussian_log_likelihood(x, *, means, log_scales): """ Compute the log-likelihood of a Gaussian distribution discretizing to a given image. :param x: the target images. It is assumed that this was uint8 values, rescaled to the range [-1, 1]. :param means: the Gaussian mean Tensor. :param log_scales: the Gaussian log stddev Tensor. :return: a tensor like x of log probabilities (in nats). """ assert x.shape == means.shape == log_scales.shape centered_x = x - means inv_stdv = th.exp(-log_scales) plus_in = inv_stdv * (centered_x + 1.0 / 255.0) cdf_plus = approx_standard_normal_cdf(plus_in) min_in = inv_stdv * (centered_x - 1.0 / 255.0) cdf_min = approx_standard_normal_cdf(min_in) log_cdf_plus = th.log(cdf_plus.clamp(min=1e-12)) log_one_minus_cdf_min = th.log((1.0 - cdf_min).clamp(min=1e-12)) cdf_delta = cdf_plus - cdf_min log_probs = th.where( x < -0.999, log_cdf_plus, th.where(x > 0.999, log_one_minus_cdf_min, th.log(cdf_delta.clamp(min=1e-12))), ) assert log_probs.shape == x.shape return log_probs def mean_flat(tensor): """ Take the mean over all non-batch dimensions. """ return tensor.mean(dim=list(range(1, len(tensor.shape)))) def get_named_beta_schedule(schedule_name, num_diffusion_timesteps): """ Get a pre-defined beta schedule for the given name. The beta schedule library consists of beta schedules which remain similar in the limit of num_diffusion_timesteps. Beta schedules may be added, but should not be removed or changed once they are committed to maintain backwards compatibility. """ if schedule_name == "linear": # Linear schedule from Ho et al, extended to work for any number of # diffusion steps. scale = 1000 / num_diffusion_timesteps beta_start = scale * 0.0001 beta_end = scale * 0.02 return np.linspace( beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64 ) elif schedule_name == "cosine": return betas_for_alpha_bar( num_diffusion_timesteps, lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2, ) else: raise NotImplementedError(f"unknown beta schedule: {schedule_name}") def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999): """ Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of (1-beta) over time from t = [0,1]. :param num_diffusion_timesteps: the number of betas to produce. :param alpha_bar: a lambda that takes an argument t from 0 to 1 and produces the cumulative product of (1-beta) up to that part of the diffusion process. :param max_beta: the maximum beta to use; use values lower than 1 to prevent singularities. """ betas = [] for i in range(num_diffusion_timesteps): t1 = i / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) return np.array(betas) class ModelMeanType(enum.Enum): """ Which type of output the model predicts. """ PREVIOUS_X = 'previous_x' # the model predicts x_{t-1} START_X = 'start_x' # the model predicts x_0 EPSILON = 'epsilon' # the model predicts epsilon class ModelVarType(enum.Enum): """ What is used as the model's output variance. The LEARNED_RANGE option has been added to allow the model to predict values between FIXED_SMALL and FIXED_LARGE, making its job easier. """ LEARNED = 'learned' FIXED_SMALL = 'fixed_small' FIXED_LARGE = 'fixed_large' LEARNED_RANGE = 'learned_range' class LossType(enum.Enum): MSE = 'mse' # use raw MSE loss (and KL when learning variances) RESCALED_MSE = 'rescaled_mse' # use raw MSE loss (with RESCALED_KL when learning variances) KL = 'kl' # use the variational lower-bound RESCALED_KL = 'rescaled_kl' # like KL, but rescale to estimate the full VLB def is_vb(self): return self == LossType.KL or self == LossType.RESCALED_KL class GaussianDiffusion: """ Utilities for training and sampling diffusion models. Ported directly from here, and then adapted over time to further experimentation. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/diffusion_utils_2.py#L42 :param betas: a 1-D numpy array of betas for each diffusion timestep, starting at T and going to 1. :param model_mean_type: a ModelMeanType determining what the model outputs. :param model_var_type: a ModelVarType determining how variance is output. :param loss_type: a LossType determining the loss function to use. :param rescale_timesteps: if True, pass floating point timesteps into the model so that they are always scaled like in the original paper (0 to 1000). """ def __init__( self, *, betas, model_mean_type, model_var_type, loss_type, rescale_timesteps=False, conditioning_free=False, conditioning_free_k=1, ramp_conditioning_free=True, ): self.model_mean_type = ModelMeanType(model_mean_type) self.model_var_type = ModelVarType(model_var_type) self.loss_type = LossType(loss_type) self.rescale_timesteps = rescale_timesteps self.conditioning_free = conditioning_free self.conditioning_free_k = conditioning_free_k self.ramp_conditioning_free = ramp_conditioning_free # Use float64 for accuracy. betas = np.array(betas, dtype=np.float64) self.betas = betas assert len(betas.shape) == 1, "betas must be 1-D" assert (betas > 0).all() and (betas <= 1).all() self.num_timesteps = int(betas.shape[0]) alphas = 1.0 - betas self.alphas_cumprod = np.cumprod(alphas, axis=0) self.alphas_cumprod_prev = np.append(1.0, self.alphas_cumprod[:-1]) self.alphas_cumprod_next = np.append(self.alphas_cumprod[1:], 0.0) assert self.alphas_cumprod_prev.shape == (self.num_timesteps,) # calculations for diffusion q(x_t | x_{t-1}) and others self.sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod) self.sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod) self.log_one_minus_alphas_cumprod = np.log(1.0 - self.alphas_cumprod) self.sqrt_recip_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod) self.sqrt_recipm1_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod - 1) # calculations for posterior q(x_{t-1} | x_t, x_0) self.posterior_variance = ( betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod) ) # log calculation clipped because the posterior variance is 0 at the # beginning of the diffusion chain. self.posterior_log_variance_clipped = np.log( np.append(self.posterior_variance[1], self.posterior_variance[1:]) ) self.posterior_mean_coef1 = ( betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod) ) self.posterior_mean_coef2 = ( (1.0 - self.alphas_cumprod_prev) * np.sqrt(alphas) / (1.0 - self.alphas_cumprod) ) def q_mean_variance(self, x_start, t): """ Get the distribution q(x_t | x_0). :param x_start: the [N x C x ...] tensor of noiseless inputs. :param t: the number of diffusion steps (minus 1). Here, 0 means one step. :return: A tuple (mean, variance, log_variance), all of x_start's shape. """ mean = ( _extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start ) variance = _extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape) log_variance = _extract_into_tensor( self.log_one_minus_alphas_cumprod, t, x_start.shape ) return mean, variance, log_variance def q_sample(self, x_start, t, noise=None): """ Diffuse the data for a given number of diffusion steps. In other words, sample from q(x_t | x_0). :param x_start: the initial data batch. :param t: the number of diffusion steps (minus 1). Here, 0 means one step. :param noise: if specified, the split-out normal noise. :return: A noisy version of x_start. """ if noise is None: noise = th.randn_like(x_start) assert noise.shape == x_start.shape return ( _extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start + _extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise ) def q_posterior_mean_variance(self, x_start, x_t, t): """ Compute the mean and variance of the diffusion posterior: q(x_{t-1} | x_t, x_0) """ assert x_start.shape == x_t.shape posterior_mean = ( _extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start + _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t ) posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape) posterior_log_variance_clipped = _extract_into_tensor( self.posterior_log_variance_clipped, t, x_t.shape ) assert ( posterior_mean.shape[0] == posterior_variance.shape[0] == posterior_log_variance_clipped.shape[0] == x_start.shape[0] ) return posterior_mean, posterior_variance, posterior_log_variance_clipped def p_mean_variance( self, model, x, t, clip_denoised=True, denoised_fn=None, model_kwargs=None ): """ Apply the model to get p(x_{t-1} | x_t), as well as a prediction of the initial x, x_0. :param model: the model, which takes a signal and a batch of timesteps as input. :param x: the [N x C x ...] tensor at time t. :param t: a 1-D Tensor of timesteps. :param clip_denoised: if True, clip the denoised signal into [-1, 1]. :param denoised_fn: if not None, a function which applies to the x_start prediction before it is used to sample. Applies before clip_denoised. :param model_kwargs: if not None, a dict of extra keyword arguments to pass to the model. This can be used for conditioning. :return: a dict with the following keys: - 'mean': the model mean output. - 'variance': the model variance output. - 'log_variance': the log of 'variance'. - 'pred_xstart': the prediction for x_0. """ if model_kwargs is None: model_kwargs = {} B, C = x.shape[:2] assert t.shape == (B,) model_output = model(x, self._scale_timesteps(t), **model_kwargs) if self.conditioning_free: model_output_no_conditioning = model(x, self._scale_timesteps(t), conditioning_free=True, **model_kwargs) if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]: assert model_output.shape == (B, C * 2, *x.shape[2:]) model_output, model_var_values = th.split(model_output, C, dim=1) if self.conditioning_free: model_output_no_conditioning, _ = th.split(model_output_no_conditioning, C, dim=1) if self.model_var_type == ModelVarType.LEARNED: model_log_variance = model_var_values model_variance = th.exp(model_log_variance) else: min_log = _extract_into_tensor( self.posterior_log_variance_clipped, t, x.shape ) max_log = _extract_into_tensor(np.log(self.betas), t, x.shape) # The model_var_values is [-1, 1] for [min_var, max_var]. frac = (model_var_values + 1) / 2 model_log_variance = frac * max_log + (1 - frac) * min_log model_variance = th.exp(model_log_variance) else: model_variance, model_log_variance = { # for fixedlarge, we set the initial (log-)variance like so # to get a better decoder log likelihood. ModelVarType.FIXED_LARGE: ( np.append(self.posterior_variance[1], self.betas[1:]), np.log(np.append(self.posterior_variance[1], self.betas[1:])), ), ModelVarType.FIXED_SMALL: ( self.posterior_variance, self.posterior_log_variance_clipped, ), }[self.model_var_type] model_variance = _extract_into_tensor(model_variance, t, x.shape) model_log_variance = _extract_into_tensor(model_log_variance, t, x.shape) if self.conditioning_free: if self.ramp_conditioning_free: assert t.shape[0] == 1 # This should only be used in inference. cfk = self.conditioning_free_k * (1 - self._scale_timesteps(t)[0].item() / self.num_timesteps) else: cfk = self.conditioning_free_k model_output = (1 + cfk) * model_output - cfk * model_output_no_conditioning def process_xstart(x): if denoised_fn is not None: x = denoised_fn(x) if clip_denoised: return x.clamp(-1, 1) return x if self.model_mean_type == ModelMeanType.PREVIOUS_X: pred_xstart = process_xstart( self._predict_xstart_from_xprev(x_t=x, t=t, xprev=model_output) ) model_mean = model_output elif self.model_mean_type in [ModelMeanType.START_X, ModelMeanType.EPSILON]: if self.model_mean_type == ModelMeanType.START_X: pred_xstart = process_xstart(model_output) else: pred_xstart = process_xstart( self._predict_xstart_from_eps(x_t=x, t=t, eps=model_output) ) model_mean, _, _ = self.q_posterior_mean_variance( x_start=pred_xstart, x_t=x, t=t ) else: raise NotImplementedError(self.model_mean_type) assert ( model_mean.shape == model_log_variance.shape == pred_xstart.shape == x.shape ) return { "mean": model_mean, "variance": model_variance, "log_variance": model_log_variance, "pred_xstart": pred_xstart, } def _predict_xstart_from_eps(self, x_t, t, eps): assert x_t.shape == eps.shape return ( _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps ) def _predict_xstart_from_xprev(self, x_t, t, xprev): assert x_t.shape == xprev.shape return ( # (xprev - coef2*x_t) / coef1 _extract_into_tensor(1.0 / self.posterior_mean_coef1, t, x_t.shape) * xprev - _extract_into_tensor( self.posterior_mean_coef2 / self.posterior_mean_coef1, t, x_t.shape ) * x_t ) def _predict_eps_from_xstart(self, x_t, t, pred_xstart): return ( _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - pred_xstart ) / _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) def _scale_timesteps(self, t): if self.rescale_timesteps: return t.float() * (1000.0 / self.num_timesteps) return t def condition_mean(self, cond_fn, p_mean_var, x, t, model_kwargs=None): """ Compute the mean for the previous step, given a function cond_fn that computes the gradient of a conditional log probability with respect to x. In particular, cond_fn computes grad(log(p(y|x))), and we want to condition on y. This uses the conditioning strategy from Sohl-Dickstein et al. (2015). """ gradient = cond_fn(x, self._scale_timesteps(t), **model_kwargs) new_mean = ( p_mean_var["mean"].float() + p_mean_var["variance"] * gradient.float() ) return new_mean def condition_score(self, cond_fn, p_mean_var, x, t, model_kwargs=None): """ Compute what the p_mean_variance output would have been, should the model's score function be conditioned by cond_fn. See condition_mean() for details on cond_fn. Unlike condition_mean(), this instead uses the conditioning strategy from Song et al (2020). """ alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape) eps = self._predict_eps_from_xstart(x, t, p_mean_var["pred_xstart"]) eps = eps - (1 - alpha_bar).sqrt() * cond_fn( x, self._scale_timesteps(t), **model_kwargs ) out = p_mean_var.copy() out["pred_xstart"] = self._predict_xstart_from_eps(x, t, eps) out["mean"], _, _ = self.q_posterior_mean_variance( x_start=out["pred_xstart"], x_t=x, t=t ) return out def p_sample( self, model, x, t, clip_denoised=True, denoised_fn=None, cond_fn=None, model_kwargs=None, ): """ Sample x_{t-1} from the model at the given timestep. :param model: the model to sample from. :param x: the current tensor at x_{t-1}. :param t: the value of t, starting at 0 for the first diffusion step. :param clip_denoised: if True, clip the x_start prediction to [-1, 1]. :param denoised_fn: if not None, a function which applies to the x_start prediction before it is used to sample. :param cond_fn: if not None, this is a gradient function that acts similarly to the model. :param model_kwargs: if not None, a dict of extra keyword arguments to pass to the model. This can be used for conditioning. :return: a dict containing the following keys: - 'sample': a random sample from the model. - 'pred_xstart': a prediction of x_0. """ out = self.p_mean_variance( model, x, t, clip_denoised=clip_denoised, denoised_fn=denoised_fn, model_kwargs=model_kwargs, ) noise = th.randn_like(x) nonzero_mask = ( (t != 0).float().view(-1, *([1] * (len(x.shape) - 1))) ) # no noise when t == 0 if cond_fn is not None: out["mean"] = self.condition_mean( cond_fn, out, x, t, model_kwargs=model_kwargs ) sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise return {"sample": sample, "pred_xstart": out["pred_xstart"]} def p_sample_loop( self, model, shape, noise=None, clip_denoised=True, denoised_fn=None, cond_fn=None, model_kwargs=None, device=None, progress=False, ): """ Generate samples from the model. :param model: the model module. :param shape: the shape of the samples, (N, C, H, W). :param noise: if specified, the noise from the encoder to sample. Should be of the same shape as `shape`. :param clip_denoised: if True, clip x_start predictions to [-1, 1]. :param denoised_fn: if not None, a function which applies to the x_start prediction before it is used to sample. :param cond_fn: if not None, this is a gradient function that acts similarly to the model. :param model_kwargs: if not None, a dict of extra keyword arguments to pass to the model. This can be used for conditioning. :param device: if specified, the device to create the samples on. If not specified, use a model parameter's device. :param progress: if True, show a tqdm progress bar. :return: a non-differentiable batch of samples. """ final = None for sample in self.p_sample_loop_progressive( model, shape, noise=noise, clip_denoised=clip_denoised, denoised_fn=denoised_fn, cond_fn=cond_fn, model_kwargs=model_kwargs, device=device, progress=progress, ): final = sample return final["sample"] def p_sample_loop_progressive( self, model, shape, noise=None, clip_denoised=True, denoised_fn=None, cond_fn=None, model_kwargs=None, device=None, progress=False, ): """ Generate samples from the model and yield intermediate samples from each timestep of diffusion. Arguments are the same as p_sample_loop(). Returns a generator over dicts, where each dict is the return value of p_sample(). """ if device is None: device = next(model.parameters()).device assert isinstance(shape, (tuple, list)) if noise is not None: img = noise else: img = th.randn(*shape, device=device) indices = list(range(self.num_timesteps))[::-1] for i in tqdm(indices, disable=not progress): t = th.tensor([i] * shape[0], device=device) with th.no_grad(): out = self.p_sample( model, img, t, clip_denoised=clip_denoised, denoised_fn=denoised_fn, cond_fn=cond_fn, model_kwargs=model_kwargs, ) yield out img = out["sample"] def ddim_sample( self, model, x, t, clip_denoised=True, denoised_fn=None, cond_fn=None, model_kwargs=None, eta=0.0, ): """ Sample x_{t-1} from the model using DDIM. Same usage as p_sample(). """ out = self.p_mean_variance( model, x, t, clip_denoised=clip_denoised, denoised_fn=denoised_fn, model_kwargs=model_kwargs, ) if cond_fn is not None: out = self.condition_score(cond_fn, out, x, t, model_kwargs=model_kwargs) # Usually our model outputs epsilon, but we re-derive it # in case we used x_start or x_prev prediction. eps = self._predict_eps_from_xstart(x, t, out["pred_xstart"]) alpha_bar = _extract_into_tensor(self.alphas_cumprod, t, x.shape) alpha_bar_prev = _extract_into_tensor(self.alphas_cumprod_prev, t, x.shape) sigma = ( eta * th.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar)) * th.sqrt(1 - alpha_bar / alpha_bar_prev) ) # Equation 12. noise = th.randn_like(x) mean_pred = ( out["pred_xstart"] * th.sqrt(alpha_bar_prev) + th.sqrt(1 - alpha_bar_prev - sigma ** 2) * eps ) nonzero_mask = ( (t != 0).float().view(-1, *([1] * (len(x.shape) - 1))) ) # no noise when t == 0 sample = mean_pred + nonzero_mask * sigma * noise return {"sample": sample, "pred_xstart": out["pred_xstart"]} def ddim_reverse_sample( self, model, x, t, clip_denoised=True, denoised_fn=None, model_kwargs=None, eta=0.0, ): """ Sample x_{t+1} from the model using DDIM reverse ODE. """ assert eta == 0.0, "Reverse ODE only for deterministic path" out = self.p_mean_variance( model, x, t, clip_denoised=clip_denoised, denoised_fn=denoised_fn, model_kwargs=model_kwargs, ) # Usually our model outputs epsilon, but we re-derive it # in case we used x_start or x_prev prediction. eps = ( _extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x.shape) * x - out["pred_xstart"] ) / _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x.shape) alpha_bar_next = _extract_into_tensor(self.alphas_cumprod_next, t, x.shape) # Equation 12. reversed mean_pred = ( out["pred_xstart"] * th.sqrt(alpha_bar_next) + th.sqrt(1 - alpha_bar_next) * eps ) return {"sample": mean_pred, "pred_xstart": out["pred_xstart"]} def ddim_sample_loop( self, model, shape, noise=None, clip_denoised=True, denoised_fn=None, cond_fn=None, model_kwargs=None, device=None, progress=False, eta=0.0, ): """ Generate samples from the model using DDIM. Same usage as p_sample_loop(). """ final = None for sample in self.ddim_sample_loop_progressive( model, shape, noise=noise, clip_denoised=clip_denoised, denoised_fn=denoised_fn, cond_fn=cond_fn, model_kwargs=model_kwargs, device=device, progress=progress, eta=eta, ): final = sample return final["sample"] def ddim_sample_loop_progressive( self, model, shape, noise=None, clip_denoised=True, denoised_fn=None, cond_fn=None, model_kwargs=None, device=None, progress=False, eta=0.0, ): """ Use DDIM to sample from the model and yield intermediate samples from each timestep of DDIM. Same usage as p_sample_loop_progressive(). """ if device is None: device = next(model.parameters()).device assert isinstance(shape, (tuple, list)) if noise is not None: img = noise else: img = th.randn(*shape, device=device) indices = list(range(self.num_timesteps))[::-1] if progress: # Lazy import so that we don't depend on tqdm. from tqdm.auto import tqdm indices = tqdm(indices, disable=not progress) for i in indices: t = th.tensor([i] * shape[0], device=device) with th.no_grad(): out = self.ddim_sample( model, img, t, clip_denoised=clip_denoised, denoised_fn=denoised_fn, cond_fn=cond_fn, model_kwargs=model_kwargs, eta=eta, ) yield out img = out["sample"] def _vb_terms_bpd( self, model, x_start, x_t, t, clip_denoised=True, model_kwargs=None ): """ Get a term for the variational lower-bound. The resulting units are bits (rather than nats, as one might expect). This allows for comparison to other papers. :return: a dict with the following keys: - 'output': a shape [N] tensor of NLLs or KLs. - 'pred_xstart': the x_0 predictions. """ true_mean, _, true_log_variance_clipped = self.q_posterior_mean_variance( x_start=x_start, x_t=x_t, t=t ) out = self.p_mean_variance( model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs ) kl = normal_kl( true_mean, true_log_variance_clipped, out["mean"], out["log_variance"] ) kl = mean_flat(kl) / np.log(2.0) decoder_nll = -discretized_gaussian_log_likelihood( x_start, means=out["mean"], log_scales=0.5 * out["log_variance"] ) assert decoder_nll.shape == x_start.shape decoder_nll = mean_flat(decoder_nll) / np.log(2.0) # At the first timestep return the decoder NLL, # otherwise return KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t)) output = th.where((t == 0), decoder_nll, kl) return {"output": output, "pred_xstart": out["pred_xstart"]} def training_losses(self, model, x_start, t, model_kwargs=None, noise=None): """ Compute training losses for a single timestep. :param model: the model to evaluate loss on. :param x_start: the [N x C x ...] tensor of inputs. :param t: a batch of timestep indices. :param model_kwargs: if not None, a dict of extra keyword arguments to pass to the model. This can be used for conditioning. :param noise: if specified, the specific Gaussian noise to try to remove. :return: a dict with the key "loss" containing a tensor of shape [N]. Some mean or variance settings may also have other keys. """ if model_kwargs is None: model_kwargs = {} if noise is None: noise = th.randn_like(x_start) x_t = self.q_sample(x_start, t, noise=noise) terms = {} if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL: # TODO: support multiple model outputs for this mode. terms["loss"] = self._vb_terms_bpd( model=model, x_start=x_start, x_t=x_t, t=t, clip_denoised=False, model_kwargs=model_kwargs, )["output"] if self.loss_type == LossType.RESCALED_KL: terms["loss"] *= self.num_timesteps elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE: model_outputs = model(x_t, self._scale_timesteps(t), **model_kwargs) if isinstance(model_outputs, tuple): model_output = model_outputs[0] terms['extra_outputs'] = model_outputs[1:] else: model_output = model_outputs if self.model_var_type in [ ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE, ]: B, C = x_t.shape[:2] assert model_output.shape == (B, C * 2, *x_t.shape[2:]) model_output, model_var_values = th.split(model_output, C, dim=1) # Learn the variance using the variational bound, but don't let # it affect our mean prediction. frozen_out = th.cat([model_output.detach(), model_var_values], dim=1) terms["vb"] = self._vb_terms_bpd( model=lambda *args, r=frozen_out: r, x_start=x_start, x_t=x_t, t=t, clip_denoised=False, )["output"] if self.loss_type == LossType.RESCALED_MSE: # Divide by 1000 for equivalence with initial implementation. # Without a factor of 1/1000, the VB term hurts the MSE term. terms["vb"] *= self.num_timesteps / 1000.0 if self.model_mean_type == ModelMeanType.PREVIOUS_X: target = self.q_posterior_mean_variance( x_start=x_start, x_t=x_t, t=t )[0] x_start_pred = torch.zeros(x_start) # Not supported. elif self.model_mean_type == ModelMeanType.START_X: target = x_start x_start_pred = model_output elif self.model_mean_type == ModelMeanType.EPSILON: target = noise x_start_pred = self._predict_xstart_from_eps(x_t, t, model_output) else: raise NotImplementedError(self.model_mean_type) assert model_output.shape == target.shape == x_start.shape terms["mse"] = mean_flat((target - model_output) ** 2) terms["x_start_predicted"] = x_start_pred if "vb" in terms: terms["loss"] = terms["mse"] + terms["vb"] else: terms["loss"] = terms["mse"] else: raise NotImplementedError(self.loss_type) return terms def autoregressive_training_losses(self, model, x_start, t, model_output_keys, gd_out_key, model_kwargs=None, noise=None): """ Compute training losses for a single timestep. :param model: the model to evaluate loss on. :param x_start: the [N x C x ...] tensor of inputs. :param t: a batch of timestep indices. :param model_kwargs: if not None, a dict of extra keyword arguments to pass to the model. This can be used for conditioning. :param noise: if specified, the specific Gaussian noise to try to remove. :return: a dict with the key "loss" containing a tensor of shape [N]. Some mean or variance settings may also have other keys. """ if model_kwargs is None: model_kwargs = {} if noise is None: noise = th.randn_like(x_start) x_t = self.q_sample(x_start, t, noise=noise) terms = {} if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL: assert False # not currently supported for this type of diffusion. elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE: model_outputs = model(x_t, x_start, self._scale_timesteps(t), **model_kwargs) terms.update({k: o for k, o in zip(model_output_keys, model_outputs)}) model_output = terms[gd_out_key] if self.model_var_type in [ ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE, ]: B, C = x_t.shape[:2] assert model_output.shape == (B, C, 2, *x_t.shape[2:]) model_output, model_var_values = model_output[:, :, 0], model_output[:, :, 1] # Learn the variance using the variational bound, but don't let # it affect our mean prediction. frozen_out = th.cat([model_output.detach(), model_var_values], dim=1) terms["vb"] = self._vb_terms_bpd( model=lambda *args, r=frozen_out: r, x_start=x_start, x_t=x_t, t=t, clip_denoised=False, )["output"] if self.loss_type == LossType.RESCALED_MSE: # Divide by 1000 for equivalence with initial implementation. # Without a factor of 1/1000, the VB term hurts the MSE term. terms["vb"] *= self.num_timesteps / 1000.0 if self.model_mean_type == ModelMeanType.PREVIOUS_X: target = self.q_posterior_mean_variance( x_start=x_start, x_t=x_t, t=t )[0] x_start_pred = torch.zeros(x_start) # Not supported. elif self.model_mean_type == ModelMeanType.START_X: target = x_start x_start_pred = model_output elif self.model_mean_type == ModelMeanType.EPSILON: target = noise x_start_pred = self._predict_xstart_from_eps(x_t, t, model_output) else: raise NotImplementedError(self.model_mean_type) assert model_output.shape == target.shape == x_start.shape terms["mse"] = mean_flat((target - model_output) ** 2) terms["x_start_predicted"] = x_start_pred if "vb" in terms: terms["loss"] = terms["mse"] + terms["vb"] else: terms["loss"] = terms["mse"] else: raise NotImplementedError(self.loss_type) return terms def _prior_bpd(self, x_start): """ Get the prior KL term for the variational lower-bound, measured in bits-per-dim. This term can't be optimized, as it only depends on the encoder. :param x_start: the [N x C x ...] tensor of inputs. :return: a batch of [N] KL values (in bits), one per batch element. """ batch_size = x_start.shape[0] t = th.tensor([self.num_timesteps - 1] * batch_size, device=x_start.device) qt_mean, _, qt_log_variance = self.q_mean_variance(x_start, t) kl_prior = normal_kl( mean1=qt_mean, logvar1=qt_log_variance, mean2=0.0, logvar2=0.0 ) return mean_flat(kl_prior) / np.log(2.0) def calc_bpd_loop(self, model, x_start, clip_denoised=True, model_kwargs=None): """ Compute the entire variational lower-bound, measured in bits-per-dim, as well as other related quantities. :param model: the model to evaluate loss on. :param x_start: the [N x C x ...] tensor of inputs. :param clip_denoised: if True, clip denoised samples. :param model_kwargs: if not None, a dict of extra keyword arguments to pass to the model. This can be used for conditioning. :return: a dict containing the following keys: - total_bpd: the total variational lower-bound, per batch element. - prior_bpd: the prior term in the lower-bound. - vb: an [N x T] tensor of terms in the lower-bound. - xstart_mse: an [N x T] tensor of x_0 MSEs for each timestep. - mse: an [N x T] tensor of epsilon MSEs for each timestep. """ device = x_start.device batch_size = x_start.shape[0] vb = [] xstart_mse = [] mse = [] for t in list(range(self.num_timesteps))[::-1]: t_batch = th.tensor([t] * batch_size, device=device) noise = th.randn_like(x_start) x_t = self.q_sample(x_start=x_start, t=t_batch, noise=noise) # Calculate VLB term at the current timestep with th.no_grad(): out = self._vb_terms_bpd( model, x_start=x_start, x_t=x_t, t=t_batch, clip_denoised=clip_denoised, model_kwargs=model_kwargs, ) vb.append(out["output"]) xstart_mse.append(mean_flat((out["pred_xstart"] - x_start) ** 2)) eps = self._predict_eps_from_xstart(x_t, t_batch, out["pred_xstart"]) mse.append(mean_flat((eps - noise) ** 2)) vb = th.stack(vb, dim=1) xstart_mse = th.stack(xstart_mse, dim=1) mse = th.stack(mse, dim=1) prior_bpd = self._prior_bpd(x_start) total_bpd = vb.sum(dim=1) + prior_bpd return { "total_bpd": total_bpd, "prior_bpd": prior_bpd, "vb": vb, "xstart_mse": xstart_mse, "mse": mse, } def get_named_beta_schedule(schedule_name, num_diffusion_timesteps): """ Get a pre-defined beta schedule for the given name. The beta schedule library consists of beta schedules which remain similar in the limit of num_diffusion_timesteps. Beta schedules may be added, but should not be removed or changed once they are committed to maintain backwards compatibility. """ if schedule_name == "linear": # Linear schedule from Ho et al, extended to work for any number of # diffusion steps. scale = 1000 / num_diffusion_timesteps beta_start = scale * 0.0001 beta_end = scale * 0.02 return np.linspace( beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64 ) elif schedule_name == "cosine": return betas_for_alpha_bar( num_diffusion_timesteps, lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2, ) else: raise NotImplementedError(f"unknown beta schedule: {schedule_name}") class SpacedDiffusion(GaussianDiffusion): """ A diffusion process which can skip steps in a base diffusion process. :param use_timesteps: a collection (sequence or set) of timesteps from the original diffusion process to retain. :param kwargs: the kwargs to create the base diffusion process. """ def __init__(self, use_timesteps, **kwargs): self.use_timesteps = set(use_timesteps) self.timestep_map = [] self.original_num_steps = len(kwargs["betas"]) base_diffusion = GaussianDiffusion(**kwargs) # pylint: disable=missing-kwoa last_alpha_cumprod = 1.0 new_betas = [] for i, alpha_cumprod in enumerate(base_diffusion.alphas_cumprod): if i in self.use_timesteps: new_betas.append(1 - alpha_cumprod / last_alpha_cumprod) last_alpha_cumprod = alpha_cumprod self.timestep_map.append(i) kwargs["betas"] = np.array(new_betas) super().__init__(**kwargs) def p_mean_variance( self, model, *args, **kwargs ): # pylint: disable=signature-differs return super().p_mean_variance(self._wrap_model(model), *args, **kwargs) def training_losses( self, model, *args, **kwargs ): # pylint: disable=signature-differs return super().training_losses(self._wrap_model(model), *args, **kwargs) def autoregressive_training_losses( self, model, *args, **kwargs ): # pylint: disable=signature-differs return super().autoregressive_training_losses(self._wrap_model(model, True), *args, **kwargs) def condition_mean(self, cond_fn, *args, **kwargs): return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs) def condition_score(self, cond_fn, *args, **kwargs): return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs) def _wrap_model(self, model, autoregressive=False): if isinstance(model, _WrappedModel) or isinstance(model, _WrappedAutoregressiveModel): return model mod = _WrappedAutoregressiveModel if autoregressive else _WrappedModel return mod( model, self.timestep_map, self.rescale_timesteps, self.original_num_steps ) def _scale_timesteps(self, t): # Scaling is done by the wrapped model. return t def space_timesteps(num_timesteps, section_counts): """ Create a list of timesteps to use from an original diffusion process, given the number of timesteps we want to take from equally-sized portions of the original process. For example, if there's 300 timesteps and the section counts are [10,15,20] then the first 100 timesteps are strided to be 10 timesteps, the second 100 are strided to be 15 timesteps, and the final 100 are strided to be 20. If the stride is a string starting with "ddim", then the fixed striding from the DDIM paper is used, and only one section is allowed. :param num_timesteps: the number of diffusion steps in the original process to divide up. :param section_counts: either a list of numbers, or a string containing comma-separated numbers, indicating the step count per section. As a special case, use "ddimN" where N is a number of steps to use the striding from the DDIM paper. :return: a set of diffusion steps from the original process to use. """ if isinstance(section_counts, str): if section_counts.startswith("ddim"): desired_count = int(section_counts[len("ddim") :]) for i in range(1, num_timesteps): if len(range(0, num_timesteps, i)) == desired_count: return set(range(0, num_timesteps, i)) raise ValueError( f"cannot create exactly {num_timesteps} steps with an integer stride" ) section_counts = [int(x) for x in section_counts.split(",")] size_per = num_timesteps // len(section_counts) extra = num_timesteps % len(section_counts) start_idx = 0 all_steps = [] for i, section_count in enumerate(section_counts): size = size_per + (1 if i < extra else 0) if size < section_count: raise ValueError( f"cannot divide section of {size} steps into {section_count}" ) if section_count <= 1: frac_stride = 1 else: frac_stride = (size - 1) / (section_count - 1) cur_idx = 0.0 taken_steps = [] for _ in range(section_count): taken_steps.append(start_idx + round(cur_idx)) cur_idx += frac_stride all_steps += taken_steps start_idx += size return set(all_steps) class _WrappedModel: def __init__(self, model, timestep_map, rescale_timesteps, original_num_steps): self.model = model self.timestep_map = timestep_map self.rescale_timesteps = rescale_timesteps self.original_num_steps = original_num_steps def __call__(self, x, ts, **kwargs): map_tensor = th.tensor(self.timestep_map, device=ts.device, dtype=ts.dtype) new_ts = map_tensor[ts] if self.rescale_timesteps: new_ts = new_ts.float() * (1000.0 / self.original_num_steps) return self.model(x, new_ts, **kwargs) class _WrappedAutoregressiveModel: def __init__(self, model, timestep_map, rescale_timesteps, original_num_steps): self.model = model self.timestep_map = timestep_map self.rescale_timesteps = rescale_timesteps self.original_num_steps = original_num_steps def __call__(self, x, x0, ts, **kwargs): map_tensor = th.tensor(self.timestep_map, device=ts.device, dtype=ts.dtype) new_ts = map_tensor[ts] if self.rescale_timesteps: new_ts = new_ts.float() * (1000.0 / self.original_num_steps) return self.model(x, x0, new_ts, **kwargs) def _extract_into_tensor(arr, timesteps, broadcast_shape): """ Extract values from a 1-D numpy array for a batch of indices. :param arr: the 1-D numpy array. :param timesteps: a tensor of indices into the array to extract. :param broadcast_shape: a larger shape of K dimensions with the batch dimension equal to the length of timesteps. :return: a tensor of shape [batch_size, 1, ...] where the shape has K dims. """ res = th.from_numpy(arr.astype(np.float32)).to(device=timesteps.device)[timesteps] while len(res.shape) < len(broadcast_shape): res = res[..., None] return res.expand(broadcast_shape) ================================================ FILE: tortoise/utils/stft.py ================================================ """ BSD 3-Clause License Copyright (c) 2017, Prem Seetharaman All rights reserved. * Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ import torch import numpy as np import torch.nn.functional as F from torch.autograd import Variable from scipy.signal import get_window from librosa.util import pad_center, tiny import librosa.util as librosa_util def window_sumsquare(window, n_frames, hop_length=200, win_length=800, n_fft=800, dtype=np.float32, norm=None): """ # from librosa 0.6 Compute the sum-square envelope of a window function at a given hop length. This is used to estimate modulation effects induced by windowing observations in short-time fourier transforms. Parameters ---------- window : string, tuple, number, callable, or list-like Window specification, as in `get_window` n_frames : int > 0 The number of analysis frames hop_length : int > 0 The number of samples to advance between frames win_length : [optional] The length of the window function. By default, this matches `n_fft`. n_fft : int > 0 The length of each analysis frame. dtype : np.dtype The data type of the output Returns ------- wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))` The sum-squared envelope of the window function """ if win_length is None: win_length = n_fft n = n_fft + hop_length * (n_frames - 1) x = np.zeros(n, dtype=dtype) # Compute the squared window at the desired length win_sq = get_window(window, win_length, fftbins=True) win_sq = librosa_util.normalize(win_sq, norm=norm)**2 win_sq = librosa_util.pad_center(win_sq, n_fft) # Fill the envelope for i in range(n_frames): sample = i * hop_length x[sample:min(n, sample + n_fft)] += win_sq[:max(0, min(n_fft, n - sample))] return x class STFT(torch.nn.Module): """adapted from Prem Seetharaman's https://github.com/pseeth/pytorch-stft""" def __init__(self, filter_length=800, hop_length=200, win_length=800, window='hann'): super(STFT, self).__init__() self.filter_length = filter_length self.hop_length = hop_length self.win_length = win_length self.window = window self.forward_transform = None scale = self.filter_length / self.hop_length fourier_basis = np.fft.fft(np.eye(self.filter_length)) cutoff = int((self.filter_length / 2 + 1)) fourier_basis = np.vstack([np.real(fourier_basis[:cutoff, :]), np.imag(fourier_basis[:cutoff, :])]) forward_basis = torch.FloatTensor(fourier_basis[:, None, :]) inverse_basis = torch.FloatTensor( np.linalg.pinv(scale * fourier_basis).T[:, None, :]) if window is not None: assert(filter_length >= win_length) # get window and zero center pad it to filter_length fft_window = get_window(window, win_length, fftbins=True) fft_window = pad_center(fft_window, size=filter_length) fft_window = torch.from_numpy(fft_window).float() # window the bases forward_basis *= fft_window inverse_basis *= fft_window self.register_buffer('forward_basis', forward_basis.float()) self.register_buffer('inverse_basis', inverse_basis.float()) def transform(self, input_data): num_batches = input_data.size(0) num_samples = input_data.size(1) self.num_samples = num_samples # similar to librosa, reflect-pad the input input_data = input_data.view(num_batches, 1, num_samples) input_data = F.pad( input_data.unsqueeze(1), (int(self.filter_length / 2), int(self.filter_length / 2), 0, 0), mode='reflect') input_data = input_data.squeeze(1) forward_transform = F.conv1d( input_data, Variable(self.forward_basis, requires_grad=False), stride=self.hop_length, padding=0) cutoff = int((self.filter_length / 2) + 1) real_part = forward_transform[:, :cutoff, :] imag_part = forward_transform[:, cutoff:, :] magnitude = torch.sqrt(real_part**2 + imag_part**2) phase = torch.autograd.Variable( torch.atan2(imag_part.data, real_part.data)) return magnitude, phase def inverse(self, magnitude, phase): recombine_magnitude_phase = torch.cat( [magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1) inverse_transform = F.conv_transpose1d( recombine_magnitude_phase, Variable(self.inverse_basis, requires_grad=False), stride=self.hop_length, padding=0) if self.window is not None: window_sum = window_sumsquare( self.window, magnitude.size(-1), hop_length=self.hop_length, win_length=self.win_length, n_fft=self.filter_length, dtype=np.float32) # remove modulation effects approx_nonzero_indices = torch.from_numpy( np.where(window_sum > tiny(window_sum))[0]) window_sum = torch.autograd.Variable( torch.from_numpy(window_sum), requires_grad=False) window_sum = window_sum.cuda() if magnitude.is_cuda else window_sum inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices] # scale by hop ratio inverse_transform *= float(self.filter_length) / self.hop_length inverse_transform = inverse_transform[:, :, int(self.filter_length/2):] inverse_transform = inverse_transform[:, :, :-int(self.filter_length/2):] return inverse_transform def forward(self, input_data): self.magnitude, self.phase = self.transform(input_data) reconstruction = self.inverse(self.magnitude, self.phase) return reconstruction ================================================ FILE: tortoise/utils/text.py ================================================ import re def split_and_recombine_text(text, desired_length=200, max_length=300): """Split text it into chunks of a desired length trying to keep sentences intact.""" # normalize text, remove redundant whitespace and convert non-ascii quotes to ascii text = re.sub(r'\n\n+', '\n', text) text = re.sub(r'\s+', ' ', text) text = re.sub(r'[“”]', '"', text) rv = [] in_quote = False current = "" split_pos = [] pos = -1 end_pos = len(text) - 1 def seek(delta): nonlocal pos, in_quote, current is_neg = delta < 0 for _ in range(abs(delta)): if is_neg: pos -= 1 current = current[:-1] else: pos += 1 current += text[pos] if text[pos] == '"': in_quote = not in_quote return text[pos] def peek(delta): p = pos + delta return text[p] if p < end_pos and p >= 0 else "" def commit(): nonlocal rv, current, split_pos rv.append(current) current = "" split_pos = [] while pos < end_pos: c = seek(1) # do we need to force a split? if len(current) >= max_length: if len(split_pos) > 0 and len(current) > (desired_length / 2): # we have at least one sentence and we are over half the desired length, seek back to the last split d = pos - split_pos[-1] seek(-d) else: # no full sentences, seek back until we are not in the middle of a word and split there while c not in '!?.\n ' and pos > 0 and len(current) > desired_length: c = seek(-1) commit() # check for sentence boundaries elif not in_quote and (c in '!?\n' or (c == '.' and peek(1) in '\n ')): # seek forward if we have consecutive boundary markers but still within the max length while pos < len(text) - 1 and len(current) < max_length and peek(1) in '!?.': c = seek(1) split_pos.append(pos) if len(current) >= desired_length: commit() # treat end of quote as a boundary if its followed by a space or newline elif in_quote and peek(1) == '"' and peek(2) in '\n ': seek(2) split_pos.append(pos) rv.append(current) # clean up, remove lines with only whitespace or punctuation rv = [s.strip() for s in rv] rv = [s for s in rv if len(s) > 0 and not re.match(r'^[\s\.,;:!?]*$', s)] return rv if __name__ == '__main__': import os import unittest class Test(unittest.TestCase): def test_split_and_recombine_text(self): text = """ This is a sample sentence. This is another sample sentence. This is a longer sample sentence that should force a split inthemiddlebutinotinthislongword. "Don't split my quote... please" """ self.assertEqual(split_and_recombine_text(text, desired_length=20, max_length=40), ['This is a sample sentence.', 'This is another sample sentence.', 'This is a longer sample sentence that', 'should force a split', 'inthemiddlebutinotinthislongword.', '"Don\'t split my quote... please"']) def test_split_and_recombine_text_2(self): text = """ When you are really angry sometimes you use consecutive exclamation marks!!!!!! Is this a good thing to do?!?!?! I don't know but we should handle this situation.......................... """ self.assertEqual(split_and_recombine_text(text, desired_length=30, max_length=50), ['When you are really angry sometimes you use', 'consecutive exclamation marks!!!!!!', 'Is this a good thing to do?!?!?!', 'I don\'t know but we should handle this situation.']) def test_split_and_recombine_text_3(self): text_src = os.path.join(os.path.dirname(__file__), '../data/riding_hood.txt') with open(text_src, 'r') as f: text = f.read() self.assertEqual( split_and_recombine_text(text), [ 'Once upon a time there lived in a certain village a little country girl, the prettiest creature who was ever seen. Her mother was excessively fond of her; and her grandmother doted on her still more. This good woman had a little red riding hood made for her.', 'It suited the girl so extremely well that everybody called her Little Red Riding Hood. One day her mother, having made some cakes, said to her, "Go, my dear, and see how your grandmother is doing, for I hear she has been very ill. Take her a cake, and this little pot of butter."', 'Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village. As she was going through the wood, she met with a wolf, who had a very great mind to eat her up, but he dared not, because of some woodcutters working nearby in the forest.', 'He asked her where she was going. The poor child, who did not know that it was dangerous to stay and talk to a wolf, said to him, "I am going to see my grandmother and carry her a cake and a little pot of butter from my mother." "Does she live far off?" said the wolf "Oh I say,"', 'answered Little Red Riding Hood; "it is beyond that mill you see there, at the first house in the village." "Well," said the wolf, "and I\'ll go and see her too. I\'ll go this way and go you that, and we shall see who will be there first."', 'The wolf ran as fast as he could, taking the shortest path, and the little girl took a roundabout way, entertaining herself by gathering nuts, running after butterflies, and gathering bouquets of little flowers.', 'It was not long before the wolf arrived at the old woman\'s house. He knocked at the door: tap, tap. "Who\'s there?" "Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother."', 'The good grandmother, who was in bed, because she was somewhat ill, cried out, "Pull the bobbin, and the latch will go up."', 'The wolf pulled the bobbin, and the door opened, and then he immediately fell upon the good woman and ate her up in a moment, for it been more than three days since he had eaten.', 'He then shut the door and got into the grandmother\'s bed, expecting Little Red Riding Hood, who came some time afterwards and knocked at the door: tap, tap. "Who\'s there?"', 'Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you."', 'The wolf cried out to her, softening his voice as much as he could, "Pull the bobbin, and the latch will go up." Little Red Riding Hood pulled the bobbin, and the door opened.', 'The wolf, seeing her come in, said to her, hiding himself under the bedclothes, "Put the cake and the little pot of butter upon the stool, and come get into bed with me." Little Red Riding Hood took off her clothes and got into bed.', 'She was greatly amazed to see how her grandmother looked in her nightclothes, and said to her, "Grandmother, what big arms you have!" "All the better to hug you with, my dear." "Grandmother, what big legs you have!" "All the better to run with, my child." "Grandmother, what big ears you have!"', '"All the better to hear with, my child." "Grandmother, what big eyes you have!" "All the better to see with, my child." "Grandmother, what big teeth you have got!" "All the better to eat you up with." And, saying these words, this wicked wolf fell upon Little Red Riding Hood, and ate her all up.', ] ) unittest.main() ================================================ FILE: tortoise/utils/tokenizer.py ================================================ import os import re import inflect import torch from tokenizers import Tokenizer # Regular expression matching whitespace: from unidecode import unidecode _whitespace_re = re.compile(r'\s+') # List of (regular expression, replacement) pairs for abbreviations: _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ ('mrs', 'misess'), ('mr', 'mister'), ('dr', 'doctor'), ('st', 'saint'), ('co', 'company'), ('jr', 'junior'), ('maj', 'major'), ('gen', 'general'), ('drs', 'doctors'), ('rev', 'reverend'), ('lt', 'lieutenant'), ('hon', 'honorable'), ('sgt', 'sergeant'), ('capt', 'captain'), ('esq', 'esquire'), ('ltd', 'limited'), ('col', 'colonel'), ('ft', 'fort'), ]] def expand_abbreviations(text): for regex, replacement in _abbreviations: text = re.sub(regex, replacement, text) return text _inflect = inflect.engine() _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') _number_re = re.compile(r'[0-9]+') def _remove_commas(m): return m.group(1).replace(',', '') def _expand_decimal_point(m): return m.group(1).replace('.', ' point ') def _expand_dollars(m): match = m.group(1) parts = match.split('.') if len(parts) > 2: return match + ' dollars' # Unexpected format dollars = int(parts[0]) if parts[0] else 0 cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 if dollars and cents: dollar_unit = 'dollar' if dollars == 1 else 'dollars' cent_unit = 'cent' if cents == 1 else 'cents' return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) elif dollars: dollar_unit = 'dollar' if dollars == 1 else 'dollars' return '%s %s' % (dollars, dollar_unit) elif cents: cent_unit = 'cent' if cents == 1 else 'cents' return '%s %s' % (cents, cent_unit) else: return 'zero dollars' def _expand_ordinal(m): return _inflect.number_to_words(m.group(0)) def _expand_number(m): num = int(m.group(0)) if num > 1000 and num < 3000: if num == 2000: return 'two thousand' elif num > 2000 and num < 2010: return 'two thousand ' + _inflect.number_to_words(num % 100) elif num % 100 == 0: return _inflect.number_to_words(num // 100) + ' hundred' else: return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') else: return _inflect.number_to_words(num, andword='') def normalize_numbers(text): text = re.sub(_comma_number_re, _remove_commas, text) text = re.sub(_pounds_re, r'\1 pounds', text) text = re.sub(_dollars_re, _expand_dollars, text) text = re.sub(_decimal_number_re, _expand_decimal_point, text) text = re.sub(_ordinal_re, _expand_ordinal, text) text = re.sub(_number_re, _expand_number, text) return text def expand_numbers(text): return normalize_numbers(text) def lowercase(text): return text.lower() def collapse_whitespace(text): return re.sub(_whitespace_re, ' ', text) def convert_to_ascii(text): return unidecode(text) def basic_cleaners(text): '''Basic pipeline that lowercases and collapses whitespace without transliteration.''' text = lowercase(text) text = collapse_whitespace(text) return text def transliteration_cleaners(text): '''Pipeline for non-English text that transliterate to ASCII.''' text = convert_to_ascii(text) text = lowercase(text) text = collapse_whitespace(text) return text def english_cleaners(text): '''Pipeline for English text, including number and abbreviation expansion.''' text = convert_to_ascii(text) text = lowercase(text) text = expand_numbers(text) text = expand_abbreviations(text) text = collapse_whitespace(text) text = text.replace('"', '') return text def lev_distance(s1, s2): if len(s1) > len(s2): s1, s2 = s2, s1 distances = range(len(s1) + 1) for i2, c2 in enumerate(s2): distances_ = [i2 + 1] for i1, c1 in enumerate(s1): if c1 == c2: distances_.append(distances[i1]) else: distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1]))) distances = distances_ return distances[-1] DEFAULT_VOCAB_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data/tokenizer.json') class VoiceBpeTokenizer: def __init__(self, vocab_file=None, use_basic_cleaners=False): self.tokenizer = Tokenizer.from_file( DEFAULT_VOCAB_FILE if vocab_file is None else vocab_file ) if use_basic_cleaners: self.preprocess_text = basic_cleaners else: self.preprocess_text = english_cleaners def encode(self, txt): txt = self.preprocess_text(txt) txt = txt.replace(' ', '[SPACE]') return self.tokenizer.encode(txt).ids def decode(self, seq): if isinstance(seq, torch.Tensor): seq = seq.cpu().numpy() txt = self.tokenizer.decode(seq, skip_special_tokens=False).replace(' ', '') txt = txt.replace('[SPACE]', ' ') txt = txt.replace('[STOP]', '') txt = txt.replace('[UNK]', '') return txt ================================================ FILE: tortoise/utils/typical_sampling.py ================================================ import torch from transformers import LogitsWarper class TypicalLogitsWarper(LogitsWarper): def __init__(self, mass: float = 0.9, filter_value: float = -float("Inf"), min_tokens_to_keep: int = 1): self.filter_value = filter_value self.mass = mass self.min_tokens_to_keep = min_tokens_to_keep def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor: # calculate entropy normalized = torch.nn.functional.log_softmax(scores, dim=-1) p = torch.exp(normalized) ent = -(normalized * p).nansum(-1, keepdim=True) # shift and sort shifted_scores = torch.abs((-normalized) - ent) sorted_scores, sorted_indices = torch.sort(shifted_scores, descending=False) sorted_logits = scores.gather(-1, sorted_indices) cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1) # Remove tokens with cumulative mass above the threshold last_ind = (cumulative_probs < self.mass).sum(dim=1) last_ind[last_ind < 0] = 0 sorted_indices_to_remove = sorted_scores > sorted_scores.gather(1, last_ind.view(-1, 1)) if self.min_tokens_to_keep > 1: # Keep at least min_tokens_to_keep (set to min_tokens_to_keep-1 because we add the first one below) sorted_indices_to_remove[..., : self.min_tokens_to_keep] = 0 indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove) scores = scores.masked_fill(indices_to_remove, self.filter_value) return scores ================================================ FILE: tortoise/utils/wav2vec_alignment.py ================================================ import re import torch import torchaudio from transformers import Wav2Vec2ForCTC, Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer, Wav2Vec2Processor from tortoise.utils.audio import load_audio def max_alignment(s1, s2, skip_character='~', record=None): """ A clever function that aligns s1 to s2 as best it can. Wherever a character from s1 is not found in s2, a '~' is used to replace that character. Finally got to use my DP skills! """ if record is None: record = {} assert skip_character not in s1, f"Found the skip character {skip_character} in the provided string, {s1}" if len(s1) == 0: return '' if len(s2) == 0: return skip_character * len(s1) if s1 == s2: return s1 if s1[0] == s2[0]: return s1[0] + max_alignment(s1[1:], s2[1:], skip_character, record) take_s1_key = (len(s1), len(s2) - 1) if take_s1_key in record: take_s1, take_s1_score = record[take_s1_key] else: take_s1 = max_alignment(s1, s2[1:], skip_character, record) take_s1_score = len(take_s1.replace(skip_character, '')) record[take_s1_key] = (take_s1, take_s1_score) take_s2_key = (len(s1) - 1, len(s2)) if take_s2_key in record: take_s2, take_s2_score = record[take_s2_key] else: take_s2 = max_alignment(s1[1:], s2, skip_character, record) take_s2_score = len(take_s2.replace(skip_character, '')) record[take_s2_key] = (take_s2, take_s2_score) return take_s1 if take_s1_score > take_s2_score else skip_character + take_s2 class Wav2VecAlignment: """ Uses wav2vec2 to perform audio<->text alignment. """ def __init__(self, device='cuda' if not torch.backends.mps.is_available() else 'mps'): self.model = Wav2Vec2ForCTC.from_pretrained("jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli").cpu() self.feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(f"facebook/wav2vec2-large-960h") self.tokenizer = Wav2Vec2CTCTokenizer.from_pretrained('jbetker/tacotron-symbols') self.device = device def align(self, audio, expected_text, audio_sample_rate=24000): orig_len = audio.shape[-1] with torch.no_grad(): self.model = self.model.to(self.device) audio = audio.to(self.device) audio = torchaudio.functional.resample(audio, audio_sample_rate, 16000) clip_norm = (audio - audio.mean()) / torch.sqrt(audio.var() + 1e-7) logits = self.model(clip_norm).logits self.model = self.model.cpu() logits = logits[0] pred_string = self.tokenizer.decode(logits.argmax(-1).tolist()) fixed_expectation = max_alignment(expected_text.lower(), pred_string) w2v_compression = orig_len // logits.shape[0] expected_tokens = self.tokenizer.encode(fixed_expectation) expected_chars = list(fixed_expectation) if len(expected_tokens) == 1: return [0] # The alignment is simple; there is only one token. expected_tokens.pop(0) # The first token is a given. expected_chars.pop(0) alignments = [0] def pop_till_you_win(): if len(expected_tokens) == 0: return None popped = expected_tokens.pop(0) popped_char = expected_chars.pop(0) while popped_char == '~': alignments.append(-1) if len(expected_tokens) == 0: return None popped = expected_tokens.pop(0) popped_char = expected_chars.pop(0) return popped next_expected_token = pop_till_you_win() for i, logit in enumerate(logits): top = logit.argmax() if next_expected_token == top: alignments.append(i * w2v_compression) if len(expected_tokens) > 0: next_expected_token = pop_till_you_win() else: break pop_till_you_win() if not (len(expected_tokens) == 0 and len(alignments) == len(expected_text)): torch.save([audio, expected_text], 'alignment_debug.pth') assert False, "Something went wrong with the alignment algorithm. I've dumped a file, 'alignment_debug.pth' to" \ "your current working directory. Please report this along with the file so it can get fixed." # Now fix up alignments. Anything with -1 should be interpolated. alignments.append(orig_len) # This'll get removed but makes the algorithm below more readable. for i in range(len(alignments)): if alignments[i] == -1: for j in range(i+1, len(alignments)): if alignments[j] != -1: next_found_token = j break for j in range(i, next_found_token): gap = alignments[next_found_token] - alignments[i-1] alignments[j] = (j-i+1) * gap // (next_found_token-i+1) + alignments[i-1] return alignments[:-1] def redact(self, audio, expected_text, audio_sample_rate=24000): if '[' not in expected_text: return audio splitted = expected_text.split('[') fully_split = [splitted[0]] for spl in splitted[1:]: assert ']' in spl, 'Every "[" character must be paired with a "]" with no nesting.' fully_split.extend(spl.split(']')) # At this point, fully_split is a list of strings, with every other string being something that should be redacted. non_redacted_intervals = [] last_point = 0 for i in range(len(fully_split)): if i % 2 == 0 and fully_split[i] != "": # Check for empty string fixes index error end_interval = max(0, last_point + len(fully_split[i]) - 1) non_redacted_intervals.append((last_point, end_interval)) last_point += len(fully_split[i]) bare_text = ''.join(fully_split) alignments = self.align(audio, bare_text, audio_sample_rate) output_audio = [] for nri in non_redacted_intervals: start, stop = nri output_audio.append(audio[:, alignments[start]:alignments[stop]]) return torch.cat(output_audio, dim=-1) ================================================ FILE: tortoise_tts.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": { "id": "_pIZ3ZXNp7cf" }, "source": [ "Welcome to Tortoise! 🐢🐢🐢🐢\n", "\n", "Before you begin, I **strongly** recommend you turn on a GPU runtime.\n", "\n", "There's a reason this is called \"Tortoise\" - this model takes up to a minute to perform inference for a single sentence on a GPU. Expect waits on the order of hours on a CPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JrK20I32grP6" }, "outputs": [], "source": [ "#first follow the instructions in the README.md file under Local Installation\n", "!pip3 install -r requirements.txt\n", "# !python3 setup.py install" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Gen09NM4hONQ" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/manmay/anaconda3/envs/tortoise/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2023-07-15 10:55:28,559] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown\n", "[2023-07-15 10:55:28,603] [WARNING] [config_utils.py:75:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead\n", "[2023-07-15 10:55:28,605] [INFO] [logging.py:93:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1\n", "WARNING! Setting BLOOMLayerPolicy._orig_layer_class to None due to Exception: module 'transformers.models' has no attribute 'bloom'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Using /home/manmay/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...\n", "Detected CUDA files, patching ldflags\n", "Emitting ninja build file /home/manmay/.cache/torch_extensions/py39_cu117/transformer_inference/build.ninja...\n", "Building extension module transformer_inference...\n", "Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)\n", "Loading extension module transformer_inference...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ninja: no work to do.\n", "Time to load transformer_inference op: 0.9313881397247314 seconds\n", "[2023-07-15 10:55:34,938] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': , 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Using /home/manmay/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...\n", "No modifications detected for re-loaded extension module transformer_inference, skipping build step...\n", "Loading extension module transformer_inference...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Time to load transformer_inference op: 0.15709829330444336 seconds\n" ] } ], "source": [ "# Imports used through the rest of the notebook.\n", "import torch\n", "import torchaudio\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "\n", "import IPython\n", "\n", "from tortoise.api import TextToSpeech\n", "from tortoise.utils.audio import load_audio, load_voice, load_voices\n", "\n", "# This will download all the models used by Tortoise from the HF hub.\n", "# tts = TextToSpeech()\n", "# If you want to use deepspeed the pass use_deepspeed=True nearly 2x faster than normal\n", "tts = TextToSpeech(use_deepspeed=True, kv_cache=True)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "bt_aoxONjfL2" }, "outputs": [], "source": [ "# This is the text that will be spoken.\n", "text = \"Joining two modalities results in a surprising increase in generalization! What would happen if we combined them all?\"\n", "\n", "# Here's something for the poetically inclined.. (set text=)\n", "\"\"\"\n", "Then took the other, as just as fair,\n", "And having perhaps the better claim,\n", "Because it was grassy and wanted wear;\n", "Though as for that the passing there\n", "Had worn them really about the same,\"\"\"\n", "\n", "# Pick a \"preset mode\" to determine quality. Options: {\"ultra_fast\", \"fast\" (default), \"standard\", \"high_quality\"}. See docs in api.py\n", "preset = \"ultra_fast\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "SSleVnRAiEE2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0m\u001b[01;34mangie\u001b[0m/ \u001b[01;34mfreeman\u001b[0m/ \u001b[01;34mmyself\u001b[0m/ \u001b[01;34mtom\u001b[0m/ \u001b[01;34mtrain_grace\u001b[0m/\n", "\u001b[01;34mapplejack\u001b[0m/ \u001b[01;34mgeralt\u001b[0m/ \u001b[01;34mpat\u001b[0m/ \u001b[01;34mtrain_atkins\u001b[0m/ \u001b[01;34mtrain_kennard\u001b[0m/\n", "\u001b[01;34mcond_latent_example\u001b[0m/ \u001b[01;34mhalle\u001b[0m/ \u001b[01;34mpat2\u001b[0m/ \u001b[01;34mtrain_daws\u001b[0m/ \u001b[01;34mtrain_lescault\u001b[0m/\n", "\u001b[01;34mdaniel\u001b[0m/ \u001b[01;34mjlaw\u001b[0m/ \u001b[01;34mrainbow\u001b[0m/ \u001b[01;34mtrain_dotrice\u001b[0m/ \u001b[01;34mtrain_mouse\u001b[0m/\n", "\u001b[01;34mdeniro\u001b[0m/ \u001b[01;34mlj\u001b[0m/ \u001b[01;34msnakes\u001b[0m/ \u001b[01;34mtrain_dreams\u001b[0m/ \u001b[01;34mweaver\u001b[0m/\n", "\u001b[01;34memma\u001b[0m/ \u001b[01;34mmol\u001b[0m/ \u001b[01;34mtim_reynolds\u001b[0m/ \u001b[01;34mtrain_empire\u001b[0m/ \u001b[01;34mwilliam\u001b[0m/\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tortoise will attempt to mimic voices you provide. It comes pre-packaged\n", "# with some voices you might recognize.\n", "\n", "# Let's list all the voices available. These are just some random clips I've gathered\n", "# from the internet as well as a few voices from the training dataset.\n", "# Feel free to add your own clips to the voices/ folder.\n", "%ls tortoise/voices\n", "\n", "IPython.display.Audio('tortoise/voices/tom/1.wav')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "KEXOKjIvn6NW" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating autoregressive samples..\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/16 [00:00 6\u001b[0m gen \u001b[39m=\u001b[39m tts\u001b[39m.\u001b[39;49mtts_with_preset(text, voice_samples\u001b[39m=\u001b[39;49mvoice_samples, conditioning_latents\u001b[39m=\u001b[39;49mconditioning_latents, \n\u001b[1;32m 7\u001b[0m preset\u001b[39m=\u001b[39;49mpreset)\n\u001b[1;32m 8\u001b[0m torchaudio\u001b[39m.\u001b[39msave(\u001b[39m'\u001b[39m\u001b[39mgenerated.wav\u001b[39m\u001b[39m'\u001b[39m, gen\u001b[39m.\u001b[39msqueeze(\u001b[39m0\u001b[39m)\u001b[39m.\u001b[39mcpu(), \u001b[39m24000\u001b[39m)\n\u001b[1;32m 9\u001b[0m IPython\u001b[39m.\u001b[39mdisplay\u001b[39m.\u001b[39mAudio(\u001b[39m'\u001b[39m\u001b[39mgenerated.wav\u001b[39m\u001b[39m'\u001b[39m)\n", "File \u001b[0;32m/data/speech_synth/tortoise-tts/tortoise/api.py:329\u001b[0m, in \u001b[0;36mTextToSpeech.tts_with_preset\u001b[0;34m(self, text, preset, **kwargs)\u001b[0m\n\u001b[1;32m 327\u001b[0m settings\u001b[39m.\u001b[39mupdate(presets[preset])\n\u001b[1;32m 328\u001b[0m settings\u001b[39m.\u001b[39mupdate(kwargs) \u001b[39m# allow overriding of preset settings with kwargs\u001b[39;00m\n\u001b[0;32m--> 329\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtts(text, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49msettings)\n", "File \u001b[0;32m/data/speech_synth/tortoise-tts/tortoise/api.py:412\u001b[0m, in \u001b[0;36mTextToSpeech.tts\u001b[0;34m(self, text, voice_samples, conditioning_latents, k, verbose, use_deterministic_seed, return_deterministic_state, num_autoregressive_samples, temperature, length_penalty, repetition_penalty, top_p, max_mel_tokens, cvvp_amount, diffusion_iterations, cond_free, cond_free_k, diffusion_temperature, **hf_generate_kwargs)\u001b[0m\n\u001b[1;32m 410\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39mGenerating autoregressive samples..\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 411\u001b[0m \u001b[39mfor\u001b[39;00m b \u001b[39min\u001b[39;00m tqdm(\u001b[39mrange\u001b[39m(num_batches), disable\u001b[39m=\u001b[39m\u001b[39mnot\u001b[39;00m verbose):\n\u001b[0;32m--> 412\u001b[0m codes \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mautoregressive\u001b[39m.\u001b[39;49minference_speech(auto_conditioning, text_tokens,\n\u001b[1;32m 413\u001b[0m do_sample\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m,\n\u001b[1;32m 414\u001b[0m top_p\u001b[39m=\u001b[39;49mtop_p,\n\u001b[1;32m 415\u001b[0m temperature\u001b[39m=\u001b[39;49mtemperature,\n\u001b[1;32m 416\u001b[0m num_return_sequences\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mautoregressive_batch_size,\n\u001b[1;32m 417\u001b[0m length_penalty\u001b[39m=\u001b[39;49mlength_penalty,\n\u001b[1;32m 418\u001b[0m repetition_penalty\u001b[39m=\u001b[39;49mrepetition_penalty,\n\u001b[1;32m 419\u001b[0m max_generate_length\u001b[39m=\u001b[39;49mmax_mel_tokens,\n\u001b[1;32m 420\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mhf_generate_kwargs)\n\u001b[1;32m 421\u001b[0m padding_needed \u001b[39m=\u001b[39m max_mel_tokens \u001b[39m-\u001b[39m codes\u001b[39m.\u001b[39mshape[\u001b[39m1\u001b[39m]\n\u001b[1;32m 422\u001b[0m codes \u001b[39m=\u001b[39m F\u001b[39m.\u001b[39mpad(codes, (\u001b[39m0\u001b[39m, padding_needed), value\u001b[39m=\u001b[39mstop_mel_token)\n", "File \u001b[0;32m/data/speech_synth/tortoise-tts/tortoise/models/autoregressive.py:490\u001b[0m, in \u001b[0;36mUnifiedVoice.inference_speech\u001b[0;34m(self, speech_conditioning_latent, text_inputs, input_tokens, num_return_sequences, max_generate_length, typical_sampling, typical_mass, **hf_generate_kwargs)\u001b[0m\n\u001b[1;32m 488\u001b[0m logits_processor \u001b[39m=\u001b[39m LogitsProcessorList([TypicalLogitsWarper(mass\u001b[39m=\u001b[39mtypical_mass)]) \u001b[39mif\u001b[39;00m typical_sampling \u001b[39melse\u001b[39;00m LogitsProcessorList()\n\u001b[1;32m 489\u001b[0m max_length \u001b[39m=\u001b[39m trunc_index \u001b[39m+\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmax_mel_tokens \u001b[39m-\u001b[39m \u001b[39m1\u001b[39m \u001b[39mif\u001b[39;00m max_generate_length \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39melse\u001b[39;00m trunc_index \u001b[39m+\u001b[39m max_generate_length\n\u001b[0;32m--> 490\u001b[0m gen \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49minference_model\u001b[39m.\u001b[39;49mgenerate(inputs, bos_token_id\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mstart_mel_token, pad_token_id\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mstop_mel_token, eos_token_id\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mstop_mel_token,\n\u001b[1;32m 491\u001b[0m max_length\u001b[39m=\u001b[39;49mmax_length, logits_processor\u001b[39m=\u001b[39;49mlogits_processor,\n\u001b[1;32m 492\u001b[0m num_return_sequences\u001b[39m=\u001b[39;49mnum_return_sequences, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mhf_generate_kwargs)\n\u001b[1;32m 493\u001b[0m \u001b[39mreturn\u001b[39;00m gen[:, trunc_index:]\n", "File \u001b[0;32m~/anaconda3/envs/tortoise/lib/python3.9/site-packages/torch/utils/_contextlib.py:115\u001b[0m, in \u001b[0;36mcontext_decorator..decorate_context\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 112\u001b[0m \u001b[39m@functools\u001b[39m\u001b[39m.\u001b[39mwraps(func)\n\u001b[1;32m 113\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mdecorate_context\u001b[39m(\u001b[39m*\u001b[39margs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs):\n\u001b[1;32m 114\u001b[0m \u001b[39mwith\u001b[39;00m ctx_factory():\n\u001b[0;32m--> 115\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", "File \u001b[0;32m~/anaconda3/envs/tortoise/lib/python3.9/site-packages/transformers/generation_utils.py:1310\u001b[0m, in \u001b[0;36mGenerationMixin.generate\u001b[0;34m(self, inputs, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, typical_p, repetition_penalty, bad_words_ids, force_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, logits_processor, renormalize_logits, stopping_criteria, constraints, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, exponential_decay_length_penalty, **model_kwargs)\u001b[0m\n\u001b[1;32m 1302\u001b[0m input_ids, model_kwargs \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_expand_inputs_for_generation(\n\u001b[1;32m 1303\u001b[0m input_ids,\n\u001b[1;32m 1304\u001b[0m expand_size\u001b[39m=\u001b[39mnum_return_sequences,\n\u001b[1;32m 1305\u001b[0m is_encoder_decoder\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mconfig\u001b[39m.\u001b[39mis_encoder_decoder,\n\u001b[1;32m 1306\u001b[0m \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mmodel_kwargs,\n\u001b[1;32m 1307\u001b[0m )\n\u001b[1;32m 1309\u001b[0m \u001b[39m# 12. run sample\u001b[39;00m\n\u001b[0;32m-> 1310\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msample(\n\u001b[1;32m 1311\u001b[0m input_ids,\n\u001b[1;32m 1312\u001b[0m logits_processor\u001b[39m=\u001b[39;49mlogits_processor,\n\u001b[1;32m 1313\u001b[0m logits_warper\u001b[39m=\u001b[39;49mlogits_warper,\n\u001b[1;32m 1314\u001b[0m stopping_criteria\u001b[39m=\u001b[39;49mstopping_criteria,\n\u001b[1;32m 1315\u001b[0m pad_token_id\u001b[39m=\u001b[39;49mpad_token_id,\n\u001b[1;32m 1316\u001b[0m eos_token_id\u001b[39m=\u001b[39;49meos_token_id,\n\u001b[1;32m 1317\u001b[0m output_scores\u001b[39m=\u001b[39;49moutput_scores,\n\u001b[1;32m 1318\u001b[0m return_dict_in_generate\u001b[39m=\u001b[39;49mreturn_dict_in_generate,\n\u001b[1;32m 1319\u001b[0m synced_gpus\u001b[39m=\u001b[39;49msynced_gpus,\n\u001b[1;32m 1320\u001b[0m \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mmodel_kwargs,\n\u001b[1;32m 1321\u001b[0m )\n\u001b[1;32m 1323\u001b[0m \u001b[39melif\u001b[39;00m is_beam_gen_mode:\n\u001b[1;32m 1324\u001b[0m \u001b[39mif\u001b[39;00m num_return_sequences \u001b[39m>\u001b[39m num_beams:\n", "File \u001b[0;32m~/anaconda3/envs/tortoise/lib/python3.9/site-packages/transformers/generation_utils.py:1963\u001b[0m, in \u001b[0;36mGenerationMixin.sample\u001b[0;34m(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)\u001b[0m\n\u001b[1;32m 1961\u001b[0m \u001b[39m# sample\u001b[39;00m\n\u001b[1;32m 1962\u001b[0m probs \u001b[39m=\u001b[39m nn\u001b[39m.\u001b[39mfunctional\u001b[39m.\u001b[39msoftmax(next_token_scores, dim\u001b[39m=\u001b[39m\u001b[39m-\u001b[39m\u001b[39m1\u001b[39m)\n\u001b[0;32m-> 1963\u001b[0m next_tokens \u001b[39m=\u001b[39m torch\u001b[39m.\u001b[39;49mmultinomial(probs, num_samples\u001b[39m=\u001b[39;49m\u001b[39m1\u001b[39;49m)\u001b[39m.\u001b[39msqueeze(\u001b[39m1\u001b[39m)\n\u001b[1;32m 1965\u001b[0m \u001b[39m# finished sentences should have their next token be a padding token\u001b[39;00m\n\u001b[1;32m 1966\u001b[0m \u001b[39mif\u001b[39;00m eos_token_id \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n", "\u001b[0;31mKeyboardInterrupt\u001b[0m: " ] } ], "source": [ "# Pick one of the voices from the output above\n", "voice = 'tom'\n", "\n", "# Load it and send it through Tortoise.\n", "voice_samples, conditioning_latents = load_voice(voice)\n", "gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, \n", " preset=preset)\n", "torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)\n", "IPython.display.Audio('generated.wav')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "16Xs2SSC3BXa" }, "outputs": [], "source": [ "# Tortoise can also generate speech using a random voice. The voice changes each time you execute this!\n", "# (Note: random voices can be prone to strange utterances)\n", "gen = tts.tts_with_preset(text, voice_samples=None, conditioning_latents=None, preset=preset)\n", "torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)\n", "IPython.display.Audio('generated.wav')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fYTk8KUezUr5" }, "outputs": [], "source": [ "# You can also combine conditioning voices. Combining voices produces a new voice\n", "# with traits from all the parents.\n", "#\n", "# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:\n", "voice_samples, conditioning_latents = load_voices(['pat', 'william'])\n", "\n", "gen = tts.tts_with_preset(\"They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.\", \n", " voice_samples=None, conditioning_latents=None, preset=preset)\n", "torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)\n", "IPython.display.Audio('captain_kirkard.wav')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t66yqWgu68KL" }, "outputs": [], "source": [ "del tts # Will break other cells, but necessary to conserve RAM if you want to run this cell.\n", "\n", "# Tortoise comes with some scripts that does a lot of the lifting for you. For example,\n", "# read.py will read a text file for you.\n", "!python3 tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=.\n", "\n", "IPython.display.Audio('train_atkins/combined.wav')\n", "# This will take awhile.." ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "tortoise-tts.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: tortoise_v2_examples.html ================================================ TorToiSe - These words were never spoken.

Introduction 🐢

TorToiSe is a text-to-speech program built in April 2022 by jbetker@. TorToiSe is open source, with trained model weights available at https://github.com/neonbjb/tortoise-tts

This page demonstrates some of the results of TorToiSe.

Handpicked results 🐢

Following are several particularly good results generated by the model.

Short-form



















Long-form


Comparisons (with the LJSpeech voice): 🐢

LJSpeech is a popular dataset used to train small-scale TTS models. TorToiSe is a multi-voice model, following is how it renders the LJSpeech voice with and without fine-tuning, compared with results for the same text from the popular Tacotron2 model paired with the Waveglow vocoder.

Tacotron2+WaveglowTorToiSeTorToiSe Finetuned









NaturalVoice is a SOTA TTS engine developed by Microsoft Research Asia in May 2022. It features realistic prosody and end-to-end generation with no need for a vocoder. While not much has actually been released about this model other than five samples, those samples are quite good and I would consider this the most competitive TTS engine out there right now.

Natural VoiceTorToiSe Finetuned






It is important to note that it is not actually fair to compare any of these models: Tortoise is a multi-voice probabilistic model trained on millions of hours of speech with an exceptionally slow inference time. Tacotron and NaturalVoice are efficient, fast, single-voice models trained on 24 hours of speech. Unfortunately, there isn't much in the way of actually comparable research to Tortoise.

All Results 🐢

Following are all the results from which the hand-picked results were drawn from. Also included is the reference audio that the program is trying to mimic. This will give you a better sense of how TorToiSe really performs.

Short-form

textangiedanieldeniroemmafreemangeralthallejlawljmyselfpatsnakestomtrain_atkinstrain_dotricetrain_kennardweaverwilliam
reference clip
autoregressive_ml
bengio_it_needs_to_know_what_is_bad
dickinson_stop_for_death
espn_basketball
frost_oar_to_oar
frost_road_not_taken
gatsby_and_so_we_beat_on
harrypotter_differences_of_habit_and_language
i_am_a_language_model
melodie_kao
nyt_covid
real_courage_is_when_you_know_your_licked
rolling_stone_review
spacecraft_interview
tacotron2_sample1
tacotron2_sample2
tacotron2_sample3
tacotron2_sample4
watts_this_is_the_real_secret_of_life
wilde_nowadays_people_know_the_price

Long-form

Angelina:
Craig:
Deniro:
Emma:
Freeman:
Geralt:
Halle:
Jlaw:
LJ:
Myself:
Pat:
Snakes:
Tom:
Weaver:
William:

Prompt Engineering 🐢

Tortoise is capable of "prompt-engineering" in that tone and prosody is affected by the emotions inflected in the words fed to the program. For example, prompting the model with "[I am so angry,] I went to the park and threw a ball" will result in it outputting "I went to the park and threw the ball" with an angry tone.

Following are a few examples of different prompts. The effect is subtle, but is definitely there. Many voices are less effected by this.

Angry:
Sad:
Happy:
Scared:
================================================ FILE: voice_customization_guide.md ================================================ ## Voice Customization Guide Tortoise was specifically trained to be a multi-speaker model. It accomplishes this by consulting reference clips. These reference clips are recordings of a speaker that you provide to guide speech generation. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb. ### Provided voices This repo comes with several pre-packaged voices. Voices prepended with "train_" came from the training set and perform far better than the others. If your goal is high quality speech, I recommend you pick one of them. If you want to see what Tortoise can do for zero-shot mimicking, take a look at the others. ### Adding a new voice To add new voices to Tortoise, you will need to do the following: 1. Gather audio clips of your speaker(s). Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Guidelines for good clips are in the next section. 2. Cut your clips into ~10 second segments. You want at least 3 clips. More is better, but I only experimented with up to 5 in my testing. 3. Save the clips as a WAV file with floating point format and a 22,050 sample rate. 4. Create a subdirectory in voices/ 5. Put your clips in that subdirectory. 6. Run tortoise utilities with --voice=. ### Picking good reference clips As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips: 1. Avoid clips with background music, noise or reverb. These clips were removed from the training dataset. Tortoise is unlikely to do well with them. 2. Avoid speeches. These generally have distortion caused by the amplification system. 3. Avoid clips from phone calls. 4. Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them. 5. Try to find clips that are spoken in such a way as you wish your output to sound like. For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book. 6. The text being spoken in the clips does not matter, but diverse text does seem to perform better.