Showing preview only (1,807K chars total). Download the full file or copy to clipboard to get everything.
Repository: CMsmartvoice/One-Shot-Voice-Cloning
Branch: master
Commit: fe785f7755b0
Files: 74
Total size: 51.0 MB
Directory structure:
gitextract_aqw3mgq5/
├── .gitignore
├── README-CN.md
├── README.md
├── TensorFlowTTS/
│ ├── LICENSE
│ ├── README.md
│ ├── setup.cfg
│ ├── setup.py
│ └── tensorflow_tts/
│ ├── __init__.py
│ ├── audio_process/
│ │ ├── __init__.py
│ │ ├── audio.py
│ │ └── audio_spec.py
│ ├── bin/
│ │ ├── __init__.py
│ │ └── preprocess_unetts.py
│ ├── configs/
│ │ ├── __init__.py
│ │ ├── mb_melgan.py
│ │ ├── melgan.py
│ │ └── unetts.py
│ ├── datasets/
│ │ ├── __init__.py
│ │ ├── abstract_dataset.py
│ │ ├── audio_dataset.py
│ │ └── mel_dataset.py
│ ├── inference/
│ │ ├── __init__.py
│ │ ├── auto_config.py
│ │ ├── auto_model.py
│ │ └── auto_processor.py
│ ├── losses/
│ │ ├── __init__.py
│ │ ├── spectrogram.py
│ │ └── stft.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── mb_melgan.py
│ │ ├── melgan.py
│ │ ├── moduls/
│ │ │ ├── __init__.py
│ │ │ ├── adain_en_de_code.py
│ │ │ ├── conditional.py
│ │ │ ├── core.py
│ │ │ └── core2.py
│ │ └── unetts.py
│ ├── optimizers/
│ │ ├── __init__.py
│ │ └── adamweightdecay.py
│ ├── processor/
│ │ ├── __init__.py
│ │ ├── base_processor.py
│ │ └── multispk_voiceclone.py
│ ├── trainers/
│ │ ├── __init__.py
│ │ └── base_trainer.py
│ └── utils/
│ ├── __init__.py
│ ├── cleaners.py
│ ├── decoder.py
│ ├── griffin_lim.py
│ ├── group_conv.py
│ ├── korean.py
│ ├── number_norm.py
│ ├── outliers.py
│ ├── strategy.py
│ ├── utils.py
│ └── weight_norm.py
├── UnetTTS_syn.py
├── models/
│ ├── acous12k.h5
│ ├── duration4k.h5
│ ├── unetts_mapper.json
│ └── vocoder800k.h5
├── notebook/
│ └── OneShotVoiceClone_Inference.ipynb
├── test_wavs/
│ ├── angry_dur_stat.npy
│ ├── happy_dur_stat.npy
│ ├── neutral_dur_stat.npy
│ ├── sad_dur_stat.npy
│ └── surprise_dur_stat.npy
└── train/
├── configs/
│ ├── multiband_melgan.yaml
│ ├── unetts_acous.yaml
│ ├── unetts_duration.yaml
│ └── unetts_preprocess.yaml
├── train_multiband_melgan.py
├── train_unetts_acous.py
├── train_unetts_duration.py
└── unetts_dataset.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
**/**/__pycache__
**/.ipynb_checkpoints
TensorFlowTTS/build
TensorFlowTTS/TensorFlowTTS.egg-info
TensorFlowTTS/.eggs
egs
================================================
FILE: README-CN.md
================================================
## Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning
[](http://choosealicense.com/licenses/mit/)
> 中文 | [English](README.md)
:exclamation: 提供推理代码和预训练模型,你可以生成想要的文本语音。
:star: 模型只在正常情绪的语料上训练,没有使用其他任何强烈情感的语料。
:star: 受到训练语料的限制,一般的说话人编码或者非监督风格学习方法都很难模仿未见过的语音。训练数据分布范围外的风格迁移仍具有很大的挑战。
:star: 依赖Unet网络和AdaIN层,我们的方法在未见风格上有很强的迁移能力。
:sparkles:强烈推荐使用[在线notebook](https://colab.research.google.com/drive/1sEDvKTJCY7uosb7TvTqwyUdwNPiv3pBW#scrollTo=puzhCI99LY_a)进行推理。
[Demo results](https://cmsmartvoice.github.io/Unet-TTS/)
[Paper link](https://arxiv.org/abs/2109.11115)

---
:star: 现在只需要输入一条参考语音就可以进行克隆TTS,而不再需要手动输入参考语音的时长统计信息。
:smile: 我们正在准备基于aishell3数据的训练流程,敬请期待。
流程包括:
- [x] 一句话语音克隆推理
- [x] 参考音频的时长统计信息可以有训练的Style_Encoder估计
- [ ] 基于说话人编码的多说话人TTS,它可以提供不错的Content Encoder
- [ ] Unet-TTS训练
- [ ] C++推理
---
### Install Requirements
- 暂时只支持Linux系统
- Install the appropriate TensorFlow and tensorflow-addons versions according to CUDA version.
- The default is TensorFlow 2.6 and tensorflow-addons 0.14.0.
```shell
cd One-Shot-Voice-Cloning/TensorFlowTTS
pip install .
(or python setup.py install)
```
### Usage
方法1: 在UnetTTS_syn.py文件中修改要克隆的参考语音. (更多细节参见此文件)
```shell
cd One-Shot-Voice-Cloning
CUDA_VISIBLE_DEVICES=0 python UnetTTS_syn.py
```
方法2: Notebook
**Note**: 请将One-Shot-Voice-Cloning目录添加到系统路径中,否则UnetTTS类无法从UnetTTS_syn.py文件中载入.
```python
import sys
sys.path.append("<your repository's parent directory>/One-Shot-Voice-Cloning")
from UnetTTS_syn import UnetTTS
from tensorflow_tts.audio_process import preprocess_wav
"""初始化模型"""
models_and_params = {"duration_param": "train/configs/unetts_duration.yaml",
"duration_model": "models/duration4k.h5",
"acous_param": "train/configs/unetts_acous.yaml",
"acous_model": "models/acous12k.h5",
"vocoder_param": "train/configs/multiband_melgan.yaml",
"vocoder_model": "models/vocoder800k.h5"}
feats_yaml = "train/configs/unetts_preprocess.yaml"
text2id_mapper = "models/unetts_mapper.json"
Tts_handel = UnetTTS(models_and_params, text2id_mapper, feats_yaml)
"""根据目标语音,生成任意文本的克隆语音"""
wav_fpath = "./reference_speech.wav"
ref_audio = preprocess_wav(wav_fpath, source_sr=16000, normalize=True, trim_silence=True, is_sil_pad=True,
vad_window_length=30,
vad_moving_average_width=1,
vad_max_silence_length=1)
# 文本中插入#3标识,可以当作标点符号,合成语音中会产生停顿
text = "一句话#3风格迁移#3语音合成系统"
syn_audio, _, _ = Tts_handel.one_shot_TTS(text, ref_audio)
```
### Reference
https://github.com/TensorSpeech/TensorFlowTTS
https://github.com/CorentinJ/Real-Time-Voice-Cloning
================================================
FILE: README.md
================================================
## Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning
[](http://choosealicense.com/licenses/mit/)
> English | [中文](README-CN.md)
:exclamation: Now we provide inferencing code and pre-training models. You could generate any text sounds you want.
:star: The model training only uses the corpus of neutral emotion, and does not use any strongly emotional speech.
:star: There are still great challenges in out-of-domain style transfer. Limited by the training corpus, it is difficult for the speaker-embedding or unsupervised style learning (like GST) methods to imitate the unseen data.
:star: With the help of Unet network and AdaIN layer, our proposed algorithm has powerful speaker and style transfer capabilities.
[Demo results](https://cmsmartvoice.github.io/Unet-TTS/)
[Paper link](https://arxiv.org/abs/2109.11115)
:sparkles:[Colab notebook](https://colab.research.google.com/drive/1sEDvKTJCY7uosb7TvTqwyUdwNPiv3pBW?usp=sharing) is Highly Recommended for test.

---
:star: Now, you only need to use the reference speech for one-shot voice cloning and no longer need to manually enter the duration statistics additionally.
:smile: The authors are preparing simple, clear, and well-documented training process of Unet-TTS based on Aishell3.
It contains:
- [x] One-shot Voice cloning inference
- [x] The duration statistics of the reference speech can be estimated Automatically using Style_Encoder.
- [ ] Multi-speaker TTS with speaker_embedding-Instance-Normalization, and this model provides pre-training Content Encoder.
- [ ] Unet-TTS training
- [ ] C++ inference
Stay tuned!
---
### Install Requirements
- Only support Linux system
- Install the appropriate TensorFlow and tensorflow-addons versions according to CUDA version.
- The default is TensorFlow 2.6 and tensorflow-addons 0.14.0.
```shell
cd One-Shot-Voice-Cloning/TensorFlowTTS
pip install .
(or python setup.py install)
```
### Usage
Option 1: Modify the reference audio file to be cloned in the UnetTTS_syn.py file. (See this file for more details)
```shell
cd One-Shot-Voice-Cloning
CUDA_VISIBLE_DEVICES=0 python UnetTTS_syn.py
```
Option 2: Notebook
**Note**: Please add the One-Shot-Voice-Cloning path to the system path. Otherwise the required class UnetTTS cannot be imported from the UnetTTS_syn.py file.
```python
import sys
sys.path.append("<your repository's parent directory>/One-Shot-Voice-Cloning")
from UnetTTS_syn import UnetTTS
from tensorflow_tts.audio_process import preprocess_wav
"""Inint models"""
models_and_params = {"duration_param": "train/configs/unetts_duration.yaml",
"duration_model": "models/duration4k.h5",
"acous_param": "train/configs/unetts_acous.yaml",
"acous_model": "models/acous12k.h5",
"vocoder_param": "train/configs/multiband_melgan.yaml",
"vocoder_model": "models/vocoder800k.h5"}
feats_yaml = "train/configs/unetts_preprocess.yaml"
text2id_mapper = "models/unetts_mapper.json"
Tts_handel = UnetTTS(models_and_params, text2id_mapper, feats_yaml)
"""Synthesize arbitrary text cloning voice using a reference speech"""
wav_fpath = "./reference_speech.wav"
ref_audio = preprocess_wav(wav_fpath, source_sr=16000, normalize=True, trim_silence=True, is_sil_pad=True,
vad_window_length=30,
vad_moving_average_width=1,
vad_max_silence_length=1)
# Inserting #3 marks into text is regarded as punctuation, and synthetic speech can produce pause.
text = "一句话#3风格迁移#3语音合成系统"
syn_audio, _, _ = Tts_handel.one_shot_TTS(text, ref_audio)
```
### Reference
https://github.com/TensorSpeech/TensorFlowTTS
https://github.com/CorentinJ/Real-Time-Voice-Cloning
================================================
FILE: TensorFlowTTS/LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: TensorFlowTTS/README.md
================================================
<h2 align="center">
<p> :yum: TensorFlowTTS
<p align="center">
<a href="https://github.com/tensorspeech/TensorFlowTTS/actions">
<img alt="Build" src="https://github.com/tensorspeech/TensorFlowTTS/workflows/CI/badge.svg?branch=master">
</a>
<a href="https://github.com/tensorspeech/TensorFlowTTS/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/tensorspeech/TensorflowTTS?color=red">
</a>
<a href="https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing">
<img alt="Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
</a>
</p>
</h2>
<h2 align="center">
<p>Real-Time State-of-the-art Speech Synthesis for Tensorflow 2
</h2>
:zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
## What's new
- 2020/08/23 **(NEW!)** Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
- 2020/08/23 **(NEW!)** Add MBMelGAN G + ParallelWaveGAN G example. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
- 2020/08/20 **(NEW!)** Add C++ inference code. Thank [@ZDisket](https://github.com/ZDisket). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cppwin)
- 2020/08/18 **(NEW!)** Update [new base processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/base_processor.py). Add [AutoProcessor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/inference/auto_processor.py) and [pretrained processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/pretrained/) json file.
- 2020/08/14 **(NEW!)** Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan).
- 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
- 2020/07/17 Support MultiGPU for all Trainer.
- 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from the TFlite team for his support.
- 2020/06/20 [FastSpeech2](https://arxiv.org/abs/2006.04558) implementation with Tensorflow is supported.
- 2020/06/07 [Multi-band MelGAN (MB MelGAN)](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/multiband_melgan/) implementation with Tensorflow is supported.
## Features
- High performance on Speech Synthesis.
- Be able to fine-tune on other languages.
- Fast, Scalable, and Reliable.
- Suitable for deployment.
- Easy to implement a new model, based-on abstract class.
- Mixed precision to speed-up training if possible.
- Support both Single/Multi GPU in base trainer class.
- TFlite conversion for all supported models.
- Android example.
- Support many languages (currently, we support Chinese, Korean, English.)
- Support C++ inference.
- Support Convert weight for some models from PyTorch to TensorFlow to accelerate speed.
## Requirements
This repository is tested on Ubuntu 18.04 with:
- Python 3.7+
- Cuda 10.1
- CuDNN 7.6.5
- Tensorflow 2.2/2.3
- [Tensorflow Addons](https://github.com/tensorflow/addons) >= 0.10.0
Different Tensorflow version should be working but not tested yet. This repo will try to work with the latest stable TensorFlow version. **We recommend you install TensorFlow 2.3.0 to training in case you want to use MultiGPU.**
## Installation
### With pip
```bash
$ pip install TensorFlowTTS
```
### From source
Examples are included in the repository but are not shipped with the framework. Therefore, to run the latest version of examples, you need to install the source below.
```bash
$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .
```
If you want to upgrade the repository and its dependencies:
```bash
$ git pull
$ pip install --upgrade .
```
# Supported Model architectures
TensorFlowTTS currently provides the following architectures:
1. **MelGAN** released with the paper [MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711) by Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville.
2. **Tacotron-2** released with the paper [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884) by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu.
3. **FastSpeech** released with the paper [FastSpeech: Fast, Robust, and Controllable Text to Speech](https://arxiv.org/abs/1905.09263) by Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
4. **Multi-band MelGAN** released with the paper [Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106) by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
5. **FastSpeech2** released with the paper [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558) by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
6. **Parallel WaveGAN** released with the paper [Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480) by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
We are also implementing some techniques to improve quality and convergence speed from the following papers:
2. **Guided Attention Loss** released with the paper [Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
](https://arxiv.org/abs/1710.08969) by Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara.
# Audio Samples
Here in an audio samples on valid set. [tacotron-2](https://drive.google.com/open?id=1kaPXRdLg9gZrll9KtvH3-feOBMM8sn3_), [fastspeech](https://drive.google.com/open?id=1f69ujszFeGnIy7PMwc8AkUckhIaT2OD0), [melgan](https://drive.google.com/open?id=1mBwGVchwtNkgFsURl7g4nMiqx4gquAC2), [melgan.stft](https://drive.google.com/open?id=1xUkDjbciupEkM3N4obiJAYySTo6J9z6b), [fastspeech2](https://drive.google.com/drive/u/1/folders/1NG7oOfNuXSh7WyAoM1hI8P5BxDALY_mU), [multiband_melgan](https://drive.google.com/drive/folders/1DCV3sa6VTyoJzZmKATYvYVDUAFXlQ_Zp)
# Tutorial End-to-End
## Prepare Dataset
Prepare a dataset in the following format:
```
|- [NAME_DATASET]/
| |- metadata.csv
| |- wav/
| |- file1.wav
| |- ...
```
Where `metadata.csv` has the following format: `id|transcription`. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.
Note that `NAME_DATASET` should be `[ljspeech/kss/baker/libritts]` for example.
## Preprocessing
The preprocessing has two steps:
1. Preprocess audio features
- Convert characters to IDs
- Compute mel spectrograms
- Normalize mel spectrograms to [-1, 1] range
- Split the dataset into train and validation
- Compute the mean and standard deviation of multiple features from the **training** split
2. Standardize mel spectrogram based on computed statistics
To reproduce the steps above:
```
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker/libritts]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
```
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) and [`libritts`](http://www.openslr.org/60/) for dataset argument. In the future, we intend to support more datasets.
**Note**: To run `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need to reformat it first before run preprocessing.
After preprocessing, the structure of the project folder should be:
```
|- [NAME_DATASET]/
| |- metadata.csv
| |- wav/
| |- file1.wav
| |- ...
|- dump_[ljspeech/kss/baker/libritts]/
| |- train/
| |- ids/
| |- LJ001-0001-ids.npy
| |- ...
| |- raw-feats/
| |- LJ001-0001-raw-feats.npy
| |- ...
| |- raw-f0/
| |- LJ001-0001-raw-f0.npy
| |- ...
| |- raw-energies/
| |- LJ001-0001-raw-energy.npy
| |- ...
| |- norm-feats/
| |- LJ001-0001-norm-feats.npy
| |- ...
| |- wavs/
| |- LJ001-0001-wave.npy
| |- ...
| |- valid/
| |- ids/
| |- LJ001-0009-ids.npy
| |- ...
| |- raw-feats/
| |- LJ001-0009-raw-feats.npy
| |- ...
| |- raw-f0/
| |- LJ001-0001-raw-f0.npy
| |- ...
| |- raw-energies/
| |- LJ001-0001-raw-energy.npy
| |- ...
| |- norm-feats/
| |- LJ001-0009-norm-feats.npy
| |- ...
| |- wavs/
| |- LJ001-0009-wave.npy
| |- ...
| |- stats.npy
| |- stats_f0.npy
| |- stats_energy.npy
| |- train_utt_ids.npy
| |- valid_utt_ids.npy
|- examples/
| |- melgan/
| |- fastspeech/
| |- tacotron2/
| ...
```
- `stats.npy` contains the mean and std from the training split mel spectrograms
- `stats_energy.npy` contains the mean and std of energy values from the training split
- `stats_f0.npy` contains the mean and std of F0 values in the training split
- `train_utt_ids.npy` / `valid_utt_ids.npy` contains training and validation utterances IDs respectively
We use suffix (`ids`, `raw-feats`, `raw-energy`, `raw-f0`, `norm-feats`, and `wave`) for each input type.
**IMPORTANT NOTES**:
- This preprocessing step is based on [ESPnet](https://github.com/espnet/espnet) so you can combine all models here with other models from ESPnet repository.
- Regardless of how your dataset is formatted, the final structure of the `dump` folder **SHOULD** follow the above structure to be able to use the training script, or you can modify it by yourself 😄.
## Training models
To know how to train model from scratch or fine-tune with other datasets/languages, please see detail at example directory.
- For Tacotron-2 tutorial, pls see [examples/tacotron2](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/tacotron2)
- For FastSpeech tutorial, pls see [examples/fastspeech](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/fastspeech)
- For FastSpeech2 tutorial, pls see [examples/fastspeech2](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/fastspeech2)
- For FastSpeech2 + MFA tutorial, pls see [examples/fastspeech2_libritts](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts)
- For MelGAN tutorial, pls see [examples/melgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/melgan)
- For MelGAN + STFT Loss tutorial, pls see [examples/melgan.stft](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/melgan.stft)
- For Multiband-MelGAN tutorial, pls see [examples/multiband_melgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_melgan)
- For Parallel WaveGAN tutorial, pls see [examples/parallel_wavegan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
- For Multiband-MelGAN Generator + Parallel WaveGAN Discriminator tutorial, pls see [examples/multiband_pwgan](https://github.com/tensorspeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
# Abstract Class Explaination
## Abstract DataLoader Tensorflow-based dataset
A detail implementation of abstract dataset class from [tensorflow_tts/dataset/abstract_dataset](https://github.com/tensorspeech/TensorFlowTTS/blob/master/tensorflow_tts/datasets/abstract_dataset.py). There are some functions you need overide and understand:
1. **get_args**: This function return argumentation for **generator** class, normally is utt_ids.
2. **generator**: This function have an inputs from **get_args** function and return a inputs for models. **Note that we return a dictionary for all generator functions with the keys that exactly match with the model's parameters because base_trainer will use model(\*\*batch) to do forward step.**
3. **get_output_dtypes**: This function need return dtypes for each element from **generator** function.
4. **get_len_dataset**: Return len of datasets, normaly is len(utt_ids).
**IMPORTANT NOTES**:
- A pipeline of creating dataset should be: cache -> shuffle -> map_fn -> get_batch -> prefetch.
- If you do shuffle before cache, the dataset won't shuffle when it re-iterate over datasets.
- You should apply map_fn to make each element return from **generator** function have the same length before getting batch and feed it into a model.
Some examples to use this **abstract_dataset** are [tacotron_dataset.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/tacotron2/tacotron_dataset.py), [fastspeech_dataset.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/fastspeech/fastspeech_dataset.py), [melgan_dataset.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/melgan/audio_mel_dataset.py), [fastspeech2_dataset.py](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/fastspeech2/fastspeech2_dataset.py)
## Abstract Trainer Class
A detail implementation of base_trainer from [tensorflow_tts/trainer/base_trainer.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py). It include [Seq2SeqBasedTrainer](https://github.com/tensorspeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L265) and [GanBasedTrainer](https://github.com/tensorspeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L149) inherit from [BasedTrainer](https://github.com/tensorspeech/TensorFlowTTS/blob/master/tensorflow_tts/trainers/base_trainer.py#L16). All trainer support both single/multi GPU. There a some functions you **MUST** overide when implement new_trainer:
- **compile**: This function aim to define a models, and losses.
- **generate_and_save_intermediate_result**: This function will save intermediate result such as: plot alignment, save audio generated, plot mel-spectrogram ...
- **compute_per_example_losses**: This function will compute per_example_loss for model, note that all element of the loss **MUST** has shape [batch_size].
All models on this repo are trained based-on **GanBasedTrainer** (see [train_melgan.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/melgan/train_melgan.py), [train_melgan_stft.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/melgan.stft/train_melgan_stft.py), [train_multiband_melgan.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/multiband_melgan/train_multiband_melgan.py)) and **Seq2SeqBasedTrainer** (see [train_tacotron2.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/tacotron2/train_tacotron2.py), [train_fastspeech.py](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/fastspeech/train_fastspeech.py)).
# End-to-End Examples
You can know how to inference each model at [notebooks](https://github.com/tensorspeech/TensorFlowTTS/tree/master/notebooks) or see a [colab](https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing) (for English), [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing) (for Korean). Here is an example code for end2end inference with fastspeech and melgan.
```python
import numpy as np
import soundfile as sf
import yaml
import tensorflow as tf
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor
# initialize fastspeech model.
fs_config = AutoConfig.from_pretrained('/examples/fastspeech/conf/fastspeech.v1.yaml')
fastspeech = TFAutoModel.from_pretrained(
config=fs_config,
pretrained_path="./examples/fastspeech/pretrained/model-195000.h5"
)
# initialize melgan model
melgan_config = AutoConfig.from_pretrained('./examples/melgan/conf/melgan.v1.yaml')
melgan = TFAutoModel.from_pretrained(
config=melgan_config,
pretrained_path="./examples/melgan/checkpoint/generator-1500000.h5"
)
# inference
processor = AutoProcessor.from_pretrained(pretrained_path="./test/files/ljspeech_mapper.json")
ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
ids = tf.expand_dims(ids, 0)
# fastspeech inference
masked_mel_before, masked_mel_after, duration_outputs = fastspeech.inference(
ids,
speaker_ids=tf.zeros(shape=[tf.shape(ids)[0]], dtype=tf.int32),
speed_ratios=tf.constant([1.0], dtype=tf.float32)
)
# melgan inference
audio_before = melgan.inference(masked_mel_before)[0, :, 0]
audio_after = melgan.inference(masked_mel_after)[0, :, 0]
# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")
```
# Contact
[Minh Nguyen Quan Anh](https://github.com/tensorspeech): nguyenquananhminh@gmail.com, [erogol](https://github.com/erogol): erengolge@gmail.com, [Kuan Chen](https://github.com/azraelkuan): azraelkuan@gmail.com, [Dawid Kobus](https://github.com/machineko): machineko@protonmail.com, [Takuya Ebata](https://github.com/MokkeMeguru): meguru.mokke@gmail.com, [Trinh Le Quang](https://github.com/l4zyf9x): trinhle.cse@gmail.com, [Yunchao He](https://github.com/candlewill): yunchaohe@gmail.com, [Alejandro Miguel Velasquez](https://github.com/ZDisket): xml506ok@gmail.com
# License
Overall, Almost models here are licensed under the [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0) for all countries in the world, except in **Viet Nam** this framework cannot be used for production in any way without permission from TensorFlowTTS's Authors. There is an exception, Tacotron-2 can be used with any purpose. If you are Vietnamese and want to use this framework for production, you **Must** contact us in advance.
# Acknowledgement
We want to thank [Tomoki Hayashi](https://github.com/kan-bayashi), who discussed with us much about Melgan, Multi-band melgan, Fastspeech, and Tacotron. This framework based-on his great open-source [ParallelWaveGan](https://github.com/kan-bayashi/ParallelWaveGAN) project.
================================================
FILE: TensorFlowTTS/setup.cfg
================================================
[aliases]
test=pytest
[tool:pytest]
addopts = --verbose --durations=0
testpaths = test
[flake8]
ignore = H102,W504,H238,D104,H306,H405,D205
# 120 is a workaround, 79 is good
max-line-length = 120
================================================
FILE: TensorFlowTTS/setup.py
================================================
"""Setup Tensorflow TTS libarary."""
import os
import sys
from distutils.version import LooseVersion
import pip
from setuptools import find_packages, setup
if LooseVersion(sys.version) < LooseVersion("3.6"):
raise RuntimeError(
"Tensorflow TTS requires python >= 3.6, "
"but your Python version is {}".format(sys.version)
)
if LooseVersion(pip.__version__) < LooseVersion("19"):
raise RuntimeError(
"pip>=19.0.0 is required, but your pip version is {}. "
'Try again after "pip install -U pip"'.format(pip.__version__)
)
# TODO(@dathudeptrai) update requirement if needed.
requirements = {
"install": [
"tensorflow-gpu==2.6.0",
"tensorflow-addons==0.14.0",
"keras==2.6.0",
"setuptools>=38.5.1",
"librosa>=0.7.0",
"soundfile>=0.10.2",
"matplotlib>=3.1.0",
"PyYAML>=3.12",
"tqdm>=4.26.1",
"h5py>=2.10.0",
"unidecode>=1.1.1",
"inflect>=4.1.0",
"scikit-learn>=0.22.0",
"pyworld>=0.2.10",
"numba<=0.48", # Fix No module named "numba.decorators"
"jamo>=0.4.1",
"pypinyin",
"g2pM",
"textgrid",
"click",
"g2p_en",
"dataclasses",
"pysptk",
"webrtcvad",
],
"setup": ["numpy", "pytest-runner",],
"test": [
"pytest>=3.3.0",
"hacking>=1.1.0",
],
}
# TODO(@dathudeptrai) update console_scripts.
entry_points = {
"console_scripts": [
"tensorflow-tts-preprocess-unetts-duration=tensorflow_tts.bin.preprocess_unetts:preprocess_duration",
"tensorflow-tts-preprocess-unetts-acous=tensorflow_tts.bin.preprocess_unetts:preprocess_acous",
"tensorflow-tts-preprocess-unetts-vocoder=tensorflow_tts.bin.preprocess_unetts:preprocess_vocoder",
]
}
install_requires = requirements["install"]
setup_requires = requirements["setup"]
tests_require = requirements["test"]
extras_require = {
k: v for k, v in requirements.items() if k not in ["install", "setup"]
}
dirname = os.path.dirname(__file__)
setup(
name="TensorFlowTTS",
version="0.0.0",
url="https://github.com/tensorspeech/TensorFlowTTS",
author="Minh Nguyen Quan Anh, Eren Gölge, Kuan Chen, Dawid Kobus, Takuya Ebata, Trinh Le Quang, Yunchao He, Alejandro Miguel Velasquez",
author_email="nguyenquananhminh@gmail.com",
description="TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2",
long_description=open(os.path.join(dirname, "README.md"), encoding="utf-8").read(),
long_description_content_type="text/markdown",
license="Apache-2.0",
packages=find_packages(include=["tensorflow_tts*"]),
install_requires=install_requires,
setup_requires=setup_requires,
tests_require=tests_require,
extras_require=extras_require,
entry_points=entry_points,
classifiers=[
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Intended Audience :: Science/Research",
"Operating System :: POSIX :: Linux",
"License :: OSI Approved :: Apache Software License",
"Topic :: Software Development :: Libraries :: Python Modules",
],
)
================================================
FILE: TensorFlowTTS/tensorflow_tts/__init__.py
================================================
__version__ = "0.0"
================================================
FILE: TensorFlowTTS/tensorflow_tts/audio_process/__init__.py
================================================
from tensorflow_tts.audio_process.audio import preprocess_wav, melbasis_make, mel_make
from tensorflow_tts.audio_process import audio_spec
================================================
FILE: TensorFlowTTS/tensorflow_tts/audio_process/audio.py
================================================
import struct
from pathlib import Path
from typing import Optional, Union
import librosa
import numpy as np
from scipy.ndimage.morphology import binary_dilation
try:
import webrtcvad
except:
print("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
webrtcvad=None
# ## Voice Activation Detection
# # Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
# # This sets the granularity of the VAD. Should not need to be changed.
# vad_window_length = 30 # In milliseconds
# # Number of frames to average together when performing the moving average smoothing.
# # The larger this value, the larger the VAD variations must be to not get smoothed out.
# vad_moving_average_width = 8
# # Maximum number of consecutive silent frames a segment can have.
# vad_max_silence_length = 6
int16_max = (2 ** 15) - 1
sampling_rate = 16000
def preprocess_wav(fpath_or_wav: Union[str, Path, np.ndarray],
source_sr: Optional[int] = None,
normalize: Optional[bool] = True,
trim_silence: Optional[bool] = True,
is_sil_pad: Optional[bool] = True,
vad_window_length = 30,
vad_moving_average_width = 8,
vad_max_silence_length = 6):
"""
Applies the preprocessing operations used in training the Speaker Encoder to a waveform
either on disk or in memory. The waveform will be resampled to match the data hyperparameters.
:param fpath_or_wav: either a filepath to an audio file (many extensions are supported, not
just .wav), either the waveform as a numpy array of floats.
:param source_sr: if passing an audio waveform, the sampling rate of the waveform before
preprocessing. After preprocessing, the waveform's sampling rate will match the data
hyperparameters. If passing a filepath, the sampling rate will be automatically detected and
this argument will be ignored.
"""
# Load the wav from disk if needed
if isinstance(fpath_or_wav, str) or isinstance(fpath_or_wav, Path):
wav, source_sr = librosa.load(str(fpath_or_wav), sr=None)
else:
wav = fpath_or_wav
# Resample the wav if needed
if source_sr is not None and source_sr != sampling_rate:
wav = librosa.resample(wav, source_sr, sampling_rate)
# Apply the preprocessing: normalize volume and shorten long silences
if normalize:
wav = normalize_volume(wav)
if trim_silence:
wav = trim_long_silences(wav, vad_window_length, vad_moving_average_width, vad_max_silence_length)
if is_sil_pad:
wav = sil_pad(wav)
return wav
def normalize_volume(wav, ratio=0.6):
return wav / np.max(np.abs(wav)) * ratio
def sil_pad(wav, pad_length=100):
pad_length = int(sampling_rate / 1000 * pad_length)
return np.pad(wav, (pad_length, pad_length))
def trim_long_silences(wav, vad_window_length, vad_moving_average_width, vad_max_silence_length):
"""
Ensures that segments without voice in the waveform remain no longer than a
threshold determined by the VAD parameters in params.py.
:param wav: the raw waveform as a numpy array of floats
:return: the same waveform with silences trimmed away (length <= original wav length)
"""
# Compute the voice detection window size
samples_per_window = (vad_window_length * sampling_rate) // 1000
# Trim the end of the audio to have a multiple of the window size
wav = wav[:len(wav) - (len(wav) % samples_per_window)]
# Convert the float waveform to 16-bit mono PCM
pcm_wave = struct.pack("%dh" % len(wav), *(np.round(wav * int16_max)).astype(np.int16))
# Perform voice activation detection
voice_flags = []
vad = webrtcvad.Vad(mode=3)
for window_start in range(0, len(wav), samples_per_window):
window_end = window_start + samples_per_window
voice_flags.append(vad.is_speech(pcm_wave[window_start * 2:window_end * 2],
sample_rate=sampling_rate))
voice_flags = np.array(voice_flags)
# Smooth the voice detection with a moving average
def moving_average(array, width):
array_padded = np.concatenate((np.zeros((width - 1) // 2), array, np.zeros(width // 2)))
ret = np.cumsum(array_padded, dtype=float)
ret[width:] = ret[width:] - ret[:-width]
return ret[width - 1:] / width
audio_mask = moving_average(voice_flags, vad_moving_average_width)
audio_mask = np.round(audio_mask).astype(np.bool)
# Dilate the voiced regions
audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))
audio_mask = np.repeat(audio_mask, samples_per_window)
return wav[audio_mask == True]
def melbasis_make(sr=16000, n_fft=1024, n_mels=80, fmin=80, fmax=7600):
return librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
def mel_make(filepath: str, sr=16000, n_fft=1024, framesize=256, mel_basis=None, fn=None):
if fn is None:
audio, _ = librosa.load(filepath, sr=sr)
else:
audio = fn(filepath, trim_silence=False, is_sil_pad=False)
D = librosa.stft(audio, n_fft=n_fft, hop_length=framesize)
S, _ = librosa.magphase(D)
if mel_basis:
mel = np.log10(np.maximum(np.dot(mel_basis, S), 1e-10)).T
return audio, mel
else:
return audio, S
================================================
FILE: TensorFlowTTS/tensorflow_tts/audio_process/audio_spec.py
================================================
import librosa
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
import soundfile as sf
def preemphasis(wav, k, preemphasize=True):
if preemphasize:
return signal.lfilter([1, -k], [1], wav)
return wav
def inv_preemphasis(wav, k, inv_preemphasize=True):
if inv_preemphasize:
return signal.lfilter([1], [1, -k], wav)
return wav
class AudioMelSpec():
'''
Audio to Mel_Spec
'''
def __init__(
self, sample_rate=16000, n_fft=800, num_mels=80, hop_size=200, win_size=800,
fmin=55, fmax=7600, min_level_db=-100, ref_level_db=20, max_abs_value=4.,
preemphasis=0.97, preemphasize=True,
signal_normalization=True, allow_clipping_in_normalization=True, symmetric_mels=True,
power=1.5, griffin_lim_iters=60,
rescale=True, rescaling_max=0.9
):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.num_mels = num_mels
self.hop_size = hop_size
self.win_size = win_size
self.fmin = fmin
self.fmax = fmax
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.max_abs_value = max_abs_value
self.preemphasis = preemphasis
self.preemphasize = preemphasize
self.signal_normalization = signal_normalization
self.symmetric_mels = symmetric_mels
self.allow_clipping_in_normalization = allow_clipping_in_normalization
self.power = power
self.griffin_lim_iters = griffin_lim_iters
self.rescale = rescale
self.rescaling_max = rescaling_max
self._mel_basis_create()
def _mel_basis_create(self):
self._mel_basis = librosa.filters.mel(self.sample_rate, self.n_fft, self.num_mels, self.fmin, self.fmax)
self._inv_mel_basis = np.linalg.pinv(self._mel_basis)
def _stft(self, y):
return librosa.stft(y=y, n_fft=self.n_fft, hop_length=self.hop_size, win_length=self.win_size)
def _istft(self, y):
return librosa.istft(y, hop_length=self.hop_size, win_length=self.win_size)
def _linear_to_mel(self, spectogram):
return np.dot(self._mel_basis, spectogram)
def _mel_to_linear(self, mel_spectrogram):
return np.maximum(1e-10, np.dot(self._inv_mel_basis, mel_spectrogram))
def _amp_to_db(self, x):
min_level = np.exp(self.min_level_db / 20 * np.log(10))
return 20 * np.log10(np.maximum(min_level, x))
def _db_to_amp(self, x):
return np.power(10.0, (x) * 0.05)
def _normalize(self, S):
if self.allow_clipping_in_normalization:
if self.symmetric_mels:
return np.clip((2 * self.max_abs_value) * ((S - self.min_level_db) / (-self.min_level_db)) - self.max_abs_value,
-self.max_abs_value, self.max_abs_value)
else:
return np.clip(self.max_abs_value * ((S - self.min_level_db) / (-self.min_level_db)), 0, self.max_abs_value)
assert S.max() <= 0 and S.min() - self.min_level_db >= 0
if self.symmetric_mels:
return (2 * self.max_abs_value) * ((S - self.min_level_db) / (-self.min_level_db)) - self.max_abs_value
else:
return self.max_abs_value * ((S - self.min_level_db) / (-self.min_level_db))
def _denormalize(self, D):
if self.allow_clipping_in_normalization:
if self.symmetric_mels:
return (((np.clip(D, -self.max_abs_value,
self.max_abs_value) + self.max_abs_value) * -self.min_level_db / (2 * self.max_abs_value))
+ self.min_level_db)
else:
return ((np.clip(D, 0, self.max_abs_value) * -self.min_level_db / self.max_abs_value) + self.min_level_db)
if self.symmetric_mels:
return (((D + self.max_abs_value) * -self.min_level_db / (2 * self.max_abs_value)) + self.min_level_db)
else:
return ((D * -self.min_level_db / self.max_abs_value) + self.min_level_db)
def _griffin_lim(self, S):
"""librosa implementation of Griffin-Lim
Based on https://github.com/librosa/librosa/issues/434
"""
angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
S_complex = np.abs(S).astype(np.complex)
y = self._istft(S_complex * angles)
for i in range(self.griffin_lim_iters):
angles = np.exp(1j * np.angle(self._stft(y)))
y = self._istft(S_complex * angles)
return y
def load_wav(self, wav_fpath):
wav, _ = librosa.load(wav_fpath, sr=self.sample_rate)
if self.rescale:
wav = wav / np.abs(wav).max() * self.rescaling_max
return wav
def save_wav(self, wav, fpath):
if self.rescale:
wav = wav / np.abs(wav).max() * self.rescaling_max
sf.write(fpath, wav, self.sample_rate, subtype="PCM_16")
def melspectrogram(self, wav):
D = self._stft(preemphasis(wav, self.preemphasis, self.preemphasize))
S = self._amp_to_db(self._linear_to_mel(np.abs(D))) - self.ref_level_db
if self.signal_normalization:
return self._normalize(S.T)
return S.T
def inv_mel_spectrogram(self, mel_spectrogram):
"""Converts mel spectrogram to waveform using librosa"""
if self.signal_normalization:
D = self._denormalize(mel_spectrogram.T)
else:
D = mel_spectrogram.T
S = self._mel_to_linear(self._db_to_amp(D + self.ref_level_db)) # Convert back to linear
return inv_preemphasis(self._griffin_lim(S ** self.power), self.preemphasis, self.preemphasize)
def compare_plot(self, targets, preds, filepath=None, frame_real_len=None, text=None):
if frame_real_len:
targets = targets[:frame_real_len]
preds = preds[:frame_real_len]
fig = plt.figure(figsize=(14,10))
if text:
fig.text(0.4, 0.48, text, horizontalalignment="center", fontsize=16)
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
im = ax1.imshow(targets.T, aspect='auto', origin="lower", interpolation="none")
ax1.set_title("Target Mel-Spectrogram")
fig.colorbar(mappable=im, shrink=0.65, ax=ax1)
im = ax2.imshow(preds.T, aspect='auto', origin="lower", interpolation="none")
ax2.set_title("Pred Mel-Spectrogram")
fig.colorbar(mappable=im, shrink=0.65, ax=ax2)
plt.tight_layout()
if filepath is None:
plt.show()
else:
plt.savefig(filepath)
plt.close()
def melspec_plot(self, mels):
plt.figure(figsize=(10,6))
plt.imshow(mels.T, aspect='auto', origin="lower", interpolation="none")
plt.colorbar()
plt.show()
class AudioSpec():
''' # TODO
Now just for sqrt(sp) from world
'''
def __init__(self, sr, nfft, mel_dim=80, f0_min=71, f0_max=7800,
min_level_db=-120., ref_level_db=-5., max_abs_value=4.,
is_norm=True, is_symmetric=True, is_clipping_in_normalization=False):
self.sr = sr
self.nfft = nfft
self.mel_dim = mel_dim
self.f0_min = f0_min
self.f0_max = f0_max
self.min_level_db = min_level_db
self.min_level_amp = np.exp((self.min_level_db + 0.1) / 20 * np.log(10))
# sp from world, self.ref_level_db should be less than zero
# otherwise, is_clipping_in_normalization should be true
self.ref_level_db = ref_level_db
self.max_abs_value = max_abs_value
self.is_norm = is_norm
self.is_symmetric = is_symmetric
self.is_clipping_in_normalization = is_clipping_in_normalization
if self.ref_level_db > 0.:
try:
assert self.is_norm and self.is_clipping_in_normalization
except:
self.is_clipping_in_normalization = True
self._mel_basis_create()
def _mel_basis_create(self):
self._mel_basis = librosa.filters.mel(self.sr, self.nfft, self.mel_dim, self.f0_min, self.f0_max)
self._inv_mel_basis = np.linalg.pinv(self._mel_basis)
def _normalize(self, log_sepc, is_symmetric, is_clipping_in_normalization):
if is_clipping_in_normalization:
if is_symmetric:
return np.clip((2 * self.max_abs_value) * ((log_sepc - self.min_level_db) / (-self.min_level_db)) - self.max_abs_value,
-self.max_abs_value, self.max_abs_value)
else:
return np.clip(self.max_abs_value * ((log_sepc - self.min_level_db) / (-self.min_level_db)), 0, self.max_abs_value)
assert log_sepc.max() <= 0 and log_sepc.min() >= self.min_level_db
if is_symmetric:
return (2 * self.max_abs_value) * ((log_sepc - self.min_level_db) / (-self.min_level_db)) - self.max_abs_value
else:
return self.max_abs_value * ((log_sepc - self.min_level_db) / (-self.min_level_db))
def _denormalize(self, log_sepc, is_symmetric, is_clipping_in_normalization):
if is_clipping_in_normalization:
if is_symmetric:
return (((np.clip(log_sepc, -self.max_abs_value,
self.max_abs_value) + self.max_abs_value) * -self.min_level_db / (2 * self.max_abs_value))
+ self.min_level_db)
else:
return ((np.clip(log_sepc, 0, self.max_abs_value) * -self.min_level_db / self.max_abs_value) + self.min_level_db)
if is_symmetric:
return (((log_sepc + self.max_abs_value) * -self.min_level_db / (2 * self.max_abs_value)) + self.min_level_db)
else:
return ((log_sepc * -self.min_level_db / self.max_abs_value) + self.min_level_db)
def ampspec2logspec(self, amp_spec):
mel_spec = np.dot(amp_spec, self._mel_basis.T)
log_sepc = 20 * np.log10(np.maximum(self.min_level_amp, mel_spec)) - self.ref_level_db
if self.is_norm:
log_sepc = self._normalize(log_sepc, self.is_symmetric, self.is_clipping_in_normalization)
return log_sepc
def logspec2ampspec(self, log_spec):
if self.is_norm:
log_spec = self._denormalize(log_spec, self.is_symmetric, self.is_clipping_in_normalization)
log_spec += self.ref_level_db
amp_spec = np.maximum(self.min_level_amp**2, np.dot(np.power(10.0, log_spec * 0.05), self._inv_mel_basis.T))
return amp_spec
class VariableNormProcess():
'''
Variable, like duration, f0 and bap from world
'''
def __init__(self, var_min, var_max, max_abs_value=4.0, is_symmetric=True):
self.var_min = var_min
self.var_max = var_max
self.scale = var_max - var_min
self.max_abs_value = max_abs_value
self.is_symmetric = is_symmetric
assert self.scale > 0
def normalize(self, var):
if self.is_symmetric:
return np.clip((2 * self.max_abs_value) * ((var - self.var_min) / self.scale) - self.max_abs_value,
-self.max_abs_value, self.max_abs_value)
else:
return np.clip(self.max_abs_value * ((var - self.var_min) / self.scale), 0, self.max_abs_value)
def denormalize(self, nvar):
if self.is_symmetric:
return (((np.clip(nvar, -self.max_abs_value, self.max_abs_value)
+ self.max_abs_value) * self.scale / (2 * self.max_abs_value))
+ self.var_min)
else:
return ((np.clip(nvar, 0, self.max_abs_value) * self.scale / self.max_abs_value) + self.var_min)
================================================
FILE: TensorFlowTTS/tensorflow_tts/bin/__init__.py
================================================
================================================
FILE: TensorFlowTTS/tensorflow_tts/bin/preprocess_unetts.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Perform preprocessing, with raw feature extraction and normalization of train/valid split."""
import argparse
import logging
import os
import yaml
import numpy as np
from functools import partial
from multiprocessing import Pool
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from tensorflow_tts.processor.multispk_voiceclone import MultiSPKVoiceCloneProcessor
from tensorflow_tts.processor.multispk_voiceclone import AISHELL_CHN_SYMBOLS
from tensorflow_tts.audio_process.audio_spec import AudioMelSpec
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import random
import tensorflow as tf
SEED = 2021
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)
_feats_handle = None
def parse_and_config():
"""Parse arguments and set configuration parameters."""
parser = argparse.ArgumentParser(
description="Preprocess audio and text features "
"(See detail in tensorflow_tts/bin/preprocess_dataset.py)."
)
parser.add_argument(
"--rootdir",
default=None,
type=str,
required=True,
help="Directory containing the dataset files.",
)
parser.add_argument(
"--outdir",
default=None,
type=str,
required=True,
help="Output directory where features will be saved.",
)
parser.add_argument(
"--dataset",
type=str,
default="multispk_voiceclone",
choices=["multispk_voiceclone"],
help="Dataset to preprocess.",
)
parser.add_argument(
"--during_train",
type=int,
default=0,
choices=[0, 1],
help="0-False, 1-True: trainging during model"
)
parser.add_argument(
"--all_train",
type=int,
default=0,
choices=[0, 1],
help="0-False, 1-True: trainging f0 model"
)
parser.add_argument(
"--mfaed_txt",
type=str,
default=None,
required=True,
help="mfa results txt"
)
parser.add_argument(
"--wavs_dir",
type=str,
default=None,
required=True,
help="wav dir"
)
parser.add_argument(
"--spkinfo_dir",
type=str,
default=None,
required=True,
help="spkinfo dir"
)
parser.add_argument(
"--embed_dir",
type=str,
default=None,
required=True,
help="embed dir"
)
parser.add_argument(
"--unseen_dir",
type=str,
default=None,
required=True,
help="unseen speaker dir"
)
parser.add_argument(
"--config", type=str, required=True, help="YAML format configuration file."
)
parser.add_argument(
"--n_cpus",
type=int,
default=4,
required=False,
help="Number of CPUs to use in parallel.",
)
parser.add_argument(
"--test_size",
type=float,
default=0.05,
required=False,
help="Proportion of files to use as test dataset.",
)
parser.add_argument(
"--verbose",
type=int,
default=0,
choices=[0, 1, 2],
help="Logging level. 0: DEBUG, 1: INFO and WARNING, 2: INFO, WARNING, and ERROR",
)
args = parser.parse_args()
# set logger
FORMAT = "%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
log_level = {0: logging.DEBUG, 1: logging.WARNING, 2: logging.ERROR}
logging.basicConfig(level=log_level[args.verbose], format=FORMAT)
# load config
config = yaml.load(open(args.config), Loader=yaml.Loader)
config.update(vars(args))
# config checks
assert config["format"] == "npy", "'npy' is the only supported format."
return config
'''
###############################################################################
############################# Duration #######################################
###############################################################################
'''
def preprocess_duration():
"""Run preprocessing process and compute statistics for normalizing."""
config = parse_and_config()
dataset_processor = {
"multispk_voiceclone": MultiSPKVoiceCloneProcessor,
}
dataset_symbol = {
"multispk_voiceclone": AISHELL_CHN_SYMBOLS,
}
dataset_cleaner = {
"multispk_voiceclone": None,
}
logging.info(f"Selected '{config['dataset']}' processor.")
processor = dataset_processor[config["dataset"]](
config["rootdir"],
symbols = dataset_symbol[config["dataset"]],
cleaner_names = dataset_cleaner[config["dataset"]],
during_train = True if config["during_train"] else False,
mfaed_txt = config["mfaed_txt"],
wavs_dir = config["wavs_dir"],
embed_dir = config["embed_dir"],
spkinfo_dir = config["spkinfo_dir"],
unseen_dir = config["unseen_dir"]
)
# check output directories
build_dir = lambda x: [
os.makedirs(os.path.join(config["outdir"], x, y), exist_ok=True)
for y in ["ids", "raw-durations", "stat-durations"]
]
build_dir("train")
build_dir("valid")
# save pretrained-processor to feature dir
processor._save_mapper(
os.path.join(config["outdir"], f"{config['dataset']}_mapper.json"),
extra_attrs_to_save={"pinyin_dict": processor.pinyin_dict}
if config["dataset"] == "multispk_voiceclone" else {},
)
# build train test split
_Y = [i[0] for i in processor.items]
train_split, valid_split = train_test_split(
processor.items,
test_size=config["test_size"],
random_state=42,
shuffle=True,
stratify=_Y
)
logging.info(f"Training items: {len(train_split)}")
logging.info(f"Validation items: {len(valid_split)}")
train_utt_ids = [x[1] for x in train_split]
valid_utt_ids = [x[1] for x in valid_split]
# save train and valid utt_ids to track later
np.save(os.path.join(config["outdir"], "train_utt_ids.npy"), train_utt_ids, allow_pickle=False)
np.save(os.path.join(config["outdir"], "valid_utt_ids.npy"), valid_utt_ids, allow_pickle=False)
config["none_pinyin_symnum"] = processor.none_pinyin_symnum
# define map iterator
def iterator_data(items_list):
for item in items_list:
yield processor.get_one_sample(item)
train_iterator_data = iterator_data(train_split)
valid_iterator_data = iterator_data(valid_split)
p = Pool(config["n_cpus"])
# preprocess train files and get statistics for normalizing
partial_fn = partial(gen_duration_features, config=config)
train_map = p.imap(
partial_fn,
tqdm(train_iterator_data, total=len(train_split), desc="[Preprocessing train]"),
chunksize=10,
)
for item in train_map:
save_duration_to_file(item, "train", config)
# preprocess valid files
partial_fn = partial(gen_duration_features, config=config)
valid_map = p.imap(
partial_fn,
tqdm(valid_iterator_data, total=len(valid_split), desc="[Preprocessing valid]"),
chunksize=10,
)
for item in valid_map:
save_duration_to_file(item, "valid", config)
"""
sample = {
"speaker_name": spkname,
"filename" : filename,
"wav_path" : wav_path,
"text_ids" : text_ids,
"durs" : durs,
"embed_path" : embed_path,
"rate" : self.target_rate,
}
"""
def gen_duration_features(item, config):
text_ids = item["text_ids"]
durs = item["durs"]
assert len(text_ids) == len(durs)
none_phnum = config["none_pinyin_symnum"]
shengmu = []
yunmu = []
is_shengmu = True
for t_id, dur in zip(text_ids, durs):
if t_id < none_phnum:
continue
if is_shengmu:
shengmu.append(dur)
is_shengmu = False
else:
yunmu.append(dur)
is_shengmu = True
assert len(shengmu) == len(yunmu)
dur_stats = np.array([np.mean(shengmu), np.std(shengmu), np.mean(yunmu), np.std(yunmu)])
item["text_ids"] = np.array(text_ids)
item["durs"] = np.array(durs)
item["dur_stats"] = dur_stats
return item
def save_duration_to_file(features, subdir, config):
filename = features["filename"]
if config["format"] == "npy":
save_list = [
(features["text_ids"], "ids", "ids", np.int32),
(features["durs"], "raw-durations", "raw-durations", np.float32),
(features["dur_stats"], "stat-durations", "stat-durations", np.float32),
]
for item, name_dir, name_file, fmt in save_list:
np.save(
os.path.join(
config["outdir"], subdir, name_dir, f"{filename}-{name_file}.npy"
),
item.astype(fmt),
allow_pickle=False,
)
else:
raise ValueError("'npy' is the only supported format.")
'''
###############################################################################
################################ Acous ########################################
###############################################################################
'''
def preprocess_acous():
"""Run preprocessing process and compute statistics for normalizing."""
config = parse_and_config()
dataset_processor = {
"multispk_voiceclone": MultiSPKVoiceCloneProcessor,
}
dataset_symbol = {
"multispk_voiceclone": AISHELL_CHN_SYMBOLS,
}
dataset_cleaner = {
"multispk_voiceclone": None,
}
logging.info(f"Selected '{config['dataset']}' processor.")
processor = dataset_processor[config["dataset"]](
config["rootdir"],
symbols = dataset_symbol[config["dataset"]],
cleaner_names = dataset_cleaner[config["dataset"]],
all_train = True if config["all_train"] else False,
mfaed_txt = config["mfaed_txt"],
wavs_dir = config["wavs_dir"],
embed_dir = config["embed_dir"],
spkinfo_dir = config["spkinfo_dir"],
unseen_dir = config["unseen_dir"]
)
# check output directories
build_dir = lambda x: [
os.makedirs(os.path.join(config["outdir"], x, y), exist_ok=True)
for y in ["ids", "raw-durations",
"raw-mels", "embeds"]
]
build_dir("train")
build_dir("valid")
# save pretrained-processor to feature dir
processor._save_mapper(
os.path.join(config["outdir"], f"{config['dataset']}_mapper.json"),
extra_attrs_to_save={"pinyin_dict": processor.pinyin_dict}
if config["dataset"] == "multispk_voiceclone" else {},
)
# build train test split
_Y = [i[0] for i in processor.items]
train_split, valid_split = train_test_split(
processor.items,
test_size=config["test_size"],
random_state=42,
shuffle=True,
stratify=_Y
)
logging.info(f"Training items: {len(train_split)}")
logging.info(f"Validation items: {len(valid_split)}")
train_utt_ids = [x[1] for x in train_split]
valid_utt_ids = [x[1] for x in valid_split]
# save train and valid utt_ids to track later
np.save(os.path.join(config["outdir"], "train_utt_ids.npy"), train_utt_ids, allow_pickle=False)
np.save(os.path.join(config["outdir"], "valid_utt_ids.npy"), valid_utt_ids, allow_pickle=False)
# config["none_pinyin_symnum"] = processor.none_pinyin_symnum
# define map iterator
def iterator_data(items_list):
for item in items_list:
yield processor.get_one_sample(item)
train_iterator_data = iterator_data(train_split)
valid_iterator_data = iterator_data(valid_split)
p = Pool(config["n_cpus"])
# preprocess train files and get statistics for normalizing
partial_fn = partial(gen_acous_features, config=config)
train_map = p.imap_unordered(
partial_fn,
tqdm(train_iterator_data, total=len(train_split), desc="[Preprocessing train]"),
chunksize=10,
)
for item in train_map:
save_acous_to_file(item, "train", config)
# preprocess valid files
partial_fn = partial(gen_acous_features, config=config)
valid_map = p.imap_unordered(
partial_fn,
tqdm(valid_iterator_data, total=len(valid_split), desc="[Preprocessing valid]"),
chunksize=10,
)
for item in valid_map:
save_acous_to_file(item, "valid", config)
"""
sample = {
"speaker_name": spkname,
"filename" : filename,
"wav_path" : wav_path,
"text_ids" : text_ids,
"durs" : durs,
"embed_path" : embed_path,
"rate" : self.target_rate,
}
"""
def gen_acous_features(item, config):
text_ids = item["text_ids"]
durs = item["durs"]
assert len(text_ids) == len(durs)
global _feats_handle
if _feats_handle is None:
_feats_handle = AudioMelSpec(**config["feat_params"])
audio = _feats_handle.load_wav(item["wav_path"])
mel = _feats_handle.melspectrogram(audio)
assert len(mel) == sum(durs)
item["text_ids"] = np.array(text_ids)
item["durs"] = np.array(durs)
item["mels"] = mel
item["embeds"] = np.load(item["embed_path"])
return item
def save_acous_to_file(features, subdir, config):
filename = features["filename"]
if config["format"] == "npy":
save_list = [
(features["text_ids"], "ids", "ids", np.int32),
(features["durs"], "raw-durations", "raw-durations", np.int32),
(features["mels"], "raw-mels", "raw-mels", np.float32),
(features["embeds"], "embeds", "embeds", np.float32),
]
for item, name_dir, name_file, fmt in save_list:
np.save(
os.path.join(
config["outdir"], subdir, name_dir, f"{filename}-{name_file}.npy"
),
item.astype(fmt),
allow_pickle=False,
)
else:
raise ValueError("'npy' is the only supported format.")
'''
###############################################################################
################################ Vocoder ######################################
###############################################################################
'''
def preprocess_vocoder():
"""Run preprocessing process and compute statistics for normalizing."""
config = parse_and_config()
dataset_processor = {
"multispk_voiceclone": MultiSPKVoiceCloneProcessor,
}
dataset_symbol = {
"multispk_voiceclone": AISHELL_CHN_SYMBOLS,
}
dataset_cleaner = {
"multispk_voiceclone": None,
}
logging.info(f"Selected '{config['dataset']}' processor.")
processor = dataset_processor[config["dataset"]](
config["rootdir"],
symbols = dataset_symbol[config["dataset"]],
cleaner_names = dataset_cleaner[config["dataset"]],
during_train = True if config["during_train"] else False,
mfaed_txt = config["mfaed_txt"],
wavs_dir = config["wavs_dir"],
embed_dir = config["embed_dir"],
spkinfo_dir = config["spkinfo_dir"]
)
# check output directories
build_dir = lambda x: [
os.makedirs(os.path.join(config["outdir"], x, y), exist_ok=True)
for y in ["norm-feats", "wavs"]
]
build_dir("train")
build_dir("valid")
# save pretrained-processor to feature dir
processor._save_mapper(
os.path.join(config["outdir"], f"{config['dataset']}_mapper.json"),
extra_attrs_to_save={"pinyin_dict": processor.pinyin_dict}
if config["dataset"] == "multispk_voiceclone" else {},
)
# build train test split
_Y = [i[0] for i in processor.items]
train_split, valid_split = train_test_split(
processor.items,
test_size=config["test_size"],
random_state=42,
shuffle=True,
stratify=_Y
)
logging.info(f"Training items: {len(train_split)}")
logging.info(f"Validation items: {len(valid_split)}")
train_utt_ids = [x[1] for x in train_split]
valid_utt_ids = [x[1] for x in valid_split]
# save train and valid utt_ids to track later
np.save(os.path.join(config["outdir"], "train_utt_ids.npy"), train_utt_ids, allow_pickle=False)
np.save(os.path.join(config["outdir"], "valid_utt_ids.npy"), valid_utt_ids, allow_pickle=False)
# config["none_pinyin_symnum"] = processor.none_pinyin_symnum
# define map iterator
def iterator_data(items_list):
for item in items_list:
yield processor.get_one_sample(item)
train_iterator_data = iterator_data(train_split)
valid_iterator_data = iterator_data(valid_split)
p = Pool(config["n_cpus"])
# preprocess train files and get statistics for normalizing
partial_fn = partial(gen_vocoder, config=config)
train_map = p.imap_unordered(
partial_fn,
tqdm(train_iterator_data, total=len(train_split), desc="[Preprocessing train]"),
chunksize=10,
)
for item in train_map:
save_vocoder_to_file(item, "train", config)
# preprocess valid files
partial_fn = partial(gen_vocoder, config=config)
valid_map = p.imap_unordered(
partial_fn,
tqdm(valid_iterator_data, total=len(valid_split), desc="[Preprocessing valid]"),
chunksize=10,
)
for item in valid_map:
save_vocoder_to_file(item, "valid", config)
"""
sample = {
"speaker_name": spkname,
"filename" : filename,
"wav_path" : wav_path,
"text_ids" : text_ids,
"durs" : durs,
"embed_path" : embed_path,
"rate" : self.target_rate,
}
"""
def gen_vocoder(item, config):
global _feats_handle
if _feats_handle is None:
_feats_handle = AudioMelSpec(**config["feat_params"])
audio = _feats_handle.load_wav(item["wav_path"])
mel = _feats_handle.melspectrogram(audio)
# check audio and feature length
audio = np.pad(audio, (0, _feats_handle.n_fft), mode="edge")
audio = audio[: len(mel) * _feats_handle.hop_size]
assert len(mel) * _feats_handle.hop_size == len(audio)
item["audio"] = audio
item["mels"] = mel
return item
def save_vocoder_to_file(features, subdir, config):
filename = features["filename"]
if config["format"] == "npy":
save_list = [
(features["audio"], "wavs", "wave", np.float32),
(features["mels"], "norm-feats", "norm-feats", np.float32),
]
for item, name_dir, name_file, fmt in save_list:
np.save(
os.path.join(
config["outdir"], subdir, name_dir, f"{filename}-{name_file}.npy"
),
item.astype(fmt),
allow_pickle=False,
)
else:
raise ValueError("'npy' is the only supported format.")
================================================
FILE: TensorFlowTTS/tensorflow_tts/configs/__init__.py
================================================
from tensorflow_tts.configs.melgan import (
MelGANDiscriminatorConfig,
MelGANGeneratorConfig,
)
from tensorflow_tts.configs.mb_melgan import (
MultiBandMelGANDiscriminatorConfig,
MultiBandMelGANGeneratorConfig,
)
from tensorflow_tts.configs.unetts import UNETTSDurationConfig, UNETTSAcousConfig
================================================
FILE: TensorFlowTTS/tensorflow_tts/configs/mb_melgan.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Multi-band MelGAN Config object."""
from tensorflow_tts.configs import MelGANDiscriminatorConfig, MelGANGeneratorConfig
class MultiBandMelGANGeneratorConfig(MelGANGeneratorConfig):
"""Initialize Multi-band MelGAN Generator Config."""
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.subbands = kwargs.pop("subbands", 4)
self.taps = kwargs.pop("taps", 62)
self.cutoff_ratio = kwargs.pop("cutoff_ratio", 0.142)
self.beta = kwargs.pop("beta", 9.0)
class MultiBandMelGANDiscriminatorConfig(MelGANDiscriminatorConfig):
"""Initialize Multi-band MelGAN Discriminator Config."""
def __init__(self, **kwargs):
super().__init__(**kwargs)
================================================
FILE: TensorFlowTTS/tensorflow_tts/configs/melgan.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""MelGAN Config object."""
class MelGANGeneratorConfig(object):
"""Initialize MelGAN Generator Config."""
def __init__(
self,
out_channels=1,
kernel_size=7,
filters=512,
use_bias=True,
upsample_scales=[8, 8, 2, 2],
stack_kernel_size=3,
stacks=3,
nonlinear_activation="LeakyReLU",
nonlinear_activation_params={"alpha": 0.2},
padding_type="REFLECT",
use_final_nolinear_activation=True,
is_weight_norm=True,
initializer_seed=42,
**kwargs
):
"""Init parameters for MelGAN Generator model."""
self.out_channels = out_channels
self.kernel_size = kernel_size
self.filters = filters
self.use_bias = use_bias
self.upsample_scales = upsample_scales
self.stack_kernel_size = stack_kernel_size
self.stacks = stacks
self.nonlinear_activation = nonlinear_activation
self.nonlinear_activation_params = nonlinear_activation_params
self.padding_type = padding_type
self.use_final_nolinear_activation = use_final_nolinear_activation
self.is_weight_norm = is_weight_norm
self.initializer_seed = initializer_seed
class MelGANDiscriminatorConfig(object):
"""Initialize MelGAN Discriminator Config."""
def __init__(
self,
out_channels=1,
scales=3,
downsample_pooling="AveragePooling1D",
downsample_pooling_params={"pool_size": 4, "strides": 2,},
kernel_sizes=[5, 3],
filters=16,
max_downsample_filters=1024,
use_bias=True,
downsample_scales=[4, 4, 4, 4],
nonlinear_activation="LeakyReLU",
nonlinear_activation_params={"alpha": 0.2},
padding_type="REFLECT",
is_weight_norm=True,
initializer_seed=42,
**kwargs
):
"""Init parameters for MelGAN Discriminator model."""
self.out_channels = out_channels
self.scales = scales
self.downsample_pooling = downsample_pooling
self.downsample_pooling_params = downsample_pooling_params
self.kernel_sizes = kernel_sizes
self.filters = filters
self.max_downsample_filters = max_downsample_filters
self.use_bias = use_bias
self.downsample_scales = downsample_scales
self.nonlinear_activation = nonlinear_activation
self.nonlinear_activation_params = nonlinear_activation_params
self.padding_type = padding_type
self.is_weight_norm = is_weight_norm
self.initializer_seed = initializer_seed
================================================
FILE: TensorFlowTTS/tensorflow_tts/configs/unetts.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""UnetTTS Config object."""
import collections
from tensorflow_tts.processor.multispk_voiceclone import AISHELL_CHN_SYMBOLS as aishell_symbols
SelfAttentionParams = collections.namedtuple(
"SelfAttentionParams",
[
"hidden_size",
"num_hidden_layers",
"num_attention_heads",
"attention_head_size",
"intermediate_size",
"intermediate_kernel_size",
"hidden_act",
"output_attentions",
"output_hidden_states",
"initializer_range",
"hidden_dropout_prob",
"attention_probs_dropout_prob",
"layer_norm_eps",
],
)
SelfAttentionConditionalParams = collections.namedtuple(
"SelfAttentionParams",
[
"hidden_size",
"num_hidden_layers",
"num_attention_heads",
"attention_head_size",
"intermediate_size",
"intermediate_kernel_size",
"hidden_act",
"output_attentions",
"output_hidden_states",
"initializer_range",
"hidden_dropout_prob",
"attention_probs_dropout_prob",
"layer_norm_eps",
"conditional_norm_type",
],
)
class UNETTSDurationConfig(object):
"""Initialize UNETTSDuration Config."""
def __init__(
self,
dataset = 'multispk_voiceclone',
vocab_size = len(aishell_symbols),
encoder_hidden_size = 384,
encoder_num_hidden_layers = 4,
encoder_num_attention_heads = 2,
encoder_attention_head_size = 192,
encoder_intermediate_size = 1024,
encoder_intermediate_kernel_size = 3,
encoder_hidden_act = "mish",
output_attentions = True,
output_hidden_states = True,
hidden_dropout_prob = 0.1,
attention_probs_dropout_prob = 0.1,
initializer_range = 0.02,
layer_norm_eps = 1e-5,
num_duration_conv_layers = 2,
duration_predictor_filters = 256,
duration_predictor_kernel_sizes = 3,
duration_predictor_dropout_probs = 0.1,
**kwargs
):
"""Init parameters for UNETTSDuration model."""
if dataset == "multispk_voiceclone":
self.vocab_size = len(aishell_symbols)
else:
raise ValueError("No such dataset: {}".format(dataset))
self.initializer_range = initializer_range
# self.max_position_embeddings = max_position_embeddings
self.layer_norm_eps = layer_norm_eps
# encoder params
self.encoder_self_attention_params = SelfAttentionParams(
hidden_size = encoder_hidden_size,
num_hidden_layers = encoder_num_hidden_layers,
num_attention_heads = encoder_num_attention_heads,
attention_head_size = encoder_attention_head_size,
hidden_act = encoder_hidden_act,
intermediate_size = encoder_intermediate_size,
intermediate_kernel_size = encoder_intermediate_kernel_size,
output_attentions = output_attentions,
output_hidden_states = output_hidden_states,
initializer_range = initializer_range,
hidden_dropout_prob = hidden_dropout_prob,
attention_probs_dropout_prob = attention_probs_dropout_prob,
layer_norm_eps = layer_norm_eps,
)
self.duration_predictor_dropout_probs = duration_predictor_dropout_probs
self.num_duration_conv_layers = num_duration_conv_layers
self.duration_predictor_filters = duration_predictor_filters
self.duration_predictor_kernel_sizes = duration_predictor_kernel_sizes
class UNETTSAcousConfig(object):
"""Initialize UNETTSAcou Config."""
def __init__(
self,
dataset = 'multispk_voiceclone',
vocab_size = len(aishell_symbols),
encoder_hidden_size = 384,
encoder_num_hidden_layers = 4,
encoder_num_attention_heads = 2,
encoder_attention_head_size = 192,
encoder_intermediate_size = 1024,
encoder_intermediate_kernel_size = 3,
encoder_hidden_act = "mish",
output_attentions = True,
output_hidden_states = True,
hidden_dropout_prob = 0.1,
attention_probs_dropout_prob = 0.1,
initializer_range = 0.02,
layer_norm_eps = 1e-5,
addfeatures_num = 3,
isaddur = True,
num_mels = 80,
content_latent_dim = 132,
n_conv_blocks = 6,
adain_filter_size = 256,
enc_kernel_size = 5,
dec_kernel_size = 5,
gen_kernel_size = 5,
decoder_hidden_size = 384,
decoder_num_hidden_layers = 4,
decoder_num_attention_heads = 2,
decoder_attention_head_size = 192,
decoder_intermediate_size = 1024,
decoder_intermediate_kernel_size = 3,
decoder_hidden_act = "mish",
decoder_conditional_norm_type = "Layer",
decoder_is_conditional = True,
num_variant_conv_layers = 2,
variant_predictor_dropout_probs = 0.1,
variant_predictor_filters = 256,
variant_predictor_kernel_sizes = 3,
n_conv_postnet = 5,
postnet_conv_filters = 512,
postnet_conv_kernel_sizes = 5,
postnet_dropout_rate = 0.1,
**kwargs
):
"""Init parameters for UNETTSAcou model."""
if dataset == "multispk_voiceclone":
self.vocab_size = len(aishell_symbols)
else:
raise ValueError("No such dataset: {}".format(dataset))
self.initializer_range = initializer_range
# self.max_position_embeddings = max_position_embeddings
self.layer_norm_eps = layer_norm_eps
self.num_mels = num_mels
# encoder params
self.encoder_self_attention_params = SelfAttentionParams(
hidden_size = encoder_hidden_size,
num_hidden_layers = encoder_num_hidden_layers,
num_attention_heads = encoder_num_attention_heads,
attention_head_size = encoder_attention_head_size,
hidden_act = encoder_hidden_act,
intermediate_size = encoder_intermediate_size,
intermediate_kernel_size = encoder_intermediate_kernel_size,
output_attentions = output_attentions,
output_hidden_states = output_hidden_states,
initializer_range = initializer_range,
hidden_dropout_prob = hidden_dropout_prob,
attention_probs_dropout_prob = attention_probs_dropout_prob,
layer_norm_eps = layer_norm_eps,
)
self.content_latent_dim = content_latent_dim
self.n_conv_blocks = n_conv_blocks
self.adain_filter_size = adain_filter_size
self.enc_kernel_size = enc_kernel_size
self.dec_kernel_size = dec_kernel_size
self.gen_kernel_size = gen_kernel_size
self.decoder_is_conditional = decoder_is_conditional
self.decoder_self_attention_conditional_params = SelfAttentionConditionalParams(
hidden_size = decoder_hidden_size,
num_hidden_layers = decoder_num_hidden_layers,
num_attention_heads = decoder_num_attention_heads,
attention_head_size = decoder_attention_head_size,
hidden_act = decoder_hidden_act,
intermediate_size = decoder_intermediate_size,
intermediate_kernel_size = decoder_intermediate_kernel_size,
output_attentions = output_attentions,
output_hidden_states = output_hidden_states,
initializer_range = initializer_range,
hidden_dropout_prob = hidden_dropout_prob,
attention_probs_dropout_prob = attention_probs_dropout_prob,
layer_norm_eps = layer_norm_eps,
conditional_norm_type = decoder_conditional_norm_type,
)
self.decoder_self_attention_params = SelfAttentionParams(
hidden_size = decoder_hidden_size,
num_hidden_layers = decoder_num_hidden_layers,
num_attention_heads = decoder_num_attention_heads,
attention_head_size = decoder_attention_head_size,
hidden_act = decoder_hidden_act,
intermediate_size = decoder_intermediate_size,
intermediate_kernel_size = decoder_intermediate_kernel_size,
output_attentions = output_attentions,
output_hidden_states = output_hidden_states,
initializer_range = initializer_range,
hidden_dropout_prob = hidden_dropout_prob,
attention_probs_dropout_prob = attention_probs_dropout_prob,
layer_norm_eps = layer_norm_eps,
)
self.num_variant_conv_layers = num_variant_conv_layers
self.variant_predictor_dropout_probs = variant_predictor_dropout_probs
self.variant_predictor_filters = variant_predictor_filters
self.variant_predictor_kernel_sizes = variant_predictor_kernel_sizes
# postnet
self.n_conv_postnet = n_conv_postnet
self.postnet_conv_filters = postnet_conv_filters
self.postnet_conv_kernel_sizes = postnet_conv_kernel_sizes
self.postnet_dropout_rate = postnet_dropout_rate
self.addfeatures_num = addfeatures_num
self.isaddur = isaddur
================================================
FILE: TensorFlowTTS/tensorflow_tts/datasets/__init__.py
================================================
from tensorflow_tts.datasets.abstract_dataset import AbstractDataset
from tensorflow_tts.datasets.audio_dataset import AudioDataset
from tensorflow_tts.datasets.mel_dataset import MelDataset
================================================
FILE: TensorFlowTTS/tensorflow_tts/datasets/abstract_dataset.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Abstract Dataset modules."""
import abc
import tensorflow as tf
class AbstractDataset(metaclass=abc.ABCMeta):
"""Abstract Dataset module for Dataset Loader."""
@abc.abstractmethod
def get_args(self):
"""Return args for generator function."""
pass
@abc.abstractmethod
def generator(self):
"""Generator function, should have args from get_args function."""
pass
@abc.abstractmethod
def get_output_dtypes(self):
"""Return output dtypes for each element from generator."""
pass
@abc.abstractmethod
def get_len_dataset(self):
"""Return number of samples on dataset."""
pass
def create(
self,
allow_cache=False,
batch_size=1,
is_shuffle=False,
map_fn=None,
reshuffle_each_iteration=True,
):
"""Create tf.dataset function."""
output_types = self.get_output_dtypes()
datasets = tf.data.Dataset.from_generator(
self.generator, output_types=output_types, args=(self.get_args())
)
if allow_cache:
datasets = datasets.cache()
if is_shuffle:
datasets = datasets.shuffle(
self.get_len_dataset(),
reshuffle_each_iteration=reshuffle_each_iteration,
)
if batch_size > 1 and map_fn is None:
raise ValueError("map function must define when batch_size > 1.")
if map_fn is not None:
datasets = datasets.map(map_fn, tf.data.experimental.AUTOTUNE)
datasets = datasets.batch(batch_size)
datasets = datasets.prefetch(tf.data.experimental.AUTOTUNE)
return datasets
================================================
FILE: TensorFlowTTS/tensorflow_tts/datasets/audio_dataset.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Audio modules."""
import logging
import os
import numpy as np
import tensorflow as tf
from tensorflow_tts.datasets.abstract_dataset import AbstractDataset
from tensorflow_tts.utils import find_files
class AudioDataset(AbstractDataset):
"""Tensorflow compatible audio dataset."""
def __init__(
self,
root_dir,
audio_query="*-wave.npy",
audio_load_fn=np.load,
audio_length_threshold=0,
):
"""Initialize dataset.
Args:
root_dir (str): Root directory including dumped files.
audio_query (str): Query to find feature files in root_dir.
audio_load_fn (func): Function to load feature file.
audio_length_threshold (int): Threshold to remove short feature files.
return_utt_id (bool): Whether to return the utterance id with arrays.
"""
# find all of mel files.
audio_files = sorted(find_files(root_dir, audio_query))
audio_lengths = [audio_load_fn(f).shape[0] for f in audio_files]
# assert the number of files
assert len(audio_files) != 0, f"Not found any mel files in ${root_dir}."
if ".npy" in audio_query:
suffix = audio_query[1:]
utt_ids = [os.path.basename(f).replace(suffix, "") for f in audio_files]
# set global params
self.utt_ids = utt_ids
self.audio_files = audio_files
self.audio_lengths = audio_lengths
self.audio_load_fn = audio_load_fn
self.audio_length_threshold = audio_length_threshold
def get_args(self):
return [self.utt_ids]
def generator(self, utt_ids):
for i, utt_id in enumerate(utt_ids):
audio_file = self.audio_files[i]
audio = self.audio_load_fn(audio_file)
audio_length = self.audio_lengths[i]
items = {"utt_ids": utt_id, "audios": audio, "audio_lengths": audio_length}
yield items
def get_output_dtypes(self):
output_types = {
"utt_ids": tf.string,
"audios": tf.float32,
"audio_lengths": tf.float32,
}
return output_types
def create(
self,
allow_cache=False,
batch_size=1,
is_shuffle=False,
map_fn=None,
reshuffle_each_iteration=True,
):
"""Create tf.dataset function."""
output_types = self.get_output_dtypes()
datasets = tf.data.Dataset.from_generator(
self.generator, output_types=output_types, args=(self.get_args())
)
datasets = datasets.filter(
lambda x: x["audio_lengths"] > self.audio_length_threshold
)
if allow_cache:
datasets = datasets.cache()
if is_shuffle:
datasets = datasets.shuffle(
self.get_len_dataset(),
reshuffle_each_iteration=reshuffle_each_iteration,
)
# define padded shapes
padded_shapes = {
"utt_ids": [],
"audios": [None],
"audio_lengths": [],
}
datasets = datasets.padded_batch(batch_size, padded_shapes=padded_shapes)
datasets = datasets.prefetch(tf.data.experimental.AUTOTUNE)
return datasets
def get_len_dataset(self):
return len(self.utt_ids)
def __name__(self):
return "AudioDataset"
================================================
FILE: TensorFlowTTS/tensorflow_tts/datasets/mel_dataset.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dataset modules."""
import logging
import os
import numpy as np
import tensorflow as tf
from tensorflow_tts.datasets.abstract_dataset import AbstractDataset
from tensorflow_tts.utils import find_files
class MelDataset(AbstractDataset):
"""Tensorflow compatible mel dataset."""
def __init__(
self,
root_dir,
mel_query="*-raw-feats.h5",
mel_load_fn=np.load,
mel_length_threshold=0,
):
"""Initialize dataset.
Args:
root_dir (str): Root directory including dumped files.
mel_query (str): Query to find feature files in root_dir.
mel_load_fn (func): Function to load feature file.
mel_length_threshold (int): Threshold to remove short feature files.
"""
# find all of mel files.
mel_files = sorted(find_files(root_dir, mel_query))
mel_lengths = [mel_load_fn(f).shape[0] for f in mel_files]
# assert the number of files
assert len(mel_files) != 0, f"Not found any mel files in ${root_dir}."
if ".npy" in mel_query:
suffix = mel_query[1:]
utt_ids = [os.path.basename(f).replace(suffix, "") for f in mel_files]
# set global params
self.utt_ids = utt_ids
self.mel_files = mel_files
self.mel_lengths = mel_lengths
self.mel_load_fn = mel_load_fn
self.mel_length_threshold = mel_length_threshold
def get_args(self):
return [self.utt_ids]
def generator(self, utt_ids):
for i, utt_id in enumerate(utt_ids):
mel_file = self.mel_files[i]
mel = self.mel_load_fn(mel_file)
mel_length = self.mel_lengths[i]
items = {"utt_ids": utt_id, "mels": mel, "mel_lengths": mel_length}
yield items
def get_output_dtypes(self):
output_types = {
"utt_ids": tf.string,
"mels": tf.float32,
"mel_lengths": tf.int32,
}
return output_types
def create(
self,
allow_cache=False,
batch_size=1,
is_shuffle=False,
map_fn=None,
reshuffle_each_iteration=True,
):
"""Create tf.dataset function."""
output_types = self.get_output_dtypes()
datasets = tf.data.Dataset.from_generator(
self.generator, output_types=output_types, args=(self.get_args())
)
datasets = datasets.filter(
lambda x: x["mel_lengths"] > self.mel_length_threshold
)
if allow_cache:
datasets = datasets.cache()
if is_shuffle:
datasets = datasets.shuffle(
self.get_len_dataset(),
reshuffle_each_iteration=reshuffle_each_iteration,
)
# define padded shapes
padded_shapes = {
"utt_ids": [],
"mels": [None, 80],
"mel_lengths": [],
}
datasets = datasets.padded_batch(batch_size, padded_shapes=padded_shapes)
datasets = datasets.prefetch(tf.data.experimental.AUTOTUNE)
return datasets
def get_len_dataset(self):
return len(self.utt_ids)
def __name__(self):
return "MelDataset"
================================================
FILE: TensorFlowTTS/tensorflow_tts/inference/__init__.py
================================================
from tensorflow_tts.inference.auto_model import TFAutoModel
from tensorflow_tts.inference.auto_config import AutoConfig
from tensorflow_tts.inference.auto_processor import AutoProcessor
================================================
FILE: TensorFlowTTS/tensorflow_tts/inference/auto_config.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The HuggingFace Inc. team and Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tensorflow Auto Config modules."""
import logging
import yaml
from collections import OrderedDict
from tensorflow_tts.configs import (
MelGANGeneratorConfig,
MultiBandMelGANGeneratorConfig,
UNETTSDurationConfig,
UNETTSAcousConfig,
)
CONFIG_MAPPING = OrderedDict(
[
("multiband_melgan_generator", MultiBandMelGANGeneratorConfig),
("melgan_generator", MelGANGeneratorConfig),
("unetts_duration", UNETTSDurationConfig),
("unetts_acous", UNETTSAcousConfig),
]
)
class AutoConfig:
def __init__(self):
raise EnvironmentError(
"AutoConfig is designed to be instantiated "
"using the `AutoConfig.from_pretrained(pretrained_path)` method."
)
@classmethod
def from_pretrained(cls, pretrained_path, **kwargs):
with open(pretrained_path) as f:
config = yaml.load(f, Loader=yaml.Loader)
try:
model_type = config["model_type"]
config_class = CONFIG_MAPPING[model_type]
config_class = config_class(**config[model_type + "_params"], **kwargs)
return config_class
except Exception:
raise ValueError(
"Unrecognized config in {}. "
"Should have a `model_type` key in its config.yaml, or contain one of the following strings "
"in its name: {}".format(
pretrained_path, ", ".join(CONFIG_MAPPING.keys())
)
)
================================================
FILE: TensorFlowTTS/tensorflow_tts/inference/auto_model.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The HuggingFace Inc. team and Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tensorflow Auto Model modules."""
import logging
import warnings
from collections import OrderedDict
from tensorflow_tts.configs import (
MelGANGeneratorConfig,
MultiBandMelGANGeneratorConfig,
UNETTSDurationConfig,
UNETTSAcousConfig,
)
from tensorflow_tts.models import (
TFMelGANGenerator,
TFMBMelGANGenerator,
TFUNETTSDuration,
TFUNETTSAcous,
)
TF_MODEL_MAPPING = OrderedDict(
[
(MultiBandMelGANGeneratorConfig, TFMBMelGANGenerator),
(MelGANGeneratorConfig, TFMelGANGenerator),
(UNETTSDurationConfig, TFUNETTSDuration),
(UNETTSAcousConfig, TFUNETTSAcous),
]
)
class TFAutoModel(object):
"""General model class for inferencing."""
def __init__(self):
raise EnvironmentError("Cannot be instantiated using `__init__()`")
@classmethod
def from_pretrained(cls, config, pretrained_path=None, **kwargs):
is_build = kwargs.pop("is_build", True)
for config_class, model_class in TF_MODEL_MAPPING.items():
if isinstance(config, config_class) and str(config_class.__name__) in str(
config
):
model = model_class(config=config, **kwargs)
if is_build:
model._build()
if pretrained_path is not None and ".h5" in pretrained_path:
model.load_weights(pretrained_path)
return model
raise ValueError(
"Unrecognized configuration class {} for this kind of TFAutoModel: {}.\n"
"Model type should be one of {}.".format(
config.__class__,
cls.__name__,
", ".join(c.__name__ for c in TF_MODEL_MAPPING.keys()),
)
)
================================================
FILE: TensorFlowTTS/tensorflow_tts/inference/auto_processor.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The TensorFlowTTS Team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tensorflow Auto Processor modules."""
import logging
import json
from collections import OrderedDict
from tensorflow_tts.processor import (
MultiSPKVoiceCloneProcessor,
)
CONFIG_MAPPING = OrderedDict(
[
("MultiSPKVoiceCloneProcessor", MultiSPKVoiceCloneProcessor),
]
)
class AutoProcessor:
def __init__(self):
raise EnvironmentError(
"AutoProcessor is designed to be instantiated "
"using the `AutoProcessor.from_pretrained(pretrained_path)` method."
)
@classmethod
def from_pretrained(cls, pretrained_path, **kwargs):
with open(pretrained_path, "r") as f:
config = json.load(f)
try:
processor_name = config["processor_name"]
processor_class = CONFIG_MAPPING[processor_name]
processor_class = processor_class(
data_dir=None, loaded_mapper_path=pretrained_path
)
return processor_class
except Exception:
raise ValueError(
"Unrecognized processor in {}. "
"Should have a `processor_name` key in its config.json, or contain one of the following strings "
"in its name: {}".format(
pretrained_path, ", ".join(CONFIG_MAPPING.keys())
)
)
================================================
FILE: TensorFlowTTS/tensorflow_tts/losses/__init__.py
================================================
from tensorflow_tts.losses.spectrogram import TFMelSpectrogram
from tensorflow_tts.losses.stft import TFMultiResolutionSTFT
================================================
FILE: TensorFlowTTS/tensorflow_tts/losses/spectrogram.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Spectrogram-based loss modules."""
import tensorflow as tf
class TFMelSpectrogram(tf.keras.layers.Layer):
"""Mel Spectrogram loss."""
def __init__(
self,
n_mels=80,
f_min=80.0,
f_max=7600,
frame_length=1024,
frame_step=256,
fft_length=1024,
sample_rate=16000,
**kwargs
):
"""Initialize."""
super().__init__(**kwargs)
self.frame_length = frame_length
self.frame_step = frame_step
self.fft_length = fft_length
self.linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
n_mels, fft_length // 2 + 1, sample_rate, f_min, f_max
)
def _calculate_log_mels_spectrogram(self, signals):
"""Calculate forward propagation.
Args:
signals (Tensor): signal (B, T).
Returns:
Tensor: Mel spectrogram (B, T', 80)
"""
stfts = tf.signal.stft(
signals,
frame_length=self.frame_length,
frame_step=self.frame_step,
fft_length=self.fft_length,
)
linear_spectrograms = tf.abs(stfts)
mel_spectrograms = tf.tensordot(
linear_spectrograms, self.linear_to_mel_weight_matrix, 1
)
mel_spectrograms.set_shape(
linear_spectrograms.shape[:-1].concatenate(
self.linear_to_mel_weight_matrix.shape[-1:]
)
)
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6) # prevent nan.
return log_mel_spectrograms
def call(self, y, x):
"""Calculate forward propagation.
Args:
y (Tensor): Groundtruth signal (B, T).
x (Tensor): Predicted signal (B, T).
Returns:
Tensor: Mean absolute Error Spectrogram Loss.
"""
y_mels = self._calculate_log_mels_spectrogram(y)
x_mels = self._calculate_log_mels_spectrogram(x)
return tf.reduce_mean(
tf.abs(y_mels - x_mels), axis=list(range(1, len(x_mels.shape)))
)
================================================
FILE: TensorFlowTTS/tensorflow_tts/losses/stft.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""STFT-based loss modules."""
import tensorflow as tf
class TFSpectralConvergence(tf.keras.layers.Layer):
"""Spectral convergence loss."""
def __init__(self):
"""Initialize."""
super().__init__()
def call(self, y_mag, x_mag):
"""Calculate forward propagation.
Args:
y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
Returns:
Tensor: Spectral convergence loss value.
"""
return tf.norm(y_mag - x_mag, ord="fro", axis=(-2, -1)) / tf.norm(
y_mag, ord="fro", axis=(-2, -1)
)
class TFLogSTFTMagnitude(tf.keras.layers.Layer):
"""Log STFT magnitude loss module."""
def __init__(self):
"""Initialize."""
super().__init__()
def call(self, y_mag, x_mag):
"""Calculate forward propagation.
Args:
y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
Returns:
Tensor: Spectral convergence loss value.
"""
return tf.abs(tf.math.log(y_mag) - tf.math.log(x_mag))
class TFSTFT(tf.keras.layers.Layer):
"""STFT loss module."""
def __init__(self, frame_length=600, frame_step=120, fft_length=1024):
"""Initialize."""
super().__init__()
self.frame_length = frame_length
self.frame_step = frame_step
self.fft_length = fft_length
self.spectral_convergenge_loss = TFSpectralConvergence()
self.log_stft_magnitude_loss = TFLogSTFTMagnitude()
def call(self, y, x):
"""Calculate forward propagation.
Args:
y (Tensor): Groundtruth signal (B, T).
x (Tensor): Predicted signal (B, T).
Returns:
Tensor: Spectral convergence loss value (pre-reduce).
Tensor: Log STFT magnitude loss value (pre-reduce).
"""
x_mag = tf.abs(
tf.signal.stft(
signals=x,
frame_length=self.frame_length,
frame_step=self.frame_step,
fft_length=self.fft_length,
)
)
y_mag = tf.abs(
tf.signal.stft(
signals=y,
frame_length=self.frame_length,
frame_step=self.frame_step,
fft_length=self.fft_length,
)
)
# add small number to prevent nan value.
# compatible with pytorch version.
x_mag = tf.clip_by_value(tf.math.sqrt(x_mag ** 2 + 1e-7), 1e-7, 1e3)
y_mag = tf.clip_by_value(tf.math.sqrt(y_mag ** 2 + 1e-7), 1e-7, 1e3)
sc_loss = self.spectral_convergenge_loss(y_mag, x_mag)
mag_loss = self.log_stft_magnitude_loss(y_mag, x_mag)
return sc_loss, mag_loss
class TFMultiResolutionSTFT(tf.keras.layers.Layer):
"""Multi resolution STFT loss module."""
def __init__(
self,
fft_lengths=[1024, 2048, 512],
frame_lengths=[600, 1200, 240],
frame_steps=[120, 240, 50],
):
"""Initialize Multi resolution STFT loss module.
Args:
frame_lengths (list): List of FFT sizes.
frame_steps (list): List of hop sizes.
fft_lengths (list): List of window lengths.
"""
super().__init__()
assert len(frame_lengths) == len(frame_steps) == len(fft_lengths)
self.stft_losses = []
for frame_length, frame_step, fft_length in zip(
frame_lengths, frame_steps, fft_lengths
):
self.stft_losses.append(TFSTFT(frame_length, frame_step, fft_length))
def call(self, y, x):
"""Calculate forward propagation.
Args:
y (Tensor): Groundtruth signal (B, T).
x (Tensor): Predicted signal (B, T).
Returns:
Tensor: Multi resolution spectral convergence loss value.
Tensor: Multi resolution log STFT magnitude loss value.
"""
sc_loss = 0.0
mag_loss = 0.0
for f in self.stft_losses:
sc_l, mag_l = f(y, x)
sc_loss += tf.reduce_mean(sc_l, axis=list(range(1, len(sc_l.shape))))
mag_loss += tf.reduce_mean(mag_l, axis=list(range(1, len(mag_l.shape))))
sc_loss /= len(self.stft_losses)
mag_loss /= len(self.stft_losses)
return sc_loss, mag_loss
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/__init__.py
================================================
from tensorflow_tts.models.melgan import (
TFMelGANDiscriminator,
TFMelGANGenerator,
TFMelGANMultiScaleDiscriminator,
)
from tensorflow_tts.models.mb_melgan import TFPQMF
from tensorflow_tts.models.mb_melgan import TFMBMelGANGenerator
from tensorflow_tts.models.unetts import TFUNETTSDuration, TFUNETTSAcous, TFUNETTSContentPretrain
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/mb_melgan.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The Multi-band MelGAN Authors , Minh Nguyen (@dathudeptrai) and Tomoki Hayashi (@kan-bayashi)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
#
# Compatible with https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/parallel_wavegan/layers/pqmf.py.
"""Multi-band MelGAN Modules."""
import numpy as np
import tensorflow as tf
from scipy.signal import kaiser
from tensorflow_tts.models import TFMelGANGenerator
def design_prototype_filter(taps=62, cutoff_ratio=0.15, beta=9.0):
"""Design prototype filter for PQMF.
This method is based on `A Kaiser window approach for the design of prototype
filters of cosine modulated filterbanks`_.
Args:
taps (int): The number of filter taps.
cutoff_ratio (float): Cut-off frequency ratio.
beta (float): Beta coefficient for kaiser window.
Returns:
ndarray: Impluse response of prototype filter (taps + 1,).
.. _`A Kaiser window approach for the design of prototype filters of cosine modulated filterbanks`:
https://ieeexplore.ieee.org/abstract/document/681427
"""
# check the arguments are valid
assert taps % 2 == 0, "The number of taps mush be even number."
assert 0.0 < cutoff_ratio < 1.0, "Cutoff ratio must be > 0.0 and < 1.0."
# make initial filter
omega_c = np.pi * cutoff_ratio
with np.errstate(invalid="ignore"):
h_i = np.sin(omega_c * (np.arange(taps + 1) - 0.5 * taps)) / (
np.pi * (np.arange(taps + 1) - 0.5 * taps)
)
# fix nan due to indeterminate form
h_i[taps // 2] = np.cos(0) * cutoff_ratio
# apply kaiser window
w = kaiser(taps + 1, beta)
h = h_i * w
return h
class TFPQMF(tf.keras.layers.Layer):
"""PQMF module."""
def __init__(self, config, **kwargs):
"""Initilize PQMF module.
Args:
config (class): MultiBandMelGANGeneratorConfig
"""
super().__init__(**kwargs)
subbands = config.subbands
taps = config.taps
cutoff_ratio = config.cutoff_ratio
beta = config.beta
# define filter coefficient
h_proto = design_prototype_filter(taps, cutoff_ratio, beta)
h_analysis = np.zeros((subbands, len(h_proto)))
h_synthesis = np.zeros((subbands, len(h_proto)))
for k in range(subbands):
h_analysis[k] = (
2
* h_proto
* np.cos(
(2 * k + 1)
* (np.pi / (2 * subbands))
* (np.arange(taps + 1) - (taps / 2))
+ (-1) ** k * np.pi / 4
)
)
h_synthesis[k] = (
2
* h_proto
* np.cos(
(2 * k + 1)
* (np.pi / (2 * subbands))
* (np.arange(taps + 1) - (taps / 2))
- (-1) ** k * np.pi / 4
)
)
# [subbands, 1, taps + 1] == [filter_width, in_channels, out_channels]
analysis_filter = np.expand_dims(h_analysis, 1)
analysis_filter = np.transpose(analysis_filter, (2, 1, 0))
synthesis_filter = np.expand_dims(h_synthesis, 0)
synthesis_filter = np.transpose(synthesis_filter, (2, 1, 0))
# filter for downsampling & upsampling
updown_filter = np.zeros((subbands, subbands, subbands), dtype=np.float32)
for k in range(subbands):
updown_filter[0, k, k] = 1.0
self.subbands = subbands
self.taps = taps
self.analysis_filter = analysis_filter.astype(np.float32)
self.synthesis_filter = synthesis_filter.astype(np.float32)
self.updown_filter = updown_filter.astype(np.float32)
@tf.function(
experimental_relax_shapes=True,
input_signature=[tf.TensorSpec(shape=[None, None, 1], dtype=tf.float32)],
)
def analysis(self, x):
"""Analysis with PQMF.
Args:
x (Tensor): Input tensor (B, T, 1).
Returns:
Tensor: Output tensor (B, T // subbands, subbands).
"""
x = tf.pad(x, [[0, 0], [self.taps // 2, self.taps // 2], [0, 0]])
x = tf.nn.conv1d(x, self.analysis_filter, stride=1, padding="VALID")
x = tf.nn.conv1d(x, self.updown_filter, stride=self.subbands, padding="VALID")
return x
@tf.function(
experimental_relax_shapes=True,
input_signature=[tf.TensorSpec(shape=[None, None, None], dtype=tf.float32)],
)
def synthesis(self, x):
"""Synthesis with PQMF.
Args:
x (Tensor): Input tensor (B, T // subbands, subbands).
Returns:
Tensor: Output tensor (B, T, 1).
"""
x = tf.nn.conv1d_transpose(
x,
self.updown_filter * self.subbands,
strides=self.subbands,
output_shape=(
tf.shape(x)[0],
tf.shape(x)[1] * self.subbands,
self.subbands,
),
)
x = tf.pad(x, [[0, 0], [self.taps // 2, self.taps // 2], [0, 0]])
return tf.nn.conv1d(x, self.synthesis_filter, stride=1, padding="VALID")
class TFMBMelGANGenerator(TFMelGANGenerator):
"""Tensorflow MBMelGAN generator module."""
def __init__(self, config, **kwargs):
super().__init__(config, **kwargs)
self.pqmf = TFPQMF(config=config, name="pqmf")
def call(self, mels, **kwargs):
"""Calculate forward propagation.
Args:
c (Tensor): Input tensor (B, T, channels)
Returns:
Tensor: Output tensor (B, T ** prod(upsample_scales), out_channels)
"""
return self.inference(mels)
@tf.function(
input_signature=[
tf.TensorSpec(shape=[None, None, 80], dtype=tf.float32, name="mels")
]
)
def inference(self, mels):
mb_audios = self.melgan(mels)
return self.pqmf.synthesis(mb_audios)
@tf.function(
input_signature=[
tf.TensorSpec(shape=[1, None, 80], dtype=tf.float32, name="mels")
]
)
def inference_tflite(self, mels):
mb_audios = self.melgan(mels)
return self.pqmf.synthesis(mb_audios)
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/melgan.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The MelGAN Authors and Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""MelGAN Modules."""
import numpy as np
import tensorflow as tf
from tensorflow_tts.utils import GroupConv1D, WeightNormalization
def get_initializer(initializer_seed=42):
"""Creates a `tf.initializers.glorot_normal` with the given seed.
Args:
initializer_seed: int, initializer seed.
Returns:
GlorotNormal initializer with seed = `initializer_seed`.
"""
return tf.keras.initializers.GlorotNormal(seed=initializer_seed)
class TFReflectionPad1d(tf.keras.layers.Layer):
"""Tensorflow ReflectionPad1d module."""
def __init__(self, padding_size, padding_type="REFLECT", **kwargs):
"""Initialize TFReflectionPad1d module.
Args:
padding_size (int)
padding_type (str) ("CONSTANT", "REFLECT", or "SYMMETRIC". Default is "REFLECT")
"""
super().__init__(**kwargs)
self.padding_size = padding_size
self.padding_type = padding_type
def call(self, x):
"""Calculate forward propagation.
Args:
x (Tensor): Input tensor (B, T, C).
Returns:
Tensor: Padded tensor (B, T + 2 * padding_size, C).
"""
return tf.pad(
x,
[[0, 0], [self.padding_size, self.padding_size], [0, 0]],
self.padding_type,
)
class TFConvTranspose1d(tf.keras.layers.Layer):
"""Tensorflow ConvTranspose1d module."""
def __init__(
self,
filters,
kernel_size,
strides,
padding,
is_weight_norm,
initializer_seed,
**kwargs
):
"""Initialize TFConvTranspose1d( module.
Args:
filters (int): Number of filters.
kernel_size (int): kernel size.
strides (int): Stride width.
padding (str): Padding type ("same" or "valid").
"""
super().__init__(**kwargs)
self.conv1d_transpose = tf.keras.layers.Conv2DTranspose(
filters=filters,
kernel_size=(kernel_size, 1),
strides=(strides, 1),
padding="same",
kernel_initializer=get_initializer(initializer_seed),
)
if is_weight_norm:
self.conv1d_transpose = WeightNormalization(self.conv1d_transpose)
def call(self, x):
"""Calculate forward propagation.
Args:
x (Tensor): Input tensor (B, T, C).
Returns:
Tensor: Output tensor (B, T', C').
"""
x = tf.expand_dims(x, 2)
x = self.conv1d_transpose(x)
x = tf.squeeze(x, 2)
return x
class TFResidualStack(tf.keras.layers.Layer):
"""Tensorflow ResidualStack module."""
def __init__(
self,
kernel_size,
filters,
dilation_rate,
use_bias,
nonlinear_activation,
nonlinear_activation_params,
is_weight_norm,
initializer_seed,
**kwargs
):
"""Initialize TFResidualStack module.
Args:
kernel_size (int): Kernel size.
filters (int): Number of filters.
dilation_rate (int): Dilation rate.
use_bias (bool): Whether to add bias parameter in convolution layers.
nonlinear_activation (str): Activation function module name.
nonlinear_activation_params (dict): Hyperparameters for activation function.
"""
super().__init__(**kwargs)
self.blocks = [
getattr(tf.keras.layers, nonlinear_activation)(
**nonlinear_activation_params
),
TFReflectionPad1d((kernel_size - 1) // 2 * dilation_rate),
tf.keras.layers.Conv1D(
filters=filters,
kernel_size=kernel_size,
dilation_rate=dilation_rate,
use_bias=use_bias,
kernel_initializer=get_initializer(initializer_seed),
),
getattr(tf.keras.layers, nonlinear_activation)(
**nonlinear_activation_params
),
tf.keras.layers.Conv1D(
filters=filters,
kernel_size=1,
use_bias=use_bias,
kernel_initializer=get_initializer(initializer_seed),
),
]
self.shortcut = tf.keras.layers.Conv1D(
filters=filters,
kernel_size=1,
use_bias=use_bias,
kernel_initializer=get_initializer(initializer_seed),
name="shortcut",
)
# apply weightnorm
if is_weight_norm:
self._apply_weightnorm(self.blocks)
self.shortcut = WeightNormalization(self.shortcut)
def call(self, x):
"""Calculate forward propagation.
Args:
x (Tensor): Input tensor (B, T, C).
Returns:
Tensor: Output tensor (B, T, C).
"""
_x = tf.identity(x)
for layer in self.blocks:
_x = layer(_x)
shortcut = self.shortcut(x)
return shortcut + _x
def _apply_weightnorm(self, list_layers):
"""Try apply weightnorm for all layer in list_layers."""
for i in range(len(list_layers)):
try:
layer_name = list_layers[i].name.lower()
if "conv1d" in layer_name or "dense" in layer_name:
list_layers[i] = WeightNormalization(list_layers[i])
except Exception:
pass
class TFMelGANGenerator(tf.keras.Model):
"""Tensorflow MelGAN generator module."""
def __init__(self, config, **kwargs):
"""Initialize TFMelGANGenerator module.
Args:
config: config object of Melgan generator.
"""
super().__init__(**kwargs)
# check hyper parameter is valid or not
assert config.filters >= np.prod(config.upsample_scales)
assert config.filters % (2 ** len(config.upsample_scales)) == 0
# add initial layer
layers = []
layers += [
TFReflectionPad1d(
(config.kernel_size - 1) // 2,
padding_type=config.padding_type,
name="first_reflect_padding",
),
tf.keras.layers.Conv1D(
filters=config.filters,
kernel_size=config.kernel_size,
use_bias=config.use_bias,
kernel_initializer=get_initializer(config.initializer_seed),
),
]
for i, upsample_scale in enumerate(config.upsample_scales):
# add upsampling layer
layers += [
getattr(tf.keras.layers, config.nonlinear_activation)(
**config.nonlinear_activation_params
),
TFConvTranspose1d(
filters=config.filters // (2 ** (i + 1)),
kernel_size=upsample_scale * 2,
strides=upsample_scale,
padding="same",
is_weight_norm=config.is_weight_norm,
initializer_seed=config.initializer_seed,
name="conv_transpose_._{}".format(i),
),
]
# ad residual stack layer
for j in range(config.stacks):
layers += [
TFResidualStack(
kernel_size=config.stack_kernel_size,
filters=config.filters // (2 ** (i + 1)),
dilation_rate=config.stack_kernel_size ** j,
use_bias=config.use_bias,
nonlinear_activation=config.nonlinear_activation,
nonlinear_activation_params=config.nonlinear_activation_params,
is_weight_norm=config.is_weight_norm,
initializer_seed=config.initializer_seed,
name="residual_stack_._{}._._{}".format(i, j),
)
]
# add final layer
layers += [
getattr(tf.keras.layers, config.nonlinear_activation)(
**config.nonlinear_activation_params
),
TFReflectionPad1d(
(config.kernel_size - 1) // 2,
padding_type=config.padding_type,
name="last_reflect_padding",
),
tf.keras.layers.Conv1D(
filters=config.out_channels,
kernel_size=config.kernel_size,
use_bias=config.use_bias,
kernel_initializer=get_initializer(config.initializer_seed),
),
]
if config.use_final_nolinear_activation:
layers += [tf.keras.layers.Activation("tanh")]
if config.is_weight_norm is True:
self._apply_weightnorm(layers)
self.melgan = tf.keras.models.Sequential(layers)
def call(self, mels, **kwargs):
"""Calculate forward propagation.
Args:
c (Tensor): Input tensor (B, T, channels)
Returns:
Tensor: Output tensor (B, T ** prod(upsample_scales), out_channels)
"""
return self.inference(mels)
@tf.function(
input_signature=[
tf.TensorSpec(shape=[None, None, 80], dtype=tf.float32, name="mels")
]
)
def inference(self, mels):
return self.melgan(mels)
@tf.function(
input_signature=[
tf.TensorSpec(shape=[1, None, 80], dtype=tf.float32, name="mels")
]
)
def inference_tflite(self, mels):
return self.melgan(mels)
def _apply_weightnorm(self, list_layers):
"""Try apply weightnorm for all layer in list_layers."""
for i in range(len(list_layers)):
try:
layer_name = list_layers[i].name.lower()
if "conv1d" in layer_name or "dense" in layer_name:
list_layers[i] = WeightNormalization(list_layers[i])
except Exception:
pass
def _build(self):
"""Build model by passing fake input."""
fake_mels = tf.random.uniform(shape=[1, 100, 80], dtype=tf.float32)
self(fake_mels)
class TFMelGANDiscriminator(tf.keras.layers.Layer):
"""Tensorflow MelGAN generator module."""
def __init__(
self,
out_channels=1,
kernel_sizes=[5, 3],
filters=16,
max_downsample_filters=1024,
use_bias=True,
downsample_scales=[4, 4, 4, 4],
nonlinear_activation="LeakyReLU",
nonlinear_activation_params={"alpha": 0.2},
padding_type="REFLECT",
is_weight_norm=True,
initializer_seed=0.02,
**kwargs
):
"""Initilize MelGAN discriminator module.
Args:
out_channels (int): Number of output channels.
kernel_sizes (list): List of two kernel sizes. The prod will be used for the first conv layer,
and the first and the second kernel sizes will be used for the last two layers.
For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15.
the last two layers' kernel size will be 5 and 3, respectively.
filters (int): Initial number of filters for conv layer.
max_downsample_filters (int): Maximum number of filters for downsampling layers.
use_bias (bool): Whether to add bias parameter in convolution layers.
downsample_scales (list): List of downsampling scales.
nonlinear_activation (str): Activation function module name.
nonlinear_activation_params (dict): Hyperparameters for activation function.
padding_type (str): Padding type (support only "REFLECT", "CONSTANT", "SYMMETRIC")
"""
super().__init__(**kwargs)
discriminator = []
# check kernel_size is valid
assert len(kernel_sizes) == 2
assert kernel_sizes[0] % 2 == 1
assert kernel_sizes[1] % 2 == 1
# add first layer
discriminator = [
TFReflectionPad1d(
(np.prod(kernel_sizes) - 1) // 2, padding_type=padding_type
),
tf.keras.layers.Conv1D(
filters=filters,
kernel_size=int(np.prod(kernel_sizes)),
use_bias=use_bias,
kernel_initializer=get_initializer(initializer_seed),
),
getattr(tf.keras.layers, nonlinear_activation)(
**nonlinear_activation_params
),
]
# add downsample layers
in_chs = filters
with tf.keras.utils.CustomObjectScope({"GroupConv1D": GroupConv1D}):
for downsample_scale in downsample_scales:
out_chs = min(in_chs * downsample_scale, max_downsample_filters)
discriminator += [
GroupConv1D(
filters=out_chs,
kernel_size=downsample_scale * 10 + 1,
strides=downsample_scale,
padding="same",
use_bias=use_bias,
groups=in_chs // 4,
kernel_initializer=get_initializer(initializer_seed),
)
]
discriminator += [
getattr(tf.keras.layers, nonlinear_activation)(
**nonlinear_activation_params
)
]
in_chs = out_chs
# add final layers
out_chs = min(in_chs * 2, max_downsample_filters)
discriminator += [
tf.keras.layers.Conv1D(
filters=out_chs,
kernel_size=kernel_sizes[0],
padding="same",
use_bias=use_bias,
kernel_initializer=get_initializer(initializer_seed),
)
]
discriminator += [
getattr(tf.keras.layers, nonlinear_activation)(
**nonlinear_activation_params
)
]
discriminator += [
tf.keras.layers.Conv1D(
filters=out_channels,
kernel_size=kernel_sizes[1],
padding="same",
use_bias=use_bias,
kernel_initializer=get_initializer(initializer_seed),
)
]
if is_weight_norm is True:
self._apply_weightnorm(discriminator)
self.disciminator = discriminator
def call(self, x, **kwargs):
"""Calculate forward propagation.
Args:
x (Tensor): Input noise signal (B, T, 1).
Returns:
List: List of output tensors of each layer.
"""
outs = []
for f in self.disciminator:
x = f(x)
outs += [x]
return outs
def _apply_weightnorm(self, list_layers):
"""Try apply weightnorm for all layer in list_layers."""
for i in range(len(list_layers)):
try:
layer_name = list_layers[i].name.lower()
if "conv1d" in layer_name or "dense" in layer_name:
list_layers[i] = WeightNormalization(list_layers[i])
except Exception:
pass
class TFMelGANMultiScaleDiscriminator(tf.keras.Model):
"""MelGAN multi-scale discriminator module."""
def __init__(self, config, **kwargs):
"""Initilize MelGAN multi-scale discriminator module.
Args:
config: config object for melgan discriminator
"""
super().__init__(**kwargs)
self.discriminator = []
# add discriminator
for i in range(config.scales):
self.discriminator += [
TFMelGANDiscriminator(
out_channels=config.out_channels,
kernel_sizes=config.kernel_sizes,
filters=config.filters,
max_downsample_filters=config.max_downsample_filters,
use_bias=config.use_bias,
downsample_scales=config.downsample_scales,
nonlinear_activation=config.nonlinear_activation,
nonlinear_activation_params=config.nonlinear_activation_params,
padding_type=config.padding_type,
is_weight_norm=config.is_weight_norm,
initializer_seed=config.initializer_seed,
name="melgan_discriminator_scale_._{}".format(i),
)
]
self.pooling = getattr(tf.keras.layers, config.downsample_pooling)(
**config.downsample_pooling_params
)
def call(self, x, **kwargs):
"""Calculate forward propagation.
Args:
x (Tensor): Input noise signal (B, T, 1).
Returns:
List: List of list of each discriminator outputs, which consists of each layer output tensors.
"""
outs = []
for f in self.discriminator:
outs += [f(x)]
x = self.pooling(x)
return outs
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/__init__.py
================================================
from tensorflow_tts.models.moduls import (
core, core2, conditional, adain_en_de_code
)
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/adain_en_de_code.py
================================================
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow_tts.models.moduls.conditional import MaskInstanceNormalization
def get_initializer(initializer_range=0.02):
"""Creates a `tf.initializers.truncated_normal` with the given range.
Args:
initializer_range: float, initializer range for stddev.
Returns:
TruncatedNormal initializer with stddev = `initializer_range`.
"""
return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)
class ConvModul(tf.keras.layers.Layer):
def __init__(self, hidden_size, kernel_size, initializer_range, layer_norm_eps=1e-5, **kwargs):
super().__init__(**kwargs)
self.conv_0 = tf.keras.layers.Conv1D(
filters = hidden_size,
kernel_size = kernel_size,
kernel_initializer = get_initializer(initializer_range),
padding = 'same',
)
self.conv_1 = tf.keras.layers.Conv1D(
filters = hidden_size,
kernel_size = kernel_size,
kernel_initializer = get_initializer(initializer_range),
padding = 'same',
)
self.atc = tf.keras.layers.Activation(tf.nn.relu)
self.batch_norm = tf.keras.layers.BatchNormalization(epsilon=layer_norm_eps) # TODO
def call(self, x):
y = self.conv_0(x)
y = self.batch_norm(y)
y = self.atc(y)
y = self.conv_1(y)
return y
class EncConvBlock(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.conv = ConvModul(
config.adain_filter_size,
config.enc_kernel_size,
config.initializer_range,
config.layer_norm_eps)
def call(self, x):
return x + self.conv(x)
class DecConvBlock(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.dec_conv = ConvModul(
config.adain_filter_size,
config.dec_kernel_size,
config.initializer_range,
config.layer_norm_eps)
self.gen_conv = ConvModul(
config.adain_filter_size,
config.gen_kernel_size,
config.initializer_range,
config.layer_norm_eps)
def call(self, x):
y = self.dec_conv(x)
y = y + self.gen_conv(y)
return x + y
class AadINEncoder(tf.keras.Model):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.config = config
self.in_hidden_size = config.adain_filter_size # 256
self.out_hidden_size = config.content_latent_dim # content_latent_dim
self.n_conv_blocks = config.n_conv_blocks
self.in_conv = tf.keras.layers.Conv1D(
filters = self.in_hidden_size,
kernel_size = 1,
kernel_initializer = get_initializer(config.initializer_range),
padding = 'same',
)
self.out_conv = tf.keras.layers.Conv1D(
filters = self.out_hidden_size,
kernel_size = 1,
kernel_initializer = get_initializer(config.initializer_range),
padding = 'same',
)
self.inorm = MaskInstanceNormalization(config.layer_norm_eps)
self.conv_blocks = [
EncConvBlock(config) for _ in range(self.n_conv_blocks)
]
def call(self, x, mask):
means = []
stds = []
y = self.in_conv(x) # 80 -> 256
for block in self.conv_blocks:
y = block(y)
y, mean, std = self.inorm(y, mask, return_mean_std=True)
means.append(mean)
stds.append(std)
y = self.out_conv(y) # 256 -> 128 + 4
# TODO sigmoid
means.reverse()
stds.reverse()
return y, means, stds
class AdaINDecoder(tf.keras.Model):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.config = config
self.in_hidden_size = config.adain_filter_size # 256
self.out_hidden_size = config.num_mels # 80
self.n_conv_blocks = config.n_conv_blocks
self.in_conv = tf.keras.layers.Conv1D(
filters = self.in_hidden_size,
kernel_size = 1,
kernel_initializer = get_initializer(config.initializer_range),
padding = 'same',
)
self.out_conv = tf.keras.layers.Conv1D(
filters = self.out_hidden_size,
kernel_size = 1,
kernel_initializer = get_initializer(config.initializer_range),
padding = 'same',
)
self.inorm = MaskInstanceNormalization(config.layer_norm_eps)
self.conv_blocks = [
DecConvBlock(config) for _ in range(self.n_conv_blocks)
]
def call(self, enc, cond, mask):
_, means, stds = cond
# TODO
# y, means, stds = cond
# _, mean, std = self.inorm(y, mask, return_mean_std=True)
# enc = self.inorm(enc, mask)
# enc = enc * tf.expand_dims(std, 1) + tf.expand_dims(mean, 1)
y = self.in_conv(enc) # 132 -> 256
for block, mean, std in zip(self.conv_blocks, means, stds):
y = self.inorm(y, mask)
y = y * tf.expand_dims(std, 1) + tf.expand_dims(mean, 1)
y = block(y)
y = self.out_conv(y) # 256 -> 80
return y
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/conditional.py
================================================
import tensorflow as tf
import tensorflow_addons as tfa
import numpy as np
def get_initializer(initializer_range=0.02):
"""Creates a `tf.initializers.truncated_normal` with the given range.
Args:
initializer_range: float, initializer range for stddev.
Returns:
TruncatedNormal initializer with stddev = `initializer_range`.
"""
return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)
class MaskInstanceNormalization(tf.keras.layers.Layer):
def __init__(self, layer_norm_eps, **kwargs):
super().__init__(**kwargs)
self.layer_norm_eps = layer_norm_eps
def _cal_mean_std(self, inputs, mask):
expend_mask = tf.cast(tf.expand_dims(mask, axis=2), inputs.dtype)
sums = tf.math.reduce_sum(tf.cast(mask, inputs.dtype), axis=-1, keepdims=True)
mean = tf.math.reduce_sum(inputs * expend_mask, axis=1) / sums
std = tf.math.sqrt(
tf.math.reduce_sum(
tf.math.pow(inputs - tf.expand_dims(mean, 1), 2) * expend_mask, axis = 1
) / sums + self.layer_norm_eps
)
return mean, std, expend_mask
def call(self, inputs, mask, return_mean_std=False):
'''
inputs: [B, T, hidden_size]
mask: [B, T]
'''
mean, std, expend_mask = self._cal_mean_std(inputs, mask)
outputs = (inputs - tf.expand_dims(mean, 1)) / tf.expand_dims(std, 1) * expend_mask
if return_mean_std:
return outputs, mean, std
else:
return outputs
# TODO
class ConditionalNormalization(tf.keras.layers.Layer):
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.config = config
self.scale = tf.keras.layers.Dense(
config.hidden_size,
use_bias = False,
kernel_initializer = get_initializer(config.initializer_range),
name = "Scale",
)
self.mean = tf.keras.layers.Dense(
config.hidden_size,
use_bias = False,
kernel_initializer = get_initializer(config.initializer_range),
name = "Mean",
)
if config.conditional_norm_type == "Layer":
self.norm_layer = tf.keras.layers.LayerNormalization(
center = False,
scale = False,
epsilon = config.layer_norm_eps,
name = "LayerNorm",
)
elif config.conditional_norm_type == "Instance":
# self.norm_layer = tfa.layers.InstanceNormalization(
# center = False,
# scale = False,
# epsilon = config.layer_norm_eps,
# name = "InstanceNorm",
# )
self.norm_layer = MaskInstanceNormalization(config.layer_norm_eps)
else:
print(f"Not support norm type {config.conditional_norm_type} !")
exit(0)
def call(self, inputs, conds, mask):
'''
inputs: [B, T, hidden_size]
conds: [B, 1, C']
mask: [B, T]
'''
if self.config.conditional_norm_type == "Layer":
tmp = self.norm_layer(inputs)
elif self.config.conditional_norm_type == "Instance":
tmp = self.norm_layer(inputs, mask)
scale = self.scale(conds)
mean = self.mean(conds)
return tmp * scale + mean
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/core.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The FastSpeech Authors, The HuggingFace Inc. team and Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tensorflow Model modules for FastSpeech."""
import numpy as np
import tensorflow as tf
import scipy.stats
def get_initializer(initializer_range=0.02):
"""Creates a `tf.initializers.truncated_normal` with the given range.
Args:
initializer_range: float, initializer range for stddev.
Returns:
TruncatedNormal initializer with stddev = `initializer_range`.
"""
return tf.keras.initializers.TruncatedNormal(stddev=initializer_range)
def gelu(x):
"""Gaussian Error Linear unit."""
cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
return x * cdf
def gelu_new(x):
"""Smoother gaussian Error Linear Unit."""
cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x * cdf
def swish(x):
"""Swish activation function."""
return x * tf.sigmoid(x)
def mish(x):
return x * tf.math.tanh(tf.math.softplus(x))
ACT2FN = {
"identity": tf.keras.layers.Activation("linear"),
"tanh": tf.keras.layers.Activation("tanh"),
"gelu": tf.keras.layers.Activation(gelu),
"relu": tf.keras.activations.relu,
"swish": tf.keras.layers.Activation(swish),
"gelu_new": tf.keras.layers.Activation(gelu_new),
"mish": tf.keras.layers.Activation(mish),
}
class TFEmbedding(tf.keras.layers.Embedding):
"""Faster version of embedding."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def call(self, inputs):
inputs = tf.cast(inputs, tf.int32)
outputs = tf.gather(self.embeddings, inputs)
return outputs
class TFFastSpeechEmbeddings(tf.keras.layers.Layer):
"""Construct charactor/phoneme/positional/speaker embeddings."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.vocab_size = config.vocab_size
self.hidden_size = config.encoder_self_attention_params.hidden_size
self.initializer_range = config.initializer_range
self.config = config
def build(self, input_shape):
"""Build shared charactor/phoneme embedding layers."""
with tf.name_scope("charactor_embeddings"):
self.charactor_embeddings = self.add_weight(
"weight",
shape=[self.vocab_size, self.hidden_size],
initializer=get_initializer(self.initializer_range),
)
super().build(input_shape)
def call(self, input_ids):
return tf.gather(self.charactor_embeddings, input_ids)
class TFFastSpeechSelfAttention(tf.keras.layers.Layer):
"""Self attention module for fastspeech."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
if config.hidden_size % config.num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
)
self.output_attentions = config.output_attentions
self.num_attention_heads = config.num_attention_heads
self.all_head_size = self.num_attention_heads * config.attention_head_size
self.query = tf.keras.layers.Dense(
self.all_head_size,
kernel_initializer=get_initializer(config.initializer_range),
name="query",
)
self.key = tf.keras.layers.Dense(
self.all_head_size,
kernel_initializer=get_initializer(config.initializer_range),
name="key",
)
self.value = tf.keras.layers.Dense(
self.all_head_size,
kernel_initializer=get_initializer(config.initializer_range),
name="value",
)
self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)
self.config = config
# TODO
# self.half_win = config.local_attention_halfwin_size
# self.frames_max = 100
# self.local_maxs = self._local_attention_mask()
# self.local_ones = tf.ones([self.frames_max, self.frames_max], tf.float32)
def transpose_for_scores(self, x, batch_size):
"""Transpose to calculate attention scores."""
x = tf.reshape(
x,
(batch_size, -1, self.num_attention_heads, self.config.attention_head_size),
)
return tf.transpose(x, perm=[0, 2, 1, 3])
# def _local_attention_mask(self, frames_num):
# xv, yv = tf.meshgrid(tf.range(frames_num), tf.range(frames_num), indexing="ij")
# f32_matrix = tf.cast(yv - xv, tf.float32)
# val = f32_matrix[0][self.half_win]
# local1 = tf.math.greater_equal(f32_matrix, -val)
# local2 = tf.math.less_equal(f32_matrix, val)
# return tf.cast(tf.logical_and(local1, local2), tf.float32)
def call(self, inputs, training=False):
"""Call logic."""
hidden_states, attention_mask = inputs
batch_size = tf.shape(hidden_states)[0]
mixed_query_layer = self.query(hidden_states)
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer, batch_size)
key_layer = self.transpose_for_scores(mixed_key_layer, batch_size)
value_layer = self.transpose_for_scores(mixed_value_layer, batch_size)
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
dk = tf.cast(tf.shape(key_layer)[-1], attention_scores.dtype) # scale attention_scores
attention_scores = attention_scores / tf.math.sqrt(dk)
if attention_mask is not None:
# extended_attention_masks for self attention encoder.
extended_attention_mask = attention_mask[:, tf.newaxis, tf.newaxis, :]
extended_attention_mask = tf.cast(extended_attention_mask, attention_scores.dtype)
extended_attention_mask = (1.0 - extended_attention_mask) * -1e9
attention_scores = attention_scores + extended_attention_mask
# TODO
# frames_num = tf.shape(attention_mask)[-1]
# local_attention_mask = tf.cond(tf.greater(frames_num, self.half_win + 1),
# lambda: self._local_attention_mask(frames_num),
# lambda: tf.ones([frames_num, frames_num], tf.float32))
# local_attention_mask = (1.0 - local_attention_mask) * -1e9
# attention_scores = attention_scores + local_attention_mask
# Normalize the attention scores to probabilities.
attention_probs = tf.nn.softmax(attention_scores, axis=-1)
attention_probs = self.dropout(attention_probs, training=training)
context_layer = tf.matmul(attention_probs, value_layer)
context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])
context_layer = tf.reshape(context_layer, (batch_size, -1, self.all_head_size))
outputs = (
(context_layer, attention_probs)
if self.output_attentions
else (context_layer,)
)
return outputs
class TFFastSpeechSelfOutput(tf.keras.layers.Layer):
"""Fastspeech output of self attention module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.dense = tf.keras.layers.Dense(
config.hidden_size,
kernel_initializer=get_initializer(config.initializer_range),
name="dense",
)
self.LayerNorm = tf.keras.layers.LayerNormalization(
epsilon=config.layer_norm_eps, name="LayerNorm"
)
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
def call(self, inputs, training=False):
"""Call logic."""
hidden_states, input_tensor = inputs
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states, training=training)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
class TFFastSpeechAttention(tf.keras.layers.Layer):
"""Fastspeech attention module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.self_attention = TFFastSpeechSelfAttention(config, name="self")
self.dense_output = TFFastSpeechSelfOutput(config, name="output")
def call(self, inputs, training=False):
input_tensor, attention_mask = inputs
self_outputs = self.self_attention(
[input_tensor, attention_mask], training=training
)
attention_output = self.dense_output(
[self_outputs[0], input_tensor], training=training
)
masked_attention_output = attention_output * tf.cast(
tf.expand_dims(attention_mask, 2), dtype=attention_output.dtype
)
outputs = (masked_attention_output,) + self_outputs[
1:
] # add attentions if we output them
return outputs
class TFFastSpeechIntermediate(tf.keras.layers.Layer):
"""Intermediate representation module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.conv1d_1 = tf.keras.layers.Conv1D(
config.intermediate_size,
kernel_size=config.intermediate_kernel_size,
kernel_initializer=get_initializer(config.initializer_range),
padding="same",
name="conv1d_1",
)
self.conv1d_2 = tf.keras.layers.Conv1D(
config.hidden_size,
kernel_size=config.intermediate_kernel_size,
kernel_initializer=get_initializer(config.initializer_range),
padding="same",
name="conv1d_2",
)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def call(self, inputs):
"""Call logic."""
hidden_states, attention_mask = inputs
hidden_states = self.conv1d_1(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
hidden_states = self.conv1d_2(hidden_states)
masked_hidden_states = hidden_states * tf.cast(
tf.expand_dims(attention_mask, 2), dtype=hidden_states.dtype
)
return masked_hidden_states
class TFFastSpeechOutput(tf.keras.layers.Layer):
"""Output module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.LayerNorm = tf.keras.layers.LayerNormalization(
epsilon=config.layer_norm_eps, name="LayerNorm"
)
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
def call(self, inputs, training=False):
"""Call logic."""
hidden_states, input_tensor = inputs
hidden_states = self.dropout(hidden_states, training=training)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
class TFFastSpeechLayer(tf.keras.layers.Layer):
"""Fastspeech module (FFT module on the paper)."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.attention = TFFastSpeechAttention(config, name="attention")
self.intermediate = TFFastSpeechIntermediate(config, name="intermediate")
self.bert_output = TFFastSpeechOutput(config, name="output")
def call(self, inputs, training=False):
"""Call logic."""
hidden_states, attention_mask = inputs
attention_outputs = self.attention(
[hidden_states, attention_mask], training=training
)
attention_output = attention_outputs[0]
intermediate_output = self.intermediate(
[attention_output, attention_mask], training=training
)
layer_output = self.bert_output(
[intermediate_output, attention_output], training=training
)
masked_layer_output = layer_output * tf.cast(
tf.expand_dims(attention_mask, 2), dtype=layer_output.dtype
)
outputs = (masked_layer_output,) + attention_outputs[
1:
] # add attentions if we output them
return outputs
class TFFastSpeechEncoder(tf.keras.layers.Layer):
"""Fast Speech encoder module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.output_attentions = config.output_attentions
self.output_hidden_states = config.output_hidden_states
self.layer = [
TFFastSpeechLayer(config, name="layer_._{}".format(i))
for i in range(config.num_hidden_layers)
]
def call(self, inputs, training=False):
"""Call logic."""
hidden_states, attention_mask = inputs
all_hidden_states = ()
all_attentions = ()
for _, layer_module in enumerate(self.layer):
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_outputs = layer_module(
[hidden_states, attention_mask], training=training
)
hidden_states = layer_outputs[0]
if self.output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
# Add last layer
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
outputs = (hidden_states,)
if self.output_hidden_states:
outputs = outputs + (all_hidden_states,)
if self.output_attentions:
outputs = outputs + (all_attentions,)
return outputs # outputs, (hidden states), (attentions)
class TFFastSpeechDecoder(TFFastSpeechEncoder):
"""Fast Speech decoder module."""
def __init__(self, config, **kwargs):
self.is_compatible_encoder = kwargs.pop("is_compatible_encoder", True)
super().__init__(config, **kwargs)
self.config = config
if self.is_compatible_encoder is False:
self.project_compatible_decoder = tf.keras.layers.Dense(
units=config.hidden_size, name="project_compatible_decoder"
)
def call(self, inputs, training=False):
hidden_states, encoder_mask = inputs
if self.is_compatible_encoder is False:
hidden_states = self.project_compatible_decoder(hidden_states)
return super().call([hidden_states, encoder_mask], training=training)
class TFTacotronPostnet(tf.keras.layers.Layer):
"""Tacotron-2 postnet."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.conv_batch_norm = []
for i in range(config.n_conv_postnet):
conv = tf.keras.layers.Conv1D(
filters=config.postnet_conv_filters
if i < config.n_conv_postnet - 1
else config.num_mels,
kernel_size=config.postnet_conv_kernel_sizes,
padding="same",
name="conv_._{}".format(i),
)
batch_norm = tf.keras.layers.BatchNormalization(
axis=-1, name="batch_norm_._{}".format(i)
)
self.conv_batch_norm.append((conv, batch_norm))
self.dropout = tf.keras.layers.Dropout(
rate=config.postnet_dropout_rate, name="dropout"
)
self.activation = [tf.nn.tanh] * (config.n_conv_postnet - 1) + [tf.identity]
def call(self, inputs, training=False):
"""Call logic."""
outputs, mask = inputs
extended_mask = tf.cast(tf.expand_dims(mask, axis=2), outputs.dtype)
for i, (conv, bn) in enumerate(self.conv_batch_norm):
outputs = conv(outputs)
outputs = bn(outputs)
outputs = self.activation[i](outputs)
outputs = self.dropout(outputs, training=training)
return outputs * extended_mask
# TODO Drop infer trainning=False
class TFFastSpeechVariantPredictor(tf.keras.layers.Layer):
"""FastSpeech variant predictor module."""
def __init__(self, config, sub_name="f0", is_sigmod=False, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.is_sigmod = is_sigmod
self.conv_layers = []
for i in range(config.num_variant_conv_layers):
self.conv_layers.append(
tf.keras.layers.Conv1D(
config.variant_predictor_filters,
config.variant_predictor_kernel_sizes,
padding="same",
name="{}_conv_._{}".format(sub_name, i),
)
)
self.conv_layers.append(tf.keras.layers.Activation(tf.nn.relu))
self.conv_layers.append(
tf.keras.layers.LayerNormalization(
epsilon=config.layer_norm_eps, name="{}_LayerNorm_._{}".format(sub_name, i)
)
)
self.conv_layers.append(
tf.keras.layers.Dropout(config.variant_predictor_dropout_probs)
)
self.conv_layers_sequence = tf.keras.Sequential(self.conv_layers, name=sub_name)
self.output_layer = tf.keras.layers.Dense(1)
if self.is_sigmod:
self.sigmod_layer = tf.keras.layers.Activation(tf.nn.sigmoid)
def call(self, inputs, training=False):
"""Call logic."""
encoder_hidden_states, attention_mask = inputs
attention_mask = tf.cast(tf.expand_dims(attention_mask, 2), encoder_hidden_states.dtype)
# mask encoder hidden states
masked_encoder_hidden_states = encoder_hidden_states * attention_mask
# pass though first layer
outputs = self.conv_layers_sequence(masked_encoder_hidden_states)
outputs = self.output_layer(outputs)
if self.is_sigmod:
outputs = self.sigmod_layer(outputs)
masked_outputs = outputs * attention_mask
return tf.squeeze(masked_outputs, -1)
class TFFastSpeechDurationPredictor(tf.keras.layers.Layer):
"""FastSpeech duration predictor module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.conv_layers = []
for i in range(config.num_duration_conv_layers):
self.conv_layers.append(
tf.keras.layers.Conv1D(
config.duration_predictor_filters,
config.duration_predictor_kernel_sizes,
padding="same",
name="conv_._{}".format(i),
)
)
self.conv_layers.append(tf.keras.layers.Activation(tf.nn.relu))
self.conv_layers.append(
tf.keras.layers.LayerNormalization(
epsilon=config.layer_norm_eps, name="LayerNorm_._{}".format(i)
)
)
self.conv_layers.append(
tf.keras.layers.Dropout(config.duration_predictor_dropout_probs)
)
self.conv_layers_sequence = tf.keras.Sequential(self.conv_layers)
self.output_layer = tf.keras.layers.Dense(1)
def call(self, inputs, training=False):
"""Call logic."""
encoder_hidden_states, attention_mask = inputs
attention_mask = tf.cast(tf.expand_dims(attention_mask, 2), encoder_hidden_states.dtype)
# mask encoder hidden states
masked_encoder_hidden_states = encoder_hidden_states * attention_mask
# pass though first layer
outputs = self.conv_layers_sequence(masked_encoder_hidden_states)
outputs = self.output_layer(outputs)
masked_outputs = outputs * attention_mask
# return tf.squeeze(tf.nn.relu(masked_outputs), -1) # make sure positive value.
return tf.squeeze(masked_outputs, -1)
class TFFastSpeechLengthRegulator(tf.keras.layers.Layer):
"""FastSpeech lengthregulator module."""
def __init__(self, config, **kwargs):
"""Init variables."""
self.enable_tflite_convertible = kwargs.pop("enable_tflite_convertible", False)
super().__init__(**kwargs)
self.config = config
self.addfeatures_num = 0
if config.addfeatures_num > 0:
self._compute_coarse_coding_features()
self.addfeatures_num = config.addfeatures_num
if config.isaddur:
self.addfeatures_num += 1
def _compute_coarse_coding_features(self):
npoints = 600
x1 = np.linspace(-1.5, 1.5, npoints)
x2 = np.linspace(-1.0, 2.0, npoints)
x3 = np.linspace(-0.5, 2.5, npoints)
x4 = np.linspace(0.0, 3.0, npoints)
mu1 = 0.0
mu2 = 0.5
mu3 = 1.0
mu4 = 1.5
sigma = 0.4
self.cc_features0 = tf.convert_to_tensor(scipy.stats.norm.pdf(x1, mu1, sigma), tf.float32)
self.cc_features1 = tf.convert_to_tensor(scipy.stats.norm.pdf(x2, mu2, sigma), tf.float32)
self.cc_features2 = tf.convert_to_tensor(scipy.stats.norm.pdf(x3, mu3, sigma), tf.float32)
self.cc_features3 = tf.convert_to_tensor(scipy.stats.norm.pdf(x4, mu4, sigma), tf.float32)
def call(self, inputs, training=False):
"""Call logic.
Args:
1. encoder_hidden_states, Tensor (float32) shape [batch_size, length, hidden_size]
2. durations_gt, Tensor (float32/int32) shape [batch_size, length]
"""
encoder_hidden_states, durations_gt = inputs
outputs, encoder_masks = self._length_regulator(
encoder_hidden_states, durations_gt
)
return outputs, encoder_masks
def _length_regulator(self, encoder_hidden_states, durations_gt):
"""Length regulator logic."""
sum_durations = tf.reduce_sum(durations_gt, axis=-1) # [batch_size]
max_durations = tf.reduce_max(sum_durations)
input_shape = tf.shape(encoder_hidden_states)
batch_size = input_shape[0]
hidden_size = input_shape[-1]
# initialize output hidden states and encoder masking.
# TODO add tflite_infer for coarse_coding
if self.enable_tflite_convertible:
# There is only 1 batch in inference, so we don't have to use
# `tf.While` op with 3-D output tensor.
repeats = durations_gt[0]
real_length = tf.reduce_sum(repeats)
pad_size = max_durations - real_length
# masks : [max_durations]
masks = tf.sequence_mask([real_length], max_durations, dtype=tf.int32)
repeat_encoder_hidden_states = tf.repeat(
encoder_hidden_states[0], repeats=repeats, axis=0
)
repeat_encoder_hidden_states = tf.expand_dims(
tf.pad(repeat_encoder_hidden_states, [[0, pad_size], [0, 0]]), 0
) # [1, max_durations, hidden_size]
outputs = repeat_encoder_hidden_states
encoder_masks = masks
else:
outputs = tf.zeros(shape=[0, max_durations, hidden_size + self.addfeatures_num], dtype=encoder_hidden_states.dtype)
# outputs = tf.zeros(shape=[0, max_durations, hidden_size], dtype=encoder_hidden_states.dtype)
encoder_masks = tf.zeros(shape=[0, max_durations], dtype=tf.int32)
def condition(
i,
batch_size,
outputs,
encoder_masks,
encoder_hidden_states,
durations_gt,
max_durations,
):
return tf.less(i, batch_size)
def body(
i,
batch_size,
outputs,
encoder_masks,
encoder_hidden_states,
durations_gt,
max_durations,
):
############################### ori ##################################
# repeats = durations_gt[i]
# real_length = tf.reduce_sum(repeats)
# pad_size = max_durations - real_length
# masks = tf.sequence_mask([real_length], max_durations, dtype=tf.int32)
# repeat_encoder_hidden_states = tf.repeat(
# encoder_hidden_states[i], repeats=repeats, axis=0
# )
# repeat_encoder_hidden_states = tf.expand_dims(
# tf.pad(repeat_encoder_hidden_states, [[0, pad_size], [0, 0]]), 0
# ) # [1, max_durations, hidden_size]
# outputs = tf.concat([outputs, repeat_encoder_hidden_states], axis=0)
# encoder_masks = tf.concat([encoder_masks, masks], axis=0)
############################### add duration info ##################################
repeats = durations_gt[i]
real_length = tf.reduce_sum(repeats)
pad_size = max_durations - real_length
masks = tf.sequence_mask([real_length], max_durations, dtype=tf.int32)
repeat_encoder_hidden_states = tf.repeat(
encoder_hidden_states[i], repeats=repeats, axis=0
)
if self.addfeatures_num > 0:
# duration sum per phone
durdur = tf.repeat(repeats[:, tf.newaxis], repeats=repeats, axis=0)
durdur = tf.cast(durdur, encoder_hidden_states.dtype)
# acc duration
maskbool = tf.sequence_mask(repeats, tf.reduce_max(repeats))
durindex = tf.cumsum(tf.cast(maskbool, encoder_hidden_states.dtype), -1)
durindex = tf.boolean_mask(durindex, maskbool)[:, tf.newaxis]
# duration/(sum)
durindex = (durindex - 1) / durdur
# coarse_coding
indexs = tf.cast(durindex*100, tf.int32)
cc0 = tf.gather(self.cc_features0, 400+indexs)
cc1 = tf.gather(self.cc_features1, 300+indexs)
cc2 = tf.gather(self.cc_features2, 200+indexs)
cc3 = tf.gather(self.cc_features3, 100+indexs)
ccc = tf.concat([cc0, cc1, cc2, cc3], axis=-1)
if self.config.isaddur:
repeat_encoder_hidden_states = tf.concat([repeat_encoder_hidden_states, durdur], -1)
repeat_encoder_hidden_states = tf.concat([repeat_encoder_hidden_states, ccc], -1)
repeat_encoder_hidden_states = tf.expand_dims(
tf.pad(repeat_encoder_hidden_states, [[0, pad_size], [0, 0]]), 0
) # [1, max_durations, hidden_size]
outputs = tf.concat([outputs, repeat_encoder_hidden_states], axis=0)
encoder_masks = tf.concat([encoder_masks, masks], axis=0)
return [
i + 1,
batch_size,
outputs,
encoder_masks,
encoder_hidden_states,
durations_gt,
max_durations,
]
# initialize iteration i.
i = tf.constant(0, dtype=tf.int32)
_, _, outputs, encoder_masks, _, _, _, = tf.while_loop(
condition,
body,
[
i,
batch_size,
outputs,
encoder_masks,
encoder_hidden_states,
durations_gt,
max_durations,
],
shape_invariants=[
i.get_shape(),
batch_size.get_shape(),
tf.TensorShape(
[
None,
None,
self.config.content_latent_dim,
]
),
tf.TensorShape([None, None]),
encoder_hidden_states.get_shape(),
durations_gt.get_shape(),
max_durations.get_shape(),
],
)
return outputs, encoder_masks
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/core2.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2020 The FastSpeech Authors, The HuggingFace Inc. team and Minh Nguyen (@dathudeptrai)
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tensorflow Model modules for FastSpeech."""
import tensorflow as tf
from tensorflow_tts.models.moduls.core import *
from tensorflow_tts.models.moduls.conditional import ConditionalNormalization
class TFFastSpeechConditionalSelfOutput(tf.keras.layers.Layer):
"""Fastspeech output of self attention module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.dense = tf.keras.layers.Dense(
config.hidden_size,
kernel_initializer=get_initializer(config.initializer_range),
name="dense",
)
self.normlayer = ConditionalNormalization(config)
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
def call(self, inputs, training=False):
'''
hidden_states: [B, T, C]
input_tensor: [B, T, C]
conds: [B, 1, T]
attention_mask:[B, T]
'''
hidden_states, input_tensor, conds, attention_mask = inputs
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states, training=training)
hidden_states = self.normlayer(hidden_states + input_tensor, conds, attention_mask)
return hidden_states
class TFFastSpeechConditionalAttention(tf.keras.layers.Layer):
"""Fastspeech attention module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.self_attention = TFFastSpeechSelfAttention(config, name="self")
self.dense_output = TFFastSpeechConditionalSelfOutput(config, name="output")
def call(self, inputs, training=False):
'''
input_tensor: [B, T, C]
conds: [B, 1, C']
attention_mask: [B, T]
'''
input_tensor, conds, attention_mask = inputs
self_outputs = self.self_attention(
[input_tensor, attention_mask], training=training
)
attention_output = self.dense_output(
[self_outputs[0], input_tensor, conds, attention_mask], training=training
)
masked_attention_output = attention_output * tf.cast(
tf.expand_dims(attention_mask, 2), dtype=attention_output.dtype
)
outputs = (masked_attention_output,) + self_outputs[
1:
] # add attentions if we output them
return outputs
class TFFastSpeechConditionalOutput(tf.keras.layers.Layer):
"""Output module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.normlayer = ConditionalNormalization(config)
self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
def call(self, inputs, training=False):
'''
hidden_states: [B, T, C]
input_tensor: [B, T, C]
conds: [B, 1, T]
attention_mask:[B, T]
'''
hidden_states, input_tensor, conds, attention_mask = inputs
hidden_states = self.dropout(hidden_states, training=training)
hidden_states = self.normlayer(hidden_states + input_tensor, conds, attention_mask)
return hidden_states
class TFFastSpeechConditionalLayer(tf.keras.layers.Layer):
"""Fastspeech module (FFT module on the paper)."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.attention = TFFastSpeechConditionalAttention(config, name="attention")
self.intermediate = TFFastSpeechIntermediate(config, name="intermediate")
self.bert_output = TFFastSpeechConditionalOutput(config, name="output")
def call(self, inputs, training=False):
'''
hidden_states: [B, T, C]
conds: [B, 1, C']
attention_mask: [B, T]
'''
hidden_states, conds, attention_mask = inputs
attention_outputs = self.attention(
[hidden_states, conds, attention_mask], training=training
)
attention_output = attention_outputs[0]
intermediate_output = self.intermediate(
[attention_output, attention_mask], training=training
)
layer_output = self.bert_output(
[intermediate_output, attention_output, conds, attention_mask], training=training
)
masked_layer_output = layer_output * tf.cast(
tf.expand_dims(attention_mask, 2), dtype=layer_output.dtype
)
outputs = (masked_layer_output,) + attention_outputs[
1:
] # add attentions if we output them
return outputs
class TFFastSpeechConditionalEncoder(tf.keras.layers.Layer):
"""Fast Speech encoder module."""
def __init__(self, config, **kwargs):
"""Init variables."""
super().__init__(**kwargs)
self.output_attentions = config.output_attentions
self.output_hidden_states = config.output_hidden_states
self.layer = [
TFFastSpeechConditionalLayer(config, name="layer_._{}".format(i))
for i in range(config.num_hidden_layers)
]
def call(self, inputs, training=False):
'''
hidden_states: [B, T, C]
conds: [B, 1, C']
attention_mask: [B, T]
'''
hidden_states, conds, attention_mask = inputs
all_hidden_states = ()
all_attentions = ()
for _, layer_module in enumerate(self.layer):
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_outputs = layer_module(
[hidden_states, conds, attention_mask], training=training
)
hidden_states = layer_outputs[0]
if self.output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
# Add last layer
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
outputs = (hidden_states,)
if self.output_hidden_states:
outputs = outputs + (all_hidden_states,)
if self.output_attentions:
outputs = outputs + (all_attentions,)
return outputs # outputs, (hidden states), (attentions)
class TFFastSpeechConditionalDecoder(TFFastSpeechConditionalEncoder):
"""Fast Speech decoder module."""
def __init__(self, config, **kwargs):
self.is_compatible_encoder = kwargs.pop("is_compatible_encoder", True)
super().__init__(config, **kwargs)
self.config = config
if self.is_compatible_encoder is False:
self.project_compatible_decoder = tf.keras.layers.Dense(
units=config.hidden_size, name="project_compatible_decoder"
)
def call(self, inputs, training=False):
'''
hidden_states: [B, T, C]
conds: [B, 1, C']
encoder_mask: [B, T]
'''
hidden_states, conds, encoder_mask = inputs
if self.is_compatible_encoder is False:
hidden_states = self.project_compatible_decoder(hidden_states)
return super().call([hidden_states, conds, encoder_mask], training=training)
================================================
FILE: TensorFlowTTS/tensorflow_tts/models/unetts.py
================================================
import tensorflow as tf
import numpy as np
from tensorflow_tts.models.moduls.core import (
TFFastSpeechEmbeddings,
TFFastSpeechEncoder,
TFFastSpeechDecoder,
TFTacotronPostnet,
TFFastSpeechLengthRegulator,
TFFastSpeechVariantPredictor,
TFFastSpeechDurationPredictor
)
from tensorflow_tts.models.moduls.core2 import TFFastSpeechConditionalDecoder
from tensorflow_tts.models.moduls.adain_en_de_code import (
AadINEncoder, AdaINDecoder
)
'''
###############################################################################
############################# Duration #######################################
###############################################################################
'''
class TFUNETTSDuration(tf.keras.Model):
def __init__(self, config, **kwargs):
"""Init layers for UNETTSDuration."""
self.enable_tflite_convertible = kwargs.pop("enable_tflite_convertible", False)
super().__init__(**kwargs)
self.embeddings = TFFastSpeechEmbeddings(config, name="embeddings")
self.encoder = TFFastSpeechEncoder(
config.encoder_self_attention_params, name = "encoder"
)
self.duration_predictor = TFFastSpeechDurationPredictor(
config, name = "duration_predictor"
)
self.duration_stat_cal = tf.keras.layers.Dense(4, use_bias=False,
kernel_initializer=tf.constant_initializer(
[[0.97, 0.01, 0.01, 0.01],
[0.01, 0.97, 0.01, 0.01],
[0.01, 0.01, 0.97, 0.01],
[0.01, 0.01, 0.01, 0.97]]
),
kernel_constraint=tf.keras.constraints.NonNeg(),
name="duration_stat_cal")
self.setup_inference_fn()
def _build(self):
"""Dummy input for building model."""
# fake inputs
char_ids = tf.convert_to_tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]], tf.int32)
duration_stat = tf.convert_to_tensor([[1., 1., 1., 1.]], tf.float32)
self(char_ids, duration_stat)
def call(
self, char_ids, duration_stat, training=False, **kwargs,
):
"""Call logic."""
attention_mask = tf.math.not_equal(char_ids, 0)
sheng_mask = char_ids < 27
yun_mask = char_ids > 26
duration_stat = self.duration_stat_cal(duration_stat)
sheng_mean, sheng_std, yun_mean, yun_std = \
duration_stat[:,0][:, None], duration_stat[:,1][:, None], duration_stat[:,2][:, None], duration_stat[:,3][:, None]
embedding_output = self.embeddings(char_ids)
encoder_output = self.encoder([embedding_output, attention_mask], training=training)
last_encoder_hidden_states = encoder_output[0]
duration_outputs = self.duration_predictor([last_encoder_hidden_states, attention_mask])
sheng_outputs = duration_outputs * sheng_std + sheng_mean
sheng_outputs = sheng_outputs * tf.cast(sheng_mask, tf.float32)
yun_outputs = duration_outputs * yun_std + yun_mean
yun_outputs = yun_outputs * tf.cast(yun_mask, tf.float32)
duration_outp
gitextract_aqw3mgq5/
├── .gitignore
├── README-CN.md
├── README.md
├── TensorFlowTTS/
│ ├── LICENSE
│ ├── README.md
│ ├── setup.cfg
│ ├── setup.py
│ └── tensorflow_tts/
│ ├── __init__.py
│ ├── audio_process/
│ │ ├── __init__.py
│ │ ├── audio.py
│ │ └── audio_spec.py
│ ├── bin/
│ │ ├── __init__.py
│ │ └── preprocess_unetts.py
│ ├── configs/
│ │ ├── __init__.py
│ │ ├── mb_melgan.py
│ │ ├── melgan.py
│ │ └── unetts.py
│ ├── datasets/
│ │ ├── __init__.py
│ │ ├── abstract_dataset.py
│ │ ├── audio_dataset.py
│ │ └── mel_dataset.py
│ ├── inference/
│ │ ├── __init__.py
│ │ ├── auto_config.py
│ │ ├── auto_model.py
│ │ └── auto_processor.py
│ ├── losses/
│ │ ├── __init__.py
│ │ ├── spectrogram.py
│ │ └── stft.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── mb_melgan.py
│ │ ├── melgan.py
│ │ ├── moduls/
│ │ │ ├── __init__.py
│ │ │ ├── adain_en_de_code.py
│ │ │ ├── conditional.py
│ │ │ ├── core.py
│ │ │ └── core2.py
│ │ └── unetts.py
│ ├── optimizers/
│ │ ├── __init__.py
│ │ └── adamweightdecay.py
│ ├── processor/
│ │ ├── __init__.py
│ │ ├── base_processor.py
│ │ └── multispk_voiceclone.py
│ ├── trainers/
│ │ ├── __init__.py
│ │ └── base_trainer.py
│ └── utils/
│ ├── __init__.py
│ ├── cleaners.py
│ ├── decoder.py
│ ├── griffin_lim.py
│ ├── group_conv.py
│ ├── korean.py
│ ├── number_norm.py
│ ├── outliers.py
│ ├── strategy.py
│ ├── utils.py
│ └── weight_norm.py
├── UnetTTS_syn.py
├── models/
│ ├── acous12k.h5
│ ├── duration4k.h5
│ ├── unetts_mapper.json
│ └── vocoder800k.h5
├── notebook/
│ └── OneShotVoiceClone_Inference.ipynb
├── test_wavs/
│ ├── angry_dur_stat.npy
│ ├── happy_dur_stat.npy
│ ├── neutral_dur_stat.npy
│ ├── sad_dur_stat.npy
│ └── surprise_dur_stat.npy
└── train/
├── configs/
│ ├── multiband_melgan.yaml
│ ├── unetts_acous.yaml
│ ├── unetts_duration.yaml
│ └── unetts_preprocess.yaml
├── train_multiband_melgan.py
├── train_unetts_acous.py
├── train_unetts_duration.py
└── unetts_dataset.py
SYMBOL INDEX (510 symbols across 40 files)
FILE: TensorFlowTTS/tensorflow_tts/audio_process/audio.py
function preprocess_wav (line 29) | def preprocess_wav(fpath_or_wav: Union[str, Path, np.ndarray],
function normalize_volume (line 70) | def normalize_volume(wav, ratio=0.6):
function sil_pad (line 73) | def sil_pad(wav, pad_length=100):
function trim_long_silences (line 77) | def trim_long_silences(wav, vad_window_length, vad_moving_average_width,...
function melbasis_make (line 119) | def melbasis_make(sr=16000, n_fft=1024, n_mels=80, fmin=80, fmax=7600):
function mel_make (line 122) | def mel_make(filepath: str, sr=16000, n_fft=1024, framesize=256, mel_bas...
FILE: TensorFlowTTS/tensorflow_tts/audio_process/audio_spec.py
function preemphasis (line 7) | def preemphasis(wav, k, preemphasize=True):
function inv_preemphasis (line 12) | def inv_preemphasis(wav, k, inv_preemphasize=True):
class AudioMelSpec (line 17) | class AudioMelSpec():
method __init__ (line 21) | def __init__(
method _mel_basis_create (line 54) | def _mel_basis_create(self):
method _stft (line 58) | def _stft(self, y):
method _istft (line 61) | def _istft(self, y):
method _linear_to_mel (line 64) | def _linear_to_mel(self, spectogram):
method _mel_to_linear (line 67) | def _mel_to_linear(self, mel_spectrogram):
method _amp_to_db (line 70) | def _amp_to_db(self, x):
method _db_to_amp (line 74) | def _db_to_amp(self, x):
method _normalize (line 77) | def _normalize(self, S):
method _denormalize (line 91) | def _denormalize(self, D):
method _griffin_lim (line 105) | def _griffin_lim(self, S):
method load_wav (line 117) | def load_wav(self, wav_fpath):
method save_wav (line 123) | def save_wav(self, wav, fpath):
method melspectrogram (line 128) | def melspectrogram(self, wav):
method inv_mel_spectrogram (line 136) | def inv_mel_spectrogram(self, mel_spectrogram):
method compare_plot (line 147) | def compare_plot(self, targets, preds, filepath=None, frame_real_len=N...
method melspec_plot (line 175) | def melspec_plot(self, mels):
class AudioSpec (line 181) | class AudioSpec():
method __init__ (line 185) | def __init__(self, sr, nfft, mel_dim=80, f0_min=71, f0_max=7800,
method _mel_basis_create (line 214) | def _mel_basis_create(self):
method _normalize (line 218) | def _normalize(self, log_sepc, is_symmetric, is_clipping_in_normalizat...
method _denormalize (line 232) | def _denormalize(self, log_sepc, is_symmetric, is_clipping_in_normaliz...
method ampspec2logspec (line 246) | def ampspec2logspec(self, amp_spec):
method logspec2ampspec (line 255) | def logspec2ampspec(self, log_spec):
class VariableNormProcess (line 264) | class VariableNormProcess():
method __init__ (line 268) | def __init__(self, var_min, var_max, max_abs_value=4.0, is_symmetric=T...
method normalize (line 277) | def normalize(self, var):
method denormalize (line 284) | def denormalize(self, nvar):
FILE: TensorFlowTTS/tensorflow_tts/bin/preprocess_unetts.py
function parse_and_config (line 46) | def parse_and_config():
function preprocess_duration (line 166) | def preprocess_duration():
function gen_duration_features (line 273) | def gen_duration_features(item, config):
function save_duration_to_file (line 303) | def save_duration_to_file(features, subdir, config):
function preprocess_acous (line 329) | def preprocess_acous():
function gen_acous_features (line 437) | def gen_acous_features(item, config):
function save_acous_to_file (line 458) | def save_acous_to_file(features, subdir, config):
function preprocess_vocoder (line 484) | def preprocess_vocoder():
function gen_vocoder (line 590) | def gen_vocoder(item, config):
function save_vocoder_to_file (line 608) | def save_vocoder_to_file(features, subdir, config):
FILE: TensorFlowTTS/tensorflow_tts/configs/mb_melgan.py
class MultiBandMelGANGeneratorConfig (line 20) | class MultiBandMelGANGeneratorConfig(MelGANGeneratorConfig):
method __init__ (line 23) | def __init__(self, **kwargs):
class MultiBandMelGANDiscriminatorConfig (line 31) | class MultiBandMelGANDiscriminatorConfig(MelGANDiscriminatorConfig):
method __init__ (line 34) | def __init__(self, **kwargs):
FILE: TensorFlowTTS/tensorflow_tts/configs/melgan.py
class MelGANGeneratorConfig (line 18) | class MelGANGeneratorConfig(object):
method __init__ (line 21) | def __init__(
class MelGANDiscriminatorConfig (line 54) | class MelGANDiscriminatorConfig(object):
method __init__ (line 57) | def __init__(
FILE: TensorFlowTTS/tensorflow_tts/configs/unetts.py
class UNETTSDurationConfig (line 61) | class UNETTSDurationConfig(object):
method __init__ (line 64) | def __init__(
class UNETTSAcousConfig (line 118) | class UNETTSAcousConfig(object):
method __init__ (line 121) | def __init__(
FILE: TensorFlowTTS/tensorflow_tts/datasets/abstract_dataset.py
class AbstractDataset (line 22) | class AbstractDataset(metaclass=abc.ABCMeta):
method get_args (line 26) | def get_args(self):
method generator (line 31) | def generator(self):
method get_output_dtypes (line 36) | def get_output_dtypes(self):
method get_len_dataset (line 41) | def get_len_dataset(self):
method create (line 45) | def create(
FILE: TensorFlowTTS/tensorflow_tts/datasets/audio_dataset.py
class AudioDataset (line 27) | class AudioDataset(AbstractDataset):
method __init__ (line 30) | def __init__(
method get_args (line 65) | def get_args(self):
method generator (line 68) | def generator(self, utt_ids):
method get_output_dtypes (line 78) | def get_output_dtypes(self):
method create (line 86) | def create(
method get_len_dataset (line 124) | def get_len_dataset(self):
method __name__ (line 127) | def __name__(self):
FILE: TensorFlowTTS/tensorflow_tts/datasets/mel_dataset.py
class MelDataset (line 27) | class MelDataset(AbstractDataset):
method __init__ (line 30) | def __init__(
method get_args (line 64) | def get_args(self):
method generator (line 67) | def generator(self, utt_ids):
method get_output_dtypes (line 77) | def get_output_dtypes(self):
method create (line 85) | def create(
method get_len_dataset (line 123) | def get_len_dataset(self):
method __name__ (line 126) | def __name__(self):
FILE: TensorFlowTTS/tensorflow_tts/inference/auto_config.py
class AutoConfig (line 38) | class AutoConfig:
method __init__ (line 39) | def __init__(self):
method from_pretrained (line 46) | def from_pretrained(cls, pretrained_path, **kwargs):
FILE: TensorFlowTTS/tensorflow_tts/inference/auto_model.py
class TFAutoModel (line 46) | class TFAutoModel(object):
method __init__ (line 49) | def __init__(self):
method from_pretrained (line 53) | def from_pretrained(cls, config, pretrained_path=None, **kwargs):
FILE: TensorFlowTTS/tensorflow_tts/inference/auto_processor.py
class AutoProcessor (line 32) | class AutoProcessor:
method __init__ (line 33) | def __init__(self):
method from_pretrained (line 40) | def from_pretrained(cls, pretrained_path, **kwargs):
FILE: TensorFlowTTS/tensorflow_tts/losses/spectrogram.py
class TFMelSpectrogram (line 20) | class TFMelSpectrogram(tf.keras.layers.Layer):
method __init__ (line 23) | def __init__(
method _calculate_log_mels_spectrogram (line 44) | def _calculate_log_mels_spectrogram(self, signals):
method call (line 69) | def call(self, y, x):
FILE: TensorFlowTTS/tensorflow_tts/losses/stft.py
class TFSpectralConvergence (line 20) | class TFSpectralConvergence(tf.keras.layers.Layer):
method __init__ (line 23) | def __init__(self):
method call (line 27) | def call(self, y_mag, x_mag):
class TFLogSTFTMagnitude (line 40) | class TFLogSTFTMagnitude(tf.keras.layers.Layer):
method __init__ (line 43) | def __init__(self):
method call (line 47) | def call(self, y_mag, x_mag):
class TFSTFT (line 58) | class TFSTFT(tf.keras.layers.Layer):
method __init__ (line 61) | def __init__(self, frame_length=600, frame_step=120, fft_length=1024):
method call (line 70) | def call(self, y, x):
class TFMultiResolutionSTFT (line 107) | class TFMultiResolutionSTFT(tf.keras.layers.Layer):
method __init__ (line 110) | def __init__(
method call (line 130) | def call(self, y, x):
FILE: TensorFlowTTS/tensorflow_tts/models/mb_melgan.py
function design_prototype_filter (line 27) | def design_prototype_filter(taps=62, cutoff_ratio=0.15, beta=9.0):
class TFPQMF (line 60) | class TFPQMF(tf.keras.layers.Layer):
method __init__ (line 63) | def __init__(self, config, **kwargs):
method analysis (line 122) | def analysis(self, x):
method synthesis (line 138) | def synthesis(self, x):
class TFMBMelGANGenerator (line 159) | class TFMBMelGANGenerator(TFMelGANGenerator):
method __init__ (line 162) | def __init__(self, config, **kwargs):
method call (line 166) | def call(self, mels, **kwargs):
method inference (line 180) | def inference(self, mels):
method inference_tflite (line 189) | def inference_tflite(self, mels):
FILE: TensorFlowTTS/tensorflow_tts/models/melgan.py
function get_initializer (line 23) | def get_initializer(initializer_seed=42):
class TFReflectionPad1d (line 33) | class TFReflectionPad1d(tf.keras.layers.Layer):
method __init__ (line 36) | def __init__(self, padding_size, padding_type="REFLECT", **kwargs):
method call (line 47) | def call(self, x):
class TFConvTranspose1d (line 61) | class TFConvTranspose1d(tf.keras.layers.Layer):
method __init__ (line 64) | def __init__(
method call (line 92) | def call(self, x):
class TFResidualStack (line 105) | class TFResidualStack(tf.keras.layers.Layer):
method __init__ (line 108) | def __init__(
method call (line 165) | def call(self, x):
method _apply_weightnorm (line 178) | def _apply_weightnorm(self, list_layers):
class TFMelGANGenerator (line 189) | class TFMelGANGenerator(tf.keras.Model):
method __init__ (line 192) | def __init__(self, config, **kwargs):
method call (line 276) | def call(self, mels, **kwargs):
method inference (line 290) | def inference(self, mels):
method inference_tflite (line 298) | def inference_tflite(self, mels):
method _apply_weightnorm (line 301) | def _apply_weightnorm(self, list_layers):
method _build (line 311) | def _build(self):
class TFMelGANDiscriminator (line 317) | class TFMelGANDiscriminator(tf.keras.layers.Layer):
method __init__ (line 320) | def __init__(
method call (line 428) | def call(self, x, **kwargs):
method _apply_weightnorm (line 441) | def _apply_weightnorm(self, list_layers):
class TFMelGANMultiScaleDiscriminator (line 452) | class TFMelGANMultiScaleDiscriminator(tf.keras.Model):
method __init__ (line 455) | def __init__(self, config, **kwargs):
method call (line 485) | def call(self, x, **kwargs):
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/adain_en_de_code.py
function get_initializer (line 5) | def get_initializer(initializer_range=0.02):
class ConvModul (line 17) | class ConvModul(tf.keras.layers.Layer):
method __init__ (line 18) | def __init__(self, hidden_size, kernel_size, initializer_range, layer_...
method call (line 39) | def call(self, x):
class EncConvBlock (line 46) | class EncConvBlock(tf.keras.layers.Layer):
method __init__ (line 47) | def __init__(self, config, **kwargs):
method call (line 56) | def call(self, x):
class DecConvBlock (line 59) | class DecConvBlock(tf.keras.layers.Layer):
method __init__ (line 60) | def __init__(self, config, **kwargs):
method call (line 75) | def call(self, x):
class AadINEncoder (line 80) | class AadINEncoder(tf.keras.Model):
method __init__ (line 81) | def __init__(self, config, **kwargs):
method call (line 109) | def call(self, x, mask):
class AdaINDecoder (line 131) | class AdaINDecoder(tf.keras.Model):
method __init__ (line 132) | def __init__(self, config, **kwargs):
method call (line 160) | def call(self, enc, cond, mask):
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/conditional.py
function get_initializer (line 5) | def get_initializer(initializer_range=0.02):
class MaskInstanceNormalization (line 17) | class MaskInstanceNormalization(tf.keras.layers.Layer):
method __init__ (line 18) | def __init__(self, layer_norm_eps, **kwargs):
method _cal_mean_std (line 22) | def _cal_mean_std(self, inputs, mask):
method call (line 36) | def call(self, inputs, mask, return_mean_std=False):
class ConditionalNormalization (line 51) | class ConditionalNormalization(tf.keras.layers.Layer):
method __init__ (line 52) | def __init__(self, config, **kwargs):
method call (line 90) | def call(self, inputs, conds, mask):
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/core.py
function get_initializer (line 22) | def get_initializer(initializer_range=0.02):
function gelu (line 35) | def gelu(x):
function gelu_new (line 41) | def gelu_new(x):
function swish (line 47) | def swish(x):
function mish (line 52) | def mish(x):
class TFEmbedding (line 67) | class TFEmbedding(tf.keras.layers.Embedding):
method __init__ (line 69) | def __init__(self, *args, **kwargs):
method call (line 72) | def call(self, inputs):
class TFFastSpeechEmbeddings (line 78) | class TFFastSpeechEmbeddings(tf.keras.layers.Layer):
method __init__ (line 81) | def __init__(self, config, **kwargs):
method build (line 89) | def build(self, input_shape):
method call (line 99) | def call(self, input_ids):
class TFFastSpeechSelfAttention (line 103) | class TFFastSpeechSelfAttention(tf.keras.layers.Layer):
method __init__ (line 106) | def __init__(self, config, **kwargs):
method transpose_for_scores (line 143) | def transpose_for_scores(self, x, batch_size):
method call (line 162) | def call(self, inputs, training=False):
class TFFastSpeechSelfOutput (line 210) | class TFFastSpeechSelfOutput(tf.keras.layers.Layer):
method __init__ (line 213) | def __init__(self, config, **kwargs):
method call (line 226) | def call(self, inputs, training=False):
class TFFastSpeechAttention (line 236) | class TFFastSpeechAttention(tf.keras.layers.Layer):
method __init__ (line 239) | def __init__(self, config, **kwargs):
method call (line 245) | def call(self, inputs, training=False):
class TFFastSpeechIntermediate (line 263) | class TFFastSpeechIntermediate(tf.keras.layers.Layer):
method __init__ (line 266) | def __init__(self, config, **kwargs):
method call (line 288) | def call(self, inputs):
class TFFastSpeechOutput (line 302) | class TFFastSpeechOutput(tf.keras.layers.Layer):
method __init__ (line 305) | def __init__(self, config, **kwargs):
method call (line 313) | def call(self, inputs, training=False):
class TFFastSpeechLayer (line 322) | class TFFastSpeechLayer(tf.keras.layers.Layer):
method __init__ (line 325) | def __init__(self, config, **kwargs):
method call (line 332) | def call(self, inputs, training=False):
class TFFastSpeechEncoder (line 355) | class TFFastSpeechEncoder(tf.keras.layers.Layer):
method __init__ (line 358) | def __init__(self, config, **kwargs):
method call (line 368) | def call(self, inputs, training=False):
class TFFastSpeechDecoder (line 398) | class TFFastSpeechDecoder(TFFastSpeechEncoder):
method __init__ (line 401) | def __init__(self, config, **kwargs):
method call (line 412) | def call(self, inputs, training=False):
class TFTacotronPostnet (line 421) | class TFTacotronPostnet(tf.keras.layers.Layer):
method __init__ (line 424) | def __init__(self, config, **kwargs):
method call (line 446) | def call(self, inputs, training=False):
class TFFastSpeechVariantPredictor (line 458) | class TFFastSpeechVariantPredictor(tf.keras.layers.Layer):
method __init__ (line 461) | def __init__(self, config, sub_name="f0", is_sigmod=False, **kwargs):
method call (line 493) | def call(self, inputs, training=False):
class TFFastSpeechDurationPredictor (line 512) | class TFFastSpeechDurationPredictor(tf.keras.layers.Layer):
method __init__ (line 515) | def __init__(self, config, **kwargs):
method call (line 545) | def call(self, inputs, training=False):
class TFFastSpeechLengthRegulator (line 560) | class TFFastSpeechLengthRegulator(tf.keras.layers.Layer):
method __init__ (line 563) | def __init__(self, config, **kwargs):
method _compute_coarse_coding_features (line 577) | def _compute_coarse_coding_features(self):
method call (line 597) | def call(self, inputs, training=False):
method _length_regulator (line 609) | def _length_regulator(self, encoder_hidden_states, durations_gt):
FILE: TensorFlowTTS/tensorflow_tts/models/moduls/core2.py
class TFFastSpeechConditionalSelfOutput (line 21) | class TFFastSpeechConditionalSelfOutput(tf.keras.layers.Layer):
method __init__ (line 24) | def __init__(self, config, **kwargs):
method call (line 37) | def call(self, inputs, training=False):
class TFFastSpeechConditionalAttention (line 52) | class TFFastSpeechConditionalAttention(tf.keras.layers.Layer):
method __init__ (line 55) | def __init__(self, config, **kwargs):
method call (line 61) | def call(self, inputs, training=False):
class TFFastSpeechConditionalOutput (line 83) | class TFFastSpeechConditionalOutput(tf.keras.layers.Layer):
method __init__ (line 86) | def __init__(self, config, **kwargs):
method call (line 92) | def call(self, inputs, training=False):
class TFFastSpeechConditionalLayer (line 106) | class TFFastSpeechConditionalLayer(tf.keras.layers.Layer):
method __init__ (line 109) | def __init__(self, config, **kwargs):
method call (line 116) | def call(self, inputs, training=False):
class TFFastSpeechConditionalEncoder (line 142) | class TFFastSpeechConditionalEncoder(tf.keras.layers.Layer):
method __init__ (line 145) | def __init__(self, config, **kwargs):
method call (line 155) | def call(self, inputs, training=False):
class TFFastSpeechConditionalDecoder (line 189) | class TFFastSpeechConditionalDecoder(TFFastSpeechConditionalEncoder):
method __init__ (line 192) | def __init__(self, config, **kwargs):
method call (line 203) | def call(self, inputs, training=False):
FILE: TensorFlowTTS/tensorflow_tts/models/unetts.py
class TFUNETTSDuration (line 24) | class TFUNETTSDuration(tf.keras.Model):
method __init__ (line 25) | def __init__(self, config, **kwargs):
method _build (line 52) | def _build(self):
method call (line 59) | def call(
method _inference (line 90) | def _inference(self, char_ids, duration_stat, **kwargs):
method setup_inference_fn (line 119) | def setup_inference_fn(self):
class ContentEncoder (line 144) | class ContentEncoder(tf.keras.Model):
method __init__ (line 145) | def __init__(self, config, **kwargs):
method call (line 162) | def call(self, char_ids, duration_gts, training=False):
class TFUNETTSAcous (line 176) | class TFUNETTSAcous(tf.keras.Model):
method __init__ (line 179) | def __init__(self, config, **kwargs):
method _build (line 198) | def _build(self):
method text_encoder_weight_load (line 206) | def text_encoder_weight_load(self, content_encoder_path):
method freezen_encoder (line 209) | def freezen_encoder(self):
method call (line 212) | def call(
method _inference (line 227) | def _inference(self, char_ids, duration_gts, mel_src, **kwargs):
method extract_dur_pos_embed (line 238) | def extract_dur_pos_embed(self, mel_src):
method setup_inference_fn (line 243) | def setup_inference_fn(self):
class TFUNETTSContentPretrain (line 264) | class TFUNETTSContentPretrain(tf.keras.Model):
method __init__ (line 267) | def __init__(self, config, **kwargs):
method _build (line 297) | def _build(self):
method content_encoder_weight_save (line 305) | def content_encoder_weight_save(self, path):
method call (line 308) | def call(
method _inference (line 340) | def _inference(self, char_ids, duration_gts, embed, **kwargs):
method setup_inference_fn (line 370) | def setup_inference_fn(self):
FILE: TensorFlowTTS/tensorflow_tts/optimizers/adamweightdecay.py
class WarmUp (line 23) | class WarmUp(tf.keras.optimizers.schedules.LearningRateSchedule):
method __init__ (line 26) | def __init__(
method __call__ (line 41) | def __call__(self, step):
method get_config (line 58) | def get_config(self):
class AdamWeightDecay (line 68) | class AdamWeightDecay(tf.keras.optimizers.Adam):
method __init__ (line 79) | def __init__(
method from_config (line 100) | def from_config(cls, config):
method _prepare_local (line 107) | def _prepare_local(self, var_device, var_dtype, apply_state):
method _decay_weights_op (line 113) | def _decay_weights_op(self, var, learning_rate, apply_state):
method apply_gradients (line 122) | def apply_gradients(self, grads_and_vars, clip_norm=0.5, **kwargs):
method _get_lr (line 127) | def _get_lr(self, var_device, var_dtype, apply_state):
method _resource_apply_dense (line 140) | def _resource_apply_dense(self, grad, var, apply_state=None):
method _resource_apply_sparse (line 148) | def _resource_apply_sparse(self, grad, var, indices, apply_state=None):
method get_config (line 156) | def get_config(self):
method _do_use_weight_decay (line 163) | def _do_use_weight_decay(self, param_name):
FILE: TensorFlowTTS/tensorflow_tts/processor/base_processor.py
class DataProcessorError (line 25) | class DataProcessorError(Exception):
class BaseProcessor (line 30) | class BaseProcessor(abc.ABC):
method __post_init__ (line 49) | def __post_init__(self):
method __getattr__ (line 79) | def __getattr__(self, name: str) -> Union[str, int]:
method create_speaker_map (line 84) | def create_speaker_map(self):
method get_speaker_id (line 95) | def get_speaker_id(self, name: str) -> int:
method get_speaker_name (line 98) | def get_speaker_name(self, speaker_id: int) -> str:
method create_symbols (line 101) | def create_symbols(self):
method create_items (line 105) | def create_items(self):
method add_symbol (line 126) | def add_symbol(self, symbol: Union[str, list]):
method get_one_sample (line 142) | def get_one_sample(self, item):
method text_to_sequence (line 162) | def text_to_sequence(self, text: str):
method setup_eos_token (line 166) | def setup_eos_token(self):
method convert_symbols_to_ids (line 170) | def convert_symbols_to_ids(self, symbols: Union[str, list]):
method _load_mapper (line 186) | def _load_mapper(self, loaded_path: str = None):
method _save_mapper (line 208) | def _save_mapper(self, saved_path: str = None, extra_attrs_to_save: di...
FILE: TensorFlowTTS/tensorflow_tts/processor/multispk_voiceclone.py
function is_zh (line 535) | def is_zh(word):
function is_en (line 540) | def is_en(word):
class MyConverter (line 545) | class MyConverter(NeutralToneWith5Mixin, DefaultConverter):
class MultiSPKVoiceCloneProcessor (line 550) | class MultiSPKVoiceCloneProcessor(BaseProcessor):
method __post_init__ (line 566) | def __post_init__(self):
method setup_eos_token (line 577) | def setup_eos_token(self):
method create_speaker_info (line 580) | def create_speaker_info(self):
method create_unseen_speaker (line 594) | def create_unseen_speaker(self):
method create_items (line 609) | def create_items(self):
method get_phoneme_from_char_and_pinyin (line 652) | def get_phoneme_from_char_and_pinyin(self, txt, pinyin):
method get_one_sample (line 749) | def get_one_sample(self, item):
method get_pinyin_parser (line 764) | def get_pinyin_parser(self):
method text_to_sequence (line 769) | def text_to_sequence(self, text, inference=False):
method create_speaker_map (line 805) | def create_speaker_map(self):
FILE: TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py
class BasedTrainer (line 32) | class BasedTrainer(metaclass=abc.ABCMeta):
method __init__ (line 35) | def __init__(self, steps, epochs, config):
method init_train_eval_metrics (line 47) | def init_train_eval_metrics(self, list_metrics_name):
method reset_states_train (line 59) | def reset_states_train(self):
method reset_states_eval (line 64) | def reset_states_eval(self):
method update_train_metrics (line 69) | def update_train_metrics(self, dict_metrics_losses):
method update_eval_metrics (line 73) | def update_eval_metrics(self, dict_metrics_losses):
method set_train_data_loader (line 77) | def set_train_data_loader(self, train_dataset):
method get_train_data_loader (line 81) | def get_train_data_loader(self):
method set_eval_data_loader (line 85) | def set_eval_data_loader(self, eval_dataset):
method get_eval_data_loader (line 89) | def get_eval_data_loader(self):
method compile (line 94) | def compile(self):
method create_checkpoint_manager (line 98) | def create_checkpoint_manager(self, saved_path=None, max_to_keep=10):
method run (line 102) | def run(self):
method save_checkpoint (line 117) | def save_checkpoint(self):
method load_checkpoint (line 122) | def load_checkpoint(self, pretrained_path):
method _train_epoch (line 126) | def _train_epoch(self):
method _eval_epoch (line 150) | def _eval_epoch(self):
method _train_step (line 155) | def _train_step(self, batch):
method _check_log_interval (line 160) | def _check_log_interval(self):
method fit (line 165) | def fit(self):
method _check_eval_interval (line 168) | def _check_eval_interval(self):
method _check_save_interval (line 173) | def _check_save_interval(self):
method generate_and_save_intermediate_result (line 179) | def generate_and_save_intermediate_result(self, batch):
method _write_to_tensorboard (line 183) | def _write_to_tensorboard(self, list_metrics, stage="train"):
class GanBasedTrainer (line 191) | class GanBasedTrainer(BasedTrainer):
method __init__ (line 194) | def __init__(
method init_train_eval_metrics (line 217) | def init_train_eval_metrics(self, list_metrics_name):
method get_n_gpus (line 221) | def get_n_gpus(self):
method _get_train_element_signature (line 224) | def _get_train_element_signature(self):
method _get_eval_element_signature (line 227) | def _get_eval_element_signature(self):
method set_gen_model (line 230) | def set_gen_model(self, generator_model):
method get_gen_model (line 234) | def get_gen_model(self):
method set_dis_model (line 238) | def set_dis_model(self, discriminator_model):
method get_dis_model (line 242) | def get_dis_model(self):
method set_gen_optimizer (line 246) | def set_gen_optimizer(self, generator_optimizer):
method get_gen_optimizer (line 254) | def get_gen_optimizer(self):
method set_dis_optimizer (line 258) | def set_dis_optimizer(self, discriminator_optimizer):
method get_dis_optimizer (line 266) | def get_dis_optimizer(self):
method compile (line 270) | def compile(self, gen_model, dis_model, gen_optimizer, dis_optimizer):
method _train_step (line 276) | def _train_step(self, batch):
method _one_step_forward (line 299) | def _one_step_forward(self, batch):
method compute_per_example_generator_losses (line 308) | def compute_per_example_generator_losses(self, batch, outputs):
method compute_per_example_discriminator_losses (line 326) | def compute_per_example_discriminator_losses(self, batch, gen_outputs):
method _one_step_forward_per_replica (line 343) | def _one_step_forward_per_replica(self, batch):
method _eval_epoch (line 423) | def _eval_epoch(self):
method _one_step_evaluate_per_replica (line 455) | def _one_step_evaluate_per_replica(self, batch):
method _one_step_evaluate (line 478) | def _one_step_evaluate(self, batch):
method _one_step_predict_per_replica (line 481) | def _one_step_predict_per_replica(self, batch):
method _one_step_predict (line 485) | def _one_step_predict(self, batch):
method generate_and_save_intermediate_result (line 490) | def generate_and_save_intermediate_result(self, batch):
method create_checkpoint_manager (line 493) | def create_checkpoint_manager(self, saved_path=None, max_to_keep=10):
method save_checkpoint (line 511) | def save_checkpoint(self):
method load_checkpoint (line 523) | def load_checkpoint(self, pretrained_path):
method _check_train_finish (line 554) | def _check_train_finish(self):
method _check_log_interval (line 568) | def _check_log_interval(self):
method fit (line 580) | def fit(self, train_data_loader, valid_data_loader, saved_path, resume...
class Seq2SeqBasedTrainer (line 597) | class Seq2SeqBasedTrainer(BasedTrainer, metaclass=abc.ABCMeta):
method __init__ (line 600) | def __init__(
method init_train_eval_metrics (line 620) | def init_train_eval_metrics(self, list_metrics_name):
method set_model (line 624) | def set_model(self, model):
method get_model (line 628) | def get_model(self):
method set_optimizer (line 632) | def set_optimizer(self, optimizer):
method get_optimizer (line 640) | def get_optimizer(self):
method get_n_gpus (line 644) | def get_n_gpus(self):
method compile (line 647) | def compile(self, model, optimizer):
method _get_train_element_signature (line 651) | def _get_train_element_signature(self):
method _get_eval_element_signature (line 654) | def _get_eval_element_signature(self):
method _train_step (line 657) | def _train_step(self, batch):
method _one_step_forward (line 680) | def _one_step_forward(self, batch):
method _one_step_forward_per_replica (line 688) | def _one_step_forward_per_replica(self, batch):
method compute_per_example_losses (line 724) | def compute_per_example_losses(self, batch, outputs):
method _eval_epoch (line 741) | def _eval_epoch(self):
method _one_step_evaluate_per_replica (line 773) | def _one_step_evaluate_per_replica(self, batch):
method _one_step_evaluate (line 779) | def _one_step_evaluate(self, batch):
method _one_step_predict_per_replica (line 782) | def _one_step_predict_per_replica(self, batch):
method _one_step_predict (line 786) | def _one_step_predict(self, batch):
method generate_and_save_intermediate_result (line 791) | def generate_and_save_intermediate_result(self, batch):
method create_checkpoint_manager (line 794) | def create_checkpoint_manager(self, saved_path=None, max_to_keep=10):
method save_checkpoint (line 809) | def save_checkpoint(self):
method load_checkpoint (line 816) | def load_checkpoint(self, pretrained_path):
method _check_train_finish (line 828) | def _check_train_finish(self):
method _check_log_interval (line 833) | def _check_log_interval(self):
method fit (line 845) | def fit(self, train_data_loader, valid_data_loader, saved_path, resume...
class StreamBasedTrainer (line 862) | class StreamBasedTrainer(Seq2SeqBasedTrainer):
method __init__ (line 863) | def __init__(
method _one_step_evaluate_per_replica (line 868) | def _one_step_evaluate_per_replica(self, batch):
FILE: TensorFlowTTS/tensorflow_tts/utils/cleaners.py
function expand_abbreviations (line 57) | def expand_abbreviations(text):
function expand_numbers (line 63) | def expand_numbers(text):
function lowercase (line 67) | def lowercase(text):
function collapse_whitespace (line 71) | def collapse_whitespace(text):
function convert_to_ascii (line 75) | def convert_to_ascii(text):
function basic_cleaners (line 79) | def basic_cleaners(text):
function transliteration_cleaners (line 86) | def transliteration_cleaners(text):
function english_cleaners (line 94) | def english_cleaners(text):
function korean_cleaners (line 104) | def korean_cleaners(text):
FILE: TensorFlowTTS/tensorflow_tts/utils/decoder.py
function dynamic_decode (line 28) | def dynamic_decode(
FILE: TensorFlowTTS/tensorflow_tts/utils/griffin_lim.py
function griffin_lim_lb (line 26) | def griffin_lim_lb(
class TFGriffinLim (line 64) | class TFGriffinLim(tf.keras.layers.Layer):
method __init__ (line 67) | def __init__(self, stats_path, dataset_config, normalized: bool = True):
method save_wav (line 88) | def save_wav(self, gl_tf, output_dir, wav_name):
method call (line 117) | def call(self, mel_spec, n_iter=32):
FILE: TensorFlowTTS/tensorflow_tts/utils/group_conv.py
class Convolution (line 14) | class Convolution(object):
method __init__ (line 29) | def __init__(
method _build_op (line 97) | def _build_op(self, _, padding):
method __call__ (line 107) | def __call__(self, inp, filter):
class Conv (line 111) | class Conv(Layer):
method __init__ (line 172) | def __init__(
method build (line 238) | def build(self, input_shape):
method call (line 293) | def call(self, inputs):
method compute_output_shape (line 326) | def compute_output_shape(self, input_shape):
method get_config (line 357) | def get_config(self):
method _compute_causal_padding (line 379) | def _compute_causal_padding(self):
method _get_channel_axis (line 388) | def _get_channel_axis(self):
method _get_input_channel (line 394) | def _get_input_channel(self, input_shape):
method _get_padding_op (line 403) | def _get_padding_op(self):
method _recreate_conv_op (line 412) | def _recreate_conv_op(self, inputs):
class GroupConv1D (line 433) | class GroupConv1D(Conv):
method __init__ (line 517) | def __init__(
FILE: TensorFlowTTS/tensorflow_tts/utils/korean.py
function is_lead (line 284) | def is_lead(char):
function is_vowel (line 288) | def is_vowel(char):
function is_tail (line 292) | def is_tail(char):
function get_mode (line 296) | def get_mode(char):
function _get_text_from_candidates (line 307) | def _get_text_from_candidates(candidates):
function jamo_to_korean (line 316) | def jamo_to_korean(text):
function compare_sentence_with_jamo (line 345) | def compare_sentence_with_jamo(text1, text2):
function tokenize (line 349) | def tokenize(text, as_id=False):
function tokenizer_fn (line 362) | def tokenizer_fn(iterator):
function normalize (line 366) | def normalize(text):
function normalize_with_dictionary (line 382) | def normalize_with_dictionary(text, dic):
function normalize_english (line 390) | def normalize_english(text):
function normalize_upper (line 402) | def normalize_upper(text):
function normalize_quote (line 411) | def normalize_quote(text):
function normalize_number (line 428) | def normalize_number(text):
function number_to_korean (line 458) | def number_to_korean(num_str, is_count=False):
FILE: TensorFlowTTS/tensorflow_tts/utils/number_norm.py
function _remove_commas (line 37) | def _remove_commas(m):
function _expand_decimal_point (line 41) | def _expand_decimal_point(m):
function _expand_dollars (line 45) | def _expand_dollars(m):
function _expand_ordinal (line 66) | def _expand_ordinal(m):
function _expand_number (line 70) | def _expand_number(m):
function normalize_numbers (line 87) | def normalize_numbers(text):
FILE: TensorFlowTTS/tensorflow_tts/utils/outliers.py
function is_outlier (line 19) | def is_outlier(x, p25, p75):
function remove_outlier (line 26) | def remove_outlier(x, p_bottom: int = 25, p_top: int = 75):
FILE: TensorFlowTTS/tensorflow_tts/utils/strategy.py
function return_strategy (line 19) | def return_strategy():
function calculate_3d_loss (line 29) | def calculate_3d_loss(y_gt, y_pred, loss_fn):
function calculate_2d_loss (line 54) | def calculate_2d_loss(y_gt, y_pred, loss_fn):
function calculate_loss_norm_lens (line 79) | def calculate_loss_norm_lens(y_gt, y_pred, loss_fn, norm_lens):
FILE: TensorFlowTTS/tensorflow_tts/utils/utils.py
function find_files (line 11) | def find_files(root_dir, query="*.wav", include_root_dir=True):
FILE: TensorFlowTTS/tensorflow_tts/utils/weight_norm.py
class WeightNormalization (line 22) | class WeightNormalization(tf.keras.layers.Wrapper):
method __init__ (line 48) | def __init__(self, layer, data_init=True, **kwargs):
method _compute_weights (line 86) | def _compute_weights(self):
method _init_norm (line 96) | def _init_norm(self):
method _data_dep_init (line 103) | def _data_dep_init(self, inputs):
method build (line 131) | def build(self, input_shape=None):
method call (line 168) | def call(self, inputs):
method compute_output_shape (line 183) | def compute_output_shape(self, input_shape):
FILE: UnetTTS_syn.py
class UnetTTS (line 13) | class UnetTTS():
method __init__ (line 14) | def __init__(self, models_and_params, text2id_mapper, feats_yaml):
method one_shot_TTS (line 26) | def one_shot_TTS(self, text, src_audio, duration_stats=None, is_wrap_t...
method __init_models (line 66) | def __init_models(self):
method _stats_duration (line 87) | def _stats_duration(self, dur_pos_embed):
method mel_feats_extractor (line 113) | def mel_feats_extractor(self, audio):
method txt2ids (line 116) | def txt2ids(self, input_text):
method infer_duration_stats (line 122) | def infer_duration_stats(self, mel_src):
FILE: train/train_multiband_melgan.py
class MultiBandMelganTrainer (line 48) | class MultiBandMelganTrainer(MelganTrainer):
method __init__ (line 51) | def __init__(
method compile (line 96) | def compile(self, gen_model, dis_model, gen_optimizer, dis_optimizer, ...
method compute_per_example_generator_losses (line 109) | def compute_per_example_generator_losses(self, batch, outputs):
method compute_per_example_discriminator_losses (line 179) | def compute_per_example_discriminator_losses(self, batch, gen_outputs):
method generate_and_save_intermediate_result (line 200) | def generate_and_save_intermediate_result(self, batch):
function main (line 258) | def main():
FILE: train/train_unetts_acous.py
class UNETTSAcousTrainer (line 45) | class UNETTSAcousTrainer(Seq2SeqBasedTrainer):
method __init__ (line 48) | def __init__(
method compile (line 76) | def compile(self, model, optimizer):
method compute_per_example_losses (line 88) | def compute_per_example_losses(self, batch, outputs):
method generate_and_save_intermediate_result (line 118) | def generate_and_save_intermediate_result(self, batch):
class UNETTSContentPreTrainer (line 155) | class UNETTSContentPreTrainer(Seq2SeqBasedTrainer):
method __init__ (line 158) | def __init__(
method compile (line 186) | def compile(self, model, optimizer):
method compute_per_example_losses (line 198) | def compute_per_example_losses(self, batch, outputs):
method generate_and_save_intermediate_result (line 228) | def generate_and_save_intermediate_result(self, batch):
function main (line 268) | def main():
FILE: train/train_unetts_duration.py
class UNETTSDurationTrainer (line 45) | class UNETTSDurationTrainer(Seq2SeqBasedTrainer):
method __init__ (line 48) | def __init__(
method compile (line 73) | def compile(self, model, optimizer):
method compute_per_example_losses (line 86) | def compute_per_example_losses(self, batch, outputs):
method generate_and_save_intermediate_result (line 112) | def generate_and_save_intermediate_result(self, batch):
function main (line 142) | def main():
FILE: train/unetts_dataset.py
class UNETTSDurationDataset (line 28) | class UNETTSDurationDataset(AbstractDataset):
method __init__ (line 31) | def __init__(
method get_args (line 73) | def get_args(self):
method generator (line 76) | def generator(self, utt_ids):
method _load_data (line 92) | def _load_data(self, items):
method create (line 107) | def create(
method get_output_dtypes (line 148) | def get_output_dtypes(self):
method get_len_dataset (line 157) | def get_len_dataset(self):
method __name__ (line 160) | def __name__(self):
class UNETTSAcousDataset (line 163) | class UNETTSAcousDataset(AbstractDataset):
method __init__ (line 166) | def __init__(
method get_args (line 216) | def get_args(self):
method generator (line 219) | def generator(self, utt_ids):
method _load_data (line 237) | def _load_data(self, items):
method create (line 255) | def create(
method get_output_dtypes (line 302) | def get_output_dtypes(self):
method get_len_dataset (line 312) | def get_len_dataset(self):
method __name__ (line 315) | def __name__(self):
Condensed preview — 74 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,826K chars).
[
{
"path": ".gitignore",
"chars": 120,
"preview": "**/**/__pycache__\n**/.ipynb_checkpoints\nTensorFlowTTS/build\nTensorFlowTTS/TensorFlowTTS.egg-info\nTensorFlowTTS/.eggs\negs"
},
{
"path": "README-CN.md",
"chars": 2788,
"preview": "## Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning\n[\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/configs/__init__.py",
"chars": 311,
"preview": "from tensorflow_tts.configs.melgan import (\n MelGANDiscriminatorConfig,\n MelGANGeneratorConfig,\n)\nfrom tensorflow_"
},
{
"path": "TensorFlowTTS/tensorflow_tts/configs/mb_melgan.py",
"chars": 1330,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/configs/melgan.py",
"chars": 3227,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/configs/unetts.py",
"chars": 11061,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/datasets/__init__.py",
"chars": 191,
"preview": "from tensorflow_tts.datasets.abstract_dataset import AbstractDataset\nfrom tensorflow_tts.datasets.audio_dataset import A"
},
{
"path": "TensorFlowTTS/tensorflow_tts/datasets/abstract_dataset.py",
"chars": 2320,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/datasets/audio_dataset.py",
"chars": 4007,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/datasets/mel_dataset.py",
"chars": 3847,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/inference/__init__.py",
"chars": 186,
"preview": "from tensorflow_tts.inference.auto_model import TFAutoModel\nfrom tensorflow_tts.inference.auto_config import AutoConfig\n"
},
{
"path": "TensorFlowTTS/tensorflow_tts/inference/auto_config.py",
"chars": 2142,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The HuggingFace Inc. team and Minh Nguyen (@dathudeptrai)\n#\n# Licensed under th"
},
{
"path": "TensorFlowTTS/tensorflow_tts/inference/auto_model.py",
"chars": 2405,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The HuggingFace Inc. team and Minh Nguyen (@dathudeptrai)\n#\n# Licensed under th"
},
{
"path": "TensorFlowTTS/tensorflow_tts/inference/auto_processor.py",
"chars": 1945,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The TensorFlowTTS Team.\n#\n# Licensed under the Apache License, Version 2.0 (the"
},
{
"path": "TensorFlowTTS/tensorflow_tts/losses/__init__.py",
"chars": 124,
"preview": "from tensorflow_tts.losses.spectrogram import TFMelSpectrogram\nfrom tensorflow_tts.losses.stft import TFMultiResolutionS"
},
{
"path": "TensorFlowTTS/tensorflow_tts/losses/spectrogram.py",
"chars": 2697,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/losses/stft.py",
"chars": 5179,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/__init__.py",
"chars": 344,
"preview": "from tensorflow_tts.models.melgan import (\n TFMelGANDiscriminator,\n TFMelGANGenerator,\n TFMelGANMultiScaleDiscr"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/mb_melgan.py",
"chars": 6828,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The Multi-band MelGAN Authors , Minh Nguyen (@dathudeptrai) and Tomoki Hayashi "
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/melgan.py",
"chars": 17721,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The MelGAN Authors and Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apach"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/moduls/__init__.py",
"chars": 91,
"preview": "from tensorflow_tts.models.moduls import (\n core, core2, conditional, adain_en_de_code\n)"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/moduls/adain_en_de_code.py",
"chars": 5576,
"preview": "import tensorflow as tf\nimport tensorflow_addons as tfa\nfrom tensorflow_tts.models.moduls.conditional import MaskInstanc"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/moduls/conditional.py",
"chars": 3498,
"preview": "import tensorflow as tf\nimport tensorflow_addons as tfa\nimport numpy as np\n\ndef get_initializer(initializer_range=0.02):"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/moduls/core.py",
"chars": 29149,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The FastSpeech Authors, The HuggingFace Inc. team and Minh Nguyen (@dathudeptra"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/moduls/core2.py",
"chars": 7866,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 The FastSpeech Authors, The HuggingFace Inc. team and Minh Nguyen (@dathudeptra"
},
{
"path": "TensorFlowTTS/tensorflow_tts/models/unetts.py",
"chars": 15789,
"preview": "import tensorflow as tf\nimport numpy as np\n\nfrom tensorflow_tts.models.moduls.core import (\n TFFastSpeechEmbeddings,\n"
},
{
"path": "TensorFlowTTS/tensorflow_tts/optimizers/__init__.py",
"chars": 78,
"preview": "from tensorflow_tts.optimizers.adamweightdecay import AdamWeightDecay, WarmUp\n"
},
{
"path": "TensorFlowTTS/tensorflow_tts/optimizers/adamweightdecay.py",
"chars": 6854,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2019 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache Lice"
},
{
"path": "TensorFlowTTS/tensorflow_tts/processor/__init__.py",
"chars": 150,
"preview": "from tensorflow_tts.processor.base_processor import BaseProcessor\nfrom tensorflow_tts.processor.multispk_voiceclone impo"
},
{
"path": "TensorFlowTTS/tensorflow_tts/processor/base_processor.py",
"chars": 8004,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 TensorFlowTTS Team.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "TensorFlowTTS/tensorflow_tts/processor/multispk_voiceclone.py",
"chars": 22016,
"preview": "# -*- coding: utf-8 -*-\r\n# Copyright 2020 TensorFlowTTS Team.\r\n#\r\n# Licensed under the Apache License, Version 2.0 (the "
},
{
"path": "TensorFlowTTS/tensorflow_tts/trainers/__init__.py",
"chars": 86,
"preview": "from tensorflow_tts.trainers.base_trainer import GanBasedTrainer, Seq2SeqBasedTrainer\n"
},
{
"path": "TensorFlowTTS/tensorflow_tts/trainers/base_trainer.py",
"chars": 31354,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/__init__.py",
"chars": 790,
"preview": "from tensorflow_tts.utils.cleaners import (\n basic_cleaners,\n collapse_whitespace,\n convert_to_ascii,\n engli"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/cleaners.py",
"chars": 3368,
"preview": "# -*- coding: utf-8 -*-\n# Copyright (c) 2017 Keith Ito\n#\n# Permission is hereby granted, free of charge, to any person o"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/decoder.py",
"chars": 12732,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 TensorFlow Authors, All Rights Reserved.\n#\n# Licensed under the Apache License,"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/griffin_lim.py",
"chars": 6824,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/group_conv.py",
"chars": 23944,
"preview": "# -*- coding: utf-8 -*-\n# This code is copy from https://github.com/tensorflow/tensorflow/pull/36773.\n\"\"\"Group Convoluti"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/korean.py",
"chars": 13235,
"preview": "# -*- coding: utf-8 -*-\r\n# Copyright 2020 TensorFlowTTS Team, Jaehyoung Kim(@crux153) and Taehoon Kim(@carpedm20)\r\n#\r\n#"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/number_norm.py",
"chars": 3408,
"preview": "# -*- coding: utf-8 -*-\n# Copyright (c) 2017 Keith Ito\n#\n# Permission is hereby granted, free of charge, to any person o"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/outliers.py",
"chars": 1293,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/strategy.py",
"chars": 3108,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/utils.py",
"chars": 798,
"preview": "# -*- coding: utf-8 -*-\n\n# Copyright 2019 Tomoki Hayashi\n# MIT License (https://opensource.org/licenses/MIT)\n\"\"\"Utility"
},
{
"path": "TensorFlowTTS/tensorflow_tts/utils/weight_norm.py",
"chars": 7216,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2019 The TensorFlow Probability Authors and Minh Nguyen (@dathudeptrai)\n#\n# Licensed"
},
{
"path": "UnetTTS_syn.py",
"chars": 9097,
"preview": "import re\nfrom pathlib import Path\n\nimport numpy as np\nimport soundfile as sf\nimport tensorflow as tf\nimport yaml\nfrom t"
},
{
"path": "models/unetts_mapper.json",
"chars": 14205,
"preview": "{\"symbol_to_id\": {\"pad\": 0, \"sil\": 1, \"#0\": 2, \"#1\": 3, \"#3\": 4, \"^\": 5, \"b\": 6, \"c\": 7, \"ch\": 8, \"d\": 9, \"f\": 10, \"g\": "
},
{
"path": "notebook/OneShotVoiceClone_Inference.ipynb",
"chars": 1343666,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"id\": \"e0ebb61b\",\n \"metadata\": {},\n \"outputs\":"
},
{
"path": "train/configs/multiband_melgan.yaml",
"chars": 5908,
"preview": "\n# This is the hyperparameter configuration file for Multi-Band MelGAN.\n# Please make sure this is adjusted for the Bake"
},
{
"path": "train/configs/unetts_acous.yaml",
"chars": 4012,
"preview": "###########################################################\n# FEATURE EXTRACTION SETTING #\n"
},
{
"path": "train/configs/unetts_duration.yaml",
"chars": 2572,
"preview": "###########################################################\n# FEATURE EXTRACTION SETTING #\n"
},
{
"path": "train/configs/unetts_preprocess.yaml",
"chars": 639,
"preview": "###########################################################\n# FEATURE EXTRACTION SETTING #\n"
},
{
"path": "train/train_multiband_melgan.py",
"chars": 17810,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "train/train_unetts_acous.py",
"chars": 18241,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "train/train_unetts_duration.py",
"chars": 11314,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
},
{
"path": "train/unetts_dataset.py",
"chars": 10439,
"preview": "# -*- coding: utf-8 -*-\n# Copyright 2020 Minh Nguyen (@dathudeptrai)\n#\n# Licensed under the Apache License, Version 2.0 "
}
]
// ... and 8 more files (download for full content)
About this extraction
This page contains the full source code of the CMsmartvoice/One-Shot-Voice-Cloning GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 74 files (51.0 MB), approximately 1.0M tokens, and a symbol index with 510 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.