## Abstract
Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable.
In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, e.g., ''Sigh tone in full of sad mood with some helpless feelin''.
Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated with not only content transcriptions but also style descriptions in natural language.
Then we propose an expressive TTS model, named as InstructTTS, which is novel in the sense of following aspects:
(1) We fully take the advantage of self-supervised learning and cross-modal metric learning, and propose a novel three-stage training procedure to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts and control the speaking style in the generated speech.
(2) We propose to model acoustic features in discrete latent space and train a novel discrete diffusion probabilistic model to generate vector-quantized (VQ) acoustic tokens rather than the commonly-used mel spectrogram.
(3) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, avoiding possible content and speaker information leakage from the style prompt.
Extensive objective and subjective evaluation has been conducted to verify the effectiveness and expressiveness of InstructTTS. Experimental results show that InstructTTS can synthesize high-fidelity and natural speech with style prompts controlling the speaking style.
## Introduction
This is a [demo](http://dongchaoyang.top/InstructTTS//) for our paper **_InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt_**. In the following, we will show some generated samples by our proposed method. More Generated speech, please refer to https://github.com/yangdongchao/InstructTTS/tree/gh-pages
## Some Synthesized speech by InstructTTS.
|
|
| ----------- | ----------- | ----------- |
| 急切,激动,大声的表达自己观点 | 哥,我早说过大嫂掉下去不关我的事,何况她根本就没怀孕! | | |
| 因他人不好的行为而气愤不已,发出严声质问 | 就算你再怎么恨周傲于都好,你为什么要拿孩子撒气 | | |
| 伤心难过,又无能为力,很悲观的情绪 | 因为他要做一个很长很长的梦 | | |
| 语速略快,声音高昂,充满兴奋与轻快 | 窗外,熟悉的鸟儿们欢天喜地的歌唱 | | |
## Links
[[Paper](https://arxiv.org/abs/2301.13662)] [[Bibtex]()] [[Demo GitHub](http://dongchaoyang.top/PromptLM-TTS)] [[TencentAILab](https://ai.tencent.com/ailab/zh/index)] [[CUHK]()] [[code]()]
================================================
FILE: readme.md
================================================
# The demo page of InstructTTS
Paper: https://arxiv.org/abs/2301.13662
Demo: http://dongchaoyang.top/InstructTTS/
### InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
## Introduction
For the first time, we study the modelling of expressive TTS with style prompt in natural language, where we meet with the following research problems: (1) how to train a language model that can capture semantic information from the natural language prompt and control the speaking style in the generated speech; (2) how to design an acoustic model to effectively model the challenging one-to-many learning problem of expressive TTS. In this paper, we will address these two challenges.
The main contributions of this study are summarized as follows:
(1) For the first time, we study the modelling of expressive TTS with natural language prompt, which brings us a step closer to achieve user-controllable expressive TTS.
(2) We introduce a novel three stage training strategy to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts.
(3) Inspired by the success of large-scale language models, \textit{e.g.}, GPT3 and ChatGPT \cite{brown2020language}, we propose to model acoustic features in discrete latent space and cast speech synthesis as a language modeling task. Specifically, we train a novel discrete diffusion model to generate vector-quantized (VQ) acoustic feature rather than to predict the commonly-used mel-spectrogram.
(4) We explore to model two types of VQ acoustic feature: mel-spectrogram based VQ features and waveform-based VQ features. We prove that the two types of VQ features can be effectively modeled by our proposed novel discrete diffusion model. We must state that our waveform-based modelling method only needs one-stage training and it is a non-autoregressive model, which is far different from our concurrent work AudioLM \cite{borsos2022audiolm}, VALL-E \cite{wang2023neural} and MusicLM \cite{borsos2023musiclm}.
(5) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, which avoiding possible content and speaker information leakage from the style prompt.