Showing preview only (280K chars total). Download the full file or copy to clipboard to get everything.
Repository: hccho2/Tacotron2-Wavenet-Korean-TTS
Branch: master
Commit: 9215afde67a2
Files: 36
Total size: 254.3 KB
Directory structure:
gitextract_8q9e32ds/
├── LICENSE
├── ReadMe.md
├── datasets/
│ ├── __init__.py
│ ├── datafeeder_tacotron2.py
│ ├── datafeeder_wavenet.py
│ ├── moon/
│ │ └── moon-recognition-All.json
│ ├── moon.py
│ ├── son/
│ │ └── son-recognition-All.json
│ └── son.py
├── generate.py
├── hparams.py
├── preprocess.py
├── synthesizer.py
├── tacotron2/
│ ├── __init__.py
│ ├── helpers.py
│ ├── modules.py
│ ├── rnn_wrappers.py
│ └── tacotron2.py
├── text/
│ ├── __init__.py
│ ├── cleaners.py
│ ├── en_numbers.py
│ ├── english.py
│ ├── ko_dictionary.py
│ ├── korean.py
│ └── symbols.py
├── train_tacotron2.py
├── train_vocoder.py
├── utils/
│ ├── __init__.py
│ ├── audio.py
│ ├── infolog.py
│ └── plot.py
├── wavenet/
│ ├── __init__.py
│ ├── mixture.py
│ ├── model.py
│ └── ops.py
└── 명령어모음.txt
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2018 Heecheol Cho
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: ReadMe.md
================================================
# Multi-Speaker Tocotron2 + Wavenet Vocoder + Korean TTS
Tacotron2 모델과 Wavenet Vocoder를 결합하여 한국어 TTS구현하는 project입니다.
Tacotron2 모델을 Multi-Speaker모델로 확장했습니다.
Based on
- https://github.com/keithito/tacotron
- https://github.com/carpedm20/multi-speaker-tacotron-tensorflow
- https://github.com/Rayhane-mamah/Tacotron-2
- https://github.com/hccho2/Tacotron-Wavenet-Vocoder
## Tacotron 2
- Tacotron 모델에 관한 설명은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다.
- [Tacotron2](https://arxiv.org/abs/1712.05884)에서는 모델 구조도 바뀌었고, Location Sensitive Attention, Stop Token, Vocoder로 Wavenet을 제안하고 있다.
- Tacotron2의 대표적인 구현은 [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)입니다. 이 역시, [keithito](https://github.com/keithito/tacotron), [r9y9](https://github.com/r9y9/wavenet_vocoder)의 코드를 기반으로 발전된 것이다.
## This Project
* Tacotron2 모델로 한국어 TTS를 만드는 것이 목표입니다.
* [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)의 구현은 Customization된 Layer를 많이 사용했는데, 제가 보기에는 너무 복잡하게 한 것 같아, Cumomization Layer를 많이 줄이고, Tensorflow에 구현되어 있는 Layer를 많이 활용했습니다.
* teacher forcing 방식의 train sample은 2000 step부터, free forcing 방식의 test sample은 3000 step부터 알아들을 수 있는 정도의 음성을 만들기 시작합니다.
## 단계별 실행
### 실행 순서
- Data 생성: 한국어 data의 생성은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다.
- 생성된 Data는 아래의 'data_paths'에 지정하면 된다.
- tacotron training 후, synthesize.py로 test.
- wavenet training 후, generate.py로 test(tacotron이 만들지 않은 mel spectrogram으로 test할 수도 있고, tacotron이 만든 mel spectrogram을 사용할 수도 있다.)
- 2개 모델 모두 train 후, tacotron에서 생성한 mel spectrogram을 wavent에 local condition으로 넣어 test하면 된다.
### Tacotron2 Training
- train_tacotron2.py 내에서 '--data_paths'를 지정한 후, train할 수 있다. data_path는 여러개의 데이터 디렉토리를 지정할 수 있습니다.
```
parser.add_argument('--data_paths', default='.\\data\\moon,.\\data\\son')
```
- train을 이어서 계속하는 경우에는 '--load_path'를 지정해 주면 된다.
```
parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-02-27_00-21-42')
```
- model_type은 'single' 또는 ' multi-speaker'로 지정할 수 있다. speaker가 1명 일 때는, hparams의 model_type = 'single'로 하고 train_tacotron2.py 내에서 '--data_paths'를 1개만 넣어주면 된다.
```
parser.add_argument('--data_paths', default='D:\\Tacotron2\\data\\moon')
```
- 하이퍼파라메터를 hparmas.py에서 argument를 train_tacotron2.py에서 다 설정했기 때문에, train 실행은 다음과 같이 단순합니다.
> python train_tacotron2.py
- train 후, 음성을 생성하려면 다음과 같이 하면 된다. '--num_speaker', '--speaker_id'는 잘 지정되어야 한다.
> python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다."
### Wavenet Vocoder Training
- train_vocoder.py 내에서 '--data_dir'를 지정한 후, train할 수 있다.
- memory 부족으로 training 되지 않거나 너무 느리면, hyper paramerter 중 sample_size를 줄이면 된다. 물론 batch_size를 줄일 수도 있다.
```
DATA_DIRECTORY = 'D:\\Tacotron2\\data\\moon,D:\\Tacotron2\\data\\son'
parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing data')
```
- train을 이어서 계속하는 경우에는 '--logdir'를 지정해 주면 된다.
```
LOGDIR = './/logdir-wavenet//train//2018-12-21T22-58-10'
parser.add_argument('--logdir', type=str, default=LOGDIR)
```
- wavenet train 후, tacotron이 생성한 mel spectrogram(npy파일)을 local condition으로 넣어서 TTS의 최종 결과를 얻을 수 있다.
> python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10
### Result
- Tacotron의 batch_size = 32, Wavenet의 batch_size=8. GTX 1080ti.
- Tacotron은 step 100K, Wavenet은 177K 만큼 train.
- samples 디렉토리에는 생성된 wav파일이 있다.
- Griffin-Lim으로 생성된 것과 Wavenet Vocoder로 생성된 sample이 있다.
- Wavenet으로 생성된 음성은 train 부족으로 잡음이 섞여있다.
================================================
FILE: datasets/__init__.py
================================================
# -*- coding: utf-8 -*-
from .datafeeder_wavenet import DataFeederWavenet
================================================
FILE: datasets/datafeeder_tacotron2.py
================================================
# coding: utf-8
import os
import time
import pprint
import random
import threading
import traceback
import numpy as np
from glob import glob
import tensorflow as tf
from collections import defaultdict
import text
from utils.infolog import log
from utils import parallel_run, remove_file
from utils.audio import frames_to_hours
_pad = 0
_stop_token_pad = 1
def get_frame(path):
data = np.load(path)
n_frame = data["linear"].shape[0]
n_token = len(data["tokens"])
return (path, n_frame, n_token)
def get_path_dict(data_dirs, hparams, config,data_type, n_test=None,rng=np.random.RandomState(123)):
# Load metadata:
path_dict = {}
for data_dir in data_dirs: # ['datasets/moon\\data']
paths = glob("{}/*.npz".format(data_dir)) # ['datasets/moon\\data\\001.0000.npz', 'datasets/moon\\data\\001.0001.npz', 'datasets/moon\\data\\001.0002.npz', ...]
if data_type == 'train':
rng.shuffle(paths) # ['datasets/moon\\data\\012.0287.npz', 'datasets/moon\\data\\004.0215.npz', 'datasets/moon\\data\\003.0149.npz', ...]
if not config.skip_path_filter:
items = parallel_run( get_frame, paths, desc="filter_by_min_max_frame_batch", parallel=True) # [('datasets/moon\\data\\012.0287.npz', 130, 21), ('datasets/moon\\data\\003.0149.npz', 209, 37), ...]
min_n_frame = hparams.min_n_frame # 5*30
max_n_frame = hparams.max_n_frame - 1 # 5*200 - 5
# 다음 단계에서 data가 많이 떨어져 나감. 글자수가 짧은 것들이 탈락됨.
new_items = [(path, n) for path, n, n_tokens in items if min_n_frame <= n <= max_n_frame and n_tokens >= hparams.min_tokens] # [('datasets/moon\\data\\004.0383.npz', 297), ('datasets/moon\\data\\003.0533.npz', 394),...]
new_paths = [path for path, n in new_items]
new_n_frames = [n for path, n in new_items]
hours = frames_to_hours(new_n_frames,hparams)
log(' [{}] Loaded metadata for {} examples ({:.2f} hours)'.format(data_dir, len(new_n_frames), hours))
log(' [{}] Max length: {}'.format(data_dir, max(new_n_frames)))
log(' [{}] Min length: {}'.format(data_dir, min(new_n_frames)))
else:
new_paths = paths
# train용 data와 test용 data로 나눈다.
if data_type == 'train':
new_paths = new_paths[:-n_test] # 끝에 있는 n_test(batch_size)를 제외한 모두
elif data_type == 'test':
new_paths = new_paths[-n_test:] # 끝에 있는 n_test
else:
raise Exception(" [!] Unkown data_type: {}".format(data_type))
path_dict[data_dir] = new_paths # ['datasets/moon\\data\\001.0621.npz', 'datasets/moon\\data\\003.0229.npz', ...]
return path_dict
# run -> _enqueue_next_group -> _get_next_example
class DataFeederTacotron2(threading.Thread):
'''Feeds batches of data into a queue on a background thread.'''
def __init__(self, coordinator, data_dirs,hparams, config, batches_per_group, data_type, batch_size): #batches_per_group = 32 or 8, data_type: 'train' or 'test'
super(DataFeederTacotron2, self).__init__()
self._coord = coordinator
self._hp = hparams
self._cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
self._step = 0
self._offset = defaultdict(lambda: 2)
self._batches_per_group = batches_per_group
self.rng = np.random.RandomState(config.random_seed) # random number generator
self.data_type = data_type
self.batch_size = batch_size
self.min_tokens = hparams.min_tokens # 30
self.min_n_frame = hparams.min_n_frame # 5*30
self.max_n_frame = hparams.max_n_frame - 1 # 5*200 - 5
self.skip_path_filter = config.skip_path_filter
# Load metadata:
self.path_dict = get_path_dict(data_dirs, self._hp, config, self.data_type,n_test=self.batch_size, rng=self.rng) # data_dirs: ['datasets/moon\\data']
self.data_dirs = list(self.path_dict.keys()) # ['datasets/moon\\data']
self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)} # {'datasets/moon\\data': 0}
data_weight = {data_dir: 1. for data_dir in self.data_dirs} # {'datasets/moon\\data': 1.0}
if self._hp.main_data_greedy_factor > 0 and any(main_data in data_dir for data_dir in self.data_dirs for main_data in self._hp.main_data): # 'main_data': ['']
for main_data in self._hp.main_data:
for data_dir in self.data_dirs:
if main_data in data_dir:
data_weight[data_dir] += self._hp.main_data_greedy_factor
weight_Z = sum(data_weight.values()) # 1
self.data_ratio = { data_dir: weight / weight_Z for data_dir, weight in data_weight.items()} # 각 data들의 weight sum이 1이 되도록...
log("="*40)
log('Data Amount:')
log(pprint.pformat(self.data_ratio, indent=4))
log("="*40)
#audio_paths = [path.replace("/data/", "/audio/").replace(".npz", ".wav") for path in self.data_paths]
#duration = get_durations(audio_paths, print_detail=False)
# Create placeholders for inputs and targets. Don't specify batch size because we want to
# be able to feed different sized batches at eval time.
self._placeholders = [
tf.placeholder(tf.int32, [None, None], 'inputs'),
tf.placeholder(tf.int32, [None], 'input_lengths'),
tf.placeholder(tf.float32, [None], 'loss_coeff'),
tf.placeholder(tf.float32, [None, None, hparams.num_mels], 'mel_targets'),
tf.placeholder(tf.float32, [None, None, hparams.num_freq], 'linear_targets'),
tf.placeholder(tf.float32, [None, None], 'stop_token_targets')
]
# Create queue for buffering data:
dtypes = [tf.int32, tf.int32, tf.float32, tf.float32, tf.float32, tf.float32]
self.is_multi_speaker = len(self.data_dirs) > 1
if self.is_multi_speaker:
self._placeholders.append( tf.placeholder(tf.int32, [None], 'speaker_id'),)
dtypes.append(tf.int32)
num_worker = 8 if self.data_type == 'train' else 1
queue = tf.FIFOQueue(num_worker, dtypes, name='input_queue')
self._enqueue_op = queue.enqueue(self._placeholders)
if self.is_multi_speaker:
self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets, self.speaker_id = queue.dequeue()
else:
self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets = queue.dequeue()
self.inputs.set_shape(self._placeholders[0].shape)
self.input_lengths.set_shape(self._placeholders[1].shape)
self.loss_coeff.set_shape(self._placeholders[2].shape)
self.mel_targets.set_shape(self._placeholders[3].shape)
self.linear_targets.set_shape(self._placeholders[4].shape)
self.stop_token_targets.set_shape(self._placeholders[5].shape)
if self.is_multi_speaker:
self.speaker_id.set_shape(self._placeholders[6].shape)
else:
self.speaker_id = None
if self.data_type == 'test':
examples = []
while True:
for data_dir in self.data_dirs:
examples.append(self._get_next_example(data_dir))
#print(data_dir, text.sequence_to_text(examples[-1][0], False, True))
if len(examples) >= self.batch_size:
break
if len(examples) >= self.batch_size:
break
# test 할 때는 같은 examples로 계속 반복
self.static_batches = [examples for _ in range(self._batches_per_group)] # [examples, examples,...,examples] <--- 각 example은 2개의 data를 가지고 있다.
else:
self.static_batches = None
def start_in_session(self, session, start_step):
self._step = start_step
self._session = session
self.start()
def run(self):
try:
while not self._coord.should_stop():
self._enqueue_next_group()
except Exception as e:
traceback.print_exc()
self._coord.request_stop(e)
def _enqueue_next_group(self):
start = time.time()
# Read a group of examples:
n = self.batch_size # 32
r = self._hp.reduction_factor # 4 or 5 min_n_frame,max_n_frame 계산에 사용되었던...
if self.static_batches is not None: # 'test'에서는 static_batches를 사용한다. static_batches는 init에서 이미 만들어 놓았다.
batches = self.static_batches
else: # 'train'
examples = []
for data_dir in self.data_dirs:
if self._hp.initial_data_greedy:
if self._step < self._hp.initial_phase_step and any("krbook" in data_dir for data_dir in self.data_dirs):
data_dir = [data_dir for data_dir in self.data_dirs if "krbook" in data_dir][0]
if self._step < self._hp.initial_phase_step: # 'initial_phase_step': 8000
example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group // len(self.data_dirs)))] # _batches_per_group 8,또는 32 만큼의 batch data를 만드낟. 각각의 batch size는 2, 또는 32
else:
example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group * self.data_ratio[data_dir]))]
examples.extend(example)
examples.sort(key=lambda x: x[-1]) # 제일 마지막 기준이니까, len(linear_target) 기준으로 정렬
batches = [examples[i:i+n] for i in range(0, len(examples), n)]
self.rng.shuffle(batches)
log('Generated %d batches of size %d in %.03f sec' % (len(batches), n, time.time() - start))
for batch in batches: # batches는 batch의 묶음이다.
# test 또는 train mode에 맞게 만든 batches의 batch data를 placeholder에 넘겨준다.
feed_dict = dict(zip(self._placeholders, _prepare_batch(batch, r, self.rng, self.data_type))) # _prepare_batch에서 batch data의 길이를 맞춘다. return 순서 = placeholder순서
self._session.run(self._enqueue_op, feed_dict=feed_dict)
self._step += 1
def _get_next_example(self, data_dir):
'''npz 1개를 읽어 처리한다. Loads a single example (input, mel_target, linear_target, cost) from disk'''
data_paths = self.path_dict[data_dir]
while True:
if self._offset[data_dir] >= len(data_paths):
self._offset[data_dir] = 0
if self.data_type == 'train':
self.rng.shuffle(data_paths)
data_path = data_paths[self._offset[data_dir]] # npz파일 1개 선택
self._offset[data_dir] += 1
try:
if os.path.exists(data_path):
data = np.load(data_path) # data속에는 "linear","mel","tokens","loss_coeff"
else:
continue
except:
remove_file(data_path)
continue
if not self.skip_path_filter:
break
if self.min_n_frame <= data["linear"].shape[0] <= self.max_n_frame and len(data["tokens"]) > self.min_tokens:
break
input_data = data['tokens'] # 1-dim
mel_target = data['mel']
if 'loss_coeff' in data:
loss_coeff = data['loss_coeff']
else:
loss_coeff = 1
linear_target = data['linear']
stop_token_target = np.asarray([0.] * len(mel_target)) # mel_target은 [xx,80]으로 data마다 len이 다르다. len에 따라 [0,...,0]
# multi-speaker가 아니면, speaker_id는 넘길 필요 없지만, 현재 구현이 좀 꼬여 있다. 그래서 무조건 넘긴다.
if self.is_multi_speaker:
return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, self.data_dir_to_id[data_dir], len(linear_target))
else:
return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, len(linear_target))
def _prepare_batch(batch, reduction_factor, rng, data_type=None):
# (input_data, loss_coeff, mel_target, linear_target,stop_token_target, speaker_id, len(linear_target))
if data_type == 'train':
rng.shuffle(batch)
# batch data: (input_data, loss_coeff, mel_target, linear_target, self.data_dir_to_id[data_dir], len(linear_target))
inputs = _prepare_inputs([x[0] for x in batch]) # batch에 있는 data들 중, 가장 긴 data의 길이에 맞게 padding한다.
input_lengths = np.asarray([len(x[0]) for x in batch], dtype=np.int32) # batch_size, [37, 37, 32, 32, 38,..., 39, 36, 30]
loss_coeff = np.asarray([x[1] for x in batch], dtype=np.float32) # batch_size, [1,1,1,,..., 1,1,1]
mel_targets = _prepare_targets([x[2] for x in batch], reduction_factor) # ---> (32, 175, 80) max length는 reduction_factor의 배수가 되도록
linear_targets = _prepare_targets([x[3] for x in batch], reduction_factor) # ---> (32, 175, 1025) max length는 reduction_factor의 배수가 되도록
stop_token_targets = _prepare_stop_token_targets([x[4] for x in batch], reduction_factor)
if len(batch[0]) == 7: # is_multi_speaker = True인 경우
speaker_id = np.asarray([x[5] for x in batch], dtype=np.int32) # speaker_id로 list 만들기
return (inputs, input_lengths, loss_coeff,mel_targets, linear_targets,stop_token_targets, speaker_id)
else:
return (inputs, input_lengths, loss_coeff, mel_targets, linear_targets,stop_token_targets) # ('inputs' 'input_lengths' 'loss_coeff' 'mel_targets' 'linear_targets')
def _prepare_inputs(inputs): # inputs: batch 길이 만큼의 list
max_len = max((len(x) for x in inputs))
return np.stack([_pad_input(x, max_len) for x in inputs]) # (batch_size, max_len)
"""
batch_size = 2 일 떼,
[[13, 26, 13, 41, 13, 21, 13, 41, 13, 21, 13, 41, 9, 41, 13, 40,79, 14, 34, 13, 33, 79, 20, 32, 13, 35, 45, 2, 34, 42, 13, 39,7, 29, 11, 25, 1],
[ 6, 29, 79, 14, 26, 14, 34, 5, 29, 79, 2, 30, 45, 2, 28, 14,21, 79, 13, 27, 7, 25, 9, 34, 45, 13, 40, 79, 4, 29, 2, 29,13, 26, 1, 0, 0]]
"""
def _prepare_targets(targets, alignment):
# targets: shape of list [ (162,80) , (172, 80), ...]
max_len = max((len(t) for t in targets)) + 1
return np.stack([_pad_target(t, _round_up(max_len, alignment)) for t in targets])
def _prepare_stop_token_targets(targets, alignment):
max_len = max((len(t) for t in targets)) + 1
return np.stack([_pad_stop_token_target(t, _round_up(max_len, alignment)) for t in targets])
def _pad_input(x, length):
return np.pad(x, (0, length - x.shape[0]), mode='constant', constant_values=_pad)
def _pad_target(t, length):
# t: 2 dim array. ( xx, num_mels) ==> (length,num_mels)
return np.pad(t, [(0, length - t.shape[0]), (0,0)], mode='constant', constant_values=_pad) # (169, 80) ==> (length, 80)
###
def _pad_stop_token_target(t, length):
return np.pad(t, (0, length - t.shape[0]), mode='constant', constant_values=_stop_token_pad)
def _round_up(x, multiple):
remainder = x % multiple
return x if remainder == 0 else x + multiple - remainder
if __name__ == '__main__':
from hparams import hparams
import argparse
from utils import str2bool
parser = argparse.ArgumentParser()
parser.add_argument('--random_seed', type=int, default=123)
parser.add_argument('--batch_size', type=int, default=4)
parser.add_argument('--skip_path_filter', type=str2bool, default=True, help='Use only for debugging')
config = parser.parse_args()
coord = tf.train.Coordinator()
data_dirs=['D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon']
mydatafeed = DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size)
with tf.Session() as sess:
try:
sess.run(tf.global_variables_initializer())
step = 0
mydatafeed.start_in_session(sess,step)
while not coord.should_stop():
a,b,c,d=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets,mydatafeed.stop_token_targets])
print(a.shape,c.shape,d.shape)
print(step,b)
print('stop token:', d[0])
print('-'*10)
a,b,c=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets])
print(a.shape,c.shape)
print(step,b)
print('='*10)
step = step +1
if step > 3:
raise Exception('End xxx')
except Exception as e:
print('finally')
print(e)
coord.request_stop(e)
================================================
FILE: datasets/datafeeder_wavenet.py
================================================
# -*- coding: utf-8 -*-
import sys
sys.path.append("../")
import tensorflow as tf
import threading
import random
import numpy as np
import os
from utils import audio
from hparams import hparams
from glob import glob
from collections import defaultdict
def get_path_dict(data_dirs, min_length):
path_dict = {}
for data_dir in data_dirs:
if not hparams.skip_path_filter:
with open(os.path.join(data_dir,'train.txt'), 'r', encoding='utf-8') as f:
lines = f.readlines()
new_paths = []
for line in lines:
line = line.strip().split("|")
if int(line[3]) > min_length:
new_paths.append(line[6])
path_dict[data_dir] = new_paths
else:
new_paths = glob("{}/*.npz".format(data_dir))
new_paths = [os.path.basename(p) for p in new_paths]
path_dict[data_dir] = new_paths
return path_dict
def assert_ready_for_upsampling(x, c,hop_size):
assert len(x) % len(c) == 0 and len(x) // len(c) == hop_size
def ensure_divisible(length, divisible_by=256, lower=True):
if length % divisible_by == 0:
return length
if lower:
return length - length % divisible_by
else:
return length + (divisible_by - length % divisible_by)
class DataFeederWavenet(threading.Thread):
def __init__(self,coord,data_dirs,batch_size, gc_enable=False,test_mode=False, queue_size=8):
super(DataFeederWavenet, self).__init__()
self.data_dirs = data_dirs
self.coord = coord
self.batch_size = batch_size
self.hop_size = audio.get_hop_size(hparams)
self.sample_size = ensure_divisible(hparams.sample_size,self.hop_size, True)
self.max_frames = self.sample_size // self.hop_size # sample_size 크기를 확보하기 위해.
self.queue_size = queue_size
self.gc_enable = gc_enable
self.skip_path_filter = hparams.skip_path_filter
self.test_mode = test_mode
if test_mode:
assert batch_size==1
self.rng = np.random.RandomState(123)
self._offset = defaultdict(lambda: 2) # key에 없는 값이 들어어면 2가 할당된다.
self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)} # data_dir <---> speaker_id 매핑
self.path_dict = get_path_dict(self.data_dirs,self.sample_size)# receptive_field 보다 작은 것을 버리고, 나머지만 돌려준다.
self._placeholders = [
tf.placeholder(tf.float32, shape=[None,None,1],name='input_wav'),
tf.placeholder(tf.float32, shape=[None,None,hparams.num_mels],name='local_condition')
]
dtypes = [tf.float32, tf.float32]
if self.gc_enable:
self._placeholders.append(tf.placeholder(tf.int32, shape=[None],name='speaker_id'))
dtypes.append(tf.int32)
queue = tf.FIFOQueue(self.queue_size, dtypes, name='input_queue')
self.enqueue = queue.enqueue(self._placeholders)
if self.gc_enable:
self.inputs_wav, self.local_condition, self.speaker_id = queue.dequeue()
else:
self.inputs_wav, self.local_condition = queue.dequeue()
self.inputs_wav.set_shape(self._placeholders[0].shape)
self.local_condition.set_shape(self._placeholders[1].shape)
if self.gc_enable:
self.speaker_id.set_shape(self._placeholders[2].shape)
def run(self):
try:
while not self.coord.should_stop():
self.make_batches()
except Exception as e:
self.coord.request_stop(e)
def start_in_session(self, session,start_step):
self._step = start_step
self.sess = session
self.start()
def make_batches(self):
examples = []
n = self.batch_size
for data_dir in self.data_dirs:
example = [self._get_next_example(data_dir) for _ in range(int(n * 32 // len(self.data_dirs)))]
examples.extend(example)
self.rng.shuffle(examples)
batches = [examples[i:i+n] for i in range(0, len(examples), n)]
for batch in batches: # batch size만큼의 data를 원하는 만큼 만든다.
feed_dict = dict(zip(self._placeholders, _prepare_batch(batch)))
self.sess.run(self.enqueue, feed_dict=feed_dict)
self._step += 1
def _get_next_example(self, data_dir):
'''npz 1개를 읽어 처리한다. Loads a single example (input_wav, local_condition,speaker_id ) from disk'''
data_paths = self.path_dict[data_dir]
while True:
if self._offset[data_dir] >= len(data_paths):
self._offset[data_dir] = 0
self.rng.shuffle(data_paths)
data_path = os.path.join(data_dir,data_paths[self._offset[data_dir]]) # npz파일 1개 선택
self._offset[data_dir] += 1
if os.path.exists(data_path):
data = np.load(data_path) # data속에는 'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'token'
else:
continue
if not self.skip_path_filter:
# 이경우는 get_path_dict함수에서 한번 걸러졌기 때문에, 여기서 다시 확인할 필요 없음.
break
# get_path_dict함수에서 걸러지지 않앗기 때문에 확인이 필요함.
if data['time_steps'] > self.sample_size or self.test_mode:
break
input_wav = data['audio']
local_condition = data['mel']
input_wav = input_wav.reshape(-1, 1)
assert_ready_for_upsampling(input_wav, local_condition,self.hop_size)
if self.test_mode==False: # test_mode에서는 전체. train_mode에서는 sample_size 만큼만
s = np.random.randint(0, len(local_condition) - self.max_frames+1) # hccho
ts = s * self.hop_size
input_wav = input_wav[ts:ts + self.hop_size * self.max_frames, :]
local_condition = local_condition[s:s + self.max_frames, :]
if self.gc_enable:
return (input_wav,local_condition, self.data_dir_to_id[data_dir])
else: return (input_wav,local_condition)
def _prepare_batch(batch):
input_wavs = [x[0] for x in batch]
local_conditions = [x[1] for x in batch]
if len(batch[0])==3:
speaker_ids = [x[2] for x in batch]
return (input_wavs,local_conditions,speaker_ids)
else:
return (input_wavs,local_conditions)
if __name__ == '__main__':
coord = tf.train.Coordinator()
data_dirs=['D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon','D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son']
mydatafeed = DataFeederWavenet(coord,data_dirs,batch_size=5,receptive_field=1200, gc_enable=True, queue_size=8)
with tf.Session() as sess:
try:
sess.run(tf.global_variables_initializer())
step = 0
mydatafeed.start_in_session(sess,step)
while not coord.should_stop():
a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id])
print(a.shape,b.shape,c.shape)
print(step, c)
a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id])
print(a.shape,b.shape,c.shape)
print(step, c)
step = step +1
except Exception as e:
print('finally')
coord.request_stop(e)
================================================
FILE: datasets/moon/moon-recognition-All.json
================================================
{
"./datasets/moon/audio/003.0000.wav": "존경하는 독일 국민 여러분",
"./datasets/moon/audio/003.0001.wav": "고국에 계신 국민 여러분",
"./datasets/moon/audio/003.0002.wav": "하울젠 쾨르버재단 이사님과",
"./datasets/moon/audio/003.0003.wav": "모드로",
"./datasets/moon/audio/003.0004.wav": "전 동독 총리님을 비롯한",
"./datasets/moon/audio/003.0005.wav": "내외 귀빈 여러분",
"./datasets/moon/audio/003.0006.wav": "먼저 냉전과 분단을 넘어",
"./datasets/moon/audio/003.0007.wav": "통일을 이루고",
"./datasets/moon/audio/003.0008.wav": "그 힘으로 유럽통합과 국제평화를 선도하고 있는",
"./datasets/moon/audio/003.0009.wav": "독일과",
"./datasets/moon/audio/003.0010.wav": "독일 국민에게",
"./datasets/moon/audio/003.0011.wav": "무한한 경의를 표합니다",
"./datasets/moon/audio/003.0012.wav": "오늘 이 자리를 마련해 주신",
"./datasets/moon/audio/003.0013.wav": "독일 정부와 쾨르버 재단에도",
"./datasets/moon/audio/003.0014.wav": "감사드립니다",
"./datasets/moon/audio/003.0015.wav": "아울러 얼마 전 별세하신",
"./datasets/moon/audio/003.0016.wav": "고",
"./datasets/moon/audio/003.0017.wav": "헬무트 콜 총리의 가족과",
"./datasets/moon/audio/003.0018.wav": "독일 국민들에게 깊은 애도와",
"./datasets/moon/audio/003.0019.wav": "위로의 마음을 전합니다",
"./datasets/moon/audio/003.0020.wav": "대한민국은",
"./datasets/moon/audio/003.0021.wav": "냉전시기",
"./datasets/moon/audio/003.0022.wav": "어려운 환경 속에서도",
"./datasets/moon/audio/003.0023.wav": "적극적이고",
"./datasets/moon/audio/003.0024.wav": "능동적인 외교로",
"./datasets/moon/audio/003.0025.wav": "독일 통일과 유럽통합을 주도한",
"./datasets/moon/audio/003.0026.wav": "헬무트",
"./datasets/moon/audio/003.0027.wav": "콜 총리의 위대한 업적을 기억할 것입니다",
"./datasets/moon/audio/003.0028.wav": "친애하는 내외 귀빈 여러분",
"./datasets/moon/audio/003.0029.wav": "이곳 베를린은",
"./datasets/moon/audio/003.0030.wav": "지금으로부터 17년 전",
"./datasets/moon/audio/003.0031.wav": "한국의 김대중 대통령이",
"./datasets/moon/audio/003.0032.wav": "남북 화해·협력의 기틀을 마련한",
"./datasets/moon/audio/003.0033.wav": "베를린 선언을 발표한 곳입니다",
"./datasets/moon/audio/003.0034.wav": "여기 알테스 슈타트하우스는",
"./datasets/moon/audio/003.0035.wav": "독일 통일조약 협상이 이뤄졌던",
"./datasets/moon/audio/003.0036.wav": "역사적 현장입니다",
"./datasets/moon/audio/003.0037.wav": "나는 오늘",
"./datasets/moon/audio/003.0038.wav": "베를린의 교훈이 살아있는 이 자리에서",
"./datasets/moon/audio/003.0039.wav": "대한민국 새 정부의 한반도 평화 구상을",
"./datasets/moon/audio/003.0040.wav": "말씀드리고자 합니다",
"./datasets/moon/audio/003.0041.wav": "내외 귀빈 여러분",
"./datasets/moon/audio/003.0042.wav": "독일 통일의 경험은",
"./datasets/moon/audio/003.0043.wav": "지구상",
"./datasets/moon/audio/003.0044.wav": "마지막 분단국가로 남은 우리에게",
"./datasets/moon/audio/003.0045.wav": "통일에 대한 희망과 함께",
"./datasets/moon/audio/003.0046.wav": "우리가 나아가야 할 방향을 말해주고 있습니다",
"./datasets/moon/audio/003.0047.wav": "그것은 우선",
"./datasets/moon/audio/003.0048.wav": "통일에 이르는",
"./datasets/moon/audio/003.0049.wav": "과정의 중요성입니다",
"./datasets/moon/audio/006.0000.wav": "존경하고 사랑하는 국민 여러분",
"./datasets/moon/audio/006.0001.wav": "감사합니다",
"./datasets/moon/audio/006.0002.wav": "국민 여러분의",
"./datasets/moon/audio/006.0003.wav": "위대한 선택에",
"./datasets/moon/audio/006.0004.wav": "머리 숙여",
"./datasets/moon/audio/006.0005.wav": "깊이",
"./datasets/moon/audio/006.0006.wav": "감사드립니다",
"./datasets/moon/audio/006.0007.wav": "저는 오늘",
"./datasets/moon/audio/006.0008.wav": "대한민국",
"./datasets/moon/audio/006.0009.wav": "제19대 대통령으로서",
"./datasets/moon/audio/006.0010.wav": "새로운 대한민국을 향해",
"./datasets/moon/audio/006.0011.wav": "첫걸음을 내딛습니다",
"./datasets/moon/audio/006.0012.wav": "지금 제 두 어깨는",
"./datasets/moon/audio/006.0013.wav": "국민 여러분으로부터",
"./datasets/moon/audio/006.0014.wav": "부여받은",
"./datasets/moon/audio/006.0015.wav": "막중한 소명감으로",
"./datasets/moon/audio/006.0016.wav": "무겁습니다",
"./datasets/moon/audio/006.0017.wav": "지금 제 가슴은",
"./datasets/moon/audio/006.0018.wav": "한 번도 경험하지 못한",
"./datasets/moon/audio/006.0019.wav": "나라를 만들겠다는 열정으로 뜨겁습니다",
"./datasets/moon/audio/006.0020.wav": "그리고 지금 제 머리는",
"./datasets/moon/audio/006.0021.wav": "통합과 공존의",
"./datasets/moon/audio/006.0022.wav": "새로운 세상을 열어갈",
"./datasets/moon/audio/006.0023.wav": "청사진으로",
"./datasets/moon/audio/006.0024.wav": "가득 차 있습니다",
"./datasets/moon/audio/006.0025.wav": "우리가 만들어가려는 새로운 대한민국은",
"./datasets/moon/audio/006.0026.wav": "숱한 좌절과 패배에도 불구하고",
"./datasets/moon/audio/006.0027.wav": "우리의 선대들이",
"./datasets/moon/audio/006.0028.wav": "일관되게 추구했던 나라입니다",
"./datasets/moon/audio/006.0029.wav": "또 많은 희생과 헌신을 감내하며",
"./datasets/moon/audio/006.0030.wav": "우리 젊은이들이",
"./datasets/moon/audio/006.0031.wav": "그토록 이루고 싶어했던",
"./datasets/moon/audio/006.0032.wav": "나라입니다",
"./datasets/moon/audio/006.0033.wav": "그런 대한민국을 만들기 위해 저는",
"./datasets/moon/audio/006.0034.wav": "역사와 국민 앞에",
"./datasets/moon/audio/006.0035.wav": "두렵지만",
"./datasets/moon/audio/006.0036.wav": "겸허한 마음으로",
"./datasets/moon/audio/006.0037.wav": "대한민국",
"./datasets/moon/audio/006.0038.wav": "제19대",
"./datasets/moon/audio/006.0039.wav": "대통령으로서의",
"./datasets/moon/audio/006.0040.wav": "책임과 소명을 다할 것임을 천명합니다",
"./datasets/moon/audio/006.0041.wav": "함께 선거를 치른 후보들께",
"./datasets/moon/audio/006.0042.wav": "감사의 말씀과 함께",
"./datasets/moon/audio/006.0043.wav": "심심한",
"./datasets/moon/audio/006.0044.wav": "위로를 전합니다",
"./datasets/moon/audio/006.0045.wav": "이번 선거에서는",
"./datasets/moon/audio/006.0046.wav": "승자도",
"./datasets/moon/audio/006.0047.wav": "패자도 없습니다",
"./datasets/moon/audio/006.0048.wav": "우리는",
"./datasets/moon/audio/006.0062.wav": "정치적 격변기를 보냈습니다",
"./datasets/moon/audio/006.0063.wav": "정치는 혼란스러웠지만",
"./datasets/moon/audio/006.0065.wav": "현직 대통령의 탄핵과 구속 앞에서도",
"./datasets/moon/audio/006.0067.wav": "대한민국의 앞길을 열어주셨습니다",
"./datasets/moon/audio/006.0068.wav": "우리 국민들은 좌절하지 않고",
"./datasets/moon/audio/006.0093.wav": "2017년5월10일",
"./datasets/moon/audio/006.0098.wav": "존경하고 사랑하는 국민 여러분",
"./datasets/moon/audio/006.0104.wav": "바로 그 질문에서 새로 시작하겠습니다",
"./datasets/moon/audio/006.0108.wav": "구시대의 잘못된 관행과",
"./datasets/moon/audio/006.0115.wav": "광화문 대통령 시대를 열겠습니다",
"./datasets/moon/audio/006.0116.wav": "참모들과 머리와 어깨를 맞대고"
}
================================================
FILE: datasets/moon.py
================================================
# -*- coding: utf-8 -*-
from concurrent.futures import ProcessPoolExecutor
from functools import partial
import numpy as np
import os,json
from utils import audio
from text import text_to_sequence
def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
"""
Preprocesses the speech dataset from a gven input path to given output directories
Args:
- hparams: hyper parameters
- input_dir: input directory that contains the files to prerocess
- out_dir: output directory of npz files
- n_jobs: Optional, number of worker process to parallelize across
- tqdm: Optional, provides a nice progress bar
Returns:
- A list of tuple describing the train examples. this should be written to train.txt
"""
executor = ProcessPoolExecutor(max_workers=num_workers)
futures = []
index = 1
path = os.path.join(in_dir, 'moon-recognition-All.json')
with open(path,encoding='utf-8') as f:
content = f.read()
data = json.loads(content)
for key, text in data.items():
wav_path = key.strip().split('/')
wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1])
# In case of test file
if not os.path.exists(wav_path):
continue
futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams)))
index += 1
return [future.result() for future in tqdm(futures) if future.result() is not None]
# result = []
# for future in tqdm(futures):
# if future.result() is not None:
# result.append(future.result())
#
# return result
def _process_utterance(out_dir, wav_path, text, hparams):
"""
Preprocesses a single utterance wav/text pair
this writes the mel scale spectogram to disk and return a tuple to write
to the train.txt file
Args:
- mel_dir: the directory to write the mel spectograms into
- linear_dir: the directory to write the linear spectrograms into
- wav_dir: the directory to write the preprocessed wav into
- index: the numeric index to use in the spectogram filename
- wav_path: path to the audio file containing the speech input
- text: text spoken in the input audio file
- hparams: hyper parameters
Returns:
- A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)
"""
try:
# Load the audio as numpy array
wav = audio.load_wav(wav_path, sr=hparams.sample_rate)
except FileNotFoundError: #catch missing wav exception
print('file {} present in csv metadata is not present in wav folder. skipping!'.format(
wav_path))
return None
#rescale wav
if hparams.rescaling: # hparams.rescale = True
wav = wav / np.abs(wav).max() * hparams.rescaling_max
#M-AILABS extra silence specific
if hparams.trim_silence: # hparams.trim_silence = True
wav = audio.trim_silence(wav, hparams) # Trim leading and trailing silence
#Mu-law quantize, default 값은 'raw'
if hparams.input_type=='mulaw-quantize':
#[0, quantize_channels)
out = audio.mulaw_quantize(wav, hparams.quantize_channels)
#Trim silences
start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
wav = wav[start: end]
out = out[start: end]
constant_values = mulaw_quantize(0, hparams.quantize_channels)
out_dtype = np.int16
elif hparams.input_type=='mulaw':
#[-1, 1]
out = audio.mulaw(wav, hparams.quantize_channels)
constant_values = audio.mulaw(0., hparams.quantize_channels)
out_dtype = np.float32
else: # raw
#[-1, 1]
out = wav
constant_values = 0.
out_dtype = np.float32
# Compute the mel scale spectrogram from the wav
mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
mel_frames = mel_spectrogram.shape[1]
if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length: # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True
return None
#Compute the linear scale spectrogram from the wav
linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)
linear_frames = linear_spectrogram.shape[1]
#sanity check
assert linear_frames == mel_frames
if hparams.use_lws: # hparams.use_lws = False
#Ensure time resolution adjustement between audio and mel-spectrogram
fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size
l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))
#Zero pad audio signal
out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
else:
#Ensure time resolution adjustement between audio and mel-spectrogram
pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))
#Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)
out = np.pad(out, pad, mode='reflect')
assert len(out) >= mel_frames * audio.get_hop_size(hparams)
#time resolution adjustement
#ensure length of raw audio is multiple of hop size so that we can use
#transposed convolution to upsample
out = out[:mel_frames * audio.get_hop_size(hparams)]
assert len(out) % audio.get_hop_size(hparams) == 0
time_steps = len(out)
# Write the spectrogram and audio to disk
wav_id = os.path.splitext(os.path.basename(wav_path))[0]
# Write the spectrograms to disk:
audio_filename = '{}-audio.npy'.format(wav_id)
mel_filename = '{}-mel.npy'.format(wav_id)
linear_filename = '{}-linear.npy'.format(wav_id)
npz_filename = '{}.npz'.format(wav_id)
npz_flag=True
if npz_flag:
# Tacotron 코드와 맞추기 위해, 같은 key를 사용한다.
data = {
'audio': out.astype(out_dtype),
'mel': mel_spectrogram.T,
'linear': linear_spectrogram.T,
'time_steps': time_steps,
'mel_frames': mel_frames,
'text': text,
'tokens': text_to_sequence(text), # eos(~)에 해당하는 "1"이 끝에 붙는다.
'loss_coeff': 1 # For Tacotron
}
np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False)
else:
np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)
np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)
# Return a tuple describing this training example
return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename)
================================================
FILE: datasets/son/son-recognition-All.json
================================================
{
"./datasets/son/audio/NB10584578.0000.wav": "오늘부터 뉴스룸 2부에서는 그날의 주요사항을 한마디의 단어로 축약해서 앵커브리핑으로 풀어보겠습니다",
"./datasets/son/audio/NB10584578.0001.wav": "오늘 뉴스룸이 주목한다 던어는 저돌입니다",
"./datasets/son/audio/NB10584578.0002.wav": "돼지 저 자에 갑자기 돌 이 두 글자를 사용하는 이 단어는 흔히 추진력이 강하다는 의미로 쓰이죠",
"./datasets/son/audio/NB10584578.0003.wav": "난파 직전의 새정치연합을 책임지게 된 문희상 비대위원장이 이런 말을 했습니다",
"./datasets/son/audio/NB10584578.0004.wav": "난 그냥 산 돼지처럼 돌파하는 스타일이다",
"./datasets/son/audio/NB10584578.0005.wav": "이렇게 얘기했습니다",
"./datasets/son/audio/NB10584578.0006.wav": "몸이 좋지 않다면서 만남을 주저했던 김무성 새누리당 대표를 찾아 가서 만난 것도 바로 이런 적어도 저돌성이 없었다면 어려웠을지도 모르겠습니다 그렇다면",
"./datasets/son/audio/NB10584578.0007.wav": "문 비대위원장이 저돌적으로 돌파해야 할 과제는 무엇인가",
"./datasets/son/audio/NB10584578.0008.wav": "첫 번째는 계파주의 청산입니다",
"./datasets/son/audio/NB10584578.0009.wav": "지난 이천십이년 대선에서 민주통합당의 패배한 이후에 대선평가 위원장을 맡았던 한상진 서울대 명예교수가",
"./datasets/son/audio/NB10584578.0010.wav": "이런 보고서를 냈습니다",
"./datasets/son/audio/NB10584578.0011.wav": "계파정치 청산은 민주당의 미래를 위한 최우선 과제다",
"./datasets/son/audio/NB10584578.0012.wav": "아 이렇게 얘기했는데요 그러나 아시는 것처럼이 보고서는",
"./datasets/son/audio/NB10584578.0013.wav": "갖가지 반발 끝에 결국 채택되지 못했습니다",
"./datasets/son/audio/NB10584578.0014.wav": "아마 여당에서 한상진 교수 좋아하는 사람 별로 없을 겁니다",
"./datasets/son/audio/NB10584578.0015.wav": "문희상 당시 비대위원장이 공교롭게도 계파와 패권주의 청산을 내세웠던 바로 그 시기에 비대위원장 이었죠",
"./datasets/son/audio/NB10584578.0016.wav": "계파 청산에 관한 문 비대위원장은 어떻게 보면 실패했다고 봐야만 합니다",
"./datasets/son/audio/NB10584578.0017.wav": "권한은 공유하되 책임은 당 대표가 혼자지는 이런 기형적 구조가",
"./datasets/son/audio/NB10584578.0018.wav": "아 결국",
"./datasets/son/audio/NB10584578.0019.wav": "최근 사년 동안에 임기 2년에 야당 지도부 교체 숫자를",
"./datasets/son/audio/NB10584578.0020.wav": "늘려서 무료 열번이나 교체가 되었습니다",
"./datasets/son/audio/NB10584578.0021.wav": "같은 기간에 새누리당은 단 네명의 지도부가 바뀌었습니다",
"./datasets/son/audio/NB10584578.0022.wav": "실패가 구조화된 당의 체질을 바꾸지 않고서는 누가 리더가 되어도 쉽지 않다는 것을 상징적으로 내보여주는 숫자이기도 합니다",
"./datasets/son/audio/NB10584578.0023.wav": "자 두 번째 과제는 바로 이겁니다 수사권 기소권 문제로 교착상태에 빠지는 세월호 특별법 지금도 끝이 보이지 않는데요",
"./datasets/son/audio/NB10584578.0024.wav": "어떠한 추가 협상도",
"./datasets/son/audio/NB10584578.0025.wav": "불가하다 이렇게 못박은 청와대와",
"./datasets/son/audio/NB10584578.0026.wav": "여당을 어떻게 변화시킬 것인지 또한",
"./datasets/son/audio/NB10584578.0027.wav": "수사권과 기소권을 주장하는 유족들의 요구를 어떻게 담아낼 것인지",
"./datasets/son/audio/NB10584578.0028.wav": "겉은 장비 속은 조조라고 불리우는 의회주의자 문희상 비대위원장과 새정치연합이 저돌적으로 말 그대로 저돌적으로 풀어 가야 할",
"./datasets/son/audio/NB10584578.0029.wav": "과제인지도 모르겠습니다",
"./datasets/son/audio/NB10584578.0030.wav": "세월호 참사는 오늘로 백육십일째를 맞았습니다",
"./datasets/son/audio/NB10584578.0031.wav": "쓸쓸한 팽목항에는",
"./datasets/son/audio/NB10584578.0032.wav": "자원봉사자마저 하나둘 철수하고 있고",
"./datasets/son/audio/NB10584578.0033.wav": "슬픈 이천십사년은 오늘로 이제 딱",
"./datasets/son/audio/NB10584578.0034.wav": "백일이 남았습니다",
"./datasets/son/audio/NB10584578.0035.wav": "잠시 후에 문희상 비대위원장을 스튜디오에서 만나겠습니다",
"./datasets/son/audio/NB10585784.0001.wav": "자 이어서 앵커 브리핑 순서입니다 오늘 뉴스 룸이 주목한 단어는 덫입니다",
"./datasets/son/audio/NB10585784.0002.wav": "어 잔꾀를 부리다 자신이 놓은 덫에 스스로 걸리고 만 꼴이다",
"./datasets/son/audio/NB10585784.0003.wav": "국회 선진화법 개정을 추진하고 있는 새누리당을 향해서",
"./datasets/son/audio/NB10585784.0004.wav": "새정치민주연합에 박수현 의원이 이런 말을 했군요",
"./datasets/son/audio/NB10585784.0005.wav": "이 말을 이해하기 위해서는 지난 이천십이년에 국회로 한 걸음",
"./datasets/son/audio/NB10585784.0006.wav": "돌아가 봐야만 합니다",
"./datasets/son/audio/NB10585784.0007.wav": "기대보다는 걱정이 앞서는 것이",
"./datasets/son/audio/NB10585784.0008.wav": "솔직한 내 심정입니다",
"./datasets/son/audio/NB10585784.0009.wav": "이제 개정안이 통과된 이상 우리 여야가",
"./datasets/son/audio/NB10585784.0010.wav": "대화와 타협을 통해서",
"./datasets/son/audio/NB10585784.0011.wav": "국민들에게 신뢰받는 선진 국회를 만들어 가기를 간절히 바랍니다",
"./datasets/son/audio/NB10585784.0015.wav": "예 이렇게 세번 두들기고 법안은 통과가 되는데요",
"./datasets/son/audio/NB10585784.0016.wav": "국회선진화법은 재적의원 중에 과반이 아닌 오분의 삼이상이 찬성해야 만",
"./datasets/son/audio/NB10585784.0017.wav": "안건을 올릴 수 있도록 만든 법이죠"
}
================================================
FILE: datasets/son.py
================================================
# -*- coding: utf-8 -*-
from concurrent.futures import ProcessPoolExecutor
from functools import partial
import numpy as np
import os,json
from utils import audio
from text import text_to_sequence
def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
"""
Preprocesses the speech dataset from a gven input path to given output directories
Args:
- hparams: hyper parameters
- input_dir: input directory that contains the files to prerocess
- out_dir: output directory of npz files
- n_jobs: Optional, number of worker process to parallelize across
- tqdm: Optional, provides a nice progress bar
Returns:
- A list of tuple describing the train examples. this should be written to train.txt
"""
executor = ProcessPoolExecutor(max_workers=num_workers)
futures = []
index = 1
path = os.path.join(in_dir, 'son-recognition-All.json')
with open(path,encoding='utf-8') as f:
content = f.read()
data = json.loads(content)
for key, text in data.items():
wav_path = key.strip().split('/')
wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1])
# In case of test file
if not os.path.exists(wav_path):
continue
futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams)))
index += 1
return [future.result() for future in tqdm(futures) if future.result() is not None]
def _process_utterance(out_dir, wav_path, text, hparams):
"""
Preprocesses a single utterance wav/text pair
this writes the mel scale spectogram to disk and return a tuple to write
to the train.txt file
Args:
- mel_dir: the directory to write the mel spectograms into
- linear_dir: the directory to write the linear spectrograms into
- wav_dir: the directory to write the preprocessed wav into
- index: the numeric index to use in the spectogram filename
- wav_path: path to the audio file containing the speech input
- text: text spoken in the input audio file
- hparams: hyper parameters
Returns:
- A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)
"""
try:
# Load the audio as numpy array
wav = audio.load_wav(wav_path, sr=hparams.sample_rate)
except FileNotFoundError: #catch missing wav exception
print('file {} present in csv metadata is not present in wav folder. skipping!'.format(wav_path))
return None
#rescale wav
if hparams.rescaling: # hparams.rescale = True
wav = wav / np.abs(wav).max() * hparams.rescaling_max
#M-AILABS extra silence specific
if hparams.trim_silence: # hparams.trim_silence = True
wav = audio.trim_silence(wav, hparams) # Trim leading and trailing silence
#Mu-law quantize, default 값은 'raw'
if hparams.input_type=='mulaw-quantize':
#[0, quantize_channels)
out = audio.mulaw_quantize(wav, hparams.quantize_channels)
#Trim silences
start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
wav = wav[start: end]
out = out[start: end]
constant_values = mulaw_quantize(0, hparams.quantize_channels)
out_dtype = np.int16
elif hparams.input_type=='mulaw':
#[-1, 1]
out = audio.mulaw(wav, hparams.quantize_channels)
constant_values = audio.mulaw(0., hparams.quantize_channels)
out_dtype = np.float32
else: # raw
#[-1, 1]
out = wav
constant_values = 0.
out_dtype = np.float32
# Compute the mel scale spectrogram from the wav
mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
mel_frames = mel_spectrogram.shape[1]
if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length: # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True
return None
#Compute the linear scale spectrogram from the wav
linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)
linear_frames = linear_spectrogram.shape[1]
#sanity check
assert linear_frames == mel_frames
if hparams.use_lws: # hparams.use_lws = False
#Ensure time resolution adjustement between audio and mel-spectrogram
fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size
l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))
#Zero pad audio signal
out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
else:
#Ensure time resolution adjustement between audio and mel-spectrogram
pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))
#Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)
out = np.pad(out, pad, mode='reflect')
assert len(out) >= mel_frames * audio.get_hop_size(hparams)
#time resolution adjustement
#ensure length of raw audio is multiple of hop size so that we can use
#transposed convolution to upsample
out = out[:mel_frames * audio.get_hop_size(hparams)]
assert len(out) % audio.get_hop_size(hparams) == 0
time_steps = len(out)
# Write the spectrogram and audio to disk
wav_id = os.path.splitext(os.path.basename(wav_path))[0]
# Write the spectrograms to disk:
audio_filename = '{}-audio.npy'.format(wav_id)
mel_filename = '{}-mel.npy'.format(wav_id)
linear_filename = '{}-linear.npy'.format(wav_id)
npz_filename = '{}.npz'.format(wav_id)
npz_flag=True
if npz_flag:
# Tacotron 코드와 맞추기 위해, 같은 key를 사용한다.
data = {
'audio': out.astype(out_dtype),
'mel': mel_spectrogram.T,
'linear': linear_spectrogram.T,
'time_steps': time_steps,
'mel_frames': mel_frames,
'text': text,
'tokens': text_to_sequence(text), # eos(~)에 해당하는 "1"이 끝에 붙는다.
'loss_coeff': 1 # For Tacotron
}
np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False)
else:
np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)
np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)
# Return a tuple describing this training example
return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename)
================================================
FILE: generate.py
================================================
# coding: utf-8
"""
sample_rate = 16000이므로, samples 48000이면 3초 길이가 된다.
> python generate.py --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10
> python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10 <----scalar_input = True
> python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10
python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-tacotron/generate/mel-2018-12-25_22-27-50-0.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10
gc_id = 0(moon), 1(son)
python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-22T23-08-16
python generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-22T23-08-16
"""
import argparse
from datetime import datetime
import json
import os,time
import librosa
import numpy as np
import tensorflow as tf
from wavenet import WaveNetModel, mu_law_decode, mu_law_encode
from hparams import hparams
from utils import load_hparams,load
from utils import audio
from utils import plot
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
def _interp(feats, in_range):
#rescales from [-max, max] (or [0, max]) to [0, 1]
return (feats - in_range[0]) / (in_range[1] - in_range[0])
def get_arguments():
def _str_to_bool(s):
"""Convert string to bool (in argparse context)."""
if s.lower() not in ['true', 'false']:
raise ValueError('Argument needs to be a boolean, got {}'.format(s))
return {'true': True, 'false': False}[s.lower()]
def _ensure_positive_float(f):
"""Ensure argument is a positive float."""
if float(f) < 0:
raise argparse.ArgumentTypeError('Argument must be greater than zero')
return float(f)
parser = argparse.ArgumentParser(description='WaveNet generation script')
parser.add_argument('checkpoint_dir', type=str, help='Which model checkpoint to generate from')
TEMPERATURE = 1.0
parser.add_argument('--temperature', type=_ensure_positive_float, default=TEMPERATURE,help='Sampling temperature')
LOGDIR = './logdir-wavenet'
parser.add_argument('--logdir',type=str,default=LOGDIR,help='Directory in which to store the logging information for TensorBoard.')
parser.add_argument('--wav_out_path',type=str,default=None,help='Path to output wav file')
BATCH_SIZE = 1
parser.add_argument('--batch_size', type=int, default=BATCH_SIZE,help='batch size')
parser.add_argument('--wav_seed',type=str,default=None,help='The wav file to start generation from')
parser.add_argument('--mel',type=str,default=None,help='mel input')
parser.add_argument('--gc_cardinality',type=int,default=None,help='Number of categories upon which we globally condition.')
parser.add_argument('--gc_id',type=int,default=None,help='ID of category to generate, if globally conditioned.')
arguments = parser.parse_args()
if hparams.gc_channels is not None:
if arguments.gc_cardinality is None:
raise ValueError("Globally conditioning but gc_cardinality not specified. Use --gc_cardinality=377 for full VCTK corpus.")
if arguments.gc_id is None:
raise ValueError("Globally conditioning, but global condition was not specified. Use --gc_id to specify global condition.")
return arguments
# def write_wav(waveform, sample_rate, filename):
# y = np.array(waveform)
# librosa.output.write_wav(filename, y, sample_rate)
# print('Updated wav file at {}'.format(filename))
def create_seed(filename,sample_rate,quantization_channels,window_size,scalar_input):
# seed의 앞부분만 사용한다.
seed_audio, _ = librosa.load(filename, sr=sample_rate, mono=True)
seed_audio = audio.trim_silence(seed_audio, hparams)
if scalar_input:
if len(seed_audio) < window_size:
return seed_audio
else: return seed_audio[:window_size]
else:
quantized = mu_law_encode(seed_audio, quantization_channels)
# 짧으면 짧은 대로 return하는데, padding이라도 해야되지 않나???
cut_index = tf.cond(tf.size(quantized) < tf.constant(window_size), lambda: tf.size(quantized), lambda: tf.constant(window_size))
return quantized[:cut_index]
def main():
config = get_arguments()
started_datestring = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now())
logdir = os.path.join(config.logdir, 'generate', started_datestring)
if not os.path.exists(logdir):
os.makedirs(logdir)
load_hparams(hparams, config.checkpoint_dir)
with tf.device('/cpu:0'): # cpu가 더 빠르다. gpu로 설정하면 Error. tf.device 없이 하면 더 느려진다.
sess = tf.Session()
scalar_input = hparams.scalar_input
net = WaveNetModel(
batch_size=config.batch_size,
dilations=hparams.dilations,
filter_width=hparams.filter_width,
residual_channels=hparams.residual_channels,
dilation_channels=hparams.dilation_channels,
quantization_channels=hparams.quantization_channels,
out_channels =hparams.out_channels,
skip_channels=hparams.skip_channels,
use_biases=hparams.use_biases,
scalar_input=hparams.scalar_input,
global_condition_channels=hparams.gc_channels,
global_condition_cardinality=config.gc_cardinality,
local_condition_channels=hparams.num_mels,
upsample_factor=hparams.upsample_factor,
legacy = hparams.legacy,
residual_legacy = hparams.residual_legacy,
train_mode=False) # train 단계에서는 global_condition_cardinality를 AudioReader에서 파악했지만, 여기서는 넣어주어야 함
if scalar_input:
samples = tf.placeholder(tf.float32,shape=[net.batch_size,None])
else:
samples = tf.placeholder(tf.int32,shape=[net.batch_size,None]) # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이)
# local condition이 (N,T,num_mels) 여야 하지만, 길이 1까지로 들어가야하기 때무넹, (N,1,num_mels) --> squeeze하면 (N,num_mels)
upsampled_local_condition = tf.placeholder(tf.float32,shape=[net.batch_size,hparams.num_mels])
next_sample = net.predict_proba_incremental(samples,upsampled_local_condition, [config.gc_id]*net.batch_size) # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용
# making local condition data. placeholder - upsampled_local_condition 넣어줄 upsampled local condition data를 만들어 보자.
mel_input = np.load(config.mel)
sample_size = mel_input.shape[0] * hparams.hop_size
mel_input = np.tile(mel_input,(config.batch_size,1,1))
with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE):
upsampled_local_condition_data = net.create_upsample(mel_input,upsample_type=hparams.upsample_type)
var_list = [var for var in tf.global_variables() if 'queue' not in var.name ]
saver = tf.train.Saver(var_list)
print('Restoring model from {}'.format(config.checkpoint_dir))
load(saver, sess, config.checkpoint_dir)
sess.run(net.queue_initializer) # 이 부분이 없으면, checkpoint에서 복원된 값들이 들어 있다.
quantization_channels = hparams.quantization_channels
if config.wav_seed:
# wav_seed의 길이가 receptive_field보다 작으면, padding이라도 해야 되는 거 아닌가? 그냥 짧으면 짧은 대로 return함 --> 그래서 너무 짧으면 error
seed = create_seed(config.wav_seed,hparams.sample_rate,quantization_channels,net.receptive_field,scalar_input) # --> mu_law encode 된 것.
if scalar_input:
waveform = seed.tolist()
else:
waveform = sess.run(seed).tolist() # [116, 114, 120, 121, 127, ...]
print('Priming generation...')
for i, x in enumerate(waveform[-net.receptive_field: -1]): # 제일 마지막 1개는 아래의 for loop의 첫 loop에서 넣어준다.
if i % 100 == 0:
print('Priming sample {}/{}'.format(i,net.receptive_field), end='\r')
sess.run(next_sample, feed_dict={samples: np.array([x]*net.batch_size).reshape(net.batch_size,1), upsampled_local_condition: np.zeros([net.batch_size,hparams.num_mels])})
print('Done.')
waveform = np.array([waveform[-net.receptive_field:]]*net.batch_size)
else:
# Silence with a single random sample at the end.
if scalar_input:
waveform = [0.0] * (net.receptive_field - 1)
waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1)
waveform = np.concatenate([waveform,2*np.random.rand(net.batch_size).reshape(net.batch_size,-1)-1],axis=-1) # -1~1사이의 random number를 만들어 끝에 붙힌다.
# wavefor: shape(batch_size,net.receptive_field )
else:
waveform = [quantization_channels / 2] * (net.receptive_field - 1) # 필요한 receptive_field 크기보다 1개 작게 만든 후, 아래에서 random하게 1개를 덧붙힌다.
waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1)
waveform = np.concatenate([waveform,np.random.randint(quantization_channels,size=net.batch_size).reshape(net.batch_size,-1)],axis=-1) # one hot 변환 전. (batch_size, 5117)
start_time = time.time()
upsampled_local_condition_data = sess.run(upsampled_local_condition_data)
last_sample_timestamp = datetime.now()
for step in range(sample_size): # 원하는 길이를 구하기 위해 loop sample_size
window = waveform[:,-1:] # 제일 끝에 있는 1개만 samples에 넣어 준다. window: shape(N,1)
# Run the WaveNet to predict the next sample.
# fast가 아닌경우. window: [128.0, 128.0, ..., 128.0, 178, 185]
# fast인 경우, window는 숫자 1개.
prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step,:]}) # samples는 mu law encoding된 것. 계산 과정에서 one hot으로 변환된다. --> (batch_size,256)
if scalar_input:
sample = prediction # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다.
else:
# Scale prediction distribution using temperature.
# 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다.
# config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다.
np.seterr(divide='ignore')
scaled_prediction = np.log(prediction) / config.temperature # config.temperature인 경우는 값의 변화가 없다.
scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True)) # np.log(np.sum(np.exp(scaled_prediction)))
scaled_prediction = np.exp(scaled_prediction)
np.seterr(divide='warn')
# Prediction distribution at temperature=1.0 should be unchanged after
# scaling.
if config.temperature == 1.0:
np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.')
# argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다.
sample = [[np.random.choice(np.arange(quantization_channels), p=p)] for p in scaled_prediction] # choose one sample per batch
waveform = np.concatenate([waveform,sample],axis=-1) #window.shape: (N,1)
# Show progress only once per second.
current_sample_timestamp = datetime.now()
time_since_print = current_sample_timestamp - last_sample_timestamp
if time_since_print.total_seconds() > 1.:
duration = time.time() - start_time
print('Sample {:3<d}/{:3<d}, ({:.3f} sec/step)'.format(step + 1, sample_size, duration), end='\r')
last_sample_timestamp = current_sample_timestamp
# Introduce a newline to clear the carriage return from the progress.
print()
# Save the result as a wav file.
if hparams.input_type == 'raw':
out = waveform[:,net.receptive_field:]
elif hparams.input_type == 'mulaw':
decode = mu_law_decode(samples, quantization_channels,quantization=False)
out = sess.run(decode, feed_dict={samples: waveform[:,net.receptive_field:]})
else: # 'mulaw-quantize'
decode = mu_law_decode(samples, quantization_channels,quantization=True)
out = sess.run(decode, feed_dict={samples: waveform[:,net.receptive_field:]})
# save wav
for i in range(net.batch_size):
config.wav_out_path= logdir + '/test-{}.wav'.format(i)
mel_path = config.wav_out_path.replace(".wav", ".png")
gen_mel_spectrogram = audio.melspectrogram(out[i], hparams).astype(np.float32).T
audio.save_wav(out[i], config.wav_out_path, hparams.sample_rate) # save_wav 내에서 out[i]의 값이 바뀐다.
plot.plot_spectrogram(gen_mel_spectrogram, mel_path, title='generated mel spectrogram',target_spectrogram=mel_input[i])
print('Finished generating.')
if __name__ == '__main__':
s = time.time()
main()
print(time.time()-s,'sec')
================================================
FILE: hparams.py
================================================
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np
hparams = tf.contrib.training.HParams(
name = "Tacotron-2",
# tacotron hyper parameter
cleaners = 'korean_cleaners', # 'korean_cleaners' or 'english_cleaners'
skip_path_filter = False, # npz파일에서 불필요한 것을 거르는 작업을 할지 말지 결정. receptive_field 보다 짧은 data를 걸러야 하기 때문에 해 줘야 한다.
use_lws = False,
# Audio
sample_rate = 24000, #
# shift can be specified by either hop_size(우선) or frame_shift_ms
hop_size = 300, # frame_shift_ms = 12.5ms
fft_size=2048, # n_fft. 주로 1024로 되어있는데, tacotron에서 2048사용
win_size = 1200, # 50ms
num_mels=80,
#Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude levels. Also allows for better G&L phase reconstruction)
preemphasize = True, #whether to apply filter
preemphasis = 0.97,
min_level_db = -100,
ref_level_db = 20,
signal_normalization = True, #Whether to normalize mel spectrograms to some predefined range (following below parameters)
allow_clipping_in_normalization = True, #Only relevant if mel_normalization = True
symmetric_mels = True, #Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, faster and cleaner convergence)
max_abs_value = 4., #max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not be too big to avoid gradient explosion, not too small for fast convergence)
rescaling=True,
rescaling_max=0.999,
trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
#M-AILABS (and other datasets) trim params (there parameters are usually correct for any data, but definitely must be tuned for specific speakers)
trim_fft_size = 512,
trim_hop_size = 128,
trim_top_db = 23,
clip_mels_length = True, #For cases of OOM (Not really recommended, only use if facing unsolvable OOM errors, also consider clipping your samples to smaller chunks)
max_mel_frames = 1000, #Only relevant when clip_mels_length = True, please only use after trying output_per_steps=3 and still getting OOM errors.
l2_regularization_strength = 0, # Coefficient in the L2 regularization.
sample_size = 9000, # Concatenate and cut audio samples to this many samples
silence_threshold = 0, # Volume threshold below which to trim the start and the end from the training set samples. e.g. 2
filter_width = 3,
gc_channels = 32, # global_condition_vector의 차원. 이것 지정함으로써, global conditioning을 모델에 반영하라는 의미가 된다.
input_type="raw", # 'mulaw-quantize', 'mulaw', 'raw', mulaw, raw 2가지는 scalar input
scalar_input = True, # input_type과 맞아야 함.
dilations = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
residual_channels = 128,
dilation_channels = 256,
quantization_channels = 256,
out_channels = 30, # discretized_mix_logistic_loss를 적용하기 때문에, 3의 배수
skip_channels = 128,
use_biases = True,
upsample_type = 'SubPixel', # 'SubPixel', None
upsample_factor=[12,25], # np.prod(upsample_factor) must equal to hop_size
# wavenet training hp
wavenet_batch_size = 2, # 16--> OOM. wavenet은 batch_size가 고정되어야 한다.
store_metadata = False,
num_steps = 1000000, # Number of training steps
#Learning rate schedule
wavenet_learning_rate = 1e-3, #wavenet initial learning rate
wavenet_decay_rate = 0.5, #Only used with 'exponential' scheme. Defines the decay rate.
wavenet_decay_steps = 300000, #Only used with 'exponential' scheme. Defines the decay steps.
#Regularization parameters
wavenet_clip_gradients = True, #Whether the clip the gradients during wavenet training.
# residual 결과를 sum할 때,
legacy = True, #Whether to use legacy mode: Multiply all skip outputs but the first one with sqrt(0.5) (True for more early training stability, especially for large models)
# residual block내에서 x = (x + residual) * np.sqrt(0.5)
residual_legacy = True, #Whether to scale residual blocks outputs by a factor of sqrt(0.5) (True for input variance preservation early in training and better overall stability)
wavenet_dropout = 0.05,
optimizer = 'adam',
momentum = 0.9, # 'Specify the momentum to be used by sgd or rmsprop optimizer. Ignored by the adam optimizer.
max_checkpoints = 3, # 'Maximum amount of checkpoints that will be kept alive. Default: '
####################################
####################################
####################################
# TACOTRON HYPERPARAMETERS
# Training
adam_beta1 = 0.9,
adam_beta2 = 0.999,
#Learning rate schedule
tacotron_decay_learning_rate = True, #boolean, determines if the learning rate will follow an exponential decay
tacotron_start_decay = 40000, #Step at which learning decay starts
tacotron_decay_steps = 18000, #Determines the learning rate decay slope (UNDER TEST)
tacotron_decay_rate = 0.5, #learning rate decay rate (UNDER TEST)
tacotron_initial_learning_rate = 1e-3, #starting learning rate
tacotron_final_learning_rate = 1e-4, #minimal learning rate
initial_data_greedy = True,
initial_phase_step = 8000, # 여기서 지정한 step 이전에는 data_dirs의 각각의 디렉토리에 대하여 같은 수의 example을 만들고, 이후, weght 비듈에 따라 ... 즉, 아래의 'main_data_greedy_factor'의 영향을 받는다.
main_data_greedy_factor = 0,
main_data = [''], # 이곳에 있는 directory 속에 있는 data는 가중치를 'main_data_greedy_factor' 만큼 더 준다.
prioritize_loss = False,
# Model
model_type = 'multi-speaker', # [single, multi-speaker]
speaker_embedding_size = 16,
embedding_size = 512, # 'ᄀ', 'ᄂ', 'ᅡ' 에 대한 embedding dim
dropout_prob = 0.5,
reduction_factor = 2, # reduction_factor가 적으면 더 많은 iteration이 필요하므로, 더 많은 메모리가 필요하다.
# Encoder
enc_conv_num_layers = 3,
enc_conv_kernel_size = 5,
enc_conv_channels = 512,
tacotron_zoneout_rate = 0.1,
encoder_lstm_units = 256,
attention_type = 'bah_mon_norm', # 'loc_sen', 'bah_mon_norm'
attention_size = 128,
#Attention mechanism
smoothing = False, #Whether to smooth the attention normalization function
attention_dim = 128, #dimension of attention space
attention_filters = 32, #number of attention convolution filters
attention_kernel = (31, ), #kernel size of attention convolution
cumulative_weights = True, #Whether to cumulate (sum) all previous attention weights or simply feed previous weights (Recommended: True)
#Attention synthesis constraints
#"Monotonic" constraint forces the model to only look at the forwards attention_win_size steps.
#"Window" allows the model to look at attention_win_size neighbors, both forward and backward steps.
synthesis_constraint = False, #Whether to use attention windows constraints in synthesis only (Useful for long utterances synthesis)
synthesis_constraint_type = 'window', #can be in ('window', 'monotonic').
attention_win_size = 7, #Side of the window. Current step does not count. If mode is window and attention_win_size is not pair, the 1 extra is provided to backward part of the window.
#Loss params
mask_encoder = True, #whether to mask encoder padding while computing location sensitive attention. Set to True for better prosody but slower convergence.
#Decoder
prenet_layers = [256, 256], #number of layers and number of units of prenet
decoder_layers = 2, #number of decoder lstm layers
decoder_lstm_units = 1024, #number of decoder lstm units on each layer
dec_prenet_sizes = [256, 256], #number of layers and number of units of prenet
#Residual postnet
postnet_num_layers = 5, #number of postnet convolutional layers
postnet_kernel_size = (5, ), #size of postnet convolution filters for each layer
postnet_channels = 512, #number of postnet convolution filters for each layer
# for linear mel spectrogrma
post_bank_size = 8,
post_bank_channel_size = 128,
post_maxpool_width = 2,
post_highway_depth = 4,
post_rnn_size = 128,
post_proj_sizes = [256, 80], # num_mels=80
post_proj_width = 3,
tacotron_reg_weight = 1e-6, #regularization weight (for L2 regularization)
inference_prenet_dropout = True,
# Eval
min_tokens = 30, #originally 50, 30 is good for korean, text를 token으로 쪼갰을 때, 최소 길이 이상되어야 train에 사용
min_n_frame = 30*5, # min_n_frame = reduction_factor * min_iters, reduction_factor와 곱해서 min_n_frame을 설정한다.
max_n_frame = 200*5,
skip_inadequate = False,
griffin_lim_iters = 60,
power = 1.5,
)
if hparams.use_lws:
# Does not work if fft_size is not multiple of hop_size!!
# sample size = 20480, hop_size=256=12.5ms. fft_size는 window_size를 결정하는데, 2048을 시간으로 환산하면 2048/20480 = 0.1초=100ms
hparams.sample_rate = 20480 #
# shift can be specified by either hop_size(우선) or frame_shift_ms
hparams.hop_size = 256 # frame_shift_ms = 12.5ms
hparams.frame_shift_ms=None # hop_size= sample_rate * frame_shift_ms / 1000
hparams.fft_size=2048 # 주로 1024로 되어있는데, tacotron에서 2048사용==> output size = 1025
hparams.win_size = None # 256x4 --> 50ms
else:
# 미리 정의되 parameter들로 부터 consistant하게 정의해 준다.
hparams.num_freq = int(hparams.fft_size/2 + 1)
hparams.frame_shift_ms = hparams.hop_size * 1000.0/ hparams.sample_rate # hop_size= sample_rate * frame_shift_ms / 1000
hparams.frame_length_ms = hparams.win_size * 1000.0/ hparams.sample_rate
def hparams_debug_string():
values = hparams.values()
hp = [' %s: %s' % (name, values[name]) for name in sorted(values)]
return 'Hyperparameters:\n' + '\n'.join(hp)
================================================
FILE: preprocess.py
================================================
# coding: utf-8
"""
python preprocess.py --num_workers 10 --name son --in_dir D:\hccho\multi-speaker-tacotron-tensorflow-master\datasets\son --out_dir .\data\son
python preprocess.py --num_workers 10 --name moon --in_dir D:\hccho\multi-speaker-tacotron-tensorflow-master\datasets\moon --out_dir .\data\moon
==> out_dir에 'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'tokens', 'loss_coeff'를 묶은 npz파일이 생성된다.
"""
import argparse
import os
from multiprocessing import cpu_count
from tqdm import tqdm
import importlib
from hparams import hparams, hparams_debug_string
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
def preprocess(mod, in_dir, out_root,num_workers):
os.makedirs(out_dir, exist_ok=True)
metadata = mod.build_from_path(hparams, in_dir, out_dir,num_workers=num_workers, tqdm=tqdm)
write_metadata(metadata, out_dir)
def write_metadata(metadata, out_dir):
with open(os.path.join(out_dir, 'train.txt'), 'w', encoding='utf-8') as f:
for m in metadata:
f.write('|'.join([str(x) for x in m]) + '\n')
mel_frames = sum([int(m[4]) for m in metadata])
timesteps = sum([int(m[3]) for m in metadata])
sr = hparams.sample_rate
hours = timesteps / sr / 3600
print('Write {} utterances, {} mel frames, {} audio timesteps, ({:.2f} hours)'.format(len(metadata), mel_frames, timesteps, hours))
print('Max input length (text chars): {}'.format(max(len(m[5]) for m in metadata)))
print('Max mel frames length: {}'.format(max(int(m[4]) for m in metadata)))
print('Max audio timesteps length: {}'.format(max(m[3] for m in metadata)))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--name', type=str, default=None)
parser.add_argument('--in_dir', type=str, default=None)
parser.add_argument('--out_dir', type=str, default=None)
parser.add_argument('--num_workers', type=str, default=None)
parser.add_argument('--hparams', type=str, default=None)
args = parser.parse_args()
if args.hparams is not None:
hparams.parse(args.hparams)
print(hparams_debug_string())
name = args.name
in_dir = args.in_dir
out_dir = args.out_dir
num_workers = args.num_workers
num_workers = cpu_count() if num_workers is None else int(num_workers) # cpu_count() = process 갯수
print("Sampling frequency: {}".format(hparams.sample_rate))
assert name in ["cmu_arctic", "ljspeech", "son", "moon"]
mod = importlib.import_module('datasets.{}'.format(name))
preprocess(mod, in_dir, out_dir, num_workers)
================================================
FILE: synthesizer.py
================================================
# coding: utf-8
"""
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "그런데 청년은 이렇게 말합니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "이런 논란은 타코트론 논문 이후에 사라졌습니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "이런 논란은 타코트론 논문 이후에 사라졌습니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오는 6월6일은 제64회 현충일입니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오는 6월6일은 제64회 현충일입니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다"
"""
import io
import os
import re
import librosa
import argparse
import numpy as np
from glob import glob
from tqdm import tqdm
import tensorflow as tf
from functools import partial
from hparams import hparams
from tacotron2 import create_model, get_most_recent_checkpoint
from utils.audio import save_wav, inv_linear_spectrogram, inv_preemphasis, inv_spectrogram_tensorflow
from utils import plot, PARAMS_NAME, load_json, load_hparams, add_prefix, add_postfix, get_time, parallel_run, makedirs, str2bool
from text.korean import tokenize
from text import text_to_sequence, sequence_to_text
from datasets.datafeeder_tacotron2 import _prepare_inputs
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
tf.logging.set_verbosity(tf.logging.ERROR)
class Synthesizer(object):
def close(self):
tf.reset_default_graph()
self.sess.close()
def load(self, checkpoint_path, num_speakers=2, checkpoint_step=None, inference_prenet_dropout=True,model_name='tacotron'):
self.num_speakers = num_speakers
if os.path.isdir(checkpoint_path):
load_path = checkpoint_path
checkpoint_path = get_most_recent_checkpoint(checkpoint_path, checkpoint_step)
else:
load_path = os.path.dirname(checkpoint_path)
print('Constructing model: %s' % model_name)
inputs = tf.placeholder(tf.int32, [None, None], 'inputs')
input_lengths = tf.placeholder(tf.int32, [None], 'input_lengths')
batch_size = tf.shape(inputs)[0]
speaker_id = tf.placeholder_with_default(
tf.zeros([batch_size], dtype=tf.int32), [None], 'speaker_id')
load_hparams(hparams, load_path)
hparams.inference_prenet_dropout = inference_prenet_dropout
with tf.variable_scope('model') as scope:
self.model = create_model(hparams)
self.model.initialize(inputs=inputs, input_lengths=input_lengths, num_speakers=self.num_speakers, speaker_id=speaker_id,is_training=False)
self.wav_output = inv_spectrogram_tensorflow(self.model.linear_outputs,hparams)
print('Loading checkpoint: %s' % checkpoint_path)
sess_config = tf.ConfigProto(
allow_soft_placement=True,
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=2)
sess_config.gpu_options.allow_growth = True
self.sess = tf.Session(config=sess_config)
self.sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.restore(self.sess, checkpoint_path)
def synthesize(self,
texts=None, tokens=None,
base_path=None, paths=None, speaker_ids=None,
start_of_sentence=None, end_of_sentence=True,
pre_word_num=0, post_word_num=0,
pre_surplus_idx=0, post_surplus_idx=1,
use_short_concat=False,
base_alignment_path=None,
librosa_trim=False,
attention_trim=True,
isKorean=True):
# Possible inputs:
# 1) text=text
# 2) text=texts
# 3) tokens=tokens, texts=texts # use texts as guide
if type(texts) == str:
texts = [texts]
if texts is not None and tokens is None:
sequences = np.array([text_to_sequence(text) for text in texts])
sequences = _prepare_inputs(sequences)
elif tokens is not None:
sequences = tokens
#sequences = np.pad(sequences,[(0,0),(0,5)],'constant',constant_values=(0)) # case by case ---> overfitting?
if paths is None:
paths = [None] * len(sequences)
if texts is None:
texts = [None] * len(sequences)
time_str = get_time()
def plot_and_save_parallel(wavs, alignments,mels):
items = list(enumerate(zip(wavs, alignments, paths, texts, sequences,mels)))
fn = partial(
plot_graph_and_save_audio,
base_path=base_path,
start_of_sentence=start_of_sentence, end_of_sentence=end_of_sentence,
pre_word_num=pre_word_num, post_word_num=post_word_num,
pre_surplus_idx=pre_surplus_idx, post_surplus_idx=post_surplus_idx,
use_short_concat=use_short_concat,
librosa_trim=librosa_trim,
attention_trim=attention_trim,
time_str=time_str,
isKorean=isKorean)
return parallel_run(fn, items,desc="plot_graph_and_save_audio", parallel=False)
#input_lengths = np.argmax(np.array(sequences) == 1, 1)+1
input_lengths = [np.argmax(a==1)+1 for a in sequences]
fetches = [
#self.wav_output,
self.model.linear_outputs,
self.model.alignments, # # batch_size, text length(encoder), target length(decoder)
self.model.mel_outputs,
]
feed_dict = { self.model.inputs: sequences, self.model.input_lengths: input_lengths, }
if speaker_ids is not None:
if type(speaker_ids) == dict:
speaker_embed_table = sess.run(
self.model.speaker_embed_table)
speaker_embed = [speaker_ids[speaker_id] * speaker_embed_table[speaker_id] for speaker_id in speaker_ids]
feed_dict.update({ self.model.speaker_embed_table: np.tile() })
else:
feed_dict[self.model.speaker_id] = speaker_ids
wavs, alignments,mels = self.sess.run(fetches, feed_dict=feed_dict)
results = plot_and_save_parallel(wavs, alignments,mels=mels)
return results
def plot_graph_and_save_audio(args,
base_path=None,
start_of_sentence=None, end_of_sentence=None,
pre_word_num=0, post_word_num=0,
pre_surplus_idx=0, post_surplus_idx=1,
use_short_concat=False,
save_alignment=False,
librosa_trim=False, attention_trim=False,
time_str=None, isKorean=True):
idx, (wav, alignment, path, text, sequence,mel) = args
if base_path:
plot_path = "{}/{}.png".format(base_path, get_time())
elif path:
plot_path = path.rsplit('.', 1)[0] + ".png"
else:
plot_path = None
if plot_path:
plot.plot_alignment(alignment, plot_path, text=text, isKorean=isKorean)
if use_short_concat:
wav = short_concat(
wav, alignment, text,
start_of_sentence, end_of_sentence,
pre_word_num, post_word_num,
pre_surplus_idx, post_surplus_idx)
if attention_trim and end_of_sentence:
# attention이 text의 마지막까지 왔다면, 그 뒷부분은 버린다.
end_idx_counter = 0
attention_argmax = alignment.argmax(0) # alignment: text length(encoder), target length(decoder) ==> target length(decoder)
end_idx = min(len(sequence) - 1, max(attention_argmax))
max_counter = min((attention_argmax == end_idx).sum(), 5)
for jdx, attend_idx in enumerate(attention_argmax):
if len(attention_argmax) > jdx + 1:
if attend_idx == end_idx:
end_idx_counter += 1
if attend_idx == end_idx and attention_argmax[jdx + 1] > end_idx:
break
if end_idx_counter >= max_counter:
break
else:
break
spec_end_idx = hparams.reduction_factor * jdx + 3
wav = wav[:spec_end_idx]
mel = mel[:spec_end_idx]
audio_out = inv_linear_spectrogram(wav.T,hparams)
if librosa_trim and end_of_sentence:
yt, index = librosa.effects.trim(audio_out, frame_length=5120, hop_length=256, top_db=50)
audio_out = audio_out[:index[-1]]
mel = mel[:index[-1]//hparams.hop_size]
if save_alignment:
alignment_path = "{}/{}.npy".format(base_path, idx)
np.save(alignment_path, alignment, allow_pickle=False)
if path or base_path:
if path:
current_path = add_postfix(path, idx)
elif base_path:
current_path = plot_path.replace(".png", ".wav")
save_wav(audio_out, current_path,hparams.sample_rate)
#hccho
mel_path = current_path.replace(".wav",".npy")
np.save(mel_path,mel)
return True
else:
io_out = io.BytesIO()
save_wav(audio_out, io_out,hparams.sample_rate)
result = io_out.getvalue()
return result
def get_most_recent_checkpoint(checkpoint_dir, checkpoint_step=None):
if checkpoint_step is None:
checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))]
idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]
max_idx = max(idxes)
else:
max_idx = checkpoint_step
lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx))
print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint))
return lastest_checkpoint
def short_concat(
wav, alignment, text,
start_of_sentence, end_of_sentence,
pre_word_num, post_word_num,
pre_surplus_idx, post_surplus_idx):
# np.array(list(decomposed_text))[attention_argmax]
attention_argmax = alignment.argmax(0)
if not start_of_sentence and pre_word_num > 0:
surplus_decomposed_text = decompose_ko_text("".join(text.split()[0]))
start_idx = len(surplus_decomposed_text) + 1
for idx, attend_idx in enumerate(attention_argmax):
if attend_idx == start_idx and attention_argmax[idx - 1] < start_idx:
break
wav_start_idx = hparams.reduction_factor * idx - 1 - pre_surplus_idx
else:
wav_start_idx = 0
if not end_of_sentence and post_word_num > 0:
surplus_decomposed_text = decompose_ko_text("".join(text.split()[-1]))
end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1
for idx, attend_idx in enumerate(attention_argmax):
if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx:
break
wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx
else:
if True: # attention based split
if end_of_sentence:
end_idx = min(len(decomposed_text) - 1, max(attention_argmax))
else:
surplus_decomposed_text = decompose_ko_text("".join(text.split()[-1]))
end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1
while True:
if end_idx in attention_argmax:
break
end_idx -= 1
end_idx_counter = 0
for idx, attend_idx in enumerate(attention_argmax):
if len(attention_argmax) > idx + 1:
if attend_idx == end_idx:
end_idx_counter += 1
if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx:
break
if end_idx_counter > 5:
break
else:
break
wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx
else:
wav_end_idx = None
wav = wav[wav_start_idx:wav_end_idx]
if end_of_sentence:
wav = np.lib.pad(wav, ((0, 20), (0, 0)), 'constant', constant_values=0)
else:
wav = np.lib.pad(wav, ((0, 10), (0, 0)), 'constant', constant_values=0)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--load_path', required=True)
parser.add_argument('--sample_path', default="logdir-tacotron2/generate")
parser.add_argument('--text', required=True)
parser.add_argument('--num_speakers', default=1, type=int)
parser.add_argument('--speaker_id', default=0, type=int)
parser.add_argument('--checkpoint_step', default=None, type=int)
parser.add_argument('--is_korean', default=True, type=str2bool)
parser.add_argument('--base_alignment_path', default=None)
config = parser.parse_args()
makedirs(config.sample_path)
synthesizer = Synthesizer()
synthesizer.load(config.load_path, config.num_speakers, config.checkpoint_step,inference_prenet_dropout=False)
audio = synthesizer.synthesize(texts=[config.text],base_path=config.sample_path,speaker_ids=[config.speaker_id],
attention_trim=True,base_alignment_path=config.base_alignment_path,isKorean=config.is_korean)[0]
================================================
FILE: tacotron2/__init__.py
================================================
# coding: utf-8
import os
from glob import glob
from .tacotron2 import Tacotron2
def create_model(hparams):
return Tacotron2(hparams)
def get_most_recent_checkpoint(checkpoint_dir):
checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))]
idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]
max_idx = max(idxes)
lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx))
#latest_checkpoint=checkpoint_paths[0]
print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint))
return lastest_checkpoint
================================================
FILE: tacotron2/helpers.py
================================================
# coding: utf-8
# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py
import numpy as np
import tensorflow as tf
from tensorflow.contrib.seq2seq import Helper
# Adapted from tf.contrib.seq2seq.GreedyEmbeddingHelper
class TacoTestHelper(Helper):
def __init__(self, batch_size, output_dim, r):
with tf.name_scope('TacoTestHelper'):
self._batch_size = batch_size
self._output_dim = output_dim
self._end_token = tf.tile([0.0], [output_dim * r]) # [0.0,0.0,...]
self._reduction_factor = r
@property
def batch_size(self):
return self._batch_size
@property
def sample_ids_dtype(self):
return tf.int32
@property
def sample_ids_shape(self):
return tf.TensorShape([])
def initialize(self, name=None):
return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim))
def sample(self, time, outputs, state, name=None):
return tf.tile([0], [self._batch_size]) # Return all 0; we ignore them
def next_inputs(self, time, outputs, state, sample_ids, name=None):
'''Stop on EOS. Otherwise, pass the last output as the next input and pass through state.'''
with tf.name_scope('TacoTestHelper'):
stop_token_preds = tf.nn.sigmoid(outputs[:,-self._reduction_factor:])
finished = tf.reduce_any(tf.cast(tf.round(stop_token_preds), tf.bool),axis=1)
# Feed last output frame as next input. outputs is [N, output_dim * r]
next_inputs = outputs[:, -(self._output_dim+self._reduction_factor):-self._reduction_factor] # stop token 부분을 제외
return (finished, next_inputs, state)
class TacoTrainingHelper(Helper):
def __init__(self, targets, output_dim, r):
# inputs is [N, T_in], targets is [N, T_out, D]
# output_dim = hp.num_mels = 80
# r = hp.reduction_factor = 4 or 5
with tf.name_scope('TacoTrainingHelper'):
self._batch_size = tf.shape(targets)[0]
self._output_dim = output_dim
# Feed every r-th target frame as input
self._targets = targets[:, r-1::r, :]
# Use full length for every target because we don't want to mask the padding frames
num_steps = tf.shape(self._targets)[1]
self._lengths = tf.tile([num_steps], [self._batch_size])
@property
def batch_size(self):
return self._batch_size
@property
def sample_ids_dtype(self):
return tf.int32
@property
def sample_ids_shape(self):
return tf.TensorShape([])
def initialize(self, name=None):
return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim))
def sample(self, time, outputs, state, name=None):
return tf.tile([0], [self._batch_size]) # Return all 0; we ignore them
def next_inputs(self, time, outputs, state, sample_ids, name=None): # time에 해당하는 input을 만들어 return해야 한다.
with tf.name_scope(name or 'TacoTrainingHelper'):
finished = (time + 1 >= self._lengths)
next_inputs = self._targets[:, time, :]
return (finished, next_inputs, state)
def _go_frames(batch_size, output_dim):
'''Returns all-zero <GO> frames for a given batch size and output dimension'''
return tf.tile([[0.0]], [batch_size, output_dim])
================================================
FILE: tacotron2/modules.py
================================================
# coding: utf-8
# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py
import tensorflow as tf
from tensorflow.contrib.rnn import GRUCell
from tensorflow.python.layers import core
from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, AttentionWrapper, AttentionWrapperState
def prenet(inputs, is_training, layer_sizes, drop_prob, scope=None):
x = inputs # 3차원 array(batch,seq_length,embedding_dim) ==> (batch,seq_length,256) ==> (batch,seq_length,128)
#drop_rate = drop_prob if is_training else 0.0
#print('drop_rate',drop_rate)
with tf.variable_scope(scope or 'prenet'):
for i, size in enumerate(layer_sizes): # [f(256), f(256)]
dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='projection_%d' % (i+1))
# Tacotron2 논문에서는 training, inference 모두에 dropout 적용
x = tf.layers.dropout(dense, rate=drop_prob,training=True, name='dropout_%d' % (i+1)) # Tacotron2에서는 training, inference 모두에 dropout 적용
return x
def cbhg(inputs, input_lengths, is_training, bank_size, bank_channel_size, maxpool_width, highway_depth,
rnn_size, proj_sizes, proj_width, scope,before_highway=None, encoder_rnn_init_state=None):
# inputs: (N,T_in, 128), bank_size: 16
batch_size = tf.shape(inputs)[0]
with tf.variable_scope(scope):
with tf.variable_scope('conv_bank'):
# Convolution bank: concatenate on the last axis
# to stack channels from all convolutions
conv_fn = lambda k: conv1d(inputs, k, bank_channel_size, tf.nn.relu, is_training, 'conv1d_%d' % k) # bank_channel_size =128
conv_outputs = tf.concat( [conv_fn(k) for k in range(1, bank_size+1)], axis=-1,) # ==> (N,T_in,128*bank_size)
# Maxpooling:
maxpool_output = tf.layers.max_pooling1d(conv_outputs,pool_size=maxpool_width,strides=1,padding='same') # maxpool_width = 2
# Two projection layers:
proj_out = maxpool_output
for idx, proj_size in enumerate(proj_sizes): # [f(128), f(128)], post: [f(256), f(80)]
activation_fn = None if idx == len(proj_sizes) - 1 else tf.nn.relu
proj_out = conv1d(proj_out, proj_width, proj_size, activation_fn,is_training, 'proj_{}'.format(idx + 1)) # proj_width = 3
# Residual connection:
if before_highway is not None: # multi-sperker mode
expanded_before_highway = tf.expand_dims(before_highway, [1])
tiled_before_highway = tf.tile(expanded_before_highway, [1, tf.shape(proj_out)[1], 1])
highway_input = proj_out + inputs + tiled_before_highway
else: # single model
highway_input = proj_out + inputs
# Handle dimensionality mismatch:
if highway_input.shape[2] != rnn_size: # rnn_size = 128
highway_input = tf.layers.dense(highway_input, rnn_size,name='highway_projection')
# 4-layer HighwayNet:
for idx in range(highway_depth):
highway_input = highwaynet(highway_input, 'highway_%d' % (idx+1))
rnn_input = highway_input
# Bidirectional RNN
if encoder_rnn_init_state is not None:
initial_state_fw, initial_state_bw = tf.split(encoder_rnn_init_state, 2, 1)
else: # single mode
initial_state_fw, initial_state_bw = None, None
cell_fw, cell_bw = GRUCell(rnn_size), GRUCell(rnn_size)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,rnn_input,sequence_length=input_lengths,
initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32)
return tf.concat(outputs, axis=2) # Concat forward and backward
def batch_tile(tensor, batch_size):
expaneded_tensor = tf.expand_dims(tensor, [0])
return tf.tile(expaneded_tensor, \
[batch_size] + [1 for _ in tensor.get_shape()])
def highwaynet(inputs, scope):
highway_dim = int(inputs.get_shape()[-1])
with tf.variable_scope(scope):
H = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.relu,name='H_projection')
T = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.sigmoid,name='T_projection',bias_initializer=tf.constant_initializer(-1.0))
return H * T + inputs * (1.0 - T)
def conv1d(inputs, kernel_size, channels, activation, is_training, scope):
with tf.variable_scope(scope):
# strides=1, padding = same 이므로, kernel_size에 상관없이 크기가 유지된다.
conv1d_output = tf.layers.conv1d(inputs,filters=channels,kernel_size=kernel_size,activation=activation,padding='same') # padding이 same이라 kenel size가 달라도 concat된다.
return tf.layers.batch_normalization(conv1d_output, training=is_training)
================================================
FILE: tacotron2/rnn_wrappers.py
================================================
# coding: utf-8
import numpy as np
import tensorflow as tf
from tensorflow.contrib.rnn import RNNCell
from tensorflow.python.ops import rnn_cell_impl
#from tensorflow.contrib.data.python.util import nest
from tensorflow.contrib.framework import nest
from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, \
AttentionWrapperState, AttentionMechanism, _BaseMonotonicAttentionMechanism,_maybe_mask_score,_prepare_memory,_monotonic_probability_fn
from tensorflow.python.ops import array_ops, math_ops, nn_ops, variable_scope
from tensorflow.python.layers.core import Dense
from .modules import prenet
import functools
_zero_state_tensors = rnn_cell_impl._zero_state_tensors
class ZoneoutLSTMCell(RNNCell):
'''Wrapper for tf LSTM to create Zoneout LSTM Cell
inspired by:
https://github.com/teganmaharaj/zoneout/blob/master/zoneout_tensorflow.py
Published by one of 'https://arxiv.org/pdf/1606.01305.pdf' paper writers.
Many thanks to @Ondal90 for pointing this out. You sir are a hero!
'''
def __init__(self, num_units, is_training, zoneout_factor_cell=0., zoneout_factor_output=0., state_is_tuple=True, name=None):
'''Initializer with possibility to set different zoneout values for cell/hidden states.
'''
zm = min(zoneout_factor_output, zoneout_factor_cell)
zs = max(zoneout_factor_output, zoneout_factor_cell)
if zm < 0. or zs > 1.:
raise ValueError('One/both provided Zoneout factors are not in [0, 1]')
self._cell = tf.nn.rnn_cell.LSTMCell(num_units, state_is_tuple=state_is_tuple, name=name)
self._zoneout_cell = zoneout_factor_cell
self._zoneout_outputs = zoneout_factor_output
self.is_training = is_training
self.state_is_tuple = state_is_tuple
@property
def state_size(self):
return self._cell.state_size
@property
def output_size(self):
return self._cell.output_size
def __call__(self, inputs, state, scope=None):
'''Runs vanilla LSTM Cell and applies zoneout.
'''
#Apply vanilla LSTM
output, new_state = self._cell(inputs, state, scope)
if self.state_is_tuple:
(prev_c, prev_h) = state
(new_c, new_h) = new_state
else:
num_proj = self._cell._num_units if self._cell._num_proj is None else self._cell._num_proj
prev_c = tf.slice(state, [0, 0], [-1, self._cell._num_units])
prev_h = tf.slice(state, [0, self._cell._num_units], [-1, num_proj])
new_c = tf.slice(new_state, [0, 0], [-1, self._cell._num_units])
new_h = tf.slice(new_state, [0, self._cell._num_units], [-1, num_proj])
#Apply zoneout
if self.is_training:
#nn.dropout takes keep_prob (probability to keep activations) not drop_prob (probability to mask activations)!
c = (1 - self._zoneout_cell) * tf.nn.dropout(new_c - prev_c, (1 - self._zoneout_cell)) + prev_c # tf.nn.dropout outputs the input element scaled up by 1 / keep_prob
h = (1 - self._zoneout_outputs) * tf.nn.dropout(new_h - prev_h, (1 - self._zoneout_outputs)) + prev_h
else:
c = (1 - self._zoneout_cell) * new_c + self._zoneout_cell * prev_c
h = (1 - self._zoneout_outputs) * new_h + self._zoneout_outputs * prev_h
new_state = tf.nn.rnn_cell.LSTMStateTuple(c, h) if self.state_is_tuple else tf.concat(1, [c, h])
return output, new_state
class DecoderWrapper(RNNCell):
'''Runs RNN inputs through a prenet before sending them to the cell.'''
# input에 prenet을 먼저 적용하는 것 뿐이다.
def __init__(self, cell, is_training, prenet_sizes, dropout_prob,inference_prenet_dropout=True):
super(DecoderWrapper, self).__init__()
self._is_training = is_training
self._cell = cell
self.prenet_sizes = prenet_sizes
if not is_training and not inference_prenet_dropout:
self.dropout_prob = 0.
else: self.dropout_prob = dropout_prob
@property
def state_size(self):
return self._cell.state_size
@property
def output_size(self):
return self._cell.output_size + self._cell.state_size.attention
def call(self, inputs, state):
prenet_out = prenet(inputs, self._is_training,self.prenet_sizes, self.dropout_prob, scope='decoder_prenet')
output, res_state = self._cell(prenet_out, state)
return tf.concat([output, res_state.attention], axis=-1), res_state
def zero_state(self, batch_size, dtype):
return self._cell.zero_state(batch_size, dtype)
class LocationSensitiveAttention(BahdanauAttention):
"""Impelements Bahdanau-style (cumulative) scoring function.
Usually referred to as "hybrid" attention (content-based + location-based)
Extends the additive attention described in:
"D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-
tion by jointly learning to align and translate,” in Proceedings
of ICLR, 2015."
to use previous alignments as additional location features.
This attention is described in:
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” in Ad-
vances in Neural Information Processing Systems, 2015, pp.
577–585.
"""
def __init__(self,
num_units,
memory,
hparams,
is_training,
mask_encoder=True,
memory_sequence_length=None,
smoothing=False,
cumulate_weights=True,
name='LocationSensitiveAttention'):
"""Construct the Attention mechanism.
Args:
num_units: The depth of the query mechanism.
memory: The memory to query; usually the output of an RNN encoder. This
tensor should be shaped `[batch_size, max_time, ...]`.
mask_encoder (optional): Boolean, whether to mask encoder paddings.
memory_sequence_length (optional): Sequence lengths for the batch entries
in memory. If provided, the memory tensor rows are masked with zeros
for values past the respective sequence lengths. Only relevant if mask_encoder = True.
smoothing (optional): Boolean. Determines which normalization function to use.
Default normalization function (probablity_fn) is softmax. If smoothing is
enabled, we replace softmax with:
a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))
Introduced in:
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” in Ad-
vances in Neural Information Processing Systems, 2015, pp.
577–585.
This is mainly used if the model wants to attend to multiple input parts
at the same decoding step. We probably won't be using it since multiple sound
frames may depend on the same character/phone, probably not the way around.
Note:
We still keep it implemented in case we want to test it. They used it in the
paper in the context of speech recognition, where one phoneme may depend on
multiple subsequent sound frames.
name: Name to use when creating ops.
"""
#Create normalization function
#Setting it to None defaults in using softmax
normalization_function = _smoothing_normalization if (smoothing == True) else None
memory_length = memory_sequence_length if (mask_encoder==True) else None
super(LocationSensitiveAttention, self).__init__(
num_units=num_units,
memory=memory,
memory_sequence_length=memory_length,
probability_fn=normalization_function,
name=name)
self.location_convolution = tf.layers.Conv1D(filters=hparams.attention_filters,
kernel_size=hparams.attention_kernel, padding='same', use_bias=True,
bias_initializer=tf.zeros_initializer(), name='location_features_convolution')
self.location_layer = tf.layers.Dense(units=num_units, use_bias=False,dtype=tf.float32, name='location_features_projection')
self._cumulate = cumulate_weights
self.synthesis_constraint = hparams.synthesis_constraint and not is_training
self.attention_win_size = tf.convert_to_tensor(hparams.attention_win_size, dtype=tf.int32)
self.constraint_type = hparams.synthesis_constraint_type
def __call__(self, query, state):
"""Score the query based on the keys and values.
Args:
query: Tensor of dtype matching `self.values` and shape
`[batch_size, query_depth]`.
state (previous alignments): Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]`
(`alignments_size` is memory's `max_time`).
Returns:
alignments: Tensor of dtype matching `self.values` and shape
`[batch_size, alignments_size]` (`alignments_size` is memory's
`max_time`).
"""
previous_alignments = state
with variable_scope.variable_scope(None, "Location_Sensitive_Attention", [query]):
# processed_query shape [batch_size, query_depth] -> [batch_size, attention_dim]
processed_query = self.query_layer(query) if self.query_layer else query
# -> [batch_size, 1, attention_dim]
processed_query = tf.expand_dims(processed_query, 1)
# processed_location_features shape [batch_size, max_time, attention dimension]
# [batch_size, max_time] -> [batch_size, max_time, 1]
expanded_alignments = tf.expand_dims(previous_alignments, axis=2)
# location features [batch_size, max_time, filters]
f = self.location_convolution(expanded_alignments)
# Projected location features [batch_size, max_time, attention_dim]
processed_location_features = self.location_layer(f)
# energy shape [batch_size, max_time]
energy = _location_sensitive_score(processed_query, processed_location_features, self.keys)
if self.synthesis_constraint:
prev_max_attentions = tf.argmax(previous_alignments, -1, output_type=tf.int32)
Tx = tf.shape(energy)[-1]
# prev_max_attentions = tf.squeeze(prev_max_attentions, [-1])
if self.constraint_type == 'monotonic':
key_masks = tf.sequence_mask(prev_max_attentions, Tx)
reverse_masks = tf.sequence_mask(Tx - self.attention_win_size - prev_max_attentions, Tx)[:, ::-1]
else:
assert self.constraint_type == 'window'
key_masks = tf.sequence_mask(prev_max_attentions - (self.attention_win_size // 2 + (self.attention_win_size % 2 != 0)), Tx)
reverse_masks = tf.sequence_mask(Tx - (self.attention_win_size // 2) - prev_max_attentions, Tx)[:, ::-1]
masks = tf.logical_or(key_masks, reverse_masks)
paddings = tf.ones_like(energy) * (-2 ** 32 + 1) # (N, Ty/r, Tx)
energy = tf.where(tf.equal(masks, False), energy, paddings)
# alignments shape = energy shape = [batch_size, max_time]
alignments = self._probability_fn(energy, previous_alignments)
# Cumulate alignments
if self._cumulate:
next_state = alignments + previous_alignments
else:
next_state = alignments
return alignments, next_state
def _location_sensitive_score(W_query, W_fil, W_keys):
"""Impelements Bahdanau-style (cumulative) scoring function.
This attention is described in:
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” in Ad-
vances in Neural Information Processing Systems, 2015, pp.
577–585.
#############################################################################
hybrid attention (content-based + location-based)
f = F * α_{i-1}
energy = dot(v_a, tanh(W_keys(h_enc) + W_query(h_dec) + W_fil(f) + b_a))
#############################################################################
Args:
W_query: Tensor, shape '[batch_size, 1, attention_dim]' to compare to location features.
W_location: processed previous alignments into location features, shape '[batch_size, max_time, attention_dim]'
W_keys: Tensor, shape '[batch_size, max_time, attention_dim]', typically the encoder outputs.
Returns:
A '[batch_size, max_time]' attention score (energy)
"""
# Get the number of hidden units from the trailing dimension of keys
dtype = W_query.dtype
num_units = W_keys.shape[-1].value or array_ops.shape(W_keys)[-1]
v_a = tf.get_variable(
'attention_variable_projection', shape=[num_units], dtype=dtype,
initializer=tf.contrib.layers.xavier_initializer())
b_a = tf.get_variable(
'attention_bias', shape=[num_units], dtype=dtype,
initializer=tf.zeros_initializer())
return tf.reduce_sum(v_a * tf.tanh(W_keys + W_query + W_fil + b_a), [2])
def _smoothing_normalization(e):
"""Applies a smoothing normalization function instead of softmax
Introduced in:
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
gio, “Attention-based models for speech recognition,” in Ad-
vances in Neural Information Processing Systems, 2015, pp.
577–585.
############################################################################
Smoothing normalization function
a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))
############################################################################
Args:
e: matrix [batch_size, max_time(memory_time)]: expected to be energy (score)
values of an attention mechanism
Returns:
matrix [batch_size, max_time]: [0, 1] normalized alignments with possible
attendance to multiple memory time steps.
"""
return tf.nn.sigmoid(e) / tf.reduce_sum(tf.nn.sigmoid(e), axis=-1, keepdims=True)
class GmmAttention(AttentionMechanism):
def __init__(self,
num_mixtures,
memory,
memory_sequence_length=None,
check_inner_dims_defined=True,
score_mask_value=None,
name='GmmAttention'):
self.dtype = memory.dtype
self.num_mixtures = num_mixtures
self.query_layer = tf.layers.Dense(3 * num_mixtures, name='gmm_query_projection', use_bias=True, dtype=self.dtype)
with tf.name_scope(name, 'GmmAttentionMechanismInit'):
if score_mask_value is None:
score_mask_value = 0.
self._maybe_mask_score = functools.partial(
_maybe_mask_score,
memory_sequence_length=memory_sequence_length,
score_mask_value=score_mask_value)
self._value = _prepare_memory(
memory, memory_sequence_length, check_inner_dims_defined)
self._batch_size = (
self._value.shape[0].value or tf.shape(self._value)[0])
self._alignments_size = (
self._value.shape[1].value or tf.shape(self._value)[1])
@property
def values(self):
return self._value
@property
def batch_size(self):
return self._batch_size
@property
def alignments_size(self):
return self._alignments_size
@property
def state_size(self):
return self.num_mixtures
def initial_alignments(self, batch_size, dtype):
max_time = self._alignments_size
return _zero_state_tensors(max_time, batch_size, dtype)
def initial_state(self, batch_size, dtype):
state_size_ = self.state_size
return _zero_state_tensors(state_size_, batch_size, dtype)
def __call__(self, query, state):
with tf.variable_scope("GmmAttention"):
previous_kappa = state
params = self.query_layer(query) # query(dec_rnn_size=256) , params(num_mixtures(256)*3)
alpha_hat, beta_hat, kappa_hat = tf.split(params, num_or_size_splits=3, axis=1)
# [batch_size, num_mixtures, 1]
alpha = tf.expand_dims(tf.exp(alpha_hat), axis=2)
# softmax makes the alpha value more stable.
# alpha = tf.expand_dims(tf.nn.softmax(alpha_hat, axis=1), axis=2)
beta = tf.expand_dims(tf.exp(beta_hat), axis=2)
kappa = tf.expand_dims(previous_kappa + tf.exp(kappa_hat), axis=2)
# [1, 1, max_input_steps]
mu = tf.reshape(tf.cast(tf.range(self.alignments_size), dtype=tf.float32), shape=[1, 1, self.alignments_size]) # [[[0,1,2,...]]]
# [batch_size, max_input_steps]
phi = tf.reduce_sum(alpha * tf.exp(-beta * (kappa - mu) ** 2.), axis=1)
alignments = self._maybe_mask_score(phi)
state = tf.squeeze(kappa, axis=2)
return alignments, state
================================================
FILE: tacotron2/tacotron2.py
================================================
# coding: utf-8
# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py
"""
모델 수정
1. prenet에서 dropout 적용 오류 수정
2. AttentionWrapper 적용 순서 오류 수정: keith ito 코드는 잘 구현되어 있음
3. BahdanauMonotonicAttention에서 normalize=True적용(2018년9월11일 적용)
4. BahdanauMonotonicAttention에서 memory_sequence_length 입력
5. synhesizer.py input_lengths 계산오류. +1 해야 함.
"""
import numpy as np
import tensorflow as tf
from tensorflow.contrib.seq2seq import BasicDecoder, BahdanauAttention, BahdanauMonotonicAttention,LuongAttention
from tensorflow.contrib.rnn import GRUCell, MultiRNNCell, OutputProjectionWrapper, ResidualWrapper,LSTMStateTuple
from utils.infolog import log
from text.symbols import symbols
from .modules import *
from .helpers import TacoTestHelper, TacoTrainingHelper
from .rnn_wrappers import LocationSensitiveAttention,GmmAttention,ZoneoutLSTMCell,DecoderWrapper
class Tacotron2():
def __init__(self, hparams):
self._hparams = hparams
def initialize(self, inputs, input_lengths, num_speakers, speaker_id=None,mel_targets=None, linear_targets=None, is_training= False,loss_coeff=None,stop_token_targets=None):
with tf.variable_scope('Eembedding') as scope:
hp = self._hparams
batch_size = tf.shape(inputs)[0]
# Embeddings(256)
char_embed_table = tf.get_variable('inputs_embedding', [len(symbols), hp.embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5))
zero_pad = True
if zero_pad: # transformer에 구현되어 있는 거 보고, 가져온 로직.
# <PAD> 0 은 embedding이 0으로 고정되고, train으로 변하지 않는다. 즉, 위의 get_variable에서 잡았던 변수의 첫번째 행(<PAD>)에 대응되는 것은 사용되지 않는 것이다)
char_embed_table = tf.concat((tf.zeros(shape=[1, hp.embedding_size]),char_embed_table[1:, :]), 0)
# [N, T_in, embedding_size]
char_embedded_inputs = tf.nn.embedding_lookup(char_embed_table, inputs)
self.num_speakers = num_speakers
if self.num_speakers > 1:
speaker_embed_table = tf.get_variable('speaker_embedding',[self.num_speakers, hp.speaker_embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5))
# [N, T_in, speaker_embedding_size]
speaker_embed = tf.nn.embedding_lookup(speaker_embed_table, speaker_id)
deep_dense = lambda x, dim,name: tf.layers.dense(x, dim, activation=tf.nn.softsign,name=name) # softsign: x / (abs(x) + 1)
encoder_rnn_init_state = deep_dense( speaker_embed, hp.encoder_lstm_units * 4,'encoder_init_dense') # hp.encoder_lstm_units = 256
decoder_rnn_init_states = [deep_dense(speaker_embed, hp.decoder_lstm_units*2,'decoder_init_dense_{}'.format(i)) for i in range(hp.decoder_layers)] # hp.decoder_lstm_units = 1024
speaker_embed = None
else:
# self.num_speakers =1인 경우
speaker_embed = None
encoder_rnn_init_state = None # bidirectional GRU의 init state
attention_rnn_init_state = None
decoder_rnn_init_states = None
with tf.variable_scope('Encoder') as scope:
##############
# Encoder
##############
x = char_embedded_inputs
for i in range(hp.enc_conv_num_layers):
x = tf.layers.conv1d(x,filters=hp.enc_conv_channels,kernel_size=hp.enc_conv_kernel_size,padding='same',activation=tf.nn.relu,name='Encoder_{}'.format(i))
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='dropout_{}'.format(i))
if encoder_rnn_init_state is not None:
initial_state_fw_c,initial_state_fw_h, initial_state_bw_c,initial_state_bw_h = tf.split(encoder_rnn_init_state, 4, 1)
initial_state_fw = LSTMStateTuple(initial_state_fw_c,initial_state_fw_h)
initial_state_bw = LSTMStateTuple(initial_state_bw_c,initial_state_bw_h)
else: # single mode
initial_state_fw, initial_state_bw = None, None
cell_fw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM')
cell_bw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM')
encoder_conv_output = x
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,encoder_conv_output,sequence_length=input_lengths,
initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32)
# envoder_outpust = [N,T,2*encoder_lstm_units] = [N,T,512]
encoder_outputs = tf.concat(outputs, axis=2) # Concat and return forward + backward outputs
with tf.variable_scope('Decoder') as scope:
##############
# Attention
##############
if hp.attention_type == 'bah_mon':
attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths,normalize=False)
elif hp.attention_type == 'bah_mon_norm': # hccho 추가
attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length = input_lengths, normalize=True)
elif hp.attention_type == 'loc_sen': # Location Sensitivity Attention
attention_mechanism = LocationSensitiveAttention(hp.attention_size, encoder_outputs,hparams=hp, is_training=is_training,
mask_encoder=hp.mask_encoder,memory_sequence_length = input_lengths,smoothing=hp.smoothing,cumulate_weights=hp.cumulative_weights)
elif hp.attention_type == 'gmm': # GMM Attention
attention_mechanism = GmmAttention(hp.attention_size, memory=encoder_outputs,memory_sequence_length = input_lengths)
elif hp.attention_type == 'bah_norm':
attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, normalize=True)
elif hp.attention_type == 'luong_scaled':
attention_mechanism = LuongAttention( hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, scale=True)
elif hp.attention_type == 'luong':
attention_mechanism = LuongAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths)
elif hp.attention_type == 'bah':
attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths)
else:
raise Exception(" [!] Unkown attention type: {}".format(hp.attention_type))
decoder_lstm = [ZoneoutLSTMCell(hp.decoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,
zoneout_factor_output=hp.tacotron_zoneout_rate,name='decoder_LSTM_{}'.format(i+1)) for i in range(hp.decoder_layers)]
decoder_lstm = tf.contrib.rnn.MultiRNNCell(decoder_lstm, state_is_tuple=True)
decoder_init_state = decoder_lstm.zero_state(batch_size=batch_size, dtype=tf.float32) # 여기서 zero_state를 부르면, 위의 AttentionWrapper에서 이미 넣은 준 값도 포함되어 있다.
if hp.model_type == "multi-speaker":
decoder_init_state = list(decoder_init_state)
for idx, cell in enumerate(decoder_rnn_init_states):
shape1 = decoder_init_state[idx][0].get_shape().as_list()
shape2 = cell.get_shape().as_list()
if shape1[1]*2 != shape2[1]:
raise Exception(" [!] Shape {} and {} should be equal".format(shape1, shape2))
c,h = tf.split(cell,2,1)
decoder_init_state[idx] = LSTMStateTuple(c,h)
decoder_init_state = tuple(decoder_init_state)
attention_cell = AttentionWrapper(decoder_lstm,attention_mechanism, initial_cell_state=decoder_init_state,
alignment_history=True,output_attention=False) # output_attention=False 에 주목, attention_layer_size에 값을 넣지 않았다. 그래서 attention = contex vector가 된다.
# attention_state_size = 256
# Decoder input -> prenet -> decoder_lstm -> concat[output, attention]
dec_prenet_outputs = DecoderWrapper(attention_cell , is_training, hp.dec_prenet_sizes, hp.dropout_prob,hp.inference_prenet_dropout)
dec_outputs_cell = OutputProjectionWrapper(dec_prenet_outputs,(hp.num_mels+1) * hp.reduction_factor)
if is_training:
helper = TacoTrainingHelper(mel_targets, hp.num_mels, hp.reduction_factor) # inputs은 batch_size 계산에만 사용됨
else:
helper = TacoTestHelper(batch_size, hp.num_mels, hp.reduction_factor)
decoder_init_state = dec_outputs_cell.zero_state(batch_size=batch_size, dtype=tf.float32)
(decoder_outputs, _), final_decoder_state, _ = \
tf.contrib.seq2seq.dynamic_decode(BasicDecoder(dec_outputs_cell, helper, decoder_init_state),maximum_iterations=int(hp.max_n_frame/hp.reduction_factor)) # max_iters=200
decoder_mel_outputs = tf.reshape(decoder_outputs[:,:,:hp.num_mels * hp.reduction_factor], [batch_size, -1, hp.num_mels]) # [N,iters,400] -> [N,5*iters,80]
stop_token_outputs = tf.reshape(decoder_outputs[:,:,hp.num_mels * hp.reduction_factor:], [batch_size, -1]) # [N,iters]
# Postnet
x = decoder_mel_outputs
for i in range(hp.postnet_num_layers):
activation = tf.nn.tanh if i != (hp.postnet_num_layers-1) else None
x = tf.layers.conv1d(x,filters=hp.postnet_channels,kernel_size=hp.postnet_kernel_size,padding='same',activation=activation,name='Postnet_{}'.format(i))
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='Postnet_dropout_{}'.format(i))
residual = tf.layers.dense(x,hp.num_mels,name='residual_projection')
mel_outputs = decoder_mel_outputs + residual
# Add post-processing CBHG:
# mel_outputs: (N,T,num_mels)
post_outputs = cbhg(mel_outputs, None, is_training,hp.post_bank_size, hp.post_bank_channel_size, hp.post_maxpool_width, hp.post_highway_depth, hp.post_rnn_size,
hp.post_proj_sizes, hp.post_proj_width,scope='post_cbhg')
linear_outputs = tf.layers.dense(post_outputs, hp.num_freq,name='linear_spectogram_projection') # [N, T_out, F(1025)]
# Grab alignments from the final decoder state:
alignments = tf.transpose(final_decoder_state.alignment_history.stack(), [1, 2, 0]) # batch_size, text length(encoder), target length(decoder)
self.inputs = inputs
self.speaker_id = speaker_id
self.input_lengths = input_lengths
self.loss_coeff = loss_coeff
self.decoder_mel_outputs = decoder_mel_outputs
self.mel_outputs = mel_outputs
self.linear_outputs = linear_outputs
self.alignments = alignments
self.mel_targets = mel_targets
self.linear_targets = linear_targets
self.final_decoder_state = final_decoder_state
self.stop_token_targets = stop_token_targets
self.stop_token_outputs = stop_token_outputs
self.all_vars = tf.trainable_variables()
log('='*40)
log(' model_type: %s' % hp.model_type)
log('='*40)
log('Initialized Tacotron model. Dimensions: ')
log(' embedding: %d' % char_embedded_inputs.shape[-1])
log(' encoder conv out: %d' % encoder_conv_output.shape[-1])
log(' encoder out: %d' % encoder_outputs.shape[-1])
log(' attention out: %d' % attention_cell.output_size)
log(' decoder prenet lstm concat out : %d' % dec_prenet_outputs.output_size)
log(' decoder cell out: %d' % dec_outputs_cell.output_size)
log(' decoder out (%d frames): %d' % (hp.reduction_factor, decoder_outputs.shape[-1]))
log(' decoder mel out: %d' % decoder_mel_outputs.shape[-1])
log(' mel out: %d' % mel_outputs.shape[-1])
log(' postnet out: %d' % post_outputs.shape[-1])
log(' linear out: %d' % linear_outputs.shape[-1])
log(' Tacotron Parameters {:.3f} Million.'.format(np.sum([np.prod(v.get_shape().as_list()) for v in self.all_vars]) / 1000000))
def add_loss(self):
'''Adds loss to the model. Sets "loss" field. initialize must have been called.'''
with tf.variable_scope('loss') as scope:
hp = self._hparams
before = tf.squared_difference(self.mel_targets, self.decoder_mel_outputs)
after = tf.squared_difference(self.mel_targets, self.mel_outputs)
mel_loss = before+after
stop_token_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.stop_token_targets, logits=self.stop_token_outputs))
l1 = tf.abs(self.linear_targets - self.linear_outputs)
expanded_loss_coeff = tf.expand_dims(tf.expand_dims(self.loss_coeff, [-1]), [-1])
regularization_loss = tf.reduce_mean([tf.nn.l2_loss(v) for v in self.all_vars
if not('bias' in v.name or 'Bias' in v.name or 'projection' in v.name or 'inputs_embedding' in v.name or 'speaker_embedding' in v.name
or 'dense' in v.name or 'RNN' in v.name or 'LSTM' in v.name)]) * hp.tacotron_reg_weight
regularization_loss = 0
if hp.prioritize_loss:
# Prioritize loss for frequencies.
upper_priority_freq = int(5000 / (hp.sample_rate * 0.5) * hp.num_freq)
lower_priority_freq = int(165 / (hp.sample_rate * 0.5) * hp.num_freq)
l1_priority= l1[:,:,lower_priority_freq:upper_priority_freq]
self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + \
0.5 * tf.reduce_mean(l1 * expanded_loss_coeff) + 0.5 * tf.reduce_mean(l1_priority * expanded_loss_coeff) + stop_token_loss + regularization_loss
self.linear_loss = tf.reduce_mean( 0.5 * (tf.reduce_mean(l1) + tf.reduce_mean(l1_priority)))
else:
self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + tf.reduce_mean(l1 * expanded_loss_coeff) + stop_token_loss + regularization_loss # 이 loss는 사용하지 않고, 아래의 loss_without_coeff를 사용함
self.linear_loss = tf.reduce_mean(l1)
self.mel_loss = tf.reduce_mean(mel_loss)
self.loss_without_coeff = self.mel_loss + self.linear_loss + stop_token_loss + regularization_loss
def add_optimizer(self, global_step):
'''Adds optimizer. Sets "gradients" and "optimize" fields. add_loss must have been called.
Args:
global_step: int32 scalar Tensor representing current global step in training
'''
with tf.variable_scope('optimizer') as scope:
hp = self._hparams
if hp.tacotron_decay_learning_rate:
self.decay_steps = hp.tacotron_decay_steps
self.decay_rate = hp.tacotron_decay_rate
self.learning_rate = self._learning_rate_decay(hp.tacotron_initial_learning_rate, global_step)
else:
self.learning_rate = tf.convert_to_tensor(hp.tacotron_initial_learning_rate)
optimizer = tf.train.AdamOptimizer(self.learning_rate, hp.adam_beta1, hp.adam_beta2)
gradients, variables = zip(*optimizer.compute_gradients(self.loss))
self.gradients = gradients
clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1.0)
# Add dependency on UPDATE_OPS; otherwise batchnorm won't work correctly. See:
# https://github.com/tensorflow/tensorflow/issues/1122
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
self.optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step)
def _learning_rate_decay(self, init_lr, global_step):
#################################################################
# Narrow Exponential Decay:
# Phase 1: lr = 1e-3
# We only start learning rate decay after 50k steps
# Phase 2: lr in ]1e-5, 1e-3[
# decay reach minimal value at step 310k
# Phase 3: lr = 1e-5
# clip by minimal learning rate value (step > 310k)
#################################################################
hp = self._hparams
#Compute natural exponential decay
lr = tf.train.exponential_decay(init_lr,
global_step - hp.tacotron_start_decay, #lr = 1e-3 at step 50k
self.decay_steps,
self.decay_rate, #lr = 1e-5 around step 310k
name='lr_exponential_decay')
#clip learning rate by max and min values (initial and final values)
return tf.minimum(tf.maximum(lr, hp.tacotron_final_learning_rate), init_lr)
================================================
FILE: text/__init__.py
================================================
# coding: utf-8
import re
import string
import numpy as np
from text import cleaners
from hparams import hparams
from text.symbols import symbols, en_symbols, PAD, EOS
from text.korean import jamo_to_korean
# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)} # 80개
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
isEn=False
# Regular expression matching text enclosed in curly braces:
_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')
puncuation_table = str.maketrans({key: None for key in string.punctuation})
def convert_to_en_symbols():
'''Converts built-in korean symbols to english, to be used for english training
'''
global _symbol_to_id, _id_to_symbol, isEn
if not isEn:
print(" [!] Converting to english mode")
_symbol_to_id = {s: i for i, s in enumerate(en_symbols)}
_id_to_symbol = {i: s for i, s in enumerate(en_symbols)}
isEn=True
def remove_puncuations(text):
return text.translate(puncuation_table)
def text_to_sequence(text, as_token=False):
cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
if ('english_cleaners' in cleaner_names) and isEn==False:
convert_to_en_symbols()
return _text_to_sequence(text, cleaner_names, as_token)
def _text_to_sequence(text, cleaner_names, as_token):
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
The text can optionally have ARPAbet sequences enclosed in curly braces embedded
in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
Args:
text: string to convert to a sequence
cleaner_names: names of the cleaner functions to run the text through
Returns:
List of integers corresponding to the symbols in the text
'''
sequence = []
# Check for curly braces and treat their contents as ARPAbet:
while len(text):
m = _curly_re.match(text)
if not m:
sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
break
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
# Append EOS token
sequence.append(_symbol_to_id[EOS]) # [14, 29, 45, 2, 27, 62, 20, 21, 4, 39, 45, 1]
if as_token:
return sequence_to_text(sequence, combine_jamo=True)
else:
return np.array(sequence, dtype=np.int32)
def sequence_to_text(sequence, skip_eos_and_pad=False, combine_jamo=False):
'''Converts a sequence of IDs back to a string'''
cleaner_names=[x.strip() for x in hparams.cleaners.split(',')]
if 'english_cleaners' in cleaner_names and isEn==False:
convert_to_en_symbols()
result = ''
for symbol_id in sequence:
if symbol_id in _id_to_symbol:
s = _id_to_symbol[symbol_id]
# Enclose ARPAbet back in curly braces:
if len(s) > 1 and s[0] == '@':
s = '{%s}' % s[1:]
if not skip_eos_and_pad or s not in [EOS, PAD]:
result += s
result = result.replace('}{', ' ')
if combine_jamo:
return jamo_to_korean(result)
else:
return result
def _clean_text(text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception('Unknown cleaner: %s' % name)
text = cleaner(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']
return text
def _symbols_to_sequence(symbols):
return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
def _arpabet_to_sequence(text):
return _symbols_to_sequence(['@' + s for s in text.split()])
def _should_keep_symbol(s):
return s in _symbol_to_id and s is not '_' and s is not '~'
================================================
FILE: text/cleaners.py
================================================
# coding: utf-8
# Code based on https://github.com/keithito/tacotron/blob/master/text/cleaners.py
'''
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
'''
import re
from .korean import tokenize as ko_tokenize
# Added to support LJ_speech
from unidecode import unidecode
from .en_numbers import normalize_numbers as en_normalize_numbers
# Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+')
def korean_cleaners(text):
'''Pipeline for Korean text, including number and abbreviation expansion.'''
text = ko_tokenize(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']
return text
# List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
('mrs', 'misess'),
('mr', 'mister'),
('dr', 'doctor'),
('st', 'saint'),
('co', 'company'),
('jr', 'junior'),
('maj', 'major'),
('gen', 'general'),
('drs', 'doctors'),
('rev', 'reverend'),
('lt', 'lieutenant'),
('hon', 'honorable'),
('sgt', 'sergeant'),
('capt', 'captain'),
('esq', 'esquire'),
('ltd', 'limited'),
('col', 'colonel'),
('ft', 'fort'),
]]
def expand_abbreviations(text):
for regex, replacement in _abbreviations:
text = re.sub(regex, replacement, text)
return text
def expand_numbers(text):
return en_normalize_numbers(text)
def lowercase(text):
return text.lower()
def collapse_whitespace(text):
return re.sub(_whitespace_re, ' ', text)
def convert_to_ascii(text):
'''Converts to ascii, existed in keithito but deleted in carpedm20'''
return unidecode(text)
def basic_cleaners(text):
'''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
text = lowercase(text)
text = collapse_whitespace(text)
return text
def transliteration_cleaners(text):
'''Pipeline for non-English text that transliterates to ASCII.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = collapse_whitespace(text)
return text
def english_cleaners(text):
'''Pipeline for English text, including number and abbreviation expansion.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = expand_numbers(text)
text = expand_abbreviations(text)
text = collapse_whitespace(text)
return text
================================================
FILE: text/en_numbers.py
================================================
import inflect
import re
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')
def _remove_commas(m):
return m.group(1).replace(',', '')
def _expand_decimal_point(m):
return m.group(1).replace('.', ' point ')
def _expand_dollars(m):
match = m.group(1)
parts = match.split('.')
if len(parts) > 2:
return match + ' dollars' # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
elif dollars:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
return '%s %s' % (dollars, dollar_unit)
elif cents:
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s' % (cents, cent_unit)
else:
return 'zero dollars'
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
num = int(m.group(0))
if num > 1000 and num < 3000:
if num == 2000:
return 'two thousand'
elif num > 2000 and num < 2010:
return 'two thousand ' + _inflect.number_to_words(num % 100)
elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
else:
return _inflect.number_to_words(num, andword='')
def normalize_numbers(text):
text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r'\1 pounds', text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
return text
================================================
FILE: text/english.py
================================================
# Code from https://github.com/keithito/tacotron/blob/master/util/numbers.py
import inflect
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')
def _remove_commas(m):
return m.group(1).replace(',', '')
def _expand_decimal_point(m):
return m.group(1).replace('.', ' point ')
def _expand_dollars(m):
match = m.group(1)
parts = match.split('.')
if len(parts) > 2:
return match + ' dollars' # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
elif dollars:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
return '%s %s' % (dollars, dollar_unit)
elif cents:
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s' % (cents, cent_unit)
else:
return 'zero dollars'
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
num = int(m.group(0))
if num > 1000 and num < 3000:
if num == 2000:
return 'two thousand'
elif num > 2000 and num < 2010:
return 'two thousand ' + _inflect.number_to_words(num % 100)
elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
else:
return _inflect.number_to_words(num, andword='')
def normalize(text):
text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r'\1 pounds', text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
return text
================================================
FILE: text/ko_dictionary.py
================================================
# coding: utf-8
etc_dictionary = {
'2 30대': '이삼십대',
'20~30대': '이삼십대',
'20, 30대': '이십대 삼십대',
'1+1': '원플러스원',
'3에서 6개월인': '3개월에서 육개월인',
}
english_dictionary = {
'Devsisters': '데브시스터즈',
'track': '트랙',
# krbook
'LA': '엘에이',
'LG': '엘지',
'KOREA': '코리아',
'JSA': '제이에스에이',
'PGA': '피지에이',
'GA': '지에이',
'idol': '아이돌',
'KTX': '케이티엑스',
'AC': '에이씨',
'DVD': '디비디',
'US': '유에스',
'CNN': '씨엔엔',
'LPGA': '엘피지에이',
'P': '피',
'L': '엘',
'T': '티',
'B': '비',
'C': '씨',
'BIFF': '비아이에프에프',
'GV': '지비',
# JTBC
'IT': '아이티',
'IQ': '아이큐',
'JTBC': '제이티비씨',
'trickle down effect': '트리클 다운 이펙트',
'trickle up effect': '트리클 업 이펙트',
'down': '다운',
'up': '업',
'FCK': '에프씨케이',
'AP': '에이피',
'WHERETHEWILDTHINGSARE': '',
'Rashomon Effect': '',
'O': '오',
'OO': '오오',
'B': '비',
'GDP': '지디피',
'CIPA': '씨아이피에이',
'YS': '와이에스',
'Y': '와이',
'S': '에스',
'JTBC': '제이티비씨',
'PC': '피씨',
'bill': '빌',
'Halmuny': '하모니', #####
'X': '엑스',
'SNS': '에스엔에스',
'ability': '어빌리티',
'shy': '',
'CCTV': '씨씨티비',
'IT': '아이티',
'the tenth man': '더 텐쓰 맨', ####
'L': '엘',
'PC': '피씨',
'YSDJJPMB': '', ########
'Content Attitude Timing': '컨텐트 애티튜드 타이밍',
'CAT': '캣',
'IS': '아이에스',
'SNS': '에스엔에스',
'K': '케이',
'Y': '와이',
'KDI': '케이디아이',
'DOC': '디오씨',
'CIA': '씨아이에이',
'PBS': '피비에스',
'D': '디',
'PPropertyPositionPowerPrisonP'
'S': '에스',
'francisco': '프란시스코',
'I': '아이',
'III': '아이아이', ######
'No joke': '노 조크',
'BBK': '비비케이',
'LA': '엘에이',
'Don': '',
't worry be happy': ' 워리 비 해피',
'NO': '엔오', #####
'it was our sky': '잇 워즈 아워 스카이',
'it is our sky': '잇 이즈 아워 스카이', ####
'NEIS': '엔이아이에스', #####
'IMF': '아이엠에프',
'apology': '어폴로지',
'humble': '험블',
'M': '엠',
'Nowhere Man': '노웨어 맨',
'The Tenth Man': '더 텐쓰 맨',
'PBS': '피비에스',
'BBC': '비비씨',
'MRJ': '엠알제이',
'CCTV': '씨씨티비',
'Pick me up': '픽 미 업',
'DNA': '디엔에이',
'UN': '유엔',
'STOP': '스탑', #####
'PRESS': '프레스', #####
'not to be': '낫 투비',
'Denial': '디나이얼',
'G': '지',
'IMF': '아이엠에프',
'GDP': '지디피',
'JTBC': '제이티비씨',
'Time flies like an arrow': '타임 플라이즈 라이크 언 애로우',
'DDT': '디디티',
'AI': '에이아이',
'Z': '제트',
'OECD': '오이씨디',
'N': '앤',
'A': '에이',
'MB': '엠비',
'EH': '이에이치',
'IS': '아이에스',
'TV': '티비',
'MIT': '엠아이티',
'KBO': '케이비오',
'I love America': '아이 러브 아메리카',
'SF': '에스에프',
'Q': '큐',
'KFX': '케이에프엑스',
'PM': '피엠',
'Prime Minister': '프라임 미니스터',
'Swordline': '스워드라인',
'TBS': '티비에스',
'DDT': '디디티',
'CS': '씨에스',
'Reflecting Absence': '리플렉팅 앱센스',
'PBS': '피비에스',
'Drum being beaten by everyone': '드럼 빙 비튼 바이 에브리원',
'negative pressure': '네거티브 프레셔',
'F': '에프',
'KIA': '기아',
'FTA': '에프티에이',
'Que sais-je': '',
'UFC': '유에프씨',
'P': '피',
'DJ': '디제이',
'Chaebol': '채벌',
'BBC': '비비씨',
'OECD': '오이씨디',
'BC': '삐씨',
'C': '씨',
'B': '씨',
'KY': '케이와이',
'K': '케이',
'CEO': '씨이오',
'YH': '와이에치',
'IS': '아이에스',
'who are you': '후 얼 유',
'Y': '와이',
'The Devils Advocate': '더 데빌즈 어드보카트',
'YS': '와이에스',
'so sorry': '쏘 쏘리',
'Santa': '산타',
'Big Endian': '빅 엔디안',
'Small Endian': '스몰 엔디안',
'Oh Captain My Captain': '오 캡틴 마이 캡틴',
'AIB': '에이아이비',
'K': '케이',
'PBS': '피비에스',
}
================================================
FILE: text/korean.py
================================================
# coding: utf-8
# Code based on
import re
import os
import ast
import json
from jamo import hangul_to_jamo, h2j, j2h
from .ko_dictionary import english_dictionary, etc_dictionary
PAD = '_'
EOS = '~'
PUNC = '!\'(),-.:;?'
SPACE = ' '
JAMO_LEADS = "".join([chr(_) for _ in range(0x1100, 0x1113)])
JAMO_VOWELS = "".join([chr(_) for _ in range(0x1161, 0x1176)])
JAMO_TAILS = "".join([chr(_) for _ in range(0x11A8, 0x11C3)])
VALID_CHARS = JAMO_LEADS + JAMO_VOWELS + JAMO_TAILS + PUNC + SPACE
ALL_SYMBOLS = PAD + EOS + VALID_CHARS
char_to_id = {c: i for i, c in enumerate(ALL_SYMBOLS)}
id_to_char = {i: c for i, c in enumerate(ALL_SYMBOLS)}
quote_checker = """([`"'"“‘])(.+?)([`"'"”’])"""
def is_lead(char):
return char in JAMO_LEADS
def is_vowel(char):
return char in JAMO_VOWELS
def is_tail(char):
return char in JAMO_TAILS
def get_mode(char):
if is_lead(char):
return 0
elif is_vowel(char):
return 1
elif is_tail(char):
return 2
else:
return -1
def _get_text_from_candidates(candidates):
if len(candidates) == 0:
return ""
elif len(candidates) == 1:
return _jamo_char_to_hcj(candidates[0])
else:
return j2h(**dict(zip(["lead", "vowel", "tail"], candidates)))
def jamo_to_korean(text):
text = h2j(text)
idx = 0
new_text = ""
candidates = []
while True:
if idx >= len(text):
new_text += _get_text_from_candidates(candidates)
break
char = text[idx]
mode = get_mode(char)
if mode == 0:
new_text += _get_text_from_candidates(candidates)
candidates = [char]
elif mode == -1:
new_text += _get_text_from_candidates(candidates)
new_text += char
candidates = []
else:
candidates.append(char)
idx += 1
return new_text
num_to_kor = {
'0': '영',
'1': '일',
'2': '이',
'3': '삼',
'4': '사',
'5': '오',
'6': '육',
'7': '칠',
'8': '팔',
'9': '구',
}
unit_to_kor1 = {
'%': '퍼센트',
'cm': '센치미터',
'mm': '밀리미터',
'km': '킬로미터',
'kg': '킬로그람',
}
unit_to_kor2 = {
'm': '미터',
}
upper_to_kor = {
'A': '에이',
'B': '비',
'C': '씨',
'D': '디',
'E': '이',
'F': '에프',
'G': '지',
'H': '에이치',
'I': '아이',
'J': '제이',
'K': '케이',
'L': '엘',
'M': '엠',
'N': '엔',
'O': '오',
'P': '피',
'Q': '큐',
'R': '알',
'S': '에스',
'T': '티',
'U': '유',
'V': '브이',
'W': '더블유',
'X': '엑스',
'Y': '와이',
'Z': '지',
}
def compare_sentence_with_jamo(text1, text2):
return h2j(text1) != h2j(text)
def tokenize(text, as_id=False):
# jamo package에 있는 hangul_to_jamo를 이용하여 한글 string을 초성/중성/종성으로 나눈다.
text = normalize(text)
tokens = list(hangul_to_jamo(text)) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']
if as_id:
return [char_to_id[token] for token in tokens] + [char_to_id[EOS]]
else:
return [token for token in tokens] + [EOS]
def tokenizer_fn(iterator):
return (token for x in iterator for token in tokenize(x, as_id=False))
def normalize(text):
text = text.strip()
text = re.sub('\(\d+일\)', '', text)
text = re.sub('\([⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]+\)', '', text)
text = normalize_with_dictionary(text, etc_dictionary)
text = normalize_english(text)
text = re.sub('[a-zA-Z]+', normalize_upper, text)
text = normalize_quote(text)
text = normalize_number(text)
return text
def normalize_with_dictionary(text, dic):
if any(key in text for key in dic.keys()):
pattern = re.compile('|'.join(re.escape(key) for key in dic.keys()))
return pattern.sub(lambda x: dic[x.group()], text)
else:
return text
def normalize_english(text):
def fn(m):
word = m.group()
if word in english_dictionary:
return english_dictionary.get(word)
else:
return word
text = re.sub("([A-Za-z]+)", fn, text)
return text
def normalize_upper(text):
text = text.group(0)
if all([char.isupper() for char in text]):
return "".join(upper_to_kor[char] for char in text)
else:
return text
def normalize_quote(text):
def fn(found_text):
from nltk import sent_tokenize # NLTK doesn't along with multiprocessing
found_text = found_text.group()
unquoted_text = found_text[1:-1]
sentences = sent_tokenize(unquoted_text)
return " ".join(["'{}'".format(sent) for sent in sentences])
return re.sub(quote_checker, fn, text)
number_checker = "([+-]?\d[\d,]*)[\.]?\d*"
count_checker = "(시|명|가지|살|마리|포기|송이|수|톨|통|점|개|벌|척|채|다발|그루|자루|줄|켤레|그릇|잔|마디|상자|사람|곡|병|판)"
def normalize_number(text):
text = normalize_with_dictionary(text, unit_to_kor1)
text = normalize_with_dictionary(text, unit_to_kor2)
text = re.sub(number_checker + count_checker,
lambda x: number_to_korean(x, True), text)
text = re.sub(number_checker,
lambda x: number_to_korean(x, False), text)
return text
num_to_kor1 = [""] + list("일이삼사오육칠팔구")
num_to_kor2 = [""] + list("만억조경해")
num_to_kor3 = [""] + list("십백천")
#count_to_kor1 = [""] + ["하나","둘","셋","넷","다섯","여섯","일곱","여덟","아홉"]
count_to_kor1 = [""] + ["한","두","세","네","다섯","여섯","일곱","여덟","아홉"]
count_tenth_dict = {
"십": "열",
"두십": "스물",
"세십": "서른",
"네십": "마흔",
"다섯십": "쉰",
"여섯십": "예순",
"일곱십": "일흔",
"여덟십": "여든",
"아홉십": "아흔",
}
def number_to_korean(num_str, is_count=False):
if is_count:
num_str, unit_str = num_str.group(1), num_str.group(2)
else:
num_str, unit_str = num_str.group(), ""
num_str = num_str.replace(',', '')
num = ast.literal_eval(num_str)
if num == 0:
return "영"
check_float = num_str.split('.')
if len(check_float) == 2:
digit_str, float_str = check_float
elif len(check_float) >= 3:
raise Exception(" [!] Wrong number format")
else:
digit_str, float_str = check_float[0], None
if is_count and float_str is not None:
raise Exception(" [!] `is_count` and float number does not fit each other")
digit = int(digit_str)
if digit_str.startswith("-"):
digit, digit_str = abs(digit), str(abs(digit))
kor = ""
size = len(str(digit))
tmp = []
for i, v in enumerate(digit_str, start=1):
v = int(v)
if v != 0:
if is_count:
tmp += count_to_kor1[v]
else:
tmp += num_to_kor1[v]
tmp += num_to_kor3[(size - i) % 4]
if (size - i) % 4 == 0 and len(tmp) != 0:
kor += "".join(tmp)
tmp = []
kor += num_to_kor2[int((size - i) / 4)]
if is_count:
if kor.startswith("한") and len(kor) > 1:
kor = kor[1:]
if any(word in kor for word in count_tenth_dict):
kor = re.sub(
'|'.join(count_tenth_dict.keys()),
lambda x: count_tenth_dict[x.group()], kor)
if not is_count and kor.startswith("일") and len(kor) > 1:
kor = kor[1:]
if float_str is not None:
kor += "쩜 "
kor += re.sub('\d', lambda x: num_to_kor[x.group()], float_str)
if num_str.startswith("+"):
kor = "플러스 " + kor
elif num_str.startswith("-"):
kor = "마이너스 " + kor
return kor + unit_str
if __name__ == "__main__":
def test_normalize(text):
print(text)
print(normalize(text))
print("="*30)
test_normalize("JTBC는 JTBCs를 DY는 A가 Absolute")
test_normalize("오늘(13일) 3,600마리 강아지가")
test_normalize("60.3%")
test_normalize('"저돌"(猪突) 입니다.')
test_normalize('비대위원장이 지난 1월 이런 말을 했습니다. “난 그냥 산돼지처럼 돌파하는 스타일이다”')
test_normalize("지금은 -12.35%였고 종류는 5가지와 19가지, 그리고 55가지였다")
test_normalize("JTBC는 TH와 K 양이 2017년 9월 12일 오후 12시에 24살이 된다")
print(list(hangul_to_jamo(list(hangul_to_jamo('비대위원장이 지난 1월 이런 말을 했습니다? “난 그냥 산돼지처럼 돌파하는 스타일이다”')))))
================================================
FILE: text/symbols.py
================================================
# coding: utf-8
'''
Defines the set of symbols used in text input to the model.
The default is a set of ASCII characters that works well for English or text that has been run
through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details.
'''
from jamo import h2j, j2h
from jamo.jamo import _jamo_char_to_hcj
from .korean import ALL_SYMBOLS, PAD, EOS
# For english
en_symbols = PAD+EOS+'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'(),-.:;? ' #<-For deployment(Because korean ALL_SYMBOLS follow this convention)
symbols = ALL_SYMBOLS # for korean
"""
초성과 종성은 같아보이지만, 다른 character이다.
'_~ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ!'(),-.:;? '
'_': 0, '~': 1, 'ᄀ': 2, 'ᄁ': 3, 'ᄂ': 4, 'ᄃ': 5, 'ᄄ': 6, 'ᄅ': 7, 'ᄆ': 8, 'ᄇ': 9, 'ᄈ': 10,
'ᄉ': 11, 'ᄊ': 12, 'ᄋ': 13, 'ᄌ': 14, 'ᄍ': 15, 'ᄎ': 16, 'ᄏ': 17, 'ᄐ': 18, 'ᄑ': 19, 'ᄒ': 20,
'ᅡ': 21, 'ᅢ': 22, 'ᅣ': 23, 'ᅤ': 24, 'ᅥ': 25, 'ᅦ': 26, 'ᅧ': 27, 'ᅨ': 28, 'ᅩ': 29, 'ᅪ': 30,
'ᅫ': 31, 'ᅬ': 32, 'ᅭ': 33, 'ᅮ': 34, 'ᅯ': 35, 'ᅰ': 36, 'ᅱ': 37, 'ᅲ': 38, 'ᅳ': 39, 'ᅴ': 40,
'ᅵ': 41, 'ᆨ': 42, 'ᆩ': 43, 'ᆪ': 44, 'ᆫ': 45, 'ᆬ': 46, 'ᆭ': 47, 'ᆮ': 48, 'ᆯ': 49, 'ᆰ': 50,
'ᆱ': 51, 'ᆲ': 52, 'ᆳ': 53, 'ᆴ': 54, 'ᆵ': 55, 'ᆶ': 56, 'ᆷ': 57, 'ᆸ': 58, 'ᆹ': 59, 'ᆺ': 60,
'ᆻ': 61, 'ᆼ': 62, 'ᆽ': 63, 'ᆾ': 64, 'ᆿ': 65, 'ᇀ': 66, 'ᇁ': 67, 'ᇂ': 68, '!': 69, "'": 70,
'(': 71, ')': 72, ',': 73, '-': 74, '.': 75, ':': 76, ';': 77, '?': 78, ' ': 79
"""
================================================
FILE: train_tacotron2.py
================================================
# coding: utf-8
import os
import time
import math
import argparse
import traceback
import subprocess
import numpy as np
from jamo import h2j
import tensorflow as tf
from datetime import datetime
from functools import partial
from hparams import hparams, hparams_debug_string
from tacotron2 import create_model, get_most_recent_checkpoint
from utils import ValueWindow, prepare_dirs
from utils import infolog, warning, plot, load_hparams
from utils import get_git_revision_hash, get_git_diff, str2bool, parallel_run
from utils.audio import save_wav, inv_spectrogram
from text import sequence_to_text, text_to_sequence
from datasets.datafeeder_tacotron2 import DataFeederTacotron2
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
tf.logging.set_verbosity(tf.logging.ERROR)
log = infolog.log
def get_git_commit():
subprocess.check_output(['git', 'diff-index', '--quiet', 'HEAD']) # Verify client is clean
commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:10]
log('Git commit: %s' % commit)
return commit
def add_stats(model, model2=None, scope_name='train'):
with tf.variable_scope(scope_name) as scope:
summaries = [
tf.summary.scalar('loss_mel', model.mel_loss),
tf.summary.scalar('loss_linear', model.linear_loss),
tf.summary.scalar('loss', model.loss_without_coeff),
]
if scope_name == 'train':
gradient_norms = [tf.norm(grad) for grad in model.gradients if grad is not None]
summaries.extend([
tf.summary.scalar('learning_rate', model.learning_rate),
tf.summary.scalar('max_gradient_norm', tf.reduce_max(gradient_norms)),
])
if model2 is not None:
with tf.variable_scope('gap_test-train') as scope:
summaries.extend([
tf.summary.scalar('loss_mel',
model.mel_loss - model2.mel_loss),
tf.summary.scalar('loss_linear',
model.linear_loss - model2.linear_loss),
tf.summary.scalar('loss',
model.loss_without_coeff - model2.loss_without_coeff),
])
return tf.summary.merge(summaries)
def save_and_plot_fn(args, log_dir, step, loss, prefix):
idx, (seq, spec, align) = args
audio_path = os.path.join(log_dir, '{}-step-{:09d}-audio{:03d}.wav'.format(prefix, step, idx))
align_path = os.path.join(log_dir, '{}-step-{:09d}-align{:03d}.png'.format(prefix, step, idx))
waveform = inv_spectrogram(spec.T,hparams)
save_wav(waveform, audio_path,hparams.sample_rate)
info_text = 'step={:d}, loss={:.5f}'.format(step, loss)
if 'korean_cleaners' in [x.strip() for x in hparams.cleaners.split(',')]:
log('Training korean : Use jamo')
plot.plot_alignment( align, align_path, info=info_text, text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=True), isKorean=True)
else:
log('Training non-korean : X use jamo')
plot.plot_alignment(align, align_path, info=info_text,text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=False), isKorean=False)
def save_and_plot(sequences, spectrograms,alignments, log_dir, step, loss, prefix):
fn = partial(save_and_plot_fn,log_dir=log_dir, step=step, loss=loss, prefix=prefix)
items = list(enumerate(zip(sequences, spectrograms, alignments)))
parallel_run(fn, items, parallel=False)
log('Test finished for step {}.'.format(step))
def train(log_dir, config):
config.data_paths = config.data_paths # ['datasets/moon']
data_dirs = config.data_paths # ['datasets/moon\\data']
num_speakers = len(data_dirs)
config.num_test = config.num_test_per_speaker * num_speakers # 2*1
if num_speakers > 1 and hparams.model_type not in ["multi-speaker", "simple"]:
raise Exception("[!] Unkown model_type for multi-speaker: {}".format(config.model_type))
commit = get_git_commit() if config.git else 'None'
checkpoint_path = os.path.join(log_dir, 'model.ckpt') # 'logdir-tacotron\\moon_2018-08-28_13-06-42\\model.ckpt'
#log(' [*] git recv-parse HEAD:\n%s' % get_git_revision_hash()) # hccho: 주석 처리
log('='*50)
#log(' [*] dit diff:\n%s' % get_git_diff())
log('='*50)
log(' [*] Checkpoint path: %s' % checkpoint_path)
log(' [*] Loading training data from: %s' % data_dirs)
log(' [*] Using model: %s' % config.model_dir) # 'logdir-tacotron\\moon_2018-08-28_13-06-42'
log(hparams_debug_string())
# Set up DataFeeder:
coord = tf.train.Coordinator()
with tf.variable_scope('datafeeder') as scope:
# DataFeeder의 6개 placeholder: train_feeder.inputs, train_feeder.input_lengths, train_feeder.loss_coeff, train_feeder.mel_targets, train_feeder.linear_targets, train_feeder.speaker_id
train_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size)
test_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 8, data_type='test', batch_size=config.num_test)
# Set up model:
global_step = tf.Variable(0, name='global_step', trainable=False)
with tf.variable_scope('model') as scope:
model = create_model(hparams)
model.initialize(inputs=train_feeder.inputs, input_lengths=train_feeder.input_lengths,num_speakers=num_speakers,speaker_id=train_feeder.speaker_id,
mel_targets=train_feeder.mel_targets, linear_targets=train_feeder.linear_targets,is_training=True,
loss_coeff=train_feeder.loss_coeff,stop_token_targets=train_feeder.stop_token_targets)
model.add_loss()
model.add_optimizer(global_step)
train_stats = add_stats(model, scope_name='train') # legacy
with tf.variable_scope('model', reuse=True) as scope:
test_model = create_model(hparams)
test_model.initialize(inputs=test_feeder.inputs, input_lengths=test_feeder.input_lengths,num_speakers=num_speakers,speaker_id=test_feeder.speaker_id,
mel_targets=test_feeder.mel_targets, linear_targets=test_feeder.linear_targets,is_training=False,
loss_coeff=test_feeder.loss_coeff,stop_token_targets=test_feeder.stop_token_targets)
test_model.add_loss()
# Bookkeeping:
step = 0
time_window = ValueWindow(100)
loss_window = ValueWindow(100)
saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)
sess_config = tf.ConfigProto(log_device_placement=False,allow_soft_placement=True)
sess_config.gpu_options.allow_growth=True
# Train!
#with tf.Session(config=sess_config) as sess:
with tf.Session() as sess:
try:
summary_writer = tf.summary.FileWriter(log_dir, sess.graph)
sess.run(tf.global_variables_initializer())
if config.load_path:
# Restore from a checkpoint if the user requested it.
restore_path = get_most_recent_checkpoint(config.model_dir)
saver.restore(sess, restore_path)
log('Resuming from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True)
elif config.initialize_path:
restore_path = get_most_recent_checkpoint(config.initialize_path)
saver.restore(sess, restore_path)
log('Initialized from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True)
zero_step_assign = tf.assign(global_step, 0)
sess.run(zero_step_assign)
start_step = sess.run(global_step)
log('='*50)
log(' [*] Global step is reset to {}'.format(start_step))
log('='*50)
else:
log('Starting new training run at commit: %s' % commit, slack=True)
start_step = sess.run(global_step)
train_feeder.start_in_session(sess, start_step)
test_feeder.start_in_session(sess, start_step)
while not coord.should_stop():
start_time = time.time()
step, loss, opt = sess.run([global_step, model.loss_without_coeff, model.optimize])
time_window.append(time.time() - start_time)
loss_window.append(loss)
message = 'Step %-7d [%.03f sec/step, loss=%.05f, avg_loss=%.05f]' % (step, time_window.average, loss, loss_window.average)
log(message, slack=(step % config.checkpoint_interval == 0))
if loss > 100 or math.isnan(loss):
log('Loss exploded to %.05f at step %d!' % (loss, step), slack=True)
raise Exception('Loss Exploded')
if step % config.summary_interval == 0:
log('Writing summary at step: %d' % step)
summary_writer.add_summary(sess.run( train_stats), step)
if step % config.checkpoint_interval == 0:
log('Saving checkpoint to: %s-%d' % (checkpoint_path, step))
saver.save(sess, checkpoint_path, global_step=step)
if step % config.test_interval == 0:
log('Saving audio and alignment...')
num_test = config.num_test
fetches = [
model.inputs[:num_test],
model.linear_outputs[:num_test],
model.alignments[:num_test],
test_model.inputs[:num_test],
test_model.linear_outputs[:num_test],
test_model.alignments[:num_test],
]
sequences, spectrograms, alignments, test_sequences, test_spectrograms, test_alignments = sess.run(fetches)
#librosa는 ffmpeg가 있어야 한다.
save_and_plot(sequences[:1], spectrograms[:1], alignments[:1], log_dir, step, loss, "train") # spectrograms: (num_test,200,1025), alignments: (num_test,encoder_length,decoder_length)
save_and_plot(test_sequences, test_spectrograms, test_alignments, log_dir, step, loss, "test")
except Exception as e:
log('Exiting due to exception: %s' % e, slack=True)
traceback.print_exc()
coord.request_stop(e)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--log_dir', default='logdir-tacotron2')
parser.add_argument('--data_paths', default='D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son')
#parser.add_argument('--data_paths', default='D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\small1,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\small2')
#parser.add_argument('--load_path', default=None) # 아래의 'initialize_path'보다 우선 적용
parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-03-01_10-35-44')
parser.add_argument('--initialize_path', default=None) # ckpt로 부터 model을 restore하지만, global step은 0에서 시작
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--num_test_per_speaker', type=int, default=2)
parser.add_argument('--random_seed', type=int, default=123)
parser.add_argument('--summary_interval', type=int, default=100)
parser.add_argument('--test_interval', type=int, default=500) # 500
parser.add_argument('--checkpoint_interval', type=int, default=2000) # 2000
parser.add_argument('--skip_path_filter', type=str2bool, default=False, help='Use only for debugging')
parser.add_argument('--slack_url', help='Slack webhook URL to get periodic reports.')
parser.add_argument('--git', action='store_true', help='If set, verify that the client is clean.') # The store_true option automatically creates a default value of False.
config = parser.parse_args()
config.data_paths = config.data_paths.split(",")
setattr(hparams, "num_speakers", len(config.data_paths))
prepare_dirs(config, hparams)
log_path = os.path.join(config.model_dir, 'train.log')
infolog.init(log_path, config.model_dir, config.slack_url)
tf.set_random_seed(config.random_seed)
print(config.data_paths)
if config.load_path is not None and config.initialize_path is not None:
raise Exception(" [!] Only one of load_path and initialize_path should be set")
train(config.model_dir, config)
if __name__ == '__main__':
main()
================================================
FILE: train_vocoder.py
================================================
# coding: utf-8
"""
- train data를 speaker를 분리된 디렉토리로 받아서, speaker id를 디렉토리별로 부과.
- file name에서 speaker id를 추론하는 방식이 아님.
"""
from __future__ import print_function
import argparse
import numpy as np
import os
import time
import traceback
from glob import glob
import tensorflow as tf
from tensorflow.python.client import timeline
from datetime import datetime
from wavenet import WaveNetModel,mu_law_decode
from datasets import DataFeederWavenet
from hparams import hparams
from utils import validate_directories,load,save,infolog,get_tensors_in_checkpoint_file,build_tensors_in_checkpoint_file,plot,audio
tf.logging.set_verbosity(tf.logging.ERROR)
EPSILON = 0.001
log = infolog.log
def eval_step(sess,logdir,step,waveform,upsampled_local_condition_data,speaker_id_data,mel_input_data,samples,speaker_id,upsampled_local_condition,next_sample,temperature=1.0):
waveform = waveform[:,:1]
sample_size = upsampled_local_condition_data.shape[1]
last_sample_timestamp = datetime.now()
start_time = time.time()
for step2 in range(sample_size): # 원하는 길이를 구하기 위해 loop sample_size
window = waveform[:,-1:] # 제일 끝에 있는 1개만 samples에 넣어 준다. window: shape(N,1)
prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step2,:],speaker_id: speaker_id_data })
if hparams.scalar_input:
sample = prediction # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다.
else:
# Scale prediction distribution using temperature.
# 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다.
# config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다.
np.seterr(divide='ignore')
scaled_prediction = np.log(prediction) / temperature # config.temperature인 경우는 값의 변화가 없다.
scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True)) # np.log(np.sum(np.exp(scaled_prediction)))
scaled_prediction = np.exp(scaled_prediction)
np.seterr(divide='warn')
# Prediction distribution at temperature=1.0 should be unchanged after
# scaling.
if temperature == 1.0:
np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.')
# argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다.
sample = [[np.random.choice(np.arange(hparams.quantization_channels), p=p)] for p in scaled_prediction] # choose one sample per batch
waveform = np.concatenate([waveform,sample],axis=-1) #window.shape: (N,1)
# Show progress only once per second.
current_sample_timestamp = datetime.now()
time_since_print = current_sample_timestamp - last_sample_timestamp
if time_since_print.total_seconds() > 1.:
duration = time.time() - start_time
print('Sample {:3<d}/{:3<d}, ({:.3f} sec/step)'.format(step2 + 1, sample_size, duration), end='\r')
last_sample_timestamp = current_sample_timestamp
print('\n')
# Save the result as a wav file.
if hparams.input_type == 'raw':
out = waveform[:,1:]
elif hparams.input_type == 'mulaw':
decode = mu_law_decode(samples, hparams.quantization_channels,quantization=False)
out = sess.run(decode, feed_dict={samples: waveform[:,1:]})
else: # 'mulaw-quantize'
decode = mu_law_decode(samples, hparams.quantization_channels,quantization=True)
out = sess.run(decode, feed_dict={samples: waveform[:,1:]})
# save wav
for i in range(1):
wav_out_path= logdir + '/test-{}-{}.wav'.format(step,i)
mel_path = wav_out_path.replace(".wav", ".png")
gen_mel_spectrogram = audio.melspectrogram(out[i], hparams).astype(np.float32).T
audio.save_wav(out[i], wav_out_path, hparams.sample_rate) # save_wav 내에서 out[i]의 값이 바뀐다.
plot.plot_spectrogram(gen_mel_spectrogram, mel_path, title='generated mel spectrogram{}'.format(step),target_spectrogram=mel_input_data[i])
def create_network(hp,batch_size,num_speakers,is_training):
net = WaveNetModel(
batch_size=batch_size,
dilations=hp.dilations,
filter_width=hp.filter_width,
residual_channels=hp.residual_channels,
dilation_channels=hp.dilation_channels,
quantization_channels=hp.quantization_channels,
out_channels =hp.out_channels,
skip_channels=hp.skip_channels,
use_biases=hp.use_biases, # True
scalar_input=hp.scalar_input,
global_condition_channels=hp.gc_channels,
global_condition_cardinality=num_speakers,
local_condition_channels=hp.num_mels,
upsample_factor=hp.upsample_factor,
legacy = hp.legacy,
residual_legacy = hp.residual_legacy,
drop_rate = hp.wavenet_dropout,
train_mode=is_training)
return net
def main():
def _str_to_bool(s):
"""Convert string to bool (in argparse context)."""
if s.lower() not in ['true', 'false']:
raise ValueError('Argument needs to be a boolean, got {}'.format(s))
return {'true': True, 'false': False}[s.lower()]
parser = argparse.ArgumentParser(description='WaveNet example network')
DATA_DIRECTORY = 'D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son'
#DATA_DIRECTORY = 'D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon'
parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing the VCTK corpus.')
#LOGDIR = None
LOGDIR = './/logdir-wavenet//train//2019-03-27T20-27-18'
parser.add_argument('--logdir', type=str, default=LOGDIR,help='Directory in which to store the logging information for TensorBoard. If the model already exists, it will restore the state and will continue training. Cannot use with --logdir_root and --restore_from.')
parser.add_argument('--logdir_root', type=str, default=None,help='Root directory to place the logging output and generated model. These are stored under the dated subdirectory of --logdir_root. Cannot use with --logdir.')
parser.add_argument('--restore_from', type=str, default=None,help='Directory in which to restore the model from. This creates the new model under the dated directory in --logdir_root. Cannot use with --logdir.')
CHECKPOINT_EVERY = 1000 # checkpoint 저장 주기
parser.add_argument('--checkpoint_every', type=int, default=CHECKPOINT_EVERY,help='How many steps to save each checkpoint after. Default: ' + str(CHECKPOINT_EVERY) + '.')
parser.add_argument('--eval_every', type=int, default=2,help='Steps between eval on test data')
config = parser.parse_args() # command 창에서 입력받을 수 있는 조건
config.data_dir = config.data_dir.split(",")
try:
directories = validate_directories(config,hparams)
except ValueError as e:
print("Some arguments are wrong:")
print(str(e))
return
logdir = directories['logdir']
restore_from = directories['restore_from']
# Even if we restored the model, we will treat it as new training
# if the trained model is written into an arbitrary location.
is_overwritten_training = logdir != restore_from
log_path = os.path.join(logdir, 'train.log')
infolog.init(log_path, logdir)
global_step = tf.Variable(0, name='global_step', trainable=False)
if hparams.l2_regularization_strength == 0:
hparams.l2_regularization_strength = None
# Create coordinator.
coord = tf.train.Coordinator()
num_speakers = len(config.data_dir)
# Load raw waveform from VCTK corpus.
with tf.name_scope('create_inputs'):
# Allow silence trimming to be skipped by specifying a threshold near
# zero.
silence_threshold = hparams.silence_threshold if hparams.silence_threshold > EPSILON else None
gc_enable = True # Before: num_speakers > 1 After: 항상 True
# AudioReader에서 wav 파일을 잘라 input값을 만든다. receptive_field길이만큼을 앞부분에 pad하거나 앞조각에서 가져온다. (receptive_field+ sample_size)크기로 자른다.
reader = DataFeederWavenet(coord,config.data_dir,batch_size=hparams.wavenet_batch_size,gc_enable= gc_enable,test_mode=False)
# test를 위한 DataFeederWavenet를 하나 만들자. 여기서는 딱 1개의 파일만 가져온다.
reader_test = DataFeederWavenet(coord,config.data_dir,batch_size=1,gc_enable= gc_enable,test_mode=True,queue_size=1)
audio_batch, lc_batch, gc_id_batch = reader.inputs_wav, reader.local_condition, reader.speaker_id
# Create train network.
net = create_network(hparams,hparams.wavenet_batch_size,num_speakers,is_training=True)
net.add_loss(input_batch=audio_batch,local_condition=lc_batch, global_condition_batch=gc_id_batch, l2_regularization_strength=hparams.l2_regularization_strength,upsample_type=hparams.upsample_type)
net.add_optimizer(hparams,global_step)
run_metadata = tf.RunMetadata()
# Set up session
sess = tf.Session(config=tf.ConfigProto(log_device_placement=False)) # log_device_placement=False --> cpu/gpu 자동 배치.
init = tf.global_variables_initializer()
sess.run(init)
# Saver for storing checkpoints of the model.
saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints) # 최대 checkpoint 저장 갯수 지정
try:
start_step = load(saver, sess, restore_from) # checkpoint load
if is_overwritten_training or start_step is None:
# The first training step will be saved_global_step + 1,
# therefore we put -1 here for new or overwritten trainings.
zero_step_assign = tf.assign(global_step, 0)
sess.run(zero_step_assign)
start_step=0
except:
print("Something went wrong while restoring checkpoint. We will terminate training to avoid accidentally overwriting the previous model.")
raise
###########
reader.start_in_session(sess,start_step)
reader_test.start_in_session(sess,start_step)
################### Create test network. <---- Queue 생성 때문에, sess restore후 test network 생성
net_test = create_network(hparams,1,num_speakers,is_training=False)
if hparams.scalar_input:
samples = tf.placeholder(tf.float32,shape=[net_test.batch_size,None])
waveform = 2*np.random.rand(net_test.batch_size).reshape(net_test.batch_size,-1)-1
else:
samples = tf.placeholder(tf.int32,shape=[net_test.batch_size,None]) # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이)
waveform = np.random.randint(hparams.quantization_channels,size=net_test.batch_size).reshape(net_test.batch_size,-1)
upsampled_local_condition = tf.placeholder(tf.float32,shape=[net_test.batch_size,hparams.num_mels])
speaker_id = tf.placeholder(tf.int32,shape=[net_test.batch_size])
next_sample = net_test.predict_proba_incremental(samples,upsampled_local_condition,speaker_id) # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용
sess.run(net_test.queue_initializer)
# test를 위한 placeholder는 모두 3개: samples,speaker_id,upsampled_local_condition
# test용 mel-spectrogram을 하나 뽑자. 그것을 고정하지 않으면, thread가 계속 돌아가면서 data를 읽어온다. reader_test의 역할은 여기서 끝난다.
mel_input_test, speaker_id_test = sess.run([reader_test.local_condition,reader_test.speaker_id])
with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE):
upsampled_local_condition_data = net_test.create_upsample(mel_input_test,upsample_type=hparams.upsample_type)
upsampled_local_condition_data_ = sess.run(upsampled_local_condition_data) # upsampled_local_condition_data_ 을 feed_dict로 placehoder인 upsampled_local_condition에 넣어준다.
######################################################
start_step = sess.run(global_step)
step = last_saved_step = start_step
try:
while not coord.should_stop():
start_time = time.time()
if hparams.store_metadata and step % 50 == 0:
# Slow run that stores extra information for debugging.
log('Storing metadata')
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
step, loss_value, _ = sess.run([global_step, net.loss, net.optimize],options=run_options,run_metadata=run_metadata)
tl = timeline.Timeline(run_metadata.step_stats)
timeline_path = os.path.join(logdir, 'timeline.trace')
with open(timeline_path, 'w') as f:
f.write(tl.generate_chrome_trace_format(show_memory=True))
else:
step, loss_value, _ = sess.run([global_step,net.loss, net.optimize])
duration = time.time() - start_time
log('step {:d} - loss = {:.3f}, ({:.3f} sec/step)'.format(step, loss_value, duration))
if step % config.checkpoint_every == 0:
save(saver, sess, logdir, step)
last_saved_step = step
if step % config.eval_every == 0: # config.eval_every
eval_step(sess,logdir,step,waveform,upsampled_local_condition_data_,speaker_id_test,mel_input_test,samples,speaker_id,upsampled_local_condition,next_sample)
if step >= hparams.num_steps:
# error message가 나오지만, 여기서 멈춘 것은 맞다.
raise Exception('End xxx~~~yyy')
except Exception as e:
print('finally')
log('Exiting due to exception: %s' % e, slack=True)
#if step > last_saved_step:
# save(saver, sess, logdir, step)
traceback.print_exc()
coord.request_stop(e)
if __name__ == '__main__':
main()
traceback.print_exc()
print('Done')
================================================
FILE: utils/__init__.py
================================================
# -*- coding: utf-8 -*-
import re,json,sys,os
import tensorflow as tf
from tqdm import tqdm
from contextlib import closing
from multiprocessing import Pool
from collections import namedtuple
from datetime import datetime, timedelta
from shutil import copyfile as copy_file
from tensorflow.python import pywrap_tensorflow
PARAMS_NAME = "params.json"
STARTED_DATESTRING = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now())
LOGDIR_ROOT_Wavenet = './logdir-wavenet'
class ValueWindow():
def __init__(self, window_size=100):
self._window_size = window_size
self._values = []
def append(self, x):
self._values = self._values[-(self._window_size - 1):] + [x]
@property
def sum(self):
return sum(self._values)
@property
def count(self):
return len(self._values)
@property
def average(self):
return self.sum / max(1, self.count)
def reset(self):
self._values = []
def prepare_dirs(config, hparams):
if hasattr(config, "data_paths"):
config.datasets = [os.path.basename(data_path) for data_path in config.data_paths]
dataset_desc = "+".join(config.datasets)
if config.load_path:
config.model_dir = config.load_path
else:
config.model_name = "{}_{}".format(dataset_desc, get_time())
config.model_dir = os.path.join(config.log_dir, config.model_name)
for path in [config.log_dir, config.model_dir]:
if not os.path.exists(path):
os.makedirs(path)
if config.load_path:
load_hparams(hparams, config.model_dir)
else:
setattr(hparams, "num_speakers", len(config.datasets))
save_hparams(config.model_dir, hparams)
copy_file("hparams.py", os.path.join(config.model_dir, "hparams.py"))
def save(saver, sess, logdir, step):
model_name = 'model.ckpt'
checkpoint_path = os.path.join(logdir, model_name)
print('Storing checkpoint to {} ...'.format(logdir), end="")
sys.stdout.flush()
if not os.path.exists(logdir):
os.makedirs(logdir)
saver.save(sess, checkpoint_path, global_step=step)
print(' Done.')
def load(saver, sess, logdir):
print("Trying to restore saved checkpoints from {} ...".format(logdir),end="")
ckpt = tf.train.get_checkpoint_state(logdir)
#c
gitextract_8q9e32ds/ ├── LICENSE ├── ReadMe.md ├── datasets/ │ ├── __init__.py │ ├── datafeeder_tacotron2.py │ ├── datafeeder_wavenet.py │ ├── moon/ │ │ └── moon-recognition-All.json │ ├── moon.py │ ├── son/ │ │ └── son-recognition-All.json │ └── son.py ├── generate.py ├── hparams.py ├── preprocess.py ├── synthesizer.py ├── tacotron2/ │ ├── __init__.py │ ├── helpers.py │ ├── modules.py │ ├── rnn_wrappers.py │ └── tacotron2.py ├── text/ │ ├── __init__.py │ ├── cleaners.py │ ├── en_numbers.py │ ├── english.py │ ├── ko_dictionary.py │ ├── korean.py │ └── symbols.py ├── train_tacotron2.py ├── train_vocoder.py ├── utils/ │ ├── __init__.py │ ├── audio.py │ ├── infolog.py │ └── plot.py ├── wavenet/ │ ├── __init__.py │ ├── mixture.py │ ├── model.py │ └── ops.py └── 명령어모음.txt
SYMBOL INDEX (266 symbols across 27 files)
FILE: datasets/datafeeder_tacotron2.py
function get_frame (line 22) | def get_frame(path):
function get_path_dict (line 28) | def get_path_dict(data_dirs, hparams, config,data_type, n_test=None,rng=...
class DataFeederTacotron2 (line 73) | class DataFeederTacotron2(threading.Thread):
method __init__ (line 76) | def __init__(self, coordinator, data_dirs,hparams, config, batches_per...
method start_in_session (line 180) | def start_in_session(self, session, start_step):
method run (line 186) | def run(self):
method _enqueue_next_group (line 195) | def _enqueue_next_group(self):
method _get_next_example (line 229) | def _get_next_example(self, data_dir):
function _prepare_batch (line 276) | def _prepare_batch(batch, reduction_factor, rng, data_type=None):
function _prepare_inputs (line 298) | def _prepare_inputs(inputs): # inputs: batch 길이 만큼의 list
function _prepare_targets (line 307) | def _prepare_targets(targets, alignment):
function _prepare_stop_token_targets (line 312) | def _prepare_stop_token_targets(targets, alignment):
function _pad_input (line 317) | def _pad_input(x, length):
function _pad_target (line 321) | def _pad_target(t, length):
function _pad_stop_token_target (line 326) | def _pad_stop_token_target(t, length):
function _round_up (line 329) | def _round_up(x, multiple):
FILE: datasets/datafeeder_wavenet.py
function get_path_dict (line 16) | def get_path_dict(data_dirs, min_length):
function assert_ready_for_upsampling (line 38) | def assert_ready_for_upsampling(x, c,hop_size):
function ensure_divisible (line 41) | def ensure_divisible(length, divisible_by=256, lower=True):
class DataFeederWavenet (line 50) | class DataFeederWavenet(threading.Thread):
method __init__ (line 51) | def __init__(self,coord,data_dirs,batch_size, gc_enable=False,test_mod...
method run (line 96) | def run(self):
method start_in_session (line 102) | def start_in_session(self, session,start_step):
method make_batches (line 107) | def make_batches(self):
method _get_next_example (line 122) | def _get_next_example(self, data_dir):
function _prepare_batch (line 163) | def _prepare_batch(batch):
FILE: datasets/moon.py
function build_from_path (line 11) | def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda...
function _process_utterance (line 52) | def _process_utterance(out_dir, wav_path, text, hparams):
FILE: datasets/son.py
function build_from_path (line 11) | def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda...
function _process_utterance (line 47) | def _process_utterance(out_dir, wav_path, text, hparams):
FILE: generate.py
function _interp (line 37) | def _interp(feats, in_range):
function get_arguments (line 42) | def get_arguments():
function create_seed (line 92) | def create_seed(filename,sample_rate,quantization_channels,window_size,s...
function main (line 110) | def main():
FILE: hparams.py
function hparams_debug_string (line 233) | def hparams_debug_string():
FILE: preprocess.py
function preprocess (line 19) | def preprocess(mod, in_dir, out_root,num_workers):
function write_metadata (line 25) | def write_metadata(metadata, out_dir):
FILE: synthesizer.py
class Synthesizer (line 39) | class Synthesizer(object):
method close (line 40) | def close(self):
method load (line 44) | def load(self, checkpoint_path, num_speakers=2, checkpoint_step=None, ...
method synthesize (line 83) | def synthesize(self,
function plot_graph_and_save_audio (line 161) | def plot_graph_and_save_audio(args,
function get_most_recent_checkpoint (line 246) | def get_most_recent_checkpoint(checkpoint_dir, checkpoint_step=None):
function short_concat (line 258) | def short_concat(
FILE: tacotron2/__init__.py
function create_model (line 7) | def create_model(hparams):
function get_most_recent_checkpoint (line 11) | def get_most_recent_checkpoint(checkpoint_dir):
FILE: tacotron2/helpers.py
class TacoTestHelper (line 10) | class TacoTestHelper(Helper):
method __init__ (line 11) | def __init__(self, batch_size, output_dim, r):
method batch_size (line 18) | def batch_size(self):
method sample_ids_dtype (line 22) | def sample_ids_dtype(self):
method sample_ids_shape (line 26) | def sample_ids_shape(self):
method initialize (line 29) | def initialize(self, name=None):
method sample (line 32) | def sample(self, time, outputs, state, name=None):
method next_inputs (line 35) | def next_inputs(self, time, outputs, state, sample_ids, name=None):
class TacoTrainingHelper (line 46) | class TacoTrainingHelper(Helper):
method __init__ (line 47) | def __init__(self, targets, output_dim, r):
method batch_size (line 63) | def batch_size(self):
method sample_ids_dtype (line 67) | def sample_ids_dtype(self):
method sample_ids_shape (line 71) | def sample_ids_shape(self):
method initialize (line 75) | def initialize(self, name=None):
method sample (line 78) | def sample(self, time, outputs, state, name=None):
method next_inputs (line 81) | def next_inputs(self, time, outputs, state, sample_ids, name=None): #...
function _go_frames (line 90) | def _go_frames(batch_size, output_dim):
FILE: tacotron2/modules.py
function prenet (line 10) | def prenet(inputs, is_training, layer_sizes, drop_prob, scope=None):
function cbhg (line 21) | def cbhg(inputs, input_lengths, is_training, bank_size, bank_channel_siz...
function batch_tile (line 73) | def batch_tile(tensor, batch_size):
function highwaynet (line 79) | def highwaynet(inputs, scope):
function conv1d (line 88) | def conv1d(inputs, kernel_size, channels, activation, is_training, scope):
FILE: tacotron2/rnn_wrappers.py
class ZoneoutLSTMCell (line 17) | class ZoneoutLSTMCell(RNNCell):
method __init__ (line 27) | def __init__(self, num_units, is_training, zoneout_factor_cell=0., zon...
method state_size (line 43) | def state_size(self):
method output_size (line 47) | def output_size(self):
method __call__ (line 50) | def __call__(self, inputs, state, scope=None):
class DecoderWrapper (line 80) | class DecoderWrapper(RNNCell):
method __init__ (line 83) | def __init__(self, cell, is_training, prenet_sizes, dropout_prob,infer...
method state_size (line 96) | def state_size(self):
method output_size (line 100) | def output_size(self):
method call (line 103) | def call(self, inputs, state):
method zero_state (line 110) | def zero_state(self, batch_size, dtype):
class LocationSensitiveAttention (line 115) | class LocationSensitiveAttention(BahdanauAttention):
method __init__ (line 131) | def __init__(self,
method __call__ (line 188) | def __call__(self, query, state):
function _location_sensitive_score (line 248) | def _location_sensitive_score(W_query, W_fil, W_keys):
function _smoothing_normalization (line 283) | def _smoothing_normalization(e):
class GmmAttention (line 304) | class GmmAttention(AttentionMechanism):
method __init__ (line 305) | def __init__(self,
method values (line 332) | def values(self):
method batch_size (line 336) | def batch_size(self):
method alignments_size (line 340) | def alignments_size(self):
method state_size (line 344) | def state_size(self):
method initial_alignments (line 347) | def initial_alignments(self, batch_size, dtype):
method initial_state (line 351) | def initial_state(self, batch_size, dtype):
method __call__ (line 355) | def __call__(self, query, state):
FILE: tacotron2/tacotron2.py
class Tacotron2 (line 31) | class Tacotron2():
method __init__ (line 32) | def __init__(self, hparams):
method initialize (line 36) | def initialize(self, inputs, input_lengths, num_speakers, speaker_id=N...
method add_loss (line 234) | def add_loss(self):
method add_optimizer (line 272) | def add_optimizer(self, global_step):
method _learning_rate_decay (line 302) | def _learning_rate_decay(self, init_lr, global_step):
FILE: text/__init__.py
function convert_to_en_symbols (line 24) | def convert_to_en_symbols():
function remove_puncuations (line 35) | def remove_puncuations(text):
function text_to_sequence (line 38) | def text_to_sequence(text, as_token=False):
function _text_to_sequence (line 44) | def _text_to_sequence(text, cleaner_names, as_token):
function sequence_to_text (line 78) | def sequence_to_text(sequence, skip_eos_and_pad=False, combine_jamo=False):
function _clean_text (line 104) | def _clean_text(text, cleaner_names):
function _symbols_to_sequence (line 114) | def _symbols_to_sequence(symbols):
function _arpabet_to_sequence (line 118) | def _arpabet_to_sequence(text):
function _should_keep_symbol (line 122) | def _should_keep_symbol(s):
FILE: text/cleaners.py
function korean_cleaners (line 27) | def korean_cleaners(text):
function expand_abbreviations (line 56) | def expand_abbreviations(text):
function expand_numbers (line 62) | def expand_numbers(text):
function lowercase (line 66) | def lowercase(text):
function collapse_whitespace (line 70) | def collapse_whitespace(text):
function convert_to_ascii (line 73) | def convert_to_ascii(text):
function basic_cleaners (line 78) | def basic_cleaners(text):
function transliteration_cleaners (line 85) | def transliteration_cleaners(text):
function english_cleaners (line 93) | def english_cleaners(text):
FILE: text/en_numbers.py
function _remove_commas (line 14) | def _remove_commas(m):
function _expand_decimal_point (line 18) | def _expand_decimal_point(m):
function _expand_dollars (line 22) | def _expand_dollars(m):
function _expand_ordinal (line 43) | def _expand_ordinal(m):
function _expand_number (line 47) | def _expand_number(m):
function normalize_numbers (line 62) | def normalize_numbers(text):
FILE: text/english.py
function _remove_commas (line 14) | def _remove_commas(m):
function _expand_decimal_point (line 18) | def _expand_decimal_point(m):
function _expand_dollars (line 22) | def _expand_dollars(m):
function _expand_ordinal (line 43) | def _expand_ordinal(m):
function _expand_number (line 47) | def _expand_number(m):
function normalize (line 62) | def normalize(text):
FILE: text/korean.py
function is_lead (line 29) | def is_lead(char):
function is_vowel (line 32) | def is_vowel(char):
function is_tail (line 35) | def is_tail(char):
function get_mode (line 38) | def get_mode(char):
function _get_text_from_candidates (line 48) | def _get_text_from_candidates(candidates):
function jamo_to_korean (line 56) | def jamo_to_korean(text):
function compare_sentence_with_jamo (line 137) | def compare_sentence_with_jamo(text1, text2):
function tokenize (line 140) | def tokenize(text, as_id=False):
function tokenizer_fn (line 150) | def tokenizer_fn(iterator):
function normalize (line 153) | def normalize(text):
function normalize_with_dictionary (line 168) | def normalize_with_dictionary(text, dic):
function normalize_english (line 175) | def normalize_english(text):
function normalize_upper (line 186) | def normalize_upper(text):
function normalize_quote (line 194) | def normalize_quote(text):
function normalize_number (line 209) | def normalize_number(text):
function number_to_korean (line 239) | def number_to_korean(num_str, is_count=False):
function test_normalize (line 311) | def test_normalize(text):
FILE: train_tacotron2.py
function get_git_commit (line 32) | def get_git_commit():
function add_stats (line 39) | def add_stats(model, model2=None, scope_name='train'):
function save_and_plot_fn (line 69) | def save_and_plot_fn(args, log_dir, step, loss, prefix):
function save_and_plot (line 86) | def save_and_plot(sequences, spectrograms,alignments, log_dir, step, los...
function train (line 95) | def train(log_dir, config):
function main (line 239) | def main():
FILE: train_vocoder.py
function eval_step (line 28) | def eval_step(sess,logdir,step,waveform,upsampled_local_condition_data,s...
function create_network (line 94) | def create_network(hp,batch_size,num_speakers,is_training):
function main (line 116) | def main():
FILE: utils/__init__.py
class ValueWindow (line 17) | class ValueWindow():
method __init__ (line 18) | def __init__(self, window_size=100):
method append (line 22) | def append(self, x):
method sum (line 26) | def sum(self):
method count (line 30) | def count(self):
method average (line 34) | def average(self):
method reset (line 37) | def reset(self):
function prepare_dirs (line 39) | def prepare_dirs(config, hparams):
function save (line 62) | def save(saver, sess, logdir, step):
function load (line 75) | def load(saver, sess, logdir):
function get_default_logdir (line 93) | def get_default_logdir(logdir_root):
function validate_directories (line 100) | def validate_directories(args,hparams):
function save_hparams (line 143) | def save_hparams(model_dir, hparams):
function write_json (line 152) | def write_json(path, data):
function load_hparams (line 156) | def load_hparams(hparams, load_path, skip_list=[]):
function load_json (line 173) | def load_json(path, as_class=False, encoding='euc-kr'):
function get_most_recent_checkpoint (line 186) | def get_most_recent_checkpoint(checkpoint_dir):
function add_prefix (line 197) | def add_prefix(path, prefix):
function add_postfix (line 201) | def add_postfix(path, postfix):
function remove_postfix (line 205) | def remove_postfix(path):
function get_time (line 209) | def get_time():
function parallel_run (line 212) | def parallel_run(fn, items, desc="", parallel=True):
function makedirs (line 227) | def makedirs(path):
function str2bool (line 232) | def str2bool(v):
function remove_file (line 235) | def remove_file(path):
function get_git_revision_hash (line 240) | def get_git_revision_hash():
function get_git_diff (line 242) | def get_git_diff():
function warning (line 245) | def warning(msg):
function get_tensors_in_checkpoint_file (line 251) | def get_tensors_in_checkpoint_file(file_name,all_tensors=True,tensor_nam...
function build_tensors_in_checkpoint_file (line 269) | def build_tensors_in_checkpoint_file(loaded_tensors):
FILE: utils/audio.py
function load_wav (line 11) | def load_wav(path, sr):
function save_wav (line 14) | def save_wav(wav, path, sr):
function save_wavenet_wav (line 19) | def save_wavenet_wav(wav, path, sr):
function preemphasis (line 22) | def preemphasis(wav, k, preemphasize=True):
function inv_preemphasis (line 27) | def inv_preemphasis(wav, k, inv_preemphasize=True):
function start_and_end_indices (line 33) | def start_and_end_indices(quantized, silence_threshold=2):
function trim_silence (line 46) | def trim_silence(wav, hparams):
function get_hop_size (line 54) | def get_hop_size(hparams):
function linearspectrogram (line 61) | def linearspectrogram(wav, hparams):
function melspectrogram (line 69) | def melspectrogram(wav, hparams):
function inv_linear_spectrogram (line 77) | def inv_linear_spectrogram(linear_spectrogram, hparams):
function inv_mel_spectrogram (line 95) | def inv_mel_spectrogram(mel_spectrogram, hparams):
function inv_spectrogram_tensorflow (line 112) | def inv_spectrogram_tensorflow(spectrogram,hparams):
function inv_spectrogram (line 117) | def inv_spectrogram(spectrogram,hparams):
function _lws_processor (line 123) | def _lws_processor(hparams):
function _griffin_lim (line 127) | def _griffin_lim(S, hparams):
function _stft (line 139) | def _stft(y, hparams):
function _istft (line 145) | def _istft(y, hparams):
function num_frames (line 150) | def num_frames(length, fsize, fshift):
function pad_lr (line 161) | def pad_lr(x, fsize, fshift):
function librosa_pad_lr (line 171) | def librosa_pad_lr(x, fsize, fshift):
function _linear_to_mel (line 181) | def _linear_to_mel(spectogram, hparams):
function _mel_to_linear (line 187) | def _mel_to_linear(mel_spectrogram, hparams):
function _build_mel_basis (line 193) | def _build_mel_basis(hparams):
function _amp_to_db (line 201) | def _amp_to_db(x, hparams):
function _db_to_amp (line 205) | def _db_to_amp(x):
function _normalize (line 208) | def _normalize(S, hparams):
function _denormalize (line 222) | def _denormalize(D, hparams):
function mulaw (line 244) | def mulaw(x, mu=256):
function inv_mulaw (line 265) | def inv_mulaw(y, mu=256):
function mulaw_quantize (line 283) | def mulaw_quantize(x, mu=256):
function inv_mulaw_quantize (line 317) | def inv_mulaw_quantize(y, mu=256):
function _sign (line 343) | def _sign(x):
function _log1p (line 350) | def _log1p(x):
function _abs (line 357) | def _abs(x):
function _asint (line 364) | def _asint(x):
function _asfloat (line 371) | def _asfloat(x):
function frames_to_hours (line 377) | def frames_to_hours(n_frames,hparams):
function get_duration (line 380) | def get_duration(audio,hparams):
function _db_to_amp_tensorflow (line 383) | def _db_to_amp_tensorflow(x):
function _denormalize_tensorflow (line 386) | def _denormalize_tensorflow(S,hparams):
function _griffin_lim_tensorflow (line 389) | def _griffin_lim_tensorflow(S,hparams):
function _istft_tensorflow (line 400) | def _istft_tensorflow(stfts,hparams):
function _stft_tensorflow (line 404) | def _stft_tensorflow(signals,hparams):
function _stft_parameters (line 408) | def _stft_parameters(hparams):
FILE: utils/infolog.py
function init (line 14) | def init(filename, run_name, slack_url=None):
function log (line 25) | def log(msg, slack=False):
function _close_logfile (line 33) | def _close_logfile():
function _send_slack (line 40) | def _send_slack(msg):
FILE: utils/plot.py
function plot (line 25) | def plot(alignment, info, text, isKorean=True):
function plot_alignment (line 64) | def plot_alignment(
function plot_spectrogram (line 79) | def plot_spectrogram(pred_spectrogram, path, title=None, split_title=Fal...
FILE: wavenet/mixture.py
function log_sum_exp (line 12) | def log_sum_exp(x):
function log_prob_from_logits (line 20) | def log_prob_from_logits(x):
function discretized_mix_logistic_loss (line 27) | def discretized_mix_logistic_loss(y_hat, y, num_class=256, log_scale_min...
function sample_from_discretized_mix_logistic (line 84) | def sample_from_discretized_mix_logistic(y, log_scale_min=float(np.log(1...
FILE: wavenet/model.py
class WaveNetModel (line 7) | class WaveNetModel(object):
method __init__ (line 8) | def __init__(self,batch_size,dilations,filter_width,residual_channels,...
method calculate_receptive_field (line 35) | def calculate_receptive_field(filter_width, dilations):
method _create_causal_layer (line 40) | def _create_causal_layer(self, input_batch):
method _create_queue (line 49) | def _create_queue(self):
method _create_dilation_layer (line 60) | def _create_dilation_layer(self, input_batch, layer_index, dilation,lo...
method create_upsample (line 123) | def create_upsample(self, local_condition_batch,upsample_type='SubPixe...
method _create_network (line 147) | def _create_network(self, input_batch,local_condition_batch, global_co...
method _one_hot (line 192) | def _one_hot(self, input_batch):
method _embed_gc (line 204) | def _embed_gc(self, global_condition): # global_condition = global_co...
method predict_proba_incremental (line 238) | def predict_proba_incremental(self, waveform,upsampled_local_condition...
method add_loss (line 270) | def add_loss(self, input_batch,local_condition=None, global_condition_...
method add_optimizer (line 338) | def add_optimizer(self, hparams,global_step):
FILE: wavenet/ops.py
function create_adam_optimizer (line 4) | def create_adam_optimizer(learning_rate, momentum):
function create_sgd_optimizer (line 9) | def create_sgd_optimizer(learning_rate, momentum):
function create_rmsprop_optimizer (line 14) | def create_rmsprop_optimizer(learning_rate, momentum):
function mu_law_encode (line 23) | def mu_law_encode(audio, quantization_channels):
function mu_law_decode (line 37) | def mu_law_decode(output, quantization_channels, quantization=True):
class SubPixelConvolution (line 51) | class SubPixelConvolution(tf.layers.Conv2D):
method __init__ (line 57) | def __init__(self, filters, kernel_size, padding, strides, NN_init, NN...
method build (line 82) | def build(self, input_shape):
method call (line 108) | def call(self, inputs):
method PS (line 116) | def PS(self, inputs):
method _phase_shift (line 138) | def _phase_shift(self, inputs, batch_size, H, W, r1, r2):
method _init_kernel (line 154) | def _init_kernel(self, kernel_size, strides, filters):
Condensed preview — 36 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (288K chars).
[
{
"path": "LICENSE",
"chars": 1088,
"preview": "MIT License\r\n\r\nCopyright (c) 2018 Heecheol Cho\r\n\r\nPermission is hereby granted, free of charge, to any person obtaining "
},
{
"path": "ReadMe.md",
"chars": 3687,
"preview": "# Multi-Speaker Tocotron2 + Wavenet Vocoder + Korean TTS\r\nTacotron2 모델과 Wavenet Vocoder를 결합하여 한국어 TTS구현하는 project입니다.\r\n"
},
{
"path": "datasets/__init__.py",
"chars": 74,
"preview": "# -*- coding: utf-8 -*-\n\nfrom .datafeeder_wavenet import DataFeederWavenet"
},
{
"path": "datasets/datafeeder_tacotron2.py",
"chars": 16866,
"preview": "# coding: utf-8\nimport os\nimport time\nimport pprint\nimport random\nimport threading\nimport traceback\nimport numpy as np\nf"
},
{
"path": "datasets/datafeeder_wavenet.py",
"chars": 7784,
"preview": "# -*- coding: utf-8 -*-\nimport sys\nsys.path.append(\"../\")\n\nimport tensorflow as tf\nimport threading\nimport random\nimport"
},
{
"path": "datasets/moon/moon-recognition-All.json",
"chars": 6353,
"preview": "{\r\n \"./datasets/moon/audio/003.0000.wav\": \"존경하는 독일 국민 여러분\",\r\n \"./datasets/moon/audio/003.0001.wav\": \"고국에 계신 국민 여러분"
},
{
"path": "datasets/moon.py",
"chars": 7092,
"preview": "# -*- coding: utf-8 -*-\r\n\r\nfrom concurrent.futures import ProcessPoolExecutor\r\nfrom functools import partial\r\nimport num"
},
{
"path": "datasets/son/son-recognition-All.json",
"chars": 4129,
"preview": "{\r\n \"./datasets/son/audio/NB10584578.0000.wav\": \"오늘부터 뉴스룸 2부에서는 그날의 주요사항을 한마디의 단어로 축약해서 앵커브리핑으로 풀어보겠습니다\",\r\n \"./dat"
},
{
"path": "datasets/son.py",
"chars": 6901,
"preview": "# -*- coding: utf-8 -*-\r\n\r\nfrom concurrent.futures import ProcessPoolExecutor\r\nfrom functools import partial\r\nimport num"
},
{
"path": "generate.py",
"chars": 14001,
"preview": "# coding: utf-8\r\n\"\"\"\r\nsample_rate = 16000이므로, samples 48000이면 3초 길이가 된다.\r\n\r\n> python generate.py --gc_cardinality 2 --g"
},
{
"path": "hparams.py",
"chars": 10253,
"preview": "# -*- coding: utf-8 -*-\r\n\r\nimport tensorflow as tf\r\nimport numpy as np\r\n\r\nhparams = tf.contrib.training.HParams(\r\n na"
},
{
"path": "preprocess.py",
"chars": 2605,
"preview": "# coding: utf-8\n\"\"\"\npython preprocess.py --num_workers 10 --name son --in_dir D:\\hccho\\multi-speaker-tacotron-tensorflow"
},
{
"path": "synthesizer.py",
"chars": 14154,
"preview": "# coding: utf-8\r\n\r\n\"\"\"\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2"
},
{
"path": "tacotron2/__init__.py",
"chars": 662,
"preview": "# coding: utf-8\r\nimport os\r\nfrom glob import glob\r\nfrom .tacotron2 import Tacotron2\r\n\r\n\r\ndef create_model(hparams):\r\n "
},
{
"path": "tacotron2/helpers.py",
"chars": 3546,
"preview": "# coding: utf-8\r\n# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py\r\n\r\nimport numpy as "
},
{
"path": "tacotron2/modules.py",
"chars": 4941,
"preview": "# coding: utf-8\r\n# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py\r\n\r\nimport tensorflo"
},
{
"path": "tacotron2/rnn_wrappers.py",
"chars": 17913,
"preview": "# coding: utf-8\r\nimport numpy as np\r\nimport tensorflow as tf\r\nfrom tensorflow.contrib.rnn import RNNCell\r\nfrom tensorflo"
},
{
"path": "tacotron2/tacotron2.py",
"chars": 18415,
"preview": "# coding: utf-8\r\n\r\n# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py\r\n\r\n\"\"\"\r\n모델 수정\r\n1."
},
{
"path": "text/__init__.py",
"chars": 4042,
"preview": "# coding: utf-8\r\nimport re\r\nimport string\r\nimport numpy as np\r\n\r\nfrom text import cleaners\r\nfrom hparams import hparams\r"
},
{
"path": "text/cleaners.py",
"chars": 3099,
"preview": "# coding: utf-8\r\n\r\n# Code based on https://github.com/keithito/tacotron/blob/master/text/cleaners.py\r\n'''\r\nCleaners are "
},
{
"path": "text/en_numbers.py",
"chars": 2185,
"preview": "import inflect\r\nimport re\r\n\r\n\r\n_inflect = inflect.engine()\r\n_comma_number_re = re.compile(r'([0-9][0-9\\,]+[0-9])')\r\n_dec"
},
{
"path": "text/english.py",
"chars": 2370,
"preview": "# Code from https://github.com/keithito/tacotron/blob/master/util/numbers.py\r\nimport inflect\r\n\r\n\r\n_inflect = inflect.eng"
},
{
"path": "text/ko_dictionary.py",
"chars": 4387,
"preview": "# coding: utf-8\r\n\r\netc_dictionary = {\r\n '2 30대': '이삼십대',\r\n '20~30대': '이삼십대',\r\n '20, 30대': '이십대 삼십대'"
},
{
"path": "text/korean.py",
"chars": 8596,
"preview": "# coding: utf-8\r\n# Code based on \r\n\r\nimport re\r\nimport os\r\nimport ast\r\nimport json\r\nfrom jamo import hangul_to_jamo, h2"
},
{
"path": "text/symbols.py",
"chars": 1466,
"preview": "# coding: utf-8\r\n'''\r\nDefines the set of symbols used in text input to the model.\r\n\r\nThe default is a set of ASCII chara"
},
{
"path": "train_tacotron2.py",
"chars": 12937,
"preview": "# coding: utf-8\r\nimport os\r\nimport time\r\nimport math\r\nimport argparse\r\nimport traceback\r\nimport subprocess\r\nimport numpy"
},
{
"path": "train_vocoder.py",
"chars": 14537,
"preview": "# coding: utf-8\r\n\"\"\"\r\n- train data를 speaker를 분리된 디렉토리로 받아서, speaker id를 디렉토리별로 부과.\r\n- file name에서 speaker id를 추론하는 방식이 "
},
{
"path": "utils/__init__.py",
"chars": 10490,
"preview": "# -*- coding: utf-8 -*-\r\nimport re,json,sys,os\r\nimport tensorflow as tf\r\nfrom tqdm import tqdm\r\nfrom contextlib import c"
},
{
"path": "utils/audio.py",
"chars": 15874,
"preview": "# coding: utf-8\nimport librosa\nimport librosa.filters\nimport numpy as np\nimport tensorflow as tf\nfrom scipy import signa"
},
{
"path": "utils/infolog.py",
"chars": 1288,
"preview": "import atexit\nfrom datetime import datetime\nimport json\nfrom threading import Thread\nfrom urllib.request import Request,"
},
{
"path": "utils/plot.py",
"chars": 3611,
"preview": "# coding: utf-8\r\nimport os \r\nimport matplotlib\r\nimport matplotlib.font_manager as font_manager\r\nfrom jamo import h2j, j2"
},
{
"path": "wavenet/__init__.py",
"chars": 118,
"preview": "# coding: utf-8\r\nfrom .model import WaveNetModel\r\nfrom .ops import (mu_law_encode, mu_law_decode,optimizer_factory)\r\n"
},
{
"path": "wavenet/mixture.py",
"chars": 4523,
"preview": "# coding:utf-8\n\"\"\"\nthe code is adapted from:\nhttps://github.com/Rayhane-mamah/Tacotron-2/blob/master/wavenet_vocoder/mod"
},
{
"path": "wavenet/model.py",
"chars": 21395,
"preview": "# coding: utf-8\r\nimport numpy as np\r\nimport tensorflow as tf\r\n\r\nfrom .ops import mu_law_encode,optimizer_factory,SubPix"
},
{
"path": "wavenet/ops.py",
"chars": 7858,
"preview": "# coding: utf-8\r\nimport tensorflow as tf\r\nimport numpy as np\r\ndef create_adam_optimizer(learning_rate, momentum):\r\n "
},
{
"path": "명령어모음.txt",
"chars": 1115,
"preview": "python preprocess.py --num_workers 10 --name son --in_dir .\\datasets\\son --out_dir .\\data\\son\r\n\r\n\r\npython preprocess.py "
}
]
About this extraction
This page contains the full source code of the hccho2/Tacotron2-Wavenet-Korean-TTS GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 36 files (254.3 KB), approximately 69.3k tokens, and a symbol index with 266 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.