Repository: hccho2/Tacotron2-Wavenet-Korean-TTS Branch: master Commit: 9215afde67a2 Files: 36 Total size: 254.3 KB Directory structure: gitextract_8q9e32ds/ ├── LICENSE ├── ReadMe.md ├── datasets/ │ ├── __init__.py │ ├── datafeeder_tacotron2.py │ ├── datafeeder_wavenet.py │ ├── moon/ │ │ └── moon-recognition-All.json │ ├── moon.py │ ├── son/ │ │ └── son-recognition-All.json │ └── son.py ├── generate.py ├── hparams.py ├── preprocess.py ├── synthesizer.py ├── tacotron2/ │ ├── __init__.py │ ├── helpers.py │ ├── modules.py │ ├── rnn_wrappers.py │ └── tacotron2.py ├── text/ │ ├── __init__.py │ ├── cleaners.py │ ├── en_numbers.py │ ├── english.py │ ├── ko_dictionary.py │ ├── korean.py │ └── symbols.py ├── train_tacotron2.py ├── train_vocoder.py ├── utils/ │ ├── __init__.py │ ├── audio.py │ ├── infolog.py │ └── plot.py ├── wavenet/ │ ├── __init__.py │ ├── mixture.py │ ├── model.py │ └── ops.py └── 명령어모음.txt ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Heecheol Cho Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: ReadMe.md ================================================ # Multi-Speaker Tocotron2 + Wavenet Vocoder + Korean TTS Tacotron2 모델과 Wavenet Vocoder를 결합하여 한국어 TTS구현하는 project입니다. Tacotron2 모델을 Multi-Speaker모델로 확장했습니다. Based on - https://github.com/keithito/tacotron - https://github.com/carpedm20/multi-speaker-tacotron-tensorflow - https://github.com/Rayhane-mamah/Tacotron-2 - https://github.com/hccho2/Tacotron-Wavenet-Vocoder ## Tacotron 2 - Tacotron 모델에 관한 설명은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다. - [Tacotron2](https://arxiv.org/abs/1712.05884)에서는 모델 구조도 바뀌었고, Location Sensitive Attention, Stop Token, Vocoder로 Wavenet을 제안하고 있다. - Tacotron2의 대표적인 구현은 [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)입니다. 이 역시, [keithito](https://github.com/keithito/tacotron), [r9y9](https://github.com/r9y9/wavenet_vocoder)의 코드를 기반으로 발전된 것이다. ## This Project * Tacotron2 모델로 한국어 TTS를 만드는 것이 목표입니다. * [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)의 구현은 Customization된 Layer를 많이 사용했는데, 제가 보기에는 너무 복잡하게 한 것 같아, Cumomization Layer를 많이 줄이고, Tensorflow에 구현되어 있는 Layer를 많이 활용했습니다. * teacher forcing 방식의 train sample은 2000 step부터, free forcing 방식의 test sample은 3000 step부터 알아들을 수 있는 정도의 음성을 만들기 시작합니다. ## 단계별 실행 ### 실행 순서 - Data 생성: 한국어 data의 생성은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다. - 생성된 Data는 아래의 'data_paths'에 지정하면 된다. - tacotron training 후, synthesize.py로 test. - wavenet training 후, generate.py로 test(tacotron이 만들지 않은 mel spectrogram으로 test할 수도 있고, tacotron이 만든 mel spectrogram을 사용할 수도 있다.) - 2개 모델 모두 train 후, tacotron에서 생성한 mel spectrogram을 wavent에 local condition으로 넣어 test하면 된다. ### Tacotron2 Training - train_tacotron2.py 내에서 '--data_paths'를 지정한 후, train할 수 있다. data_path는 여러개의 데이터 디렉토리를 지정할 수 있습니다. ``` parser.add_argument('--data_paths', default='.\\data\\moon,.\\data\\son') ``` - train을 이어서 계속하는 경우에는 '--load_path'를 지정해 주면 된다. ``` parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-02-27_00-21-42') ``` - model_type은 'single' 또는 ' multi-speaker'로 지정할 수 있다. speaker가 1명 일 때는, hparams의 model_type = 'single'로 하고 train_tacotron2.py 내에서 '--data_paths'를 1개만 넣어주면 된다. ``` parser.add_argument('--data_paths', default='D:\\Tacotron2\\data\\moon') ``` - 하이퍼파라메터를 hparmas.py에서 argument를 train_tacotron2.py에서 다 설정했기 때문에, train 실행은 다음과 같이 단순합니다. > python train_tacotron2.py - train 후, 음성을 생성하려면 다음과 같이 하면 된다. '--num_speaker', '--speaker_id'는 잘 지정되어야 한다. > python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다." ### Wavenet Vocoder Training - train_vocoder.py 내에서 '--data_dir'를 지정한 후, train할 수 있다. - memory 부족으로 training 되지 않거나 너무 느리면, hyper paramerter 중 sample_size를 줄이면 된다. 물론 batch_size를 줄일 수도 있다. ``` DATA_DIRECTORY = 'D:\\Tacotron2\\data\\moon,D:\\Tacotron2\\data\\son' parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing data') ``` - train을 이어서 계속하는 경우에는 '--logdir'를 지정해 주면 된다. ``` LOGDIR = './/logdir-wavenet//train//2018-12-21T22-58-10' parser.add_argument('--logdir', type=str, default=LOGDIR) ``` - wavenet train 후, tacotron이 생성한 mel spectrogram(npy파일)을 local condition으로 넣어서 TTS의 최종 결과를 얻을 수 있다. > python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10 ### Result - Tacotron의 batch_size = 32, Wavenet의 batch_size=8. GTX 1080ti. - Tacotron은 step 100K, Wavenet은 177K 만큼 train. - samples 디렉토리에는 생성된 wav파일이 있다. - Griffin-Lim으로 생성된 것과 Wavenet Vocoder로 생성된 sample이 있다. - Wavenet으로 생성된 음성은 train 부족으로 잡음이 섞여있다. ================================================ FILE: datasets/__init__.py ================================================ # -*- coding: utf-8 -*- from .datafeeder_wavenet import DataFeederWavenet ================================================ FILE: datasets/datafeeder_tacotron2.py ================================================ # coding: utf-8 import os import time import pprint import random import threading import traceback import numpy as np from glob import glob import tensorflow as tf from collections import defaultdict import text from utils.infolog import log from utils import parallel_run, remove_file from utils.audio import frames_to_hours _pad = 0 _stop_token_pad = 1 def get_frame(path): data = np.load(path) n_frame = data["linear"].shape[0] n_token = len(data["tokens"]) return (path, n_frame, n_token) def get_path_dict(data_dirs, hparams, config,data_type, n_test=None,rng=np.random.RandomState(123)): # Load metadata: path_dict = {} for data_dir in data_dirs: # ['datasets/moon\\data'] paths = glob("{}/*.npz".format(data_dir)) # ['datasets/moon\\data\\001.0000.npz', 'datasets/moon\\data\\001.0001.npz', 'datasets/moon\\data\\001.0002.npz', ...] if data_type == 'train': rng.shuffle(paths) # ['datasets/moon\\data\\012.0287.npz', 'datasets/moon\\data\\004.0215.npz', 'datasets/moon\\data\\003.0149.npz', ...] if not config.skip_path_filter: items = parallel_run( get_frame, paths, desc="filter_by_min_max_frame_batch", parallel=True) # [('datasets/moon\\data\\012.0287.npz', 130, 21), ('datasets/moon\\data\\003.0149.npz', 209, 37), ...] min_n_frame = hparams.min_n_frame # 5*30 max_n_frame = hparams.max_n_frame - 1 # 5*200 - 5 # 다음 단계에서 data가 많이 떨어져 나감. 글자수가 짧은 것들이 탈락됨. new_items = [(path, n) for path, n, n_tokens in items if min_n_frame <= n <= max_n_frame and n_tokens >= hparams.min_tokens] # [('datasets/moon\\data\\004.0383.npz', 297), ('datasets/moon\\data\\003.0533.npz', 394),...] new_paths = [path for path, n in new_items] new_n_frames = [n for path, n in new_items] hours = frames_to_hours(new_n_frames,hparams) log(' [{}] Loaded metadata for {} examples ({:.2f} hours)'.format(data_dir, len(new_n_frames), hours)) log(' [{}] Max length: {}'.format(data_dir, max(new_n_frames))) log(' [{}] Min length: {}'.format(data_dir, min(new_n_frames))) else: new_paths = paths # train용 data와 test용 data로 나눈다. if data_type == 'train': new_paths = new_paths[:-n_test] # 끝에 있는 n_test(batch_size)를 제외한 모두 elif data_type == 'test': new_paths = new_paths[-n_test:] # 끝에 있는 n_test else: raise Exception(" [!] Unkown data_type: {}".format(data_type)) path_dict[data_dir] = new_paths # ['datasets/moon\\data\\001.0621.npz', 'datasets/moon\\data\\003.0229.npz', ...] return path_dict # run -> _enqueue_next_group -> _get_next_example class DataFeederTacotron2(threading.Thread): '''Feeds batches of data into a queue on a background thread.''' def __init__(self, coordinator, data_dirs,hparams, config, batches_per_group, data_type, batch_size): #batches_per_group = 32 or 8, data_type: 'train' or 'test' super(DataFeederTacotron2, self).__init__() self._coord = coordinator self._hp = hparams self._cleaner_names = [x.strip() for x in hparams.cleaners.split(',')] self._step = 0 self._offset = defaultdict(lambda: 2) self._batches_per_group = batches_per_group self.rng = np.random.RandomState(config.random_seed) # random number generator self.data_type = data_type self.batch_size = batch_size self.min_tokens = hparams.min_tokens # 30 self.min_n_frame = hparams.min_n_frame # 5*30 self.max_n_frame = hparams.max_n_frame - 1 # 5*200 - 5 self.skip_path_filter = config.skip_path_filter # Load metadata: self.path_dict = get_path_dict(data_dirs, self._hp, config, self.data_type,n_test=self.batch_size, rng=self.rng) # data_dirs: ['datasets/moon\\data'] self.data_dirs = list(self.path_dict.keys()) # ['datasets/moon\\data'] self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)} # {'datasets/moon\\data': 0} data_weight = {data_dir: 1. for data_dir in self.data_dirs} # {'datasets/moon\\data': 1.0} if self._hp.main_data_greedy_factor > 0 and any(main_data in data_dir for data_dir in self.data_dirs for main_data in self._hp.main_data): # 'main_data': [''] for main_data in self._hp.main_data: for data_dir in self.data_dirs: if main_data in data_dir: data_weight[data_dir] += self._hp.main_data_greedy_factor weight_Z = sum(data_weight.values()) # 1 self.data_ratio = { data_dir: weight / weight_Z for data_dir, weight in data_weight.items()} # 각 data들의 weight sum이 1이 되도록... log("="*40) log('Data Amount:') log(pprint.pformat(self.data_ratio, indent=4)) log("="*40) #audio_paths = [path.replace("/data/", "/audio/").replace(".npz", ".wav") for path in self.data_paths] #duration = get_durations(audio_paths, print_detail=False) # Create placeholders for inputs and targets. Don't specify batch size because we want to # be able to feed different sized batches at eval time. self._placeholders = [ tf.placeholder(tf.int32, [None, None], 'inputs'), tf.placeholder(tf.int32, [None], 'input_lengths'), tf.placeholder(tf.float32, [None], 'loss_coeff'), tf.placeholder(tf.float32, [None, None, hparams.num_mels], 'mel_targets'), tf.placeholder(tf.float32, [None, None, hparams.num_freq], 'linear_targets'), tf.placeholder(tf.float32, [None, None], 'stop_token_targets') ] # Create queue for buffering data: dtypes = [tf.int32, tf.int32, tf.float32, tf.float32, tf.float32, tf.float32] self.is_multi_speaker = len(self.data_dirs) > 1 if self.is_multi_speaker: self._placeholders.append( tf.placeholder(tf.int32, [None], 'speaker_id'),) dtypes.append(tf.int32) num_worker = 8 if self.data_type == 'train' else 1 queue = tf.FIFOQueue(num_worker, dtypes, name='input_queue') self._enqueue_op = queue.enqueue(self._placeholders) if self.is_multi_speaker: self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets, self.speaker_id = queue.dequeue() else: self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets = queue.dequeue() self.inputs.set_shape(self._placeholders[0].shape) self.input_lengths.set_shape(self._placeholders[1].shape) self.loss_coeff.set_shape(self._placeholders[2].shape) self.mel_targets.set_shape(self._placeholders[3].shape) self.linear_targets.set_shape(self._placeholders[4].shape) self.stop_token_targets.set_shape(self._placeholders[5].shape) if self.is_multi_speaker: self.speaker_id.set_shape(self._placeholders[6].shape) else: self.speaker_id = None if self.data_type == 'test': examples = [] while True: for data_dir in self.data_dirs: examples.append(self._get_next_example(data_dir)) #print(data_dir, text.sequence_to_text(examples[-1][0], False, True)) if len(examples) >= self.batch_size: break if len(examples) >= self.batch_size: break # test 할 때는 같은 examples로 계속 반복 self.static_batches = [examples for _ in range(self._batches_per_group)] # [examples, examples,...,examples] <--- 각 example은 2개의 data를 가지고 있다. else: self.static_batches = None def start_in_session(self, session, start_step): self._step = start_step self._session = session self.start() def run(self): try: while not self._coord.should_stop(): self._enqueue_next_group() except Exception as e: traceback.print_exc() self._coord.request_stop(e) def _enqueue_next_group(self): start = time.time() # Read a group of examples: n = self.batch_size # 32 r = self._hp.reduction_factor # 4 or 5 min_n_frame,max_n_frame 계산에 사용되었던... if self.static_batches is not None: # 'test'에서는 static_batches를 사용한다. static_batches는 init에서 이미 만들어 놓았다. batches = self.static_batches else: # 'train' examples = [] for data_dir in self.data_dirs: if self._hp.initial_data_greedy: if self._step < self._hp.initial_phase_step and any("krbook" in data_dir for data_dir in self.data_dirs): data_dir = [data_dir for data_dir in self.data_dirs if "krbook" in data_dir][0] if self._step < self._hp.initial_phase_step: # 'initial_phase_step': 8000 example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group // len(self.data_dirs)))] # _batches_per_group 8,또는 32 만큼의 batch data를 만드낟. 각각의 batch size는 2, 또는 32 else: example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group * self.data_ratio[data_dir]))] examples.extend(example) examples.sort(key=lambda x: x[-1]) # 제일 마지막 기준이니까, len(linear_target) 기준으로 정렬 batches = [examples[i:i+n] for i in range(0, len(examples), n)] self.rng.shuffle(batches) log('Generated %d batches of size %d in %.03f sec' % (len(batches), n, time.time() - start)) for batch in batches: # batches는 batch의 묶음이다. # test 또는 train mode에 맞게 만든 batches의 batch data를 placeholder에 넘겨준다. feed_dict = dict(zip(self._placeholders, _prepare_batch(batch, r, self.rng, self.data_type))) # _prepare_batch에서 batch data의 길이를 맞춘다. return 순서 = placeholder순서 self._session.run(self._enqueue_op, feed_dict=feed_dict) self._step += 1 def _get_next_example(self, data_dir): '''npz 1개를 읽어 처리한다. Loads a single example (input, mel_target, linear_target, cost) from disk''' data_paths = self.path_dict[data_dir] while True: if self._offset[data_dir] >= len(data_paths): self._offset[data_dir] = 0 if self.data_type == 'train': self.rng.shuffle(data_paths) data_path = data_paths[self._offset[data_dir]] # npz파일 1개 선택 self._offset[data_dir] += 1 try: if os.path.exists(data_path): data = np.load(data_path) # data속에는 "linear","mel","tokens","loss_coeff" else: continue except: remove_file(data_path) continue if not self.skip_path_filter: break if self.min_n_frame <= data["linear"].shape[0] <= self.max_n_frame and len(data["tokens"]) > self.min_tokens: break input_data = data['tokens'] # 1-dim mel_target = data['mel'] if 'loss_coeff' in data: loss_coeff = data['loss_coeff'] else: loss_coeff = 1 linear_target = data['linear'] stop_token_target = np.asarray([0.] * len(mel_target)) # mel_target은 [xx,80]으로 data마다 len이 다르다. len에 따라 [0,...,0] # multi-speaker가 아니면, speaker_id는 넘길 필요 없지만, 현재 구현이 좀 꼬여 있다. 그래서 무조건 넘긴다. if self.is_multi_speaker: return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, self.data_dir_to_id[data_dir], len(linear_target)) else: return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, len(linear_target)) def _prepare_batch(batch, reduction_factor, rng, data_type=None): # (input_data, loss_coeff, mel_target, linear_target,stop_token_target, speaker_id, len(linear_target)) if data_type == 'train': rng.shuffle(batch) # batch data: (input_data, loss_coeff, mel_target, linear_target, self.data_dir_to_id[data_dir], len(linear_target)) inputs = _prepare_inputs([x[0] for x in batch]) # batch에 있는 data들 중, 가장 긴 data의 길이에 맞게 padding한다. input_lengths = np.asarray([len(x[0]) for x in batch], dtype=np.int32) # batch_size, [37, 37, 32, 32, 38,..., 39, 36, 30] loss_coeff = np.asarray([x[1] for x in batch], dtype=np.float32) # batch_size, [1,1,1,,..., 1,1,1] mel_targets = _prepare_targets([x[2] for x in batch], reduction_factor) # ---> (32, 175, 80) max length는 reduction_factor의 배수가 되도록 linear_targets = _prepare_targets([x[3] for x in batch], reduction_factor) # ---> (32, 175, 1025) max length는 reduction_factor의 배수가 되도록 stop_token_targets = _prepare_stop_token_targets([x[4] for x in batch], reduction_factor) if len(batch[0]) == 7: # is_multi_speaker = True인 경우 speaker_id = np.asarray([x[5] for x in batch], dtype=np.int32) # speaker_id로 list 만들기 return (inputs, input_lengths, loss_coeff,mel_targets, linear_targets,stop_token_targets, speaker_id) else: return (inputs, input_lengths, loss_coeff, mel_targets, linear_targets,stop_token_targets) # ('inputs' 'input_lengths' 'loss_coeff' 'mel_targets' 'linear_targets') def _prepare_inputs(inputs): # inputs: batch 길이 만큼의 list max_len = max((len(x) for x in inputs)) return np.stack([_pad_input(x, max_len) for x in inputs]) # (batch_size, max_len) """ batch_size = 2 일 떼, [[13, 26, 13, 41, 13, 21, 13, 41, 13, 21, 13, 41, 9, 41, 13, 40,79, 14, 34, 13, 33, 79, 20, 32, 13, 35, 45, 2, 34, 42, 13, 39,7, 29, 11, 25, 1], [ 6, 29, 79, 14, 26, 14, 34, 5, 29, 79, 2, 30, 45, 2, 28, 14,21, 79, 13, 27, 7, 25, 9, 34, 45, 13, 40, 79, 4, 29, 2, 29,13, 26, 1, 0, 0]] """ def _prepare_targets(targets, alignment): # targets: shape of list [ (162,80) , (172, 80), ...] max_len = max((len(t) for t in targets)) + 1 return np.stack([_pad_target(t, _round_up(max_len, alignment)) for t in targets]) def _prepare_stop_token_targets(targets, alignment): max_len = max((len(t) for t in targets)) + 1 return np.stack([_pad_stop_token_target(t, _round_up(max_len, alignment)) for t in targets]) def _pad_input(x, length): return np.pad(x, (0, length - x.shape[0]), mode='constant', constant_values=_pad) def _pad_target(t, length): # t: 2 dim array. ( xx, num_mels) ==> (length,num_mels) return np.pad(t, [(0, length - t.shape[0]), (0,0)], mode='constant', constant_values=_pad) # (169, 80) ==> (length, 80) ### def _pad_stop_token_target(t, length): return np.pad(t, (0, length - t.shape[0]), mode='constant', constant_values=_stop_token_pad) def _round_up(x, multiple): remainder = x % multiple return x if remainder == 0 else x + multiple - remainder if __name__ == '__main__': from hparams import hparams import argparse from utils import str2bool parser = argparse.ArgumentParser() parser.add_argument('--random_seed', type=int, default=123) parser.add_argument('--batch_size', type=int, default=4) parser.add_argument('--skip_path_filter', type=str2bool, default=True, help='Use only for debugging') config = parser.parse_args() coord = tf.train.Coordinator() data_dirs=['D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon'] mydatafeed = DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size) with tf.Session() as sess: try: sess.run(tf.global_variables_initializer()) step = 0 mydatafeed.start_in_session(sess,step) while not coord.should_stop(): a,b,c,d=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets,mydatafeed.stop_token_targets]) print(a.shape,c.shape,d.shape) print(step,b) print('stop token:', d[0]) print('-'*10) a,b,c=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets]) print(a.shape,c.shape) print(step,b) print('='*10) step = step +1 if step > 3: raise Exception('End xxx') except Exception as e: print('finally') print(e) coord.request_stop(e) ================================================ FILE: datasets/datafeeder_wavenet.py ================================================ # -*- coding: utf-8 -*- import sys sys.path.append("../") import tensorflow as tf import threading import random import numpy as np import os from utils import audio from hparams import hparams from glob import glob from collections import defaultdict def get_path_dict(data_dirs, min_length): path_dict = {} for data_dir in data_dirs: if not hparams.skip_path_filter: with open(os.path.join(data_dir,'train.txt'), 'r', encoding='utf-8') as f: lines = f.readlines() new_paths = [] for line in lines: line = line.strip().split("|") if int(line[3]) > min_length: new_paths.append(line[6]) path_dict[data_dir] = new_paths else: new_paths = glob("{}/*.npz".format(data_dir)) new_paths = [os.path.basename(p) for p in new_paths] path_dict[data_dir] = new_paths return path_dict def assert_ready_for_upsampling(x, c,hop_size): assert len(x) % len(c) == 0 and len(x) // len(c) == hop_size def ensure_divisible(length, divisible_by=256, lower=True): if length % divisible_by == 0: return length if lower: return length - length % divisible_by else: return length + (divisible_by - length % divisible_by) class DataFeederWavenet(threading.Thread): def __init__(self,coord,data_dirs,batch_size, gc_enable=False,test_mode=False, queue_size=8): super(DataFeederWavenet, self).__init__() self.data_dirs = data_dirs self.coord = coord self.batch_size = batch_size self.hop_size = audio.get_hop_size(hparams) self.sample_size = ensure_divisible(hparams.sample_size,self.hop_size, True) self.max_frames = self.sample_size // self.hop_size # sample_size 크기를 확보하기 위해. self.queue_size = queue_size self.gc_enable = gc_enable self.skip_path_filter = hparams.skip_path_filter self.test_mode = test_mode if test_mode: assert batch_size==1 self.rng = np.random.RandomState(123) self._offset = defaultdict(lambda: 2) # key에 없는 값이 들어어면 2가 할당된다. self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)} # data_dir <---> speaker_id 매핑 self.path_dict = get_path_dict(self.data_dirs,self.sample_size)# receptive_field 보다 작은 것을 버리고, 나머지만 돌려준다. self._placeholders = [ tf.placeholder(tf.float32, shape=[None,None,1],name='input_wav'), tf.placeholder(tf.float32, shape=[None,None,hparams.num_mels],name='local_condition') ] dtypes = [tf.float32, tf.float32] if self.gc_enable: self._placeholders.append(tf.placeholder(tf.int32, shape=[None],name='speaker_id')) dtypes.append(tf.int32) queue = tf.FIFOQueue(self.queue_size, dtypes, name='input_queue') self.enqueue = queue.enqueue(self._placeholders) if self.gc_enable: self.inputs_wav, self.local_condition, self.speaker_id = queue.dequeue() else: self.inputs_wav, self.local_condition = queue.dequeue() self.inputs_wav.set_shape(self._placeholders[0].shape) self.local_condition.set_shape(self._placeholders[1].shape) if self.gc_enable: self.speaker_id.set_shape(self._placeholders[2].shape) def run(self): try: while not self.coord.should_stop(): self.make_batches() except Exception as e: self.coord.request_stop(e) def start_in_session(self, session,start_step): self._step = start_step self.sess = session self.start() def make_batches(self): examples = [] n = self.batch_size for data_dir in self.data_dirs: example = [self._get_next_example(data_dir) for _ in range(int(n * 32 // len(self.data_dirs)))] examples.extend(example) self.rng.shuffle(examples) batches = [examples[i:i+n] for i in range(0, len(examples), n)] for batch in batches: # batch size만큼의 data를 원하는 만큼 만든다. feed_dict = dict(zip(self._placeholders, _prepare_batch(batch))) self.sess.run(self.enqueue, feed_dict=feed_dict) self._step += 1 def _get_next_example(self, data_dir): '''npz 1개를 읽어 처리한다. Loads a single example (input_wav, local_condition,speaker_id ) from disk''' data_paths = self.path_dict[data_dir] while True: if self._offset[data_dir] >= len(data_paths): self._offset[data_dir] = 0 self.rng.shuffle(data_paths) data_path = os.path.join(data_dir,data_paths[self._offset[data_dir]]) # npz파일 1개 선택 self._offset[data_dir] += 1 if os.path.exists(data_path): data = np.load(data_path) # data속에는 'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'token' else: continue if not self.skip_path_filter: # 이경우는 get_path_dict함수에서 한번 걸러졌기 때문에, 여기서 다시 확인할 필요 없음. break # get_path_dict함수에서 걸러지지 않앗기 때문에 확인이 필요함. if data['time_steps'] > self.sample_size or self.test_mode: break input_wav = data['audio'] local_condition = data['mel'] input_wav = input_wav.reshape(-1, 1) assert_ready_for_upsampling(input_wav, local_condition,self.hop_size) if self.test_mode==False: # test_mode에서는 전체. train_mode에서는 sample_size 만큼만 s = np.random.randint(0, len(local_condition) - self.max_frames+1) # hccho ts = s * self.hop_size input_wav = input_wav[ts:ts + self.hop_size * self.max_frames, :] local_condition = local_condition[s:s + self.max_frames, :] if self.gc_enable: return (input_wav,local_condition, self.data_dir_to_id[data_dir]) else: return (input_wav,local_condition) def _prepare_batch(batch): input_wavs = [x[0] for x in batch] local_conditions = [x[1] for x in batch] if len(batch[0])==3: speaker_ids = [x[2] for x in batch] return (input_wavs,local_conditions,speaker_ids) else: return (input_wavs,local_conditions) if __name__ == '__main__': coord = tf.train.Coordinator() data_dirs=['D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon','D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son'] mydatafeed = DataFeederWavenet(coord,data_dirs,batch_size=5,receptive_field=1200, gc_enable=True, queue_size=8) with tf.Session() as sess: try: sess.run(tf.global_variables_initializer()) step = 0 mydatafeed.start_in_session(sess,step) while not coord.should_stop(): a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id]) print(a.shape,b.shape,c.shape) print(step, c) a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id]) print(a.shape,b.shape,c.shape) print(step, c) step = step +1 except Exception as e: print('finally') coord.request_stop(e) ================================================ FILE: datasets/moon/moon-recognition-All.json ================================================ { "./datasets/moon/audio/003.0000.wav": "존경하는 독일 국민 여러분", "./datasets/moon/audio/003.0001.wav": "고국에 계신 국민 여러분", "./datasets/moon/audio/003.0002.wav": "하울젠 쾨르버재단 이사님과", "./datasets/moon/audio/003.0003.wav": "모드로", "./datasets/moon/audio/003.0004.wav": "전 동독 총리님을 비롯한", "./datasets/moon/audio/003.0005.wav": "내외 귀빈 여러분", "./datasets/moon/audio/003.0006.wav": "먼저 냉전과 분단을 넘어", "./datasets/moon/audio/003.0007.wav": "통일을 이루고", "./datasets/moon/audio/003.0008.wav": "그 힘으로 유럽통합과 국제평화를 선도하고 있는", "./datasets/moon/audio/003.0009.wav": "독일과", "./datasets/moon/audio/003.0010.wav": "독일 국민에게", "./datasets/moon/audio/003.0011.wav": "무한한 경의를 표합니다", "./datasets/moon/audio/003.0012.wav": "오늘 이 자리를 마련해 주신", "./datasets/moon/audio/003.0013.wav": "독일 정부와 쾨르버 재단에도", "./datasets/moon/audio/003.0014.wav": "감사드립니다", "./datasets/moon/audio/003.0015.wav": "아울러 얼마 전 별세하신", "./datasets/moon/audio/003.0016.wav": "고", "./datasets/moon/audio/003.0017.wav": "헬무트 콜 총리의 가족과", "./datasets/moon/audio/003.0018.wav": "독일 국민들에게 깊은 애도와", "./datasets/moon/audio/003.0019.wav": "위로의 마음을 전합니다", "./datasets/moon/audio/003.0020.wav": "대한민국은", "./datasets/moon/audio/003.0021.wav": "냉전시기", "./datasets/moon/audio/003.0022.wav": "어려운 환경 속에서도", "./datasets/moon/audio/003.0023.wav": "적극적이고", "./datasets/moon/audio/003.0024.wav": "능동적인 외교로", "./datasets/moon/audio/003.0025.wav": "독일 통일과 유럽통합을 주도한", "./datasets/moon/audio/003.0026.wav": "헬무트", "./datasets/moon/audio/003.0027.wav": "콜 총리의 위대한 업적을 기억할 것입니다", "./datasets/moon/audio/003.0028.wav": "친애하는 내외 귀빈 여러분", "./datasets/moon/audio/003.0029.wav": "이곳 베를린은", "./datasets/moon/audio/003.0030.wav": "지금으로부터 17년 전", "./datasets/moon/audio/003.0031.wav": "한국의 김대중 대통령이", "./datasets/moon/audio/003.0032.wav": "남북 화해·협력의 기틀을 마련한", "./datasets/moon/audio/003.0033.wav": "베를린 선언을 발표한 곳입니다", "./datasets/moon/audio/003.0034.wav": "여기 알테스 슈타트하우스는", "./datasets/moon/audio/003.0035.wav": "독일 통일조약 협상이 이뤄졌던", "./datasets/moon/audio/003.0036.wav": "역사적 현장입니다", "./datasets/moon/audio/003.0037.wav": "나는 오늘", "./datasets/moon/audio/003.0038.wav": "베를린의 교훈이 살아있는 이 자리에서", "./datasets/moon/audio/003.0039.wav": "대한민국 새 정부의 한반도 평화 구상을", "./datasets/moon/audio/003.0040.wav": "말씀드리고자 합니다", "./datasets/moon/audio/003.0041.wav": "내외 귀빈 여러분", "./datasets/moon/audio/003.0042.wav": "독일 통일의 경험은", "./datasets/moon/audio/003.0043.wav": "지구상", "./datasets/moon/audio/003.0044.wav": "마지막 분단국가로 남은 우리에게", "./datasets/moon/audio/003.0045.wav": "통일에 대한 희망과 함께", "./datasets/moon/audio/003.0046.wav": "우리가 나아가야 할 방향을 말해주고 있습니다", "./datasets/moon/audio/003.0047.wav": "그것은 우선", "./datasets/moon/audio/003.0048.wav": "통일에 이르는", "./datasets/moon/audio/003.0049.wav": "과정의 중요성입니다", "./datasets/moon/audio/006.0000.wav": "존경하고 사랑하는 국민 여러분", "./datasets/moon/audio/006.0001.wav": "감사합니다", "./datasets/moon/audio/006.0002.wav": "국민 여러분의", "./datasets/moon/audio/006.0003.wav": "위대한 선택에", "./datasets/moon/audio/006.0004.wav": "머리 숙여", "./datasets/moon/audio/006.0005.wav": "깊이", "./datasets/moon/audio/006.0006.wav": "감사드립니다", "./datasets/moon/audio/006.0007.wav": "저는 오늘", "./datasets/moon/audio/006.0008.wav": "대한민국", "./datasets/moon/audio/006.0009.wav": "제19대 대통령으로서", "./datasets/moon/audio/006.0010.wav": "새로운 대한민국을 향해", "./datasets/moon/audio/006.0011.wav": "첫걸음을 내딛습니다", "./datasets/moon/audio/006.0012.wav": "지금 제 두 어깨는", "./datasets/moon/audio/006.0013.wav": "국민 여러분으로부터", "./datasets/moon/audio/006.0014.wav": "부여받은", "./datasets/moon/audio/006.0015.wav": "막중한 소명감으로", "./datasets/moon/audio/006.0016.wav": "무겁습니다", "./datasets/moon/audio/006.0017.wav": "지금 제 가슴은", "./datasets/moon/audio/006.0018.wav": "한 번도 경험하지 못한", "./datasets/moon/audio/006.0019.wav": "나라를 만들겠다는 열정으로 뜨겁습니다", "./datasets/moon/audio/006.0020.wav": "그리고 지금 제 머리는", "./datasets/moon/audio/006.0021.wav": "통합과 공존의", "./datasets/moon/audio/006.0022.wav": "새로운 세상을 열어갈", "./datasets/moon/audio/006.0023.wav": "청사진으로", "./datasets/moon/audio/006.0024.wav": "가득 차 있습니다", "./datasets/moon/audio/006.0025.wav": "우리가 만들어가려는 새로운 대한민국은", "./datasets/moon/audio/006.0026.wav": "숱한 좌절과 패배에도 불구하고", "./datasets/moon/audio/006.0027.wav": "우리의 선대들이", "./datasets/moon/audio/006.0028.wav": "일관되게 추구했던 나라입니다", "./datasets/moon/audio/006.0029.wav": "또 많은 희생과 헌신을 감내하며", "./datasets/moon/audio/006.0030.wav": "우리 젊은이들이", "./datasets/moon/audio/006.0031.wav": "그토록 이루고 싶어했던", "./datasets/moon/audio/006.0032.wav": "나라입니다", "./datasets/moon/audio/006.0033.wav": "그런 대한민국을 만들기 위해 저는", "./datasets/moon/audio/006.0034.wav": "역사와 국민 앞에", "./datasets/moon/audio/006.0035.wav": "두렵지만", "./datasets/moon/audio/006.0036.wav": "겸허한 마음으로", "./datasets/moon/audio/006.0037.wav": "대한민국", "./datasets/moon/audio/006.0038.wav": "제19대", "./datasets/moon/audio/006.0039.wav": "대통령으로서의", "./datasets/moon/audio/006.0040.wav": "책임과 소명을 다할 것임을 천명합니다", "./datasets/moon/audio/006.0041.wav": "함께 선거를 치른 후보들께", "./datasets/moon/audio/006.0042.wav": "감사의 말씀과 함께", "./datasets/moon/audio/006.0043.wav": "심심한", "./datasets/moon/audio/006.0044.wav": "위로를 전합니다", "./datasets/moon/audio/006.0045.wav": "이번 선거에서는", "./datasets/moon/audio/006.0046.wav": "승자도", "./datasets/moon/audio/006.0047.wav": "패자도 없습니다", "./datasets/moon/audio/006.0048.wav": "우리는", "./datasets/moon/audio/006.0062.wav": "정치적 격변기를 보냈습니다", "./datasets/moon/audio/006.0063.wav": "정치는 혼란스러웠지만", "./datasets/moon/audio/006.0065.wav": "현직 대통령의 탄핵과 구속 앞에서도", "./datasets/moon/audio/006.0067.wav": "대한민국의 앞길을 열어주셨습니다", "./datasets/moon/audio/006.0068.wav": "우리 국민들은 좌절하지 않고", "./datasets/moon/audio/006.0093.wav": "2017년5월10일", "./datasets/moon/audio/006.0098.wav": "존경하고 사랑하는 국민 여러분", "./datasets/moon/audio/006.0104.wav": "바로 그 질문에서 새로 시작하겠습니다", "./datasets/moon/audio/006.0108.wav": "구시대의 잘못된 관행과", "./datasets/moon/audio/006.0115.wav": "광화문 대통령 시대를 열겠습니다", "./datasets/moon/audio/006.0116.wav": "참모들과 머리와 어깨를 맞대고" } ================================================ FILE: datasets/moon.py ================================================ # -*- coding: utf-8 -*- from concurrent.futures import ProcessPoolExecutor from functools import partial import numpy as np import os,json from utils import audio from text import text_to_sequence def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x): """ Preprocesses the speech dataset from a gven input path to given output directories Args: - hparams: hyper parameters - input_dir: input directory that contains the files to prerocess - out_dir: output directory of npz files - n_jobs: Optional, number of worker process to parallelize across - tqdm: Optional, provides a nice progress bar Returns: - A list of tuple describing the train examples. this should be written to train.txt """ executor = ProcessPoolExecutor(max_workers=num_workers) futures = [] index = 1 path = os.path.join(in_dir, 'moon-recognition-All.json') with open(path,encoding='utf-8') as f: content = f.read() data = json.loads(content) for key, text in data.items(): wav_path = key.strip().split('/') wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1]) # In case of test file if not os.path.exists(wav_path): continue futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams))) index += 1 return [future.result() for future in tqdm(futures) if future.result() is not None] # result = [] # for future in tqdm(futures): # if future.result() is not None: # result.append(future.result()) # # return result def _process_utterance(out_dir, wav_path, text, hparams): """ Preprocesses a single utterance wav/text pair this writes the mel scale spectogram to disk and return a tuple to write to the train.txt file Args: - mel_dir: the directory to write the mel spectograms into - linear_dir: the directory to write the linear spectrograms into - wav_dir: the directory to write the preprocessed wav into - index: the numeric index to use in the spectogram filename - wav_path: path to the audio file containing the speech input - text: text spoken in the input audio file - hparams: hyper parameters Returns: - A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text) """ try: # Load the audio as numpy array wav = audio.load_wav(wav_path, sr=hparams.sample_rate) except FileNotFoundError: #catch missing wav exception print('file {} present in csv metadata is not present in wav folder. skipping!'.format( wav_path)) return None #rescale wav if hparams.rescaling: # hparams.rescale = True wav = wav / np.abs(wav).max() * hparams.rescaling_max #M-AILABS extra silence specific if hparams.trim_silence: # hparams.trim_silence = True wav = audio.trim_silence(wav, hparams) # Trim leading and trailing silence #Mu-law quantize, default 값은 'raw' if hparams.input_type=='mulaw-quantize': #[0, quantize_channels) out = audio.mulaw_quantize(wav, hparams.quantize_channels) #Trim silences start, end = audio.start_and_end_indices(out, hparams.silence_threshold) wav = wav[start: end] out = out[start: end] constant_values = mulaw_quantize(0, hparams.quantize_channels) out_dtype = np.int16 elif hparams.input_type=='mulaw': #[-1, 1] out = audio.mulaw(wav, hparams.quantize_channels) constant_values = audio.mulaw(0., hparams.quantize_channels) out_dtype = np.float32 else: # raw #[-1, 1] out = wav constant_values = 0. out_dtype = np.float32 # Compute the mel scale spectrogram from the wav mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32) mel_frames = mel_spectrogram.shape[1] if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length: # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True return None #Compute the linear scale spectrogram from the wav linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32) linear_frames = linear_spectrogram.shape[1] #sanity check assert linear_frames == mel_frames if hparams.use_lws: # hparams.use_lws = False #Ensure time resolution adjustement between audio and mel-spectrogram fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams)) #Zero pad audio signal out = np.pad(out, (l, r), mode='constant', constant_values=constant_values) else: #Ensure time resolution adjustement between audio and mel-spectrogram pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams)) #Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency) out = np.pad(out, pad, mode='reflect') assert len(out) >= mel_frames * audio.get_hop_size(hparams) #time resolution adjustement #ensure length of raw audio is multiple of hop size so that we can use #transposed convolution to upsample out = out[:mel_frames * audio.get_hop_size(hparams)] assert len(out) % audio.get_hop_size(hparams) == 0 time_steps = len(out) # Write the spectrogram and audio to disk wav_id = os.path.splitext(os.path.basename(wav_path))[0] # Write the spectrograms to disk: audio_filename = '{}-audio.npy'.format(wav_id) mel_filename = '{}-mel.npy'.format(wav_id) linear_filename = '{}-linear.npy'.format(wav_id) npz_filename = '{}.npz'.format(wav_id) npz_flag=True if npz_flag: # Tacotron 코드와 맞추기 위해, 같은 key를 사용한다. data = { 'audio': out.astype(out_dtype), 'mel': mel_spectrogram.T, 'linear': linear_spectrogram.T, 'time_steps': time_steps, 'mel_frames': mel_frames, 'text': text, 'tokens': text_to_sequence(text), # eos(~)에 해당하는 "1"이 끝에 붙는다. 'loss_coeff': 1 # For Tacotron } np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False) else: np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False) np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False) np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False) # Return a tuple describing this training example return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename) ================================================ FILE: datasets/son/son-recognition-All.json ================================================ { "./datasets/son/audio/NB10584578.0000.wav": "오늘부터 뉴스룸 2부에서는 그날의 주요사항을 한마디의 단어로 축약해서 앵커브리핑으로 풀어보겠습니다", "./datasets/son/audio/NB10584578.0001.wav": "오늘 뉴스룸이 주목한다 던어는 저돌입니다", "./datasets/son/audio/NB10584578.0002.wav": "돼지 저 자에 갑자기 돌 이 두 글자를 사용하는 이 단어는 흔히 추진력이 강하다는 의미로 쓰이죠", "./datasets/son/audio/NB10584578.0003.wav": "난파 직전의 새정치연합을 책임지게 된 문희상 비대위원장이 이런 말을 했습니다", "./datasets/son/audio/NB10584578.0004.wav": "난 그냥 산 돼지처럼 돌파하는 스타일이다", "./datasets/son/audio/NB10584578.0005.wav": "이렇게 얘기했습니다", "./datasets/son/audio/NB10584578.0006.wav": "몸이 좋지 않다면서 만남을 주저했던 김무성 새누리당 대표를 찾아 가서 만난 것도 바로 이런 적어도 저돌성이 없었다면 어려웠을지도 모르겠습니다 그렇다면", "./datasets/son/audio/NB10584578.0007.wav": "문 비대위원장이 저돌적으로 돌파해야 할 과제는 무엇인가", "./datasets/son/audio/NB10584578.0008.wav": "첫 번째는 계파주의 청산입니다", "./datasets/son/audio/NB10584578.0009.wav": "지난 이천십이년 대선에서 민주통합당의 패배한 이후에 대선평가 위원장을 맡았던 한상진 서울대 명예교수가", "./datasets/son/audio/NB10584578.0010.wav": "이런 보고서를 냈습니다", "./datasets/son/audio/NB10584578.0011.wav": "계파정치 청산은 민주당의 미래를 위한 최우선 과제다", "./datasets/son/audio/NB10584578.0012.wav": "아 이렇게 얘기했는데요 그러나 아시는 것처럼이 보고서는", "./datasets/son/audio/NB10584578.0013.wav": "갖가지 반발 끝에 결국 채택되지 못했습니다", "./datasets/son/audio/NB10584578.0014.wav": "아마 여당에서 한상진 교수 좋아하는 사람 별로 없을 겁니다", "./datasets/son/audio/NB10584578.0015.wav": "문희상 당시 비대위원장이 공교롭게도 계파와 패권주의 청산을 내세웠던 바로 그 시기에 비대위원장 이었죠", "./datasets/son/audio/NB10584578.0016.wav": "계파 청산에 관한 문 비대위원장은 어떻게 보면 실패했다고 봐야만 합니다", "./datasets/son/audio/NB10584578.0017.wav": "권한은 공유하되 책임은 당 대표가 혼자지는 이런 기형적 구조가", "./datasets/son/audio/NB10584578.0018.wav": "아 결국", "./datasets/son/audio/NB10584578.0019.wav": "최근 사년 동안에 임기 2년에 야당 지도부 교체 숫자를", "./datasets/son/audio/NB10584578.0020.wav": "늘려서 무료 열번이나 교체가 되었습니다", "./datasets/son/audio/NB10584578.0021.wav": "같은 기간에 새누리당은 단 네명의 지도부가 바뀌었습니다", "./datasets/son/audio/NB10584578.0022.wav": "실패가 구조화된 당의 체질을 바꾸지 않고서는 누가 리더가 되어도 쉽지 않다는 것을 상징적으로 내보여주는 숫자이기도 합니다", "./datasets/son/audio/NB10584578.0023.wav": "자 두 번째 과제는 바로 이겁니다 수사권 기소권 문제로 교착상태에 빠지는 세월호 특별법 지금도 끝이 보이지 않는데요", "./datasets/son/audio/NB10584578.0024.wav": "어떠한 추가 협상도", "./datasets/son/audio/NB10584578.0025.wav": "불가하다 이렇게 못박은 청와대와", "./datasets/son/audio/NB10584578.0026.wav": "여당을 어떻게 변화시킬 것인지 또한", "./datasets/son/audio/NB10584578.0027.wav": "수사권과 기소권을 주장하는 유족들의 요구를 어떻게 담아낼 것인지", "./datasets/son/audio/NB10584578.0028.wav": "겉은 장비 속은 조조라고 불리우는 의회주의자 문희상 비대위원장과 새정치연합이 저돌적으로 말 그대로 저돌적으로 풀어 가야 할", "./datasets/son/audio/NB10584578.0029.wav": "과제인지도 모르겠습니다", "./datasets/son/audio/NB10584578.0030.wav": "세월호 참사는 오늘로 백육십일째를 맞았습니다", "./datasets/son/audio/NB10584578.0031.wav": "쓸쓸한 팽목항에는", "./datasets/son/audio/NB10584578.0032.wav": "자원봉사자마저 하나둘 철수하고 있고", "./datasets/son/audio/NB10584578.0033.wav": "슬픈 이천십사년은 오늘로 이제 딱", "./datasets/son/audio/NB10584578.0034.wav": "백일이 남았습니다", "./datasets/son/audio/NB10584578.0035.wav": "잠시 후에 문희상 비대위원장을 스튜디오에서 만나겠습니다", "./datasets/son/audio/NB10585784.0001.wav": "자 이어서 앵커 브리핑 순서입니다 오늘 뉴스 룸이 주목한 단어는 덫입니다", "./datasets/son/audio/NB10585784.0002.wav": "어 잔꾀를 부리다 자신이 놓은 덫에 스스로 걸리고 만 꼴이다", "./datasets/son/audio/NB10585784.0003.wav": "국회 선진화법 개정을 추진하고 있는 새누리당을 향해서", "./datasets/son/audio/NB10585784.0004.wav": "새정치민주연합에 박수현 의원이 이런 말을 했군요", "./datasets/son/audio/NB10585784.0005.wav": "이 말을 이해하기 위해서는 지난 이천십이년에 국회로 한 걸음", "./datasets/son/audio/NB10585784.0006.wav": "돌아가 봐야만 합니다", "./datasets/son/audio/NB10585784.0007.wav": "기대보다는 걱정이 앞서는 것이", "./datasets/son/audio/NB10585784.0008.wav": "솔직한 내 심정입니다", "./datasets/son/audio/NB10585784.0009.wav": "이제 개정안이 통과된 이상 우리 여야가", "./datasets/son/audio/NB10585784.0010.wav": "대화와 타협을 통해서", "./datasets/son/audio/NB10585784.0011.wav": "국민들에게 신뢰받는 선진 국회를 만들어 가기를 간절히 바랍니다", "./datasets/son/audio/NB10585784.0015.wav": "예 이렇게 세번 두들기고 법안은 통과가 되는데요", "./datasets/son/audio/NB10585784.0016.wav": "국회선진화법은 재적의원 중에 과반이 아닌 오분의 삼이상이 찬성해야 만", "./datasets/son/audio/NB10585784.0017.wav": "안건을 올릴 수 있도록 만든 법이죠" } ================================================ FILE: datasets/son.py ================================================ # -*- coding: utf-8 -*- from concurrent.futures import ProcessPoolExecutor from functools import partial import numpy as np import os,json from utils import audio from text import text_to_sequence def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x): """ Preprocesses the speech dataset from a gven input path to given output directories Args: - hparams: hyper parameters - input_dir: input directory that contains the files to prerocess - out_dir: output directory of npz files - n_jobs: Optional, number of worker process to parallelize across - tqdm: Optional, provides a nice progress bar Returns: - A list of tuple describing the train examples. this should be written to train.txt """ executor = ProcessPoolExecutor(max_workers=num_workers) futures = [] index = 1 path = os.path.join(in_dir, 'son-recognition-All.json') with open(path,encoding='utf-8') as f: content = f.read() data = json.loads(content) for key, text in data.items(): wav_path = key.strip().split('/') wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1]) # In case of test file if not os.path.exists(wav_path): continue futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams))) index += 1 return [future.result() for future in tqdm(futures) if future.result() is not None] def _process_utterance(out_dir, wav_path, text, hparams): """ Preprocesses a single utterance wav/text pair this writes the mel scale spectogram to disk and return a tuple to write to the train.txt file Args: - mel_dir: the directory to write the mel spectograms into - linear_dir: the directory to write the linear spectrograms into - wav_dir: the directory to write the preprocessed wav into - index: the numeric index to use in the spectogram filename - wav_path: path to the audio file containing the speech input - text: text spoken in the input audio file - hparams: hyper parameters Returns: - A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text) """ try: # Load the audio as numpy array wav = audio.load_wav(wav_path, sr=hparams.sample_rate) except FileNotFoundError: #catch missing wav exception print('file {} present in csv metadata is not present in wav folder. skipping!'.format(wav_path)) return None #rescale wav if hparams.rescaling: # hparams.rescale = True wav = wav / np.abs(wav).max() * hparams.rescaling_max #M-AILABS extra silence specific if hparams.trim_silence: # hparams.trim_silence = True wav = audio.trim_silence(wav, hparams) # Trim leading and trailing silence #Mu-law quantize, default 값은 'raw' if hparams.input_type=='mulaw-quantize': #[0, quantize_channels) out = audio.mulaw_quantize(wav, hparams.quantize_channels) #Trim silences start, end = audio.start_and_end_indices(out, hparams.silence_threshold) wav = wav[start: end] out = out[start: end] constant_values = mulaw_quantize(0, hparams.quantize_channels) out_dtype = np.int16 elif hparams.input_type=='mulaw': #[-1, 1] out = audio.mulaw(wav, hparams.quantize_channels) constant_values = audio.mulaw(0., hparams.quantize_channels) out_dtype = np.float32 else: # raw #[-1, 1] out = wav constant_values = 0. out_dtype = np.float32 # Compute the mel scale spectrogram from the wav mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32) mel_frames = mel_spectrogram.shape[1] if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length: # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True return None #Compute the linear scale spectrogram from the wav linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32) linear_frames = linear_spectrogram.shape[1] #sanity check assert linear_frames == mel_frames if hparams.use_lws: # hparams.use_lws = False #Ensure time resolution adjustement between audio and mel-spectrogram fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams)) #Zero pad audio signal out = np.pad(out, (l, r), mode='constant', constant_values=constant_values) else: #Ensure time resolution adjustement between audio and mel-spectrogram pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams)) #Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency) out = np.pad(out, pad, mode='reflect') assert len(out) >= mel_frames * audio.get_hop_size(hparams) #time resolution adjustement #ensure length of raw audio is multiple of hop size so that we can use #transposed convolution to upsample out = out[:mel_frames * audio.get_hop_size(hparams)] assert len(out) % audio.get_hop_size(hparams) == 0 time_steps = len(out) # Write the spectrogram and audio to disk wav_id = os.path.splitext(os.path.basename(wav_path))[0] # Write the spectrograms to disk: audio_filename = '{}-audio.npy'.format(wav_id) mel_filename = '{}-mel.npy'.format(wav_id) linear_filename = '{}-linear.npy'.format(wav_id) npz_filename = '{}.npz'.format(wav_id) npz_flag=True if npz_flag: # Tacotron 코드와 맞추기 위해, 같은 key를 사용한다. data = { 'audio': out.astype(out_dtype), 'mel': mel_spectrogram.T, 'linear': linear_spectrogram.T, 'time_steps': time_steps, 'mel_frames': mel_frames, 'text': text, 'tokens': text_to_sequence(text), # eos(~)에 해당하는 "1"이 끝에 붙는다. 'loss_coeff': 1 # For Tacotron } np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False) else: np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False) np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False) np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False) # Return a tuple describing this training example return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename) ================================================ FILE: generate.py ================================================ # coding: utf-8 """ sample_rate = 16000이므로, samples 48000이면 3초 길이가 된다. > python generate.py --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10 > python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10 <----scalar_input = True > python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10 python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-tacotron/generate/mel-2018-12-25_22-27-50-0.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10 gc_id = 0(moon), 1(son) python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-22T23-08-16 python generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-22T23-08-16 """ import argparse from datetime import datetime import json import os,time import librosa import numpy as np import tensorflow as tf from wavenet import WaveNetModel, mu_law_decode, mu_law_encode from hparams import hparams from utils import load_hparams,load from utils import audio from utils import plot import warnings warnings.simplefilter(action='ignore', category=FutureWarning) def _interp(feats, in_range): #rescales from [-max, max] (or [0, max]) to [0, 1] return (feats - in_range[0]) / (in_range[1] - in_range[0]) def get_arguments(): def _str_to_bool(s): """Convert string to bool (in argparse context).""" if s.lower() not in ['true', 'false']: raise ValueError('Argument needs to be a boolean, got {}'.format(s)) return {'true': True, 'false': False}[s.lower()] def _ensure_positive_float(f): """Ensure argument is a positive float.""" if float(f) < 0: raise argparse.ArgumentTypeError('Argument must be greater than zero') return float(f) parser = argparse.ArgumentParser(description='WaveNet generation script') parser.add_argument('checkpoint_dir', type=str, help='Which model checkpoint to generate from') TEMPERATURE = 1.0 parser.add_argument('--temperature', type=_ensure_positive_float, default=TEMPERATURE,help='Sampling temperature') LOGDIR = './logdir-wavenet' parser.add_argument('--logdir',type=str,default=LOGDIR,help='Directory in which to store the logging information for TensorBoard.') parser.add_argument('--wav_out_path',type=str,default=None,help='Path to output wav file') BATCH_SIZE = 1 parser.add_argument('--batch_size', type=int, default=BATCH_SIZE,help='batch size') parser.add_argument('--wav_seed',type=str,default=None,help='The wav file to start generation from') parser.add_argument('--mel',type=str,default=None,help='mel input') parser.add_argument('--gc_cardinality',type=int,default=None,help='Number of categories upon which we globally condition.') parser.add_argument('--gc_id',type=int,default=None,help='ID of category to generate, if globally conditioned.') arguments = parser.parse_args() if hparams.gc_channels is not None: if arguments.gc_cardinality is None: raise ValueError("Globally conditioning but gc_cardinality not specified. Use --gc_cardinality=377 for full VCTK corpus.") if arguments.gc_id is None: raise ValueError("Globally conditioning, but global condition was not specified. Use --gc_id to specify global condition.") return arguments # def write_wav(waveform, sample_rate, filename): # y = np.array(waveform) # librosa.output.write_wav(filename, y, sample_rate) # print('Updated wav file at {}'.format(filename)) def create_seed(filename,sample_rate,quantization_channels,window_size,scalar_input): # seed의 앞부분만 사용한다. seed_audio, _ = librosa.load(filename, sr=sample_rate, mono=True) seed_audio = audio.trim_silence(seed_audio, hparams) if scalar_input: if len(seed_audio) < window_size: return seed_audio else: return seed_audio[:window_size] else: quantized = mu_law_encode(seed_audio, quantization_channels) # 짧으면 짧은 대로 return하는데, padding이라도 해야되지 않나??? cut_index = tf.cond(tf.size(quantized) < tf.constant(window_size), lambda: tf.size(quantized), lambda: tf.constant(window_size)) return quantized[:cut_index] def main(): config = get_arguments() started_datestring = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now()) logdir = os.path.join(config.logdir, 'generate', started_datestring) if not os.path.exists(logdir): os.makedirs(logdir) load_hparams(hparams, config.checkpoint_dir) with tf.device('/cpu:0'): # cpu가 더 빠르다. gpu로 설정하면 Error. tf.device 없이 하면 더 느려진다. sess = tf.Session() scalar_input = hparams.scalar_input net = WaveNetModel( batch_size=config.batch_size, dilations=hparams.dilations, filter_width=hparams.filter_width, residual_channels=hparams.residual_channels, dilation_channels=hparams.dilation_channels, quantization_channels=hparams.quantization_channels, out_channels =hparams.out_channels, skip_channels=hparams.skip_channels, use_biases=hparams.use_biases, scalar_input=hparams.scalar_input, global_condition_channels=hparams.gc_channels, global_condition_cardinality=config.gc_cardinality, local_condition_channels=hparams.num_mels, upsample_factor=hparams.upsample_factor, legacy = hparams.legacy, residual_legacy = hparams.residual_legacy, train_mode=False) # train 단계에서는 global_condition_cardinality를 AudioReader에서 파악했지만, 여기서는 넣어주어야 함 if scalar_input: samples = tf.placeholder(tf.float32,shape=[net.batch_size,None]) else: samples = tf.placeholder(tf.int32,shape=[net.batch_size,None]) # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이) # local condition이 (N,T,num_mels) 여야 하지만, 길이 1까지로 들어가야하기 때무넹, (N,1,num_mels) --> squeeze하면 (N,num_mels) upsampled_local_condition = tf.placeholder(tf.float32,shape=[net.batch_size,hparams.num_mels]) next_sample = net.predict_proba_incremental(samples,upsampled_local_condition, [config.gc_id]*net.batch_size) # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용 # making local condition data. placeholder - upsampled_local_condition 넣어줄 upsampled local condition data를 만들어 보자. mel_input = np.load(config.mel) sample_size = mel_input.shape[0] * hparams.hop_size mel_input = np.tile(mel_input,(config.batch_size,1,1)) with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE): upsampled_local_condition_data = net.create_upsample(mel_input,upsample_type=hparams.upsample_type) var_list = [var for var in tf.global_variables() if 'queue' not in var.name ] saver = tf.train.Saver(var_list) print('Restoring model from {}'.format(config.checkpoint_dir)) load(saver, sess, config.checkpoint_dir) sess.run(net.queue_initializer) # 이 부분이 없으면, checkpoint에서 복원된 값들이 들어 있다. quantization_channels = hparams.quantization_channels if config.wav_seed: # wav_seed의 길이가 receptive_field보다 작으면, padding이라도 해야 되는 거 아닌가? 그냥 짧으면 짧은 대로 return함 --> 그래서 너무 짧으면 error seed = create_seed(config.wav_seed,hparams.sample_rate,quantization_channels,net.receptive_field,scalar_input) # --> mu_law encode 된 것. if scalar_input: waveform = seed.tolist() else: waveform = sess.run(seed).tolist() # [116, 114, 120, 121, 127, ...] print('Priming generation...') for i, x in enumerate(waveform[-net.receptive_field: -1]): # 제일 마지막 1개는 아래의 for loop의 첫 loop에서 넣어준다. if i % 100 == 0: print('Priming sample {}/{}'.format(i,net.receptive_field), end='\r') sess.run(next_sample, feed_dict={samples: np.array([x]*net.batch_size).reshape(net.batch_size,1), upsampled_local_condition: np.zeros([net.batch_size,hparams.num_mels])}) print('Done.') waveform = np.array([waveform[-net.receptive_field:]]*net.batch_size) else: # Silence with a single random sample at the end. if scalar_input: waveform = [0.0] * (net.receptive_field - 1) waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1) waveform = np.concatenate([waveform,2*np.random.rand(net.batch_size).reshape(net.batch_size,-1)-1],axis=-1) # -1~1사이의 random number를 만들어 끝에 붙힌다. # wavefor: shape(batch_size,net.receptive_field ) else: waveform = [quantization_channels / 2] * (net.receptive_field - 1) # 필요한 receptive_field 크기보다 1개 작게 만든 후, 아래에서 random하게 1개를 덧붙힌다. waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1) waveform = np.concatenate([waveform,np.random.randint(quantization_channels,size=net.batch_size).reshape(net.batch_size,-1)],axis=-1) # one hot 변환 전. (batch_size, 5117) start_time = time.time() upsampled_local_condition_data = sess.run(upsampled_local_condition_data) last_sample_timestamp = datetime.now() for step in range(sample_size): # 원하는 길이를 구하기 위해 loop sample_size window = waveform[:,-1:] # 제일 끝에 있는 1개만 samples에 넣어 준다. window: shape(N,1) # Run the WaveNet to predict the next sample. # fast가 아닌경우. window: [128.0, 128.0, ..., 128.0, 178, 185] # fast인 경우, window는 숫자 1개. prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step,:]}) # samples는 mu law encoding된 것. 계산 과정에서 one hot으로 변환된다. --> (batch_size,256) if scalar_input: sample = prediction # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다. else: # Scale prediction distribution using temperature. # 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다. # config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다. np.seterr(divide='ignore') scaled_prediction = np.log(prediction) / config.temperature # config.temperature인 경우는 값의 변화가 없다. scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True)) # np.log(np.sum(np.exp(scaled_prediction))) scaled_prediction = np.exp(scaled_prediction) np.seterr(divide='warn') # Prediction distribution at temperature=1.0 should be unchanged after # scaling. if config.temperature == 1.0: np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.') # argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다. sample = [[np.random.choice(np.arange(quantization_channels), p=p)] for p in scaled_prediction] # choose one sample per batch waveform = np.concatenate([waveform,sample],axis=-1) #window.shape: (N,1) # Show progress only once per second. current_sample_timestamp = datetime.now() time_since_print = current_sample_timestamp - last_sample_timestamp if time_since_print.total_seconds() > 1.: duration = time.time() - start_time print('Sample {:3 OOM. wavenet은 batch_size가 고정되어야 한다. store_metadata = False, num_steps = 1000000, # Number of training steps #Learning rate schedule wavenet_learning_rate = 1e-3, #wavenet initial learning rate wavenet_decay_rate = 0.5, #Only used with 'exponential' scheme. Defines the decay rate. wavenet_decay_steps = 300000, #Only used with 'exponential' scheme. Defines the decay steps. #Regularization parameters wavenet_clip_gradients = True, #Whether the clip the gradients during wavenet training. # residual 결과를 sum할 때, legacy = True, #Whether to use legacy mode: Multiply all skip outputs but the first one with sqrt(0.5) (True for more early training stability, especially for large models) # residual block내에서 x = (x + residual) * np.sqrt(0.5) residual_legacy = True, #Whether to scale residual blocks outputs by a factor of sqrt(0.5) (True for input variance preservation early in training and better overall stability) wavenet_dropout = 0.05, optimizer = 'adam', momentum = 0.9, # 'Specify the momentum to be used by sgd or rmsprop optimizer. Ignored by the adam optimizer. max_checkpoints = 3, # 'Maximum amount of checkpoints that will be kept alive. Default: ' #################################### #################################### #################################### # TACOTRON HYPERPARAMETERS # Training adam_beta1 = 0.9, adam_beta2 = 0.999, #Learning rate schedule tacotron_decay_learning_rate = True, #boolean, determines if the learning rate will follow an exponential decay tacotron_start_decay = 40000, #Step at which learning decay starts tacotron_decay_steps = 18000, #Determines the learning rate decay slope (UNDER TEST) tacotron_decay_rate = 0.5, #learning rate decay rate (UNDER TEST) tacotron_initial_learning_rate = 1e-3, #starting learning rate tacotron_final_learning_rate = 1e-4, #minimal learning rate initial_data_greedy = True, initial_phase_step = 8000, # 여기서 지정한 step 이전에는 data_dirs의 각각의 디렉토리에 대하여 같은 수의 example을 만들고, 이후, weght 비듈에 따라 ... 즉, 아래의 'main_data_greedy_factor'의 영향을 받는다. main_data_greedy_factor = 0, main_data = [''], # 이곳에 있는 directory 속에 있는 data는 가중치를 'main_data_greedy_factor' 만큼 더 준다. prioritize_loss = False, # Model model_type = 'multi-speaker', # [single, multi-speaker] speaker_embedding_size = 16, embedding_size = 512, # 'ᄀ', 'ᄂ', 'ᅡ' 에 대한 embedding dim dropout_prob = 0.5, reduction_factor = 2, # reduction_factor가 적으면 더 많은 iteration이 필요하므로, 더 많은 메모리가 필요하다. # Encoder enc_conv_num_layers = 3, enc_conv_kernel_size = 5, enc_conv_channels = 512, tacotron_zoneout_rate = 0.1, encoder_lstm_units = 256, attention_type = 'bah_mon_norm', # 'loc_sen', 'bah_mon_norm' attention_size = 128, #Attention mechanism smoothing = False, #Whether to smooth the attention normalization function attention_dim = 128, #dimension of attention space attention_filters = 32, #number of attention convolution filters attention_kernel = (31, ), #kernel size of attention convolution cumulative_weights = True, #Whether to cumulate (sum) all previous attention weights or simply feed previous weights (Recommended: True) #Attention synthesis constraints #"Monotonic" constraint forces the model to only look at the forwards attention_win_size steps. #"Window" allows the model to look at attention_win_size neighbors, both forward and backward steps. synthesis_constraint = False, #Whether to use attention windows constraints in synthesis only (Useful for long utterances synthesis) synthesis_constraint_type = 'window', #can be in ('window', 'monotonic'). attention_win_size = 7, #Side of the window. Current step does not count. If mode is window and attention_win_size is not pair, the 1 extra is provided to backward part of the window. #Loss params mask_encoder = True, #whether to mask encoder padding while computing location sensitive attention. Set to True for better prosody but slower convergence. #Decoder prenet_layers = [256, 256], #number of layers and number of units of prenet decoder_layers = 2, #number of decoder lstm layers decoder_lstm_units = 1024, #number of decoder lstm units on each layer dec_prenet_sizes = [256, 256], #number of layers and number of units of prenet #Residual postnet postnet_num_layers = 5, #number of postnet convolutional layers postnet_kernel_size = (5, ), #size of postnet convolution filters for each layer postnet_channels = 512, #number of postnet convolution filters for each layer # for linear mel spectrogrma post_bank_size = 8, post_bank_channel_size = 128, post_maxpool_width = 2, post_highway_depth = 4, post_rnn_size = 128, post_proj_sizes = [256, 80], # num_mels=80 post_proj_width = 3, tacotron_reg_weight = 1e-6, #regularization weight (for L2 regularization) inference_prenet_dropout = True, # Eval min_tokens = 30, #originally 50, 30 is good for korean, text를 token으로 쪼갰을 때, 최소 길이 이상되어야 train에 사용 min_n_frame = 30*5, # min_n_frame = reduction_factor * min_iters, reduction_factor와 곱해서 min_n_frame을 설정한다. max_n_frame = 200*5, skip_inadequate = False, griffin_lim_iters = 60, power = 1.5, ) if hparams.use_lws: # Does not work if fft_size is not multiple of hop_size!! # sample size = 20480, hop_size=256=12.5ms. fft_size는 window_size를 결정하는데, 2048을 시간으로 환산하면 2048/20480 = 0.1초=100ms hparams.sample_rate = 20480 # # shift can be specified by either hop_size(우선) or frame_shift_ms hparams.hop_size = 256 # frame_shift_ms = 12.5ms hparams.frame_shift_ms=None # hop_size= sample_rate * frame_shift_ms / 1000 hparams.fft_size=2048 # 주로 1024로 되어있는데, tacotron에서 2048사용==> output size = 1025 hparams.win_size = None # 256x4 --> 50ms else: # 미리 정의되 parameter들로 부터 consistant하게 정의해 준다. hparams.num_freq = int(hparams.fft_size/2 + 1) hparams.frame_shift_ms = hparams.hop_size * 1000.0/ hparams.sample_rate # hop_size= sample_rate * frame_shift_ms / 1000 hparams.frame_length_ms = hparams.win_size * 1000.0/ hparams.sample_rate def hparams_debug_string(): values = hparams.values() hp = [' %s: %s' % (name, values[name]) for name in sorted(values)] return 'Hyperparameters:\n' + '\n'.join(hp) ================================================ FILE: preprocess.py ================================================ # coding: utf-8 """ python preprocess.py --num_workers 10 --name son --in_dir D:\hccho\multi-speaker-tacotron-tensorflow-master\datasets\son --out_dir .\data\son python preprocess.py --num_workers 10 --name moon --in_dir D:\hccho\multi-speaker-tacotron-tensorflow-master\datasets\moon --out_dir .\data\moon ==> out_dir에 'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'tokens', 'loss_coeff'를 묶은 npz파일이 생성된다. """ import argparse import os from multiprocessing import cpu_count from tqdm import tqdm import importlib from hparams import hparams, hparams_debug_string import warnings warnings.simplefilter(action='ignore', category=FutureWarning) def preprocess(mod, in_dir, out_root,num_workers): os.makedirs(out_dir, exist_ok=True) metadata = mod.build_from_path(hparams, in_dir, out_dir,num_workers=num_workers, tqdm=tqdm) write_metadata(metadata, out_dir) def write_metadata(metadata, out_dir): with open(os.path.join(out_dir, 'train.txt'), 'w', encoding='utf-8') as f: for m in metadata: f.write('|'.join([str(x) for x in m]) + '\n') mel_frames = sum([int(m[4]) for m in metadata]) timesteps = sum([int(m[3]) for m in metadata]) sr = hparams.sample_rate hours = timesteps / sr / 3600 print('Write {} utterances, {} mel frames, {} audio timesteps, ({:.2f} hours)'.format(len(metadata), mel_frames, timesteps, hours)) print('Max input length (text chars): {}'.format(max(len(m[5]) for m in metadata))) print('Max mel frames length: {}'.format(max(int(m[4]) for m in metadata))) print('Max audio timesteps length: {}'.format(max(m[3] for m in metadata))) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--name', type=str, default=None) parser.add_argument('--in_dir', type=str, default=None) parser.add_argument('--out_dir', type=str, default=None) parser.add_argument('--num_workers', type=str, default=None) parser.add_argument('--hparams', type=str, default=None) args = parser.parse_args() if args.hparams is not None: hparams.parse(args.hparams) print(hparams_debug_string()) name = args.name in_dir = args.in_dir out_dir = args.out_dir num_workers = args.num_workers num_workers = cpu_count() if num_workers is None else int(num_workers) # cpu_count() = process 갯수 print("Sampling frequency: {}".format(hparams.sample_rate)) assert name in ["cmu_arctic", "ljspeech", "son", "moon"] mod = importlib.import_module('datasets.{}'.format(name)) preprocess(mod, in_dir, out_dir, num_workers) ================================================ FILE: synthesizer.py ================================================ # coding: utf-8 """ python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "그런데 청년은 이렇게 말합니다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "이런 논란은 타코트론 논문 이후에 사라졌습니다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "이런 논란은 타코트론 논문 이후에 사라졌습니다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오는 6월6일은 제64회 현충일입니다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오는 6월6일은 제64회 현충일입니다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다" """ import io import os import re import librosa import argparse import numpy as np from glob import glob from tqdm import tqdm import tensorflow as tf from functools import partial from hparams import hparams from tacotron2 import create_model, get_most_recent_checkpoint from utils.audio import save_wav, inv_linear_spectrogram, inv_preemphasis, inv_spectrogram_tensorflow from utils import plot, PARAMS_NAME, load_json, load_hparams, add_prefix, add_postfix, get_time, parallel_run, makedirs, str2bool from text.korean import tokenize from text import text_to_sequence, sequence_to_text from datasets.datafeeder_tacotron2 import _prepare_inputs import warnings warnings.simplefilter(action='ignore', category=FutureWarning) tf.logging.set_verbosity(tf.logging.ERROR) class Synthesizer(object): def close(self): tf.reset_default_graph() self.sess.close() def load(self, checkpoint_path, num_speakers=2, checkpoint_step=None, inference_prenet_dropout=True,model_name='tacotron'): self.num_speakers = num_speakers if os.path.isdir(checkpoint_path): load_path = checkpoint_path checkpoint_path = get_most_recent_checkpoint(checkpoint_path, checkpoint_step) else: load_path = os.path.dirname(checkpoint_path) print('Constructing model: %s' % model_name) inputs = tf.placeholder(tf.int32, [None, None], 'inputs') input_lengths = tf.placeholder(tf.int32, [None], 'input_lengths') batch_size = tf.shape(inputs)[0] speaker_id = tf.placeholder_with_default( tf.zeros([batch_size], dtype=tf.int32), [None], 'speaker_id') load_hparams(hparams, load_path) hparams.inference_prenet_dropout = inference_prenet_dropout with tf.variable_scope('model') as scope: self.model = create_model(hparams) self.model.initialize(inputs=inputs, input_lengths=input_lengths, num_speakers=self.num_speakers, speaker_id=speaker_id,is_training=False) self.wav_output = inv_spectrogram_tensorflow(self.model.linear_outputs,hparams) print('Loading checkpoint: %s' % checkpoint_path) sess_config = tf.ConfigProto( allow_soft_placement=True, intra_op_parallelism_threads=1, inter_op_parallelism_threads=2) sess_config.gpu_options.allow_growth = True self.sess = tf.Session(config=sess_config) self.sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() saver.restore(self.sess, checkpoint_path) def synthesize(self, texts=None, tokens=None, base_path=None, paths=None, speaker_ids=None, start_of_sentence=None, end_of_sentence=True, pre_word_num=0, post_word_num=0, pre_surplus_idx=0, post_surplus_idx=1, use_short_concat=False, base_alignment_path=None, librosa_trim=False, attention_trim=True, isKorean=True): # Possible inputs: # 1) text=text # 2) text=texts # 3) tokens=tokens, texts=texts # use texts as guide if type(texts) == str: texts = [texts] if texts is not None and tokens is None: sequences = np.array([text_to_sequence(text) for text in texts]) sequences = _prepare_inputs(sequences) elif tokens is not None: sequences = tokens #sequences = np.pad(sequences,[(0,0),(0,5)],'constant',constant_values=(0)) # case by case ---> overfitting? if paths is None: paths = [None] * len(sequences) if texts is None: texts = [None] * len(sequences) time_str = get_time() def plot_and_save_parallel(wavs, alignments,mels): items = list(enumerate(zip(wavs, alignments, paths, texts, sequences,mels))) fn = partial( plot_graph_and_save_audio, base_path=base_path, start_of_sentence=start_of_sentence, end_of_sentence=end_of_sentence, pre_word_num=pre_word_num, post_word_num=post_word_num, pre_surplus_idx=pre_surplus_idx, post_surplus_idx=post_surplus_idx, use_short_concat=use_short_concat, librosa_trim=librosa_trim, attention_trim=attention_trim, time_str=time_str, isKorean=isKorean) return parallel_run(fn, items,desc="plot_graph_and_save_audio", parallel=False) #input_lengths = np.argmax(np.array(sequences) == 1, 1)+1 input_lengths = [np.argmax(a==1)+1 for a in sequences] fetches = [ #self.wav_output, self.model.linear_outputs, self.model.alignments, # # batch_size, text length(encoder), target length(decoder) self.model.mel_outputs, ] feed_dict = { self.model.inputs: sequences, self.model.input_lengths: input_lengths, } if speaker_ids is not None: if type(speaker_ids) == dict: speaker_embed_table = sess.run( self.model.speaker_embed_table) speaker_embed = [speaker_ids[speaker_id] * speaker_embed_table[speaker_id] for speaker_id in speaker_ids] feed_dict.update({ self.model.speaker_embed_table: np.tile() }) else: feed_dict[self.model.speaker_id] = speaker_ids wavs, alignments,mels = self.sess.run(fetches, feed_dict=feed_dict) results = plot_and_save_parallel(wavs, alignments,mels=mels) return results def plot_graph_and_save_audio(args, base_path=None, start_of_sentence=None, end_of_sentence=None, pre_word_num=0, post_word_num=0, pre_surplus_idx=0, post_surplus_idx=1, use_short_concat=False, save_alignment=False, librosa_trim=False, attention_trim=False, time_str=None, isKorean=True): idx, (wav, alignment, path, text, sequence,mel) = args if base_path: plot_path = "{}/{}.png".format(base_path, get_time()) elif path: plot_path = path.rsplit('.', 1)[0] + ".png" else: plot_path = None if plot_path: plot.plot_alignment(alignment, plot_path, text=text, isKorean=isKorean) if use_short_concat: wav = short_concat( wav, alignment, text, start_of_sentence, end_of_sentence, pre_word_num, post_word_num, pre_surplus_idx, post_surplus_idx) if attention_trim and end_of_sentence: # attention이 text의 마지막까지 왔다면, 그 뒷부분은 버린다. end_idx_counter = 0 attention_argmax = alignment.argmax(0) # alignment: text length(encoder), target length(decoder) ==> target length(decoder) end_idx = min(len(sequence) - 1, max(attention_argmax)) max_counter = min((attention_argmax == end_idx).sum(), 5) for jdx, attend_idx in enumerate(attention_argmax): if len(attention_argmax) > jdx + 1: if attend_idx == end_idx: end_idx_counter += 1 if attend_idx == end_idx and attention_argmax[jdx + 1] > end_idx: break if end_idx_counter >= max_counter: break else: break spec_end_idx = hparams.reduction_factor * jdx + 3 wav = wav[:spec_end_idx] mel = mel[:spec_end_idx] audio_out = inv_linear_spectrogram(wav.T,hparams) if librosa_trim and end_of_sentence: yt, index = librosa.effects.trim(audio_out, frame_length=5120, hop_length=256, top_db=50) audio_out = audio_out[:index[-1]] mel = mel[:index[-1]//hparams.hop_size] if save_alignment: alignment_path = "{}/{}.npy".format(base_path, idx) np.save(alignment_path, alignment, allow_pickle=False) if path or base_path: if path: current_path = add_postfix(path, idx) elif base_path: current_path = plot_path.replace(".png", ".wav") save_wav(audio_out, current_path,hparams.sample_rate) #hccho mel_path = current_path.replace(".wav",".npy") np.save(mel_path,mel) return True else: io_out = io.BytesIO() save_wav(audio_out, io_out,hparams.sample_rate) result = io_out.getvalue() return result def get_most_recent_checkpoint(checkpoint_dir, checkpoint_step=None): if checkpoint_step is None: checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))] idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths] max_idx = max(idxes) else: max_idx = checkpoint_step lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx)) print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint)) return lastest_checkpoint def short_concat( wav, alignment, text, start_of_sentence, end_of_sentence, pre_word_num, post_word_num, pre_surplus_idx, post_surplus_idx): # np.array(list(decomposed_text))[attention_argmax] attention_argmax = alignment.argmax(0) if not start_of_sentence and pre_word_num > 0: surplus_decomposed_text = decompose_ko_text("".join(text.split()[0])) start_idx = len(surplus_decomposed_text) + 1 for idx, attend_idx in enumerate(attention_argmax): if attend_idx == start_idx and attention_argmax[idx - 1] < start_idx: break wav_start_idx = hparams.reduction_factor * idx - 1 - pre_surplus_idx else: wav_start_idx = 0 if not end_of_sentence and post_word_num > 0: surplus_decomposed_text = decompose_ko_text("".join(text.split()[-1])) end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1 for idx, attend_idx in enumerate(attention_argmax): if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx: break wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx else: if True: # attention based split if end_of_sentence: end_idx = min(len(decomposed_text) - 1, max(attention_argmax)) else: surplus_decomposed_text = decompose_ko_text("".join(text.split()[-1])) end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1 while True: if end_idx in attention_argmax: break end_idx -= 1 end_idx_counter = 0 for idx, attend_idx in enumerate(attention_argmax): if len(attention_argmax) > idx + 1: if attend_idx == end_idx: end_idx_counter += 1 if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx: break if end_idx_counter > 5: break else: break wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx else: wav_end_idx = None wav = wav[wav_start_idx:wav_end_idx] if end_of_sentence: wav = np.lib.pad(wav, ((0, 20), (0, 0)), 'constant', constant_values=0) else: wav = np.lib.pad(wav, ((0, 10), (0, 0)), 'constant', constant_values=0) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--load_path', required=True) parser.add_argument('--sample_path', default="logdir-tacotron2/generate") parser.add_argument('--text', required=True) parser.add_argument('--num_speakers', default=1, type=int) parser.add_argument('--speaker_id', default=0, type=int) parser.add_argument('--checkpoint_step', default=None, type=int) parser.add_argument('--is_korean', default=True, type=str2bool) parser.add_argument('--base_alignment_path', default=None) config = parser.parse_args() makedirs(config.sample_path) synthesizer = Synthesizer() synthesizer.load(config.load_path, config.num_speakers, config.checkpoint_step,inference_prenet_dropout=False) audio = synthesizer.synthesize(texts=[config.text],base_path=config.sample_path,speaker_ids=[config.speaker_id], attention_trim=True,base_alignment_path=config.base_alignment_path,isKorean=config.is_korean)[0] ================================================ FILE: tacotron2/__init__.py ================================================ # coding: utf-8 import os from glob import glob from .tacotron2 import Tacotron2 def create_model(hparams): return Tacotron2(hparams) def get_most_recent_checkpoint(checkpoint_dir): checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))] idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths] max_idx = max(idxes) lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx)) #latest_checkpoint=checkpoint_paths[0] print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint)) return lastest_checkpoint ================================================ FILE: tacotron2/helpers.py ================================================ # coding: utf-8 # Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py import numpy as np import tensorflow as tf from tensorflow.contrib.seq2seq import Helper # Adapted from tf.contrib.seq2seq.GreedyEmbeddingHelper class TacoTestHelper(Helper): def __init__(self, batch_size, output_dim, r): with tf.name_scope('TacoTestHelper'): self._batch_size = batch_size self._output_dim = output_dim self._end_token = tf.tile([0.0], [output_dim * r]) # [0.0,0.0,...] self._reduction_factor = r @property def batch_size(self): return self._batch_size @property def sample_ids_dtype(self): return tf.int32 @property def sample_ids_shape(self): return tf.TensorShape([]) def initialize(self, name=None): return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim)) def sample(self, time, outputs, state, name=None): return tf.tile([0], [self._batch_size]) # Return all 0; we ignore them def next_inputs(self, time, outputs, state, sample_ids, name=None): '''Stop on EOS. Otherwise, pass the last output as the next input and pass through state.''' with tf.name_scope('TacoTestHelper'): stop_token_preds = tf.nn.sigmoid(outputs[:,-self._reduction_factor:]) finished = tf.reduce_any(tf.cast(tf.round(stop_token_preds), tf.bool),axis=1) # Feed last output frame as next input. outputs is [N, output_dim * r] next_inputs = outputs[:, -(self._output_dim+self._reduction_factor):-self._reduction_factor] # stop token 부분을 제외 return (finished, next_inputs, state) class TacoTrainingHelper(Helper): def __init__(self, targets, output_dim, r): # inputs is [N, T_in], targets is [N, T_out, D] # output_dim = hp.num_mels = 80 # r = hp.reduction_factor = 4 or 5 with tf.name_scope('TacoTrainingHelper'): self._batch_size = tf.shape(targets)[0] self._output_dim = output_dim # Feed every r-th target frame as input self._targets = targets[:, r-1::r, :] # Use full length for every target because we don't want to mask the padding frames num_steps = tf.shape(self._targets)[1] self._lengths = tf.tile([num_steps], [self._batch_size]) @property def batch_size(self): return self._batch_size @property def sample_ids_dtype(self): return tf.int32 @property def sample_ids_shape(self): return tf.TensorShape([]) def initialize(self, name=None): return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim)) def sample(self, time, outputs, state, name=None): return tf.tile([0], [self._batch_size]) # Return all 0; we ignore them def next_inputs(self, time, outputs, state, sample_ids, name=None): # time에 해당하는 input을 만들어 return해야 한다. with tf.name_scope(name or 'TacoTrainingHelper'): finished = (time + 1 >= self._lengths) next_inputs = self._targets[:, time, :] return (finished, next_inputs, state) def _go_frames(batch_size, output_dim): '''Returns all-zero frames for a given batch size and output dimension''' return tf.tile([[0.0]], [batch_size, output_dim]) ================================================ FILE: tacotron2/modules.py ================================================ # coding: utf-8 # Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py import tensorflow as tf from tensorflow.contrib.rnn import GRUCell from tensorflow.python.layers import core from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, AttentionWrapper, AttentionWrapperState def prenet(inputs, is_training, layer_sizes, drop_prob, scope=None): x = inputs # 3차원 array(batch,seq_length,embedding_dim) ==> (batch,seq_length,256) ==> (batch,seq_length,128) #drop_rate = drop_prob if is_training else 0.0 #print('drop_rate',drop_rate) with tf.variable_scope(scope or 'prenet'): for i, size in enumerate(layer_sizes): # [f(256), f(256)] dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='projection_%d' % (i+1)) # Tacotron2 논문에서는 training, inference 모두에 dropout 적용 x = tf.layers.dropout(dense, rate=drop_prob,training=True, name='dropout_%d' % (i+1)) # Tacotron2에서는 training, inference 모두에 dropout 적용 return x def cbhg(inputs, input_lengths, is_training, bank_size, bank_channel_size, maxpool_width, highway_depth, rnn_size, proj_sizes, proj_width, scope,before_highway=None, encoder_rnn_init_state=None): # inputs: (N,T_in, 128), bank_size: 16 batch_size = tf.shape(inputs)[0] with tf.variable_scope(scope): with tf.variable_scope('conv_bank'): # Convolution bank: concatenate on the last axis # to stack channels from all convolutions conv_fn = lambda k: conv1d(inputs, k, bank_channel_size, tf.nn.relu, is_training, 'conv1d_%d' % k) # bank_channel_size =128 conv_outputs = tf.concat( [conv_fn(k) for k in range(1, bank_size+1)], axis=-1,) # ==> (N,T_in,128*bank_size) # Maxpooling: maxpool_output = tf.layers.max_pooling1d(conv_outputs,pool_size=maxpool_width,strides=1,padding='same') # maxpool_width = 2 # Two projection layers: proj_out = maxpool_output for idx, proj_size in enumerate(proj_sizes): # [f(128), f(128)], post: [f(256), f(80)] activation_fn = None if idx == len(proj_sizes) - 1 else tf.nn.relu proj_out = conv1d(proj_out, proj_width, proj_size, activation_fn,is_training, 'proj_{}'.format(idx + 1)) # proj_width = 3 # Residual connection: if before_highway is not None: # multi-sperker mode expanded_before_highway = tf.expand_dims(before_highway, [1]) tiled_before_highway = tf.tile(expanded_before_highway, [1, tf.shape(proj_out)[1], 1]) highway_input = proj_out + inputs + tiled_before_highway else: # single model highway_input = proj_out + inputs # Handle dimensionality mismatch: if highway_input.shape[2] != rnn_size: # rnn_size = 128 highway_input = tf.layers.dense(highway_input, rnn_size,name='highway_projection') # 4-layer HighwayNet: for idx in range(highway_depth): highway_input = highwaynet(highway_input, 'highway_%d' % (idx+1)) rnn_input = highway_input # Bidirectional RNN if encoder_rnn_init_state is not None: initial_state_fw, initial_state_bw = tf.split(encoder_rnn_init_state, 2, 1) else: # single mode initial_state_fw, initial_state_bw = None, None cell_fw, cell_bw = GRUCell(rnn_size), GRUCell(rnn_size) outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,rnn_input,sequence_length=input_lengths, initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32) return tf.concat(outputs, axis=2) # Concat forward and backward def batch_tile(tensor, batch_size): expaneded_tensor = tf.expand_dims(tensor, [0]) return tf.tile(expaneded_tensor, \ [batch_size] + [1 for _ in tensor.get_shape()]) def highwaynet(inputs, scope): highway_dim = int(inputs.get_shape()[-1]) with tf.variable_scope(scope): H = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.relu,name='H_projection') T = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.sigmoid,name='T_projection',bias_initializer=tf.constant_initializer(-1.0)) return H * T + inputs * (1.0 - T) def conv1d(inputs, kernel_size, channels, activation, is_training, scope): with tf.variable_scope(scope): # strides=1, padding = same 이므로, kernel_size에 상관없이 크기가 유지된다. conv1d_output = tf.layers.conv1d(inputs,filters=channels,kernel_size=kernel_size,activation=activation,padding='same') # padding이 same이라 kenel size가 달라도 concat된다. return tf.layers.batch_normalization(conv1d_output, training=is_training) ================================================ FILE: tacotron2/rnn_wrappers.py ================================================ # coding: utf-8 import numpy as np import tensorflow as tf from tensorflow.contrib.rnn import RNNCell from tensorflow.python.ops import rnn_cell_impl #from tensorflow.contrib.data.python.util import nest from tensorflow.contrib.framework import nest from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, \ AttentionWrapperState, AttentionMechanism, _BaseMonotonicAttentionMechanism,_maybe_mask_score,_prepare_memory,_monotonic_probability_fn from tensorflow.python.ops import array_ops, math_ops, nn_ops, variable_scope from tensorflow.python.layers.core import Dense from .modules import prenet import functools _zero_state_tensors = rnn_cell_impl._zero_state_tensors class ZoneoutLSTMCell(RNNCell): '''Wrapper for tf LSTM to create Zoneout LSTM Cell inspired by: https://github.com/teganmaharaj/zoneout/blob/master/zoneout_tensorflow.py Published by one of 'https://arxiv.org/pdf/1606.01305.pdf' paper writers. Many thanks to @Ondal90 for pointing this out. You sir are a hero! ''' def __init__(self, num_units, is_training, zoneout_factor_cell=0., zoneout_factor_output=0., state_is_tuple=True, name=None): '''Initializer with possibility to set different zoneout values for cell/hidden states. ''' zm = min(zoneout_factor_output, zoneout_factor_cell) zs = max(zoneout_factor_output, zoneout_factor_cell) if zm < 0. or zs > 1.: raise ValueError('One/both provided Zoneout factors are not in [0, 1]') self._cell = tf.nn.rnn_cell.LSTMCell(num_units, state_is_tuple=state_is_tuple, name=name) self._zoneout_cell = zoneout_factor_cell self._zoneout_outputs = zoneout_factor_output self.is_training = is_training self.state_is_tuple = state_is_tuple @property def state_size(self): return self._cell.state_size @property def output_size(self): return self._cell.output_size def __call__(self, inputs, state, scope=None): '''Runs vanilla LSTM Cell and applies zoneout. ''' #Apply vanilla LSTM output, new_state = self._cell(inputs, state, scope) if self.state_is_tuple: (prev_c, prev_h) = state (new_c, new_h) = new_state else: num_proj = self._cell._num_units if self._cell._num_proj is None else self._cell._num_proj prev_c = tf.slice(state, [0, 0], [-1, self._cell._num_units]) prev_h = tf.slice(state, [0, self._cell._num_units], [-1, num_proj]) new_c = tf.slice(new_state, [0, 0], [-1, self._cell._num_units]) new_h = tf.slice(new_state, [0, self._cell._num_units], [-1, num_proj]) #Apply zoneout if self.is_training: #nn.dropout takes keep_prob (probability to keep activations) not drop_prob (probability to mask activations)! c = (1 - self._zoneout_cell) * tf.nn.dropout(new_c - prev_c, (1 - self._zoneout_cell)) + prev_c # tf.nn.dropout outputs the input element scaled up by 1 / keep_prob h = (1 - self._zoneout_outputs) * tf.nn.dropout(new_h - prev_h, (1 - self._zoneout_outputs)) + prev_h else: c = (1 - self._zoneout_cell) * new_c + self._zoneout_cell * prev_c h = (1 - self._zoneout_outputs) * new_h + self._zoneout_outputs * prev_h new_state = tf.nn.rnn_cell.LSTMStateTuple(c, h) if self.state_is_tuple else tf.concat(1, [c, h]) return output, new_state class DecoderWrapper(RNNCell): '''Runs RNN inputs through a prenet before sending them to the cell.''' # input에 prenet을 먼저 적용하는 것 뿐이다. def __init__(self, cell, is_training, prenet_sizes, dropout_prob,inference_prenet_dropout=True): super(DecoderWrapper, self).__init__() self._is_training = is_training self._cell = cell self.prenet_sizes = prenet_sizes if not is_training and not inference_prenet_dropout: self.dropout_prob = 0. else: self.dropout_prob = dropout_prob @property def state_size(self): return self._cell.state_size @property def output_size(self): return self._cell.output_size + self._cell.state_size.attention def call(self, inputs, state): prenet_out = prenet(inputs, self._is_training,self.prenet_sizes, self.dropout_prob, scope='decoder_prenet') output, res_state = self._cell(prenet_out, state) return tf.concat([output, res_state.attention], axis=-1), res_state def zero_state(self, batch_size, dtype): return self._cell.zero_state(batch_size, dtype) class LocationSensitiveAttention(BahdanauAttention): """Impelements Bahdanau-style (cumulative) scoring function. Usually referred to as "hybrid" attention (content-based + location-based) Extends the additive attention described in: "D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla- tion by jointly learning to align and translate,” in Proceedings of ICLR, 2015." to use previous alignments as additional location features. This attention is described in: J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in Neural Information Processing Systems, 2015, pp. 577–585. """ def __init__(self, num_units, memory, hparams, is_training, mask_encoder=True, memory_sequence_length=None, smoothing=False, cumulate_weights=True, name='LocationSensitiveAttention'): """Construct the Attention mechanism. Args: num_units: The depth of the query mechanism. memory: The memory to query; usually the output of an RNN encoder. This tensor should be shaped `[batch_size, max_time, ...]`. mask_encoder (optional): Boolean, whether to mask encoder paddings. memory_sequence_length (optional): Sequence lengths for the batch entries in memory. If provided, the memory tensor rows are masked with zeros for values past the respective sequence lengths. Only relevant if mask_encoder = True. smoothing (optional): Boolean. Determines which normalization function to use. Default normalization function (probablity_fn) is softmax. If smoothing is enabled, we replace softmax with: a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j})) Introduced in: J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in Neural Information Processing Systems, 2015, pp. 577–585. This is mainly used if the model wants to attend to multiple input parts at the same decoding step. We probably won't be using it since multiple sound frames may depend on the same character/phone, probably not the way around. Note: We still keep it implemented in case we want to test it. They used it in the paper in the context of speech recognition, where one phoneme may depend on multiple subsequent sound frames. name: Name to use when creating ops. """ #Create normalization function #Setting it to None defaults in using softmax normalization_function = _smoothing_normalization if (smoothing == True) else None memory_length = memory_sequence_length if (mask_encoder==True) else None super(LocationSensitiveAttention, self).__init__( num_units=num_units, memory=memory, memory_sequence_length=memory_length, probability_fn=normalization_function, name=name) self.location_convolution = tf.layers.Conv1D(filters=hparams.attention_filters, kernel_size=hparams.attention_kernel, padding='same', use_bias=True, bias_initializer=tf.zeros_initializer(), name='location_features_convolution') self.location_layer = tf.layers.Dense(units=num_units, use_bias=False,dtype=tf.float32, name='location_features_projection') self._cumulate = cumulate_weights self.synthesis_constraint = hparams.synthesis_constraint and not is_training self.attention_win_size = tf.convert_to_tensor(hparams.attention_win_size, dtype=tf.int32) self.constraint_type = hparams.synthesis_constraint_type def __call__(self, query, state): """Score the query based on the keys and values. Args: query: Tensor of dtype matching `self.values` and shape `[batch_size, query_depth]`. state (previous alignments): Tensor of dtype matching `self.values` and shape `[batch_size, alignments_size]` (`alignments_size` is memory's `max_time`). Returns: alignments: Tensor of dtype matching `self.values` and shape `[batch_size, alignments_size]` (`alignments_size` is memory's `max_time`). """ previous_alignments = state with variable_scope.variable_scope(None, "Location_Sensitive_Attention", [query]): # processed_query shape [batch_size, query_depth] -> [batch_size, attention_dim] processed_query = self.query_layer(query) if self.query_layer else query # -> [batch_size, 1, attention_dim] processed_query = tf.expand_dims(processed_query, 1) # processed_location_features shape [batch_size, max_time, attention dimension] # [batch_size, max_time] -> [batch_size, max_time, 1] expanded_alignments = tf.expand_dims(previous_alignments, axis=2) # location features [batch_size, max_time, filters] f = self.location_convolution(expanded_alignments) # Projected location features [batch_size, max_time, attention_dim] processed_location_features = self.location_layer(f) # energy shape [batch_size, max_time] energy = _location_sensitive_score(processed_query, processed_location_features, self.keys) if self.synthesis_constraint: prev_max_attentions = tf.argmax(previous_alignments, -1, output_type=tf.int32) Tx = tf.shape(energy)[-1] # prev_max_attentions = tf.squeeze(prev_max_attentions, [-1]) if self.constraint_type == 'monotonic': key_masks = tf.sequence_mask(prev_max_attentions, Tx) reverse_masks = tf.sequence_mask(Tx - self.attention_win_size - prev_max_attentions, Tx)[:, ::-1] else: assert self.constraint_type == 'window' key_masks = tf.sequence_mask(prev_max_attentions - (self.attention_win_size // 2 + (self.attention_win_size % 2 != 0)), Tx) reverse_masks = tf.sequence_mask(Tx - (self.attention_win_size // 2) - prev_max_attentions, Tx)[:, ::-1] masks = tf.logical_or(key_masks, reverse_masks) paddings = tf.ones_like(energy) * (-2 ** 32 + 1) # (N, Ty/r, Tx) energy = tf.where(tf.equal(masks, False), energy, paddings) # alignments shape = energy shape = [batch_size, max_time] alignments = self._probability_fn(energy, previous_alignments) # Cumulate alignments if self._cumulate: next_state = alignments + previous_alignments else: next_state = alignments return alignments, next_state def _location_sensitive_score(W_query, W_fil, W_keys): """Impelements Bahdanau-style (cumulative) scoring function. This attention is described in: J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in Neural Information Processing Systems, 2015, pp. 577–585. ############################################################################# hybrid attention (content-based + location-based) f = F * α_{i-1} energy = dot(v_a, tanh(W_keys(h_enc) + W_query(h_dec) + W_fil(f) + b_a)) ############################################################################# Args: W_query: Tensor, shape '[batch_size, 1, attention_dim]' to compare to location features. W_location: processed previous alignments into location features, shape '[batch_size, max_time, attention_dim]' W_keys: Tensor, shape '[batch_size, max_time, attention_dim]', typically the encoder outputs. Returns: A '[batch_size, max_time]' attention score (energy) """ # Get the number of hidden units from the trailing dimension of keys dtype = W_query.dtype num_units = W_keys.shape[-1].value or array_ops.shape(W_keys)[-1] v_a = tf.get_variable( 'attention_variable_projection', shape=[num_units], dtype=dtype, initializer=tf.contrib.layers.xavier_initializer()) b_a = tf.get_variable( 'attention_bias', shape=[num_units], dtype=dtype, initializer=tf.zeros_initializer()) return tf.reduce_sum(v_a * tf.tanh(W_keys + W_query + W_fil + b_a), [2]) def _smoothing_normalization(e): """Applies a smoothing normalization function instead of softmax Introduced in: J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben- gio, “Attention-based models for speech recognition,” in Ad- vances in Neural Information Processing Systems, 2015, pp. 577–585. ############################################################################ Smoothing normalization function a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j})) ############################################################################ Args: e: matrix [batch_size, max_time(memory_time)]: expected to be energy (score) values of an attention mechanism Returns: matrix [batch_size, max_time]: [0, 1] normalized alignments with possible attendance to multiple memory time steps. """ return tf.nn.sigmoid(e) / tf.reduce_sum(tf.nn.sigmoid(e), axis=-1, keepdims=True) class GmmAttention(AttentionMechanism): def __init__(self, num_mixtures, memory, memory_sequence_length=None, check_inner_dims_defined=True, score_mask_value=None, name='GmmAttention'): self.dtype = memory.dtype self.num_mixtures = num_mixtures self.query_layer = tf.layers.Dense(3 * num_mixtures, name='gmm_query_projection', use_bias=True, dtype=self.dtype) with tf.name_scope(name, 'GmmAttentionMechanismInit'): if score_mask_value is None: score_mask_value = 0. self._maybe_mask_score = functools.partial( _maybe_mask_score, memory_sequence_length=memory_sequence_length, score_mask_value=score_mask_value) self._value = _prepare_memory( memory, memory_sequence_length, check_inner_dims_defined) self._batch_size = ( self._value.shape[0].value or tf.shape(self._value)[0]) self._alignments_size = ( self._value.shape[1].value or tf.shape(self._value)[1]) @property def values(self): return self._value @property def batch_size(self): return self._batch_size @property def alignments_size(self): return self._alignments_size @property def state_size(self): return self.num_mixtures def initial_alignments(self, batch_size, dtype): max_time = self._alignments_size return _zero_state_tensors(max_time, batch_size, dtype) def initial_state(self, batch_size, dtype): state_size_ = self.state_size return _zero_state_tensors(state_size_, batch_size, dtype) def __call__(self, query, state): with tf.variable_scope("GmmAttention"): previous_kappa = state params = self.query_layer(query) # query(dec_rnn_size=256) , params(num_mixtures(256)*3) alpha_hat, beta_hat, kappa_hat = tf.split(params, num_or_size_splits=3, axis=1) # [batch_size, num_mixtures, 1] alpha = tf.expand_dims(tf.exp(alpha_hat), axis=2) # softmax makes the alpha value more stable. # alpha = tf.expand_dims(tf.nn.softmax(alpha_hat, axis=1), axis=2) beta = tf.expand_dims(tf.exp(beta_hat), axis=2) kappa = tf.expand_dims(previous_kappa + tf.exp(kappa_hat), axis=2) # [1, 1, max_input_steps] mu = tf.reshape(tf.cast(tf.range(self.alignments_size), dtype=tf.float32), shape=[1, 1, self.alignments_size]) # [[[0,1,2,...]]] # [batch_size, max_input_steps] phi = tf.reduce_sum(alpha * tf.exp(-beta * (kappa - mu) ** 2.), axis=1) alignments = self._maybe_mask_score(phi) state = tf.squeeze(kappa, axis=2) return alignments, state ================================================ FILE: tacotron2/tacotron2.py ================================================ # coding: utf-8 # Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py """ 모델 수정 1. prenet에서 dropout 적용 오류 수정 2. AttentionWrapper 적용 순서 오류 수정: keith ito 코드는 잘 구현되어 있음 3. BahdanauMonotonicAttention에서 normalize=True적용(2018년9월11일 적용) 4. BahdanauMonotonicAttention에서 memory_sequence_length 입력 5. synhesizer.py input_lengths 계산오류. +1 해야 함. """ import numpy as np import tensorflow as tf from tensorflow.contrib.seq2seq import BasicDecoder, BahdanauAttention, BahdanauMonotonicAttention,LuongAttention from tensorflow.contrib.rnn import GRUCell, MultiRNNCell, OutputProjectionWrapper, ResidualWrapper,LSTMStateTuple from utils.infolog import log from text.symbols import symbols from .modules import * from .helpers import TacoTestHelper, TacoTrainingHelper from .rnn_wrappers import LocationSensitiveAttention,GmmAttention,ZoneoutLSTMCell,DecoderWrapper class Tacotron2(): def __init__(self, hparams): self._hparams = hparams def initialize(self, inputs, input_lengths, num_speakers, speaker_id=None,mel_targets=None, linear_targets=None, is_training= False,loss_coeff=None,stop_token_targets=None): with tf.variable_scope('Eembedding') as scope: hp = self._hparams batch_size = tf.shape(inputs)[0] # Embeddings(256) char_embed_table = tf.get_variable('inputs_embedding', [len(symbols), hp.embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5)) zero_pad = True if zero_pad: # transformer에 구현되어 있는 거 보고, 가져온 로직. # 0 은 embedding이 0으로 고정되고, train으로 변하지 않는다. 즉, 위의 get_variable에서 잡았던 변수의 첫번째 행()에 대응되는 것은 사용되지 않는 것이다) char_embed_table = tf.concat((tf.zeros(shape=[1, hp.embedding_size]),char_embed_table[1:, :]), 0) # [N, T_in, embedding_size] char_embedded_inputs = tf.nn.embedding_lookup(char_embed_table, inputs) self.num_speakers = num_speakers if self.num_speakers > 1: speaker_embed_table = tf.get_variable('speaker_embedding',[self.num_speakers, hp.speaker_embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5)) # [N, T_in, speaker_embedding_size] speaker_embed = tf.nn.embedding_lookup(speaker_embed_table, speaker_id) deep_dense = lambda x, dim,name: tf.layers.dense(x, dim, activation=tf.nn.softsign,name=name) # softsign: x / (abs(x) + 1) encoder_rnn_init_state = deep_dense( speaker_embed, hp.encoder_lstm_units * 4,'encoder_init_dense') # hp.encoder_lstm_units = 256 decoder_rnn_init_states = [deep_dense(speaker_embed, hp.decoder_lstm_units*2,'decoder_init_dense_{}'.format(i)) for i in range(hp.decoder_layers)] # hp.decoder_lstm_units = 1024 speaker_embed = None else: # self.num_speakers =1인 경우 speaker_embed = None encoder_rnn_init_state = None # bidirectional GRU의 init state attention_rnn_init_state = None decoder_rnn_init_states = None with tf.variable_scope('Encoder') as scope: ############## # Encoder ############## x = char_embedded_inputs for i in range(hp.enc_conv_num_layers): x = tf.layers.conv1d(x,filters=hp.enc_conv_channels,kernel_size=hp.enc_conv_kernel_size,padding='same',activation=tf.nn.relu,name='Encoder_{}'.format(i)) x = tf.layers.batch_normalization(x, training=is_training) x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='dropout_{}'.format(i)) if encoder_rnn_init_state is not None: initial_state_fw_c,initial_state_fw_h, initial_state_bw_c,initial_state_bw_h = tf.split(encoder_rnn_init_state, 4, 1) initial_state_fw = LSTMStateTuple(initial_state_fw_c,initial_state_fw_h) initial_state_bw = LSTMStateTuple(initial_state_bw_c,initial_state_bw_h) else: # single mode initial_state_fw, initial_state_bw = None, None cell_fw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM') cell_bw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM') encoder_conv_output = x outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,encoder_conv_output,sequence_length=input_lengths, initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32) # envoder_outpust = [N,T,2*encoder_lstm_units] = [N,T,512] encoder_outputs = tf.concat(outputs, axis=2) # Concat and return forward + backward outputs with tf.variable_scope('Decoder') as scope: ############## # Attention ############## if hp.attention_type == 'bah_mon': attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths,normalize=False) elif hp.attention_type == 'bah_mon_norm': # hccho 추가 attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length = input_lengths, normalize=True) elif hp.attention_type == 'loc_sen': # Location Sensitivity Attention attention_mechanism = LocationSensitiveAttention(hp.attention_size, encoder_outputs,hparams=hp, is_training=is_training, mask_encoder=hp.mask_encoder,memory_sequence_length = input_lengths,smoothing=hp.smoothing,cumulate_weights=hp.cumulative_weights) elif hp.attention_type == 'gmm': # GMM Attention attention_mechanism = GmmAttention(hp.attention_size, memory=encoder_outputs,memory_sequence_length = input_lengths) elif hp.attention_type == 'bah_norm': attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, normalize=True) elif hp.attention_type == 'luong_scaled': attention_mechanism = LuongAttention( hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, scale=True) elif hp.attention_type == 'luong': attention_mechanism = LuongAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths) elif hp.attention_type == 'bah': attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths) else: raise Exception(" [!] Unkown attention type: {}".format(hp.attention_type)) decoder_lstm = [ZoneoutLSTMCell(hp.decoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate, zoneout_factor_output=hp.tacotron_zoneout_rate,name='decoder_LSTM_{}'.format(i+1)) for i in range(hp.decoder_layers)] decoder_lstm = tf.contrib.rnn.MultiRNNCell(decoder_lstm, state_is_tuple=True) decoder_init_state = decoder_lstm.zero_state(batch_size=batch_size, dtype=tf.float32) # 여기서 zero_state를 부르면, 위의 AttentionWrapper에서 이미 넣은 준 값도 포함되어 있다. if hp.model_type == "multi-speaker": decoder_init_state = list(decoder_init_state) for idx, cell in enumerate(decoder_rnn_init_states): shape1 = decoder_init_state[idx][0].get_shape().as_list() shape2 = cell.get_shape().as_list() if shape1[1]*2 != shape2[1]: raise Exception(" [!] Shape {} and {} should be equal".format(shape1, shape2)) c,h = tf.split(cell,2,1) decoder_init_state[idx] = LSTMStateTuple(c,h) decoder_init_state = tuple(decoder_init_state) attention_cell = AttentionWrapper(decoder_lstm,attention_mechanism, initial_cell_state=decoder_init_state, alignment_history=True,output_attention=False) # output_attention=False 에 주목, attention_layer_size에 값을 넣지 않았다. 그래서 attention = contex vector가 된다. # attention_state_size = 256 # Decoder input -> prenet -> decoder_lstm -> concat[output, attention] dec_prenet_outputs = DecoderWrapper(attention_cell , is_training, hp.dec_prenet_sizes, hp.dropout_prob,hp.inference_prenet_dropout) dec_outputs_cell = OutputProjectionWrapper(dec_prenet_outputs,(hp.num_mels+1) * hp.reduction_factor) if is_training: helper = TacoTrainingHelper(mel_targets, hp.num_mels, hp.reduction_factor) # inputs은 batch_size 계산에만 사용됨 else: helper = TacoTestHelper(batch_size, hp.num_mels, hp.reduction_factor) decoder_init_state = dec_outputs_cell.zero_state(batch_size=batch_size, dtype=tf.float32) (decoder_outputs, _), final_decoder_state, _ = \ tf.contrib.seq2seq.dynamic_decode(BasicDecoder(dec_outputs_cell, helper, decoder_init_state),maximum_iterations=int(hp.max_n_frame/hp.reduction_factor)) # max_iters=200 decoder_mel_outputs = tf.reshape(decoder_outputs[:,:,:hp.num_mels * hp.reduction_factor], [batch_size, -1, hp.num_mels]) # [N,iters,400] -> [N,5*iters,80] stop_token_outputs = tf.reshape(decoder_outputs[:,:,hp.num_mels * hp.reduction_factor:], [batch_size, -1]) # [N,iters] # Postnet x = decoder_mel_outputs for i in range(hp.postnet_num_layers): activation = tf.nn.tanh if i != (hp.postnet_num_layers-1) else None x = tf.layers.conv1d(x,filters=hp.postnet_channels,kernel_size=hp.postnet_kernel_size,padding='same',activation=activation,name='Postnet_{}'.format(i)) x = tf.layers.batch_normalization(x, training=is_training) x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='Postnet_dropout_{}'.format(i)) residual = tf.layers.dense(x,hp.num_mels,name='residual_projection') mel_outputs = decoder_mel_outputs + residual # Add post-processing CBHG: # mel_outputs: (N,T,num_mels) post_outputs = cbhg(mel_outputs, None, is_training,hp.post_bank_size, hp.post_bank_channel_size, hp.post_maxpool_width, hp.post_highway_depth, hp.post_rnn_size, hp.post_proj_sizes, hp.post_proj_width,scope='post_cbhg') linear_outputs = tf.layers.dense(post_outputs, hp.num_freq,name='linear_spectogram_projection') # [N, T_out, F(1025)] # Grab alignments from the final decoder state: alignments = tf.transpose(final_decoder_state.alignment_history.stack(), [1, 2, 0]) # batch_size, text length(encoder), target length(decoder) self.inputs = inputs self.speaker_id = speaker_id self.input_lengths = input_lengths self.loss_coeff = loss_coeff self.decoder_mel_outputs = decoder_mel_outputs self.mel_outputs = mel_outputs self.linear_outputs = linear_outputs self.alignments = alignments self.mel_targets = mel_targets self.linear_targets = linear_targets self.final_decoder_state = final_decoder_state self.stop_token_targets = stop_token_targets self.stop_token_outputs = stop_token_outputs self.all_vars = tf.trainable_variables() log('='*40) log(' model_type: %s' % hp.model_type) log('='*40) log('Initialized Tacotron model. Dimensions: ') log(' embedding: %d' % char_embedded_inputs.shape[-1]) log(' encoder conv out: %d' % encoder_conv_output.shape[-1]) log(' encoder out: %d' % encoder_outputs.shape[-1]) log(' attention out: %d' % attention_cell.output_size) log(' decoder prenet lstm concat out : %d' % dec_prenet_outputs.output_size) log(' decoder cell out: %d' % dec_outputs_cell.output_size) log(' decoder out (%d frames): %d' % (hp.reduction_factor, decoder_outputs.shape[-1])) log(' decoder mel out: %d' % decoder_mel_outputs.shape[-1]) log(' mel out: %d' % mel_outputs.shape[-1]) log(' postnet out: %d' % post_outputs.shape[-1]) log(' linear out: %d' % linear_outputs.shape[-1]) log(' Tacotron Parameters {:.3f} Million.'.format(np.sum([np.prod(v.get_shape().as_list()) for v in self.all_vars]) / 1000000)) def add_loss(self): '''Adds loss to the model. Sets "loss" field. initialize must have been called.''' with tf.variable_scope('loss') as scope: hp = self._hparams before = tf.squared_difference(self.mel_targets, self.decoder_mel_outputs) after = tf.squared_difference(self.mel_targets, self.mel_outputs) mel_loss = before+after stop_token_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.stop_token_targets, logits=self.stop_token_outputs)) l1 = tf.abs(self.linear_targets - self.linear_outputs) expanded_loss_coeff = tf.expand_dims(tf.expand_dims(self.loss_coeff, [-1]), [-1]) regularization_loss = tf.reduce_mean([tf.nn.l2_loss(v) for v in self.all_vars if not('bias' in v.name or 'Bias' in v.name or 'projection' in v.name or 'inputs_embedding' in v.name or 'speaker_embedding' in v.name or 'dense' in v.name or 'RNN' in v.name or 'LSTM' in v.name)]) * hp.tacotron_reg_weight regularization_loss = 0 if hp.prioritize_loss: # Prioritize loss for frequencies. upper_priority_freq = int(5000 / (hp.sample_rate * 0.5) * hp.num_freq) lower_priority_freq = int(165 / (hp.sample_rate * 0.5) * hp.num_freq) l1_priority= l1[:,:,lower_priority_freq:upper_priority_freq] self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + \ 0.5 * tf.reduce_mean(l1 * expanded_loss_coeff) + 0.5 * tf.reduce_mean(l1_priority * expanded_loss_coeff) + stop_token_loss + regularization_loss self.linear_loss = tf.reduce_mean( 0.5 * (tf.reduce_mean(l1) + tf.reduce_mean(l1_priority))) else: self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + tf.reduce_mean(l1 * expanded_loss_coeff) + stop_token_loss + regularization_loss # 이 loss는 사용하지 않고, 아래의 loss_without_coeff를 사용함 self.linear_loss = tf.reduce_mean(l1) self.mel_loss = tf.reduce_mean(mel_loss) self.loss_without_coeff = self.mel_loss + self.linear_loss + stop_token_loss + regularization_loss def add_optimizer(self, global_step): '''Adds optimizer. Sets "gradients" and "optimize" fields. add_loss must have been called. Args: global_step: int32 scalar Tensor representing current global step in training ''' with tf.variable_scope('optimizer') as scope: hp = self._hparams if hp.tacotron_decay_learning_rate: self.decay_steps = hp.tacotron_decay_steps self.decay_rate = hp.tacotron_decay_rate self.learning_rate = self._learning_rate_decay(hp.tacotron_initial_learning_rate, global_step) else: self.learning_rate = tf.convert_to_tensor(hp.tacotron_initial_learning_rate) optimizer = tf.train.AdamOptimizer(self.learning_rate, hp.adam_beta1, hp.adam_beta2) gradients, variables = zip(*optimizer.compute_gradients(self.loss)) self.gradients = gradients clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1.0) # Add dependency on UPDATE_OPS; otherwise batchnorm won't work correctly. See: # https://github.com/tensorflow/tensorflow/issues/1122 with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): self.optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step) def _learning_rate_decay(self, init_lr, global_step): ################################################################# # Narrow Exponential Decay: # Phase 1: lr = 1e-3 # We only start learning rate decay after 50k steps # Phase 2: lr in ]1e-5, 1e-3[ # decay reach minimal value at step 310k # Phase 3: lr = 1e-5 # clip by minimal learning rate value (step > 310k) ################################################################# hp = self._hparams #Compute natural exponential decay lr = tf.train.exponential_decay(init_lr, global_step - hp.tacotron_start_decay, #lr = 1e-3 at step 50k self.decay_steps, self.decay_rate, #lr = 1e-5 around step 310k name='lr_exponential_decay') #clip learning rate by max and min values (initial and final values) return tf.minimum(tf.maximum(lr, hp.tacotron_final_learning_rate), init_lr) ================================================ FILE: text/__init__.py ================================================ # coding: utf-8 import re import string import numpy as np from text import cleaners from hparams import hparams from text.symbols import symbols, en_symbols, PAD, EOS from text.korean import jamo_to_korean # Mappings from symbol to numeric ID and vice versa: _symbol_to_id = {s: i for i, s in enumerate(symbols)} # 80개 _id_to_symbol = {i: s for i, s in enumerate(symbols)} isEn=False # Regular expression matching text enclosed in curly braces: _curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)') puncuation_table = str.maketrans({key: None for key in string.punctuation}) def convert_to_en_symbols(): '''Converts built-in korean symbols to english, to be used for english training ''' global _symbol_to_id, _id_to_symbol, isEn if not isEn: print(" [!] Converting to english mode") _symbol_to_id = {s: i for i, s in enumerate(en_symbols)} _id_to_symbol = {i: s for i, s in enumerate(en_symbols)} isEn=True def remove_puncuations(text): return text.translate(puncuation_table) def text_to_sequence(text, as_token=False): cleaner_names = [x.strip() for x in hparams.cleaners.split(',')] if ('english_cleaners' in cleaner_names) and isEn==False: convert_to_en_symbols() return _text_to_sequence(text, cleaner_names, as_token) def _text_to_sequence(text, cleaner_names, as_token): '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text. The text can optionally have ARPAbet sequences enclosed in curly braces embedded in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street." Args: text: string to convert to a sequence cleaner_names: names of the cleaner functions to run the text through Returns: List of integers corresponding to the symbols in the text ''' sequence = [] # Check for curly braces and treat their contents as ARPAbet: while len(text): m = _curly_re.match(text) if not m: sequence += _symbols_to_sequence(_clean_text(text, cleaner_names)) break sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names)) sequence += _arpabet_to_sequence(m.group(2)) text = m.group(3) # Append EOS token sequence.append(_symbol_to_id[EOS]) # [14, 29, 45, 2, 27, 62, 20, 21, 4, 39, 45, 1] if as_token: return sequence_to_text(sequence, combine_jamo=True) else: return np.array(sequence, dtype=np.int32) def sequence_to_text(sequence, skip_eos_and_pad=False, combine_jamo=False): '''Converts a sequence of IDs back to a string''' cleaner_names=[x.strip() for x in hparams.cleaners.split(',')] if 'english_cleaners' in cleaner_names and isEn==False: convert_to_en_symbols() result = '' for symbol_id in sequence: if symbol_id in _id_to_symbol: s = _id_to_symbol[symbol_id] # Enclose ARPAbet back in curly braces: if len(s) > 1 and s[0] == '@': s = '{%s}' % s[1:] if not skip_eos_and_pad or s not in [EOS, PAD]: result += s result = result.replace('}{', ' ') if combine_jamo: return jamo_to_korean(result) else: return result def _clean_text(text, cleaner_names): for name in cleaner_names: cleaner = getattr(cleaners, name) if not cleaner: raise Exception('Unknown cleaner: %s' % name) text = cleaner(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~'] return text def _symbols_to_sequence(symbols): return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)] def _arpabet_to_sequence(text): return _symbols_to_sequence(['@' + s for s in text.split()]) def _should_keep_symbol(s): return s in _symbol_to_id and s is not '_' and s is not '~' ================================================ FILE: text/cleaners.py ================================================ # coding: utf-8 # Code based on https://github.com/keithito/tacotron/blob/master/text/cleaners.py ''' Cleaners are transformations that run over the input text at both training and eval time. Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" hyperparameter. Some cleaners are English-specific. You'll typically want to use: 1. "english_cleaners" for English text 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using the Unidecode library (https://pypi.python.org/pypi/Unidecode) 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update the symbols in symbols.py to match your data). ''' import re from .korean import tokenize as ko_tokenize # Added to support LJ_speech from unidecode import unidecode from .en_numbers import normalize_numbers as en_normalize_numbers # Regular expression matching whitespace: _whitespace_re = re.compile(r'\s+') def korean_cleaners(text): '''Pipeline for Korean text, including number and abbreviation expansion.''' text = ko_tokenize(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~'] return text # List of (regular expression, replacement) pairs for abbreviations: _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ ('mrs', 'misess'), ('mr', 'mister'), ('dr', 'doctor'), ('st', 'saint'), ('co', 'company'), ('jr', 'junior'), ('maj', 'major'), ('gen', 'general'), ('drs', 'doctors'), ('rev', 'reverend'), ('lt', 'lieutenant'), ('hon', 'honorable'), ('sgt', 'sergeant'), ('capt', 'captain'), ('esq', 'esquire'), ('ltd', 'limited'), ('col', 'colonel'), ('ft', 'fort'), ]] def expand_abbreviations(text): for regex, replacement in _abbreviations: text = re.sub(regex, replacement, text) return text def expand_numbers(text): return en_normalize_numbers(text) def lowercase(text): return text.lower() def collapse_whitespace(text): return re.sub(_whitespace_re, ' ', text) def convert_to_ascii(text): '''Converts to ascii, existed in keithito but deleted in carpedm20''' return unidecode(text) def basic_cleaners(text): '''Basic pipeline that lowercases and collapses whitespace without transliteration.''' text = lowercase(text) text = collapse_whitespace(text) return text def transliteration_cleaners(text): '''Pipeline for non-English text that transliterates to ASCII.''' text = convert_to_ascii(text) text = lowercase(text) text = collapse_whitespace(text) return text def english_cleaners(text): '''Pipeline for English text, including number and abbreviation expansion.''' text = convert_to_ascii(text) text = lowercase(text) text = expand_numbers(text) text = expand_abbreviations(text) text = collapse_whitespace(text) return text ================================================ FILE: text/en_numbers.py ================================================ import inflect import re _inflect = inflect.engine() _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') _number_re = re.compile(r'[0-9]+') def _remove_commas(m): return m.group(1).replace(',', '') def _expand_decimal_point(m): return m.group(1).replace('.', ' point ') def _expand_dollars(m): match = m.group(1) parts = match.split('.') if len(parts) > 2: return match + ' dollars' # Unexpected format dollars = int(parts[0]) if parts[0] else 0 cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 if dollars and cents: dollar_unit = 'dollar' if dollars == 1 else 'dollars' cent_unit = 'cent' if cents == 1 else 'cents' return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) elif dollars: dollar_unit = 'dollar' if dollars == 1 else 'dollars' return '%s %s' % (dollars, dollar_unit) elif cents: cent_unit = 'cent' if cents == 1 else 'cents' return '%s %s' % (cents, cent_unit) else: return 'zero dollars' def _expand_ordinal(m): return _inflect.number_to_words(m.group(0)) def _expand_number(m): num = int(m.group(0)) if num > 1000 and num < 3000: if num == 2000: return 'two thousand' elif num > 2000 and num < 2010: return 'two thousand ' + _inflect.number_to_words(num % 100) elif num % 100 == 0: return _inflect.number_to_words(num // 100) + ' hundred' else: return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') else: return _inflect.number_to_words(num, andword='') def normalize_numbers(text): text = re.sub(_comma_number_re, _remove_commas, text) text = re.sub(_pounds_re, r'\1 pounds', text) text = re.sub(_dollars_re, _expand_dollars, text) text = re.sub(_decimal_number_re, _expand_decimal_point, text) text = re.sub(_ordinal_re, _expand_ordinal, text) text = re.sub(_number_re, _expand_number, text) return text ================================================ FILE: text/english.py ================================================ # Code from https://github.com/keithito/tacotron/blob/master/util/numbers.py import inflect _inflect = inflect.engine() _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') _pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') _dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') _number_re = re.compile(r'[0-9]+') def _remove_commas(m): return m.group(1).replace(',', '') def _expand_decimal_point(m): return m.group(1).replace('.', ' point ') def _expand_dollars(m): match = m.group(1) parts = match.split('.') if len(parts) > 2: return match + ' dollars' # Unexpected format dollars = int(parts[0]) if parts[0] else 0 cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 if dollars and cents: dollar_unit = 'dollar' if dollars == 1 else 'dollars' cent_unit = 'cent' if cents == 1 else 'cents' return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) elif dollars: dollar_unit = 'dollar' if dollars == 1 else 'dollars' return '%s %s' % (dollars, dollar_unit) elif cents: cent_unit = 'cent' if cents == 1 else 'cents' return '%s %s' % (cents, cent_unit) else: return 'zero dollars' def _expand_ordinal(m): return _inflect.number_to_words(m.group(0)) def _expand_number(m): num = int(m.group(0)) if num > 1000 and num < 3000: if num == 2000: return 'two thousand' elif num > 2000 and num < 2010: return 'two thousand ' + _inflect.number_to_words(num % 100) elif num % 100 == 0: return _inflect.number_to_words(num // 100) + ' hundred' else: return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') else: return _inflect.number_to_words(num, andword='') def normalize(text): text = re.sub(_comma_number_re, _remove_commas, text) text = re.sub(_pounds_re, r'\1 pounds', text) text = re.sub(_dollars_re, _expand_dollars, text) text = re.sub(_decimal_number_re, _expand_decimal_point, text) text = re.sub(_ordinal_re, _expand_ordinal, text) text = re.sub(_number_re, _expand_number, text) return text ================================================ FILE: text/ko_dictionary.py ================================================ # coding: utf-8 etc_dictionary = { '2 30대': '이삼십대', '20~30대': '이삼십대', '20, 30대': '이십대 삼십대', '1+1': '원플러스원', '3에서 6개월인': '3개월에서 육개월인', } english_dictionary = { 'Devsisters': '데브시스터즈', 'track': '트랙', # krbook 'LA': '엘에이', 'LG': '엘지', 'KOREA': '코리아', 'JSA': '제이에스에이', 'PGA': '피지에이', 'GA': '지에이', 'idol': '아이돌', 'KTX': '케이티엑스', 'AC': '에이씨', 'DVD': '디비디', 'US': '유에스', 'CNN': '씨엔엔', 'LPGA': '엘피지에이', 'P': '피', 'L': '엘', 'T': '티', 'B': '비', 'C': '씨', 'BIFF': '비아이에프에프', 'GV': '지비', # JTBC 'IT': '아이티', 'IQ': '아이큐', 'JTBC': '제이티비씨', 'trickle down effect': '트리클 다운 이펙트', 'trickle up effect': '트리클 업 이펙트', 'down': '다운', 'up': '업', 'FCK': '에프씨케이', 'AP': '에이피', 'WHERETHEWILDTHINGSARE': '', 'Rashomon Effect': '', 'O': '오', 'OO': '오오', 'B': '비', 'GDP': '지디피', 'CIPA': '씨아이피에이', 'YS': '와이에스', 'Y': '와이', 'S': '에스', 'JTBC': '제이티비씨', 'PC': '피씨', 'bill': '빌', 'Halmuny': '하모니', ##### 'X': '엑스', 'SNS': '에스엔에스', 'ability': '어빌리티', 'shy': '', 'CCTV': '씨씨티비', 'IT': '아이티', 'the tenth man': '더 텐쓰 맨', #### 'L': '엘', 'PC': '피씨', 'YSDJJPMB': '', ######## 'Content Attitude Timing': '컨텐트 애티튜드 타이밍', 'CAT': '캣', 'IS': '아이에스', 'SNS': '에스엔에스', 'K': '케이', 'Y': '와이', 'KDI': '케이디아이', 'DOC': '디오씨', 'CIA': '씨아이에이', 'PBS': '피비에스', 'D': '디', 'PPropertyPositionPowerPrisonP' 'S': '에스', 'francisco': '프란시스코', 'I': '아이', 'III': '아이아이', ###### 'No joke': '노 조크', 'BBK': '비비케이', 'LA': '엘에이', 'Don': '', 't worry be happy': ' 워리 비 해피', 'NO': '엔오', ##### 'it was our sky': '잇 워즈 아워 스카이', 'it is our sky': '잇 이즈 아워 스카이', #### 'NEIS': '엔이아이에스', ##### 'IMF': '아이엠에프', 'apology': '어폴로지', 'humble': '험블', 'M': '엠', 'Nowhere Man': '노웨어 맨', 'The Tenth Man': '더 텐쓰 맨', 'PBS': '피비에스', 'BBC': '비비씨', 'MRJ': '엠알제이', 'CCTV': '씨씨티비', 'Pick me up': '픽 미 업', 'DNA': '디엔에이', 'UN': '유엔', 'STOP': '스탑', ##### 'PRESS': '프레스', ##### 'not to be': '낫 투비', 'Denial': '디나이얼', 'G': '지', 'IMF': '아이엠에프', 'GDP': '지디피', 'JTBC': '제이티비씨', 'Time flies like an arrow': '타임 플라이즈 라이크 언 애로우', 'DDT': '디디티', 'AI': '에이아이', 'Z': '제트', 'OECD': '오이씨디', 'N': '앤', 'A': '에이', 'MB': '엠비', 'EH': '이에이치', 'IS': '아이에스', 'TV': '티비', 'MIT': '엠아이티', 'KBO': '케이비오', 'I love America': '아이 러브 아메리카', 'SF': '에스에프', 'Q': '큐', 'KFX': '케이에프엑스', 'PM': '피엠', 'Prime Minister': '프라임 미니스터', 'Swordline': '스워드라인', 'TBS': '티비에스', 'DDT': '디디티', 'CS': '씨에스', 'Reflecting Absence': '리플렉팅 앱센스', 'PBS': '피비에스', 'Drum being beaten by everyone': '드럼 빙 비튼 바이 에브리원', 'negative pressure': '네거티브 프레셔', 'F': '에프', 'KIA': '기아', 'FTA': '에프티에이', 'Que sais-je': '', 'UFC': '유에프씨', 'P': '피', 'DJ': '디제이', 'Chaebol': '채벌', 'BBC': '비비씨', 'OECD': '오이씨디', 'BC': '삐씨', 'C': '씨', 'B': '씨', 'KY': '케이와이', 'K': '케이', 'CEO': '씨이오', 'YH': '와이에치', 'IS': '아이에스', 'who are you': '후 얼 유', 'Y': '와이', 'The Devils Advocate': '더 데빌즈 어드보카트', 'YS': '와이에스', 'so sorry': '쏘 쏘리', 'Santa': '산타', 'Big Endian': '빅 엔디안', 'Small Endian': '스몰 엔디안', 'Oh Captain My Captain': '오 캡틴 마이 캡틴', 'AIB': '에이아이비', 'K': '케이', 'PBS': '피비에스', } ================================================ FILE: text/korean.py ================================================ # coding: utf-8 # Code based on import re import os import ast import json from jamo import hangul_to_jamo, h2j, j2h from .ko_dictionary import english_dictionary, etc_dictionary PAD = '_' EOS = '~' PUNC = '!\'(),-.:;?' SPACE = ' ' JAMO_LEADS = "".join([chr(_) for _ in range(0x1100, 0x1113)]) JAMO_VOWELS = "".join([chr(_) for _ in range(0x1161, 0x1176)]) JAMO_TAILS = "".join([chr(_) for _ in range(0x11A8, 0x11C3)]) VALID_CHARS = JAMO_LEADS + JAMO_VOWELS + JAMO_TAILS + PUNC + SPACE ALL_SYMBOLS = PAD + EOS + VALID_CHARS char_to_id = {c: i for i, c in enumerate(ALL_SYMBOLS)} id_to_char = {i: c for i, c in enumerate(ALL_SYMBOLS)} quote_checker = """([`"'"“‘])(.+?)([`"'"”’])""" def is_lead(char): return char in JAMO_LEADS def is_vowel(char): return char in JAMO_VOWELS def is_tail(char): return char in JAMO_TAILS def get_mode(char): if is_lead(char): return 0 elif is_vowel(char): return 1 elif is_tail(char): return 2 else: return -1 def _get_text_from_candidates(candidates): if len(candidates) == 0: return "" elif len(candidates) == 1: return _jamo_char_to_hcj(candidates[0]) else: return j2h(**dict(zip(["lead", "vowel", "tail"], candidates))) def jamo_to_korean(text): text = h2j(text) idx = 0 new_text = "" candidates = [] while True: if idx >= len(text): new_text += _get_text_from_candidates(candidates) break char = text[idx] mode = get_mode(char) if mode == 0: new_text += _get_text_from_candidates(candidates) candidates = [char] elif mode == -1: new_text += _get_text_from_candidates(candidates) new_text += char candidates = [] else: candidates.append(char) idx += 1 return new_text num_to_kor = { '0': '영', '1': '일', '2': '이', '3': '삼', '4': '사', '5': '오', '6': '육', '7': '칠', '8': '팔', '9': '구', } unit_to_kor1 = { '%': '퍼센트', 'cm': '센치미터', 'mm': '밀리미터', 'km': '킬로미터', 'kg': '킬로그람', } unit_to_kor2 = { 'm': '미터', } upper_to_kor = { 'A': '에이', 'B': '비', 'C': '씨', 'D': '디', 'E': '이', 'F': '에프', 'G': '지', 'H': '에이치', 'I': '아이', 'J': '제이', 'K': '케이', 'L': '엘', 'M': '엠', 'N': '엔', 'O': '오', 'P': '피', 'Q': '큐', 'R': '알', 'S': '에스', 'T': '티', 'U': '유', 'V': '브이', 'W': '더블유', 'X': '엑스', 'Y': '와이', 'Z': '지', } def compare_sentence_with_jamo(text1, text2): return h2j(text1) != h2j(text) def tokenize(text, as_id=False): # jamo package에 있는 hangul_to_jamo를 이용하여 한글 string을 초성/중성/종성으로 나눈다. text = normalize(text) tokens = list(hangul_to_jamo(text)) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~'] if as_id: return [char_to_id[token] for token in tokens] + [char_to_id[EOS]] else: return [token for token in tokens] + [EOS] def tokenizer_fn(iterator): return (token for x in iterator for token in tokenize(x, as_id=False)) def normalize(text): text = text.strip() text = re.sub('\(\d+일\)', '', text) text = re.sub('\([⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]+\)', '', text) text = normalize_with_dictionary(text, etc_dictionary) text = normalize_english(text) text = re.sub('[a-zA-Z]+', normalize_upper, text) text = normalize_quote(text) text = normalize_number(text) return text def normalize_with_dictionary(text, dic): if any(key in text for key in dic.keys()): pattern = re.compile('|'.join(re.escape(key) for key in dic.keys())) return pattern.sub(lambda x: dic[x.group()], text) else: return text def normalize_english(text): def fn(m): word = m.group() if word in english_dictionary: return english_dictionary.get(word) else: return word text = re.sub("([A-Za-z]+)", fn, text) return text def normalize_upper(text): text = text.group(0) if all([char.isupper() for char in text]): return "".join(upper_to_kor[char] for char in text) else: return text def normalize_quote(text): def fn(found_text): from nltk import sent_tokenize # NLTK doesn't along with multiprocessing found_text = found_text.group() unquoted_text = found_text[1:-1] sentences = sent_tokenize(unquoted_text) return " ".join(["'{}'".format(sent) for sent in sentences]) return re.sub(quote_checker, fn, text) number_checker = "([+-]?\d[\d,]*)[\.]?\d*" count_checker = "(시|명|가지|살|마리|포기|송이|수|톨|통|점|개|벌|척|채|다발|그루|자루|줄|켤레|그릇|잔|마디|상자|사람|곡|병|판)" def normalize_number(text): text = normalize_with_dictionary(text, unit_to_kor1) text = normalize_with_dictionary(text, unit_to_kor2) text = re.sub(number_checker + count_checker, lambda x: number_to_korean(x, True), text) text = re.sub(number_checker, lambda x: number_to_korean(x, False), text) return text num_to_kor1 = [""] + list("일이삼사오육칠팔구") num_to_kor2 = [""] + list("만억조경해") num_to_kor3 = [""] + list("십백천") #count_to_kor1 = [""] + ["하나","둘","셋","넷","다섯","여섯","일곱","여덟","아홉"] count_to_kor1 = [""] + ["한","두","세","네","다섯","여섯","일곱","여덟","아홉"] count_tenth_dict = { "십": "열", "두십": "스물", "세십": "서른", "네십": "마흔", "다섯십": "쉰", "여섯십": "예순", "일곱십": "일흔", "여덟십": "여든", "아홉십": "아흔", } def number_to_korean(num_str, is_count=False): if is_count: num_str, unit_str = num_str.group(1), num_str.group(2) else: num_str, unit_str = num_str.group(), "" num_str = num_str.replace(',', '') num = ast.literal_eval(num_str) if num == 0: return "영" check_float = num_str.split('.') if len(check_float) == 2: digit_str, float_str = check_float elif len(check_float) >= 3: raise Exception(" [!] Wrong number format") else: digit_str, float_str = check_float[0], None if is_count and float_str is not None: raise Exception(" [!] `is_count` and float number does not fit each other") digit = int(digit_str) if digit_str.startswith("-"): digit, digit_str = abs(digit), str(abs(digit)) kor = "" size = len(str(digit)) tmp = [] for i, v in enumerate(digit_str, start=1): v = int(v) if v != 0: if is_count: tmp += count_to_kor1[v] else: tmp += num_to_kor1[v] tmp += num_to_kor3[(size - i) % 4] if (size - i) % 4 == 0 and len(tmp) != 0: kor += "".join(tmp) tmp = [] kor += num_to_kor2[int((size - i) / 4)] if is_count: if kor.startswith("한") and len(kor) > 1: kor = kor[1:] if any(word in kor for word in count_tenth_dict): kor = re.sub( '|'.join(count_tenth_dict.keys()), lambda x: count_tenth_dict[x.group()], kor) if not is_count and kor.startswith("일") and len(kor) > 1: kor = kor[1:] if float_str is not None: kor += "쩜 " kor += re.sub('\d', lambda x: num_to_kor[x.group()], float_str) if num_str.startswith("+"): kor = "플러스 " + kor elif num_str.startswith("-"): kor = "마이너스 " + kor return kor + unit_str if __name__ == "__main__": def test_normalize(text): print(text) print(normalize(text)) print("="*30) test_normalize("JTBC는 JTBCs를 DY는 A가 Absolute") test_normalize("오늘(13일) 3,600마리 강아지가") test_normalize("60.3%") test_normalize('"저돌"(猪突) 입니다.') test_normalize('비대위원장이 지난 1월 이런 말을 했습니다. “난 그냥 산돼지처럼 돌파하는 스타일이다”') test_normalize("지금은 -12.35%였고 종류는 5가지와 19가지, 그리고 55가지였다") test_normalize("JTBC는 TH와 K 양이 2017년 9월 12일 오후 12시에 24살이 된다") print(list(hangul_to_jamo(list(hangul_to_jamo('비대위원장이 지난 1월 이런 말을 했습니다? “난 그냥 산돼지처럼 돌파하는 스타일이다”'))))) ================================================ FILE: text/symbols.py ================================================ # coding: utf-8 ''' Defines the set of symbols used in text input to the model. The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. ''' from jamo import h2j, j2h from jamo.jamo import _jamo_char_to_hcj from .korean import ALL_SYMBOLS, PAD, EOS # For english en_symbols = PAD+EOS+'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'(),-.:;? ' #<-For deployment(Because korean ALL_SYMBOLS follow this convention) symbols = ALL_SYMBOLS # for korean """ 초성과 종성은 같아보이지만, 다른 character이다. '_~ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ!'(),-.:;? ' '_': 0, '~': 1, 'ᄀ': 2, 'ᄁ': 3, 'ᄂ': 4, 'ᄃ': 5, 'ᄄ': 6, 'ᄅ': 7, 'ᄆ': 8, 'ᄇ': 9, 'ᄈ': 10, 'ᄉ': 11, 'ᄊ': 12, 'ᄋ': 13, 'ᄌ': 14, 'ᄍ': 15, 'ᄎ': 16, 'ᄏ': 17, 'ᄐ': 18, 'ᄑ': 19, 'ᄒ': 20, 'ᅡ': 21, 'ᅢ': 22, 'ᅣ': 23, 'ᅤ': 24, 'ᅥ': 25, 'ᅦ': 26, 'ᅧ': 27, 'ᅨ': 28, 'ᅩ': 29, 'ᅪ': 30, 'ᅫ': 31, 'ᅬ': 32, 'ᅭ': 33, 'ᅮ': 34, 'ᅯ': 35, 'ᅰ': 36, 'ᅱ': 37, 'ᅲ': 38, 'ᅳ': 39, 'ᅴ': 40, 'ᅵ': 41, 'ᆨ': 42, 'ᆩ': 43, 'ᆪ': 44, 'ᆫ': 45, 'ᆬ': 46, 'ᆭ': 47, 'ᆮ': 48, 'ᆯ': 49, 'ᆰ': 50, 'ᆱ': 51, 'ᆲ': 52, 'ᆳ': 53, 'ᆴ': 54, 'ᆵ': 55, 'ᆶ': 56, 'ᆷ': 57, 'ᆸ': 58, 'ᆹ': 59, 'ᆺ': 60, 'ᆻ': 61, 'ᆼ': 62, 'ᆽ': 63, 'ᆾ': 64, 'ᆿ': 65, 'ᇀ': 66, 'ᇁ': 67, 'ᇂ': 68, '!': 69, "'": 70, '(': 71, ')': 72, ',': 73, '-': 74, '.': 75, ':': 76, ';': 77, '?': 78, ' ': 79 """ ================================================ FILE: train_tacotron2.py ================================================ # coding: utf-8 import os import time import math import argparse import traceback import subprocess import numpy as np from jamo import h2j import tensorflow as tf from datetime import datetime from functools import partial from hparams import hparams, hparams_debug_string from tacotron2 import create_model, get_most_recent_checkpoint from utils import ValueWindow, prepare_dirs from utils import infolog, warning, plot, load_hparams from utils import get_git_revision_hash, get_git_diff, str2bool, parallel_run from utils.audio import save_wav, inv_spectrogram from text import sequence_to_text, text_to_sequence from datasets.datafeeder_tacotron2 import DataFeederTacotron2 import warnings warnings.simplefilter(action='ignore', category=FutureWarning) tf.logging.set_verbosity(tf.logging.ERROR) log = infolog.log def get_git_commit(): subprocess.check_output(['git', 'diff-index', '--quiet', 'HEAD']) # Verify client is clean commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:10] log('Git commit: %s' % commit) return commit def add_stats(model, model2=None, scope_name='train'): with tf.variable_scope(scope_name) as scope: summaries = [ tf.summary.scalar('loss_mel', model.mel_loss), tf.summary.scalar('loss_linear', model.linear_loss), tf.summary.scalar('loss', model.loss_without_coeff), ] if scope_name == 'train': gradient_norms = [tf.norm(grad) for grad in model.gradients if grad is not None] summaries.extend([ tf.summary.scalar('learning_rate', model.learning_rate), tf.summary.scalar('max_gradient_norm', tf.reduce_max(gradient_norms)), ]) if model2 is not None: with tf.variable_scope('gap_test-train') as scope: summaries.extend([ tf.summary.scalar('loss_mel', model.mel_loss - model2.mel_loss), tf.summary.scalar('loss_linear', model.linear_loss - model2.linear_loss), tf.summary.scalar('loss', model.loss_without_coeff - model2.loss_without_coeff), ]) return tf.summary.merge(summaries) def save_and_plot_fn(args, log_dir, step, loss, prefix): idx, (seq, spec, align) = args audio_path = os.path.join(log_dir, '{}-step-{:09d}-audio{:03d}.wav'.format(prefix, step, idx)) align_path = os.path.join(log_dir, '{}-step-{:09d}-align{:03d}.png'.format(prefix, step, idx)) waveform = inv_spectrogram(spec.T,hparams) save_wav(waveform, audio_path,hparams.sample_rate) info_text = 'step={:d}, loss={:.5f}'.format(step, loss) if 'korean_cleaners' in [x.strip() for x in hparams.cleaners.split(',')]: log('Training korean : Use jamo') plot.plot_alignment( align, align_path, info=info_text, text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=True), isKorean=True) else: log('Training non-korean : X use jamo') plot.plot_alignment(align, align_path, info=info_text,text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=False), isKorean=False) def save_and_plot(sequences, spectrograms,alignments, log_dir, step, loss, prefix): fn = partial(save_and_plot_fn,log_dir=log_dir, step=step, loss=loss, prefix=prefix) items = list(enumerate(zip(sequences, spectrograms, alignments))) parallel_run(fn, items, parallel=False) log('Test finished for step {}.'.format(step)) def train(log_dir, config): config.data_paths = config.data_paths # ['datasets/moon'] data_dirs = config.data_paths # ['datasets/moon\\data'] num_speakers = len(data_dirs) config.num_test = config.num_test_per_speaker * num_speakers # 2*1 if num_speakers > 1 and hparams.model_type not in ["multi-speaker", "simple"]: raise Exception("[!] Unkown model_type for multi-speaker: {}".format(config.model_type)) commit = get_git_commit() if config.git else 'None' checkpoint_path = os.path.join(log_dir, 'model.ckpt') # 'logdir-tacotron\\moon_2018-08-28_13-06-42\\model.ckpt' #log(' [*] git recv-parse HEAD:\n%s' % get_git_revision_hash()) # hccho: 주석 처리 log('='*50) #log(' [*] dit diff:\n%s' % get_git_diff()) log('='*50) log(' [*] Checkpoint path: %s' % checkpoint_path) log(' [*] Loading training data from: %s' % data_dirs) log(' [*] Using model: %s' % config.model_dir) # 'logdir-tacotron\\moon_2018-08-28_13-06-42' log(hparams_debug_string()) # Set up DataFeeder: coord = tf.train.Coordinator() with tf.variable_scope('datafeeder') as scope: # DataFeeder의 6개 placeholder: train_feeder.inputs, train_feeder.input_lengths, train_feeder.loss_coeff, train_feeder.mel_targets, train_feeder.linear_targets, train_feeder.speaker_id train_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size) test_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 8, data_type='test', batch_size=config.num_test) # Set up model: global_step = tf.Variable(0, name='global_step', trainable=False) with tf.variable_scope('model') as scope: model = create_model(hparams) model.initialize(inputs=train_feeder.inputs, input_lengths=train_feeder.input_lengths,num_speakers=num_speakers,speaker_id=train_feeder.speaker_id, mel_targets=train_feeder.mel_targets, linear_targets=train_feeder.linear_targets,is_training=True, loss_coeff=train_feeder.loss_coeff,stop_token_targets=train_feeder.stop_token_targets) model.add_loss() model.add_optimizer(global_step) train_stats = add_stats(model, scope_name='train') # legacy with tf.variable_scope('model', reuse=True) as scope: test_model = create_model(hparams) test_model.initialize(inputs=test_feeder.inputs, input_lengths=test_feeder.input_lengths,num_speakers=num_speakers,speaker_id=test_feeder.speaker_id, mel_targets=test_feeder.mel_targets, linear_targets=test_feeder.linear_targets,is_training=False, loss_coeff=test_feeder.loss_coeff,stop_token_targets=test_feeder.stop_token_targets) test_model.add_loss() # Bookkeeping: step = 0 time_window = ValueWindow(100) loss_window = ValueWindow(100) saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2) sess_config = tf.ConfigProto(log_device_placement=False,allow_soft_placement=True) sess_config.gpu_options.allow_growth=True # Train! #with tf.Session(config=sess_config) as sess: with tf.Session() as sess: try: summary_writer = tf.summary.FileWriter(log_dir, sess.graph) sess.run(tf.global_variables_initializer()) if config.load_path: # Restore from a checkpoint if the user requested it. restore_path = get_most_recent_checkpoint(config.model_dir) saver.restore(sess, restore_path) log('Resuming from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True) elif config.initialize_path: restore_path = get_most_recent_checkpoint(config.initialize_path) saver.restore(sess, restore_path) log('Initialized from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True) zero_step_assign = tf.assign(global_step, 0) sess.run(zero_step_assign) start_step = sess.run(global_step) log('='*50) log(' [*] Global step is reset to {}'.format(start_step)) log('='*50) else: log('Starting new training run at commit: %s' % commit, slack=True) start_step = sess.run(global_step) train_feeder.start_in_session(sess, start_step) test_feeder.start_in_session(sess, start_step) while not coord.should_stop(): start_time = time.time() step, loss, opt = sess.run([global_step, model.loss_without_coeff, model.optimize]) time_window.append(time.time() - start_time) loss_window.append(loss) message = 'Step %-7d [%.03f sec/step, loss=%.05f, avg_loss=%.05f]' % (step, time_window.average, loss, loss_window.average) log(message, slack=(step % config.checkpoint_interval == 0)) if loss > 100 or math.isnan(loss): log('Loss exploded to %.05f at step %d!' % (loss, step), slack=True) raise Exception('Loss Exploded') if step % config.summary_interval == 0: log('Writing summary at step: %d' % step) summary_writer.add_summary(sess.run( train_stats), step) if step % config.checkpoint_interval == 0: log('Saving checkpoint to: %s-%d' % (checkpoint_path, step)) saver.save(sess, checkpoint_path, global_step=step) if step % config.test_interval == 0: log('Saving audio and alignment...') num_test = config.num_test fetches = [ model.inputs[:num_test], model.linear_outputs[:num_test], model.alignments[:num_test], test_model.inputs[:num_test], test_model.linear_outputs[:num_test], test_model.alignments[:num_test], ] sequences, spectrograms, alignments, test_sequences, test_spectrograms, test_alignments = sess.run(fetches) #librosa는 ffmpeg가 있어야 한다. save_and_plot(sequences[:1], spectrograms[:1], alignments[:1], log_dir, step, loss, "train") # spectrograms: (num_test,200,1025), alignments: (num_test,encoder_length,decoder_length) save_and_plot(test_sequences, test_spectrograms, test_alignments, log_dir, step, loss, "test") except Exception as e: log('Exiting due to exception: %s' % e, slack=True) traceback.print_exc() coord.request_stop(e) def main(): parser = argparse.ArgumentParser() parser.add_argument('--log_dir', default='logdir-tacotron2') parser.add_argument('--data_paths', default='D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son') #parser.add_argument('--data_paths', default='D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\small1,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\small2') #parser.add_argument('--load_path', default=None) # 아래의 'initialize_path'보다 우선 적용 parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-03-01_10-35-44') parser.add_argument('--initialize_path', default=None) # ckpt로 부터 model을 restore하지만, global step은 0에서 시작 parser.add_argument('--batch_size', type=int, default=32) parser.add_argument('--num_test_per_speaker', type=int, default=2) parser.add_argument('--random_seed', type=int, default=123) parser.add_argument('--summary_interval', type=int, default=100) parser.add_argument('--test_interval', type=int, default=500) # 500 parser.add_argument('--checkpoint_interval', type=int, default=2000) # 2000 parser.add_argument('--skip_path_filter', type=str2bool, default=False, help='Use only for debugging') parser.add_argument('--slack_url', help='Slack webhook URL to get periodic reports.') parser.add_argument('--git', action='store_true', help='If set, verify that the client is clean.') # The store_true option automatically creates a default value of False. config = parser.parse_args() config.data_paths = config.data_paths.split(",") setattr(hparams, "num_speakers", len(config.data_paths)) prepare_dirs(config, hparams) log_path = os.path.join(config.model_dir, 'train.log') infolog.init(log_path, config.model_dir, config.slack_url) tf.set_random_seed(config.random_seed) print(config.data_paths) if config.load_path is not None and config.initialize_path is not None: raise Exception(" [!] Only one of load_path and initialize_path should be set") train(config.model_dir, config) if __name__ == '__main__': main() ================================================ FILE: train_vocoder.py ================================================ # coding: utf-8 """ - train data를 speaker를 분리된 디렉토리로 받아서, speaker id를 디렉토리별로 부과. - file name에서 speaker id를 추론하는 방식이 아님. """ from __future__ import print_function import argparse import numpy as np import os import time import traceback from glob import glob import tensorflow as tf from tensorflow.python.client import timeline from datetime import datetime from wavenet import WaveNetModel,mu_law_decode from datasets import DataFeederWavenet from hparams import hparams from utils import validate_directories,load,save,infolog,get_tensors_in_checkpoint_file,build_tensors_in_checkpoint_file,plot,audio tf.logging.set_verbosity(tf.logging.ERROR) EPSILON = 0.001 log = infolog.log def eval_step(sess,logdir,step,waveform,upsampled_local_condition_data,speaker_id_data,mel_input_data,samples,speaker_id,upsampled_local_condition,next_sample,temperature=1.0): waveform = waveform[:,:1] sample_size = upsampled_local_condition_data.shape[1] last_sample_timestamp = datetime.now() start_time = time.time() for step2 in range(sample_size): # 원하는 길이를 구하기 위해 loop sample_size window = waveform[:,-1:] # 제일 끝에 있는 1개만 samples에 넣어 준다. window: shape(N,1) prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step2,:],speaker_id: speaker_id_data }) if hparams.scalar_input: sample = prediction # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다. else: # Scale prediction distribution using temperature. # 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다. # config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다. np.seterr(divide='ignore') scaled_prediction = np.log(prediction) / temperature # config.temperature인 경우는 값의 변화가 없다. scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True)) # np.log(np.sum(np.exp(scaled_prediction))) scaled_prediction = np.exp(scaled_prediction) np.seterr(divide='warn') # Prediction distribution at temperature=1.0 should be unchanged after # scaling. if temperature == 1.0: np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.') # argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다. sample = [[np.random.choice(np.arange(hparams.quantization_channels), p=p)] for p in scaled_prediction] # choose one sample per batch waveform = np.concatenate([waveform,sample],axis=-1) #window.shape: (N,1) # Show progress only once per second. current_sample_timestamp = datetime.now() time_since_print = current_sample_timestamp - last_sample_timestamp if time_since_print.total_seconds() > 1.: duration = time.time() - start_time print('Sample {:3 EPSILON else None gc_enable = True # Before: num_speakers > 1 After: 항상 True # AudioReader에서 wav 파일을 잘라 input값을 만든다. receptive_field길이만큼을 앞부분에 pad하거나 앞조각에서 가져온다. (receptive_field+ sample_size)크기로 자른다. reader = DataFeederWavenet(coord,config.data_dir,batch_size=hparams.wavenet_batch_size,gc_enable= gc_enable,test_mode=False) # test를 위한 DataFeederWavenet를 하나 만들자. 여기서는 딱 1개의 파일만 가져온다. reader_test = DataFeederWavenet(coord,config.data_dir,batch_size=1,gc_enable= gc_enable,test_mode=True,queue_size=1) audio_batch, lc_batch, gc_id_batch = reader.inputs_wav, reader.local_condition, reader.speaker_id # Create train network. net = create_network(hparams,hparams.wavenet_batch_size,num_speakers,is_training=True) net.add_loss(input_batch=audio_batch,local_condition=lc_batch, global_condition_batch=gc_id_batch, l2_regularization_strength=hparams.l2_regularization_strength,upsample_type=hparams.upsample_type) net.add_optimizer(hparams,global_step) run_metadata = tf.RunMetadata() # Set up session sess = tf.Session(config=tf.ConfigProto(log_device_placement=False)) # log_device_placement=False --> cpu/gpu 자동 배치. init = tf.global_variables_initializer() sess.run(init) # Saver for storing checkpoints of the model. saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints) # 최대 checkpoint 저장 갯수 지정 try: start_step = load(saver, sess, restore_from) # checkpoint load if is_overwritten_training or start_step is None: # The first training step will be saved_global_step + 1, # therefore we put -1 here for new or overwritten trainings. zero_step_assign = tf.assign(global_step, 0) sess.run(zero_step_assign) start_step=0 except: print("Something went wrong while restoring checkpoint. We will terminate training to avoid accidentally overwriting the previous model.") raise ########### reader.start_in_session(sess,start_step) reader_test.start_in_session(sess,start_step) ################### Create test network. <---- Queue 생성 때문에, sess restore후 test network 생성 net_test = create_network(hparams,1,num_speakers,is_training=False) if hparams.scalar_input: samples = tf.placeholder(tf.float32,shape=[net_test.batch_size,None]) waveform = 2*np.random.rand(net_test.batch_size).reshape(net_test.batch_size,-1)-1 else: samples = tf.placeholder(tf.int32,shape=[net_test.batch_size,None]) # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이) waveform = np.random.randint(hparams.quantization_channels,size=net_test.batch_size).reshape(net_test.batch_size,-1) upsampled_local_condition = tf.placeholder(tf.float32,shape=[net_test.batch_size,hparams.num_mels]) speaker_id = tf.placeholder(tf.int32,shape=[net_test.batch_size]) next_sample = net_test.predict_proba_incremental(samples,upsampled_local_condition,speaker_id) # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용 sess.run(net_test.queue_initializer) # test를 위한 placeholder는 모두 3개: samples,speaker_id,upsampled_local_condition # test용 mel-spectrogram을 하나 뽑자. 그것을 고정하지 않으면, thread가 계속 돌아가면서 data를 읽어온다. reader_test의 역할은 여기서 끝난다. mel_input_test, speaker_id_test = sess.run([reader_test.local_condition,reader_test.speaker_id]) with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE): upsampled_local_condition_data = net_test.create_upsample(mel_input_test,upsample_type=hparams.upsample_type) upsampled_local_condition_data_ = sess.run(upsampled_local_condition_data) # upsampled_local_condition_data_ 을 feed_dict로 placehoder인 upsampled_local_condition에 넣어준다. ###################################################### start_step = sess.run(global_step) step = last_saved_step = start_step try: while not coord.should_stop(): start_time = time.time() if hparams.store_metadata and step % 50 == 0: # Slow run that stores extra information for debugging. log('Storing metadata') run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) step, loss_value, _ = sess.run([global_step, net.loss, net.optimize],options=run_options,run_metadata=run_metadata) tl = timeline.Timeline(run_metadata.step_stats) timeline_path = os.path.join(logdir, 'timeline.trace') with open(timeline_path, 'w') as f: f.write(tl.generate_chrome_trace_format(show_memory=True)) else: step, loss_value, _ = sess.run([global_step,net.loss, net.optimize]) duration = time.time() - start_time log('step {:d} - loss = {:.3f}, ({:.3f} sec/step)'.format(step, loss_value, duration)) if step % config.checkpoint_every == 0: save(saver, sess, logdir, step) last_saved_step = step if step % config.eval_every == 0: # config.eval_every eval_step(sess,logdir,step,waveform,upsampled_local_condition_data_,speaker_id_test,mel_input_test,samples,speaker_id,upsampled_local_condition,next_sample) if step >= hparams.num_steps: # error message가 나오지만, 여기서 멈춘 것은 맞다. raise Exception('End xxx~~~yyy') except Exception as e: print('finally') log('Exiting due to exception: %s' % e, slack=True) #if step > last_saved_step: # save(saver, sess, logdir, step) traceback.print_exc() coord.request_stop(e) if __name__ == '__main__': main() traceback.print_exc() print('Done') ================================================ FILE: utils/__init__.py ================================================ # -*- coding: utf-8 -*- import re,json,sys,os import tensorflow as tf from tqdm import tqdm from contextlib import closing from multiprocessing import Pool from collections import namedtuple from datetime import datetime, timedelta from shutil import copyfile as copy_file from tensorflow.python import pywrap_tensorflow PARAMS_NAME = "params.json" STARTED_DATESTRING = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now()) LOGDIR_ROOT_Wavenet = './logdir-wavenet' class ValueWindow(): def __init__(self, window_size=100): self._window_size = window_size self._values = [] def append(self, x): self._values = self._values[-(self._window_size - 1):] + [x] @property def sum(self): return sum(self._values) @property def count(self): return len(self._values) @property def average(self): return self.sum / max(1, self.count) def reset(self): self._values = [] def prepare_dirs(config, hparams): if hasattr(config, "data_paths"): config.datasets = [os.path.basename(data_path) for data_path in config.data_paths] dataset_desc = "+".join(config.datasets) if config.load_path: config.model_dir = config.load_path else: config.model_name = "{}_{}".format(dataset_desc, get_time()) config.model_dir = os.path.join(config.log_dir, config.model_name) for path in [config.log_dir, config.model_dir]: if not os.path.exists(path): os.makedirs(path) if config.load_path: load_hparams(hparams, config.model_dir) else: setattr(hparams, "num_speakers", len(config.datasets)) save_hparams(config.model_dir, hparams) copy_file("hparams.py", os.path.join(config.model_dir, "hparams.py")) def save(saver, sess, logdir, step): model_name = 'model.ckpt' checkpoint_path = os.path.join(logdir, model_name) print('Storing checkpoint to {} ...'.format(logdir), end="") sys.stdout.flush() if not os.path.exists(logdir): os.makedirs(logdir) saver.save(sess, checkpoint_path, global_step=step) print(' Done.') def load(saver, sess, logdir): print("Trying to restore saved checkpoints from {} ...".format(logdir),end="") ckpt = tf.train.get_checkpoint_state(logdir) #ckpt = get_most_recent_checkpoint(logdir) if ckpt: print(" Checkpoint found: {}".format(ckpt.model_checkpoint_path)) global_step = int(ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1]) print(" Global step was: {}".format(global_step)) print(" Restoring...", end="") saver.restore(sess, ckpt.model_checkpoint_path) print(" Done.") return global_step else: print(" No checkpoint found.") return None def get_default_logdir(logdir_root): logdir = os.path.join(logdir_root, 'train', STARTED_DATESTRING) if not os.path.exists(logdir): os.makedirs(logdir) return logdir def validate_directories(args,hparams): """Validate and arrange directory related arguments.""" # Validation if args.logdir and args.logdir_root: raise ValueError("--logdir and --logdir_root cannot be specified at the same time.") if args.logdir and args.restore_from: raise ValueError( "--logdir and --restore_from cannot be specified at the same " "time. This is to keep your previous model from unexpected " "overwrites.\n" "Use --logdir_root to specify the root of the directory which " "will be automatically created with current date and time, or use " "only --logdir to just continue the training from the last " "checkpoint.") # Arrangement logdir_root = args.logdir_root if logdir_root is None: logdir_root = LOGDIR_ROOT_Wavenet logdir = args.logdir if logdir is None: logdir = get_default_logdir(logdir_root) print('Using default logdir: {}'.format(logdir)) save_hparams(logdir, hparams) copy_file("hparams.py", os.path.join(logdir, "hparams.py")) else: load_hparams(hparams, logdir) restore_from = args.restore_from if restore_from is None: # args.logdir and args.restore_from are exclusive, # so it is guaranteed the logdir here is newly created. restore_from = logdir return { 'logdir': logdir, 'logdir_root': args.logdir_root, 'restore_from': restore_from } def save_hparams(model_dir, hparams): param_path = os.path.join(model_dir, PARAMS_NAME) info = eval(hparams.to_json(),{'false': False, 'true': True, 'null': None}) write_json(param_path, info) print(" [*] MODEL dir: {}".format(model_dir)) print(" [*] PARAM path: {}".format(param_path)) def write_json(path, data): with open(path, 'w',encoding='utf-8') as f: json.dump(data, f, indent=4, sort_keys=True, ensure_ascii=False) def load_hparams(hparams, load_path, skip_list=[]): # log dir에 있는 hypermarameter 정보를 이용해서, hparams.py의 정보를 update한다. path = os.path.join(load_path, PARAMS_NAME) new_hparams = load_json(path) hparams_keys = vars(hparams).keys() for key, value in new_hparams.items(): if key in skip_list or key not in hparams_keys: print("Skip {} because it not exists".format(key)) #json에 있지만, hparams에 없다는 의미 continue if key not in ['xxxxx',]: # update 하지 말아야 할 것을 지정할 수 있다. original_value = getattr(hparams, key) if original_value != value: print("UPDATE {}: {} -> {}".format(key, getattr(hparams, key), value)) setattr(hparams, key, value) def load_json(path, as_class=False, encoding='euc-kr'): with open(path,encoding=encoding) as f: content = f.read() content = re.sub(",\s*}", "}", content) content = re.sub(",\s*]", "]", content) if as_class: data = json.loads(content, object_hook=\ lambda data: namedtuple('Data', data.keys())(*data.values())) else: data = json.loads(content) return data def get_most_recent_checkpoint(checkpoint_dir): checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))] idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths] max_idx = max(idxes) lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx)) #latest_checkpoint=checkpoint_paths[0] print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint)) return lastest_checkpoint def add_prefix(path, prefix): dir_path, filename = os.path.dirname(path), os.path.basename(path) return "{}/{}.{}".format(dir_path, prefix, filename) def add_postfix(path, postfix): path_without_ext, ext = path.rsplit('.', 1) return "{}.{}.{}".format(path_without_ext, postfix, ext) def remove_postfix(path): items = path.rsplit('.', 2) return items[0] + "." + items[2] def get_time(): return datetime.now().strftime("%Y-%m-%d_%H-%M-%S") def parallel_run(fn, items, desc="", parallel=True): results = [] if parallel: with closing(Pool(10)) as pool: for out in tqdm(pool.imap_unordered(fn, items), total=len(items), desc=desc): if out is not None: results.append(out) else: for item in tqdm(items, total=len(items), desc=desc): out = fn(item) if out is not None: results.append(out) return results def makedirs(path): if not os.path.exists(path): print(" [*] Make directories : {}".format(path)) os.makedirs(path) def str2bool(v): return v.lower() in ('true', '1') def remove_file(path): if os.path.exists(path): print(" [*] Removed: {}".format(path)) os.remove(path) def get_git_revision_hash(): return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode("utf-8") def get_git_diff(): return subprocess.check_output(['git', 'diff']).decode("utf-8") def warning(msg): print("="*40) print(" [!] {}".format(msg)) print("="*40) print() def get_tensors_in_checkpoint_file(file_name,all_tensors=True,tensor_name=None): # checkpoint 파일로 부터 복구 # e.g file_name: 'D:\\hccho\\Tacotron-2-hccho\\model.ckpt-155000' varlist=[] var_value =[] reader = pywrap_tensorflow.NewCheckpointReader(file_name) trainable_variables_names = [v.name[:-2] for v in tf.trainable_variables()] # 끝부분의 ':0' 제외 if all_tensors: var_to_shape_map = reader.get_variable_to_shape_map() for key in sorted(var_to_shape_map): if key in trainable_variables_names: # hccho varlist.append(key) var_value.append(reader.get_tensor(key)) else: varlist.append(tensor_name) var_value.append(reader.get_tensor(tensor_name)) return (varlist, var_value) def build_tensors_in_checkpoint_file(loaded_tensors): # 현재 tensor graph에 있는 tensor중에서 loaded_tensors에 있는 tensor name을 가져온다. full_var_list = list() # Loop all loaded tensors for i, tensor_name in enumerate(loaded_tensors[0]): # Extract tensor try: tensor_aux = tf.get_default_graph().get_tensor_by_name(tensor_name+":0") except: print('Not found: '+tensor_name) full_var_list.append(tensor_aux) return full_var_list """ # restore egample 모델을 변형했을 때, 기존 ckpt로부터 중복되는 trainable_varaibles 복구. CHECKPOINT_NAME = 'D:\\hccho\\Tacotron-2-hccho\\ver1\\logdir-wavenet\\train\\2019-03-22T23-08-16\\model.ckpt-155000' restored_vars = get_tensors_in_checkpoint_file(file_name=CHECKPOINT_NAME) tensors_to_load = build_tensors_in_checkpoint_file(restored_vars) loader = tf.train.Saver(tensors_to_load) loader.restore(sess, CHECKPOINT_NAME) new_saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints) # 최대 checkpoint 저장 갯수 지정 save(new_saver, sess, logdir, 0) exit() """ ================================================ FILE: utils/audio.py ================================================ # coding: utf-8 import librosa import librosa.filters import numpy as np import tensorflow as tf from scipy import signal from scipy.io import wavfile from tensorflow.contrib.training.python.training.hparam import HParams def load_wav(path, sr): return librosa.core.load(path, sr=sr)[0] def save_wav(wav, path, sr): wav *= 32767 / max(0.01, np.max(np.abs(wav))) #proposed by @dsmiller --> libosa type error(bug) 극복 wavfile.write(path, sr, wav.astype(np.int16)) def save_wavenet_wav(wav, path, sr): librosa.output.write_wav(path, wav, sr=sr) def preemphasis(wav, k, preemphasize=True): if preemphasize: return signal.lfilter([1, -k], [1], wav) return wav def inv_preemphasis(wav, k, inv_preemphasize=True): if inv_preemphasize: return signal.lfilter([1], [1, -k], wav) return wav #From https://github.com/r9y9/wavenet_vocoder/blob/master/audio.py def start_and_end_indices(quantized, silence_threshold=2): for start in range(quantized.size): if abs(quantized[start] - 127) > silence_threshold: break for end in range(quantized.size - 1, 1, -1): if abs(quantized[end] - 127) > silence_threshold: break assert abs(quantized[start] - 127) > silence_threshold assert abs(quantized[end] - 127) > silence_threshold return start, end def trim_silence(wav, hparams): '''Trim leading and trailing silence Useful for M-AILABS dataset if we choose to trim the extra 0.5 silence at beginning and end. ''' #Thanks @begeekmyfriend and @lautjy for pointing out the params contradiction. These params are separate and tunable per dataset. return librosa.effects.trim(wav, top_db= hparams.trim_top_db, frame_length=hparams.trim_fft_size, hop_length=hparams.trim_hop_size)[0] def get_hop_size(hparams): hop_size = hparams.hop_size if hop_size is None: assert hparams.frame_shift_ms is not None hop_size = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate) return hop_size def linearspectrogram(wav, hparams): D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams) S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db if hparams.signal_normalization: # Tacotron에서 항상적용했다. return _normalize(S, hparams) return S def melspectrogram(wav, hparams): D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams) S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db if hparams.signal_normalization: return _normalize(S, hparams) return S def inv_linear_spectrogram(linear_spectrogram, hparams): '''Converts linear spectrogram to waveform using librosa''' if hparams.signal_normalization: D = _denormalize(linear_spectrogram, hparams) else: D = linear_spectrogram S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear if hparams.use_lws: processor = _lws_processor(hparams) D = processor.run_lws(S.astype(np.float64).T ** hparams.power) y = processor.istft(D).astype(np.float32) return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize) else: return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize) def inv_mel_spectrogram(mel_spectrogram, hparams): '''Converts mel spectrogram to waveform using librosa''' if hparams.signal_normalization: D = _denormalize(mel_spectrogram, hparams) else: D = mel_spectrogram S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams) # Convert back to linear if hparams.use_lws: processor = _lws_processor(hparams) D = processor.run_lws(S.astype(np.float64).T ** hparams.power) y = processor.istft(D).astype(np.float32) return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize) else: return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize) def inv_spectrogram_tensorflow(spectrogram,hparams): S = _db_to_amp_tensorflow(_denormalize_tensorflow(spectrogram,hparams) + hparams.ref_level_db) return _griffin_lim_tensorflow(tf.pow(S, hparams.power),hparams) def inv_spectrogram(spectrogram,hparams): S = _db_to_amp(_denormalize(spectrogram,hparams) + hparams.ref_level_db) # Convert back to linear. spectrogram: (num_freq,length) return inv_preemphasis(_griffin_lim(S ** hparams.power,hparams),hparams.preemphasis, hparams.preemphasize) # Reconstruct phase def _lws_processor(hparams): import lws return lws.lws(hparams.fft_size, get_hop_size(hparams), fftsize=hparams.win_size, mode="speech") def _griffin_lim(S, hparams): '''librosa implementation of Griffin-Lim Based on https://github.com/librosa/librosa/issues/434 ''' angles = np.exp(2j * np.pi * np.random.rand(*S.shape)) S_complex = np.abs(S).astype(np.complex) y = _istft(S_complex * angles, hparams) for i in range(hparams.griffin_lim_iters): angles = np.exp(1j * np.angle(_stft(y, hparams))) y = _istft(S_complex * angles, hparams) return y def _stft(y, hparams): if hparams.use_lws: return _lws_processor(hparams).stft(y).T else: return librosa.stft(y=y, n_fft=hparams.fft_size, hop_length=get_hop_size(hparams), win_length=hparams.win_size) def _istft(y, hparams): return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size) ########################################################## #Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!) def num_frames(length, fsize, fshift): """Compute number of time frames of spectrogram """ pad = (fsize - fshift) if length % fshift == 0: M = (length + pad * 2 - fsize) // fshift + 1 else: M = (length + pad * 2 - fsize) // fshift + 2 return M def pad_lr(x, fsize, fshift): """Compute left and right padding """ M = num_frames(len(x), fsize, fshift) pad = (fsize - fshift) T = len(x) + 2 * pad r = (M - 1) * fshift + fsize - T return pad, pad + r ########################################################## #Librosa correct padding def librosa_pad_lr(x, fsize, fshift): '''compute right padding (final frame) ''' return int(fsize // 2) # Conversions _mel_basis = None _inv_mel_basis = None def _linear_to_mel(spectogram, hparams): global _mel_basis if _mel_basis is None: _mel_basis = _build_mel_basis(hparams) return np.dot(_mel_basis, spectogram) def _mel_to_linear(mel_spectrogram, hparams): global _inv_mel_basis if _inv_mel_basis is None: _inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams)) return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram)) def _build_mel_basis(hparams): #assert hparams.fmax <= hparams.sample_rate // 2 #fmin: Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525]) #fmax: 7600, To be increased/reduced depending on data. #return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels,fmin=hparams.fmin, fmax=hparams.fmax) return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels) # fmin=0, fmax= sample_rate/2.0 def _amp_to_db(x, hparams): min_level = np.exp(hparams.min_level_db / 20 * np.log(10)) # min_level_db = -100 return 20 * np.log10(np.maximum(min_level, x)) def _db_to_amp(x): return np.power(10.0, (x) * 0.05) def _normalize(S, hparams): if hparams.allow_clipping_in_normalization: if hparams.symmetric_mels: return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value, -hparams.max_abs_value, hparams.max_abs_value) else: return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value) assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0 if hparams.symmetric_mels: return (2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value else: return hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)) def _denormalize(D, hparams): if hparams.allow_clipping_in_normalization: if hparams.symmetric_mels: return (((np.clip(D, -hparams.max_abs_value, hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db) else: return ((np.clip(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db) if hparams.symmetric_mels: return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db) else: return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db) # 김태훈 구현. 이 차이 때문에 호환이 되지 않는다. # def _normalize(S,hparams): # return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1) # min_level_db = -100 # # def _denormalize(S,hparams): # return (np.clip(S, 0, 1) * -hparams.min_level_db) + hparams.min_level_db #From https://github.com/r9y9/nnmnkwii/blob/master/nnmnkwii/preprocessing/generic.py def mulaw(x, mu=256): """Mu-Law companding Method described in paper [1]_. .. math:: f(x) = sign(x) ln (1 + mu |x|) / ln (1 + mu) Args: x (array-like): Input signal. Each value of input signal must be in range of [-1, 1]. mu (number): Compression parameter ``μ``. Returns: array-like: Compressed signal ([-1, 1]) See also: :func:`nnmnkwii.preprocessing.inv_mulaw` :func:`nnmnkwii.preprocessing.mulaw_quantize` :func:`nnmnkwii.preprocessing.inv_mulaw_quantize` .. [1] Brokish, Charles W., and Michele Lewis. "A-law and mu-law companding implementations using the tms320c54x." SPRA163 (1997). """ return _sign(x) * _log1p(mu * _abs(x)) / _log1p(mu) def inv_mulaw(y, mu=256): """Inverse of mu-law companding (mu-law expansion) .. math:: f^{-1}(x) = sign(y) (1 / mu) (1 + mu)^{|y|} - 1) Args: y (array-like): Compressed signal. Each value of input signal must be in range of [-1, 1]. mu (number): Compression parameter ``μ``. Returns: array-like: Uncomprresed signal (-1 <= x <= 1) See also: :func:`nnmnkwii.preprocessing.inv_mulaw` :func:`nnmnkwii.preprocessing.mulaw_quantize` :func:`nnmnkwii.preprocessing.inv_mulaw_quantize` """ return _sign(y) * (1.0 / mu) * ((1.0 + mu)**_abs(y) - 1.0) def mulaw_quantize(x, mu=256): """Mu-Law companding + quantize Args: x (array-like): Input signal. Each value of input signal must be in range of [-1, 1]. mu (number): Compression parameter ``μ``. Returns: array-like: Quantized signal (dtype=int) - y ∈ [0, mu] if x ∈ [-1, 1] - y ∈ [0, mu) if x ∈ [-1, 1) .. note:: If you want to get quantized values of range [0, mu) (not [0, mu]), then you need to provide input signal of range [-1, 1). Examples: >>> from scipy.io import wavfile >>> import pysptk >>> import numpy as np >>> from nnmnkwii import preprocessing as P >>> fs, x = wavfile.read(pysptk.util.example_audio_file()) >>> x = (x / 32768.0).astype(np.float32) >>> y = P.mulaw_quantize(x) >>> print(y.min(), y.max(), y.dtype) 15 246 int64 See also: :func:`nnmnkwii.preprocessing.mulaw` :func:`nnmnkwii.preprocessing.inv_mulaw` :func:`nnmnkwii.preprocessing.inv_mulaw_quantize` """ mu = mu-1 y = mulaw(x, mu) # scale [-1, 1] to [0, mu] return _asint((y + 1) / 2 * mu) def inv_mulaw_quantize(y, mu=256): """Inverse of mu-law companding + quantize Args: y (array-like): Quantized signal (∈ [0, mu]). mu (number): Compression parameter ``μ``. Returns: array-like: Uncompressed signal ([-1, 1]) Examples: >>> from scipy.io import wavfile >>> import pysptk >>> import numpy as np >>> from nnmnkwii import preprocessing as P >>> fs, x = wavfile.read(pysptk.util.example_audio_file()) >>> x = (x / 32768.0).astype(np.float32) >>> x_hat = P.inv_mulaw_quantize(P.mulaw_quantize(x)) >>> x_hat = (x_hat * 32768).astype(np.int16) See also: :func:`nnmnkwii.preprocessing.mulaw` :func:`nnmnkwii.preprocessing.inv_mulaw` :func:`nnmnkwii.preprocessing.mulaw_quantize` """ # [0, m) to [-1, 1] mu = mu-1 y = 2 * _asfloat(y) / mu - 1 return inv_mulaw(y, mu) def _sign(x): #wrapper to support tensorflow tensors/numpy arrays isnumpy = isinstance(x, np.ndarray) isscalar = np.isscalar(x) return np.sign(x) if (isnumpy or isscalar) else tf.sign(x) def _log1p(x): #wrapper to support tensorflow tensors/numpy arrays isnumpy = isinstance(x, np.ndarray) isscalar = np.isscalar(x) return np.log1p(x) if (isnumpy or isscalar) else tf.log1p(x) def _abs(x): #wrapper to support tensorflow tensors/numpy arrays isnumpy = isinstance(x, np.ndarray) isscalar = np.isscalar(x) return np.abs(x) if (isnumpy or isscalar) else tf.abs(x) def _asint(x): #wrapper to support tensorflow tensors/numpy arrays isnumpy = isinstance(x, np.ndarray) isscalar = np.isscalar(x) return x.astype(np.int) if isnumpy else int(x) if isscalar else tf.cast(x, tf.int32) def _asfloat(x): #wrapper to support tensorflow tensors/numpy arrays isnumpy = isinstance(x, np.ndarray) isscalar = np.isscalar(x) return x.astype(np.float32) if isnumpy else float(x) if isscalar else tf.cast(x, tf.float32) def frames_to_hours(n_frames,hparams): return sum((n_frame for n_frame in n_frames)) * hparams.frame_shift_ms / (3600 * 1000) def get_duration(audio,hparams): return librosa.core.get_duration(audio, sr=hparams.sample_rate) def _db_to_amp_tensorflow(x): return tf.pow(tf.ones(tf.shape(x)) * 10.0, x * 0.05) def _denormalize_tensorflow(S,hparams): return (tf.clip_by_value(S, 0, 1) * -hparams.min_level_db) + hparams.min_level_db def _griffin_lim_tensorflow(S,hparams): with tf.variable_scope('griffinlim'): S = tf.expand_dims(S, 0) S_complex = tf.identity(tf.cast(S, dtype=tf.complex64)) y = _istft_tensorflow(S_complex,hparams) for i in range(hparams.griffin_lim_iters): est = _stft_tensorflow(y,hparams) angles = est / tf.cast(tf.maximum(1e-8, tf.abs(est)), tf.complex64) y = _istft_tensorflow(S_complex * angles,hparams) return tf.squeeze(y, 0) def _istft_tensorflow(stfts,hparams): n_fft, hop_length, win_length = _stft_parameters(hparams) return tf.contrib.signal.inverse_stft(stfts, win_length, hop_length, n_fft) def _stft_tensorflow(signals,hparams): n_fft, hop_length, win_length = _stft_parameters(hparams) return tf.contrib.signal.stft(signals, win_length, hop_length, n_fft, pad_end=False) def _stft_parameters(hparams): n_fft = (hparams.num_freq - 1) * 2 # hparams.num_freq = 1025 hop_length = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate) # hparams.frame_shift_ms = 12.5 win_length = int(hparams.frame_length_ms / 1000 * hparams.sample_rate) # hparams.frame_length_ms = 50 return n_fft, hop_length, win_length ================================================ FILE: utils/infolog.py ================================================ import atexit from datetime import datetime import json from threading import Thread from urllib.request import Request, urlopen _format = '%Y-%m-%d %H:%M:%S.%f' _file = None _run_name = None _slack_url = None def init(filename, run_name, slack_url=None): global _file, _run_name, _slack_url _close_logfile() _file = open(filename, 'a') _file.write('\n-----------------------------------------------------------------\n') _file.write('Starting new training run\n') _file.write('-----------------------------------------------------------------\n') _run_name = run_name _slack_url = slack_url def log(msg, slack=False): print(msg) if _file is not None: _file.write('[%s] %s\n' % (datetime.now().strftime(_format)[:-3], msg)) if slack and _slack_url is not None: Thread(target=_send_slack, args=(msg,)).start() def _close_logfile(): global _file if _file is not None: _file.close() _file = None def _send_slack(msg): req = Request(_slack_url) req.add_header('Content-Type', 'application/json') urlopen(req, json.dumps({ 'username': 'tacotron', 'icon_emoji': ':taco:', 'text': '*%s*: %s' % (_run_name, msg) }).encode()) atexit.register(_close_logfile) ================================================ FILE: utils/plot.py ================================================ # coding: utf-8 import os import matplotlib import matplotlib.font_manager as font_manager from jamo import h2j, j2hcj import numpy as np matplotlib.use('Agg') # font 문제 해결 #matplotlib.rc('font', family="NanumBarunGothic") #font_manager._rebuild() <---- 1번만 해주면 됨 font_fname = './/utils//NanumBarunGothic.ttf' font_name = font_manager.FontProperties(fname=font_fname).get_name() matplotlib.rc('font', family="NanumBarunGothic") import matplotlib.pyplot as plt from text import PAD, EOS from utils import add_postfix from text.korean import normalize def plot(alignment, info, text, isKorean=True): char_len, audio_len = alignment.shape # 145, 200 fig, ax = plt.subplots(figsize=(char_len/5, 5)) im = ax.imshow( alignment.T, aspect='auto', origin='lower', interpolation='none') xlabel = 'Encoder timestep' ylabel = 'Decoder timestep' if info is not None: xlabel += '\n{}'.format(info) plt.xlabel(xlabel) plt.ylabel(ylabel) if text: if isKorean: jamo_text = j2hcj(h2j(normalize(text))) else: jamo_text=text pad = [PAD] * (char_len - len(jamo_text) - 1) A = [tok for tok in jamo_text] + [EOS] + pad A = [x if x != ' ' else '' for x in A] # 공백이 있으면 그 뒤가 출력되지 않는 문제... plt.xticks(range(char_len), A) if text is not None: while True: if text[-1] in [EOS, PAD]: text = text[:-1] else: break plt.title(text) plt.tight_layout() def plot_alignment( alignment, path, info=None, text=None, isKorean=True): if text: # text = '대체 투입되었던 구급대원이' tmp_alignment = alignment[:len(h2j(text)) + 2] # '대체 투입되었던 구급대원이' 푼 후, 길이 측정 <--- padding제거 효과 plot(tmp_alignment, info, text, isKorean) plt.savefig(path, format='png') else: plot(alignment, info, text, isKorean) plt.savefig(path, format='png') print(" [*] Plot saved: {}".format(path)) def plot_spectrogram(pred_spectrogram, path, title=None, split_title=False, target_spectrogram=None, max_len=None, auto_aspect=False): if max_len is not None: target_spectrogram = target_spectrogram[:max_len] pred_spectrogram = pred_spectrogram[:max_len] if split_title: title = split_title_line(title) fig = plt.figure(figsize=(10, 8)) # Set common labels fig.text(0.5, 0.18, title, horizontalalignment='center', fontsize=16) #target spectrogram subplot if target_spectrogram is not None: ax1 = fig.add_subplot(311) ax2 = fig.add_subplot(312) if auto_aspect: im = ax1.imshow(np.rot90(target_spectrogram), aspect='auto', interpolation='none') else: im = ax1.imshow(np.rot90(target_spectrogram), interpolation='none') ax1.set_title('Target Mel-Spectrogram') fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1) ax2.set_title('Predicted Mel-Spectrogram') else: ax2 = fig.add_subplot(211) if auto_aspect: im = ax2.imshow(np.rot90(pred_spectrogram), aspect='auto', interpolation='none') else: im = ax2.imshow(np.rot90(pred_spectrogram), interpolation='none') fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax2) # 'horizontal' 'vertical' plt.tight_layout() plt.savefig(path, format='png') plt.close() ================================================ FILE: wavenet/__init__.py ================================================ # coding: utf-8 from .model import WaveNetModel from .ops import (mu_law_encode, mu_law_decode,optimizer_factory) ================================================ FILE: wavenet/mixture.py ================================================ # coding:utf-8 """ the code is adapted from: https://github.com/Rayhane-mamah/Tacotron-2/blob/master/wavenet_vocoder/models/mixture.py https://github.com/openai/pixel-cnn/blob/master/pixel_cnn_pp/nn.py https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py https://github.com/azraelkuan/tensorflow_wavenet_vocoder/tree/dev """ import tensorflow as tf import numpy as np def log_sum_exp(x): """ numerically stable log_sum_exp implementation that prevents overflow """ axis = len(x.get_shape()) - 1 m = tf.reduce_max(x, axis) m2 = tf.reduce_max(x, axis, keepdims=True) return m + tf.log(tf.reduce_sum(tf.exp(x - m2), axis)) def log_prob_from_logits(x): """ numerically stable log_softmax implementation that prevents overflow """ axis = len(x.get_shape()) - 1 m = tf.reduce_max(x, axis, keepdims=True) return x - m - tf.log(tf.reduce_sum(tf.exp(x - m), axis, keepdims=True)) # https://github.com/Rayhane-mamah/Tacotron-2/issues/155 <--- 설명 있음 def discretized_mix_logistic_loss(y_hat, y, num_class=256, log_scale_min=float(np.log(1e-14)), reduce=True): """ Discretized mixture of logistic distributions loss y_hat: Predicted output B x T x C y: Target B x T x 1 (-1~1) num_class: Number of classes log_scale_min: Log scale minimum value reduce: If True, the losses are averaged or summed for each minibatch :return: loss """ y_hat_shape = y_hat.get_shape().as_list() assert len(y_hat_shape) == 3 assert y_hat_shape[2] % 3 == 0 nr_mix = y_hat_shape[2] // 3 # 30 --> 10 # unpack parameters logit_probs = y_hat[:, :, :nr_mix] means = y_hat[:, :, nr_mix:2 * nr_mix] log_scales = tf.maximum(y_hat[:, :, nr_mix * 2:nr_mix * 3], log_scale_min) # B x T x 1 => B x T x nr_mix y = tf.tile(y, [1, 1, nr_mix]) centered_y = y - means inv_stdv = tf.exp(-log_scales) plus_in = inv_stdv * (centered_y + 1. / (num_class - 1)) cdf_plus = tf.nn.sigmoid(plus_in) min_in = inv_stdv * (centered_y - 1. / (num_class - 1)) cdf_min = tf.nn.sigmoid(min_in) log_cdf_plus = plus_in - tf.nn.softplus(plus_in) # log probability for edge case of 0 (before scaling) equivalent tf.log(cdf_plus) log_one_minus_cdf_min = -tf.nn.softplus(min_in) # log probability for edge case of 255 (before scaling) equivalent tf.log(1-cdf_min) cdf_delta = cdf_plus - cdf_min # probability for all other cases mid_in = inv_stdv * centered_y #log probability in the center of the bin, to be used in extreme cases #(not actually used in this code) log_pdf_mid = mid_in - log_scales - 2. * tf.nn.softplus(mid_in) # mid 값을 pdf에 직접 넣고 계산하면 나온다. log_probs = tf.where(y < -0.999, log_cdf_plus, tf.where(y > 0.999, log_one_minus_cdf_min, tf.where(cdf_delta > 1e-5, tf.log(tf.maximum(cdf_delta, 1e-12)),log_pdf_mid - np.log((num_class - 1) / 2)))) log_probs = log_probs + tf.nn.log_softmax(logit_probs, -1) # log_probs = log_probs + log_prob_from_logits(logit_probs) if reduce: return -tf.reduce_sum(log_sum_exp(log_probs)) else: return -log_sum_exp(log_probs) def sample_from_discretized_mix_logistic(y, log_scale_min=float(np.log(1e-14))): """ :param y: B x T x C :param log_scale_min: :return: [-1, 1] """ # 아래 코드에서 2번의 uniform random sampling이 있는데, 한번은 Gumbel distribution으로 부터 sampling을 위한 것이고, 또 한번은 logistic distribution을 위한 것이다. y_shape = y.get_shape().as_list() assert len(y_shape) == 3 assert y_shape[2] % 3 == 0 nr_mix = y_shape[2] // 3 logit_probs = y[:, :, :nr_mix] # u: random_uniform --> -log(-log(u)): standard Gumbel random sample # category 결정을 위해 logit_probs(softmax 취하기 전의 값) + ( -log(-log(u)) ) ---> argmax를 취하면 category가 결정된다. sel = tf.one_hot(tf.argmax(logit_probs - tf.log(-tf.log(tf.random_uniform(tf.shape(logit_probs), minval=1e-5, maxval=1. - 1e-5))), 2), depth=nr_mix, dtype=tf.float32) means = tf.reduce_sum(y[:, :, nr_mix:nr_mix * 2] * sel, axis=2) log_scales = tf.maximum(tf.reduce_sum(y[:, :, nr_mix * 2:nr_mix * 3] * sel, axis=2), log_scale_min) # output audio를 만들기 위해 logistic distribution으로 부터 sampling u = tf.random_uniform(tf.shape(means), minval=1e-5, maxval=1. - 1e-5) x = means + tf.exp(log_scales) * (tf.log(u) - tf.log(1. - u)) # u을 logistic distribution의 cdf의 역함수에 대입. x = tf.minimum(tf.maximum(x, -1.), 1.) return x ================================================ FILE: wavenet/model.py ================================================ # coding: utf-8 import numpy as np import tensorflow as tf from .ops import mu_law_encode,optimizer_factory,SubPixelConvolution from .mixture import discretized_mix_logistic_loss, sample_from_discretized_mix_logistic class WaveNetModel(object): def __init__(self,batch_size,dilations,filter_width,residual_channels,dilation_channels,skip_channels,quantization_channels=2**8,out_channels=30, use_biases=False,scalar_input=False,global_condition_channels=None, global_condition_cardinality=None,local_condition_channels=80,upsample_factor=None,legacy=True,residual_legacy=True,train_mode=True,drop_rate=0.0): self.batch_size = batch_size self.dilations = dilations self.filter_width = filter_width self.residual_channels = residual_channels self.dilation_channels = dilation_channels self.quantization_channels = quantization_channels self.use_biases = use_biases self.skip_channels = skip_channels self.scalar_input = scalar_input self.global_condition_channels = global_condition_channels self.global_condition_cardinality = global_condition_cardinality self.local_condition_channels=local_condition_channels self.upsample_factor=upsample_factor self.train_mode = train_mode self.out_channels = out_channels self.legacy=legacy self.residual_legacy=residual_legacy self.drop_rate = drop_rate self.ema = tf.train.ExponentialMovingAverage(decay=0.9999) self.receptive_field = WaveNetModel.calculate_receptive_field(self.filter_width, self.dilations) @staticmethod def calculate_receptive_field(filter_width, dilations): # causal 때문에 length (T-1) + (여기서 계산되는 receptive_field만큼의 padding) --> 최종 output의 길이가 T가 된다. receptive_field = (filter_width - 1) * sum(dilations) + 1 # 마지막 +1은 causal condition 때문에 1개 자른 것의 때문에 길이가 T-1인 되기 때문에 +1을 통해서 입력과 같은 길이 T가 된다. return receptive_field def _create_causal_layer(self, input_batch): with tf.name_scope('causal_layer'): if self.scalar_input: return tf.layers.conv1d(input_batch,filters=self.residual_channels,kernel_size=1,padding='valid',dilation_rate=1,use_bias=True) else: return tf.layers.conv1d(input_batch,filters=self.residual_channels,kernel_size=1,padding='valid',dilation_rate=1,use_bias=True) def _create_queue(self): # first layer(causal layer)나 local condition은 kernel_size = 1이므로, Queue가 필요없다. with tf.variable_scope('queue'): self.dilation_queue=[] for i,d in enumerate(self.dilations): q = tf.Variable(initial_value=tf.zeros(shape=[self.batch_size,d*(self.filter_width-1)+1,self.residual_channels], dtype=tf.float32), name='dilation_queue'.format(i), trainable=False) self.dilation_queue.append(q) # restore했을 때, Dilation_Queue,Causal_Queue는 0으로 initialization해야 한다. self.queue_initializer= tf.variables_initializer(self.dilation_queue) def _create_dilation_layer(self, input_batch, layer_index, dilation,local_condition_batch,global_condition_batch): # input_batch는 train mode에서는 길이 줄어드는 것을 대비하여 padding이 되어 있다. with tf.variable_scope('dilation_layer'): residual = input_batch if self.train_mode: # padding padding = (self.filter_width - 1)*dilation input_batch = tf.pad(input_batch, tf.constant([(0, 0), (padding, 0), (0, 0)])) else: self.dilation_queue[layer_index] = tf.scatter_update(self.dilation_queue[layer_index],tf.range(self.batch_size),tf.concat([self.dilation_queue[layer_index][:,1:,:],input_batch],axis=1) ) input_batch = self.dilation_queue[layer_index] input_batch = tf.layers.dropout(input_batch,rate=self.drop_rate,training=self.train_mode) dilation_layer = tf.layers.Conv1D(filters=self.dilation_channels*2,kernel_size=self.filter_width,dilation_rate=dilation,padding='valid',use_bias=self.use_biases,name='conv_filter_gate') if self.train_mode: conv = dilation_layer(input_batch) conv_filter, conv_gate = tf.split(conv,2,axis=-1) else: dilation_layer.build((self.batch_size,1,input_batch.shape.as_list()[-1])) # shape의 마지막만 중요함. kernel을 잡는데 마지막 차원만 사용됨 linearized_weights = tf.reshape(dilation_layer.kernel,(-1,self.dilation_channels*2)) input_batch = input_batch[:, 0::dilation, :] temp = tf.matmul(tf.reshape(input_batch,(self.batch_size,-1)), linearized_weights) if self.use_biases: temp = tf.nn.bias_add(temp, dilation_layer.bias) conv_filter, conv_gate = tf.split(tf.expand_dims(temp,1),2,axis=-1) if global_condition_batch is not None: conv_filter += tf.layers.conv1d(global_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="gc_filter") conv_gate += tf.layers.conv1d(global_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="gc_gate") if local_condition_batch is not None: local_filter = tf.layers.conv1d(local_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="lc_filter") local_gate = tf.layers.conv1d(local_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="lc_gate") conv_filter += local_filter conv_gate += local_gate out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate) # The 1x1 conv to produce the residual output == FC transformed = tf.layers.conv1d(out,filters=self.residual_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="dense") # The 1x1 conv to produce the skip output skip_contribution = tf.layers.conv1d(out,filters=self.skip_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="skip") # residual + transformed: 다음 단계의 입력으로 들어감 if self.residual_legacy: out = (residual + transformed) * np.sqrt(0.5) else: out = residual + transformed return skip_contribution, out # skip_contribution: 결과값으로 쌓임. def create_upsample(self, local_condition_batch,upsample_type='SubPixel'): local_condition_batch = tf.expand_dims(local_condition_batch, [3]) # local condition batch N H W C freq_axis_kernel_size = self.filter_width # Rayhane-mamah 코드에서는 hyper parameter로 받음. frame(num_mels)에 적용되는 kernel_size임 for i in range(len(self.upsample_factor)): if upsample_type =='SubPixel': # NN_init, NN_scaler <---- hyper parameter이지만, 여기서는 True, 0.3으로 고정 # kernel_size: (3, hparams.freq_axis_kernel_size) 이렇게 되어 있는데, 왜 3인지 모르겠음. upsample_factor[i]로 대체. # freq_axis_kernel_size는 hparams에 3으로 되어 있는데, 여기서는 filter_width로 처리 <---- frame(num_mels)에 적용되는 kernel_size임 subpixel_layer = SubPixelConvolution(filters=1, kernel_size=(self.upsample_factor[i],freq_axis_kernel_size),padding='same', strides=(self.upsample_factor[i],1), NN_init=True, NN_scaler=0.3,up_layers=len(self.upsample_factor), name='SubPixelConvolution_layer_{}'.format(i)) local_condition_batch = subpixel_layer(local_condition_batch) else: local_condition_batch = tf.layers.conv2d_transpose(local_condition_batch,filters=1, kernel_size=(self.upsample_factor[i], freq_axis_kernel_size), strides=(self.upsample_factor[i],1),padding='same',use_bias=False,name='upsample_2D_{}'.format(i)) local_condition_batch = tf.nn.relu(local_condition_batch) # for debugging #local_condition_batch = tf.Print(local_condition_batch,[tf.shape(local_condition_batch),"xx{}".format(i)]) local_condition_batch = tf.squeeze(local_condition_batch, [3]) return local_condition_batch def _create_network(self, input_batch,local_condition_batch, global_condition_batch): '''Construct the WaveNet network.''' # global_condition_batch: (batch_size, 1, self.global_condition_channels) <--- 가운데 1은 크기 1짜리 data FC대신에 conv1d를 적용하기 위해 강제로 넣었다고 봐야 한다. if self.train_mode==False: self._create_queue() current_layer = input_batch # causal cut으로 길이 1이 줄어든 상태 # Pre-process the input with a regular convolution current_layer = self._create_causal_layer(current_layer) # 여전 모델에서는 길이가 줄었지만, 수정 후에는 길이 불변 # Add all defined dilation layers. outputs = None with tf.variable_scope('dilated_stack'): for layer_index, dilation in enumerate(self.dilations): # [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512] with tf.variable_scope('layer{}'.format(layer_index)): output, current_layer = self._create_dilation_layer(current_layer, layer_index, dilation,local_condition_batch,global_condition_batch) if outputs is None: outputs = output else: outputs = outputs + output if self.legacy: outputs = outputs * np.sqrt(0.5) with tf.name_scope('postprocessing'): # Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to # postprocess the output. transformed1 = tf.nn.relu(outputs) conv1 = tf.layers.conv1d(transformed1,filters=self.skip_channels,kernel_size=1,padding="same",use_bias=self.use_biases) transformed2 = tf.nn.relu(conv1) if self.scalar_input: conv2 = tf.layers.conv1d(transformed2,filters=self.out_channels,kernel_size=1,padding="same",use_bias=self.use_biases) else: conv2 = tf.layers.conv1d(transformed2,filters=self.quantization_channels,kernel_size=1,padding="same",use_bias=self.use_biases) return conv2 def _one_hot(self, input_batch): '''One-hot encodes the waveform amplitudes. This allows the definition of the network as a categorical distribution over a finite set of possible amplitudes. ''' with tf.name_scope('one_hot_encode'): encoded = tf.one_hot(input_batch, depth=self.quantization_channels, dtype=tf.float32) # (1, ?, 1) --> (1, ?, 1, 256) shape = [self.batch_size, -1, self.quantization_channels] encoded = tf.reshape(encoded, shape) # (1, ?, 1, 256) --> (1, ?, 256) return encoded def _embed_gc(self, global_condition): # global_condition = global_condition_batch <---- data '''Returns embedding for global condition. :param global_condition: Either ID of global condition for tf.nn.embedding_lookup or actual embedding. The latter is experimental. :return: Embedding or None ''' # global_condition: (N,) # self.global_condition_cardinality가 None이 아니며, global_condition 은 gc id이면 되고, 그렇지 않으면, global_condition은 embedding vector가 넘어와야 한다. embedding = None if self.global_condition_cardinality is not None: # Only lookup the embedding if the global condition is presented # as an integer of mutually-exclusive categories ... embedding_table = tf.get_variable('gc_embedding', [self.global_condition_cardinality, self.global_condition_channels], dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer(uniform=False)) # (2, 32) embedding = tf.nn.embedding_lookup(embedding_table,global_condition) elif global_condition is not None: # ... else the global_condition (if any) is already provided # as an embedding. # In this case, the number of global_embedding channels must be # equal to the the last dimension of the global_condition tensor. gc_batch_rank = len(global_condition.get_shape()) dims_match = (global_condition.get_shape()[gc_batch_rank - 1] == self.global_condition_channels) if not dims_match: raise ValueError('Shape of global_condition {} does not match global_condition_channels {}.'.format(global_condition.get_shape(), self.global_condition_channels)) embedding = global_condition if embedding is not None: embedding = tf.reshape(embedding,[self.batch_size, 1, self.global_condition_channels]) return embedding def predict_proba_incremental(self, waveform,upsampled_local_condition=None, global_condition=None,name='wavenet'): """ local_condition: upsampled local condition """ with tf.variable_scope(name,reuse=tf.AUTO_REUSE): if self.scalar_input: encoded = tf.reshape(waveform , [self.batch_size, -1, 1]) # (N,1,1) else: encoded = tf.one_hot(waveform, self.quantization_channels) encoded = tf.reshape(encoded, [self.batch_size,-1, self.quantization_channels]) # encoded shape=(N,1, 256) gc_embedding = self._embed_gc(global_condition) # --> shape=(1, 1, 32) # local condition if upsampled_local_condition is not None: upsampled_local_condition = tf.reshape(upsampled_local_condition , [self.batch_size, -1, self.local_condition_channels]) raw_output = self._create_network(encoded,upsampled_local_condition,gc_embedding) # 이것이 fast generation algorithm의 핵심 --> (batch_size, 1, 256) if self.scalar_input: out = tf.reshape(raw_output, [self.batch_size, -1, self.out_channels]) proba = sample_from_discretized_mix_logistic(out) else: out = tf.reshape(raw_output, [self.batch_size, self.quantization_channels]) proba = tf.cast(tf.nn.softmax(tf.cast(out, tf.float64)), tf.float32) return proba def add_loss(self, input_batch,local_condition=None, global_condition_batch=None, l2_regularization_strength=None,upsample_type=None, name='wavenet'): '''Creates a WaveNet network and returns the autoencoding loss. The variables are all scoped to the given name. ''' with tf.variable_scope(name): # We mu-law encode and quantize the input audioform. # quantization_channels 크기의 one hot encoding을 적용한 예정. 16bit= 65536개였다면, quantization_channels로 줄이는 효과가 있다. # mu law encoding은 bit를 단순히 줄이는 것보다 advanced된 방식으로 줄인다. # input_batch: (batch_size,?,1) <-- 마지막 1은 channel 1을 의미 encoded_input = mu_law_encode(input_batch, self.quantization_channels) # "quantization_channels": 256 ---> (batch_size, ?, 1) gc_embedding = self._embed_gc(global_condition_batch) # (self.batch_size, 1, self.global_condition_channels) <--- 가운데 1은 강제로 reshape encoded = self._one_hot(encoded_input) # (1, ?, quantization_channels=256) if self.scalar_input: network_input = tf.reshape( tf.cast(input_batch, tf.float32), [self.batch_size, -1, 1]) else: network_input = encoded # Cut off the last sample of network input to preserve causality. network_input_width = tf.shape(network_input)[1] - 1 if self.scalar_input: input = tf.slice(network_input, [0, 0, 0], [-1, network_input_width,1]) else: input = tf.slice(network_input, [0, 0, 0], [-1, network_input_width, self.quantization_channels]) # local condition if local_condition is not None: local_condition = self.create_upsample(local_condition,upsample_type) local_condition = tf.slice(local_condition, [0, 0, 0], [-1, network_input_width,self.local_condition_channels]) raw_output = self._create_network(input,local_condition, gc_embedding) # (batch_size, ?, quantization_channels=256) , (batch_size, 1, self.global_condition_channels) with tf.name_scope('loss'): # Cut off the samples corresponding to the receptive field # for the first predicted sample. # scalar input인 경우에도 target은 mu-law companding된 것이 된다. target_output = tf.slice(network_input , [0, 1, 0],[-1, -1, -1]) # [-1,-1,-1] --> 나머지 모두 if self.scalar_input: loss = discretized_mix_logistic_loss(raw_output, target_output,num_class=2**16, reduce=False) reduced_loss = tf.reduce_mean(loss) else: # 3 dim array의 loss를 계산학 위해, 2 dim으로 변환한다. batch와 time 부분을 합쳐서 2dim으로 변환 target_output = tf.reshape(target_output, [-1, self.quantization_channels]) prediction = tf.reshape(raw_output, [-1, self.quantization_channels]) loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=target_output) reduced_loss = tf.reduce_mean(loss) tf.summary.scalar('loss', reduced_loss) if l2_regularization_strength is None: self.loss = reduced_loss else: # L2 regularization for all trainable parameters l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if not('bias' in v.name)]) # Add the regularization term to the loss total_loss = (reduced_loss + l2_regularization_strength * l2_loss) tf.summary.scalar('l2_loss', l2_loss) tf.summary.scalar('total_loss', total_loss) self.loss = total_loss def add_optimizer(self, hparams,global_step): '''Adds optimizer to the graph. Supposes that initialize function has already been called. ''' with tf.variable_scope('optimizer'): hp = hparams learning_rate = tf.train.exponential_decay(hp.wavenet_learning_rate, global_step,hp.wavenet_decay_steps,hp.wavenet_decay_rate) #Adam optimization self.learning_rate = learning_rate optimizer = tf.train.AdamOptimizer(learning_rate) gradients, variables = zip(*optimizer.compute_gradients(self.loss)) # len(tf.trainable_variables()) = len(variables) self.gradients = gradients #Gradients clipping if hp.wavenet_clip_gradients: # Rayhane-mamah는 tf.clip_by_norm -> tf.clip_by_value 두 단계를 적용. 여기서는 tf.clip_by_global_norm clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1) # tf.clip_by_global_norm vs tf.clip_by_norm else: clipped_gradients = gradients with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): adam_optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step) #Add exponential moving average #https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage #Use adam optimization process as a dependency with tf.control_dependencies([adam_optimize]): #Create the shadow variables and add ops to maintain moving averages #Also updates moving averages after each update step #This is the optimize call instead of traditional adam_optimize one. assert tuple(tf.trainable_variables()) == variables #Verify all trainable variables are being averaged self.optimize = self.ema.apply(variables) ================================================ FILE: wavenet/ops.py ================================================ # coding: utf-8 import tensorflow as tf import numpy as np def create_adam_optimizer(learning_rate, momentum): return tf.train.AdamOptimizer(learning_rate=learning_rate, epsilon=1e-4) def create_sgd_optimizer(learning_rate, momentum): return tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=momentum) def create_rmsprop_optimizer(learning_rate, momentum): return tf.train.RMSPropOptimizer(learning_rate=learning_rate, momentum=momentum, epsilon=1e-5) optimizer_factory = {'adam': create_adam_optimizer, 'sgd': create_sgd_optimizer, 'rmsprop': create_rmsprop_optimizer} def mu_law_encode(audio, quantization_channels): '''Quantizes waveform amplitudes.''' with tf.name_scope('encode'): mu = tf.to_float(quantization_channels - 1) # Perform mu-law companding transformation (ITU-T, 1988). # Minimum operation is here to deal with rare large amplitudes caused # by resampling. safe_audio_abs = tf.minimum(tf.abs(audio), 1.0) magnitude = tf.log1p(mu * safe_audio_abs) / tf.log1p(mu) # tf.log1p(x) = log(1+x) signal = tf.sign(audio) * magnitude # Quantize signal to the specified number of levels. return tf.to_int32((signal + 1) / 2 * mu + 0.5) def mu_law_decode(output, quantization_channels, quantization=True): '''Recovers waveform from quantized values.''' with tf.name_scope('decode'): mu = quantization_channels - 1 # Map values back to [-1, 1]. if quantization: signal = 2 * (tf.to_float(output) / mu) - 1 else: signal = output # Perform inverse of mu-law transformation. magnitude = (1 / mu) * ((1 + mu)**abs(signal) - 1) return tf.sign(signal) * magnitude class SubPixelConvolution(tf.layers.Conv2D): '''Sub-Pixel Convolutions are vanilla convolutions followed by Periodic Shuffle. They serve the purpose of upsampling (like deconvolutions) but are faster and less prone to checkerboard artifact with the right initialization. In contrast to ResizeConvolutions, SubPixel have the same computation speed (when using same n° of params), but a larger receptive fields as they operate on low resolution. ''' def __init__(self, filters, kernel_size, padding, strides, NN_init, NN_scaler, up_layers, name=None, **kwargs): #Output channels = filters * H_upsample * W_upsample conv_filters = filters * strides[0] * strides[1] #Create initial kernel self.NN_init = NN_init self.up_layers = up_layers self.NN_scaler = NN_scaler init_kernel = tf.constant_initializer(self._init_kernel(kernel_size, strides, conv_filters), dtype=tf.float32) if NN_init else None #Build convolution component and save Shuffle parameters. super(SubPixelConvolution, self).__init__( filters=conv_filters, kernel_size=kernel_size, strides=(1, 1), padding=padding, kernel_initializer=init_kernel, bias_initializer=tf.zeros_initializer(), data_format='channels_last', name=name, **kwargs) self.out_filters = filters self.shuffle_strides = strides self.scope = 'SubPixelConvolution' if None else name def build(self, input_shape): '''Build SubPixel initial weights (ICNR: avoid checkerboard artifacts). To ensure checkerboard free SubPixel Conv, initial weights must make the subpixel conv equivalent to conv->NN resize. To do that, we replace initial kernel with the special kernel W_n == W_0 for all n <= out_channels. In other words, we want our initial kernel to extract feature maps then apply Nearest neighbor upsampling. NN upsampling is guaranteed to happen when we force all our output channels to be equal (neighbor pixels are duplicated). We can think of this as limiting our initial subpixel conv to a low resolution conv (1 channel) followed by a duplication (made by PS). Ref: https://arxiv.org/pdf/1707.02937.pdf ''' #Initialize layer super(SubPixelConvolution, self).build(input_shape) if not self.NN_init: #If no NN init is used, ensure all channel-wise parameters are equal. self.built = False #Get W_0 which is the first filter of the first output channels W_0 = tf.expand_dims(self.kernel[:, :, :, 0], axis=3) #[H_k, W_k, in_c, 1] #Tile W_0 across all output channels and replace original kernel self.kernel = tf.tile(W_0, [1, 1, 1, self.filters]) #[H_k, W_k, in_c, out_c] self.built = True def call(self, inputs): with tf.variable_scope(self.scope) as scope: #Inputs are supposed [batch_size, freq, time_steps, channels] convolved = super(SubPixelConvolution, self).call(inputs) #[batch_size, up_freq, up_time_steps, channels] return self.PS(convolved) def PS(self, inputs): #Get different shapes #[batch_size, H, W, C(out_c * r1 * r2)] batch_size = tf.shape(inputs)[0] H = tf.shape(inputs)[1] W = tf.shape(inputs)[2] #W = tf.shape(inputs)[2] C = inputs.shape[-1] r1, r2 = self.shuffle_strides #supposing strides = (freq_stride, time_stride) out_c = self.out_filters #number of filters as output of the convolution (usually 1 for this model) assert C == r1 * r2 * out_c #Split and shuffle (output) channels separately. (Split-Concat block) Xc = tf.split(inputs, out_c, axis=3) # out_c x [batch_size, H, W, C/out_c] outputs = tf.concat([self._phase_shift(x, batch_size, H, W, r1, r2) for x in Xc], 3) #[batch_size, r1 * H, r2 * W, out_c] with tf.control_dependencies([tf.assert_equal(out_c, tf.shape(outputs)[-1]), tf.assert_equal(H * r1, tf.shape(outputs)[1])]): outputs = tf.identity(outputs, name='SubPixelConv_output_check') return tf.reshape(outputs, [tf.shape(outputs)[0], r1 * H, tf.shape(outputs)[2], out_c]) def _phase_shift(self, inputs, batch_size, H, W, r1, r2): #Do a periodic shuffle on each output channel separately x = tf.reshape(inputs, [batch_size, H, W, r1, r2]) #[batch_size, H, W, r1, r2] #Width dim shuffle x = tf.transpose(x, [4, 2, 3, 1, 0]) #[r2, W, r1, H, batch_size] x = tf.batch_to_space_nd(x, [r2], [[0, 0]]) #[1, r2*W, r1, H, batch_size] x = tf.squeeze(x, [0]) #[r2*W, r1, H, batch_size] #Height dim shuffle x = tf.transpose(x, [1, 2, 0, 3]) #[r1, H, r2*W, batch_size] x = tf.batch_to_space_nd(x, [r1], [[0, 0]]) #[1, r1*H, r2*W, batch_size] x = tf.transpose(x, [3, 1, 2, 0]) #[batch_size, r1*H, r2*W, 1] return x def _init_kernel(self, kernel_size, strides, filters): '''Nearest Neighbor Upsample (Checkerboard free) init kernel size ''' overlap = kernel_size[1] // strides[1] init_kernel = np.zeros(kernel_size, dtype=np.float32) i = kernel_size[1] // 2 j = [kernel_size[0] // 2 - 1, kernel_size[0] // 2] if kernel_size[0] % 2 == 0 else [kernel_size[0] // 2] for j_i in j: init_kernel[j_i,i] = 1. / max(overlap, 1.) if kernel_size[1] % 2 == 0 else 1. init_kernel = np.tile(np.expand_dims(init_kernel, 2), [1, 1, 1, filters]) return init_kernel * (self.NN_scaler)**(1/self.up_layers) ================================================ FILE: 명령어모음.txt ================================================ python preprocess.py --num_workers 10 --name son --in_dir .\datasets\son --out_dir .\data\son python preprocess.py --num_workers 10 --name moon --in_dir .\datasets\moon --out_dir .\data\moon python train_tacotron2.py python train_vocoder.py python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다" python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다" python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-27T20-27-18 python generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-27T20-27-18 python generate.py --mel ./logdir-wavenet/moon-Aust.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-27T20-27-18 python generate.py --mel ./logdir-wavenet/son-Aust.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-27T20-27-18