Repository: hccho2/Tacotron2-Wavenet-Korean-TTS
Branch: master
Commit: 9215afde67a2
Files: 36
Total size: 254.3 KB

Directory structure:
gitextract_8q9e32ds/

├── LICENSE
├── ReadMe.md
├── datasets/
│   ├── __init__.py
│   ├── datafeeder_tacotron2.py
│   ├── datafeeder_wavenet.py
│   ├── moon/
│   │   └── moon-recognition-All.json
│   ├── moon.py
│   ├── son/
│   │   └── son-recognition-All.json
│   └── son.py
├── generate.py
├── hparams.py
├── preprocess.py
├── synthesizer.py
├── tacotron2/
│   ├── __init__.py
│   ├── helpers.py
│   ├── modules.py
│   ├── rnn_wrappers.py
│   └── tacotron2.py
├── text/
│   ├── __init__.py
│   ├── cleaners.py
│   ├── en_numbers.py
│   ├── english.py
│   ├── ko_dictionary.py
│   ├── korean.py
│   └── symbols.py
├── train_tacotron2.py
├── train_vocoder.py
├── utils/
│   ├── __init__.py
│   ├── audio.py
│   ├── infolog.py
│   └── plot.py
├── wavenet/
│   ├── __init__.py
│   ├── mixture.py
│   ├── model.py
│   └── ops.py
└── 명령어모음.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Heecheol Cho

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

================================================
FILE: ReadMe.md
================================================
# Multi-Speaker Tocotron2 + Wavenet Vocoder + Korean TTS
Tacotron2 모델과 Wavenet Vocoder를 결합하여  한국어 TTS구현하는 project입니다.
Tacotron2 모델을 Multi-Speaker모델로 확장했습니다.

Based on 
- https://github.com/keithito/tacotron
- https://github.com/carpedm20/multi-speaker-tacotron-tensorflow
- https://github.com/Rayhane-mamah/Tacotron-2
- https://github.com/hccho2/Tacotron-Wavenet-Vocoder


## Tacotron 2
- Tacotron 모델에 관한 설명은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다.
- [Tacotron2](https://arxiv.org/abs/1712.05884)에서는 모델 구조도 바뀌었고, Location Sensitive Attention, Stop Token, Vocoder로 Wavenet을 제안하고 있다.
- Tacotron2의 대표적인 구현은 [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)입니다. 이 역시, [keithito](https://github.com/keithito/tacotron), [r9y9](https://github.com/r9y9/wavenet_vocoder)의 코드를 기반으로 발전된 것이다.

## This Project
* Tacotron2 모델로 한국어 TTS를 만드는 것이 목표입니다.
* [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)의 구현은 Customization된 Layer를 많이 사용했는데, 제가 보기에는 너무 복잡하게 한 것 같아, Cumomization Layer를 많이 줄이고, Tensorflow에 구현되어 있는 Layer를 많이 활용했습니다.
* teacher forcing 방식의 train sample은 2000 step부터, free forcing 방식의 test sample은 3000 step부터 알아들을 수 있는 정도의 음성을 만들기 시작합니다.
## 단계별 실행

### 실행 순서
- Data 생성: 한국어 data의 생성은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다.
- 생성된 Data는 아래의 'data_paths'에 지정하면 된다.
- tacotron training 후, synthesize.py로 test.
- wavenet training 후, generate.py로 test(tacotron이 만들지 않은 mel spectrogram으로 test할 수도 있고, tacotron이 만든 mel spectrogram을 사용할 수도 있다.)
- 2개 모델 모두 train 후, tacotron에서 생성한 mel spectrogram을 wavent에 local condition으로 넣어 test하면 된다.


### Tacotron2 Training
- train_tacotron2.py 내에서 '--data_paths'를 지정한 후, train할 수 있다. data_path는 여러개의 데이터 디렉토리를 지정할 수 있습니다.
```
parser.add_argument('--data_paths', default='.\\data\\moon,.\\data\\son')
```
- train을 이어서 계속하는 경우에는 '--load_path'를 지정해 주면 된다.
```
parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-02-27_00-21-42')
```

- model_type은 'single' 또는 ' multi-speaker'로 지정할 수 있다. speaker가 1명 일 때는, hparams의 model_type = 'single'로 하고 train_tacotron2.py 내에서 '--data_paths'를 1개만 넣어주면 된다.
```
parser.add_argument('--data_paths', default='D:\\Tacotron2\\data\\moon')
```
- 하이퍼파라메터를 hparmas.py에서 argument를 train_tacotron2.py에서 다 설정했기 때문에, train 실행은 다음과 같이 단순합니다.
> python train_tacotron2.py
- train 후, 음성을 생성하려면 다음과 같이 하면 된다. '--num_speaker', '--speaker_id'는 잘 지정되어야 한다.
> python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다." 


### Wavenet Vocoder Training
- train_vocoder.py 내에서 '--data_dir'를 지정한 후, train할 수 있다.
- memory 부족으로 training 되지 않거나 너무 느리면, hyper paramerter 중 sample_size를 줄이면 된다. 물론 batch_size를 줄일 수도 있다.
```
DATA_DIRECTORY =  'D:\\Tacotron2\\data\\moon,D:\\Tacotron2\\data\\son'
parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing data')
```
- train을 이어서 계속하는 경우에는 '--logdir'를 지정해 주면 된다.
```
LOGDIR = './/logdir-wavenet//train//2018-12-21T22-58-10'
parser.add_argument('--logdir', type=str, default=LOGDIR)
```
- wavenet train 후, tacotron이 생성한 mel spectrogram(npy파일)을 local condition으로 넣어서 TTS의 최종 결과를 얻을 수 있다.
> python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10

### Result
- Tacotron의 batch_size = 32, Wavenet의 batch_size=8. GTX 1080ti.
- Tacotron은 step 100K, Wavenet은 177K 만큼 train.
- samples 디렉토리에는 생성된 wav파일이 있다.
- Griffin-Lim으로 생성된 것과 Wavenet Vocoder로 생성된 sample이 있다.
- Wavenet으로 생성된 음성은 train 부족으로 잡음이 섞여있다.


================================================
FILE: datasets/__init__.py
================================================
# -*- coding: utf-8 -*-

from .datafeeder_wavenet import DataFeederWavenet

================================================
FILE: datasets/datafeeder_tacotron2.py
================================================
# coding: utf-8
import os
import time
import pprint
import random
import threading
import traceback
import numpy as np
from glob import glob
import tensorflow as tf
from collections import defaultdict

import text
from utils.infolog import log
from utils import parallel_run, remove_file
from utils.audio import frames_to_hours


_pad = 0
_stop_token_pad = 1
def get_frame(path):
    data = np.load(path)
    n_frame = data["linear"].shape[0]
    n_token = len(data["tokens"])
    return (path, n_frame, n_token)

def get_path_dict(data_dirs, hparams, config,data_type, n_test=None,rng=np.random.RandomState(123)):

    # Load metadata:
    path_dict = {}
    for data_dir in data_dirs:  # ['datasets/moon\\data']
        paths = glob("{}/*.npz".format(data_dir)) # ['datasets/moon\\data\\001.0000.npz', 'datasets/moon\\data\\001.0001.npz', 'datasets/moon\\data\\001.0002.npz', ...]

        if data_type == 'train':
            rng.shuffle(paths)  # ['datasets/moon\\data\\012.0287.npz', 'datasets/moon\\data\\004.0215.npz', 'datasets/moon\\data\\003.0149.npz', ...]

        if not config.skip_path_filter:
            items = parallel_run( get_frame, paths, desc="filter_by_min_max_frame_batch", parallel=True)  # [('datasets/moon\\data\\012.0287.npz', 130, 21), ('datasets/moon\\data\\003.0149.npz', 209, 37), ...]

            min_n_frame = hparams.min_n_frame   # 5*30
            max_n_frame =  hparams.max_n_frame - 1  # 5*200 - 5
            
            # 다음 단계에서 data가 많이 떨어져 나감. 글자수가 짧은 것들이 탈락됨.
            new_items = [(path, n) for path, n, n_tokens in items if min_n_frame <= n <= max_n_frame and n_tokens >= hparams.min_tokens] # [('datasets/moon\\data\\004.0383.npz', 297), ('datasets/moon\\data\\003.0533.npz', 394),...]


            new_paths = [path for path, n in new_items]
            new_n_frames = [n for path, n in new_items]

            hours = frames_to_hours(new_n_frames,hparams)

            log(' [{}] Loaded metadata for {} examples ({:.2f} hours)'.format(data_dir, len(new_n_frames), hours))
            log(' [{}] Max length: {}'.format(data_dir, max(new_n_frames)))
            log(' [{}] Min length: {}'.format(data_dir, min(new_n_frames)))
        else:
            new_paths = paths

        # train용 data와 test용 data로 나눈다.
        if data_type == 'train':
            new_paths = new_paths[:-n_test] # 끝에 있는 n_test(batch_size)를 제외한 모두
        elif data_type == 'test':
            new_paths = new_paths[-n_test:] # 끝에 있는 n_test
        else:
            raise Exception(" [!] Unkown data_type: {}".format(data_type))

        path_dict[data_dir] = new_paths  # ['datasets/moon\\data\\001.0621.npz', 'datasets/moon\\data\\003.0229.npz', ...]

    return path_dict


# run -> _enqueue_next_group -> _get_next_example
class DataFeederTacotron2(threading.Thread):
    '''Feeds batches of data into a queue on a background thread.'''

    def __init__(self, coordinator, data_dirs,hparams, config, batches_per_group, data_type, batch_size):  #batches_per_group = 32 or 8,  data_type: 'train' or 'test'
        super(DataFeederTacotron2, self).__init__()

        self._coord = coordinator
        self._hp = hparams
        self._cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
        self._step = 0
        self._offset = defaultdict(lambda: 2)
        self._batches_per_group = batches_per_group

        self.rng = np.random.RandomState(config.random_seed)   # random number generator
        self.data_type = data_type
        self.batch_size = batch_size

        self.min_tokens = hparams.min_tokens  # 30
        self.min_n_frame = hparams.min_n_frame   # 5*30
        self.max_n_frame = hparams.max_n_frame - 1  # 5*200 - 5
        self.skip_path_filter = config.skip_path_filter

        # Load metadata:
        self.path_dict = get_path_dict(data_dirs, self._hp, config, self.data_type,n_test=self.batch_size, rng=self.rng) # data_dirs: ['datasets/moon\\data']

        self.data_dirs = list(self.path_dict.keys()) # ['datasets/moon\\data']
        self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)}  # {'datasets/moon\\data': 0}

        data_weight = {data_dir: 1. for data_dir in self.data_dirs} # {'datasets/moon\\data': 1.0}

        if self._hp.main_data_greedy_factor > 0 and any(main_data in data_dir for data_dir in self.data_dirs for main_data in self._hp.main_data):   # 'main_data': ['']
            for main_data in self._hp.main_data:
                for data_dir in self.data_dirs:
                    if main_data in data_dir:
                        data_weight[data_dir] += self._hp.main_data_greedy_factor

        weight_Z = sum(data_weight.values())  # 1
        self.data_ratio = { data_dir: weight / weight_Z for data_dir, weight in data_weight.items()}  # 각 data들의 weight sum이 1이 되도록...

        log("="*40)
        log('Data Amount:')
        log(pprint.pformat(self.data_ratio, indent=4))
        log("="*40)

        #audio_paths = [path.replace("/data/", "/audio/").replace(".npz", ".wav") for path in self.data_paths]
        #duration = get_durations(audio_paths, print_detail=False)

        # Create placeholders for inputs and targets. Don't specify batch size because we want to
        # be able to feed different sized batches at eval time.

        self._placeholders = [
            tf.placeholder(tf.int32, [None, None], 'inputs'),
            tf.placeholder(tf.int32, [None], 'input_lengths'),
            tf.placeholder(tf.float32, [None], 'loss_coeff'),
            tf.placeholder(tf.float32, [None, None, hparams.num_mels], 'mel_targets'),
            tf.placeholder(tf.float32, [None, None, hparams.num_freq], 'linear_targets'),
            tf.placeholder(tf.float32, [None, None], 'stop_token_targets')
        ]

        # Create queue for buffering data:
        dtypes = [tf.int32, tf.int32, tf.float32, tf.float32, tf.float32, tf.float32]

        self.is_multi_speaker = len(self.data_dirs) > 1

        if self.is_multi_speaker:
            self._placeholders.append( tf.placeholder(tf.int32, [None], 'speaker_id'),)
            dtypes.append(tf.int32)

        num_worker = 8 if self.data_type == 'train' else 1
        queue = tf.FIFOQueue(num_worker, dtypes, name='input_queue')

        self._enqueue_op = queue.enqueue(self._placeholders)

        if self.is_multi_speaker:
            self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets, self.speaker_id = queue.dequeue()
        else:
            self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets = queue.dequeue()

        self.inputs.set_shape(self._placeholders[0].shape)
        self.input_lengths.set_shape(self._placeholders[1].shape)
        self.loss_coeff.set_shape(self._placeholders[2].shape)
        self.mel_targets.set_shape(self._placeholders[3].shape)
        self.linear_targets.set_shape(self._placeholders[4].shape)
        self.stop_token_targets.set_shape(self._placeholders[5].shape)

        if self.is_multi_speaker:
            self.speaker_id.set_shape(self._placeholders[6].shape)
        else:
            self.speaker_id = None

        if self.data_type == 'test':
            examples = []
            while True:
                for data_dir in self.data_dirs:
                    examples.append(self._get_next_example(data_dir))
                    #print(data_dir, text.sequence_to_text(examples[-1][0], False, True))
                    if len(examples) >= self.batch_size:
                        break
                if len(examples) >= self.batch_size:
                    break
            
            # test 할 때는 같은 examples로 계속 반복
            self.static_batches = [examples for _ in range(self._batches_per_group)]  # [examples, examples,...,examples] <--- 각 example은 2개의 data를 가지고 있다.

        else:
            self.static_batches = None

    def start_in_session(self, session, start_step):
        self._step = start_step
        self._session = session
        self.start()


    def run(self):
        try:
            while not self._coord.should_stop():
                self._enqueue_next_group()
        except Exception as e:
            traceback.print_exc()
            self._coord.request_stop(e)


    def _enqueue_next_group(self):
        start = time.time()

        # Read a group of examples:
        n = self.batch_size   # 32
        r = self._hp.reduction_factor  #  4 or 5  min_n_frame,max_n_frame 계산에 사용되었던...

        if self.static_batches is not None:  # 'test'에서는 static_batches를 사용한다. static_batches는 init에서 이미 만들어 놓았다.
            batches = self.static_batches
        else: # 'train'
            examples = []
            for data_dir in self.data_dirs:
                if self._hp.initial_data_greedy:
                    if self._step < self._hp.initial_phase_step and any("krbook" in data_dir for data_dir in self.data_dirs):
                        data_dir = [data_dir for data_dir in self.data_dirs if "krbook" in data_dir][0]

                if self._step < self._hp.initial_phase_step:  # 'initial_phase_step': 8000
                    example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group // len(self.data_dirs)))]  # _batches_per_group 8,또는 32 만큼의 batch data를 만드낟. 각각의 batch size는 2, 또는 32
                else:
                    example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group * self.data_ratio[data_dir]))]
                examples.extend(example)
            examples.sort(key=lambda x: x[-1])  # 제일 마지막 기준이니까,  len(linear_target) 기준으로 정렬

            batches = [examples[i:i+n] for i in range(0, len(examples), n)]
            self.rng.shuffle(batches)

        log('Generated %d batches of size %d in %.03f sec' % (len(batches), n, time.time() - start))
        for batch in batches:  # batches는 batch의 묶음이다.
            # test 또는 train mode에 맞게 만든 batches의  batch data를 placeholder에 넘겨준다.
            feed_dict = dict(zip(self._placeholders, _prepare_batch(batch, r, self.rng, self.data_type)))   # _prepare_batch에서 batch data의 길이를 맞춘다. return 순서 = placeholder순서
            self._session.run(self._enqueue_op, feed_dict=feed_dict)
            self._step += 1


    def _get_next_example(self, data_dir):
        '''npz 1개를 읽어 처리한다. Loads a single example (input, mel_target, linear_target, cost) from disk'''
        data_paths = self.path_dict[data_dir]

        while True:
            if self._offset[data_dir] >= len(data_paths):
                self._offset[data_dir] = 0

                if self.data_type == 'train':
                    self.rng.shuffle(data_paths)

            data_path = data_paths[self._offset[data_dir]]  # npz파일 1개 선택
            self._offset[data_dir] += 1

            try:
                if os.path.exists(data_path):
                    data = np.load(data_path)  # data속에는 "linear","mel","tokens","loss_coeff"
                else:
                    continue
            except:
                remove_file(data_path)
                continue

            if not self.skip_path_filter:
                break

            if self.min_n_frame <= data["linear"].shape[0] <= self.max_n_frame and  len(data["tokens"]) > self.min_tokens:
                break

        input_data = data['tokens']   # 1-dim
        mel_target = data['mel']

        if 'loss_coeff' in data:
            loss_coeff = data['loss_coeff']
        else:
            loss_coeff = 1
        linear_target = data['linear']
        stop_token_target = np.asarray([0.] * len(mel_target))  # mel_target은 [xx,80]으로 data마다 len이 다르다.  len에 따라 [0,...,0]
        
        
        # multi-speaker가 아니면, speaker_id는 넘길 필요 없지만, 현재 구현이 좀 꼬여 있다. 그래서 무조건 넘긴다.
        if self.is_multi_speaker:
            return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, self.data_dir_to_id[data_dir], len(linear_target))
        else:
            return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, len(linear_target))


def _prepare_batch(batch, reduction_factor, rng, data_type=None):
    # (input_data, loss_coeff, mel_target, linear_target,stop_token_target, speaker_id, len(linear_target))
    
    if data_type == 'train':
        rng.shuffle(batch)

    # batch data: (input_data, loss_coeff, mel_target, linear_target, self.data_dir_to_id[data_dir], len(linear_target))
    inputs = _prepare_inputs([x[0] for x in batch])  # batch에 있는 data들 중, 가장 긴 data의 길이에 맞게 padding한다.
    input_lengths = np.asarray([len(x[0]) for x in batch], dtype=np.int32)  # batch_size, [37, 37, 32, 32, 38,..., 39, 36, 30]
    loss_coeff = np.asarray([x[1] for x in batch], dtype=np.float32)   # batch_size, [1,1,1,,..., 1,1,1]

    mel_targets = _prepare_targets([x[2] for x in batch], reduction_factor)  # ---> (32, 175, 80) max length는 reduction_factor의  배수가 되도록
    linear_targets = _prepare_targets([x[3] for x in batch], reduction_factor)  # ---> (32, 175, 1025)  max length는 reduction_factor의  배수가 되도록
    stop_token_targets = _prepare_stop_token_targets([x[4] for x in batch], reduction_factor)

    if len(batch[0]) == 7:  # is_multi_speaker = True인 경우
        speaker_id = np.asarray([x[5] for x in batch], dtype=np.int32)   # speaker_id로 list 만들기
        return (inputs, input_lengths, loss_coeff,mel_targets, linear_targets,stop_token_targets, speaker_id)
    else:
        return (inputs, input_lengths, loss_coeff, mel_targets, linear_targets,stop_token_targets)  # ('inputs' 'input_lengths' 'loss_coeff' 'mel_targets' 'linear_targets')


def _prepare_inputs(inputs):  # inputs: batch 길이 만큼의 list
    max_len = max((len(x) for x in inputs))
    return np.stack([_pad_input(x, max_len) for x in inputs])  # (batch_size, max_len)
    """
    batch_size = 2 일 떼,
    [[13, 26, 13, 41, 13, 21, 13, 41, 13, 21, 13, 41,  9, 41, 13, 40,79, 14, 34, 13, 33, 79, 20, 32, 13, 35, 45,  2, 34, 42, 13, 39,7, 29, 11, 25,  1],
    [ 6, 29, 79, 14, 26, 14, 34,  5, 29, 79,  2, 30, 45,  2, 28, 14,21, 79, 13, 27,  7, 25,  9, 34, 45, 13, 40, 79,  4, 29,  2, 29,13, 26,  1,  0,  0]]    
    """

def _prepare_targets(targets, alignment):
    # targets: shape of list [ (162,80) , (172, 80), ...] 
    max_len = max((len(t) for t in targets)) + 1
    return np.stack([_pad_target(t, _round_up(max_len, alignment)) for t in targets])

def _prepare_stop_token_targets(targets, alignment):
    max_len = max((len(t) for t in targets)) + 1
    return np.stack([_pad_stop_token_target(t, _round_up(max_len, alignment)) for t in targets])


def _pad_input(x, length):
    return np.pad(x, (0, length - x.shape[0]), mode='constant', constant_values=_pad)


def _pad_target(t, length):
    # t: 2 dim array. ( xx, num_mels) ==> (length,num_mels)
    return np.pad(t, [(0, length - t.shape[0]), (0,0)], mode='constant', constant_values=_pad)  # (169, 80) ==> (length, 80)

###
def _pad_stop_token_target(t, length):
    return np.pad(t, (0, length - t.shape[0]), mode='constant', constant_values=_stop_token_pad)

def _round_up(x, multiple):
    remainder = x % multiple
    return x if remainder == 0 else x + multiple - remainder


if __name__ == '__main__':
    
    from hparams import hparams
    import argparse
    from utils import str2bool
    
    parser = argparse.ArgumentParser()
    parser.add_argument('--random_seed', type=int, default=123)
    parser.add_argument('--batch_size', type=int, default=4)
    parser.add_argument('--skip_path_filter', type=str2bool, default=True, help='Use only for debugging')
    config = parser.parse_args()
    
    
    coord = tf.train.Coordinator()
    data_dirs=['D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon']
    mydatafeed =  DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size)

    
    with tf.Session() as sess:
        try:
            sess.run(tf.global_variables_initializer())
            step = 0
            mydatafeed.start_in_session(sess,step) 
            
            while not coord.should_stop():
                a,b,c,d=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets,mydatafeed.stop_token_targets])
                
                print(a.shape,c.shape,d.shape)
                print(step,b)
                print('stop token:', d[0])
                print('-'*10)
                a,b,c=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets])
                
                print(a.shape,c.shape)
                print(step,b)             
                
                print('='*10)
                step =  step +1
                
                if step > 3:
                    raise Exception('End xxx')
                
        
        except Exception as e:
            print('finally')
            print(e)
            coord.request_stop(e)


================================================
FILE: datasets/datafeeder_wavenet.py
================================================
# -*- coding: utf-8 -*-
import sys
sys.path.append("../")

import tensorflow as tf
import threading
import random
import numpy as np
import os
from utils import audio
from hparams import hparams
from glob import glob
from collections import defaultdict


def get_path_dict(data_dirs, min_length):
    path_dict = {}
    for data_dir in data_dirs:
        
        if not hparams.skip_path_filter:
        
            with open(os.path.join(data_dir,'train.txt'), 'r', encoding='utf-8') as f:
                lines = f.readlines()
                new_paths = []
                for line in lines:
                    line = line.strip().split("|")
                    if int(line[3]) > min_length:
                        new_paths.append(line[6])
            
            path_dict[data_dir] = new_paths
        else:
            new_paths = glob("{}/*.npz".format(data_dir))
            
            new_paths = [os.path.basename(p) for p in new_paths]
            path_dict[data_dir] = new_paths
    return path_dict

def assert_ready_for_upsampling(x, c,hop_size):
    assert len(x) % len(c) == 0 and len(x) // len(c) == hop_size

def ensure_divisible(length, divisible_by=256, lower=True):
    if length % divisible_by == 0:
        return length
    if lower:
        return length - length % divisible_by
    else:
        return length + (divisible_by - length % divisible_by)


class DataFeederWavenet(threading.Thread):
    def __init__(self,coord,data_dirs,batch_size, gc_enable=False,test_mode=False, queue_size=8):
        super(DataFeederWavenet, self).__init__()    
        self.data_dirs = data_dirs
        self.coord = coord
        self.batch_size = batch_size
        self.hop_size = audio.get_hop_size(hparams)
        self.sample_size = ensure_divisible(hparams.sample_size,self.hop_size, True)
        self.max_frames = self.sample_size // self.hop_size  # sample_size 크기를 확보하기 위해.
        self.queue_size = queue_size
        self.gc_enable = gc_enable
        self.skip_path_filter = hparams.skip_path_filter
        self.test_mode = test_mode
        if test_mode:
            assert batch_size==1
        
        self.rng = np.random.RandomState(123)
        self._offset = defaultdict(lambda: 2)  # key에 없는 값이 들어어면 2가 할당된다.
        
        self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)}  # data_dir <---> speaker_id 매핑
        self.path_dict = get_path_dict(self.data_dirs,self.sample_size)# receptive_field 보다 작은 것을 버리고, 나머지만 돌려준다.
        
        self._placeholders = [
            tf.placeholder(tf.float32, shape=[None,None,1],name='input_wav'),
            tf.placeholder(tf.float32, shape=[None,None,hparams.num_mels],name='local_condition')
        ]    
        dtypes = [tf.float32, tf.float32]
    
        if self.gc_enable:
            self._placeholders.append(tf.placeholder(tf.int32, shape=[None],name='speaker_id'))
            dtypes.append(tf.int32)
 
        queue = tf.FIFOQueue(self.queue_size, dtypes, name='input_queue')
        self.enqueue = queue.enqueue(self._placeholders)
        
        if self.gc_enable:
            self.inputs_wav, self.local_condition, self.speaker_id = queue.dequeue()
        else:
            self.inputs_wav, self.local_condition = queue.dequeue()

        self.inputs_wav.set_shape(self._placeholders[0].shape)
        self.local_condition.set_shape(self._placeholders[1].shape)
        if self.gc_enable:
            self.speaker_id.set_shape(self._placeholders[2].shape)
   
            
    def run(self):
        try:
            while not self.coord.should_stop():
                self.make_batches()
        except Exception as e:
            self.coord.request_stop(e)       
    def start_in_session(self, session,start_step):
        self._step = start_step
        self.sess = session
        self.start()
              
    def make_batches(self):
        examples = []
        n = self.batch_size
        for data_dir in self.data_dirs:
            example = [self._get_next_example(data_dir) for _ in range(int(n * 32 // len(self.data_dirs)))]
            examples.extend(example)
        self.rng.shuffle(examples)
        batches = [examples[i:i+n] for i in range(0, len(examples), n)]
        
        
        for batch in batches: # batch size만큼의 data를 원하는 만큼 만든다.
            feed_dict = dict(zip(self._placeholders, _prepare_batch(batch))) 
            self.sess.run(self.enqueue, feed_dict=feed_dict)
            self._step += 1
    
    def _get_next_example(self, data_dir):
        '''npz 1개를 읽어 처리한다. Loads a single example (input_wav, local_condition,speaker_id ) from disk'''
        data_paths = self.path_dict[data_dir]
        
        while True:
            if self._offset[data_dir] >= len(data_paths):
                self._offset[data_dir] = 0
                self.rng.shuffle(data_paths)
            
            data_path = os.path.join(data_dir,data_paths[self._offset[data_dir]])  # npz파일 1개 선택
            self._offset[data_dir] += 1
            
            if os.path.exists(data_path):
                data = np.load(data_path)  # data속에는 'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'token'
            else:
                continue       
            
            if not self.skip_path_filter:
                # 이경우는 get_path_dict함수에서 한번 걸러졌기 때문에, 여기서 다시 확인할 필요 없음.
                break
            
            # get_path_dict함수에서 걸러지지 않앗기 때문에 확인이 필요함.
            if data['time_steps'] > self.sample_size or self.test_mode:
                break
                 

        input_wav = data['audio']
        local_condition = data['mel']
        input_wav = input_wav.reshape(-1, 1)
        assert_ready_for_upsampling(input_wav, local_condition,self.hop_size)
        
        
        if self.test_mode==False:  # test_mode에서는 전체. train_mode에서는 sample_size 만큼만
            s = np.random.randint(0, len(local_condition) - self.max_frames+1)  # hccho
            ts = s * self.hop_size
            input_wav = input_wav[ts:ts + self.hop_size * self.max_frames, :]
            local_condition = local_condition[s:s + self.max_frames, :]        
            
        if self.gc_enable:
            return (input_wav,local_condition, self.data_dir_to_id[data_dir])
        else: return (input_wav,local_condition)
def _prepare_batch(batch):
    input_wavs = [x[0] for x in batch]
    local_conditions = [x[1] for x in batch]
    if len(batch[0])==3:
        speaker_ids = [x[2] for x in batch]
        return (input_wavs,local_conditions,speaker_ids)
    else:
        return (input_wavs,local_conditions)
        
        
if __name__ == '__main__':
    coord = tf.train.Coordinator()
    data_dirs=['D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon','D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son']
    mydatafeed =  DataFeederWavenet(coord,data_dirs,batch_size=5,receptive_field=1200, gc_enable=True, queue_size=8)
    
    
    with tf.Session() as sess:
        try:
            sess.run(tf.global_variables_initializer())
            step = 0
            mydatafeed.start_in_session(sess,step) 
            
            while not coord.should_stop():
                a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id])
                
                print(a.shape,b.shape,c.shape)
                print(step, c)
                
                a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id])
                
                print(a.shape,b.shape,c.shape)
                print(step, c)               
                
                
                step =  step +1
                
        
        except Exception as e:
            print('finally')
            coord.request_stop(e)
    

================================================
FILE: datasets/moon/moon-recognition-All.json
================================================
{
    "./datasets/moon/audio/003.0000.wav": "존경하는 독일 국민 여러분",
    "./datasets/moon/audio/003.0001.wav": "고국에 계신 국민 여러분",
    "./datasets/moon/audio/003.0002.wav": "하울젠 쾨르버재단 이사님과",
    "./datasets/moon/audio/003.0003.wav": "모드로",
    "./datasets/moon/audio/003.0004.wav": "전 동독 총리님을 비롯한",
    "./datasets/moon/audio/003.0005.wav": "내외 귀빈 여러분",
    "./datasets/moon/audio/003.0006.wav": "먼저 냉전과 분단을 넘어",
    "./datasets/moon/audio/003.0007.wav": "통일을 이루고",
    "./datasets/moon/audio/003.0008.wav": "그 힘으로 유럽통합과 국제평화를 선도하고 있는",
    "./datasets/moon/audio/003.0009.wav": "독일과",
    "./datasets/moon/audio/003.0010.wav": "독일 국민에게",
    "./datasets/moon/audio/003.0011.wav": "무한한 경의를 표합니다",
    "./datasets/moon/audio/003.0012.wav": "오늘 이 자리를 마련해 주신",
    "./datasets/moon/audio/003.0013.wav": "독일 정부와 쾨르버 재단에도",
    "./datasets/moon/audio/003.0014.wav": "감사드립니다",
    "./datasets/moon/audio/003.0015.wav": "아울러 얼마 전 별세하신",
    "./datasets/moon/audio/003.0016.wav": "고",
    "./datasets/moon/audio/003.0017.wav": "헬무트 콜 총리의 가족과",
    "./datasets/moon/audio/003.0018.wav": "독일 국민들에게 깊은 애도와",
    "./datasets/moon/audio/003.0019.wav": "위로의 마음을 전합니다",
    "./datasets/moon/audio/003.0020.wav": "대한민국은",
    "./datasets/moon/audio/003.0021.wav": "냉전시기",
    "./datasets/moon/audio/003.0022.wav": "어려운 환경 속에서도",
    "./datasets/moon/audio/003.0023.wav": "적극적이고",
    "./datasets/moon/audio/003.0024.wav": "능동적인 외교로",
    "./datasets/moon/audio/003.0025.wav": "독일 통일과 유럽통합을 주도한",
    "./datasets/moon/audio/003.0026.wav": "헬무트",
    "./datasets/moon/audio/003.0027.wav": "콜 총리의 위대한 업적을 기억할 것입니다",
    "./datasets/moon/audio/003.0028.wav": "친애하는 내외 귀빈 여러분",
    "./datasets/moon/audio/003.0029.wav": "이곳 베를린은",
    "./datasets/moon/audio/003.0030.wav": "지금으로부터 17년 전",
    "./datasets/moon/audio/003.0031.wav": "한국의 김대중 대통령이",
    "./datasets/moon/audio/003.0032.wav": "남북 화해·협력의 기틀을 마련한",
    "./datasets/moon/audio/003.0033.wav": "베를린 선언을 발표한 곳입니다",
    "./datasets/moon/audio/003.0034.wav": "여기 알테스 슈타트하우스는",
    "./datasets/moon/audio/003.0035.wav": "독일 통일조약 협상이 이뤄졌던",
    "./datasets/moon/audio/003.0036.wav": "역사적 현장입니다",
    "./datasets/moon/audio/003.0037.wav": "나는 오늘",
    "./datasets/moon/audio/003.0038.wav": "베를린의 교훈이 살아있는 이 자리에서",
    "./datasets/moon/audio/003.0039.wav": "대한민국 새 정부의 한반도 평화 구상을",
    "./datasets/moon/audio/003.0040.wav": "말씀드리고자 합니다",
    "./datasets/moon/audio/003.0041.wav": "내외 귀빈 여러분",
    "./datasets/moon/audio/003.0042.wav": "독일 통일의 경험은",
    "./datasets/moon/audio/003.0043.wav": "지구상",
    "./datasets/moon/audio/003.0044.wav": "마지막 분단국가로 남은 우리에게",
    "./datasets/moon/audio/003.0045.wav": "통일에 대한 희망과 함께",
    "./datasets/moon/audio/003.0046.wav": "우리가 나아가야 할 방향을 말해주고 있습니다",
    "./datasets/moon/audio/003.0047.wav": "그것은 우선",
    "./datasets/moon/audio/003.0048.wav": "통일에 이르는",
    "./datasets/moon/audio/003.0049.wav": "과정의 중요성입니다",
    "./datasets/moon/audio/006.0000.wav": "존경하고 사랑하는 국민 여러분",
    "./datasets/moon/audio/006.0001.wav": "감사합니다",
    "./datasets/moon/audio/006.0002.wav": "국민 여러분의",
    "./datasets/moon/audio/006.0003.wav": "위대한 선택에",
    "./datasets/moon/audio/006.0004.wav": "머리 숙여",
    "./datasets/moon/audio/006.0005.wav": "깊이",
    "./datasets/moon/audio/006.0006.wav": "감사드립니다",
    "./datasets/moon/audio/006.0007.wav": "저는 오늘",
    "./datasets/moon/audio/006.0008.wav": "대한민국",
    "./datasets/moon/audio/006.0009.wav": "제19대 대통령으로서",
    "./datasets/moon/audio/006.0010.wav": "새로운 대한민국을 향해",
    "./datasets/moon/audio/006.0011.wav": "첫걸음을 내딛습니다",
    "./datasets/moon/audio/006.0012.wav": "지금 제 두 어깨는",
    "./datasets/moon/audio/006.0013.wav": "국민 여러분으로부터",
    "./datasets/moon/audio/006.0014.wav": "부여받은",
    "./datasets/moon/audio/006.0015.wav": "막중한 소명감으로",
    "./datasets/moon/audio/006.0016.wav": "무겁습니다",
    "./datasets/moon/audio/006.0017.wav": "지금 제 가슴은",
    "./datasets/moon/audio/006.0018.wav": "한 번도 경험하지 못한",
    "./datasets/moon/audio/006.0019.wav": "나라를 만들겠다는 열정으로 뜨겁습니다",
    "./datasets/moon/audio/006.0020.wav": "그리고 지금 제 머리는",
    "./datasets/moon/audio/006.0021.wav": "통합과 공존의",
    "./datasets/moon/audio/006.0022.wav": "새로운 세상을 열어갈",
    "./datasets/moon/audio/006.0023.wav": "청사진으로",
    "./datasets/moon/audio/006.0024.wav": "가득 차 있습니다",
    "./datasets/moon/audio/006.0025.wav": "우리가 만들어가려는 새로운 대한민국은",
    "./datasets/moon/audio/006.0026.wav": "숱한 좌절과 패배에도 불구하고",
    "./datasets/moon/audio/006.0027.wav": "우리의 선대들이",
    "./datasets/moon/audio/006.0028.wav": "일관되게 추구했던 나라입니다",
    "./datasets/moon/audio/006.0029.wav": "또 많은 희생과 헌신을 감내하며",
    "./datasets/moon/audio/006.0030.wav": "우리 젊은이들이",
    "./datasets/moon/audio/006.0031.wav": "그토록 이루고 싶어했던",
    "./datasets/moon/audio/006.0032.wav": "나라입니다",
    "./datasets/moon/audio/006.0033.wav": "그런 대한민국을 만들기 위해 저는",
    "./datasets/moon/audio/006.0034.wav": "역사와 국민 앞에",
    "./datasets/moon/audio/006.0035.wav": "두렵지만",
    "./datasets/moon/audio/006.0036.wav": "겸허한 마음으로",
    "./datasets/moon/audio/006.0037.wav": "대한민국",
    "./datasets/moon/audio/006.0038.wav": "제19대",
    "./datasets/moon/audio/006.0039.wav": "대통령으로서의",
    "./datasets/moon/audio/006.0040.wav": "책임과 소명을 다할 것임을 천명합니다",
    "./datasets/moon/audio/006.0041.wav": "함께 선거를 치른 후보들께",
    "./datasets/moon/audio/006.0042.wav": "감사의 말씀과 함께",
    "./datasets/moon/audio/006.0043.wav": "심심한",
    "./datasets/moon/audio/006.0044.wav": "위로를 전합니다",
    "./datasets/moon/audio/006.0045.wav": "이번 선거에서는",
    "./datasets/moon/audio/006.0046.wav": "승자도",
    "./datasets/moon/audio/006.0047.wav": "패자도 없습니다",
    "./datasets/moon/audio/006.0048.wav": "우리는",
    "./datasets/moon/audio/006.0062.wav": "정치적 격변기를 보냈습니다",
    "./datasets/moon/audio/006.0063.wav": "정치는 혼란스러웠지만",
    "./datasets/moon/audio/006.0065.wav": "현직 대통령의 탄핵과 구속 앞에서도",
    "./datasets/moon/audio/006.0067.wav": "대한민국의 앞길을 열어주셨습니다",
    "./datasets/moon/audio/006.0068.wav": "우리 국민들은 좌절하지 않고",
    "./datasets/moon/audio/006.0093.wav": "2017년5월10일",
    "./datasets/moon/audio/006.0098.wav": "존경하고 사랑하는 국민 여러분",
    "./datasets/moon/audio/006.0104.wav": "바로 그 질문에서 새로 시작하겠습니다",
    "./datasets/moon/audio/006.0108.wav": "구시대의 잘못된 관행과",
    "./datasets/moon/audio/006.0115.wav": "광화문 대통령 시대를 열겠습니다",
    "./datasets/moon/audio/006.0116.wav": "참모들과 머리와 어깨를 맞대고"
}

================================================
FILE: datasets/moon.py
================================================
# -*- coding: utf-8 -*-

from concurrent.futures import ProcessPoolExecutor
from functools import partial
import numpy as np
import os,json
from utils import audio
from text import text_to_sequence


def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
    """
    Preprocesses the speech dataset from a gven input path to given output directories

    Args:
        - hparams: hyper parameters
        - input_dir: input directory that contains the files to prerocess
        - out_dir: output directory of npz files
        - n_jobs: Optional, number of worker process to parallelize across
        - tqdm: Optional, provides a nice progress bar

    Returns:
        - A list of tuple describing the train examples. this should be written to train.txt
    """

    executor = ProcessPoolExecutor(max_workers=num_workers)
    futures = []
    index = 1

    path = os.path.join(in_dir, 'moon-recognition-All.json')
    
    with open(path,encoding='utf-8') as f:
        content = f.read()
        data = json.loads(content)
        for key, text in data.items():
            wav_path = key.strip().split('/')
            wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1])
            # In case of test file
            if not os.path.exists(wav_path):
                continue
            futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams)))
            index += 1

    return [future.result() for future in tqdm(futures) if future.result() is not None]
#     result = []
#     for future in tqdm(futures):
#         if future.result() is not None:
#             result.append(future.result())
#          
#     return result

def _process_utterance(out_dir, wav_path, text, hparams):
    """
    Preprocesses a single utterance wav/text pair

    this writes the mel scale spectogram to disk and return a tuple to write
    to the train.txt file

    Args:
        - mel_dir: the directory to write the mel spectograms into
        - linear_dir: the directory to write the linear spectrograms into
        - wav_dir: the directory to write the preprocessed wav into
        - index: the numeric index to use in the spectogram filename
        - wav_path: path to the audio file containing the speech input
        - text: text spoken in the input audio file
        - hparams: hyper parameters

    Returns:
        - A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)
    """
    try:
        # Load the audio as numpy array
        wav = audio.load_wav(wav_path, sr=hparams.sample_rate)
    except FileNotFoundError: #catch missing wav exception
        print('file {} present in csv metadata is not present in wav folder. skipping!'.format(
            wav_path))
        return None

    #rescale wav
    if hparams.rescaling:   # hparams.rescale = True
        wav = wav / np.abs(wav).max() * hparams.rescaling_max

    #M-AILABS extra silence specific
    if hparams.trim_silence:  # hparams.trim_silence = True
        wav = audio.trim_silence(wav, hparams)   # Trim leading and trailing silence

    #Mu-law quantize, default 값은 'raw'
    if hparams.input_type=='mulaw-quantize':
        #[0, quantize_channels)
        out = audio.mulaw_quantize(wav, hparams.quantize_channels)

        #Trim silences
        start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
        wav = wav[start: end]
        out = out[start: end]

        constant_values = mulaw_quantize(0, hparams.quantize_channels)
        out_dtype = np.int16

    elif hparams.input_type=='mulaw':
        #[-1, 1]
        out = audio.mulaw(wav, hparams.quantize_channels)
        constant_values = audio.mulaw(0., hparams.quantize_channels)
        out_dtype = np.float32

    else:  # raw
        #[-1, 1]
        out = wav
        constant_values = 0.
        out_dtype = np.float32

    # Compute the mel scale spectrogram from the wav
    mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
    mel_frames = mel_spectrogram.shape[1]

    if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:   # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True
        return None

    #Compute the linear scale spectrogram from the wav
    linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)
    linear_frames = linear_spectrogram.shape[1]

    #sanity check
    assert linear_frames == mel_frames

    if hparams.use_lws:    # hparams.use_lws = False
        #Ensure time resolution adjustement between audio and mel-spectrogram
        fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size
        l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))

        #Zero pad audio signal
        out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
    else:
        #Ensure time resolution adjustement between audio and mel-spectrogram
        pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))

        #Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)
        out = np.pad(out, pad, mode='reflect')

    assert len(out) >= mel_frames * audio.get_hop_size(hparams)

    #time resolution adjustement
    #ensure length of raw audio is multiple of hop size so that we can use
    #transposed convolution to upsample
    out = out[:mel_frames * audio.get_hop_size(hparams)]
    assert len(out) % audio.get_hop_size(hparams) == 0
    time_steps = len(out)

    # Write the spectrogram and audio to disk
    wav_id = os.path.splitext(os.path.basename(wav_path))[0]
    
    # Write the spectrograms to disk:
    audio_filename = '{}-audio.npy'.format(wav_id)
    mel_filename = '{}-mel.npy'.format(wav_id)
    linear_filename = '{}-linear.npy'.format(wav_id)
    npz_filename = '{}.npz'.format(wav_id)
    npz_flag=True
    if npz_flag:
        # Tacotron 코드와 맞추기 위해, 같은 key를 사용한다.
        data = {
            'audio': out.astype(out_dtype),
            'mel': mel_spectrogram.T,  
            'linear': linear_spectrogram.T,
            'time_steps': time_steps,
            'mel_frames': mel_frames,
            'text': text,
            'tokens': text_to_sequence(text),   # eos(~)에 해당하는 "1"이 끝에 붙는다.
            'loss_coeff': 1  # For Tacotron
        }
        
        np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False)
    else:
        np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)
        np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
        np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)

    # Return a tuple describing this training example
    return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename)

================================================
FILE: datasets/son/son-recognition-All.json
================================================
{
    "./datasets/son/audio/NB10584578.0000.wav": "오늘부터 뉴스룸 2부에서는 그날의 주요사항을 한마디의 단어로 축약해서 앵커브리핑으로 풀어보겠습니다",
    "./datasets/son/audio/NB10584578.0001.wav": "오늘 뉴스룸이 주목한다 던어는 저돌입니다",
    "./datasets/son/audio/NB10584578.0002.wav": "돼지 저 자에 갑자기 돌 이 두 글자를 사용하는 이 단어는 흔히 추진력이 강하다는 의미로 쓰이죠",
    "./datasets/son/audio/NB10584578.0003.wav": "난파 직전의 새정치연합을 책임지게 된 문희상 비대위원장이 이런 말을 했습니다",
    "./datasets/son/audio/NB10584578.0004.wav": "난 그냥 산 돼지처럼 돌파하는 스타일이다",
    "./datasets/son/audio/NB10584578.0005.wav": "이렇게 얘기했습니다",
    "./datasets/son/audio/NB10584578.0006.wav": "몸이 좋지 않다면서 만남을 주저했던 김무성 새누리당 대표를 찾아 가서 만난 것도 바로 이런 적어도 저돌성이 없었다면 어려웠을지도 모르겠습니다 그렇다면",
    "./datasets/son/audio/NB10584578.0007.wav": "문 비대위원장이 저돌적으로 돌파해야 할 과제는 무엇인가",
    "./datasets/son/audio/NB10584578.0008.wav": "첫 번째는 계파주의 청산입니다",
    "./datasets/son/audio/NB10584578.0009.wav": "지난 이천십이년 대선에서 민주통합당의 패배한 이후에 대선평가 위원장을 맡았던 한상진 서울대 명예교수가",
    "./datasets/son/audio/NB10584578.0010.wav": "이런 보고서를 냈습니다",
    "./datasets/son/audio/NB10584578.0011.wav": "계파정치 청산은 민주당의 미래를 위한 최우선 과제다",
    "./datasets/son/audio/NB10584578.0012.wav": "아 이렇게 얘기했는데요 그러나 아시는 것처럼이 보고서는",
    "./datasets/son/audio/NB10584578.0013.wav": "갖가지 반발 끝에 결국 채택되지 못했습니다",
    "./datasets/son/audio/NB10584578.0014.wav": "아마 여당에서 한상진 교수 좋아하는 사람 별로 없을 겁니다",
    "./datasets/son/audio/NB10584578.0015.wav": "문희상 당시 비대위원장이 공교롭게도 계파와 패권주의 청산을 내세웠던 바로 그 시기에 비대위원장 이었죠",
    "./datasets/son/audio/NB10584578.0016.wav": "계파 청산에 관한 문 비대위원장은 어떻게 보면 실패했다고 봐야만 합니다",
    "./datasets/son/audio/NB10584578.0017.wav": "권한은 공유하되 책임은 당 대표가 혼자지는 이런 기형적 구조가",
    "./datasets/son/audio/NB10584578.0018.wav": "아 결국",
    "./datasets/son/audio/NB10584578.0019.wav": "최근 사년 동안에 임기 2년에 야당 지도부 교체 숫자를",
    "./datasets/son/audio/NB10584578.0020.wav": "늘려서 무료 열번이나 교체가 되었습니다",
    "./datasets/son/audio/NB10584578.0021.wav": "같은 기간에 새누리당은 단 네명의 지도부가 바뀌었습니다",
    "./datasets/son/audio/NB10584578.0022.wav": "실패가 구조화된 당의 체질을 바꾸지 않고서는 누가 리더가 되어도 쉽지 않다는 것을 상징적으로 내보여주는 숫자이기도 합니다",
    "./datasets/son/audio/NB10584578.0023.wav": "자 두 번째 과제는 바로 이겁니다 수사권 기소권 문제로 교착상태에 빠지는 세월호 특별법 지금도 끝이 보이지 않는데요",
    "./datasets/son/audio/NB10584578.0024.wav": "어떠한 추가 협상도",
    "./datasets/son/audio/NB10584578.0025.wav": "불가하다 이렇게 못박은 청와대와",
    "./datasets/son/audio/NB10584578.0026.wav": "여당을 어떻게 변화시킬 것인지 또한",
    "./datasets/son/audio/NB10584578.0027.wav": "수사권과 기소권을 주장하는 유족들의 요구를 어떻게 담아낼 것인지",
    "./datasets/son/audio/NB10584578.0028.wav": "겉은 장비 속은 조조라고 불리우는 의회주의자 문희상 비대위원장과 새정치연합이 저돌적으로 말 그대로 저돌적으로 풀어 가야 할",
    "./datasets/son/audio/NB10584578.0029.wav": "과제인지도 모르겠습니다",
    "./datasets/son/audio/NB10584578.0030.wav": "세월호 참사는 오늘로 백육십일째를 맞았습니다",
    "./datasets/son/audio/NB10584578.0031.wav": "쓸쓸한 팽목항에는",
    "./datasets/son/audio/NB10584578.0032.wav": "자원봉사자마저 하나둘 철수하고 있고",
    "./datasets/son/audio/NB10584578.0033.wav": "슬픈 이천십사년은 오늘로 이제 딱",
    "./datasets/son/audio/NB10584578.0034.wav": "백일이 남았습니다",
    "./datasets/son/audio/NB10584578.0035.wav": "잠시 후에 문희상 비대위원장을 스튜디오에서 만나겠습니다",
    "./datasets/son/audio/NB10585784.0001.wav": "자 이어서 앵커 브리핑 순서입니다 오늘 뉴스 룸이 주목한 단어는 덫입니다",
    "./datasets/son/audio/NB10585784.0002.wav": "어 잔꾀를 부리다 자신이 놓은 덫에 스스로 걸리고 만 꼴이다",
    "./datasets/son/audio/NB10585784.0003.wav": "국회 선진화법 개정을 추진하고 있는 새누리당을 향해서",
    "./datasets/son/audio/NB10585784.0004.wav": "새정치민주연합에 박수현 의원이 이런 말을 했군요",
    "./datasets/son/audio/NB10585784.0005.wav": "이 말을 이해하기 위해서는 지난 이천십이년에 국회로 한 걸음",
    "./datasets/son/audio/NB10585784.0006.wav": "돌아가 봐야만 합니다",
    "./datasets/son/audio/NB10585784.0007.wav": "기대보다는 걱정이 앞서는 것이",
    "./datasets/son/audio/NB10585784.0008.wav": "솔직한 내 심정입니다",
    "./datasets/son/audio/NB10585784.0009.wav": "이제 개정안이 통과된 이상 우리 여야가",
    "./datasets/son/audio/NB10585784.0010.wav": "대화와 타협을 통해서",
    "./datasets/son/audio/NB10585784.0011.wav": "국민들에게 신뢰받는 선진 국회를 만들어 가기를 간절히 바랍니다",
    "./datasets/son/audio/NB10585784.0015.wav": "예 이렇게 세번 두들기고 법안은 통과가 되는데요",
    "./datasets/son/audio/NB10585784.0016.wav": "국회선진화법은 재적의원 중에 과반이 아닌 오분의 삼이상이 찬성해야 만",
    "./datasets/son/audio/NB10585784.0017.wav": "안건을 올릴 수 있도록 만든 법이죠"
}

================================================
FILE: datasets/son.py
================================================
# -*- coding: utf-8 -*-

from concurrent.futures import ProcessPoolExecutor
from functools import partial
import numpy as np
import os,json
from utils import audio
from text import text_to_sequence


def build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x):
    """
    Preprocesses the speech dataset from a gven input path to given output directories

    Args:
        - hparams: hyper parameters
        - input_dir: input directory that contains the files to prerocess
        - out_dir: output directory of npz files
        - n_jobs: Optional, number of worker process to parallelize across
        - tqdm: Optional, provides a nice progress bar

    Returns:
        - A list of tuple describing the train examples. this should be written to train.txt
    """

    executor = ProcessPoolExecutor(max_workers=num_workers)
    futures = []
    index = 1

    path = os.path.join(in_dir, 'son-recognition-All.json')
    
    with open(path,encoding='utf-8') as f:
        content = f.read()
        data = json.loads(content)
        for key, text in data.items():
            wav_path = key.strip().split('/')
            wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1])
            # In case of test file
            if not os.path.exists(wav_path):
                continue
            futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams)))
            index += 1

    return [future.result() for future in tqdm(futures) if future.result() is not None]


def _process_utterance(out_dir, wav_path, text, hparams):
    """
    Preprocesses a single utterance wav/text pair

    this writes the mel scale spectogram to disk and return a tuple to write
    to the train.txt file

    Args:
        - mel_dir: the directory to write the mel spectograms into
        - linear_dir: the directory to write the linear spectrograms into
        - wav_dir: the directory to write the preprocessed wav into
        - index: the numeric index to use in the spectogram filename
        - wav_path: path to the audio file containing the speech input
        - text: text spoken in the input audio file
        - hparams: hyper parameters

    Returns:
        - A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)
    """
    try:
        # Load the audio as numpy array
        wav = audio.load_wav(wav_path, sr=hparams.sample_rate)
    except FileNotFoundError: #catch missing wav exception
        print('file {} present in csv metadata is not present in wav folder. skipping!'.format(wav_path))
        return None

    #rescale wav
    if hparams.rescaling:   # hparams.rescale = True
        wav = wav / np.abs(wav).max() * hparams.rescaling_max

    #M-AILABS extra silence specific
    if hparams.trim_silence:  # hparams.trim_silence = True
        wav = audio.trim_silence(wav, hparams)   # Trim leading and trailing silence

    #Mu-law quantize, default 값은 'raw'
    if hparams.input_type=='mulaw-quantize':
        #[0, quantize_channels)
        out = audio.mulaw_quantize(wav, hparams.quantize_channels)

        #Trim silences
        start, end = audio.start_and_end_indices(out, hparams.silence_threshold)
        wav = wav[start: end]
        out = out[start: end]

        constant_values = mulaw_quantize(0, hparams.quantize_channels)
        out_dtype = np.int16

    elif hparams.input_type=='mulaw':
        #[-1, 1]
        out = audio.mulaw(wav, hparams.quantize_channels)
        constant_values = audio.mulaw(0., hparams.quantize_channels)
        out_dtype = np.float32

    else:  # raw
        #[-1, 1]
        out = wav
        constant_values = 0.
        out_dtype = np.float32

    # Compute the mel scale spectrogram from the wav
    mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
    mel_frames = mel_spectrogram.shape[1]

    if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:   # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True
        return None

    #Compute the linear scale spectrogram from the wav
    linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)
    linear_frames = linear_spectrogram.shape[1]

    #sanity check
    assert linear_frames == mel_frames

    if hparams.use_lws:    # hparams.use_lws = False
        #Ensure time resolution adjustement between audio and mel-spectrogram
        fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size
        l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))

        #Zero pad audio signal
        out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)
    else:
        #Ensure time resolution adjustement between audio and mel-spectrogram
        pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))

        #Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)
        out = np.pad(out, pad, mode='reflect')

    assert len(out) >= mel_frames * audio.get_hop_size(hparams)

    #time resolution adjustement
    #ensure length of raw audio is multiple of hop size so that we can use
    #transposed convolution to upsample
    out = out[:mel_frames * audio.get_hop_size(hparams)]
    assert len(out) % audio.get_hop_size(hparams) == 0
    time_steps = len(out)

    # Write the spectrogram and audio to disk
    wav_id = os.path.splitext(os.path.basename(wav_path))[0]
    
    # Write the spectrograms to disk:
    audio_filename = '{}-audio.npy'.format(wav_id)
    mel_filename = '{}-mel.npy'.format(wav_id)
    linear_filename = '{}-linear.npy'.format(wav_id)
    npz_filename = '{}.npz'.format(wav_id)
    npz_flag=True
    if npz_flag:
        # Tacotron 코드와 맞추기 위해, 같은 key를 사용한다.
        data = {
            'audio': out.astype(out_dtype),
            'mel': mel_spectrogram.T,  
            'linear': linear_spectrogram.T,
            'time_steps': time_steps,
            'mel_frames': mel_frames,
            'text': text,
            'tokens': text_to_sequence(text),   # eos(~)에 해당하는 "1"이 끝에 붙는다.
            'loss_coeff': 1  # For Tacotron
        }
        
        np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False)
    else:
        np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)
        np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)
        np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)

    # Return a tuple describing this training example
    return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename)

================================================
FILE: generate.py
================================================
#  coding: utf-8
"""
sample_rate = 16000이므로, samples 48000이면 3초 길이가 된다.

> python generate.py --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10
> python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10   <----scalar_input = True
> python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10
python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-tacotron/generate/mel-2018-12-25_22-27-50-0.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10


gc_id = 0(moon), 1(son)
python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-22T23-08-16
python generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-22T23-08-16


"""
import argparse
from datetime import datetime
import json
import os,time

import librosa
import numpy as np
import tensorflow as tf

from wavenet import WaveNetModel, mu_law_decode, mu_law_encode
from hparams import hparams
from utils import load_hparams,load
from utils import audio
from utils import plot

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


def _interp(feats, in_range):
    #rescales from [-max, max] (or [0, max]) to [0, 1]
    return (feats - in_range[0]) / (in_range[1] - in_range[0])


def get_arguments():
    def _str_to_bool(s):
        """Convert string to bool (in argparse context)."""
        if s.lower() not in ['true', 'false']:
            raise ValueError('Argument needs to be a boolean, got {}'.format(s))
        return {'true': True, 'false': False}[s.lower()]

    def _ensure_positive_float(f):
        """Ensure argument is a positive float."""
        if float(f) < 0:
            raise argparse.ArgumentTypeError('Argument must be greater than zero')
        return float(f)

    parser = argparse.ArgumentParser(description='WaveNet generation script')
    parser.add_argument('checkpoint_dir', type=str, help='Which model checkpoint to generate from')
    
    TEMPERATURE = 1.0
    parser.add_argument('--temperature', type=_ensure_positive_float, default=TEMPERATURE,help='Sampling temperature')
    
    
    LOGDIR = './logdir-wavenet'
    parser.add_argument('--logdir',type=str,default=LOGDIR,help='Directory in which to store the logging information for TensorBoard.')
    parser.add_argument('--wav_out_path',type=str,default=None,help='Path to output wav file')
    
    BATCH_SIZE = 1
    parser.add_argument('--batch_size', type=int, default=BATCH_SIZE,help='batch size')
    
    
    parser.add_argument('--wav_seed',type=str,default=None,help='The wav file to start generation from')
    parser.add_argument('--mel',type=str,default=None,help='mel input')
    parser.add_argument('--gc_cardinality',type=int,default=None,help='Number of categories upon which we globally condition.')
    parser.add_argument('--gc_id',type=int,default=None,help='ID of category to generate, if globally conditioned.')
    
    arguments = parser.parse_args()
    if hparams.gc_channels is not None:
        if arguments.gc_cardinality is None:
            raise ValueError("Globally conditioning but gc_cardinality not specified. Use --gc_cardinality=377 for full VCTK corpus.")

        if arguments.gc_id is None:
            raise ValueError("Globally conditioning, but global condition was not specified. Use --gc_id to specify global condition.")

    return arguments


# def write_wav(waveform, sample_rate, filename):
#     y = np.array(waveform)
#     librosa.output.write_wav(filename, y, sample_rate)
#     print('Updated wav file at {}'.format(filename))


def create_seed(filename,sample_rate,quantization_channels,window_size,scalar_input):
    # seed의 앞부분만 사용한다.
    seed_audio, _ = librosa.load(filename, sr=sample_rate, mono=True)
    seed_audio = audio.trim_silence(seed_audio, hparams)
    if scalar_input:
        if len(seed_audio) < window_size:
            return seed_audio
        else: return seed_audio[:window_size]
    else:
        quantized = mu_law_encode(seed_audio, quantization_channels)
    
    
        # 짧으면 짧은 대로 return하는데, padding이라도 해야되지 않나???
        cut_index = tf.cond(tf.size(quantized) < tf.constant(window_size), lambda: tf.size(quantized), lambda: tf.constant(window_size))
    
        return quantized[:cut_index]


def main():
    config = get_arguments()
    started_datestring = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now())
    logdir = os.path.join(config.logdir, 'generate', started_datestring)
    
    if not os.path.exists(logdir):
        os.makedirs(logdir)

    load_hparams(hparams, config.checkpoint_dir)


    with tf.device('/cpu:0'):  # cpu가 더 빠르다. gpu로 설정하면 Error. tf.device 없이 하면 더 느려진다.

        sess = tf.Session()
        scalar_input = hparams.scalar_input
        net = WaveNetModel(
            batch_size=config.batch_size,
            dilations=hparams.dilations,
            filter_width=hparams.filter_width,
            residual_channels=hparams.residual_channels,
            dilation_channels=hparams.dilation_channels,
            quantization_channels=hparams.quantization_channels,
            out_channels =hparams.out_channels,
            skip_channels=hparams.skip_channels,
            use_biases=hparams.use_biases,
            scalar_input=hparams.scalar_input,
            global_condition_channels=hparams.gc_channels,
            global_condition_cardinality=config.gc_cardinality,
            local_condition_channels=hparams.num_mels,
            upsample_factor=hparams.upsample_factor,
            legacy = hparams.legacy,
            residual_legacy = hparams.residual_legacy,
            train_mode=False)   # train 단계에서는 global_condition_cardinality를 AudioReader에서 파악했지만, 여기서는 넣어주어야 함
            
        if scalar_input:
            samples = tf.placeholder(tf.float32,shape=[net.batch_size,None])
        else:
            samples = tf.placeholder(tf.int32,shape=[net.batch_size,None])  # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이)
    
        # local condition이 (N,T,num_mels) 여야 하지만, 길이 1까지로 들어가야하기 때무넹, (N,1,num_mels) --> squeeze하면 (N,num_mels)
        upsampled_local_condition = tf.placeholder(tf.float32,shape=[net.batch_size,hparams.num_mels])  
        
        next_sample = net.predict_proba_incremental(samples,upsampled_local_condition, [config.gc_id]*net.batch_size)  # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용
        
        # making local condition data. placeholder - upsampled_local_condition 넣어줄 upsampled local condition data를 만들어 보자.
        
        mel_input = np.load(config.mel)
        sample_size = mel_input.shape[0] * hparams.hop_size
        mel_input = np.tile(mel_input,(config.batch_size,1,1))
        with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE):
            upsampled_local_condition_data = net.create_upsample(mel_input,upsample_type=hparams.upsample_type)
            
        var_list = [var for var in tf.global_variables() if 'queue' not in var.name ]
        saver = tf.train.Saver(var_list)
        print('Restoring model from {}'.format(config.checkpoint_dir))
        
        load(saver, sess, config.checkpoint_dir)
        
        sess.run(net.queue_initializer) # 이 부분이 없으면, checkpoint에서 복원된 값들이 들어 있다.
    
     
        quantization_channels = hparams.quantization_channels
        if config.wav_seed:
            # wav_seed의 길이가 receptive_field보다 작으면, padding이라도 해야 되는 거 아닌가? 그냥 짧으면 짧은 대로 return함  --> 그래서 너무 짧으면 error
            seed = create_seed(config.wav_seed,hparams.sample_rate,quantization_channels,net.receptive_field,scalar_input)  # --> mu_law encode 된 것.
            if scalar_input:
                waveform = seed.tolist()
            else:
                waveform = sess.run(seed).tolist()  # [116, 114, 120, 121, 127, ...]

            print('Priming generation...')
            for i, x in enumerate(waveform[-net.receptive_field: -1]):  # 제일 마지막 1개는 아래의 for loop의 첫 loop에서 넣어준다.
                if i % 100 == 0:
                    print('Priming sample {}/{}'.format(i,net.receptive_field), end='\r')
                sess.run(next_sample, feed_dict={samples: np.array([x]*net.batch_size).reshape(net.batch_size,1), upsampled_local_condition: np.zeros([net.batch_size,hparams.num_mels])})
            print('Done.')
            waveform = np.array([waveform[-net.receptive_field:]]*net.batch_size)            
        else:
            # Silence with a single random sample at the end.
            if scalar_input:
                waveform = [0.0] * (net.receptive_field - 1)
                waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1)
                waveform = np.concatenate([waveform,2*np.random.rand(net.batch_size).reshape(net.batch_size,-1)-1],axis=-1) # -1~1사이의 random number를 만들어 끝에 붙힌다.
                # wavefor: shape(batch_size,net.receptive_field )
            else:
                waveform = [quantization_channels / 2] * (net.receptive_field - 1)  # 필요한 receptive_field 크기보다 1개 작게 만든 후, 아래에서 random하게 1개를 덧붙힌다.
                waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1)
                waveform = np.concatenate([waveform,np.random.randint(quantization_channels,size=net.batch_size).reshape(net.batch_size,-1)],axis=-1)  # one hot 변환 전. (batch_size, 5117)
            
    
        start_time = time.time()
        upsampled_local_condition_data = sess.run(upsampled_local_condition_data)
        last_sample_timestamp = datetime.now()
        for step in range(sample_size):  # 원하는 길이를 구하기 위해 loop sample_size

            window = waveform[:,-1:]  # 제일 끝에 있는 1개만 samples에 넣어 준다.  window: shape(N,1)
 
    
            # Run the WaveNet to predict the next sample.
            
            # fast가 아닌경우. window: [128.0, 128.0, ..., 128.0, 178, 185]
            # fast인 경우, window는 숫자 1개.
            prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step,:]})  # samples는 mu law encoding된 것. 계산 과정에서 one hot으로 변환된다.  --> (batch_size,256)
    
            if scalar_input:
                sample = prediction  # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다.
            else:
                # Scale prediction distribution using temperature.
                # 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다.
                # config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다.
                np.seterr(divide='ignore')
                scaled_prediction = np.log(prediction) / config.temperature   # config.temperature인 경우는 값의 변화가 없다.
                scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True))  # np.log(np.sum(np.exp(scaled_prediction)))
                scaled_prediction = np.exp(scaled_prediction)
                np.seterr(divide='warn')
        
                # Prediction distribution at temperature=1.0 should be unchanged after
                # scaling.
                if config.temperature == 1.0:
                    np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.')
                
                # argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다.
                sample = [[np.random.choice(np.arange(quantization_channels), p=p)] for p in scaled_prediction]  # choose one sample per batch
            
            waveform = np.concatenate([waveform,sample],axis=-1)   #window.shape: (N,1)
    
            # Show progress only once per second.
            current_sample_timestamp = datetime.now()
            time_since_print = current_sample_timestamp - last_sample_timestamp
            if time_since_print.total_seconds() > 1.:
                duration = time.time() - start_time
                print('Sample {:3<d}/{:3<d}, ({:.3f} sec/step)'.format(step + 1, sample_size, duration), end='\r')
                last_sample_timestamp = current_sample_timestamp
    
    
        # Introduce a newline to clear the carriage return from the progress.
        print()
    
        
        # Save the result as a wav file.    
        if hparams.input_type == 'raw':
            out = waveform[:,net.receptive_field:]
        elif hparams.input_type == 'mulaw':
            decode = mu_law_decode(samples, quantization_channels,quantization=False)
            out = sess.run(decode, feed_dict={samples: waveform[:,net.receptive_field:]})
        else:  # 'mulaw-quantize'
            decode = mu_law_decode(samples, quantization_channels,quantization=True)
            out = sess.run(decode, feed_dict={samples: waveform[:,net.receptive_field:]})          
            
            
        # save wav
        
        for i in range(net.batch_size):
            config.wav_out_path= logdir + '/test-{}.wav'.format(i)
            mel_path =  config.wav_out_path.replace(".wav", ".png")
            
            gen_mel_spectrogram = audio.melspectrogram(out[i], hparams).astype(np.float32).T
            audio.save_wav(out[i], config.wav_out_path, hparams.sample_rate)  # save_wav 내에서 out[i]의 값이 바뀐다.
            
            plot.plot_spectrogram(gen_mel_spectrogram, mel_path, title='generated mel spectrogram',target_spectrogram=mel_input[i])        
        print('Finished generating.')


if __name__ == '__main__':
    s = time.time()
    main()
    print(time.time()-s,'sec')

================================================
FILE: hparams.py
================================================
# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np

hparams = tf.contrib.training.HParams(
    name = "Tacotron-2",
    
    # tacotron hyper parameter
    
    cleaners = 'korean_cleaners',  # 'korean_cleaners'   or 'english_cleaners'
    
    
    skip_path_filter = False, # npz파일에서 불필요한 것을 거르는 작업을 할지 말지 결정. receptive_field 보다 짧은 data를 걸러야 하기 때문에 해 줘야 한다.
    use_lws = False,
    
    # Audio
    sample_rate = 24000,  # 
    
    # shift can be specified by either hop_size(우선) or frame_shift_ms
    hop_size = 300,             # frame_shift_ms = 12.5ms
    fft_size=2048,   # n_fft. 주로 1024로 되어있는데, tacotron에서 2048사용
    win_size = 1200,   # 50ms
    num_mels=80,

    #Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude levels. Also allows for better G&L phase reconstruction)
    preemphasize = True, #whether to apply filter
    preemphasis = 0.97,
    min_level_db = -100,
    ref_level_db = 20,
    signal_normalization = True, #Whether to normalize mel spectrograms to some predefined range (following below parameters)
    allow_clipping_in_normalization = True, #Only relevant if mel_normalization = True
    symmetric_mels = True, #Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, faster and cleaner convergence)
    max_abs_value = 4., #max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not be too big to avoid gradient explosion, not too small for fast convergence)
    
        
    rescaling=True,
    rescaling_max=0.999, 
    
    trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
    #M-AILABS (and other datasets) trim params (there parameters are usually correct for any data, but definitely must be tuned for specific speakers)
    trim_fft_size = 512, 
    trim_hop_size = 128,
    trim_top_db = 23,
    
    
    clip_mels_length = True, #For cases of OOM (Not really recommended, only use if facing unsolvable OOM errors, also consider clipping your samples to smaller chunks)   
    max_mel_frames = 1000,  #Only relevant when clip_mels_length = True, please only use after trying output_per_steps=3 and still getting OOM errors.
    
    
    l2_regularization_strength = 0,  # Coefficient in the L2 regularization.
    sample_size = 9000,              # Concatenate and cut audio samples to this many samples
    silence_threshold = 0,             # Volume threshold below which to trim the start and the end from the training set samples. e.g. 2

    
    filter_width = 3,
    gc_channels = 32,                  # global_condition_vector의 차원. 이것 지정함으로써, global conditioning을 모델에 반영하라는 의미가 된다.
    
    input_type="raw",    # 'mulaw-quantize', 'mulaw', 'raw',   mulaw, raw 2가지는 scalar input
    scalar_input = True,   # input_type과 맞아야 함.
    
    
    dilations = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512],
    residual_channels = 128,
    dilation_channels = 256,
    quantization_channels = 256,
    out_channels = 30,  # discretized_mix_logistic_loss를 적용하기 때문에, 3의 배수
    skip_channels = 128,
    use_biases = True,
    upsample_type = 'SubPixel',  # 'SubPixel', None   
    upsample_factor=[12,25],   # np.prod(upsample_factor) must equal to hop_size


    # wavenet training hp
    wavenet_batch_size = 2,            # 16--> OOM. wavenet은 batch_size가 고정되어야 한다.
    store_metadata = False,
    num_steps = 1000000,                # Number of training steps

    #Learning rate schedule
    wavenet_learning_rate = 1e-3, #wavenet initial learning rate
    wavenet_decay_rate = 0.5, #Only used with 'exponential' scheme. Defines the decay rate.
    wavenet_decay_steps = 300000, #Only used with 'exponential' scheme. Defines the decay steps.

    #Regularization parameters
    wavenet_clip_gradients = True, #Whether the clip the gradients during wavenet training.

    # residual 결과를 sum할 때, 
    legacy = True, #Whether to use legacy mode: Multiply all skip outputs but the first one with sqrt(0.5) (True for more early training stability, especially for large models)
    
    # residual block내에서  x = (x + residual) * np.sqrt(0.5)
    residual_legacy = True, #Whether to scale residual blocks outputs by a factor of sqrt(0.5) (True for input variance preservation early in training and better overall stability)


    wavenet_dropout = 0.05,
    
    optimizer = 'adam',
    momentum = 0.9,                   # 'Specify the momentum to be used by sgd or rmsprop optimizer. Ignored by the adam optimizer.
    max_checkpoints = 3,             # 'Maximum amount of checkpoints that will be kept alive. Default: '    


    ####################################
    ####################################
    ####################################
    # TACOTRON HYPERPARAMETERS
    
    # Training
    adam_beta1 = 0.9,
    adam_beta2 = 0.999,
    
    #Learning rate schedule
    tacotron_decay_learning_rate = True, #boolean, determines if the learning rate will follow an exponential decay
    tacotron_start_decay = 40000, #Step at which learning decay starts
    tacotron_decay_steps = 18000, #Determines the learning rate decay slope (UNDER TEST)
    tacotron_decay_rate = 0.5, #learning rate decay rate (UNDER TEST)
    tacotron_initial_learning_rate = 1e-3, #starting learning rate
    tacotron_final_learning_rate = 1e-4, #minimal learning rate
    
    
    initial_data_greedy = True,
    initial_phase_step = 8000,   # 여기서 지정한 step 이전에는 data_dirs의 각각의 디렉토리에 대하여 같은 수의 example을 만들고, 이후, weght 비듈에 따라 ... 즉, 아래의 'main_data_greedy_factor'의 영향을 받는다.
    main_data_greedy_factor = 0,
    main_data = [''],    # 이곳에 있는 directory 속에 있는 data는 가중치를 'main_data_greedy_factor' 만큼 더 준다.
    prioritize_loss = False,    
    

    # Model
    model_type = 'multi-speaker', # [single, multi-speaker]
    speaker_embedding_size  = 16, 

    embedding_size = 512,    # 'ᄀ', 'ᄂ', 'ᅡ' 에 대한 embedding dim
    dropout_prob = 0.5,

    reduction_factor = 2,  # reduction_factor가 적으면 더 많은 iteration이 필요하므로, 더 많은 메모리가 필요하다.
    
    # Encoder
    enc_conv_num_layers = 3,
    enc_conv_kernel_size = 5,
    enc_conv_channels = 512,
    tacotron_zoneout_rate = 0.1,
    encoder_lstm_units = 256,


    attention_type = 'bah_mon_norm',    # 'loc_sen', 'bah_mon_norm'
    attention_size = 128,

    #Attention mechanism
    smoothing = False, #Whether to smooth the attention normalization function
    attention_dim = 128, #dimension of attention space
    attention_filters = 32, #number of attention convolution filters
    attention_kernel = (31, ), #kernel size of attention convolution
    cumulative_weights = True, #Whether to cumulate (sum) all previous attention weights or simply feed previous weights (Recommended: True)

    #Attention synthesis constraints
    #"Monotonic" constraint forces the model to only look at the forwards attention_win_size steps.
    #"Window" allows the model to look at attention_win_size neighbors, both forward and backward steps.
    synthesis_constraint = False,  #Whether to use attention windows constraints in synthesis only (Useful for long utterances synthesis)
    synthesis_constraint_type = 'window', #can be in ('window', 'monotonic'). 
    attention_win_size = 7, #Side of the window. Current step does not count. If mode is window and attention_win_size is not pair, the 1 extra is provided to backward part of the window.

    #Loss params
    mask_encoder = True, #whether to mask encoder padding while computing location sensitive attention. Set to True for better prosody but slower convergence.


    #Decoder
    prenet_layers = [256, 256], #number of layers and number of units of prenet
    decoder_layers = 2, #number of decoder lstm layers
    decoder_lstm_units = 1024, #number of decoder lstm units on each layer

    dec_prenet_sizes = [256, 256], #number of layers and number of units of prenet


    #Residual postnet
    postnet_num_layers = 5, #number of postnet convolutional layers
    postnet_kernel_size = (5, ), #size of postnet convolution filters for each layer
    postnet_channels = 512, #number of postnet convolution filters for each layer


    # for linear mel spectrogrma
    post_bank_size = 8,
    post_bank_channel_size = 128,
    post_maxpool_width = 2,
    post_highway_depth = 4,
    post_rnn_size = 128,
    post_proj_sizes = [256, 80], # num_mels=80
    post_proj_width = 3,


    tacotron_reg_weight = 1e-6, #regularization weight (for L2 regularization)
    inference_prenet_dropout = True,


    # Eval
    min_tokens = 30,  #originally 50, 30 is good for korean,  text를 token으로 쪼갰을 때, 최소 길이 이상되어야 train에 사용
    min_n_frame = 30*5,  # min_n_frame = reduction_factor * min_iters, reduction_factor와 곱해서 min_n_frame을 설정한다.
    max_n_frame = 200*5,
    skip_inadequate = False,
 
    griffin_lim_iters = 60,
    power = 1.5, 
 
)

if hparams.use_lws:
    # Does not work if fft_size is not multiple of hop_size!!
    # sample size = 20480, hop_size=256=12.5ms. fft_size는 window_size를 결정하는데, 2048을 시간으로 환산하면 2048/20480 = 0.1초=100ms
    hparams.sample_rate = 20480  # 
    
    # shift can be specified by either hop_size(우선) or frame_shift_ms
    hparams.hop_size = 256             # frame_shift_ms = 12.5ms
    hparams.frame_shift_ms=None      # hop_size=  sample_rate *  frame_shift_ms / 1000
    hparams.fft_size=2048   # 주로 1024로 되어있는데, tacotron에서 2048사용==> output size = 1025
    hparams.win_size = None # 256x4 --> 50ms
    
  
else:
    # 미리 정의되 parameter들로 부터 consistant하게 정의해 준다.
    hparams.num_freq = int(hparams.fft_size/2 + 1)
    hparams.frame_shift_ms = hparams.hop_size * 1000.0/ hparams.sample_rate      # hop_size=  sample_rate *  frame_shift_ms / 1000
    hparams.frame_length_ms = hparams.win_size * 1000.0/ hparams.sample_rate 


def hparams_debug_string():
    values = hparams.values()
    hp = ['  %s: %s' % (name, values[name]) for name in sorted(values)]
    return 'Hyperparameters:\n' + '\n'.join(hp)


================================================
FILE: preprocess.py
================================================
# coding: utf-8
"""
python preprocess.py --num_workers 10 --name son --in_dir D:\hccho\multi-speaker-tacotron-tensorflow-master\datasets\son --out_dir .\data\son
python preprocess.py --num_workers 10 --name moon --in_dir D:\hccho\multi-speaker-tacotron-tensorflow-master\datasets\moon --out_dir .\data\moon
 ==> out_dir에  'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'tokens', 'loss_coeff'를 묶은 npz파일이 생성된다.
 
 
"""
import argparse
import os
from multiprocessing import cpu_count
from tqdm import tqdm
import importlib
from hparams import hparams, hparams_debug_string
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def preprocess(mod, in_dir, out_root,num_workers):
    os.makedirs(out_dir, exist_ok=True)
    metadata = mod.build_from_path(hparams, in_dir, out_dir,num_workers=num_workers, tqdm=tqdm)
    write_metadata(metadata, out_dir)


def write_metadata(metadata, out_dir):
    with open(os.path.join(out_dir, 'train.txt'), 'w', encoding='utf-8') as f:
        for m in metadata:
            f.write('|'.join([str(x) for x in m]) + '\n')
    mel_frames = sum([int(m[4]) for m in metadata])
    timesteps = sum([int(m[3]) for m in metadata])
    sr = hparams.sample_rate
    hours = timesteps / sr / 3600
    print('Write {} utterances, {} mel frames, {} audio timesteps, ({:.2f} hours)'.format(len(metadata), mel_frames, timesteps, hours))
    print('Max input length (text chars): {}'.format(max(len(m[5]) for m in metadata)))
    print('Max mel frames length: {}'.format(max(int(m[4]) for m in metadata)))
    print('Max audio timesteps length: {}'.format(max(m[3] for m in metadata)))

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--name', type=str, default=None)
    parser.add_argument('--in_dir', type=str, default=None)
    parser.add_argument('--out_dir', type=str, default=None)
    parser.add_argument('--num_workers', type=str, default=None)
    parser.add_argument('--hparams', type=str, default=None)
    args = parser.parse_args()

    if args.hparams is not None:
        hparams.parse(args.hparams)
    print(hparams_debug_string())

    name = args.name
    in_dir = args.in_dir
    out_dir = args.out_dir
    num_workers = args.num_workers
    num_workers = cpu_count() if num_workers is None else int(num_workers)  # cpu_count() = process 갯수

    print("Sampling frequency: {}".format(hparams.sample_rate))

    assert name in ["cmu_arctic", "ljspeech", "son", "moon"]
    mod = importlib.import_module('datasets.{}'.format(name))
    preprocess(mod, in_dir, out_dir, num_workers)


================================================
FILE: synthesizer.py
================================================
# coding: utf-8

"""
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "그런데 청년은 이렇게 말합니다"

python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "이런 논란은 타코트론 논문 이후에 사라졌습니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "이런 논란은 타코트론 논문 이후에 사라졌습니다"


python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오는 6월6일은 제64회 현충일입니다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오는 6월6일은 제64회 현충일입니다"

python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다"
"""
import io
import os
import re
import librosa
import argparse
import numpy as np
from glob import glob
from tqdm import tqdm
import tensorflow as tf
from functools import partial

from hparams import hparams
from tacotron2 import create_model, get_most_recent_checkpoint
from utils.audio import save_wav, inv_linear_spectrogram, inv_preemphasis, inv_spectrogram_tensorflow
from utils import plot, PARAMS_NAME, load_json, load_hparams, add_prefix, add_postfix, get_time, parallel_run, makedirs, str2bool

from text.korean import tokenize
from text import text_to_sequence, sequence_to_text
from datasets.datafeeder_tacotron2 import _prepare_inputs
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
tf.logging.set_verbosity(tf.logging.ERROR)

class Synthesizer(object):
    def close(self):
        tf.reset_default_graph()
        self.sess.close()

    def load(self, checkpoint_path, num_speakers=2, checkpoint_step=None, inference_prenet_dropout=True,model_name='tacotron'):
        self.num_speakers = num_speakers

        if os.path.isdir(checkpoint_path):
            load_path = checkpoint_path
            checkpoint_path = get_most_recent_checkpoint(checkpoint_path, checkpoint_step)
        else:
            load_path = os.path.dirname(checkpoint_path)

        print('Constructing model: %s' % model_name)

        inputs = tf.placeholder(tf.int32, [None, None], 'inputs')
        input_lengths = tf.placeholder(tf.int32, [None], 'input_lengths')

        batch_size = tf.shape(inputs)[0]
        speaker_id = tf.placeholder_with_default(
                tf.zeros([batch_size], dtype=tf.int32), [None], 'speaker_id')

        load_hparams(hparams, load_path)
        hparams.inference_prenet_dropout = inference_prenet_dropout
        with tf.variable_scope('model') as scope:
            self.model = create_model(hparams)

            self.model.initialize(inputs=inputs, input_lengths=input_lengths, num_speakers=self.num_speakers, speaker_id=speaker_id,is_training=False)
            self.wav_output = inv_spectrogram_tensorflow(self.model.linear_outputs,hparams)

        print('Loading checkpoint: %s' % checkpoint_path)

        sess_config = tf.ConfigProto(
                allow_soft_placement=True,
                intra_op_parallelism_threads=1,
                inter_op_parallelism_threads=2)
        sess_config.gpu_options.allow_growth = True

        self.sess = tf.Session(config=sess_config)
        self.sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()
        saver.restore(self.sess, checkpoint_path)

    def synthesize(self,
            texts=None, tokens=None,
            base_path=None, paths=None, speaker_ids=None,
            start_of_sentence=None, end_of_sentence=True,
            pre_word_num=0, post_word_num=0,
            pre_surplus_idx=0, post_surplus_idx=1,
            use_short_concat=False,
            base_alignment_path=None,
            librosa_trim=False,
            attention_trim=True,
            isKorean=True):
        # Possible inputs:
        # 1) text=text
        # 2) text=texts
        # 3) tokens=tokens, texts=texts # use texts as guide

        if type(texts) == str:
            texts = [texts]

        if texts is not None and tokens is None:
            sequences = np.array([text_to_sequence(text) for text in texts])
            sequences = _prepare_inputs(sequences)
        elif tokens is not None:
            sequences = tokens

        #sequences = np.pad(sequences,[(0,0),(0,5)],'constant',constant_values=(0))  # case by case ---> overfitting?
        
        if paths is None:
            paths = [None] * len(sequences)
        if texts is None:
            texts = [None] * len(sequences)

        time_str = get_time()
        def plot_and_save_parallel(wavs, alignments,mels):

            items = list(enumerate(zip(wavs, alignments, paths, texts, sequences,mels)))

            fn = partial(
                    plot_graph_and_save_audio,
                    base_path=base_path,
                    start_of_sentence=start_of_sentence, end_of_sentence=end_of_sentence,
                    pre_word_num=pre_word_num, post_word_num=post_word_num,
                    pre_surplus_idx=pre_surplus_idx, post_surplus_idx=post_surplus_idx,
                    use_short_concat=use_short_concat,
                    librosa_trim=librosa_trim,
                    attention_trim=attention_trim,
                    time_str=time_str,
                    isKorean=isKorean)
            return parallel_run(fn, items,desc="plot_graph_and_save_audio", parallel=False)

        #input_lengths = np.argmax(np.array(sequences) == 1, 1)+1
        input_lengths = [np.argmax(a==1)+1 for a in sequences]

        fetches = [
                #self.wav_output,
                self.model.linear_outputs,
                self.model.alignments,   #  # batch_size, text length(encoder), target length(decoder)
                self.model.mel_outputs,
        ]

        feed_dict = { self.model.inputs: sequences, self.model.input_lengths: input_lengths, }


        if speaker_ids is not None:
            if type(speaker_ids) == dict:
                speaker_embed_table = sess.run(
                        self.model.speaker_embed_table)

                speaker_embed =  [speaker_ids[speaker_id] * speaker_embed_table[speaker_id] for speaker_id in speaker_ids]
                feed_dict.update({ self.model.speaker_embed_table: np.tile() })
            else:
                feed_dict[self.model.speaker_id] = speaker_ids

        wavs, alignments,mels = self.sess.run(fetches, feed_dict=feed_dict)
        results = plot_and_save_parallel(wavs, alignments,mels=mels)  

        return results

def plot_graph_and_save_audio(args,
        base_path=None,
        start_of_sentence=None, end_of_sentence=None,
        pre_word_num=0, post_word_num=0,
        pre_surplus_idx=0, post_surplus_idx=1,
        use_short_concat=False,
        save_alignment=False,
        librosa_trim=False, attention_trim=False,
        time_str=None, isKorean=True):

    idx, (wav, alignment, path, text, sequence,mel) = args

    if base_path:
        plot_path = "{}/{}.png".format(base_path, get_time())
    elif path:
        plot_path = path.rsplit('.', 1)[0] + ".png"
    else:
        plot_path = None


    if plot_path:
        plot.plot_alignment(alignment, plot_path, text=text, isKorean=isKorean)

    if use_short_concat:
        wav = short_concat(
                wav, alignment, text,
                start_of_sentence, end_of_sentence,
                pre_word_num, post_word_num,
                pre_surplus_idx, post_surplus_idx)

    if attention_trim and end_of_sentence:
        # attention이 text의 마지막까지 왔다면, 그 뒷부분은 버린다.
        end_idx_counter = 0
        attention_argmax = alignment.argmax(0)   # alignment: text length(encoder), target length(decoder)   ==> target length(decoder)
        end_idx = min(len(sequence) - 1, max(attention_argmax))
        max_counter = min((attention_argmax == end_idx).sum(), 5)

        for jdx, attend_idx in enumerate(attention_argmax):
            if len(attention_argmax) > jdx + 1:
                if attend_idx == end_idx:
                    end_idx_counter += 1

                if attend_idx == end_idx and attention_argmax[jdx + 1] > end_idx:
                    break

                if end_idx_counter >= max_counter:
                    break
            else:
                break

        spec_end_idx = hparams.reduction_factor * jdx + 3
        wav = wav[:spec_end_idx]
        mel = mel[:spec_end_idx]

    audio_out = inv_linear_spectrogram(wav.T,hparams)

    if librosa_trim and end_of_sentence:
        yt, index = librosa.effects.trim(audio_out, frame_length=5120, hop_length=256, top_db=50)
        audio_out = audio_out[:index[-1]]
        mel = mel[:index[-1]//hparams.hop_size]

    if save_alignment:
        alignment_path = "{}/{}.npy".format(base_path, idx)
        np.save(alignment_path, alignment, allow_pickle=False)

    
    if path or base_path:
        if path:
            current_path = add_postfix(path, idx)
        elif base_path:
            current_path = plot_path.replace(".png", ".wav")

        save_wav(audio_out, current_path,hparams.sample_rate)
         
        #hccho    
        mel_path = current_path.replace(".wav",".npy")
        np.save(mel_path,mel)
               
        return True
    else:
        io_out = io.BytesIO()
        save_wav(audio_out, io_out,hparams.sample_rate)
        result = io_out.getvalue()
        return result

def get_most_recent_checkpoint(checkpoint_dir, checkpoint_step=None):
    if checkpoint_step is None:
        checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))]
        idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]

        max_idx = max(idxes)
    else:
        max_idx = checkpoint_step
    lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx))
    print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint))
    return lastest_checkpoint

def short_concat(
        wav, alignment, text,
        start_of_sentence, end_of_sentence,
        pre_word_num, post_word_num,
        pre_surplus_idx, post_surplus_idx):

    # np.array(list(decomposed_text))[attention_argmax]
    attention_argmax = alignment.argmax(0)

    if not start_of_sentence and pre_word_num > 0:
        surplus_decomposed_text = decompose_ko_text("".join(text.split()[0]))
        start_idx = len(surplus_decomposed_text) + 1

        for idx, attend_idx in enumerate(attention_argmax):
            if attend_idx == start_idx and attention_argmax[idx - 1] < start_idx:
                break

        wav_start_idx = hparams.reduction_factor * idx - 1 - pre_surplus_idx
    else:
        wav_start_idx = 0

    if not end_of_sentence and post_word_num > 0:
        surplus_decomposed_text = decompose_ko_text("".join(text.split()[-1]))
        end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1

        for idx, attend_idx in enumerate(attention_argmax):
            if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx:
                break

        wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx
    else:
        if True: # attention based split
            if end_of_sentence:
                end_idx = min(len(decomposed_text) - 1, max(attention_argmax))
            else:
                surplus_decomposed_text = decompose_ko_text("".join(text.split()[-1]))
                end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1

            while True:
                if end_idx in attention_argmax:
                    break
                end_idx -= 1

            end_idx_counter = 0
            for idx, attend_idx in enumerate(attention_argmax):
                if len(attention_argmax) > idx + 1:
                    if attend_idx == end_idx:
                        end_idx_counter += 1

                    if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx:
                        break

                    if end_idx_counter > 5:
                        break
                else:
                    break

            wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx
        else:
            wav_end_idx = None

    wav = wav[wav_start_idx:wav_end_idx]

    if end_of_sentence:
        wav = np.lib.pad(wav, ((0, 20), (0, 0)), 'constant', constant_values=0)
    else:
        wav = np.lib.pad(wav, ((0, 10), (0, 0)), 'constant', constant_values=0)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--load_path', required=True)
    parser.add_argument('--sample_path', default="logdir-tacotron2/generate")
    parser.add_argument('--text', required=True)
    parser.add_argument('--num_speakers', default=1, type=int)
    parser.add_argument('--speaker_id', default=0, type=int)
    parser.add_argument('--checkpoint_step', default=None, type=int)
    parser.add_argument('--is_korean', default=True, type=str2bool)
    parser.add_argument('--base_alignment_path', default=None)
    config = parser.parse_args()

    makedirs(config.sample_path)

    synthesizer = Synthesizer()
    synthesizer.load(config.load_path, config.num_speakers, config.checkpoint_step,inference_prenet_dropout=False)

    audio = synthesizer.synthesize(texts=[config.text],base_path=config.sample_path,speaker_ids=[config.speaker_id],
                                   attention_trim=True,base_alignment_path=config.base_alignment_path,isKorean=config.is_korean)[0]


================================================
FILE: tacotron2/__init__.py
================================================
# coding: utf-8
import os
from glob import glob
from .tacotron2 import Tacotron2


def create_model(hparams):
    return Tacotron2(hparams)


def get_most_recent_checkpoint(checkpoint_dir):
    checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))]
    idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]

    max_idx = max(idxes)
    lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx))

    #latest_checkpoint=checkpoint_paths[0]
    print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint))
    return lastest_checkpoint


================================================
FILE: tacotron2/helpers.py
================================================
# coding: utf-8
# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py

import numpy as np
import tensorflow as tf
from tensorflow.contrib.seq2seq import Helper


# Adapted from tf.contrib.seq2seq.GreedyEmbeddingHelper
class TacoTestHelper(Helper):
    def __init__(self, batch_size, output_dim, r):
        with tf.name_scope('TacoTestHelper'):
            self._batch_size = batch_size
            self._output_dim = output_dim
            self._end_token = tf.tile([0.0], [output_dim * r])  # [0.0,0.0,...]
            self._reduction_factor = r
    @property
    def batch_size(self):
        return self._batch_size
    
    @property
    def sample_ids_dtype(self):
        return tf.int32

    @property
    def sample_ids_shape(self):
        return tf.TensorShape([])
    
    def initialize(self, name=None):
        return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim))

    def sample(self, time, outputs, state, name=None):
        return tf.tile([0], [self._batch_size])  # Return all 0; we ignore them

    def next_inputs(self, time, outputs, state, sample_ids, name=None):
        '''Stop on EOS. Otherwise, pass the last output as the next input and pass through state.'''
        with tf.name_scope('TacoTestHelper'):
            stop_token_preds = tf.nn.sigmoid(outputs[:,-self._reduction_factor:])
            finished = tf.reduce_any(tf.cast(tf.round(stop_token_preds), tf.bool),axis=1)
            # Feed last output frame as next input. outputs is [N, output_dim * r]
            
            next_inputs = outputs[:, -(self._output_dim+self._reduction_factor):-self._reduction_factor]  # stop token 부분을 제외
            return (finished, next_inputs, state)


class TacoTrainingHelper(Helper):
    def __init__(self, targets, output_dim, r):
        # inputs is [N, T_in], targets is [N, T_out, D]
        # output_dim = hp.num_mels = 80
        # r = hp.reduction_factor = 4 or 5
        with tf.name_scope('TacoTrainingHelper'):
            self._batch_size = tf.shape(targets)[0]
            self._output_dim = output_dim

            # Feed every r-th target frame as input
            self._targets = targets[:, r-1::r, :]

            # Use full length for every target because we don't want to mask the padding frames
            num_steps = tf.shape(self._targets)[1]
            self._lengths = tf.tile([num_steps], [self._batch_size])

    @property
    def batch_size(self):
        return self._batch_size

    @property
    def sample_ids_dtype(self):
        return tf.int32

    @property
    def sample_ids_shape(self):
        return tf.TensorShape([])


    def initialize(self, name=None):
        return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim))

    def sample(self, time, outputs, state, name=None):
        return tf.tile([0], [self._batch_size])  # Return all 0; we ignore them

    def next_inputs(self, time, outputs, state, sample_ids, name=None):  # time에 해당하는 input을 만들어 return해야 한다.
        with tf.name_scope(name or 'TacoTrainingHelper'):
            finished = (time + 1 >= self._lengths)

            next_inputs = self._targets[:, time, :]
            
            return (finished, next_inputs, state)


def _go_frames(batch_size, output_dim):
    '''Returns all-zero <GO> frames for a given batch size and output dimension'''
    return tf.tile([[0.0]], [batch_size, output_dim])


================================================
FILE: tacotron2/modules.py
================================================
# coding: utf-8
# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py

import tensorflow as tf
from tensorflow.contrib.rnn import GRUCell
from tensorflow.python.layers import core
from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, AttentionWrapper, AttentionWrapperState


def prenet(inputs, is_training, layer_sizes, drop_prob, scope=None):
    x = inputs  # 3차원 array(batch,seq_length,embedding_dim)   ==> (batch,seq_length,256)  ==> (batch,seq_length,128)
    #drop_rate = drop_prob if is_training else 0.0
    #print('drop_rate',drop_rate)
    with tf.variable_scope(scope or 'prenet'):
        for i, size in enumerate(layer_sizes):  # [f(256), f(256)]
            dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='projection_%d' % (i+1))
            # Tacotron2 논문에서는 training, inference 모두에 dropout 적용
            x = tf.layers.dropout(dense, rate=drop_prob,training=True, name='dropout_%d' % (i+1)) # Tacotron2에서는 training, inference 모두에 dropout 적용
    return x

def cbhg(inputs, input_lengths, is_training, bank_size, bank_channel_size, maxpool_width, highway_depth, 
         rnn_size, proj_sizes, proj_width, scope,before_highway=None, encoder_rnn_init_state=None):
    # inputs: (N,T_in, 128), bank_size: 16
    batch_size = tf.shape(inputs)[0]
    with tf.variable_scope(scope):
        with tf.variable_scope('conv_bank'):
            # Convolution bank: concatenate on the last axis
            # to stack channels from all convolutions
            conv_fn = lambda k: conv1d(inputs, k, bank_channel_size, tf.nn.relu, is_training, 'conv1d_%d' % k)  # bank_channel_size =128

            conv_outputs = tf.concat( [conv_fn(k) for k in range(1, bank_size+1)], axis=-1,)  # ==> (N,T_in,128*bank_size)

        # Maxpooling:
        maxpool_output = tf.layers.max_pooling1d(conv_outputs,pool_size=maxpool_width,strides=1,padding='same')  # maxpool_width = 2

        # Two projection layers:
        proj_out = maxpool_output
        for idx, proj_size in enumerate(proj_sizes):   # [f(128), f(128)],  post: [f(256), f(80)]
            activation_fn = None if idx == len(proj_sizes) - 1 else tf.nn.relu
            proj_out = conv1d(proj_out, proj_width, proj_size, activation_fn,is_training, 'proj_{}'.format(idx + 1))  # proj_width = 3

        # Residual connection:
        if before_highway is not None: # multi-sperker mode
            expanded_before_highway = tf.expand_dims(before_highway, [1])
            tiled_before_highway = tf.tile(expanded_before_highway, [1, tf.shape(proj_out)[1], 1])

            highway_input = proj_out + inputs + tiled_before_highway
        else: # single model
            highway_input = proj_out + inputs

        # Handle dimensionality mismatch:
        if highway_input.shape[2] != rnn_size:  # rnn_size = 128
            highway_input = tf.layers.dense(highway_input, rnn_size,name='highway_projection')

        # 4-layer HighwayNet:
        for idx in range(highway_depth):
            highway_input = highwaynet(highway_input, 'highway_%d' % (idx+1))

        rnn_input = highway_input

        # Bidirectional RNN
        if encoder_rnn_init_state is not None:
            initial_state_fw, initial_state_bw = tf.split(encoder_rnn_init_state, 2, 1)
        else:  # single mode
            initial_state_fw, initial_state_bw = None, None

        cell_fw, cell_bw = GRUCell(rnn_size), GRUCell(rnn_size)
        outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,rnn_input,sequence_length=input_lengths,
                                                          initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32)
        return tf.concat(outputs, axis=2)    # Concat forward and backward


def batch_tile(tensor, batch_size):
    expaneded_tensor = tf.expand_dims(tensor, [0])
    return tf.tile(expaneded_tensor, \
            [batch_size] + [1 for _ in tensor.get_shape()])


def highwaynet(inputs, scope):
    highway_dim = int(inputs.get_shape()[-1])

    with tf.variable_scope(scope):
        H = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.relu,name='H_projection')
        T = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.sigmoid,name='T_projection',bias_initializer=tf.constant_initializer(-1.0))
        return H * T + inputs * (1.0 - T)


def conv1d(inputs, kernel_size, channels, activation, is_training, scope):
    with tf.variable_scope(scope):
        # strides=1, padding = same 이므로, kernel_size에 상관없이 크기가 유지된다.
        conv1d_output = tf.layers.conv1d(inputs,filters=channels,kernel_size=kernel_size,activation=activation,padding='same') # padding이 same이라 kenel size가 달라도 concat된다.
        return tf.layers.batch_normalization(conv1d_output, training=is_training)


================================================
FILE: tacotron2/rnn_wrappers.py
================================================
# coding: utf-8
import numpy as np
import tensorflow as tf
from tensorflow.contrib.rnn import RNNCell
from tensorflow.python.ops import rnn_cell_impl
#from tensorflow.contrib.data.python.util import nest
from tensorflow.contrib.framework import nest
from tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, \
                             AttentionWrapperState, AttentionMechanism, _BaseMonotonicAttentionMechanism,_maybe_mask_score,_prepare_memory,_monotonic_probability_fn
from tensorflow.python.ops import array_ops, math_ops, nn_ops, variable_scope
from tensorflow.python.layers.core import Dense
from .modules import prenet
import functools
_zero_state_tensors = rnn_cell_impl._zero_state_tensors


class ZoneoutLSTMCell(RNNCell):
    '''Wrapper for tf LSTM to create Zoneout LSTM Cell

    inspired by:
    https://github.com/teganmaharaj/zoneout/blob/master/zoneout_tensorflow.py

    Published by one of 'https://arxiv.org/pdf/1606.01305.pdf' paper writers.

    Many thanks to @Ondal90 for pointing this out. You sir are a hero!
    '''
    def __init__(self, num_units, is_training, zoneout_factor_cell=0., zoneout_factor_output=0., state_is_tuple=True, name=None):
        '''Initializer with possibility to set different zoneout values for cell/hidden states.
        '''
        zm = min(zoneout_factor_output, zoneout_factor_cell)
        zs = max(zoneout_factor_output, zoneout_factor_cell)

        if zm < 0. or zs > 1.:
            raise ValueError('One/both provided Zoneout factors are not in [0, 1]')

        self._cell = tf.nn.rnn_cell.LSTMCell(num_units, state_is_tuple=state_is_tuple, name=name)
        self._zoneout_cell = zoneout_factor_cell
        self._zoneout_outputs = zoneout_factor_output
        self.is_training = is_training
        self.state_is_tuple = state_is_tuple

    @property
    def state_size(self):
        return self._cell.state_size

    @property
    def output_size(self):
        return self._cell.output_size

    def __call__(self, inputs, state, scope=None):
        '''Runs vanilla LSTM Cell and applies zoneout.
        '''
        #Apply vanilla LSTM
        output, new_state = self._cell(inputs, state, scope)

        if self.state_is_tuple:
            (prev_c, prev_h) = state
            (new_c, new_h) = new_state
        else:
            num_proj = self._cell._num_units if self._cell._num_proj is None else self._cell._num_proj
            prev_c = tf.slice(state, [0, 0], [-1, self._cell._num_units])
            prev_h = tf.slice(state, [0, self._cell._num_units], [-1, num_proj])
            new_c = tf.slice(new_state, [0, 0], [-1, self._cell._num_units])
            new_h = tf.slice(new_state, [0, self._cell._num_units], [-1, num_proj])

        #Apply zoneout
        if self.is_training:
            #nn.dropout takes keep_prob (probability to keep activations) not drop_prob (probability to mask activations)!
            c = (1 - self._zoneout_cell) * tf.nn.dropout(new_c - prev_c, (1 - self._zoneout_cell)) + prev_c   # tf.nn.dropout outputs the input element scaled up by 1 / keep_prob
            h = (1 - self._zoneout_outputs) * tf.nn.dropout(new_h - prev_h, (1 - self._zoneout_outputs)) + prev_h

        else:
            c = (1 - self._zoneout_cell) * new_c + self._zoneout_cell * prev_c
            h = (1 - self._zoneout_outputs) * new_h + self._zoneout_outputs * prev_h

        new_state = tf.nn.rnn_cell.LSTMStateTuple(c, h) if self.state_is_tuple else tf.concat(1, [c, h])

        return output, new_state

class DecoderWrapper(RNNCell):
    '''Runs RNN inputs through a prenet before sending them to the cell.'''
    #  input에 prenet을 먼저 적용하는 것 뿐이다.
    def __init__(self, cell, is_training, prenet_sizes, dropout_prob,inference_prenet_dropout=True):

        super(DecoderWrapper, self).__init__()
        self._is_training = is_training

        self._cell = cell

        self.prenet_sizes = prenet_sizes
        if not is_training and not inference_prenet_dropout:
            self.dropout_prob = 0.
        else: self.dropout_prob = dropout_prob

    @property
    def state_size(self):
        return self._cell.state_size

    @property
    def output_size(self):
        return self._cell.output_size + self._cell.state_size.attention

    def call(self, inputs, state):
        prenet_out = prenet(inputs, self._is_training,self.prenet_sizes, self.dropout_prob, scope='decoder_prenet')

        output, res_state = self._cell(prenet_out, state)
        
        return tf.concat([output, res_state.attention], axis=-1), res_state

    def zero_state(self, batch_size, dtype):
        return self._cell.zero_state(batch_size, dtype)


class LocationSensitiveAttention(BahdanauAttention):
    """Impelements Bahdanau-style (cumulative) scoring function.
    Usually referred to as "hybrid" attention (content-based + location-based)
    Extends the additive attention described in:
    "D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-
  tion by jointly learning to align and translate,” in Proceedings
  of ICLR, 2015."
    to use previous alignments as additional location features.

    This attention is described in:
    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
  gio, “Attention-based models for speech recognition,” in Ad-
  vances in Neural Information Processing Systems, 2015, pp.
  577–585.
    """

    def __init__(self,
                 num_units,
                 memory,
                 hparams,
                 is_training,
                 mask_encoder=True,
                 memory_sequence_length=None,
                 smoothing=False,
                 cumulate_weights=True,
                 name='LocationSensitiveAttention'):
        """Construct the Attention mechanism.
        Args:
            num_units: The depth of the query mechanism.
            memory: The memory to query; usually the output of an RNN encoder.  This
                tensor should be shaped `[batch_size, max_time, ...]`.
            mask_encoder (optional): Boolean, whether to mask encoder paddings.
            memory_sequence_length (optional): Sequence lengths for the batch entries
                in memory.  If provided, the memory tensor rows are masked with zeros
                for values past the respective sequence lengths. Only relevant if mask_encoder = True.
            smoothing (optional): Boolean. Determines which normalization function to use.
                Default normalization function (probablity_fn) is softmax. If smoothing is
                enabled, we replace softmax with:
                        a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))
                Introduced in:
                    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
                  gio, “Attention-based models for speech recognition,” in Ad-
                  vances in Neural Information Processing Systems, 2015, pp.
                  577–585.
                This is mainly used if the model wants to attend to multiple input parts
                at the same decoding step. We probably won't be using it since multiple sound
                frames may depend on the same character/phone, probably not the way around.
                Note:
                    We still keep it implemented in case we want to test it. They used it in the
                    paper in the context of speech recognition, where one phoneme may depend on
                    multiple subsequent sound frames.
            name: Name to use when creating ops.
        """
        #Create normalization function
        #Setting it to None defaults in using softmax
        normalization_function = _smoothing_normalization if (smoothing == True) else None
        memory_length = memory_sequence_length if (mask_encoder==True) else None
        super(LocationSensitiveAttention, self).__init__(
                num_units=num_units,
                memory=memory,
                memory_sequence_length=memory_length,
                probability_fn=normalization_function,
                name=name)

        self.location_convolution = tf.layers.Conv1D(filters=hparams.attention_filters,
            kernel_size=hparams.attention_kernel, padding='same', use_bias=True,
            bias_initializer=tf.zeros_initializer(), name='location_features_convolution')
        self.location_layer = tf.layers.Dense(units=num_units, use_bias=False,dtype=tf.float32, name='location_features_projection')
        self._cumulate = cumulate_weights
        self.synthesis_constraint = hparams.synthesis_constraint and not is_training
        self.attention_win_size = tf.convert_to_tensor(hparams.attention_win_size, dtype=tf.int32)
        self.constraint_type = hparams.synthesis_constraint_type

    def __call__(self, query, state):
        """Score the query based on the keys and values.
        Args:
            query: Tensor of dtype matching `self.values` and shape
                `[batch_size, query_depth]`.
            state (previous alignments): Tensor of dtype matching `self.values` and shape
                `[batch_size, alignments_size]`
                (`alignments_size` is memory's `max_time`).
        Returns:
            alignments: Tensor of dtype matching `self.values` and shape
                `[batch_size, alignments_size]` (`alignments_size` is memory's
                `max_time`).
        """
        previous_alignments = state
        with variable_scope.variable_scope(None, "Location_Sensitive_Attention", [query]):

            # processed_query shape [batch_size, query_depth] -> [batch_size, attention_dim]
            processed_query = self.query_layer(query) if self.query_layer else query
            # -> [batch_size, 1, attention_dim]
            processed_query = tf.expand_dims(processed_query, 1)

            # processed_location_features shape [batch_size, max_time, attention dimension]
            # [batch_size, max_time] -> [batch_size, max_time, 1]
            expanded_alignments = tf.expand_dims(previous_alignments, axis=2)
            # location features [batch_size, max_time, filters]
            f = self.location_convolution(expanded_alignments)
            # Projected location features [batch_size, max_time, attention_dim]
            processed_location_features = self.location_layer(f)

            # energy shape [batch_size, max_time]
            energy = _location_sensitive_score(processed_query, processed_location_features, self.keys)

        if self.synthesis_constraint:
            prev_max_attentions = tf.argmax(previous_alignments, -1, output_type=tf.int32)
            Tx = tf.shape(energy)[-1]
            # prev_max_attentions = tf.squeeze(prev_max_attentions, [-1])
            if self.constraint_type == 'monotonic':
                key_masks = tf.sequence_mask(prev_max_attentions, Tx)
                reverse_masks = tf.sequence_mask(Tx - self.attention_win_size - prev_max_attentions, Tx)[:, ::-1]
            else:
                assert self.constraint_type == 'window'
                key_masks = tf.sequence_mask(prev_max_attentions - (self.attention_win_size // 2 + (self.attention_win_size % 2 != 0)), Tx)
                reverse_masks = tf.sequence_mask(Tx - (self.attention_win_size // 2) - prev_max_attentions, Tx)[:, ::-1]
            
            masks = tf.logical_or(key_masks, reverse_masks)
            paddings = tf.ones_like(energy) * (-2 ** 32 + 1)  # (N, Ty/r, Tx)
            energy = tf.where(tf.equal(masks, False), energy, paddings)

        # alignments shape = energy shape = [batch_size, max_time]
        alignments = self._probability_fn(energy, previous_alignments)

        # Cumulate alignments
        if self._cumulate:
            next_state = alignments + previous_alignments
        else:
            next_state = alignments

        return alignments, next_state


def _location_sensitive_score(W_query, W_fil, W_keys):
    """Impelements Bahdanau-style (cumulative) scoring function.
    This attention is described in:
        J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
      gio, “Attention-based models for speech recognition,” in Ad-
      vances in Neural Information Processing Systems, 2015, pp.
      577–585.

    #############################################################################
              hybrid attention (content-based + location-based)
                               f = F * α_{i-1}
       energy = dot(v_a, tanh(W_keys(h_enc) + W_query(h_dec) + W_fil(f) + b_a))
    #############################################################################

    Args:
        W_query: Tensor, shape '[batch_size, 1, attention_dim]' to compare to location features.
        W_location: processed previous alignments into location features, shape '[batch_size, max_time, attention_dim]'
        W_keys: Tensor, shape '[batch_size, max_time, attention_dim]', typically the encoder outputs.
    Returns:
        A '[batch_size, max_time]' attention score (energy)
    """
    # Get the number of hidden units from the trailing dimension of keys
    dtype = W_query.dtype
    num_units = W_keys.shape[-1].value or array_ops.shape(W_keys)[-1]

    v_a = tf.get_variable(
        'attention_variable_projection', shape=[num_units], dtype=dtype,
        initializer=tf.contrib.layers.xavier_initializer())
    b_a = tf.get_variable(
        'attention_bias', shape=[num_units], dtype=dtype,
        initializer=tf.zeros_initializer())

    return tf.reduce_sum(v_a * tf.tanh(W_keys + W_query + W_fil + b_a), [2])


def _smoothing_normalization(e):
    """Applies a smoothing normalization function instead of softmax
    Introduced in:
        J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
      gio, “Attention-based models for speech recognition,” in Ad-
      vances in Neural Information Processing Systems, 2015, pp.
      577–585.

    ############################################################################
                        Smoothing normalization function
                a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))
    ############################################################################

    Args:
        e: matrix [batch_size, max_time(memory_time)]: expected to be energy (score)
            values of an attention mechanism
    Returns:
        matrix [batch_size, max_time]: [0, 1] normalized alignments with possible
            attendance to multiple memory time steps.
    """
    return tf.nn.sigmoid(e) / tf.reduce_sum(tf.nn.sigmoid(e), axis=-1, keepdims=True)
class GmmAttention(AttentionMechanism):
    def __init__(self,
                 num_mixtures,
                 memory,
                 memory_sequence_length=None,
                 check_inner_dims_defined=True,
                 score_mask_value=None,
                 name='GmmAttention'):

        self.dtype = memory.dtype
        self.num_mixtures = num_mixtures
        self.query_layer = tf.layers.Dense(3 * num_mixtures, name='gmm_query_projection', use_bias=True, dtype=self.dtype)

        with tf.name_scope(name, 'GmmAttentionMechanismInit'):
            if score_mask_value is None:
                score_mask_value = 0.
            self._maybe_mask_score = functools.partial(
                _maybe_mask_score,
                memory_sequence_length=memory_sequence_length,
                score_mask_value=score_mask_value)
            self._value = _prepare_memory(
                memory, memory_sequence_length, check_inner_dims_defined)
            self._batch_size = (
                self._value.shape[0].value or tf.shape(self._value)[0])
            self._alignments_size = (
                    self._value.shape[1].value or tf.shape(self._value)[1])

    @property
    def values(self):
        return self._value

    @property
    def batch_size(self):
        return self._batch_size

    @property
    def alignments_size(self):
        return self._alignments_size

    @property
    def state_size(self):
        return self.num_mixtures

    def initial_alignments(self, batch_size, dtype):
        max_time = self._alignments_size
        return _zero_state_tensors(max_time, batch_size, dtype)

    def initial_state(self, batch_size, dtype):
        state_size_ = self.state_size
        return _zero_state_tensors(state_size_, batch_size, dtype)

    def __call__(self, query, state):
        with tf.variable_scope("GmmAttention"):
            previous_kappa = state
            
            params = self.query_layer(query)   # query(dec_rnn_size=256) , params(num_mixtures(256)*3)
            alpha_hat, beta_hat, kappa_hat = tf.split(params, num_or_size_splits=3, axis=1)

            # [batch_size, num_mixtures, 1]
            alpha = tf.expand_dims(tf.exp(alpha_hat), axis=2)
            # softmax makes the alpha value more stable.
            # alpha = tf.expand_dims(tf.nn.softmax(alpha_hat, axis=1), axis=2)
            beta = tf.expand_dims(tf.exp(beta_hat), axis=2)
            kappa = tf.expand_dims(previous_kappa + tf.exp(kappa_hat), axis=2)

            # [1, 1, max_input_steps]
            mu = tf.reshape(tf.cast(tf.range(self.alignments_size), dtype=tf.float32), shape=[1, 1, self.alignments_size])  # [[[0,1,2,...]]]

            # [batch_size, max_input_steps]
            phi = tf.reduce_sum(alpha * tf.exp(-beta * (kappa - mu) ** 2.), axis=1)

        alignments = self._maybe_mask_score(phi)
        state = tf.squeeze(kappa, axis=2)

        return alignments, state


================================================
FILE: tacotron2/tacotron2.py
================================================
# coding: utf-8

# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py

"""
모델 수정
1. prenet에서 dropout 적용 오류 수정
2. AttentionWrapper 적용 순서 오류 수정: keith ito 코드는 잘 구현되어 있음
3. BahdanauMonotonicAttention에서 normalize=True적용(2018년9월11일 적용)
4. BahdanauMonotonicAttention에서 memory_sequence_length 입력
5. synhesizer.py  input_lengths 계산오류. +1 해야 함.


"""


import numpy as np
import tensorflow as tf
from tensorflow.contrib.seq2seq import BasicDecoder, BahdanauAttention, BahdanauMonotonicAttention,LuongAttention
from tensorflow.contrib.rnn import GRUCell, MultiRNNCell, OutputProjectionWrapper, ResidualWrapper,LSTMStateTuple

from utils.infolog import log
from text.symbols import symbols

from .modules import *
from .helpers import TacoTestHelper, TacoTrainingHelper
from .rnn_wrappers import LocationSensitiveAttention,GmmAttention,ZoneoutLSTMCell,DecoderWrapper


class Tacotron2():
    def __init__(self, hparams):
        self._hparams = hparams


    def initialize(self, inputs, input_lengths, num_speakers, speaker_id=None,mel_targets=None, linear_targets=None, is_training= False,loss_coeff=None,stop_token_targets=None):
        

        with tf.variable_scope('Eembedding') as scope:
            hp = self._hparams
            batch_size = tf.shape(inputs)[0]

            # Embeddings(256)
            char_embed_table = tf.get_variable('inputs_embedding', [len(symbols), hp.embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5))
            
            zero_pad = True
            if zero_pad:    # transformer에 구현되어 있는 거 보고, 가져온 로직.
                # <PAD> 0 은 embedding이 0으로 고정되고, train으로 변하지 않는다. 즉, 위의 get_variable에서 잡았던 변수의 첫번째 행(<PAD>)에 대응되는 것은 사용되지 않는 것이다)
                char_embed_table = tf.concat((tf.zeros(shape=[1, hp.embedding_size]),char_embed_table[1:, :]), 0)
            
            
            # [N, T_in, embedding_size]
            char_embedded_inputs = tf.nn.embedding_lookup(char_embed_table, inputs)

            self.num_speakers = num_speakers
            if self.num_speakers > 1:
                speaker_embed_table = tf.get_variable('speaker_embedding',[self.num_speakers, hp.speaker_embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5))
                # [N, T_in, speaker_embedding_size]
                speaker_embed = tf.nn.embedding_lookup(speaker_embed_table, speaker_id)                       
                
                deep_dense = lambda x, dim,name: tf.layers.dense(x, dim, activation=tf.nn.softsign,name=name)   # softsign: x / (abs(x) + 1)

                encoder_rnn_init_state = deep_dense( speaker_embed, hp.encoder_lstm_units * 4,'encoder_init_dense')  # hp.encoder_lstm_units = 256

                decoder_rnn_init_states = [deep_dense(speaker_embed, hp.decoder_lstm_units*2,'decoder_init_dense_{}'.format(i)) for i in range(hp.decoder_layers)]  # hp.decoder_lstm_units = 1024

                speaker_embed = None
            else:
                # self.num_speakers =1인 경우
                speaker_embed = None
                encoder_rnn_init_state = None   # bidirectional GRU의 init state
                attention_rnn_init_state = None
                decoder_rnn_init_states = None
        
        
        with tf.variable_scope('Encoder') as scope:
            ##############
            # Encoder
            ##############
            x = char_embedded_inputs
            for i in range(hp.enc_conv_num_layers):
                x = tf.layers.conv1d(x,filters=hp.enc_conv_channels,kernel_size=hp.enc_conv_kernel_size,padding='same',activation=tf.nn.relu,name='Encoder_{}'.format(i))
                x = tf.layers.batch_normalization(x, training=is_training)
                x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='dropout_{}'.format(i))


            if encoder_rnn_init_state is not None:
                initial_state_fw_c,initial_state_fw_h, initial_state_bw_c,initial_state_bw_h = tf.split(encoder_rnn_init_state, 4, 1)
                initial_state_fw = LSTMStateTuple(initial_state_fw_c,initial_state_fw_h)
                initial_state_bw = LSTMStateTuple(initial_state_bw_c,initial_state_bw_h)
            else:  # single mode
                initial_state_fw, initial_state_bw = None, None

            cell_fw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM')
            cell_bw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM')
            encoder_conv_output = x
            outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,encoder_conv_output,sequence_length=input_lengths,
                                                              initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32)

            # envoder_outpust = [N,T,2*encoder_lstm_units] = [N,T,512]
            encoder_outputs = tf.concat(outputs, axis=2) # Concat and return forward + backward outputs
            
            
        with tf.variable_scope('Decoder') as scope:
            
            ##############
            # Attention
            ##############            
            if hp.attention_type == 'bah_mon':
                attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths,normalize=False)
            elif hp.attention_type == 'bah_mon_norm':  # hccho 추가
                attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length = input_lengths, normalize=True) 
            elif hp.attention_type == 'loc_sen': # Location Sensitivity Attention
                attention_mechanism = LocationSensitiveAttention(hp.attention_size, encoder_outputs,hparams=hp, is_training=is_training,
                                    mask_encoder=hp.mask_encoder,memory_sequence_length = input_lengths,smoothing=hp.smoothing,cumulate_weights=hp.cumulative_weights)
            elif hp.attention_type == 'gmm': # GMM Attention
                attention_mechanism = GmmAttention(hp.attention_size, memory=encoder_outputs,memory_sequence_length = input_lengths)  
            elif hp.attention_type == 'bah_norm':
                attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, normalize=True)
            elif hp.attention_type == 'luong_scaled':
                attention_mechanism = LuongAttention( hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, scale=True)
            elif hp.attention_type == 'luong':
                attention_mechanism = LuongAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths)
            elif hp.attention_type == 'bah':
                attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths)
            else:
                raise Exception(" [!] Unkown attention type: {}".format(hp.attention_type))
            
            decoder_lstm = [ZoneoutLSTMCell(hp.decoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,
                                            zoneout_factor_output=hp.tacotron_zoneout_rate,name='decoder_LSTM_{}'.format(i+1)) for i in range(hp.decoder_layers)]
            
            decoder_lstm = tf.contrib.rnn.MultiRNNCell(decoder_lstm, state_is_tuple=True)
            decoder_init_state = decoder_lstm.zero_state(batch_size=batch_size, dtype=tf.float32) # 여기서 zero_state를 부르면, 위의 AttentionWrapper에서 이미 넣은 준 값도 포함되어 있다.

            
            if hp.model_type == "multi-speaker":

                decoder_init_state = list(decoder_init_state)
 
                for idx, cell in enumerate(decoder_rnn_init_states):
                    shape1 = decoder_init_state[idx][0].get_shape().as_list()
                    shape2 = cell.get_shape().as_list()
                    if shape1[1]*2 != shape2[1]:
                        raise Exception(" [!] Shape {} and {} should be equal".format(shape1, shape2))
                    c,h = tf.split(cell,2,1)
                    decoder_init_state[idx] = LSTMStateTuple(c,h)
 
                decoder_init_state = tuple(decoder_init_state) 
            
            
            attention_cell = AttentionWrapper(decoder_lstm,attention_mechanism, initial_cell_state=decoder_init_state,
                                              alignment_history=True,output_attention=False)  # output_attention=False 에 주목, attention_layer_size에 값을 넣지 않았다. 그래서 attention = contex vector가 된다.


            # attention_state_size = 256
            # Decoder input -> prenet -> decoder_lstm -> concat[output, attention]
            dec_prenet_outputs = DecoderWrapper(attention_cell , is_training, hp.dec_prenet_sizes, hp.dropout_prob,hp.inference_prenet_dropout)

            dec_outputs_cell = OutputProjectionWrapper(dec_prenet_outputs,(hp.num_mels+1) * hp.reduction_factor)

            if is_training:
                helper = TacoTrainingHelper(mel_targets, hp.num_mels, hp.reduction_factor)  # inputs은 batch_size 계산에만 사용됨
            else:
                helper = TacoTestHelper(batch_size, hp.num_mels, hp.reduction_factor)


            decoder_init_state = dec_outputs_cell.zero_state(batch_size=batch_size, dtype=tf.float32)
            (decoder_outputs, _), final_decoder_state, _ = \
                    tf.contrib.seq2seq.dynamic_decode(BasicDecoder(dec_outputs_cell, helper, decoder_init_state),maximum_iterations=int(hp.max_n_frame/hp.reduction_factor))  # max_iters=200
            
            decoder_mel_outputs = tf.reshape(decoder_outputs[:,:,:hp.num_mels * hp.reduction_factor], [batch_size, -1, hp.num_mels])   # [N,iters,400] -> [N,5*iters,80]
            stop_token_outputs = tf.reshape(decoder_outputs[:,:,hp.num_mels * hp.reduction_factor:], [batch_size, -1]) # [N,iters]
 
 
            # Postnet
            x = decoder_mel_outputs
            for i in range(hp.postnet_num_layers):
                activation = tf.nn.tanh if i != (hp.postnet_num_layers-1) else None
                x = tf.layers.conv1d(x,filters=hp.postnet_channels,kernel_size=hp.postnet_kernel_size,padding='same',activation=activation,name='Postnet_{}'.format(i))
                x = tf.layers.batch_normalization(x, training=is_training)
                x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='Postnet_dropout_{}'.format(i))
 
            residual = tf.layers.dense(x,hp.num_mels,name='residual_projection')
            mel_outputs = decoder_mel_outputs + residual

            # Add post-processing CBHG:
            # mel_outputs: (N,T,num_mels)
            post_outputs = cbhg(mel_outputs, None, is_training,hp.post_bank_size, hp.post_bank_channel_size, hp.post_maxpool_width, hp.post_highway_depth, hp.post_rnn_size,
                                hp.post_proj_sizes, hp.post_proj_width,scope='post_cbhg')
 
 
            linear_outputs = tf.layers.dense(post_outputs, hp.num_freq,name='linear_spectogram_projection')    # [N, T_out, F(1025)]
 
            # Grab alignments from the final decoder state:
            alignments = tf.transpose(final_decoder_state.alignment_history.stack(), [1, 2, 0])  # batch_size, text length(encoder), target length(decoder)
 
 
            self.inputs = inputs
            self.speaker_id = speaker_id
            self.input_lengths = input_lengths
            self.loss_coeff = loss_coeff
            self.decoder_mel_outputs = decoder_mel_outputs
            self.mel_outputs = mel_outputs
            self.linear_outputs = linear_outputs
            self.alignments = alignments
            self.mel_targets = mel_targets
            self.linear_targets = linear_targets
            self.final_decoder_state = final_decoder_state
            self.stop_token_targets = stop_token_targets
            self.stop_token_outputs = stop_token_outputs
            self.all_vars = tf.trainable_variables()
            log('='*40)
            log(' model_type: %s' % hp.model_type)
            log('='*40)
 
            log('Initialized Tacotron model. Dimensions: ')
            log('    embedding:                %d' % char_embedded_inputs.shape[-1])
            log('    encoder conv out:               %d' % encoder_conv_output.shape[-1])
            log('    encoder out:              %d' % encoder_outputs.shape[-1])
            log('    attention out:            %d' % attention_cell.output_size)
            log('    decoder prenet lstm concat out :        %d' % dec_prenet_outputs.output_size)
            log('    decoder cell out:         %d' % dec_outputs_cell.output_size)
            log('    decoder out (%d frames):  %d' % (hp.reduction_factor, decoder_outputs.shape[-1]))
            log('    decoder mel out:    %d' % decoder_mel_outputs.shape[-1])
            log('    mel out:    %d' % mel_outputs.shape[-1])
            log('    postnet out:              %d' % post_outputs.shape[-1])
            log('    linear out:               %d' % linear_outputs.shape[-1])
            log('  Tacotron Parameters       {:.3f} Million.'.format(np.sum([np.prod(v.get_shape().as_list()) for v in self.all_vars]) / 1000000))

    def add_loss(self):
        '''Adds loss to the model. Sets "loss" field. initialize must have been called.'''
        with tf.variable_scope('loss') as scope:
            hp = self._hparams
            before = tf.squared_difference(self.mel_targets, self.decoder_mel_outputs)
            after = tf.squared_difference(self.mel_targets, self.mel_outputs)
            mel_loss = before+after
            
            stop_token_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.stop_token_targets, logits=self.stop_token_outputs))

            l1 = tf.abs(self.linear_targets - self.linear_outputs)
            expanded_loss_coeff = tf.expand_dims(tf.expand_dims(self.loss_coeff, [-1]), [-1])


            regularization_loss = tf.reduce_mean([tf.nn.l2_loss(v) for v in self.all_vars
                if not('bias' in v.name or 'Bias' in v.name or 'projection' in v.name or 'inputs_embedding' in v.name or 'speaker_embedding' in v.name
                    or 'dense' in v.name or 'RNN' in v.name or 'LSTM' in v.name)]) * hp.tacotron_reg_weight

            regularization_loss = 0
            if hp.prioritize_loss:
                # Prioritize loss for frequencies.
                upper_priority_freq = int(5000 / (hp.sample_rate * 0.5) * hp.num_freq)
                lower_priority_freq = int(165 / (hp.sample_rate * 0.5) * hp.num_freq)

                l1_priority= l1[:,:,lower_priority_freq:upper_priority_freq]

                self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + \
                        0.5 * tf.reduce_mean(l1 * expanded_loss_coeff) + 0.5 * tf.reduce_mean(l1_priority * expanded_loss_coeff) + stop_token_loss + regularization_loss
                self.linear_loss = tf.reduce_mean( 0.5 * (tf.reduce_mean(l1) + tf.reduce_mean(l1_priority)))
            else:
                self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + tf.reduce_mean(l1 * expanded_loss_coeff) + stop_token_loss +  regularization_loss    # 이 loss는 사용하지 않고, 아래의 loss_without_coeff를 사용함
                self.linear_loss = tf.reduce_mean(l1)

            self.mel_loss = tf.reduce_mean(mel_loss)
            self.loss_without_coeff = self.mel_loss + self.linear_loss + stop_token_loss + regularization_loss


    def add_optimizer(self, global_step):
        '''Adds optimizer. Sets "gradients" and "optimize" fields. add_loss must have been called.

        Args:
            global_step: int32 scalar Tensor representing current global step in training
        '''
        with tf.variable_scope('optimizer') as scope:
            hp = self._hparams


            if hp.tacotron_decay_learning_rate:
                self.decay_steps = hp.tacotron_decay_steps
                self.decay_rate = hp.tacotron_decay_rate
                self.learning_rate = self._learning_rate_decay(hp.tacotron_initial_learning_rate, global_step)
            else:
                self.learning_rate = tf.convert_to_tensor(hp.tacotron_initial_learning_rate)


            optimizer = tf.train.AdamOptimizer(self.learning_rate, hp.adam_beta1, hp.adam_beta2)
            gradients, variables = zip(*optimizer.compute_gradients(self.loss))
            self.gradients = gradients
            clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1.0)

            # Add dependency on UPDATE_OPS; otherwise batchnorm won't work correctly. See:
            # https://github.com/tensorflow/tensorflow/issues/1122
            with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
                self.optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step)


    def _learning_rate_decay(self, init_lr, global_step):
        #################################################################
        # Narrow Exponential Decay:

        # Phase 1: lr = 1e-3
        # We only start learning rate decay after 50k steps

        # Phase 2: lr in ]1e-5, 1e-3[
        # decay reach minimal value at step 310k

        # Phase 3: lr = 1e-5
        # clip by minimal learning rate value (step > 310k)
        #################################################################
        hp = self._hparams

        #Compute natural exponential decay
        lr = tf.train.exponential_decay(init_lr, 
            global_step - hp.tacotron_start_decay, #lr = 1e-3 at step 50k
            self.decay_steps, 
            self.decay_rate, #lr = 1e-5 around step 310k
            name='lr_exponential_decay')


        #clip learning rate by max and min values (initial and final values)
        return tf.minimum(tf.maximum(lr, hp.tacotron_final_learning_rate), init_lr)


================================================
FILE: text/__init__.py
================================================
# coding: utf-8
import re
import string
import numpy as np

from text import cleaners
from hparams import hparams
from text.symbols import symbols, en_symbols, PAD, EOS
from text.korean import jamo_to_korean


# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)}   # 80개
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
isEn=False


# Regular expression matching text enclosed in curly braces:
_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')

puncuation_table = str.maketrans({key: None for key in string.punctuation})

def convert_to_en_symbols():
    '''Converts built-in korean symbols to english, to be used for english training
    
'''
    global _symbol_to_id, _id_to_symbol, isEn
    if not isEn:
        print(" [!] Converting to english mode")
    _symbol_to_id = {s: i for i, s in enumerate(en_symbols)}
    _id_to_symbol = {i: s for i, s in enumerate(en_symbols)}
    isEn=True

def remove_puncuations(text):
    return text.translate(puncuation_table)

def text_to_sequence(text, as_token=False):    
    cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]
    if ('english_cleaners' in cleaner_names) and isEn==False:
        convert_to_en_symbols()
    return _text_to_sequence(text, cleaner_names, as_token)

def _text_to_sequence(text, cleaner_names, as_token):
    '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.

        The text can optionally have ARPAbet sequences enclosed in curly braces embedded
        in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."

        Args:
            text: string to convert to a sequence
            cleaner_names: names of the cleaner functions to run the text through

        Returns:
            List of integers corresponding to the symbols in the text
    '''
    sequence = []

    # Check for curly braces and treat their contents as ARPAbet:
    while len(text):
        m = _curly_re.match(text)
        if not m:
            sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
            break
        sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
        sequence += _arpabet_to_sequence(m.group(2))
        text = m.group(3)

    # Append EOS token
    sequence.append(_symbol_to_id[EOS])  # [14, 29, 45, 2, 27, 62, 20, 21, 4, 39, 45, 1]

    if as_token:
        return sequence_to_text(sequence, combine_jamo=True)
    else:
        return np.array(sequence, dtype=np.int32)


def sequence_to_text(sequence, skip_eos_and_pad=False, combine_jamo=False):
    '''Converts a sequence of IDs back to a string'''
    cleaner_names=[x.strip() for x in hparams.cleaners.split(',')]
    if 'english_cleaners' in cleaner_names and isEn==False:
        convert_to_en_symbols()
        
    result = ''
    for symbol_id in sequence:
        if symbol_id in _id_to_symbol:
            s = _id_to_symbol[symbol_id]
            # Enclose ARPAbet back in curly braces:
            if len(s) > 1 and s[0] == '@':
                s = '{%s}' % s[1:]

            if not skip_eos_and_pad or s not in [EOS, PAD]:
                result += s

    result = result.replace('}{', ' ')

    if combine_jamo:
        return jamo_to_korean(result)
    else:
        return result


def _clean_text(text, cleaner_names):
    
    for name in cleaner_names:
        cleaner = getattr(cleaners, name)
        if not cleaner:
            raise Exception('Unknown cleaner: %s' % name)
        text = cleaner(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']
    return text


def _symbols_to_sequence(symbols):
    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]


def _arpabet_to_sequence(text):
    return _symbols_to_sequence(['@' + s for s in text.split()])


def _should_keep_symbol(s):
    return s in _symbol_to_id and s is not '_' and s is not '~'


================================================
FILE: text/cleaners.py
================================================
# coding: utf-8

# Code based on https://github.com/keithito/tacotron/blob/master/text/cleaners.py
'''
Cleaners are transformations that run over the input text at both training and eval time.

Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
    1. "english_cleaners" for English text
    2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
         the Unidecode library (https://pypi.python.org/pypi/Unidecode)
    3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
         the symbols in symbols.py to match your data).
'''

import re
from .korean import tokenize as ko_tokenize

# Added to support LJ_speech
from unidecode import unidecode
from .en_numbers import normalize_numbers as en_normalize_numbers

# Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+')


def korean_cleaners(text):
    '''Pipeline for Korean text, including number and abbreviation expansion.'''
    text = ko_tokenize(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']
    return text


# List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
    ('mrs', 'misess'),
    ('mr', 'mister'),
    ('dr', 'doctor'),
    ('st', 'saint'),
    ('co', 'company'),
    ('jr', 'junior'),
    ('maj', 'major'),
    ('gen', 'general'),
    ('drs', 'doctors'),
    ('rev', 'reverend'),
    ('lt', 'lieutenant'),
    ('hon', 'honorable'),
    ('sgt', 'sergeant'),
    ('capt', 'captain'),
    ('esq', 'esquire'),
    ('ltd', 'limited'),
    ('col', 'colonel'),
    ('ft', 'fort'),
]]


def expand_abbreviations(text):
    for regex, replacement in _abbreviations:
        text = re.sub(regex, replacement, text)
    return text


def expand_numbers(text):
    return en_normalize_numbers(text)


def lowercase(text):
    return text.lower()


def collapse_whitespace(text):
    return re.sub(_whitespace_re, ' ', text)

def convert_to_ascii(text):
    '''Converts to ascii, existed in keithito but deleted in carpedm20'''
    return unidecode(text)
    

def basic_cleaners(text):
    '''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
    text = lowercase(text)
    text = collapse_whitespace(text)
    return text


def transliteration_cleaners(text):
    '''Pipeline for non-English text that transliterates to ASCII.'''
    text = convert_to_ascii(text)
    text = lowercase(text)
    text = collapse_whitespace(text)
    return text


def english_cleaners(text):
    '''Pipeline for English text, including number and abbreviation expansion.'''
    text = convert_to_ascii(text)
    text = lowercase(text)
    text = expand_numbers(text)
    text = expand_abbreviations(text)
    text = collapse_whitespace(text)
    return text


================================================
FILE: text/en_numbers.py
================================================
import inflect
import re


_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')


def _remove_commas(m):
  return m.group(1).replace(',', '')


def _expand_decimal_point(m):
  return m.group(1).replace('.', ' point ')


def _expand_dollars(m):
  match = m.group(1)
  parts = match.split('.')
  if len(parts) > 2:
    return match + ' dollars'  # Unexpected format
  dollars = int(parts[0]) if parts[0] else 0
  cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
  if dollars and cents:
    dollar_unit = 'dollar' if dollars == 1 else 'dollars'
    cent_unit = 'cent' if cents == 1 else 'cents'
    return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
  elif dollars:
    dollar_unit = 'dollar' if dollars == 1 else 'dollars'
    return '%s %s' % (dollars, dollar_unit)
  elif cents:
    cent_unit = 'cent' if cents == 1 else 'cents'
    return '%s %s' % (cents, cent_unit)
  else:
    return 'zero dollars'


def _expand_ordinal(m):
  return _inflect.number_to_words(m.group(0))


def _expand_number(m):
  num = int(m.group(0))
  if num > 1000 and num < 3000:
    if num == 2000:
      return 'two thousand'
    elif num > 2000 and num < 2010:
      return 'two thousand ' + _inflect.number_to_words(num % 100)
    elif num % 100 == 0:
      return _inflect.number_to_words(num // 100) + ' hundred'
    else:
      return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
  else:
    return _inflect.number_to_words(num, andword='')


def normalize_numbers(text):
  text = re.sub(_comma_number_re, _remove_commas, text)
  text = re.sub(_pounds_re, r'\1 pounds', text)
  text = re.sub(_dollars_re, _expand_dollars, text)
  text = re.sub(_decimal_number_re, _expand_decimal_point, text)
  text = re.sub(_ordinal_re, _expand_ordinal, text)
  text = re.sub(_number_re, _expand_number, text)
  return text


================================================
FILE: text/english.py
================================================
# Code from https://github.com/keithito/tacotron/blob/master/util/numbers.py
import inflect


_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')


def _remove_commas(m):
    return m.group(1).replace(',', '')


def _expand_decimal_point(m):
    return m.group(1).replace('.', ' point ')


def _expand_dollars(m):
    match = m.group(1)
    parts = match.split('.')
    if len(parts) > 2:
        return match + ' dollars'    # Unexpected format
    dollars = int(parts[0]) if parts[0] else 0
    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
    if dollars and cents:
        dollar_unit = 'dollar' if dollars == 1 else 'dollars'
        cent_unit = 'cent' if cents == 1 else 'cents'
        return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
    elif dollars:
        dollar_unit = 'dollar' if dollars == 1 else 'dollars'
        return '%s %s' % (dollars, dollar_unit)
    elif cents:
        cent_unit = 'cent' if cents == 1 else 'cents'
        return '%s %s' % (cents, cent_unit)
    else:
        return 'zero dollars'


def _expand_ordinal(m):
    return _inflect.number_to_words(m.group(0))


def _expand_number(m):
    num = int(m.group(0))
    if num > 1000 and num < 3000:
        if num == 2000:
            return 'two thousand'
        elif num > 2000 and num < 2010:
            return 'two thousand ' + _inflect.number_to_words(num % 100)
        elif num % 100 == 0:
            return _inflect.number_to_words(num // 100) + ' hundred'
        else:
            return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
    else:
        return _inflect.number_to_words(num, andword='')


def normalize(text):
    text = re.sub(_comma_number_re, _remove_commas, text)
    text = re.sub(_pounds_re, r'\1 pounds', text)
    text = re.sub(_dollars_re, _expand_dollars, text)
    text = re.sub(_decimal_number_re, _expand_decimal_point, text)
    text = re.sub(_ordinal_re, _expand_ordinal, text)
    text = re.sub(_number_re, _expand_number, text)
    return text


================================================
FILE: text/ko_dictionary.py
================================================
# coding: utf-8

etc_dictionary = {
        '2 30대': '이삼십대',
        '20~30대': '이삼십대',
        '20, 30대': '이십대 삼십대',
        '1+1': '원플러스원',
        '3에서 6개월인': '3개월에서 육개월인',
}

english_dictionary = {
        'Devsisters': '데브시스터즈',
        'track': '트랙',

        # krbook
        'LA': '엘에이',
        'LG': '엘지',
        'KOREA': '코리아',
        'JSA': '제이에스에이',
        'PGA': '피지에이',
        'GA': '지에이',
        'idol': '아이돌',
        'KTX': '케이티엑스',
        'AC': '에이씨',
        'DVD': '디비디',
        'US': '유에스',
        'CNN': '씨엔엔',
        'LPGA': '엘피지에이',
        'P': '피',
        'L': '엘',
        'T': '티',
        'B': '비',
        'C': '씨',
        'BIFF': '비아이에프에프',
        'GV': '지비',

        # JTBC
        'IT': '아이티',
        'IQ': '아이큐',
        'JTBC': '제이티비씨',
        'trickle down effect': '트리클 다운 이펙트',
        'trickle up effect': '트리클 업 이펙트',
        'down': '다운',
        'up': '업',
        'FCK': '에프씨케이',
        'AP': '에이피',
        'WHERETHEWILDTHINGSARE': '',
        'Rashomon Effect': '',
        'O': '오',
        'OO': '오오',
        'B': '비',
        'GDP': '지디피',
        'CIPA': '씨아이피에이',
        'YS': '와이에스',
        'Y': '와이',
        'S': '에스',
        'JTBC': '제이티비씨',
        'PC': '피씨',
        'bill': '빌',
        'Halmuny': '하모니', #####
        'X': '엑스',
        'SNS': '에스엔에스',
        'ability': '어빌리티',
        'shy': '',
        'CCTV': '씨씨티비',
        'IT': '아이티',
        'the tenth man': '더 텐쓰 맨', ####
        'L': '엘',
        'PC': '피씨',
        'YSDJJPMB': '', ########
        'Content Attitude Timing': '컨텐트 애티튜드 타이밍',
        'CAT': '캣',
        'IS': '아이에스',
        'SNS': '에스엔에스',
        'K': '케이',
        'Y': '와이',
        'KDI': '케이디아이',
        'DOC': '디오씨',
        'CIA': '씨아이에이',
        'PBS': '피비에스',
        'D': '디',
        'PPropertyPositionPowerPrisonP'
        'S': '에스',
        'francisco': '프란시스코',
        'I': '아이',
        'III': '아이아이', ######
        'No joke': '노 조크',
        'BBK': '비비케이',
        'LA': '엘에이',
        'Don': '',
        't worry be happy': ' 워리 비 해피',
        'NO': '엔오', #####
        'it was our sky': '잇 워즈 아워 스카이',
        'it is our sky': '잇 이즈 아워 스카이', ####
        'NEIS': '엔이아이에스', #####
        'IMF': '아이엠에프',
        'apology': '어폴로지',
        'humble': '험블',
        'M': '엠',
        'Nowhere Man': '노웨어 맨',
        'The Tenth Man': '더 텐쓰 맨',
        'PBS': '피비에스',
        'BBC': '비비씨',
        'MRJ': '엠알제이',
        'CCTV': '씨씨티비',
        'Pick me up': '픽 미 업',
        'DNA': '디엔에이',
        'UN': '유엔',
        'STOP': '스탑', #####
        'PRESS': '프레스', #####
        'not to be': '낫 투비',
        'Denial': '디나이얼',
        'G': '지',
        'IMF': '아이엠에프',
        'GDP': '지디피',
        'JTBC': '제이티비씨',
        'Time flies like an arrow': '타임 플라이즈 라이크 언 애로우',
        'DDT': '디디티',
        'AI': '에이아이',
        'Z': '제트',
        'OECD': '오이씨디',
        'N': '앤',
        'A': '에이',
        'MB': '엠비',
        'EH': '이에이치',
        'IS': '아이에스',
        'TV': '티비',
        'MIT': '엠아이티',
        'KBO': '케이비오',
        'I love America': '아이 러브 아메리카',
        'SF': '에스에프',
        'Q': '큐',
        'KFX': '케이에프엑스',
        'PM': '피엠',
        'Prime Minister': '프라임 미니스터',
        'Swordline': '스워드라인',
        'TBS': '티비에스',
        'DDT': '디디티',
        'CS': '씨에스',
        'Reflecting Absence': '리플렉팅 앱센스',
        'PBS': '피비에스',
        'Drum being beaten by everyone': '드럼 빙 비튼 바이 에브리원',
        'negative pressure': '네거티브 프레셔',
        'F': '에프',
        'KIA': '기아',
        'FTA': '에프티에이',
        'Que sais-je': '',
        'UFC': '유에프씨',
        'P': '피',
        'DJ': '디제이',
        'Chaebol': '채벌',
        'BBC': '비비씨',
        'OECD': '오이씨디',
        'BC': '삐씨',
        'C': '씨',
        'B': '씨',
        'KY': '케이와이',
        'K': '케이',
        'CEO': '씨이오',
        'YH': '와이에치',
        'IS': '아이에스',
        'who are you': '후 얼 유',
        'Y': '와이',
        'The Devils Advocate': '더 데빌즈 어드보카트',
        'YS': '와이에스',
        'so sorry': '쏘 쏘리',
        'Santa': '산타',
        'Big Endian': '빅 엔디안',
        'Small Endian': '스몰 엔디안',
        'Oh Captain My Captain': '오 캡틴 마이 캡틴',
        'AIB': '에이아이비',
        'K': '케이',
        'PBS': '피비에스',
}


================================================
FILE: text/korean.py
================================================
﻿# coding: utf-8
# Code based on 

import re
import os
import ast
import json
from jamo import hangul_to_jamo, h2j, j2h

from .ko_dictionary import english_dictionary, etc_dictionary

PAD = '_'
EOS = '~'
PUNC = '!\'(),-.:;?'
SPACE = ' '

JAMO_LEADS = "".join([chr(_) for _ in range(0x1100, 0x1113)])
JAMO_VOWELS = "".join([chr(_) for _ in range(0x1161, 0x1176)])
JAMO_TAILS = "".join([chr(_) for _ in range(0x11A8, 0x11C3)])

VALID_CHARS = JAMO_LEADS + JAMO_VOWELS + JAMO_TAILS + PUNC + SPACE
ALL_SYMBOLS = PAD + EOS + VALID_CHARS

char_to_id = {c: i for i, c in enumerate(ALL_SYMBOLS)}
id_to_char = {i: c for i, c in enumerate(ALL_SYMBOLS)}

quote_checker = """([`"'＂“‘])(.+?)([`"'＂”’])"""

def is_lead(char):
    return char in JAMO_LEADS

def is_vowel(char):
    return char in JAMO_VOWELS

def is_tail(char):
    return char in JAMO_TAILS

def get_mode(char):
    if is_lead(char):
        return 0
    elif is_vowel(char):
        return 1
    elif is_tail(char):
        return 2
    else:
        return -1

def _get_text_from_candidates(candidates):
    if len(candidates) == 0:
        return ""
    elif len(candidates) == 1:
        return _jamo_char_to_hcj(candidates[0])
    else:
        return j2h(**dict(zip(["lead", "vowel", "tail"], candidates)))

def jamo_to_korean(text):
    text = h2j(text)

    idx = 0
    new_text = ""
    candidates = []

    while True:
        if idx >= len(text):
            new_text += _get_text_from_candidates(candidates)
            break

        char = text[idx]
        mode = get_mode(char)

        if mode == 0:
            new_text += _get_text_from_candidates(candidates)
            candidates = [char]
        elif mode == -1:
            new_text += _get_text_from_candidates(candidates)
            new_text += char
            candidates = []
        else:
            candidates.append(char)

        idx += 1
    return new_text

num_to_kor = {
        '0': '영',
        '1': '일',
        '2': '이',
        '3': '삼',
        '4': '사',
        '5': '오',
        '6': '육',
        '7': '칠',
        '8': '팔',
        '9': '구',
}

unit_to_kor1 = {
        '%': '퍼센트',
        'cm': '센치미터',
        'mm': '밀리미터',
        'km': '킬로미터',
        'kg': '킬로그람',
}
unit_to_kor2 = {
        'm': '미터',
}

upper_to_kor = {
        'A': '에이',
        'B': '비',
        'C': '씨',
        'D': '디',
        'E': '이',
        'F': '에프',
        'G': '지',
        'H': '에이치',
        'I': '아이',
        'J': '제이',
        'K': '케이',
        'L': '엘',
        'M': '엠',
        'N': '엔',
        'O': '오',
        'P': '피',
        'Q': '큐',
        'R': '알',
        'S': '에스',
        'T': '티',
        'U': '유',
        'V': '브이',
        'W': '더블유',
        'X': '엑스',
        'Y': '와이',
        'Z': '지',
}

def compare_sentence_with_jamo(text1, text2):
    return h2j(text1) != h2j(text)

def tokenize(text, as_id=False):
    # jamo package에 있는 hangul_to_jamo를 이용하여 한글 string을 초성/중성/종성으로 나눈다.
    text = normalize(text)
    tokens = list(hangul_to_jamo(text)) # '존경하는'  --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']

    if as_id:
        return [char_to_id[token] for token in tokens] + [char_to_id[EOS]]
    else:
        return [token for token in tokens] + [EOS]

def tokenizer_fn(iterator):
    return (token for x in iterator for token in tokenize(x, as_id=False))

def normalize(text):
    text = text.strip()

    text = re.sub('\(\d+일\)', '', text)
    text = re.sub('\([⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]+\)', '', text)

    text = normalize_with_dictionary(text, etc_dictionary)
    text = normalize_english(text)
    text = re.sub('[a-zA-Z]+', normalize_upper, text)

    text = normalize_quote(text)
    text = normalize_number(text)

    return text

def normalize_with_dictionary(text, dic):
    if any(key in text for key in dic.keys()):
        pattern = re.compile('|'.join(re.escape(key) for key in dic.keys()))
        return pattern.sub(lambda x: dic[x.group()], text)
    else:
        return text

def normalize_english(text):
    def fn(m):
        word = m.group()
        if word in english_dictionary:
            return english_dictionary.get(word)
        else:
            return word

    text = re.sub("([A-Za-z]+)", fn, text)
    return text

def normalize_upper(text):
    text = text.group(0)

    if all([char.isupper() for char in text]):
        return "".join(upper_to_kor[char] for char in text)
    else:
        return text

def normalize_quote(text):
    def fn(found_text):
        from nltk import sent_tokenize # NLTK doesn't along with multiprocessing

        found_text = found_text.group()
        unquoted_text = found_text[1:-1]

        sentences = sent_tokenize(unquoted_text)
        return " ".join(["'{}'".format(sent) for sent in sentences])

    return re.sub(quote_checker, fn, text)

number_checker = "([+-]?\d[\d,]*)[\.]?\d*"
count_checker = "(시|명|가지|살|마리|포기|송이|수|톨|통|점|개|벌|척|채|다발|그루|자루|줄|켤레|그릇|잔|마디|상자|사람|곡|병|판)"

def normalize_number(text):
    text = normalize_with_dictionary(text, unit_to_kor1)
    text = normalize_with_dictionary(text, unit_to_kor2)
    text = re.sub(number_checker + count_checker,
            lambda x: number_to_korean(x, True), text)
    text = re.sub(number_checker,
            lambda x: number_to_korean(x, False), text)
    return text

num_to_kor1 = [""] + list("일이삼사오육칠팔구")
num_to_kor2 = [""] + list("만억조경해")
num_to_kor3 = [""] + list("십백천")

#count_to_kor1 = [""] + ["하나","둘","셋","넷","다섯","여섯","일곱","여덟","아홉"]
count_to_kor1 = [""] + ["한","두","세","네","다섯","여섯","일곱","여덟","아홉"]

count_tenth_dict = {
        "십": "열",
        "두십": "스물",
        "세십": "서른",
        "네십": "마흔",
        "다섯십": "쉰",
        "여섯십": "예순",
        "일곱십": "일흔",
        "여덟십": "여든",
        "아홉십": "아흔",
}


def number_to_korean(num_str, is_count=False):
    if is_count:
        num_str, unit_str = num_str.group(1), num_str.group(2)
    else:
        num_str, unit_str = num_str.group(), ""
    
    num_str = num_str.replace(',', '')
    num = ast.literal_eval(num_str)

    if num == 0:
        return "영"

    check_float = num_str.split('.')
    if len(check_float) == 2:
        digit_str, float_str = check_float
    elif len(check_float) >= 3:
        raise Exception(" [!] Wrong number format")
    else:
        digit_str, float_str = check_float[0], None

    if is_count and float_str is not None:
        raise Exception(" [!] `is_count` and float number does not fit each other")

    digit = int(digit_str)

    if digit_str.startswith("-"):
        digit, digit_str = abs(digit), str(abs(digit))

    kor = ""
    size = len(str(digit))
    tmp = []

    for i, v in enumerate(digit_str, start=1):
        v = int(v)

        if v != 0:
            if is_count:
                tmp += count_to_kor1[v]
            else:
                tmp += num_to_kor1[v]

            tmp += num_to_kor3[(size - i) % 4]

        if (size - i) % 4 == 0 and len(tmp) != 0:
            kor += "".join(tmp)
            tmp = []
            kor += num_to_kor2[int((size - i) / 4)]

    if is_count:
        if kor.startswith("한") and len(kor) > 1:
            kor = kor[1:]

        if any(word in kor for word in count_tenth_dict):
            kor = re.sub(
                    '|'.join(count_tenth_dict.keys()),
                    lambda x: count_tenth_dict[x.group()], kor)

    if not is_count and kor.startswith("일") and len(kor) > 1:
        kor = kor[1:]

    if float_str is not None:
        kor += "쩜 "
        kor += re.sub('\d', lambda x: num_to_kor[x.group()], float_str)

    if num_str.startswith("+"):
        kor = "플러스 " + kor
    elif num_str.startswith("-"):
        kor = "마이너스 " + kor

    return kor + unit_str

if __name__ == "__main__":
    def test_normalize(text):
        print(text)
        print(normalize(text))
        print("="*30)

    test_normalize("JTBC는 JTBCs를 DY는 A가 Absolute")
    test_normalize("오늘(13일) 3,600마리 강아지가")
    test_normalize("60.3%")
    test_normalize('"저돌"(猪突) 입니다.')
    test_normalize('비대위원장이 지난 1월 이런 말을 했습니다. “난 그냥 산돼지처럼 돌파하는 스타일이다”')
    test_normalize("지금은 -12.35%였고 종류는 5가지와 19가지, 그리고 55가지였다")
    test_normalize("JTBC는 TH와 K 양이 2017년 9월 12일 오후 12시에 24살이 된다")
    print(list(hangul_to_jamo(list(hangul_to_jamo('비대위원장이 지난 1월 이런 말을 했습니다? “난 그냥 산돼지처럼 돌파하는 스타일이다”')))))

================================================
FILE: text/symbols.py
================================================
# coding: utf-8
'''
Defines the set of symbols used in text input to the model.

The default is a set of ASCII characters that works well for English or text that has been run
through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details.
'''
from jamo import h2j, j2h
from jamo.jamo import _jamo_char_to_hcj

from .korean import ALL_SYMBOLS, PAD, EOS

# For english
en_symbols = PAD+EOS+'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'(),-.:;? '  #<-For deployment(Because korean ALL_SYMBOLS follow this convention)

symbols = ALL_SYMBOLS # for korean

"""
초성과 종성은 같아보이지만, 다른 character이다.

'_~ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ!'(),-.:;? '

'_': 0, '~': 1, 'ᄀ': 2, 'ᄁ': 3, 'ᄂ': 4, 'ᄃ': 5, 'ᄄ': 6, 'ᄅ': 7, 'ᄆ': 8, 'ᄇ': 9, 'ᄈ': 10, 
'ᄉ': 11, 'ᄊ': 12, 'ᄋ': 13, 'ᄌ': 14, 'ᄍ': 15, 'ᄎ': 16, 'ᄏ': 17, 'ᄐ': 18, 'ᄑ': 19, 'ᄒ': 20, 
'ᅡ': 21, 'ᅢ': 22, 'ᅣ': 23, 'ᅤ': 24, 'ᅥ': 25, 'ᅦ': 26, 'ᅧ': 27, 'ᅨ': 28, 'ᅩ': 29, 'ᅪ': 30, 
'ᅫ': 31, 'ᅬ': 32, 'ᅭ': 33, 'ᅮ': 34, 'ᅯ': 35, 'ᅰ': 36, 'ᅱ': 37, 'ᅲ': 38, 'ᅳ': 39, 'ᅴ': 40, 
'ᅵ': 41, 'ᆨ': 42, 'ᆩ': 43, 'ᆪ': 44, 'ᆫ': 45, 'ᆬ': 46, 'ᆭ': 47, 'ᆮ': 48, 'ᆯ': 49, 'ᆰ': 50, 
'ᆱ': 51, 'ᆲ': 52, 'ᆳ': 53, 'ᆴ': 54, 'ᆵ': 55, 'ᆶ': 56, 'ᆷ': 57, 'ᆸ': 58, 'ᆹ': 59, 'ᆺ': 60, 
'ᆻ': 61, 'ᆼ': 62, 'ᆽ': 63, 'ᆾ': 64, 'ᆿ': 65, 'ᇀ': 66, 'ᇁ': 67, 'ᇂ': 68, '!': 69, "'": 70, 
'(': 71, ')': 72, ',': 73, '-': 74, '.': 75, ':': 76, ';': 77, '?': 78, ' ': 79
"""

================================================
FILE: train_tacotron2.py
================================================
# coding: utf-8
import os
import time
import math
import argparse
import traceback
import subprocess
import numpy as np
from jamo import h2j
import tensorflow as tf
from datetime import datetime
from functools import partial

from hparams import hparams, hparams_debug_string
from tacotron2 import create_model, get_most_recent_checkpoint

from utils import ValueWindow, prepare_dirs
from utils import infolog, warning, plot, load_hparams
from utils import get_git_revision_hash, get_git_diff, str2bool, parallel_run

from utils.audio import save_wav, inv_spectrogram
from text import sequence_to_text, text_to_sequence
from datasets.datafeeder_tacotron2 import DataFeederTacotron2
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

tf.logging.set_verbosity(tf.logging.ERROR)
log = infolog.log


def get_git_commit():
    subprocess.check_output(['git', 'diff-index', '--quiet', 'HEAD'])     # Verify client is clean
    commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:10]
    log('Git commit: %s' % commit)
    return commit


def add_stats(model, model2=None, scope_name='train'):
    with tf.variable_scope(scope_name) as scope:
        summaries = [
                tf.summary.scalar('loss_mel', model.mel_loss),
                tf.summary.scalar('loss_linear', model.linear_loss),
                tf.summary.scalar('loss', model.loss_without_coeff),
        ]

        if scope_name == 'train':
            gradient_norms = [tf.norm(grad) for grad in model.gradients if grad is not None]

            summaries.extend([
                    tf.summary.scalar('learning_rate', model.learning_rate),
                    tf.summary.scalar('max_gradient_norm', tf.reduce_max(gradient_norms)),
            ])

    if model2 is not None:
        with tf.variable_scope('gap_test-train') as scope:
            summaries.extend([
                    tf.summary.scalar('loss_mel',
                            model.mel_loss - model2.mel_loss),
                    tf.summary.scalar('loss_linear', 
                            model.linear_loss - model2.linear_loss),
                    tf.summary.scalar('loss',
                            model.loss_without_coeff - model2.loss_without_coeff),
            ])

    return tf.summary.merge(summaries)


def save_and_plot_fn(args, log_dir, step, loss, prefix):
    idx, (seq, spec, align) = args

    audio_path = os.path.join(log_dir, '{}-step-{:09d}-audio{:03d}.wav'.format(prefix, step, idx))
    align_path = os.path.join(log_dir, '{}-step-{:09d}-align{:03d}.png'.format(prefix, step, idx))

    waveform = inv_spectrogram(spec.T,hparams)
    save_wav(waveform, audio_path,hparams.sample_rate)

    info_text = 'step={:d}, loss={:.5f}'.format(step, loss)
    if 'korean_cleaners' in [x.strip() for x in hparams.cleaners.split(',')]:
        log('Training korean : Use jamo')
        plot.plot_alignment( align, align_path, info=info_text, text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=True), isKorean=True)
    else:
        log('Training non-korean : X use jamo')
        plot.plot_alignment(align, align_path, info=info_text,text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=False), isKorean=False) 

def save_and_plot(sequences, spectrograms,alignments, log_dir, step, loss, prefix):

    fn = partial(save_and_plot_fn,log_dir=log_dir, step=step, loss=loss, prefix=prefix)
    items = list(enumerate(zip(sequences, spectrograms, alignments)))

    parallel_run(fn, items, parallel=False)
    log('Test finished for step {}.'.format(step))


def train(log_dir, config):
    config.data_paths = config.data_paths  # ['datasets/moon']

    data_dirs = config.data_paths  # ['datasets/moon\\data']
    num_speakers = len(data_dirs)
    config.num_test = config.num_test_per_speaker * num_speakers  # 2*1

    if num_speakers > 1 and hparams.model_type not in ["multi-speaker", "simple"]:
        raise Exception("[!] Unkown model_type for multi-speaker: {}".format(config.model_type))

    commit = get_git_commit() if config.git else 'None'
    checkpoint_path = os.path.join(log_dir, 'model.ckpt') # 'logdir-tacotron\\moon_2018-08-28_13-06-42\\model.ckpt'

    #log(' [*] git recv-parse HEAD:\n%s' % get_git_revision_hash())  # hccho: 주석 처리
    log('='*50)
    #log(' [*] dit diff:\n%s' % get_git_diff())
    log('='*50)
    log(' [*] Checkpoint path: %s' % checkpoint_path)
    log(' [*] Loading training data from: %s' % data_dirs)
    log(' [*] Using model: %s' % config.model_dir)  # 'logdir-tacotron\\moon_2018-08-28_13-06-42'
    log(hparams_debug_string())

    # Set up DataFeeder:
    coord = tf.train.Coordinator()
    with tf.variable_scope('datafeeder') as scope:
        # DataFeeder의 6개 placeholder: train_feeder.inputs, train_feeder.input_lengths, train_feeder.loss_coeff, train_feeder.mel_targets, train_feeder.linear_targets, train_feeder.speaker_id
        train_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size)
        test_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 8, data_type='test', batch_size=config.num_test)

    # Set up model:

    global_step = tf.Variable(0, name='global_step', trainable=False)

    with tf.variable_scope('model') as scope:
        model = create_model(hparams)
        model.initialize(inputs=train_feeder.inputs, input_lengths=train_feeder.input_lengths,num_speakers=num_speakers,speaker_id=train_feeder.speaker_id,
                         mel_targets=train_feeder.mel_targets, linear_targets=train_feeder.linear_targets,is_training=True,
                         loss_coeff=train_feeder.loss_coeff,stop_token_targets=train_feeder.stop_token_targets)

        model.add_loss()
        model.add_optimizer(global_step)
        train_stats = add_stats(model, scope_name='train') # legacy

    with tf.variable_scope('model', reuse=True) as scope:
        test_model = create_model(hparams)
        test_model.initialize(inputs=test_feeder.inputs, input_lengths=test_feeder.input_lengths,num_speakers=num_speakers,speaker_id=test_feeder.speaker_id,
                         mel_targets=test_feeder.mel_targets, linear_targets=test_feeder.linear_targets,is_training=False,
                         loss_coeff=test_feeder.loss_coeff,stop_token_targets=test_feeder.stop_token_targets)
        
        test_model.add_loss()


    # Bookkeeping:
    step = 0
    time_window = ValueWindow(100)
    loss_window = ValueWindow(100)
    saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)

    sess_config = tf.ConfigProto(log_device_placement=False,allow_soft_placement=True)
    sess_config.gpu_options.allow_growth=True

    # Train!
    #with tf.Session(config=sess_config) as sess:
    with tf.Session() as sess:
        try:
            summary_writer = tf.summary.FileWriter(log_dir, sess.graph)
            sess.run(tf.global_variables_initializer())

            if config.load_path:
                # Restore from a checkpoint if the user requested it.
                restore_path = get_most_recent_checkpoint(config.model_dir)
                saver.restore(sess, restore_path)
                log('Resuming from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True)
            elif config.initialize_path:
                restore_path = get_most_recent_checkpoint(config.initialize_path)
                saver.restore(sess, restore_path)
                log('Initialized from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True)

                zero_step_assign = tf.assign(global_step, 0)
                sess.run(zero_step_assign)

                start_step = sess.run(global_step)
                log('='*50)
                log(' [*] Global step is reset to {}'.format(start_step))
                log('='*50)
            else:
                log('Starting new training run at commit: %s' % commit, slack=True)

            start_step = sess.run(global_step)

            train_feeder.start_in_session(sess, start_step)
            test_feeder.start_in_session(sess, start_step)

            while not coord.should_stop():
                start_time = time.time()
                step, loss, opt = sess.run([global_step, model.loss_without_coeff, model.optimize])

                time_window.append(time.time() - start_time)
                loss_window.append(loss)

                message = 'Step %-7d [%.03f sec/step, loss=%.05f, avg_loss=%.05f]' % (step, time_window.average, loss, loss_window.average)
                log(message, slack=(step % config.checkpoint_interval == 0))

                if loss > 100 or math.isnan(loss):
                    log('Loss exploded to %.05f at step %d!' % (loss, step), slack=True)
                    raise Exception('Loss Exploded')

                if step % config.summary_interval == 0:
                    log('Writing summary at step: %d' % step)


                    summary_writer.add_summary(sess.run( train_stats), step)

                if step % config.checkpoint_interval == 0:
                    log('Saving checkpoint to: %s-%d' % (checkpoint_path, step))
                    saver.save(sess, checkpoint_path, global_step=step)

                if step % config.test_interval == 0:
                    log('Saving audio and alignment...')
                    num_test = config.num_test

                    fetches = [
                            model.inputs[:num_test],
                            model.linear_outputs[:num_test],
                            model.alignments[:num_test],
                            test_model.inputs[:num_test],
                            test_model.linear_outputs[:num_test],
                            test_model.alignments[:num_test],
                    ]


                    sequences, spectrograms, alignments, test_sequences, test_spectrograms, test_alignments =  sess.run(fetches)


                    #librosa는 ffmpeg가 있어야 한다.
                    save_and_plot(sequences[:1], spectrograms[:1], alignments[:1], log_dir, step, loss, "train")  # spectrograms: (num_test,200,1025), alignments: (num_test,encoder_length,decoder_length)
                    save_and_plot(test_sequences, test_spectrograms, test_alignments, log_dir, step, loss, "test")

        except Exception as e:
            log('Exiting due to exception: %s' % e, slack=True)
            traceback.print_exc()
            coord.request_stop(e)


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument('--log_dir', default='logdir-tacotron2')
    
    parser.add_argument('--data_paths', default='D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son')
    #parser.add_argument('--data_paths', default='D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\small1,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\small2')
    
    
    #parser.add_argument('--load_path', default=None)   # 아래의 'initialize_path'보다 우선 적용
    parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-03-01_10-35-44')
    
    
    parser.add_argument('--initialize_path', default=None)   # ckpt로 부터 model을 restore하지만, global step은 0에서 시작

    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--num_test_per_speaker', type=int, default=2)
    parser.add_argument('--random_seed', type=int, default=123)
    parser.add_argument('--summary_interval', type=int, default=100)
    
    parser.add_argument('--test_interval', type=int, default=500)  # 500
    
    parser.add_argument('--checkpoint_interval', type=int, default=2000) # 2000
    parser.add_argument('--skip_path_filter', type=str2bool, default=False, help='Use only for debugging')

    parser.add_argument('--slack_url', help='Slack webhook URL to get periodic reports.')
    parser.add_argument('--git', action='store_true', help='If set, verify that the client is clean.')  # The store_true option automatically creates a default value of False.

    config = parser.parse_args()
    config.data_paths = config.data_paths.split(",")
    setattr(hparams, "num_speakers", len(config.data_paths))

    prepare_dirs(config, hparams)

    log_path = os.path.join(config.model_dir, 'train.log')
    infolog.init(log_path, config.model_dir, config.slack_url)

    tf.set_random_seed(config.random_seed)
    print(config.data_paths)


    if config.load_path is not None and config.initialize_path is not None:
        raise Exception(" [!] Only one of load_path and initialize_path should be set")

    train(config.model_dir, config)


if __name__ == '__main__':
    main()


================================================
FILE: train_vocoder.py
================================================
#  coding: utf-8
"""
- train data를 speaker를 분리된 디렉토리로 받아서, speaker id를 디렉토리별로 부과.
- file name에서 speaker id를 추론하는 방식이 아님.

"""

from __future__ import print_function

import argparse
import numpy as np
import os
import time
import traceback
from glob import glob
import tensorflow as tf
from tensorflow.python.client import timeline
from datetime import datetime
from wavenet import WaveNetModel,mu_law_decode
from datasets import DataFeederWavenet
from hparams import hparams
from utils import validate_directories,load,save,infolog,get_tensors_in_checkpoint_file,build_tensors_in_checkpoint_file,plot,audio

tf.logging.set_verbosity(tf.logging.ERROR)
EPSILON = 0.001
log = infolog.log

def eval_step(sess,logdir,step,waveform,upsampled_local_condition_data,speaker_id_data,mel_input_data,samples,speaker_id,upsampled_local_condition,next_sample,temperature=1.0):
    waveform = waveform[:,:1]
    
    sample_size = upsampled_local_condition_data.shape[1]
    last_sample_timestamp = datetime.now()
    start_time = time.time()
    for step2 in range(sample_size):  # 원하는 길이를 구하기 위해 loop sample_size
        window = waveform[:,-1:]  # 제일 끝에 있는 1개만 samples에 넣어 준다.  window: shape(N,1)
        

        prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step2,:],speaker_id: speaker_id_data })


        if hparams.scalar_input:
            sample = prediction  # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다.
        else:
            # Scale prediction distribution using temperature.
            # 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다.
            # config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다.
            np.seterr(divide='ignore')
            scaled_prediction = np.log(prediction) / temperature   # config.temperature인 경우는 값의 변화가 없다.
            scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True))  # np.log(np.sum(np.exp(scaled_prediction)))
            scaled_prediction = np.exp(scaled_prediction)
            np.seterr(divide='warn')
    
            # Prediction distribution at temperature=1.0 should be unchanged after
            # scaling.
            if temperature == 1.0:
                np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.')
            
            # argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다.
            sample = [[np.random.choice(np.arange(hparams.quantization_channels), p=p)] for p in scaled_prediction]  # choose one sample per batch
        
        waveform = np.concatenate([waveform,sample],axis=-1)   #window.shape: (N,1)

        # Show progress only once per second.
        current_sample_timestamp = datetime.now()
        time_since_print = current_sample_timestamp - last_sample_timestamp
        if time_since_print.total_seconds() > 1.:
            duration = time.time() - start_time
            print('Sample {:3<d}/{:3<d}, ({:.3f} sec/step)'.format(step2 + 1, sample_size, duration), end='\r')
            last_sample_timestamp = current_sample_timestamp
    
    print('\n')
    # Save the result as a wav file.    
    if hparams.input_type == 'raw':
        out = waveform[:,1:]
    elif hparams.input_type == 'mulaw':
        decode = mu_law_decode(samples, hparams.quantization_channels,quantization=False)
        out = sess.run(decode, feed_dict={samples: waveform[:,1:]})
    else:  # 'mulaw-quantize'
        decode = mu_law_decode(samples, hparams.quantization_channels,quantization=True)
        out = sess.run(decode, feed_dict={samples: waveform[:,1:]})          
        
        
    # save wav
    
    for i in range(1):
        wav_out_path= logdir + '/test-{}-{}.wav'.format(step,i)
        mel_path =  wav_out_path.replace(".wav", ".png")
        
        gen_mel_spectrogram = audio.melspectrogram(out[i], hparams).astype(np.float32).T
        audio.save_wav(out[i], wav_out_path, hparams.sample_rate)  # save_wav 내에서 out[i]의 값이 바뀐다.
        
        plot.plot_spectrogram(gen_mel_spectrogram, mel_path, title='generated mel spectrogram{}'.format(step),target_spectrogram=mel_input_data[i])  

def create_network(hp,batch_size,num_speakers,is_training):
    net = WaveNetModel(
        batch_size=batch_size,
        dilations=hp.dilations,
        filter_width=hp.filter_width,
        residual_channels=hp.residual_channels,
        dilation_channels=hp.dilation_channels,
        quantization_channels=hp.quantization_channels,
        out_channels =hp.out_channels,
        skip_channels=hp.skip_channels,
        use_biases=hp.use_biases,  #  True
        scalar_input=hp.scalar_input,
        global_condition_channels=hp.gc_channels,
        global_condition_cardinality=num_speakers,
        local_condition_channels=hp.num_mels,
        upsample_factor=hp.upsample_factor,
        legacy = hp.legacy,
        residual_legacy = hp.residual_legacy,
        drop_rate = hp.wavenet_dropout,
        train_mode=is_training)
    
    return net
def main():
    def _str_to_bool(s):
        """Convert string to bool (in argparse context)."""
        if s.lower() not in ['true', 'false']:
            raise ValueError('Argument needs to be a boolean, got {}'.format(s))
        return {'true': True, 'false': False}[s.lower()]
    
    
    parser = argparse.ArgumentParser(description='WaveNet example network')
    
    DATA_DIRECTORY =  'D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon,D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\son'
    #DATA_DIRECTORY =  'D:\\hccho\\Tacotron-Wavenet-Vocoder-hccho\\data\\moon'
    parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing the VCTK corpus.')


    #LOGDIR = None
    LOGDIR = './/logdir-wavenet//train//2019-03-27T20-27-18'

    parser.add_argument('--logdir', type=str, default=LOGDIR,help='Directory in which to store the logging information for TensorBoard. If the model already exists, it will restore the state and will continue training. Cannot use with --logdir_root and --restore_from.')
    
    
    parser.add_argument('--logdir_root', type=str, default=None,help='Root directory to place the logging output and generated model. These are stored under the dated subdirectory of --logdir_root. Cannot use with --logdir.')
    parser.add_argument('--restore_from', type=str, default=None,help='Directory in which to restore the model from. This creates the new model under the dated directory in --logdir_root. Cannot use with --logdir.')
    
    
    CHECKPOINT_EVERY = 1000   # checkpoint 저장 주기
    parser.add_argument('--checkpoint_every', type=int, default=CHECKPOINT_EVERY,help='How many steps to save each checkpoint after. Default: ' + str(CHECKPOINT_EVERY) + '.')
    
    
    parser.add_argument('--eval_every', type=int, default=2,help='Steps between eval on test data')
    
   
    config = parser.parse_args()  # command 창에서 입력받을 수 있는 조건
    config.data_dir = config.data_dir.split(",")
    
    try:
        directories = validate_directories(config,hparams)
    except ValueError as e:
        print("Some arguments are wrong:")
        print(str(e))
        return

    logdir = directories['logdir']
    restore_from = directories['restore_from']

    # Even if we restored the model, we will treat it as new training
    # if the trained model is written into an arbitrary location.
    is_overwritten_training = logdir != restore_from


    log_path = os.path.join(logdir, 'train.log')
    infolog.init(log_path, logdir)


    global_step = tf.Variable(0, name='global_step', trainable=False)

    if hparams.l2_regularization_strength == 0:
        hparams.l2_regularization_strength = None


    # Create coordinator.
    coord = tf.train.Coordinator()
    num_speakers = len(config.data_dir)
    # Load raw waveform from VCTK corpus.
    with tf.name_scope('create_inputs'):
        # Allow silence trimming to be skipped by specifying a threshold near
        # zero.
        silence_threshold = hparams.silence_threshold if hparams.silence_threshold > EPSILON else None
        gc_enable = True  # Before: num_speakers > 1    After: 항상 True
        
        # AudioReader에서 wav 파일을 잘라 input값을 만든다. receptive_field길이만큼을 앞부분에 pad하거나 앞조각에서 가져온다. (receptive_field+ sample_size)크기로 자른다.
        reader = DataFeederWavenet(coord,config.data_dir,batch_size=hparams.wavenet_batch_size,gc_enable= gc_enable,test_mode=False)
        
        # test를 위한 DataFeederWavenet를 하나 만들자. 여기서는 딱 1개의 파일만 가져온다.
        reader_test = DataFeederWavenet(coord,config.data_dir,batch_size=1,gc_enable= gc_enable,test_mode=True,queue_size=1)
        
        
        audio_batch, lc_batch, gc_id_batch = reader.inputs_wav, reader.local_condition, reader.speaker_id


    # Create train network.
    net = create_network(hparams,hparams.wavenet_batch_size,num_speakers,is_training=True)
    net.add_loss(input_batch=audio_batch,local_condition=lc_batch, global_condition_batch=gc_id_batch, l2_regularization_strength=hparams.l2_regularization_strength,upsample_type=hparams.upsample_type)
    net.add_optimizer(hparams,global_step)


    run_metadata = tf.RunMetadata()

    # Set up session
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))  # log_device_placement=False --> cpu/gpu 자동 배치.
    init = tf.global_variables_initializer()
    sess.run(init)
    
    # Saver for storing checkpoints of the model.
    saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints)  # 최대 checkpoint 저장 갯수 지정
    
    try:
        start_step = load(saver, sess, restore_from)  # checkpoint load
        if is_overwritten_training or start_step is None:
            # The first training step will be saved_global_step + 1,
            # therefore we put -1 here for new or overwritten trainings.
            zero_step_assign = tf.assign(global_step, 0)
            sess.run(zero_step_assign)
            start_step=0
    except:
        print("Something went wrong while restoring checkpoint. We will terminate training to avoid accidentally overwriting the previous model.")
        raise


    ###########

    reader.start_in_session(sess,start_step)
    reader_test.start_in_session(sess,start_step)
    
    ################### Create test network.  <---- Queue 생성 때문에, sess restore후 test network 생성
    net_test = create_network(hparams,1,num_speakers,is_training=False)
  
    if hparams.scalar_input:
        samples = tf.placeholder(tf.float32,shape=[net_test.batch_size,None])
        waveform = 2*np.random.rand(net_test.batch_size).reshape(net_test.batch_size,-1)-1
        
    else:
        samples = tf.placeholder(tf.int32,shape=[net_test.batch_size,None])  # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이)
        waveform = np.random.randint(hparams.quantization_channels,size=net_test.batch_size).reshape(net_test.batch_size,-1)
    upsampled_local_condition = tf.placeholder(tf.float32,shape=[net_test.batch_size,hparams.num_mels])  
    
        
    speaker_id = tf.placeholder(tf.int32,shape=[net_test.batch_size])  
    next_sample = net_test.predict_proba_incremental(samples,upsampled_local_condition,speaker_id)  # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용

        
    sess.run(net_test.queue_initializer)
    

    # test를 위한 placeholder는 모두 3개: samples,speaker_id,upsampled_local_condition
    # test용 mel-spectrogram을 하나 뽑자. 그것을 고정하지 않으면, thread가 계속 돌아가면서 data를 읽어온다.  reader_test의 역할은 여기서 끝난다.

    mel_input_test, speaker_id_test = sess.run([reader_test.local_condition,reader_test.speaker_id])


    with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE):
        upsampled_local_condition_data = net_test.create_upsample(mel_input_test,upsample_type=hparams.upsample_type)
        upsampled_local_condition_data_ = sess.run(upsampled_local_condition_data)  # upsampled_local_condition_data_ 을 feed_dict로 placehoder인 upsampled_local_condition에 넣어준다.

    ######################################################
    
    
    start_step = sess.run(global_step)
    step = last_saved_step = start_step
    try:        
        
        while not coord.should_stop():
            
            start_time = time.time()
            if hparams.store_metadata and step % 50 == 0:
                # Slow run that stores extra information for debugging.
                log('Storing metadata')
                run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
                step, loss_value, _ = sess.run([global_step, net.loss, net.optimize],options=run_options,run_metadata=run_metadata)

                tl = timeline.Timeline(run_metadata.step_stats)
                timeline_path = os.path.join(logdir, 'timeline.trace')
                with open(timeline_path, 'w') as f:
                    f.write(tl.generate_chrome_trace_format(show_memory=True))
            else:
                step, loss_value, _ = sess.run([global_step,net.loss, net.optimize])

            duration = time.time() - start_time
            log('step {:d} - loss = {:.3f}, ({:.3f} sec/step)'.format(step, loss_value, duration))
            
            
            if step % config.checkpoint_every == 0:
                save(saver, sess, logdir, step)
                last_saved_step = step
                
                
            if step % config.eval_every == 0:  # config.eval_every
                eval_step(sess,logdir,step,waveform,upsampled_local_condition_data_,speaker_id_test,mel_input_test,samples,speaker_id,upsampled_local_condition,next_sample)
            
            if step >= hparams.num_steps:
                # error message가 나오지만, 여기서 멈춘 것은 맞다.
                raise Exception('End xxx~~~yyy')
            
    except Exception as e:
        print('finally')
        log('Exiting due to exception: %s' % e, slack=True)
        #if step > last_saved_step:
        #    save(saver, sess, logdir, step)        
        traceback.print_exc()
        coord.request_stop(e)


if __name__ == '__main__':
    main()
    traceback.print_exc()
    print('Done')


================================================
FILE: utils/__init__.py
================================================
# -*- coding: utf-8 -*-
import re,json,sys,os
import tensorflow as tf
from tqdm import tqdm
from contextlib import closing
from multiprocessing import Pool
from collections import namedtuple
from datetime import datetime, timedelta
from shutil import copyfile as copy_file
from tensorflow.python import pywrap_tensorflow

PARAMS_NAME = "params.json"
STARTED_DATESTRING = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now())
LOGDIR_ROOT_Wavenet = './logdir-wavenet'


class ValueWindow():
    def __init__(self, window_size=100):
        self._window_size = window_size
        self._values = []

    def append(self, x):
        self._values = self._values[-(self._window_size - 1):] + [x]

    @property
    def sum(self):
        return sum(self._values)

    @property
    def count(self):
        return len(self._values)

    @property
    def average(self):
        return self.sum / max(1, self.count)

    def reset(self):
        self._values = []
def prepare_dirs(config, hparams):
    if hasattr(config, "data_paths"):
        config.datasets = [os.path.basename(data_path) for data_path in config.data_paths]
        dataset_desc = "+".join(config.datasets)

    if config.load_path:
        config.model_dir = config.load_path
    else:
        config.model_name = "{}_{}".format(dataset_desc, get_time())
        config.model_dir = os.path.join(config.log_dir, config.model_name)

        for path in [config.log_dir, config.model_dir]:
            if not os.path.exists(path):
                os.makedirs(path)

    if config.load_path:
        load_hparams(hparams, config.model_dir)
    else:
        setattr(hparams, "num_speakers", len(config.datasets))

        save_hparams(config.model_dir, hparams)
        copy_file("hparams.py", os.path.join(config.model_dir, "hparams.py"))

def save(saver, sess, logdir, step):
    model_name = 'model.ckpt'
    checkpoint_path = os.path.join(logdir, model_name)
    print('Storing checkpoint to {} ...'.format(logdir), end="")
    sys.stdout.flush()

    if not os.path.exists(logdir):
        os.makedirs(logdir)

    saver.save(sess, checkpoint_path, global_step=step)
    print(' Done.')


def load(saver, sess, logdir):
    print("Trying to restore saved checkpoints from {} ...".format(logdir),end="")

    ckpt = tf.train.get_checkpoint_state(logdir)
    #ckpt = get_most_recent_checkpoint(logdir)
    if ckpt:
        print("  Checkpoint found: {}".format(ckpt.model_checkpoint_path))
        global_step = int(ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1])
        print("  Global step was: {}".format(global_step))
        print("  Restoring...", end="")
        saver.restore(sess, ckpt.model_checkpoint_path)
        print(" Done.")
        return global_step
    else:
        print(" No checkpoint found.")
        return None


def get_default_logdir(logdir_root):
    logdir = os.path.join(logdir_root, 'train', STARTED_DATESTRING)
    if not os.path.exists(logdir):
        os.makedirs(logdir)    
    return logdir


def validate_directories(args,hparams):
    """Validate and arrange directory related arguments."""

    # Validation
    if args.logdir and args.logdir_root:
        raise ValueError("--logdir and --logdir_root cannot be specified at the same time.")

    if args.logdir and args.restore_from:
        raise ValueError(
            "--logdir and --restore_from cannot be specified at the same "
            "time. This is to keep your previous model from unexpected "
            "overwrites.\n"
            "Use --logdir_root to specify the root of the directory which "
            "will be automatically created with current date and time, or use "
            "only --logdir to just continue the training from the last "
            "checkpoint.")

    # Arrangement
    logdir_root = args.logdir_root
    if logdir_root is None:
        logdir_root = LOGDIR_ROOT_Wavenet
        
        
    logdir = args.logdir
    if logdir is None:
        logdir = get_default_logdir(logdir_root)
        print('Using default logdir: {}'.format(logdir))
        save_hparams(logdir, hparams)
        copy_file("hparams.py", os.path.join(logdir, "hparams.py"))
    else:
        load_hparams(hparams, logdir)
        
    restore_from = args.restore_from
    if restore_from is None:
        # args.logdir and args.restore_from are exclusive,
        # so it is guaranteed the logdir here is newly created.
        restore_from = logdir

    return {
        'logdir': logdir,
        'logdir_root': args.logdir_root,
        'restore_from': restore_from
    }
def save_hparams(model_dir, hparams):
    param_path = os.path.join(model_dir, PARAMS_NAME)

    info = eval(hparams.to_json(),{'false': False, 'true': True, 'null': None})
    write_json(param_path, info)

    print(" [*] MODEL dir: {}".format(model_dir))
    print(" [*] PARAM path: {}".format(param_path))
    
def write_json(path, data):
    with open(path, 'w',encoding='utf-8') as f:
        json.dump(data, f, indent=4, sort_keys=True, ensure_ascii=False)

def load_hparams(hparams, load_path, skip_list=[]):
    # log dir에 있는 hypermarameter 정보를 이용해서, hparams.py의 정보를 update한다.
    path = os.path.join(load_path, PARAMS_NAME)

    new_hparams = load_json(path)
    hparams_keys = vars(hparams).keys()

    for key, value in new_hparams.items():
        if key in skip_list or key not in hparams_keys:
            print("Skip {} because it not exists".format(key))  #json에 있지만, hparams에 없다는 의미
            continue

        if key not in ['xxxxx',]:  # update 하지 말아야 할 것을 지정할 수 있다.
            original_value = getattr(hparams, key)
            if original_value != value:
                print("UPDATE {}: {} -> {}".format(key, getattr(hparams, key), value))
                setattr(hparams, key, value)
def load_json(path, as_class=False, encoding='euc-kr'):
    with open(path,encoding=encoding) as f:
        content = f.read()
        content = re.sub(",\s*}", "}", content)
        content = re.sub(",\s*]", "]", content)

        if as_class:
            data = json.loads(content, object_hook=\
                    lambda data: namedtuple('Data', data.keys())(*data.values()))
        else:
            data = json.loads(content)

    return data    
def get_most_recent_checkpoint(checkpoint_dir):
    checkpoint_paths = [path for path in glob("{}/*.ckpt-*.data-*".format(checkpoint_dir))]
    idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]

    max_idx = max(idxes)
    lastest_checkpoint = os.path.join(checkpoint_dir, "model.ckpt-{}".format(max_idx))

    #latest_checkpoint=checkpoint_paths[0]
    print(" [*] Found lastest checkpoint: {}".format(lastest_checkpoint))
    return lastest_checkpoint

def add_prefix(path, prefix):
    dir_path, filename = os.path.dirname(path), os.path.basename(path)
    return "{}/{}.{}".format(dir_path, prefix, filename)

def add_postfix(path, postfix):
    path_without_ext, ext = path.rsplit('.', 1)
    return "{}.{}.{}".format(path_without_ext, postfix, ext)

def remove_postfix(path):
    items = path.rsplit('.', 2)
    return items[0] + "." + items[2]

def get_time():
    return datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def parallel_run(fn, items, desc="", parallel=True):
    results = []

    if parallel:
        with closing(Pool(10)) as pool:
            for out in tqdm(pool.imap_unordered(fn, items), total=len(items), desc=desc):
                if out is not None:
                    results.append(out)
    else:
        for item in tqdm(items, total=len(items), desc=desc):
            out = fn(item)
            if out is not None:
                results.append(out)

    return results
def makedirs(path):
    if not os.path.exists(path):
        print(" [*] Make directories : {}".format(path))
        os.makedirs(path)
        
def str2bool(v):
    return v.lower() in ('true', '1')

def remove_file(path):
    if os.path.exists(path):
        print(" [*] Removed: {}".format(path))
        os.remove(path)

def get_git_revision_hash():
    return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode("utf-8")
def get_git_diff():
    return subprocess.check_output(['git', 'diff']).decode("utf-8")

def warning(msg):
    print("="*40)
    print(" [!] {}".format(msg))
    print("="*40)
    print()		

def get_tensors_in_checkpoint_file(file_name,all_tensors=True,tensor_name=None):
    # checkpoint 파일로 부터 복구
    # e.g  file_name: 'D:\\hccho\\Tacotron-2-hccho\\model.ckpt-155000'
    varlist=[]
    var_value =[]
    reader = pywrap_tensorflow.NewCheckpointReader(file_name)
    trainable_variables_names = [v.name[:-2] for v in tf.trainable_variables()]  # 끝부분의 ':0' 제외
    if all_tensors:
        var_to_shape_map = reader.get_variable_to_shape_map()
        for key in sorted(var_to_shape_map):
            if key in trainable_variables_names:   # hccho
                varlist.append(key)
                var_value.append(reader.get_tensor(key))
    else:
        varlist.append(tensor_name)
        var_value.append(reader.get_tensor(tensor_name))
    return (varlist, var_value)

def build_tensors_in_checkpoint_file(loaded_tensors):
    # 현재 tensor graph에 있는 tensor중에서 loaded_tensors에 있는 tensor name을 가져온다.
    full_var_list = list()
    # Loop all loaded tensors
    for i, tensor_name in enumerate(loaded_tensors[0]):
        # Extract tensor
        try:
            tensor_aux = tf.get_default_graph().get_tensor_by_name(tensor_name+":0")
        except:
            print('Not found: '+tensor_name)
        full_var_list.append(tensor_aux)
    return full_var_list
"""
    # restore egample 모델을 변형했을 때, 기존 ckpt로부터 중복되는 trainable_varaibles 복구.
    CHECKPOINT_NAME = 'D:\\hccho\\Tacotron-2-hccho\\ver1\\logdir-wavenet\\train\\2019-03-22T23-08-16\\model.ckpt-155000'
    restored_vars  = get_tensors_in_checkpoint_file(file_name=CHECKPOINT_NAME)
    tensors_to_load = build_tensors_in_checkpoint_file(restored_vars)
    loader = tf.train.Saver(tensors_to_load)
    loader.restore(sess, CHECKPOINT_NAME)
    
    new_saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints)  # 최대 checkpoint 저장 갯수 지정
    save(new_saver, sess, logdir, 0)
    exit()

"""


================================================
FILE: utils/audio.py
================================================
# coding: utf-8
import librosa
import librosa.filters
import numpy as np
import tensorflow as tf
from scipy import signal
from scipy.io import wavfile
from tensorflow.contrib.training.python.training.hparam import HParams


def load_wav(path, sr):
    return librosa.core.load(path, sr=sr)[0]

def save_wav(wav, path, sr):
    wav *= 32767 / max(0.01, np.max(np.abs(wav)))
    #proposed by @dsmiller   --> libosa type error(bug) 극복
    wavfile.write(path, sr, wav.astype(np.int16))

def save_wavenet_wav(wav, path, sr):
    librosa.output.write_wav(path, wav, sr=sr)

def preemphasis(wav, k, preemphasize=True):
    if preemphasize:
        return signal.lfilter([1, -k], [1], wav)
    return wav

def inv_preemphasis(wav, k, inv_preemphasize=True):
    if inv_preemphasize:
        return signal.lfilter([1], [1, -k], wav)
    return wav

#From https://github.com/r9y9/wavenet_vocoder/blob/master/audio.py
def start_and_end_indices(quantized, silence_threshold=2):
    for start in range(quantized.size):
        if abs(quantized[start] - 127) > silence_threshold:
            break
    for end in range(quantized.size - 1, 1, -1):
        if abs(quantized[end] - 127) > silence_threshold:
            break

    assert abs(quantized[start] - 127) > silence_threshold
    assert abs(quantized[end] - 127) > silence_threshold

    return start, end

def trim_silence(wav, hparams):
    '''Trim leading and trailing silence

    Useful for M-AILABS dataset if we choose to trim the extra 0.5 silence at beginning and end.
    '''
    #Thanks @begeekmyfriend and @lautjy for pointing out the params contradiction. These params are separate and tunable per dataset.
    return librosa.effects.trim(wav, top_db= hparams.trim_top_db, frame_length=hparams.trim_fft_size, hop_length=hparams.trim_hop_size)[0]

def get_hop_size(hparams):
    hop_size = hparams.hop_size
    if hop_size is None:
        assert hparams.frame_shift_ms is not None
        hop_size = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate)
    return hop_size

def linearspectrogram(wav, hparams):
    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
    S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db

    if hparams.signal_normalization:  # Tacotron에서 항상적용했다.
        return _normalize(S, hparams)
    return S

def melspectrogram(wav, hparams):
    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
    S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db

    if hparams.signal_normalization:
        return _normalize(S, hparams)
    return S

def inv_linear_spectrogram(linear_spectrogram, hparams):
    '''Converts linear spectrogram to waveform using librosa'''
    if hparams.signal_normalization:
        D = _denormalize(linear_spectrogram, hparams)
    else:
        D = linear_spectrogram

    S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear

    if hparams.use_lws:
        processor = _lws_processor(hparams)
        D = processor.run_lws(S.astype(np.float64).T ** hparams.power)
        y = processor.istft(D).astype(np.float32)
        return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)
    else:
        return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)


def inv_mel_spectrogram(mel_spectrogram, hparams):
    '''Converts mel spectrogram to waveform using librosa'''
    if hparams.signal_normalization:
        D = _denormalize(mel_spectrogram, hparams)
    else:
        D = mel_spectrogram

    S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams)  # Convert back to linear

    if hparams.use_lws:
        processor = _lws_processor(hparams)
        D = processor.run_lws(S.astype(np.float64).T ** hparams.power)
        y = processor.istft(D).astype(np.float32)
        return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)
    else:
        return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)

def inv_spectrogram_tensorflow(spectrogram,hparams):
    S = _db_to_amp_tensorflow(_denormalize_tensorflow(spectrogram,hparams) + hparams.ref_level_db)
    return _griffin_lim_tensorflow(tf.pow(S, hparams.power),hparams)


def inv_spectrogram(spectrogram,hparams):
    S = _db_to_amp(_denormalize(spectrogram,hparams) + hparams.ref_level_db)    # Convert back to linear.  spectrogram: (num_freq,length)
    return inv_preemphasis(_griffin_lim(S ** hparams.power,hparams),hparams.preemphasis, hparams.preemphasize)                 # Reconstruct phase


def _lws_processor(hparams):
    import lws
    return lws.lws(hparams.fft_size, get_hop_size(hparams), fftsize=hparams.win_size, mode="speech")

def _griffin_lim(S, hparams):
    '''librosa implementation of Griffin-Lim
    Based on https://github.com/librosa/librosa/issues/434
    '''
    angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
    S_complex = np.abs(S).astype(np.complex)
    y = _istft(S_complex * angles, hparams)
    for i in range(hparams.griffin_lim_iters):
        angles = np.exp(1j * np.angle(_stft(y, hparams)))
        y = _istft(S_complex * angles, hparams)
    return y

def _stft(y, hparams):
    if hparams.use_lws:
        return _lws_processor(hparams).stft(y).T
    else:
        return librosa.stft(y=y, n_fft=hparams.fft_size, hop_length=get_hop_size(hparams), win_length=hparams.win_size)

def _istft(y, hparams):
    return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size)

##########################################################
#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)
def num_frames(length, fsize, fshift):
    """Compute number of time frames of spectrogram
    """
    pad = (fsize - fshift)
    if length % fshift == 0:
        M = (length + pad * 2 - fsize) // fshift + 1
    else:
        M = (length + pad * 2 - fsize) // fshift + 2
    return M


def pad_lr(x, fsize, fshift):
    """Compute left and right padding
    """
    M = num_frames(len(x), fsize, fshift)
    pad = (fsize - fshift)
    T = len(x) + 2 * pad
    r = (M - 1) * fshift + fsize - T
    return pad, pad + r
##########################################################
#Librosa correct padding
def librosa_pad_lr(x, fsize, fshift):
    '''compute right padding (final frame)
    '''
    return int(fsize // 2)


# Conversions
_mel_basis = None
_inv_mel_basis = None

def _linear_to_mel(spectogram, hparams):
    global _mel_basis
    if _mel_basis is None:
        _mel_basis = _build_mel_basis(hparams)
    return np.dot(_mel_basis, spectogram)

def _mel_to_linear(mel_spectrogram, hparams):
    global _inv_mel_basis
    if _inv_mel_basis is None:
        _inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
    return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram))

def _build_mel_basis(hparams):
    #assert hparams.fmax <= hparams.sample_rate // 2
    
    #fmin: Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
    #fmax: 7600, To be increased/reduced depending on data.
    #return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels,fmin=hparams.fmin, fmax=hparams.fmax)
    return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels)  # fmin=0, fmax= sample_rate/2.0

def _amp_to_db(x, hparams):
    min_level = np.exp(hparams.min_level_db / 20 * np.log(10))  # min_level_db = -100
    return 20 * np.log10(np.maximum(min_level, x))

def _db_to_amp(x):
    return np.power(10.0, (x) * 0.05)

def _normalize(S, hparams):
    if hparams.allow_clipping_in_normalization:
        if hparams.symmetric_mels:
            return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value,
             -hparams.max_abs_value, hparams.max_abs_value)
        else:
            return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)
 
    assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
    if hparams.symmetric_mels:
        return (2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value
    else:
        return hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db))
 
def _denormalize(D, hparams):
    if hparams.allow_clipping_in_normalization:
        if hparams.symmetric_mels:
            return (((np.clip(D, -hparams.max_abs_value,
                hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value))
                + hparams.min_level_db)
        else:
            return ((np.clip(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
 
    if hparams.symmetric_mels:
        return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
    else:
        return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)

# 김태훈 구현. 이 차이 때문에 호환이 되지 않는다.
# def _normalize(S,hparams):
#     return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1)  # min_level_db = -100
# 
# def _denormalize(S,hparams):
#     return (np.clip(S, 0, 1) * -hparams.min_level_db) + hparams.min_level_db

#From https://github.com/r9y9/nnmnkwii/blob/master/nnmnkwii/preprocessing/generic.py
def mulaw(x, mu=256):
    """Mu-Law companding
    Method described in paper [1]_.
    .. math::
        f(x) = sign(x) ln (1 + mu |x|) / ln (1 + mu)
    Args:
        x (array-like): Input signal. Each value of input signal must be in
          range of [-1, 1].
        mu (number): Compression parameter ``μ``.
    Returns:
        array-like: Compressed signal ([-1, 1])
    See also:
        :func:`nnmnkwii.preprocessing.inv_mulaw`
        :func:`nnmnkwii.preprocessing.mulaw_quantize`
        :func:`nnmnkwii.preprocessing.inv_mulaw_quantize`
    .. [1] Brokish, Charles W., and Michele Lewis. "A-law and mu-law companding
        implementations using the tms320c54x." SPRA163 (1997).
    """
    return _sign(x) * _log1p(mu * _abs(x)) / _log1p(mu)


def inv_mulaw(y, mu=256):
    """Inverse of mu-law companding (mu-law expansion)
    .. math::
        f^{-1}(x) = sign(y) (1 / mu) (1 + mu)^{|y|} - 1)
    Args:
        y (array-like): Compressed signal. Each value of input signal must be in
          range of [-1, 1].
        mu (number): Compression parameter ``μ``.
    Returns:
        array-like: Uncomprresed signal (-1 <= x <= 1)
    See also:
        :func:`nnmnkwii.preprocessing.inv_mulaw`
        :func:`nnmnkwii.preprocessing.mulaw_quantize`
        :func:`nnmnkwii.preprocessing.inv_mulaw_quantize`
    """
    return _sign(y) * (1.0 / mu) * ((1.0 + mu)**_abs(y) - 1.0)


def mulaw_quantize(x, mu=256):
    """Mu-Law companding + quantize
    Args:
        x (array-like): Input signal. Each value of input signal must be in
          range of [-1, 1].
        mu (number): Compression parameter ``μ``.
    Returns:
        array-like: Quantized signal (dtype=int)
          - y ∈ [0, mu] if x ∈ [-1, 1]
          - y ∈ [0, mu) if x ∈ [-1, 1)
    .. note::
        If you want to get quantized values of range [0, mu) (not [0, mu]),
        then you need to provide input signal of range [-1, 1).
    Examples:
        >>> from scipy.io import wavfile
        >>> import pysptk
        >>> import numpy as np
        >>> from nnmnkwii import preprocessing as P
        >>> fs, x = wavfile.read(pysptk.util.example_audio_file())
        >>> x = (x / 32768.0).astype(np.float32)
        >>> y = P.mulaw_quantize(x)
        >>> print(y.min(), y.max(), y.dtype)
        15 246 int64
    See also:
        :func:`nnmnkwii.preprocessing.mulaw`
        :func:`nnmnkwii.preprocessing.inv_mulaw`
        :func:`nnmnkwii.preprocessing.inv_mulaw_quantize`
    """
    mu = mu-1
    y = mulaw(x, mu)
    # scale [-1, 1] to [0, mu]
    return _asint((y + 1) / 2 * mu)


def inv_mulaw_quantize(y, mu=256):
    """Inverse of mu-law companding + quantize
    Args:
        y (array-like): Quantized signal (∈ [0, mu]).
        mu (number): Compression parameter ``μ``.
    Returns:
        array-like: Uncompressed signal ([-1, 1])
    Examples:
        >>> from scipy.io import wavfile
        >>> import pysptk
        >>> import numpy as np
        >>> from nnmnkwii import preprocessing as P
        >>> fs, x = wavfile.read(pysptk.util.example_audio_file())
        >>> x = (x / 32768.0).astype(np.float32)
        >>> x_hat = P.inv_mulaw_quantize(P.mulaw_quantize(x))
        >>> x_hat = (x_hat * 32768).astype(np.int16)
    See also:
        :func:`nnmnkwii.preprocessing.mulaw`
        :func:`nnmnkwii.preprocessing.inv_mulaw`
        :func:`nnmnkwii.preprocessing.mulaw_quantize`
    """
    # [0, m) to [-1, 1]
    mu = mu-1
    y = 2 * _asfloat(y) / mu - 1
    return inv_mulaw(y, mu)

def _sign(x):
    #wrapper to support tensorflow tensors/numpy arrays
    isnumpy = isinstance(x, np.ndarray)
    isscalar = np.isscalar(x)
    return np.sign(x) if (isnumpy or isscalar) else tf.sign(x)


def _log1p(x):
    #wrapper to support tensorflow tensors/numpy arrays
    isnumpy = isinstance(x, np.ndarray)
    isscalar = np.isscalar(x)
    return np.log1p(x) if (isnumpy or isscalar) else tf.log1p(x)


def _abs(x):
    #wrapper to support tensorflow tensors/numpy arrays
    isnumpy = isinstance(x, np.ndarray)
    isscalar = np.isscalar(x)
    return np.abs(x) if (isnumpy or isscalar) else tf.abs(x)


def _asint(x):
    #wrapper to support tensorflow tensors/numpy arrays
    isnumpy = isinstance(x, np.ndarray)
    isscalar = np.isscalar(x)
    return x.astype(np.int) if isnumpy else int(x) if isscalar else tf.cast(x, tf.int32)


def _asfloat(x):
    #wrapper to support tensorflow tensors/numpy arrays
    isnumpy = isinstance(x, np.ndarray)
    isscalar = np.isscalar(x)
    return x.astype(np.float32) if isnumpy else float(x) if isscalar else tf.cast(x, tf.float32)

def frames_to_hours(n_frames,hparams):
    return sum((n_frame for n_frame in n_frames)) * hparams.frame_shift_ms / (3600 * 1000)

def get_duration(audio,hparams):
    return librosa.core.get_duration(audio, sr=hparams.sample_rate)

def _db_to_amp_tensorflow(x):
    return tf.pow(tf.ones(tf.shape(x)) * 10.0, x * 0.05)

def _denormalize_tensorflow(S,hparams):
    return (tf.clip_by_value(S, 0, 1) * -hparams.min_level_db) + hparams.min_level_db

def _griffin_lim_tensorflow(S,hparams):
    with tf.variable_scope('griffinlim'):
        S = tf.expand_dims(S, 0)
        S_complex = tf.identity(tf.cast(S, dtype=tf.complex64))
        y = _istft_tensorflow(S_complex,hparams)
        for i in range(hparams.griffin_lim_iters):
            est = _stft_tensorflow(y,hparams)
            angles = est / tf.cast(tf.maximum(1e-8, tf.abs(est)), tf.complex64)
            y = _istft_tensorflow(S_complex * angles,hparams)
        return tf.squeeze(y, 0)

def _istft_tensorflow(stfts,hparams):
    n_fft, hop_length, win_length = _stft_parameters(hparams)
    return tf.contrib.signal.inverse_stft(stfts, win_length, hop_length, n_fft)

def _stft_tensorflow(signals,hparams):
    n_fft, hop_length, win_length = _stft_parameters(hparams)
    return tf.contrib.signal.stft(signals, win_length, hop_length, n_fft, pad_end=False)

def _stft_parameters(hparams):
    n_fft = (hparams.num_freq - 1) * 2  # hparams.num_freq = 1025
    hop_length = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate)  # hparams.frame_shift_ms = 12.5
    win_length = int(hparams.frame_length_ms / 1000 * hparams.sample_rate)  # hparams.frame_length_ms = 50
    return n_fft, hop_length, win_length

================================================
FILE: utils/infolog.py
================================================
import atexit
from datetime import datetime
import json
from threading import Thread
from urllib.request import Request, urlopen


_format = '%Y-%m-%d %H:%M:%S.%f'
_file = None
_run_name = None
_slack_url = None


def init(filename, run_name, slack_url=None):
    global _file, _run_name, _slack_url
    _close_logfile()
    _file = open(filename, 'a')
    _file.write('\n-----------------------------------------------------------------\n')
    _file.write('Starting new training run\n')
    _file.write('-----------------------------------------------------------------\n')
    _run_name = run_name
    _slack_url = slack_url


def log(msg, slack=False):
    print(msg)
    if _file is not None:
        _file.write('[%s]    %s\n' % (datetime.now().strftime(_format)[:-3], msg))
    if slack and _slack_url is not None:
        Thread(target=_send_slack, args=(msg,)).start()


def _close_logfile():
    global _file
    if _file is not None:
        _file.close()
        _file = None


def _send_slack(msg):
    req = Request(_slack_url)
    req.add_header('Content-Type', 'application/json')
    urlopen(req, json.dumps({
        'username': 'tacotron',
        'icon_emoji': ':taco:',
        'text': '*%s*: %s' % (_run_name, msg)
    }).encode())


atexit.register(_close_logfile)


================================================
FILE: utils/plot.py
================================================
# coding: utf-8
import os 
import matplotlib
import matplotlib.font_manager as font_manager
from jamo import h2j, j2hcj
import numpy as np
matplotlib.use('Agg')

# font 문제 해결
#matplotlib.rc('font', family="NanumBarunGothic")

#font_manager._rebuild()  <---- 1번만 해주면 됨

font_fname = './/utils//NanumBarunGothic.ttf'
font_name = font_manager.FontProperties(fname=font_fname).get_name()
matplotlib.rc('font', family="NanumBarunGothic")


import matplotlib.pyplot as plt

from text import PAD, EOS
from utils import add_postfix
from text.korean import normalize

def plot(alignment, info, text, isKorean=True):
    char_len, audio_len = alignment.shape # 145, 200

    fig, ax = plt.subplots(figsize=(char_len/5, 5))
    im = ax.imshow(
            alignment.T,
            aspect='auto',
            origin='lower',
            interpolation='none')

    xlabel = 'Encoder timestep'
    ylabel = 'Decoder timestep'

    if info is not None:
        xlabel += '\n{}'.format(info)

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

    if text:
        if isKorean:
            jamo_text = j2hcj(h2j(normalize(text)))
        else:
            jamo_text=text
        pad = [PAD] * (char_len - len(jamo_text) - 1)
        A = [tok for tok in jamo_text] + [EOS] + pad
        A = [x if x != ' ' else '' for x in A]   # 공백이 있으면 그 뒤가 출력되지 않는 문제...
        plt.xticks(range(char_len), A)

    if text is not None:
        while True:
            if text[-1] in [EOS, PAD]:
                text = text[:-1]
            else:
                break
        plt.title(text)

    plt.tight_layout()

def plot_alignment(
        alignment, path, info=None, text=None, isKorean=True):

    if text:  # text = '대체 투입되었던 구급대원이'
        tmp_alignment = alignment[:len(h2j(text)) + 2]  # '대체 투입되었던 구급대원이' 푼 후, 길이 측정  <--- padding제거 효과

        plot(tmp_alignment, info, text, isKorean)
        plt.savefig(path, format='png')
    else:
        plot(alignment, info, text, isKorean)
        plt.savefig(path, format='png')

    print(" [*] Plot saved: {}".format(path))
    

def plot_spectrogram(pred_spectrogram, path, title=None, split_title=False, target_spectrogram=None, max_len=None, auto_aspect=False):
    if max_len is not None:
        target_spectrogram = target_spectrogram[:max_len]
        pred_spectrogram = pred_spectrogram[:max_len]

    if split_title:
        title = split_title_line(title)

    fig = plt.figure(figsize=(10, 8))
    # Set common labels
    fig.text(0.5, 0.18, title, horizontalalignment='center', fontsize=16)

    #target spectrogram subplot
    if target_spectrogram is not None:
        ax1 = fig.add_subplot(311)
        ax2 = fig.add_subplot(312)

        if auto_aspect:
            im = ax1.imshow(np.rot90(target_spectrogram), aspect='auto', interpolation='none')
        else:
            im = ax1.imshow(np.rot90(target_spectrogram), interpolation='none')
        ax1.set_title('Target Mel-Spectrogram')
        fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
        ax2.set_title('Predicted Mel-Spectrogram')
    else:
        ax2 = fig.add_subplot(211)

    if auto_aspect:
        im = ax2.imshow(np.rot90(pred_spectrogram), aspect='auto', interpolation='none')
    else:
        im = ax2.imshow(np.rot90(pred_spectrogram), interpolation='none')
    fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax2)   # 'horizontal'   'vertical'

    plt.tight_layout()
    plt.savefig(path, format='png')
    plt.close()

================================================
FILE: wavenet/__init__.py
================================================
#  coding: utf-8
from .model import WaveNetModel
from .ops import (mu_law_encode, mu_law_decode,optimizer_factory)


================================================
FILE: wavenet/mixture.py
================================================
# coding:utf-8
"""
the code is adapted from:
https://github.com/Rayhane-mamah/Tacotron-2/blob/master/wavenet_vocoder/models/mixture.py
https://github.com/openai/pixel-cnn/blob/master/pixel_cnn_pp/nn.py
https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
https://github.com/azraelkuan/tensorflow_wavenet_vocoder/tree/dev
"""
import tensorflow as tf
import numpy as np

def log_sum_exp(x):
    """ numerically stable log_sum_exp implementation that prevents overflow """
    axis = len(x.get_shape()) - 1
    m = tf.reduce_max(x, axis)
    m2 = tf.reduce_max(x, axis, keepdims=True)
    return m + tf.log(tf.reduce_sum(tf.exp(x - m2), axis))


def log_prob_from_logits(x):
    """ numerically stable log_softmax implementation that prevents overflow """
    axis = len(x.get_shape()) - 1
    m = tf.reduce_max(x, axis, keepdims=True)
    return x - m - tf.log(tf.reduce_sum(tf.exp(x - m), axis, keepdims=True))

#  https://github.com/Rayhane-mamah/Tacotron-2/issues/155  <--- 설명 있음
def discretized_mix_logistic_loss(y_hat, y, num_class=256, log_scale_min=float(np.log(1e-14)), reduce=True):
    """
    Discretized mixture of logistic distributions loss
    y_hat: Predicted output B x T x C
    y: Target   B x T x 1  (-1~1)
    num_class: Number of classes
    log_scale_min: Log scale minimum value
    reduce: If True, the losses are averaged or summed for each minibatch
    :return: loss
    """
    y_hat_shape = y_hat.get_shape().as_list()

    assert len(y_hat_shape) == 3
    assert y_hat_shape[2] % 3 == 0

    nr_mix = y_hat_shape[2] // 3   # 30 --> 10

    # unpack parameters
    logit_probs = y_hat[:, :, :nr_mix]
    means = y_hat[:, :, nr_mix:2 * nr_mix]
    log_scales = tf.maximum(y_hat[:, :, nr_mix * 2:nr_mix * 3], log_scale_min)

    # B x T x 1 => B x T x nr_mix
    y = tf.tile(y, [1, 1, nr_mix])

    centered_y = y - means
    inv_stdv = tf.exp(-log_scales)
    plus_in = inv_stdv * (centered_y + 1. / (num_class - 1))
    cdf_plus = tf.nn.sigmoid(plus_in)
    min_in = inv_stdv * (centered_y - 1. / (num_class - 1))
    cdf_min = tf.nn.sigmoid(min_in)

    log_cdf_plus = plus_in - tf.nn.softplus(plus_in)  # log probability for edge case of 0 (before scaling)   equivalent tf.log(cdf_plus)

    log_one_minus_cdf_min = -tf.nn.softplus(min_in)  # log probability for edge case of 255 (before scaling)  equivalent tf.log(1-cdf_min)

    cdf_delta = cdf_plus - cdf_min  # probability for all other cases
    
  
    mid_in = inv_stdv * centered_y
    #log probability in the center of the bin, to be used in extreme cases
    #(not actually used in this code) 
    log_pdf_mid = mid_in - log_scales - 2. * tf.nn.softplus(mid_in)  # mid 값을 pdf에 직접 넣고 계산하면 나온다.

    log_probs = tf.where(y < -0.999, log_cdf_plus,
                         tf.where(y > 0.999, log_one_minus_cdf_min,
                                  tf.where(cdf_delta > 1e-5, tf.log(tf.maximum(cdf_delta, 1e-12)),log_pdf_mid - np.log((num_class - 1) / 2))))

    log_probs = log_probs + tf.nn.log_softmax(logit_probs, -1)
    # log_probs = log_probs + log_prob_from_logits(logit_probs)

    if reduce:
        return -tf.reduce_sum(log_sum_exp(log_probs))
    else:
        return -log_sum_exp(log_probs)


def sample_from_discretized_mix_logistic(y, log_scale_min=float(np.log(1e-14))):
    """

    :param y: B x T x C
    :param log_scale_min:
    :return: [-1, 1]
    """
    # 아래 코드에서 2번의 uniform random sampling이 있는데, 한번은 Gumbel distribution으로 부터 sampling을 위한 것이고, 또 한번은 logistic distribution을 위한 것이다.
    
    y_shape = y.get_shape().as_list()

    assert len(y_shape) == 3
    assert y_shape[2] % 3 == 0
    nr_mix = y_shape[2] // 3

    logit_probs = y[:, :, :nr_mix]

    # u: random_uniform --> -log(-log(u)): standard Gumbel random sample
    # category 결정을 위해 logit_probs(softmax 취하기 전의 값) + ( -log(-log(u)) )   ---> argmax를 취하면 category가 결정된다.
    sel = tf.one_hot(tf.argmax(logit_probs - tf.log(-tf.log(tf.random_uniform(tf.shape(logit_probs), minval=1e-5, maxval=1. - 1e-5))), 2), depth=nr_mix, dtype=tf.float32)

    means = tf.reduce_sum(y[:, :, nr_mix:nr_mix * 2] * sel, axis=2)

    log_scales = tf.maximum(tf.reduce_sum(y[:, :, nr_mix * 2:nr_mix * 3] * sel, axis=2), log_scale_min)

    # output audio를 만들기 위해 logistic distribution으로 부터 sampling
    u = tf.random_uniform(tf.shape(means), minval=1e-5, maxval=1. - 1e-5)
    x = means + tf.exp(log_scales) * (tf.log(u) - tf.log(1. - u))   # u을 logistic distribution의 cdf의 역함수에 대입.

    x = tf.minimum(tf.maximum(x, -1.), 1.)
    return x


================================================
FILE: wavenet/model.py
================================================
#  coding: utf-8
import numpy as np
import tensorflow as tf

from .ops import mu_law_encode,optimizer_factory,SubPixelConvolution
from .mixture import discretized_mix_logistic_loss, sample_from_discretized_mix_logistic
class WaveNetModel(object):
    def __init__(self,batch_size,dilations,filter_width,residual_channels,dilation_channels,skip_channels,quantization_channels=2**8,out_channels=30,
                 use_biases=False,scalar_input=False,global_condition_channels=None,
                 global_condition_cardinality=None,local_condition_channels=80,upsample_factor=None,legacy=True,residual_legacy=True,train_mode=True,drop_rate=0.0):

        self.batch_size = batch_size
        self.dilations = dilations
        self.filter_width = filter_width
        self.residual_channels = residual_channels
        self.dilation_channels = dilation_channels
        self.quantization_channels = quantization_channels
        self.use_biases = use_biases
        self.skip_channels = skip_channels
        self.scalar_input = scalar_input
        self.global_condition_channels = global_condition_channels
        self.global_condition_cardinality = global_condition_cardinality
        self.local_condition_channels=local_condition_channels
        self.upsample_factor=upsample_factor
        self.train_mode = train_mode
        self.out_channels = out_channels
        self.legacy=legacy
        self.residual_legacy=residual_legacy
        self.drop_rate = drop_rate
        self.ema = tf.train.ExponentialMovingAverage(decay=0.9999)
        
        self.receptive_field = WaveNetModel.calculate_receptive_field(self.filter_width, self.dilations)
        
    @staticmethod
    def calculate_receptive_field(filter_width, dilations):
        # causal 때문에 length (T-1) + (여기서 계산되는 receptive_field만큼의  padding)  --> 최종 output의 길이가 T가 된다.
        receptive_field = (filter_width - 1) * sum(dilations) + 1  # 마지막 +1은 causal condition 때문에 1개 자른 것의 때문에 길이가 T-1인 되기 때문에 +1을 통해서 입력과 같은 길이 T가 된다.
        return receptive_field

    def _create_causal_layer(self, input_batch):
        with tf.name_scope('causal_layer'):
            
            if self.scalar_input:
                return tf.layers.conv1d(input_batch,filters=self.residual_channels,kernel_size=1,padding='valid',dilation_rate=1,use_bias=True)
            else:
                return tf.layers.conv1d(input_batch,filters=self.residual_channels,kernel_size=1,padding='valid',dilation_rate=1,use_bias=True)


    def _create_queue(self):
        # first layer(causal layer)나 local condition은 kernel_size = 1이므로, Queue가 필요없다.
        with tf.variable_scope('queue'):
            self.dilation_queue=[]
            for i,d in enumerate(self.dilations):
                q = tf.Variable(initial_value=tf.zeros(shape=[self.batch_size,d*(self.filter_width-1)+1,self.residual_channels], dtype=tf.float32), name='dilation_queue'.format(i), trainable=False)
                self.dilation_queue.append(q)
        
        # restore했을 때, Dilation_Queue,Causal_Queue는 0으로 initialization해야 한다.
        self.queue_initializer= tf.variables_initializer(self.dilation_queue)

    def _create_dilation_layer(self, input_batch, layer_index, dilation,local_condition_batch,global_condition_batch):
        # input_batch는 train mode에서는 길이 줄어드는 것을 대비하여 padding이 되어 있다.
        with tf.variable_scope('dilation_layer'):
            residual =  input_batch
            if self.train_mode:
                # padding
                padding = (self.filter_width - 1)*dilation
                input_batch = tf.pad(input_batch, tf.constant([(0, 0), (padding, 0), (0, 0)]))

            else:
                self.dilation_queue[layer_index] =  tf.scatter_update(self.dilation_queue[layer_index],tf.range(self.batch_size),tf.concat([self.dilation_queue[layer_index][:,1:,:],input_batch],axis=1) )
                input_batch =  self.dilation_queue[layer_index]


            input_batch = tf.layers.dropout(input_batch,rate=self.drop_rate,training=self.train_mode)
            
            dilation_layer = tf.layers.Conv1D(filters=self.dilation_channels*2,kernel_size=self.filter_width,dilation_rate=dilation,padding='valid',use_bias=self.use_biases,name='conv_filter_gate')
            
            if self.train_mode:
                conv = dilation_layer(input_batch)
                conv_filter, conv_gate = tf.split(conv,2,axis=-1)
                
            else:
                
                dilation_layer.build((self.batch_size,1,input_batch.shape.as_list()[-1]))   # shape의 마지막만 중요함. kernel을 잡는데 마지막 차원만 사용됨
                
                linearized_weights = tf.reshape(dilation_layer.kernel,(-1,self.dilation_channels*2))
                input_batch = input_batch[:, 0::dilation, :]
                temp = tf.matmul(tf.reshape(input_batch,(self.batch_size,-1)), linearized_weights)
                if self.use_biases:
                    temp = tf.nn.bias_add(temp, dilation_layer.bias)                
                
                conv_filter, conv_gate = tf.split(tf.expand_dims(temp,1),2,axis=-1)
                
                            
            if global_condition_batch is not None:
                conv_filter += tf.layers.conv1d(global_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="gc_filter")
                conv_gate += tf.layers.conv1d(global_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="gc_gate")
    
            if local_condition_batch is not None:
                local_filter = tf.layers.conv1d(local_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="lc_filter")
                local_gate = tf.layers.conv1d(local_condition_batch,filters=self.dilation_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="lc_gate")
                
                conv_filter += local_filter
                conv_gate += local_gate            
                
                    
            out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)
    
            # The 1x1 conv to produce the residual output  == FC
            transformed = tf.layers.conv1d(out,filters=self.residual_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="dense")
    
            # The 1x1 conv to produce the skip output
            skip_contribution = tf.layers.conv1d(out,filters=self.skip_channels,kernel_size=1,padding="same",use_bias=self.use_biases,name="skip")
    

            # residual + transformed: 다음 단계의 입력으로 들어감
            if self.residual_legacy:
                out = (residual + transformed) * np.sqrt(0.5)
            else:
                out = residual + transformed
    
            return skip_contribution, out   # skip_contribution: 결과값으로 쌓임. 
    def create_upsample(self, local_condition_batch,upsample_type='SubPixel'):
        local_condition_batch = tf.expand_dims(local_condition_batch, [3])
        # local condition batch N H W C
        freq_axis_kernel_size = self.filter_width   # Rayhane-mamah 코드에서는 hyper parameter로 받음. frame(num_mels)에 적용되는 kernel_size임
        for i in range(len(self.upsample_factor)):
            if upsample_type =='SubPixel':
                
                # NN_init, NN_scaler <---- hyper parameter이지만, 여기서는 True, 0.3으로 고정
                # kernel_size: (3, hparams.freq_axis_kernel_size) 이렇게 되어 있는데, 왜 3인지 모르겠음. upsample_factor[i]로 대체. 
                # freq_axis_kernel_size는 hparams에 3으로 되어 있는데, 여기서는 filter_width로 처리  <---- frame(num_mels)에 적용되는 kernel_size임
                subpixel_layer = SubPixelConvolution(filters=1, kernel_size=(self.upsample_factor[i],freq_axis_kernel_size),padding='same', strides=(self.upsample_factor[i],1),
                                      NN_init=True, NN_scaler=0.3,up_layers=len(self.upsample_factor), name='SubPixelConvolution_layer_{}'.format(i))
                local_condition_batch = subpixel_layer(local_condition_batch)
            else:
                local_condition_batch = tf.layers.conv2d_transpose(local_condition_batch,filters=1, kernel_size=(self.upsample_factor[i], freq_axis_kernel_size),
                                                   strides=(self.upsample_factor[i],1),padding='same',use_bias=False,name='upsample_2D_{}'.format(i))
            
            local_condition_batch = tf.nn.relu(local_condition_batch)
            
            # for debugging
            #local_condition_batch = tf.Print(local_condition_batch,[tf.shape(local_condition_batch),"xx{}".format(i)])
        local_condition_batch = tf.squeeze(local_condition_batch, [3])
        
        return local_condition_batch
    def _create_network(self, input_batch,local_condition_batch, global_condition_batch):  
        '''Construct the WaveNet network.'''
        # global_condition_batch: (batch_size, 1, self.global_condition_channels)  <--- 가운데 1은 크기 1짜리 data FC대신에 conv1d를 적용하기 위해 강제로 넣었다고 봐야 한다.
        
        if self.train_mode==False:
            self._create_queue()
        
        
        current_layer = input_batch  # causal cut으로 길이 1이 줄어든 상태
           

        # Pre-process the input with a regular convolution
        current_layer = self._create_causal_layer(current_layer)  # 여전 모델에서는 길이가 줄었지만, 수정 후에는 길이 불변

        # Add all defined dilation layers.
        outputs = None
        with tf.variable_scope('dilated_stack'):
            for layer_index, dilation in enumerate(self.dilations): # [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
                with tf.variable_scope('layer{}'.format(layer_index)):
                    
                    output, current_layer = self._create_dilation_layer(current_layer, layer_index, dilation,local_condition_batch,global_condition_batch)

                    if outputs is None:
                        outputs = output
                    else:
                        outputs = outputs + output
                        
                        if self.legacy:
                            outputs = outputs * np.sqrt(0.5)
                        
        with tf.name_scope('postprocessing'):
            # Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to
            # postprocess the output.
             
            transformed1 = tf.nn.relu(outputs)
            conv1 = tf.layers.conv1d(transformed1,filters=self.skip_channels,kernel_size=1,padding="same",use_bias=self.use_biases)
    
            transformed2 = tf.nn.relu(conv1)
            if self.scalar_input:
                conv2 = tf.layers.conv1d(transformed2,filters=self.out_channels,kernel_size=1,padding="same",use_bias=self.use_biases)
            else:
                conv2 = tf.layers.conv1d(transformed2,filters=self.quantization_channels,kernel_size=1,padding="same",use_bias=self.use_biases)

        return conv2

    def _one_hot(self, input_batch):
        '''One-hot encodes the waveform amplitudes.

        This allows the definition of the network as a categorical distribution
        over a finite set of possible amplitudes.
        '''
        with tf.name_scope('one_hot_encode'):
            encoded = tf.one_hot(input_batch, depth=self.quantization_channels, dtype=tf.float32)  # (1, ?, 1) --> (1, ?, 1, 256)
            shape = [self.batch_size, -1, self.quantization_channels]
            encoded = tf.reshape(encoded, shape)  # (1, ?, 1, 256) --> (1, ?, 256)
        return encoded

    def _embed_gc(self, global_condition):  # global_condition = global_condition_batch <---- data
        '''Returns embedding for global condition.
        :param global_condition: Either ID of global condition for
               tf.nn.embedding_lookup or actual embedding. The latter is
               experimental.
        :return: Embedding or None
        '''
        # global_condition: (N,)
        # self.global_condition_cardinality가 None이 아니며, global_condition 은 gc id이면 되고, 그렇지 않으면, global_condition은 embedding vector가 넘어와야 한다.
        embedding = None
        if self.global_condition_cardinality is not None:
            # Only lookup the embedding if the global condition is presented
            # as an integer of mutually-exclusive categories ...
            embedding_table = tf.get_variable('gc_embedding', [self.global_condition_cardinality, self.global_condition_channels], dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer(uniform=False))   # (2, 32)
            embedding = tf.nn.embedding_lookup(embedding_table,global_condition)
        elif global_condition is not None:
            # ... else the global_condition (if any) is already provided
            # as an embedding.

            # In this case, the number of global_embedding channels must be
            # equal to the the last dimension of the global_condition tensor.
            gc_batch_rank = len(global_condition.get_shape())
            dims_match = (global_condition.get_shape()[gc_batch_rank - 1] == self.global_condition_channels)
            if not dims_match:
                raise ValueError('Shape of global_condition {} does not match global_condition_channels {}.'.format(global_condition.get_shape(),
                                        self.global_condition_channels))
            embedding = global_condition

        if embedding is not None:
            embedding = tf.reshape(embedding,[self.batch_size, 1, self.global_condition_channels])

        return embedding


    def predict_proba_incremental(self, waveform,upsampled_local_condition=None, global_condition=None,name='wavenet'):
        """
        local_condition: upsampled local condition
        """


        with tf.variable_scope(name,reuse=tf.AUTO_REUSE):
            
            if self.scalar_input:
                encoded = tf.reshape(waveform , [self.batch_size, -1, 1])  # (N,1,1)
            else:
                encoded = tf.one_hot(waveform, self.quantization_channels)
                encoded = tf.reshape(encoded, [self.batch_size,-1, self.quantization_channels])   # encoded shape=(N,1, 256)
            
            gc_embedding = self._embed_gc(global_condition)                   # --> shape=(1, 1, 32)
            
            
            # local condition
            if upsampled_local_condition is not None:
                upsampled_local_condition = tf.reshape(upsampled_local_condition , [self.batch_size, -1, self.local_condition_channels])
            
            raw_output = self._create_network(encoded,upsampled_local_condition,gc_embedding)        # 이것이 fast generation algorithm의 핵심  --> (batch_size, 1, 256)
            
            if self.scalar_input:
                out = tf.reshape(raw_output, [self.batch_size, -1, self.out_channels])
                proba = sample_from_discretized_mix_logistic(out)
            else:
                out = tf.reshape(raw_output, [self.batch_size, self.quantization_channels])
                proba = tf.cast(tf.nn.softmax(tf.cast(out, tf.float64)), tf.float32)

            return proba

    def add_loss(self, input_batch,local_condition=None, global_condition_batch=None, l2_regularization_strength=None,upsample_type=None, name='wavenet'):
        '''Creates a WaveNet network and returns the autoencoding loss.

        The variables are all scoped to the given name.
        '''
        with tf.variable_scope(name):
            # We mu-law encode and quantize the input audioform.
            # quantization_channels 크기의 one hot encoding을 적용한 예정. 16bit= 65536개였다면,  quantization_channels로 줄이는 효과가 있다.
            # mu law encoding은 bit를 단순히 줄이는 것보다 advanced된 방식으로 줄인다.
            # input_batch: (batch_size,?,1)  <-- 마지막 1은 channel 1을 의미
            encoded_input = mu_law_encode(input_batch, self.quantization_channels)  # "quantization_channels": 256   ---> (batch_size, ?, 1)

            gc_embedding = self._embed_gc(global_condition_batch) # (self.batch_size, 1, self.global_condition_channels) <--- 가운데 1은 강제로 reshape
            encoded = self._one_hot(encoded_input)      #  (1, ?, quantization_channels=256)
            if self.scalar_input:
                network_input = tf.reshape( tf.cast(input_batch, tf.float32), [self.batch_size, -1, 1])
            else:
                network_input = encoded
                
            # Cut off the last sample of network input to preserve causality.
            network_input_width = tf.shape(network_input)[1] - 1
            if self.scalar_input:
                input = tf.slice(network_input, [0, 0, 0], [-1, network_input_width,1])
            else:
                input = tf.slice(network_input, [0, 0, 0], [-1, network_input_width, self.quantization_channels])


            # local condition
            if local_condition is not None:
                local_condition = self.create_upsample(local_condition,upsample_type)
                local_condition = tf.slice(local_condition, [0, 0, 0], [-1, network_input_width,self.local_condition_channels])

            raw_output = self._create_network(input,local_condition, gc_embedding)  # (batch_size, ?, quantization_channels=256) , (batch_size, 1, self.global_condition_channels)
            
            
            with tf.name_scope('loss'):
                # Cut off the samples corresponding to the receptive field
                # for the first predicted sample.
                
                # scalar input인 경우에도 target은 mu-law companding된 것이 된다.
                target_output = tf.slice(network_input , [0, 1, 0],[-1, -1, -1])   # [-1,-1,-1] --> 나머지 모두
                
                if self.scalar_input:
                    loss = discretized_mix_logistic_loss(raw_output, target_output,num_class=2**16, reduce=False)
                    reduced_loss = tf.reduce_mean(loss)                    
                else:
                    # 3 dim array의 loss를 계산학 위해, 2 dim으로 변환한다. batch와 time 부분을 합쳐서 2dim으로 변환
                    target_output = tf.reshape(target_output, [-1, self.quantization_channels])
                    prediction = tf.reshape(raw_output, [-1, self.quantization_channels])
                    loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=target_output)
                    reduced_loss = tf.reduce_mean(loss)

                tf.summary.scalar('loss', reduced_loss)

                if l2_regularization_strength is None:
                    self.loss = reduced_loss
                else:
                    # L2 regularization for all trainable parameters
                    l2_loss = tf.add_n([tf.nn.l2_loss(v)  for v in tf.trainable_variables() if not('bias' in v.name)])

                    # Add the regularization term to the loss
                    total_loss = (reduced_loss + l2_regularization_strength * l2_loss)

                    tf.summary.scalar('l2_loss', l2_loss)
                    tf.summary.scalar('total_loss', total_loss)

                    self.loss = total_loss

    def add_optimizer(self, hparams,global_step):
        '''Adds optimizer to the graph. Supposes that initialize function has already been called.
        '''
        with tf.variable_scope('optimizer'):
            hp = hparams

            learning_rate = tf.train.exponential_decay(hp.wavenet_learning_rate, global_step,hp.wavenet_decay_steps,hp.wavenet_decay_rate)

            #Adam optimization
            self.learning_rate = learning_rate
            optimizer = tf.train.AdamOptimizer(learning_rate)

            gradients, variables = zip(*optimizer.compute_gradients(self.loss))   # len(tf.trainable_variables()) = len(variables)
            self.gradients = gradients

            #Gradients clipping
            if hp.wavenet_clip_gradients:
                # Rayhane-mamah는 tf.clip_by_norm -> tf.clip_by_value 두 단계를 적용. 여기서는 tf.clip_by_global_norm
                
                clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1)   # tf.clip_by_global_norm vs tf.clip_by_norm
            else:
                clipped_gradients = gradients

            with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
                adam_optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step)        
                
        #Add exponential moving average
        #https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage
        #Use adam optimization process as a dependency
        with tf.control_dependencies([adam_optimize]):
            #Create the shadow variables and add ops to maintain moving averages
            #Also updates moving averages after each update step
            #This is the optimize call instead of traditional adam_optimize one.
            assert tuple(tf.trainable_variables()) == variables #Verify all trainable variables are being averaged
            self.optimize = self.ema.apply(variables)                             
                

================================================
FILE: wavenet/ops.py
================================================
#  coding: utf-8
import tensorflow as tf
import numpy as np
def create_adam_optimizer(learning_rate, momentum):
    return tf.train.AdamOptimizer(learning_rate=learning_rate,
                                  epsilon=1e-4)


def create_sgd_optimizer(learning_rate, momentum):
    return tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                      momentum=momentum)


def create_rmsprop_optimizer(learning_rate, momentum):
    return tf.train.RMSPropOptimizer(learning_rate=learning_rate,
                                     momentum=momentum,
                                     epsilon=1e-5)


optimizer_factory = {'adam': create_adam_optimizer,
                     'sgd': create_sgd_optimizer,
                     'rmsprop': create_rmsprop_optimizer}
def mu_law_encode(audio, quantization_channels):
    '''Quantizes waveform amplitudes.'''
    with tf.name_scope('encode'):
        mu = tf.to_float(quantization_channels - 1)
        # Perform mu-law companding transformation (ITU-T, 1988).
        # Minimum operation is here to deal with rare large amplitudes caused
        # by resampling.
        safe_audio_abs = tf.minimum(tf.abs(audio), 1.0)
        magnitude = tf.log1p(mu * safe_audio_abs) / tf.log1p(mu)  # tf.log1p(x) = log(1+x)
        signal = tf.sign(audio) * magnitude
        # Quantize signal to the specified number of levels.
        return tf.to_int32((signal + 1) / 2 * mu + 0.5)


def mu_law_decode(output, quantization_channels, quantization=True):
    '''Recovers waveform from quantized values.'''
    with tf.name_scope('decode'):
        mu = quantization_channels - 1
        # Map values back to [-1, 1].
        if quantization:
            signal = 2 * (tf.to_float(output) / mu) - 1
        else:
            signal = output
        # Perform inverse of mu-law transformation.
        magnitude = (1 / mu) * ((1 + mu)**abs(signal) - 1)
        return tf.sign(signal) * magnitude


class SubPixelConvolution(tf.layers.Conv2D):
    '''Sub-Pixel Convolutions are vanilla convolutions followed by Periodic Shuffle.

    They serve the purpose of upsampling (like deconvolutions) but are faster and less prone to checkerboard artifact with the right initialization.
    In contrast to ResizeConvolutions, SubPixel have the same computation speed (when using same n° of params), but a larger receptive fields as they operate on low resolution.
    '''
    def __init__(self, filters, kernel_size, padding, strides, NN_init, NN_scaler, up_layers, name=None, **kwargs):
        #Output channels = filters * H_upsample * W_upsample
        conv_filters = filters * strides[0] * strides[1]

        #Create initial kernel
        self.NN_init = NN_init
        self.up_layers = up_layers
        self.NN_scaler = NN_scaler
        init_kernel = tf.constant_initializer(self._init_kernel(kernel_size, strides, conv_filters), dtype=tf.float32) if NN_init else None

        #Build convolution component and save Shuffle parameters.
        super(SubPixelConvolution, self).__init__(
            filters=conv_filters,
            kernel_size=kernel_size,
            strides=(1, 1),
            padding=padding,
            kernel_initializer=init_kernel,
            bias_initializer=tf.zeros_initializer(),
            data_format='channels_last',
            name=name, **kwargs)

        self.out_filters = filters
        self.shuffle_strides = strides
        self.scope = 'SubPixelConvolution' if None else name

    def build(self, input_shape):
        '''Build SubPixel initial weights (ICNR: avoid checkerboard artifacts).

        To ensure checkerboard free SubPixel Conv, initial weights must make the subpixel conv equivalent to conv->NN resize.
        To do that, we replace initial kernel with the special kernel W_n == W_0 for all n <= out_channels.
        In other words, we want our initial kernel to extract feature maps then apply Nearest neighbor upsampling.
        NN upsampling is guaranteed to happen when we force all our output channels to be equal (neighbor pixels are duplicated).
        We can think of this as limiting our initial subpixel conv to a low resolution conv (1 channel) followed by a duplication (made by PS).

        Ref: https://arxiv.org/pdf/1707.02937.pdf
        '''
        #Initialize layer
        super(SubPixelConvolution, self).build(input_shape)

        if not self.NN_init:
            #If no NN init is used, ensure all channel-wise parameters are equal.
            self.built = False

            #Get W_0 which is the first filter of the first output channels
            W_0 = tf.expand_dims(self.kernel[:, :, :, 0], axis=3) #[H_k, W_k, in_c, 1]

            #Tile W_0 across all output channels and replace original kernel
            self.kernel = tf.tile(W_0, [1, 1, 1, self.filters]) #[H_k, W_k, in_c, out_c]

        self.built = True

    def call(self, inputs):
        with tf.variable_scope(self.scope) as scope:
            #Inputs are supposed [batch_size, freq, time_steps, channels]
            convolved = super(SubPixelConvolution, self).call(inputs)

            #[batch_size, up_freq, up_time_steps, channels]
            return self.PS(convolved)

    def PS(self, inputs):
        #Get different shapes
        #[batch_size, H, W, C(out_c * r1 * r2)]
        batch_size = tf.shape(inputs)[0]
        H = tf.shape(inputs)[1]
        W = tf.shape(inputs)[2]   #W = tf.shape(inputs)[2]
        C = inputs.shape[-1]
        r1, r2 = self.shuffle_strides #supposing strides = (freq_stride, time_stride)
        out_c = self.out_filters #number of filters as output of the convolution (usually 1 for this model)

        assert C == r1 * r2 * out_c

        #Split and shuffle (output) channels separately. (Split-Concat block)
        Xc = tf.split(inputs, out_c, axis=3) # out_c x [batch_size, H, W, C/out_c]
        outputs = tf.concat([self._phase_shift(x, batch_size, H, W, r1, r2) for x in Xc], 3) #[batch_size, r1 * H, r2 * W, out_c]

        with tf.control_dependencies([tf.assert_equal(out_c, tf.shape(outputs)[-1]),
            tf.assert_equal(H * r1, tf.shape(outputs)[1])]):
            outputs = tf.identity(outputs, name='SubPixelConv_output_check')

        return tf.reshape(outputs, [tf.shape(outputs)[0], r1 * H, tf.shape(outputs)[2], out_c])

    def _phase_shift(self, inputs, batch_size, H, W, r1, r2):
        #Do a periodic shuffle on each output channel separately
        x = tf.reshape(inputs, [batch_size, H, W, r1, r2]) #[batch_size, H, W, r1, r2]

        #Width dim shuffle
        x = tf.transpose(x, [4, 2, 3, 1, 0]) #[r2, W, r1, H, batch_size]
        x = tf.batch_to_space_nd(x, [r2], [[0, 0]]) #[1, r2*W, r1, H, batch_size]
        x = tf.squeeze(x, [0]) #[r2*W, r1, H, batch_size]

        #Height dim shuffle
        x = tf.transpose(x, [1, 2, 0, 3]) #[r1, H, r2*W, batch_size]
        x = tf.batch_to_space_nd(x, [r1], [[0, 0]]) #[1, r1*H, r2*W, batch_size]
        x = tf.transpose(x, [3, 1, 2, 0]) #[batch_size, r1*H, r2*W, 1]

        return x

    def _init_kernel(self, kernel_size, strides, filters):
        '''Nearest Neighbor Upsample (Checkerboard free) init kernel size
        '''
        overlap = kernel_size[1] // strides[1]
        init_kernel = np.zeros(kernel_size, dtype=np.float32)
        i = kernel_size[1] // 2
        j = [kernel_size[0] // 2 - 1, kernel_size[0] // 2] if kernel_size[0] % 2 == 0 else [kernel_size[0] // 2]
        for j_i in j:
            init_kernel[j_i,i] = 1. / max(overlap, 1.) if kernel_size[1] % 2 == 0 else 1.

        init_kernel = np.tile(np.expand_dims(init_kernel, 2), [1, 1, 1, filters])

        return init_kernel * (self.NN_scaler)**(1/self.up_layers)


================================================
FILE: 명령어모음.txt
================================================
python preprocess.py --num_workers 10 --name son --in_dir .\datasets\son --out_dir .\data\son


python preprocess.py --num_workers 10 --name moon --in_dir .\datasets\moon --out_dir .\data\moon


python train_tacotron2.py


python train_vocoder.py


python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다"
python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text "오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다"

python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-27T20-27-18
python generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-27T20-27-18
python generate.py --mel ./logdir-wavenet/moon-Aust.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-27T20-27-18
python generate.py --mel ./logdir-wavenet/son-Aust.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-27T20-27-18