[
  {
    "path": "LICENSE",
    "content": "MIT License\r\n\r\nCopyright (c) 2018 Heecheol Cho\r\n\r\nPermission is hereby granted, free of charge, to any person obtaining a copy\r\nof this software and associated documentation files (the \"Software\"), to deal\r\nin the Software without restriction, including without limitation the rights\r\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\ncopies of the Software, and to permit persons to whom the Software is\r\nfurnished to do so, subject to the following conditions:\r\n\r\nThe above copyright notice and this permission notice shall be included in all\r\ncopies or substantial portions of the Software.\r\n\r\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\nSOFTWARE."
  },
  {
    "path": "ReadMe.md",
    "content": "# Multi-Speaker Tocotron2 + Wavenet Vocoder + Korean TTS\r\nTacotron2 모델과 Wavenet Vocoder를 결합하여  한국어 TTS구현하는 project입니다.\r\nTacotron2 모델을 Multi-Speaker모델로 확장했습니다.\r\n\r\nBased on \r\n- https://github.com/keithito/tacotron\r\n- https://github.com/carpedm20/multi-speaker-tacotron-tensorflow\r\n- https://github.com/Rayhane-mamah/Tacotron-2\r\n- https://github.com/hccho2/Tacotron-Wavenet-Vocoder\r\n\r\n\r\n## Tacotron 2\r\n- Tacotron 모델에 관한 설명은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다.\r\n- [Tacotron2](https://arxiv.org/abs/1712.05884)에서는 모델 구조도 바뀌었고, Location Sensitive Attention, Stop Token, Vocoder로 Wavenet을 제안하고 있다.\r\n- Tacotron2의 대표적인 구현은 [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)입니다. 이 역시, [keithito](https://github.com/keithito/tacotron), [r9y9](https://github.com/r9y9/wavenet_vocoder)의 코드를 기반으로 발전된 것이다.\r\n\r\n## This Project\r\n* Tacotron2 모델로 한국어 TTS를 만드는 것이 목표입니다.\r\n* [Rayhane-mamah](https://github.com/Rayhane-mamah/Tacotron-2)의 구현은 Customization된 Layer를 많이 사용했는데, 제가 보기에는 너무 복잡하게 한 것 같아, Cumomization Layer를 많이 줄이고, Tensorflow에 구현되어 있는 Layer를 많이 활용했습니다.\r\n* teacher forcing 방식의 train sample은 2000 step부터, free forcing 방식의 test sample은 3000 step부터 알아들을 수 있는 정도의 음성을 만들기 시작합니다.\r\n## 단계별 실행\r\n\r\n### 실행 순서\r\n- Data 생성: 한국어 data의 생성은 이전 [repo](https://github.com/hccho2/Tacotron-Wavenet-Vocoder) 참고하시면 됩니다.\r\n- 생성된 Data는 아래의 'data_paths'에 지정하면 된다.\r\n- tacotron training 후, synthesize.py로 test.\r\n- wavenet training 후, generate.py로 test(tacotron이 만들지 않은 mel spectrogram으로 test할 수도 있고, tacotron이 만든 mel spectrogram을 사용할 수도 있다.)\r\n- 2개 모델 모두 train 후, tacotron에서 생성한 mel spectrogram을 wavent에 local condition으로 넣어 test하면 된다.\r\n\r\n\r\n### Tacotron2 Training\r\n- train_tacotron2.py 내에서 '--data_paths'를 지정한 후, train할 수 있다. data_path는 여러개의 데이터 디렉토리를 지정할 수 있습니다.\r\n```\r\nparser.add_argument('--data_paths', default='.\\\\data\\\\moon,.\\\\data\\\\son')\r\n```\r\n- train을 이어서 계속하는 경우에는 '--load_path'를 지정해 주면 된다.\r\n```\r\nparser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-02-27_00-21-42')\r\n```\r\n\r\n- model_type은 'single' 또는 ' multi-speaker'로 지정할 수 있다. speaker가 1명 일 때는, hparams의 model_type = 'single'로 하고 train_tacotron2.py 내에서 '--data_paths'를 1개만 넣어주면 된다.\r\n```\r\nparser.add_argument('--data_paths', default='D:\\\\Tacotron2\\\\data\\\\moon')\r\n```\r\n- 하이퍼파라메터를 hparmas.py에서 argument를 train_tacotron2.py에서 다 설정했기 때문에, train 실행은 다음과 같이 단순합니다.\r\n> python train_tacotron2.py\r\n- train 후, 음성을 생성하려면 다음과 같이 하면 된다. '--num_speaker', '--speaker_id'는 잘 지정되어야 한다.\r\n> python synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text \"오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다.\" \r\n\r\n\r\n\r\n### Wavenet Vocoder Training\r\n- train_vocoder.py 내에서 '--data_dir'를 지정한 후, train할 수 있다.\r\n- memory 부족으로 training 되지 않거나 너무 느리면, hyper paramerter 중 sample_size를 줄이면 된다. 물론 batch_size를 줄일 수도 있다.\r\n```\r\nDATA_DIRECTORY =  'D:\\\\Tacotron2\\\\data\\\\moon,D:\\\\Tacotron2\\\\data\\\\son'\r\nparser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing data')\r\n```\r\n- train을 이어서 계속하는 경우에는 '--logdir'를 지정해 주면 된다.\r\n```\r\nLOGDIR = './/logdir-wavenet//train//2018-12-21T22-58-10'\r\nparser.add_argument('--logdir', type=str, default=LOGDIR)\r\n```\r\n- wavenet train 후, tacotron이 생성한 mel spectrogram(npy파일)을 local condition으로 넣어서 TTS의 최종 결과를 얻을 수 있다.\r\n> python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10\r\n\r\n### Result\r\n- Tacotron의 batch_size = 32, Wavenet의 batch_size=8. GTX 1080ti.\r\n- Tacotron은 step 100K, Wavenet은 177K 만큼 train.\r\n- samples 디렉토리에는 생성된 wav파일이 있다.\r\n- Griffin-Lim으로 생성된 것과 Wavenet Vocoder로 생성된 sample이 있다.\r\n- Wavenet으로 생성된 음성은 train 부족으로 잡음이 섞여있다.\r\n\r\n\r\n"
  },
  {
    "path": "datasets/__init__.py",
    "content": "# -*- coding: utf-8 -*-\n\nfrom .datafeeder_wavenet import DataFeederWavenet"
  },
  {
    "path": "datasets/datafeeder_tacotron2.py",
    "content": "# coding: utf-8\nimport os\nimport time\nimport pprint\nimport random\nimport threading\nimport traceback\nimport numpy as np\nfrom glob import glob\nimport tensorflow as tf\nfrom collections import defaultdict\n\nimport text\nfrom utils.infolog import log\nfrom utils import parallel_run, remove_file\nfrom utils.audio import frames_to_hours\n\n\n\n_pad = 0\n_stop_token_pad = 1\ndef get_frame(path):\n    data = np.load(path)\n    n_frame = data[\"linear\"].shape[0]\n    n_token = len(data[\"tokens\"])\n    return (path, n_frame, n_token)\n\ndef get_path_dict(data_dirs, hparams, config,data_type, n_test=None,rng=np.random.RandomState(123)):\n\n    # Load metadata:\n    path_dict = {}\n    for data_dir in data_dirs:  # ['datasets/moon\\\\data']\n        paths = glob(\"{}/*.npz\".format(data_dir)) # ['datasets/moon\\\\data\\\\001.0000.npz', 'datasets/moon\\\\data\\\\001.0001.npz', 'datasets/moon\\\\data\\\\001.0002.npz', ...]\n\n        if data_type == 'train':\n            rng.shuffle(paths)  # ['datasets/moon\\\\data\\\\012.0287.npz', 'datasets/moon\\\\data\\\\004.0215.npz', 'datasets/moon\\\\data\\\\003.0149.npz', ...]\n\n        if not config.skip_path_filter:\n            items = parallel_run( get_frame, paths, desc=\"filter_by_min_max_frame_batch\", parallel=True)  # [('datasets/moon\\\\data\\\\012.0287.npz', 130, 21), ('datasets/moon\\\\data\\\\003.0149.npz', 209, 37), ...]\n\n            min_n_frame = hparams.min_n_frame   # 5*30\n            max_n_frame =  hparams.max_n_frame - 1  # 5*200 - 5\n            \n            # 다음 단계에서 data가 많이 떨어져 나감. 글자수가 짧은 것들이 탈락됨.\n            new_items = [(path, n) for path, n, n_tokens in items if min_n_frame <= n <= max_n_frame and n_tokens >= hparams.min_tokens] # [('datasets/moon\\\\data\\\\004.0383.npz', 297), ('datasets/moon\\\\data\\\\003.0533.npz', 394),...]\n\n\n            new_paths = [path for path, n in new_items]\n            new_n_frames = [n for path, n in new_items]\n\n            hours = frames_to_hours(new_n_frames,hparams)\n\n            log(' [{}] Loaded metadata for {} examples ({:.2f} hours)'.format(data_dir, len(new_n_frames), hours))\n            log(' [{}] Max length: {}'.format(data_dir, max(new_n_frames)))\n            log(' [{}] Min length: {}'.format(data_dir, min(new_n_frames)))\n        else:\n            new_paths = paths\n\n        # train용 data와 test용 data로 나눈다.\n        if data_type == 'train':\n            new_paths = new_paths[:-n_test] # 끝에 있는 n_test(batch_size)를 제외한 모두\n        elif data_type == 'test':\n            new_paths = new_paths[-n_test:] # 끝에 있는 n_test\n        else:\n            raise Exception(\" [!] Unkown data_type: {}\".format(data_type))\n\n        path_dict[data_dir] = new_paths  # ['datasets/moon\\\\data\\\\001.0621.npz', 'datasets/moon\\\\data\\\\003.0229.npz', ...]\n\n    return path_dict\n\n\n# run -> _enqueue_next_group -> _get_next_example\nclass DataFeederTacotron2(threading.Thread):\n    '''Feeds batches of data into a queue on a background thread.'''\n\n    def __init__(self, coordinator, data_dirs,hparams, config, batches_per_group, data_type, batch_size):  #batches_per_group = 32 or 8,  data_type: 'train' or 'test'\n        super(DataFeederTacotron2, self).__init__()\n\n        self._coord = coordinator\n        self._hp = hparams\n        self._cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]\n        self._step = 0\n        self._offset = defaultdict(lambda: 2)\n        self._batches_per_group = batches_per_group\n\n        self.rng = np.random.RandomState(config.random_seed)   # random number generator\n        self.data_type = data_type\n        self.batch_size = batch_size\n\n        self.min_tokens = hparams.min_tokens  # 30\n        self.min_n_frame = hparams.min_n_frame   # 5*30\n        self.max_n_frame = hparams.max_n_frame - 1  # 5*200 - 5\n        self.skip_path_filter = config.skip_path_filter\n\n        # Load metadata:\n        self.path_dict = get_path_dict(data_dirs, self._hp, config, self.data_type,n_test=self.batch_size, rng=self.rng) # data_dirs: ['datasets/moon\\\\data']\n\n        self.data_dirs = list(self.path_dict.keys()) # ['datasets/moon\\\\data']\n        self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)}  # {'datasets/moon\\\\data': 0}\n\n        data_weight = {data_dir: 1. for data_dir in self.data_dirs} # {'datasets/moon\\\\data': 1.0}\n\n        if self._hp.main_data_greedy_factor > 0 and any(main_data in data_dir for data_dir in self.data_dirs for main_data in self._hp.main_data):   # 'main_data': ['']\n            for main_data in self._hp.main_data:\n                for data_dir in self.data_dirs:\n                    if main_data in data_dir:\n                        data_weight[data_dir] += self._hp.main_data_greedy_factor\n\n        weight_Z = sum(data_weight.values())  # 1\n        self.data_ratio = { data_dir: weight / weight_Z for data_dir, weight in data_weight.items()}  # 각 data들의 weight sum이 1이 되도록...\n\n        log(\"=\"*40)\n        log('Data Amount:')\n        log(pprint.pformat(self.data_ratio, indent=4))\n        log(\"=\"*40)\n\n        #audio_paths = [path.replace(\"/data/\", \"/audio/\").replace(\".npz\", \".wav\") for path in self.data_paths]\n        #duration = get_durations(audio_paths, print_detail=False)\n\n        # Create placeholders for inputs and targets. Don't specify batch size because we want to\n        # be able to feed different sized batches at eval time.\n\n        self._placeholders = [\n            tf.placeholder(tf.int32, [None, None], 'inputs'),\n            tf.placeholder(tf.int32, [None], 'input_lengths'),\n            tf.placeholder(tf.float32, [None], 'loss_coeff'),\n            tf.placeholder(tf.float32, [None, None, hparams.num_mels], 'mel_targets'),\n            tf.placeholder(tf.float32, [None, None, hparams.num_freq], 'linear_targets'),\n            tf.placeholder(tf.float32, [None, None], 'stop_token_targets')\n        ]\n\n        # Create queue for buffering data:\n        dtypes = [tf.int32, tf.int32, tf.float32, tf.float32, tf.float32, tf.float32]\n\n        self.is_multi_speaker = len(self.data_dirs) > 1\n\n        if self.is_multi_speaker:\n            self._placeholders.append( tf.placeholder(tf.int32, [None], 'speaker_id'),)\n            dtypes.append(tf.int32)\n\n        num_worker = 8 if self.data_type == 'train' else 1\n        queue = tf.FIFOQueue(num_worker, dtypes, name='input_queue')\n\n        self._enqueue_op = queue.enqueue(self._placeholders)\n\n        if self.is_multi_speaker:\n            self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets, self.speaker_id = queue.dequeue()\n        else:\n            self.inputs, self.input_lengths, self.loss_coeff, self.mel_targets, self.linear_targets,self.stop_token_targets = queue.dequeue()\n\n        self.inputs.set_shape(self._placeholders[0].shape)\n        self.input_lengths.set_shape(self._placeholders[1].shape)\n        self.loss_coeff.set_shape(self._placeholders[2].shape)\n        self.mel_targets.set_shape(self._placeholders[3].shape)\n        self.linear_targets.set_shape(self._placeholders[4].shape)\n        self.stop_token_targets.set_shape(self._placeholders[5].shape)\n\n        if self.is_multi_speaker:\n            self.speaker_id.set_shape(self._placeholders[6].shape)\n        else:\n            self.speaker_id = None\n\n        if self.data_type == 'test':\n            examples = []\n            while True:\n                for data_dir in self.data_dirs:\n                    examples.append(self._get_next_example(data_dir))\n                    #print(data_dir, text.sequence_to_text(examples[-1][0], False, True))\n                    if len(examples) >= self.batch_size:\n                        break\n                if len(examples) >= self.batch_size:\n                    break\n            \n            # test 할 때는 같은 examples로 계속 반복\n            self.static_batches = [examples for _ in range(self._batches_per_group)]  # [examples, examples,...,examples] <--- 각 example은 2개의 data를 가지고 있다.\n\n        else:\n            self.static_batches = None\n\n    def start_in_session(self, session, start_step):\n        self._step = start_step\n        self._session = session\n        self.start()\n\n\n    def run(self):\n        try:\n            while not self._coord.should_stop():\n                self._enqueue_next_group()\n        except Exception as e:\n            traceback.print_exc()\n            self._coord.request_stop(e)\n\n\n    def _enqueue_next_group(self):\n        start = time.time()\n\n        # Read a group of examples:\n        n = self.batch_size   # 32\n        r = self._hp.reduction_factor  #  4 or 5  min_n_frame,max_n_frame 계산에 사용되었던...\n\n        if self.static_batches is not None:  # 'test'에서는 static_batches를 사용한다. static_batches는 init에서 이미 만들어 놓았다.\n            batches = self.static_batches\n        else: # 'train'\n            examples = []\n            for data_dir in self.data_dirs:\n                if self._hp.initial_data_greedy:\n                    if self._step < self._hp.initial_phase_step and any(\"krbook\" in data_dir for data_dir in self.data_dirs):\n                        data_dir = [data_dir for data_dir in self.data_dirs if \"krbook\" in data_dir][0]\n\n                if self._step < self._hp.initial_phase_step:  # 'initial_phase_step': 8000\n                    example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group // len(self.data_dirs)))]  # _batches_per_group 8,또는 32 만큼의 batch data를 만드낟. 각각의 batch size는 2, 또는 32\n                else:\n                    example = [self._get_next_example(data_dir) for _ in range(int(n * self._batches_per_group * self.data_ratio[data_dir]))]\n                examples.extend(example)\n            examples.sort(key=lambda x: x[-1])  # 제일 마지막 기준이니까,  len(linear_target) 기준으로 정렬\n\n            batches = [examples[i:i+n] for i in range(0, len(examples), n)]\n            self.rng.shuffle(batches)\n\n        log('Generated %d batches of size %d in %.03f sec' % (len(batches), n, time.time() - start))\n        for batch in batches:  # batches는 batch의 묶음이다.\n            # test 또는 train mode에 맞게 만든 batches의  batch data를 placeholder에 넘겨준다.\n            feed_dict = dict(zip(self._placeholders, _prepare_batch(batch, r, self.rng, self.data_type)))   # _prepare_batch에서 batch data의 길이를 맞춘다. return 순서 = placeholder순서\n            self._session.run(self._enqueue_op, feed_dict=feed_dict)\n            self._step += 1\n\n\n    def _get_next_example(self, data_dir):\n        '''npz 1개를 읽어 처리한다. Loads a single example (input, mel_target, linear_target, cost) from disk'''\n        data_paths = self.path_dict[data_dir]\n\n        while True:\n            if self._offset[data_dir] >= len(data_paths):\n                self._offset[data_dir] = 0\n\n                if self.data_type == 'train':\n                    self.rng.shuffle(data_paths)\n\n            data_path = data_paths[self._offset[data_dir]]  # npz파일 1개 선택\n            self._offset[data_dir] += 1\n\n            try:\n                if os.path.exists(data_path):\n                    data = np.load(data_path)  # data속에는 \"linear\",\"mel\",\"tokens\",\"loss_coeff\"\n                else:\n                    continue\n            except:\n                remove_file(data_path)\n                continue\n\n            if not self.skip_path_filter:\n                break\n\n            if self.min_n_frame <= data[\"linear\"].shape[0] <= self.max_n_frame and  len(data[\"tokens\"]) > self.min_tokens:\n                break\n\n        input_data = data['tokens']   # 1-dim\n        mel_target = data['mel']\n\n        if 'loss_coeff' in data:\n            loss_coeff = data['loss_coeff']\n        else:\n            loss_coeff = 1\n        linear_target = data['linear']\n        stop_token_target = np.asarray([0.] * len(mel_target))  # mel_target은 [xx,80]으로 data마다 len이 다르다.  len에 따라 [0,...,0]\n        \n        \n        # multi-speaker가 아니면, speaker_id는 넘길 필요 없지만, 현재 구현이 좀 꼬여 있다. 그래서 무조건 넘긴다.\n        if self.is_multi_speaker:\n            return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, self.data_dir_to_id[data_dir], len(linear_target))\n        else:\n            return (input_data, loss_coeff, mel_target, linear_target,stop_token_target, len(linear_target))\n\n\ndef _prepare_batch(batch, reduction_factor, rng, data_type=None):\n    # (input_data, loss_coeff, mel_target, linear_target,stop_token_target, speaker_id, len(linear_target))\n    \n    if data_type == 'train':\n        rng.shuffle(batch)\n\n    # batch data: (input_data, loss_coeff, mel_target, linear_target, self.data_dir_to_id[data_dir], len(linear_target))\n    inputs = _prepare_inputs([x[0] for x in batch])  # batch에 있는 data들 중, 가장 긴 data의 길이에 맞게 padding한다.\n    input_lengths = np.asarray([len(x[0]) for x in batch], dtype=np.int32)  # batch_size, [37, 37, 32, 32, 38,..., 39, 36, 30]\n    loss_coeff = np.asarray([x[1] for x in batch], dtype=np.float32)   # batch_size, [1,1,1,,..., 1,1,1]\n\n    mel_targets = _prepare_targets([x[2] for x in batch], reduction_factor)  # ---> (32, 175, 80) max length는 reduction_factor의  배수가 되도록\n    linear_targets = _prepare_targets([x[3] for x in batch], reduction_factor)  # ---> (32, 175, 1025)  max length는 reduction_factor의  배수가 되도록\n    stop_token_targets = _prepare_stop_token_targets([x[4] for x in batch], reduction_factor)\n\n    if len(batch[0]) == 7:  # is_multi_speaker = True인 경우\n        speaker_id = np.asarray([x[5] for x in batch], dtype=np.int32)   # speaker_id로 list 만들기\n        return (inputs, input_lengths, loss_coeff,mel_targets, linear_targets,stop_token_targets, speaker_id)\n    else:\n        return (inputs, input_lengths, loss_coeff, mel_targets, linear_targets,stop_token_targets)  # ('inputs' 'input_lengths' 'loss_coeff' 'mel_targets' 'linear_targets')\n\n\ndef _prepare_inputs(inputs):  # inputs: batch 길이 만큼의 list\n    max_len = max((len(x) for x in inputs))\n    return np.stack([_pad_input(x, max_len) for x in inputs])  # (batch_size, max_len)\n    \"\"\"\n    batch_size = 2 일 떼,\n    [[13, 26, 13, 41, 13, 21, 13, 41, 13, 21, 13, 41,  9, 41, 13, 40,79, 14, 34, 13, 33, 79, 20, 32, 13, 35, 45,  2, 34, 42, 13, 39,7, 29, 11, 25,  1],\n    [ 6, 29, 79, 14, 26, 14, 34,  5, 29, 79,  2, 30, 45,  2, 28, 14,21, 79, 13, 27,  7, 25,  9, 34, 45, 13, 40, 79,  4, 29,  2, 29,13, 26,  1,  0,  0]]    \n    \"\"\"\n\ndef _prepare_targets(targets, alignment):\n    # targets: shape of list [ (162,80) , (172, 80), ...] \n    max_len = max((len(t) for t in targets)) + 1\n    return np.stack([_pad_target(t, _round_up(max_len, alignment)) for t in targets])\n\ndef _prepare_stop_token_targets(targets, alignment):\n    max_len = max((len(t) for t in targets)) + 1\n    return np.stack([_pad_stop_token_target(t, _round_up(max_len, alignment)) for t in targets])\n\n\ndef _pad_input(x, length):\n    return np.pad(x, (0, length - x.shape[0]), mode='constant', constant_values=_pad)\n\n\ndef _pad_target(t, length):\n    # t: 2 dim array. ( xx, num_mels) ==> (length,num_mels)\n    return np.pad(t, [(0, length - t.shape[0]), (0,0)], mode='constant', constant_values=_pad)  # (169, 80) ==> (length, 80)\n\n###\ndef _pad_stop_token_target(t, length):\n    return np.pad(t, (0, length - t.shape[0]), mode='constant', constant_values=_stop_token_pad)\n\ndef _round_up(x, multiple):\n    remainder = x % multiple\n    return x if remainder == 0 else x + multiple - remainder\n\n\n\nif __name__ == '__main__':\n    \n    from hparams import hparams\n    import argparse\n    from utils import str2bool\n    \n    parser = argparse.ArgumentParser()\n    parser.add_argument('--random_seed', type=int, default=123)\n    parser.add_argument('--batch_size', type=int, default=4)\n    parser.add_argument('--skip_path_filter', type=str2bool, default=True, help='Use only for debugging')\n    config = parser.parse_args()\n    \n    \n    coord = tf.train.Coordinator()\n    data_dirs=['D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\moon']\n    mydatafeed =  DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size)\n\n    \n    \n    with tf.Session() as sess:\n        try:\n            sess.run(tf.global_variables_initializer())\n            step = 0\n            mydatafeed.start_in_session(sess,step) \n            \n            while not coord.should_stop():\n                a,b,c,d=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets,mydatafeed.stop_token_targets])\n                \n                print(a.shape,c.shape,d.shape)\n                print(step,b)\n                print('stop token:', d[0])\n                print('-'*10)\n                a,b,c=sess.run([mydatafeed.inputs, mydatafeed.input_lengths, mydatafeed.mel_targets])\n                \n                print(a.shape,c.shape)\n                print(step,b)             \n                \n                print('='*10)\n                step =  step +1\n                \n                if step > 3:\n                    raise Exception('End xxx')\n                \n        \n        except Exception as e:\n            print('finally')\n            print(e)\n            coord.request_stop(e)\n\n"
  },
  {
    "path": "datasets/datafeeder_wavenet.py",
    "content": "# -*- coding: utf-8 -*-\nimport sys\nsys.path.append(\"../\")\n\nimport tensorflow as tf\nimport threading\nimport random\nimport numpy as np\nimport os\nfrom utils import audio\nfrom hparams import hparams\nfrom glob import glob\nfrom collections import defaultdict\n\n\ndef get_path_dict(data_dirs, min_length):\n    path_dict = {}\n    for data_dir in data_dirs:\n        \n        if not hparams.skip_path_filter:\n        \n            with open(os.path.join(data_dir,'train.txt'), 'r', encoding='utf-8') as f:\n                lines = f.readlines()\n                new_paths = []\n                for line in lines:\n                    line = line.strip().split(\"|\")\n                    if int(line[3]) > min_length:\n                        new_paths.append(line[6])\n            \n            path_dict[data_dir] = new_paths\n        else:\n            new_paths = glob(\"{}/*.npz\".format(data_dir))\n            \n            new_paths = [os.path.basename(p) for p in new_paths]\n            path_dict[data_dir] = new_paths\n    return path_dict\n\ndef assert_ready_for_upsampling(x, c,hop_size):\n    assert len(x) % len(c) == 0 and len(x) // len(c) == hop_size\n\ndef ensure_divisible(length, divisible_by=256, lower=True):\n    if length % divisible_by == 0:\n        return length\n    if lower:\n        return length - length % divisible_by\n    else:\n        return length + (divisible_by - length % divisible_by)\n\n\nclass DataFeederWavenet(threading.Thread):\n    def __init__(self,coord,data_dirs,batch_size, gc_enable=False,test_mode=False, queue_size=8):\n        super(DataFeederWavenet, self).__init__()    \n        self.data_dirs = data_dirs\n        self.coord = coord\n        self.batch_size = batch_size\n        self.hop_size = audio.get_hop_size(hparams)\n        self.sample_size = ensure_divisible(hparams.sample_size,self.hop_size, True)\n        self.max_frames = self.sample_size // self.hop_size  # sample_size 크기를 확보하기 위해.\n        self.queue_size = queue_size\n        self.gc_enable = gc_enable\n        self.skip_path_filter = hparams.skip_path_filter\n        self.test_mode = test_mode\n        if test_mode:\n            assert batch_size==1\n        \n        self.rng = np.random.RandomState(123)\n        self._offset = defaultdict(lambda: 2)  # key에 없는 값이 들어어면 2가 할당된다.\n        \n        self.data_dir_to_id = {data_dir: idx for idx, data_dir in enumerate(self.data_dirs)}  # data_dir <---> speaker_id 매핑\n        self.path_dict = get_path_dict(self.data_dirs,self.sample_size)# receptive_field 보다 작은 것을 버리고, 나머지만 돌려준다.\n        \n        self._placeholders = [\n            tf.placeholder(tf.float32, shape=[None,None,1],name='input_wav'),\n            tf.placeholder(tf.float32, shape=[None,None,hparams.num_mels],name='local_condition')\n        ]    \n        dtypes = [tf.float32, tf.float32]\n    \n        if self.gc_enable:\n            self._placeholders.append(tf.placeholder(tf.int32, shape=[None],name='speaker_id'))\n            dtypes.append(tf.int32)\n \n        queue = tf.FIFOQueue(self.queue_size, dtypes, name='input_queue')\n        self.enqueue = queue.enqueue(self._placeholders)\n        \n        if self.gc_enable:\n            self.inputs_wav, self.local_condition, self.speaker_id = queue.dequeue()\n        else:\n            self.inputs_wav, self.local_condition = queue.dequeue()\n\n        self.inputs_wav.set_shape(self._placeholders[0].shape)\n        self.local_condition.set_shape(self._placeholders[1].shape)\n        if self.gc_enable:\n            self.speaker_id.set_shape(self._placeholders[2].shape)\n   \n            \n    def run(self):\n        try:\n            while not self.coord.should_stop():\n                self.make_batches()\n        except Exception as e:\n            self.coord.request_stop(e)       \n    def start_in_session(self, session,start_step):\n        self._step = start_step\n        self.sess = session\n        self.start()\n              \n    def make_batches(self):\n        examples = []\n        n = self.batch_size\n        for data_dir in self.data_dirs:\n            example = [self._get_next_example(data_dir) for _ in range(int(n * 32 // len(self.data_dirs)))]\n            examples.extend(example)\n        self.rng.shuffle(examples)\n        batches = [examples[i:i+n] for i in range(0, len(examples), n)]\n        \n        \n        for batch in batches: # batch size만큼의 data를 원하는 만큼 만든다.\n            feed_dict = dict(zip(self._placeholders, _prepare_batch(batch))) \n            self.sess.run(self.enqueue, feed_dict=feed_dict)\n            self._step += 1\n    \n    def _get_next_example(self, data_dir):\n        '''npz 1개를 읽어 처리한다. Loads a single example (input_wav, local_condition,speaker_id ) from disk'''\n        data_paths = self.path_dict[data_dir]\n        \n        while True:\n            if self._offset[data_dir] >= len(data_paths):\n                self._offset[data_dir] = 0\n                self.rng.shuffle(data_paths)\n            \n            data_path = os.path.join(data_dir,data_paths[self._offset[data_dir]])  # npz파일 1개 선택\n            self._offset[data_dir] += 1\n            \n            if os.path.exists(data_path):\n                data = np.load(data_path)  # data속에는 'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'token'\n            else:\n                continue       \n            \n            if not self.skip_path_filter:\n                # 이경우는 get_path_dict함수에서 한번 걸러졌기 때문에, 여기서 다시 확인할 필요 없음.\n                break\n            \n            # get_path_dict함수에서 걸러지지 않앗기 때문에 확인이 필요함.\n            if data['time_steps'] > self.sample_size or self.test_mode:\n                break\n                 \n\n        input_wav = data['audio']\n        local_condition = data['mel']\n        input_wav = input_wav.reshape(-1, 1)\n        assert_ready_for_upsampling(input_wav, local_condition,self.hop_size)\n        \n        \n        if self.test_mode==False:  # test_mode에서는 전체. train_mode에서는 sample_size 만큼만\n            s = np.random.randint(0, len(local_condition) - self.max_frames+1)  # hccho\n            ts = s * self.hop_size\n            input_wav = input_wav[ts:ts + self.hop_size * self.max_frames, :]\n            local_condition = local_condition[s:s + self.max_frames, :]        \n            \n        if self.gc_enable:\n            return (input_wav,local_condition, self.data_dir_to_id[data_dir])\n        else: return (input_wav,local_condition)\ndef _prepare_batch(batch):\n    input_wavs = [x[0] for x in batch]\n    local_conditions = [x[1] for x in batch]\n    if len(batch[0])==3:\n        speaker_ids = [x[2] for x in batch]\n        return (input_wavs,local_conditions,speaker_ids)\n    else:\n        return (input_wavs,local_conditions)\n        \n        \nif __name__ == '__main__':\n    coord = tf.train.Coordinator()\n    data_dirs=['D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\moon','D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\son']\n    mydatafeed =  DataFeederWavenet(coord,data_dirs,batch_size=5,receptive_field=1200, gc_enable=True, queue_size=8)\n    \n    \n    with tf.Session() as sess:\n        try:\n            sess.run(tf.global_variables_initializer())\n            step = 0\n            mydatafeed.start_in_session(sess,step) \n            \n            while not coord.should_stop():\n                a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id])\n                \n                print(a.shape,b.shape,c.shape)\n                print(step, c)\n                \n                a,b,c=sess.run([mydatafeed.inputs_wav, mydatafeed.local_condition, mydatafeed.speaker_id])\n                \n                print(a.shape,b.shape,c.shape)\n                print(step, c)               \n                \n                \n                step =  step +1\n                \n        \n        except Exception as e:\n            print('finally')\n            coord.request_stop(e)\n    "
  },
  {
    "path": "datasets/moon/moon-recognition-All.json",
    "content": "{\r\n    \"./datasets/moon/audio/003.0000.wav\": \"존경하는 독일 국민 여러분\",\r\n    \"./datasets/moon/audio/003.0001.wav\": \"고국에 계신 국민 여러분\",\r\n    \"./datasets/moon/audio/003.0002.wav\": \"하울젠 쾨르버재단 이사님과\",\r\n    \"./datasets/moon/audio/003.0003.wav\": \"모드로\",\r\n    \"./datasets/moon/audio/003.0004.wav\": \"전 동독 총리님을 비롯한\",\r\n    \"./datasets/moon/audio/003.0005.wav\": \"내외 귀빈 여러분\",\r\n    \"./datasets/moon/audio/003.0006.wav\": \"먼저 냉전과 분단을 넘어\",\r\n    \"./datasets/moon/audio/003.0007.wav\": \"통일을 이루고\",\r\n    \"./datasets/moon/audio/003.0008.wav\": \"그 힘으로 유럽통합과 국제평화를 선도하고 있는\",\r\n    \"./datasets/moon/audio/003.0009.wav\": \"독일과\",\r\n    \"./datasets/moon/audio/003.0010.wav\": \"독일 국민에게\",\r\n    \"./datasets/moon/audio/003.0011.wav\": \"무한한 경의를 표합니다\",\r\n    \"./datasets/moon/audio/003.0012.wav\": \"오늘 이 자리를 마련해 주신\",\r\n    \"./datasets/moon/audio/003.0013.wav\": \"독일 정부와 쾨르버 재단에도\",\r\n    \"./datasets/moon/audio/003.0014.wav\": \"감사드립니다\",\r\n    \"./datasets/moon/audio/003.0015.wav\": \"아울러 얼마 전 별세하신\",\r\n    \"./datasets/moon/audio/003.0016.wav\": \"고\",\r\n    \"./datasets/moon/audio/003.0017.wav\": \"헬무트 콜 총리의 가족과\",\r\n    \"./datasets/moon/audio/003.0018.wav\": \"독일 국민들에게 깊은 애도와\",\r\n    \"./datasets/moon/audio/003.0019.wav\": \"위로의 마음을 전합니다\",\r\n    \"./datasets/moon/audio/003.0020.wav\": \"대한민국은\",\r\n    \"./datasets/moon/audio/003.0021.wav\": \"냉전시기\",\r\n    \"./datasets/moon/audio/003.0022.wav\": \"어려운 환경 속에서도\",\r\n    \"./datasets/moon/audio/003.0023.wav\": \"적극적이고\",\r\n    \"./datasets/moon/audio/003.0024.wav\": \"능동적인 외교로\",\r\n    \"./datasets/moon/audio/003.0025.wav\": \"독일 통일과 유럽통합을 주도한\",\r\n    \"./datasets/moon/audio/003.0026.wav\": \"헬무트\",\r\n    \"./datasets/moon/audio/003.0027.wav\": \"콜 총리의 위대한 업적을 기억할 것입니다\",\r\n    \"./datasets/moon/audio/003.0028.wav\": \"친애하는 내외 귀빈 여러분\",\r\n    \"./datasets/moon/audio/003.0029.wav\": \"이곳 베를린은\",\r\n    \"./datasets/moon/audio/003.0030.wav\": \"지금으로부터 17년 전\",\r\n    \"./datasets/moon/audio/003.0031.wav\": \"한국의 김대중 대통령이\",\r\n    \"./datasets/moon/audio/003.0032.wav\": \"남북 화해·협력의 기틀을 마련한\",\r\n    \"./datasets/moon/audio/003.0033.wav\": \"베를린 선언을 발표한 곳입니다\",\r\n    \"./datasets/moon/audio/003.0034.wav\": \"여기 알테스 슈타트하우스는\",\r\n    \"./datasets/moon/audio/003.0035.wav\": \"독일 통일조약 협상이 이뤄졌던\",\r\n    \"./datasets/moon/audio/003.0036.wav\": \"역사적 현장입니다\",\r\n    \"./datasets/moon/audio/003.0037.wav\": \"나는 오늘\",\r\n    \"./datasets/moon/audio/003.0038.wav\": \"베를린의 교훈이 살아있는 이 자리에서\",\r\n    \"./datasets/moon/audio/003.0039.wav\": \"대한민국 새 정부의 한반도 평화 구상을\",\r\n    \"./datasets/moon/audio/003.0040.wav\": \"말씀드리고자 합니다\",\r\n    \"./datasets/moon/audio/003.0041.wav\": \"내외 귀빈 여러분\",\r\n    \"./datasets/moon/audio/003.0042.wav\": \"독일 통일의 경험은\",\r\n    \"./datasets/moon/audio/003.0043.wav\": \"지구상\",\r\n    \"./datasets/moon/audio/003.0044.wav\": \"마지막 분단국가로 남은 우리에게\",\r\n    \"./datasets/moon/audio/003.0045.wav\": \"통일에 대한 희망과 함께\",\r\n    \"./datasets/moon/audio/003.0046.wav\": \"우리가 나아가야 할 방향을 말해주고 있습니다\",\r\n    \"./datasets/moon/audio/003.0047.wav\": \"그것은 우선\",\r\n    \"./datasets/moon/audio/003.0048.wav\": \"통일에 이르는\",\r\n    \"./datasets/moon/audio/003.0049.wav\": \"과정의 중요성입니다\",\r\n    \"./datasets/moon/audio/006.0000.wav\": \"존경하고 사랑하는 국민 여러분\",\r\n    \"./datasets/moon/audio/006.0001.wav\": \"감사합니다\",\r\n    \"./datasets/moon/audio/006.0002.wav\": \"국민 여러분의\",\r\n    \"./datasets/moon/audio/006.0003.wav\": \"위대한 선택에\",\r\n    \"./datasets/moon/audio/006.0004.wav\": \"머리 숙여\",\r\n    \"./datasets/moon/audio/006.0005.wav\": \"깊이\",\r\n    \"./datasets/moon/audio/006.0006.wav\": \"감사드립니다\",\r\n    \"./datasets/moon/audio/006.0007.wav\": \"저는 오늘\",\r\n    \"./datasets/moon/audio/006.0008.wav\": \"대한민국\",\r\n    \"./datasets/moon/audio/006.0009.wav\": \"제19대 대통령으로서\",\r\n    \"./datasets/moon/audio/006.0010.wav\": \"새로운 대한민국을 향해\",\r\n    \"./datasets/moon/audio/006.0011.wav\": \"첫걸음을 내딛습니다\",\r\n    \"./datasets/moon/audio/006.0012.wav\": \"지금 제 두 어깨는\",\r\n    \"./datasets/moon/audio/006.0013.wav\": \"국민 여러분으로부터\",\r\n    \"./datasets/moon/audio/006.0014.wav\": \"부여받은\",\r\n    \"./datasets/moon/audio/006.0015.wav\": \"막중한 소명감으로\",\r\n    \"./datasets/moon/audio/006.0016.wav\": \"무겁습니다\",\r\n    \"./datasets/moon/audio/006.0017.wav\": \"지금 제 가슴은\",\r\n    \"./datasets/moon/audio/006.0018.wav\": \"한 번도 경험하지 못한\",\r\n    \"./datasets/moon/audio/006.0019.wav\": \"나라를 만들겠다는 열정으로 뜨겁습니다\",\r\n    \"./datasets/moon/audio/006.0020.wav\": \"그리고 지금 제 머리는\",\r\n    \"./datasets/moon/audio/006.0021.wav\": \"통합과 공존의\",\r\n    \"./datasets/moon/audio/006.0022.wav\": \"새로운 세상을 열어갈\",\r\n    \"./datasets/moon/audio/006.0023.wav\": \"청사진으로\",\r\n    \"./datasets/moon/audio/006.0024.wav\": \"가득 차 있습니다\",\r\n    \"./datasets/moon/audio/006.0025.wav\": \"우리가 만들어가려는 새로운 대한민국은\",\r\n    \"./datasets/moon/audio/006.0026.wav\": \"숱한 좌절과 패배에도 불구하고\",\r\n    \"./datasets/moon/audio/006.0027.wav\": \"우리의 선대들이\",\r\n    \"./datasets/moon/audio/006.0028.wav\": \"일관되게 추구했던 나라입니다\",\r\n    \"./datasets/moon/audio/006.0029.wav\": \"또 많은 희생과 헌신을 감내하며\",\r\n    \"./datasets/moon/audio/006.0030.wav\": \"우리 젊은이들이\",\r\n    \"./datasets/moon/audio/006.0031.wav\": \"그토록 이루고 싶어했던\",\r\n    \"./datasets/moon/audio/006.0032.wav\": \"나라입니다\",\r\n    \"./datasets/moon/audio/006.0033.wav\": \"그런 대한민국을 만들기 위해 저는\",\r\n    \"./datasets/moon/audio/006.0034.wav\": \"역사와 국민 앞에\",\r\n    \"./datasets/moon/audio/006.0035.wav\": \"두렵지만\",\r\n    \"./datasets/moon/audio/006.0036.wav\": \"겸허한 마음으로\",\r\n    \"./datasets/moon/audio/006.0037.wav\": \"대한민국\",\r\n    \"./datasets/moon/audio/006.0038.wav\": \"제19대\",\r\n    \"./datasets/moon/audio/006.0039.wav\": \"대통령으로서의\",\r\n    \"./datasets/moon/audio/006.0040.wav\": \"책임과 소명을 다할 것임을 천명합니다\",\r\n    \"./datasets/moon/audio/006.0041.wav\": \"함께 선거를 치른 후보들께\",\r\n    \"./datasets/moon/audio/006.0042.wav\": \"감사의 말씀과 함께\",\r\n    \"./datasets/moon/audio/006.0043.wav\": \"심심한\",\r\n    \"./datasets/moon/audio/006.0044.wav\": \"위로를 전합니다\",\r\n    \"./datasets/moon/audio/006.0045.wav\": \"이번 선거에서는\",\r\n    \"./datasets/moon/audio/006.0046.wav\": \"승자도\",\r\n    \"./datasets/moon/audio/006.0047.wav\": \"패자도 없습니다\",\r\n    \"./datasets/moon/audio/006.0048.wav\": \"우리는\",\r\n    \"./datasets/moon/audio/006.0062.wav\": \"정치적 격변기를 보냈습니다\",\r\n    \"./datasets/moon/audio/006.0063.wav\": \"정치는 혼란스러웠지만\",\r\n    \"./datasets/moon/audio/006.0065.wav\": \"현직 대통령의 탄핵과 구속 앞에서도\",\r\n    \"./datasets/moon/audio/006.0067.wav\": \"대한민국의 앞길을 열어주셨습니다\",\r\n    \"./datasets/moon/audio/006.0068.wav\": \"우리 국민들은 좌절하지 않고\",\r\n    \"./datasets/moon/audio/006.0093.wav\": \"2017년5월10일\",\r\n    \"./datasets/moon/audio/006.0098.wav\": \"존경하고 사랑하는 국민 여러분\",\r\n    \"./datasets/moon/audio/006.0104.wav\": \"바로 그 질문에서 새로 시작하겠습니다\",\r\n    \"./datasets/moon/audio/006.0108.wav\": \"구시대의 잘못된 관행과\",\r\n    \"./datasets/moon/audio/006.0115.wav\": \"광화문 대통령 시대를 열겠습니다\",\r\n    \"./datasets/moon/audio/006.0116.wav\": \"참모들과 머리와 어깨를 맞대고\"\r\n}"
  },
  {
    "path": "datasets/moon.py",
    "content": "# -*- coding: utf-8 -*-\r\n\r\nfrom concurrent.futures import ProcessPoolExecutor\r\nfrom functools import partial\r\nimport numpy as np\r\nimport os,json\r\nfrom utils import audio\r\nfrom text import text_to_sequence\r\n\r\n\r\ndef build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x):\r\n    \"\"\"\r\n    Preprocesses the speech dataset from a gven input path to given output directories\r\n\r\n    Args:\r\n        - hparams: hyper parameters\r\n        - input_dir: input directory that contains the files to prerocess\r\n        - out_dir: output directory of npz files\r\n        - n_jobs: Optional, number of worker process to parallelize across\r\n        - tqdm: Optional, provides a nice progress bar\r\n\r\n    Returns:\r\n        - A list of tuple describing the train examples. this should be written to train.txt\r\n    \"\"\"\r\n\r\n    executor = ProcessPoolExecutor(max_workers=num_workers)\r\n    futures = []\r\n    index = 1\r\n\r\n    path = os.path.join(in_dir, 'moon-recognition-All.json')\r\n    \r\n    with open(path,encoding='utf-8') as f:\r\n        content = f.read()\r\n        data = json.loads(content)\r\n        for key, text in data.items():\r\n            wav_path = key.strip().split('/')\r\n            wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1])\r\n            # In case of test file\r\n            if not os.path.exists(wav_path):\r\n                continue\r\n            futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams)))\r\n            index += 1\r\n\r\n    return [future.result() for future in tqdm(futures) if future.result() is not None]\r\n#     result = []\r\n#     for future in tqdm(futures):\r\n#         if future.result() is not None:\r\n#             result.append(future.result())\r\n#          \r\n#     return result\r\n\r\ndef _process_utterance(out_dir, wav_path, text, hparams):\r\n    \"\"\"\r\n    Preprocesses a single utterance wav/text pair\r\n\r\n    this writes the mel scale spectogram to disk and return a tuple to write\r\n    to the train.txt file\r\n\r\n    Args:\r\n        - mel_dir: the directory to write the mel spectograms into\r\n        - linear_dir: the directory to write the linear spectrograms into\r\n        - wav_dir: the directory to write the preprocessed wav into\r\n        - index: the numeric index to use in the spectogram filename\r\n        - wav_path: path to the audio file containing the speech input\r\n        - text: text spoken in the input audio file\r\n        - hparams: hyper parameters\r\n\r\n    Returns:\r\n        - A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)\r\n    \"\"\"\r\n    try:\r\n        # Load the audio as numpy array\r\n        wav = audio.load_wav(wav_path, sr=hparams.sample_rate)\r\n    except FileNotFoundError: #catch missing wav exception\r\n        print('file {} present in csv metadata is not present in wav folder. skipping!'.format(\r\n            wav_path))\r\n        return None\r\n\r\n    #rescale wav\r\n    if hparams.rescaling:   # hparams.rescale = True\r\n        wav = wav / np.abs(wav).max() * hparams.rescaling_max\r\n\r\n    #M-AILABS extra silence specific\r\n    if hparams.trim_silence:  # hparams.trim_silence = True\r\n        wav = audio.trim_silence(wav, hparams)   # Trim leading and trailing silence\r\n\r\n    #Mu-law quantize, default 값은 'raw'\r\n    if hparams.input_type=='mulaw-quantize':\r\n        #[0, quantize_channels)\r\n        out = audio.mulaw_quantize(wav, hparams.quantize_channels)\r\n\r\n        #Trim silences\r\n        start, end = audio.start_and_end_indices(out, hparams.silence_threshold)\r\n        wav = wav[start: end]\r\n        out = out[start: end]\r\n\r\n        constant_values = mulaw_quantize(0, hparams.quantize_channels)\r\n        out_dtype = np.int16\r\n\r\n    elif hparams.input_type=='mulaw':\r\n        #[-1, 1]\r\n        out = audio.mulaw(wav, hparams.quantize_channels)\r\n        constant_values = audio.mulaw(0., hparams.quantize_channels)\r\n        out_dtype = np.float32\r\n\r\n    else:  # raw\r\n        #[-1, 1]\r\n        out = wav\r\n        constant_values = 0.\r\n        out_dtype = np.float32\r\n\r\n    # Compute the mel scale spectrogram from the wav\r\n    mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)\r\n    mel_frames = mel_spectrogram.shape[1]\r\n\r\n    if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:   # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True\r\n        return None\r\n\r\n    #Compute the linear scale spectrogram from the wav\r\n    linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)\r\n    linear_frames = linear_spectrogram.shape[1]\r\n\r\n    #sanity check\r\n    assert linear_frames == mel_frames\r\n\r\n    if hparams.use_lws:    # hparams.use_lws = False\r\n        #Ensure time resolution adjustement between audio and mel-spectrogram\r\n        fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size\r\n        l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))\r\n\r\n        #Zero pad audio signal\r\n        out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)\r\n    else:\r\n        #Ensure time resolution adjustement between audio and mel-spectrogram\r\n        pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))\r\n\r\n        #Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)\r\n        out = np.pad(out, pad, mode='reflect')\r\n\r\n    assert len(out) >= mel_frames * audio.get_hop_size(hparams)\r\n\r\n    #time resolution adjustement\r\n    #ensure length of raw audio is multiple of hop size so that we can use\r\n    #transposed convolution to upsample\r\n    out = out[:mel_frames * audio.get_hop_size(hparams)]\r\n    assert len(out) % audio.get_hop_size(hparams) == 0\r\n    time_steps = len(out)\r\n\r\n    # Write the spectrogram and audio to disk\r\n    wav_id = os.path.splitext(os.path.basename(wav_path))[0]\r\n    \r\n    # Write the spectrograms to disk:\r\n    audio_filename = '{}-audio.npy'.format(wav_id)\r\n    mel_filename = '{}-mel.npy'.format(wav_id)\r\n    linear_filename = '{}-linear.npy'.format(wav_id)\r\n    npz_filename = '{}.npz'.format(wav_id)\r\n    npz_flag=True\r\n    if npz_flag:\r\n        # Tacotron 코드와 맞추기 위해, 같은 key를 사용한다.\r\n        data = {\r\n            'audio': out.astype(out_dtype),\r\n            'mel': mel_spectrogram.T,  \r\n            'linear': linear_spectrogram.T,\r\n            'time_steps': time_steps,\r\n            'mel_frames': mel_frames,\r\n            'text': text,\r\n            'tokens': text_to_sequence(text),   # eos(~)에 해당하는 \"1\"이 끝에 붙는다.\r\n            'loss_coeff': 1  # For Tacotron\r\n        }\r\n        \r\n        np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False)\r\n    else:\r\n        np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)\r\n        np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)\r\n        np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)\r\n\r\n    # Return a tuple describing this training example\r\n    return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename)"
  },
  {
    "path": "datasets/son/son-recognition-All.json",
    "content": "{\r\n    \"./datasets/son/audio/NB10584578.0000.wav\": \"오늘부터 뉴스룸 2부에서는 그날의 주요사항을 한마디의 단어로 축약해서 앵커브리핑으로 풀어보겠습니다\",\r\n    \"./datasets/son/audio/NB10584578.0001.wav\": \"오늘 뉴스룸이 주목한다 던어는 저돌입니다\",\r\n    \"./datasets/son/audio/NB10584578.0002.wav\": \"돼지 저 자에 갑자기 돌 이 두 글자를 사용하는 이 단어는 흔히 추진력이 강하다는 의미로 쓰이죠\",\r\n    \"./datasets/son/audio/NB10584578.0003.wav\": \"난파 직전의 새정치연합을 책임지게 된 문희상 비대위원장이 이런 말을 했습니다\",\r\n    \"./datasets/son/audio/NB10584578.0004.wav\": \"난 그냥 산 돼지처럼 돌파하는 스타일이다\",\r\n    \"./datasets/son/audio/NB10584578.0005.wav\": \"이렇게 얘기했습니다\",\r\n    \"./datasets/son/audio/NB10584578.0006.wav\": \"몸이 좋지 않다면서 만남을 주저했던 김무성 새누리당 대표를 찾아 가서 만난 것도 바로 이런 적어도 저돌성이 없었다면 어려웠을지도 모르겠습니다 그렇다면\",\r\n    \"./datasets/son/audio/NB10584578.0007.wav\": \"문 비대위원장이 저돌적으로 돌파해야 할 과제는 무엇인가\",\r\n    \"./datasets/son/audio/NB10584578.0008.wav\": \"첫 번째는 계파주의 청산입니다\",\r\n    \"./datasets/son/audio/NB10584578.0009.wav\": \"지난 이천십이년 대선에서 민주통합당의 패배한 이후에 대선평가 위원장을 맡았던 한상진 서울대 명예교수가\",\r\n    \"./datasets/son/audio/NB10584578.0010.wav\": \"이런 보고서를 냈습니다\",\r\n    \"./datasets/son/audio/NB10584578.0011.wav\": \"계파정치 청산은 민주당의 미래를 위한 최우선 과제다\",\r\n    \"./datasets/son/audio/NB10584578.0012.wav\": \"아 이렇게 얘기했는데요 그러나 아시는 것처럼이 보고서는\",\r\n    \"./datasets/son/audio/NB10584578.0013.wav\": \"갖가지 반발 끝에 결국 채택되지 못했습니다\",\r\n    \"./datasets/son/audio/NB10584578.0014.wav\": \"아마 여당에서 한상진 교수 좋아하는 사람 별로 없을 겁니다\",\r\n    \"./datasets/son/audio/NB10584578.0015.wav\": \"문희상 당시 비대위원장이 공교롭게도 계파와 패권주의 청산을 내세웠던 바로 그 시기에 비대위원장 이었죠\",\r\n    \"./datasets/son/audio/NB10584578.0016.wav\": \"계파 청산에 관한 문 비대위원장은 어떻게 보면 실패했다고 봐야만 합니다\",\r\n    \"./datasets/son/audio/NB10584578.0017.wav\": \"권한은 공유하되 책임은 당 대표가 혼자지는 이런 기형적 구조가\",\r\n    \"./datasets/son/audio/NB10584578.0018.wav\": \"아 결국\",\r\n    \"./datasets/son/audio/NB10584578.0019.wav\": \"최근 사년 동안에 임기 2년에 야당 지도부 교체 숫자를\",\r\n    \"./datasets/son/audio/NB10584578.0020.wav\": \"늘려서 무료 열번이나 교체가 되었습니다\",\r\n    \"./datasets/son/audio/NB10584578.0021.wav\": \"같은 기간에 새누리당은 단 네명의 지도부가 바뀌었습니다\",\r\n    \"./datasets/son/audio/NB10584578.0022.wav\": \"실패가 구조화된 당의 체질을 바꾸지 않고서는 누가 리더가 되어도 쉽지 않다는 것을 상징적으로 내보여주는 숫자이기도 합니다\",\r\n    \"./datasets/son/audio/NB10584578.0023.wav\": \"자 두 번째 과제는 바로 이겁니다 수사권 기소권 문제로 교착상태에 빠지는 세월호 특별법 지금도 끝이 보이지 않는데요\",\r\n    \"./datasets/son/audio/NB10584578.0024.wav\": \"어떠한 추가 협상도\",\r\n    \"./datasets/son/audio/NB10584578.0025.wav\": \"불가하다 이렇게 못박은 청와대와\",\r\n    \"./datasets/son/audio/NB10584578.0026.wav\": \"여당을 어떻게 변화시킬 것인지 또한\",\r\n    \"./datasets/son/audio/NB10584578.0027.wav\": \"수사권과 기소권을 주장하는 유족들의 요구를 어떻게 담아낼 것인지\",\r\n    \"./datasets/son/audio/NB10584578.0028.wav\": \"겉은 장비 속은 조조라고 불리우는 의회주의자 문희상 비대위원장과 새정치연합이 저돌적으로 말 그대로 저돌적으로 풀어 가야 할\",\r\n    \"./datasets/son/audio/NB10584578.0029.wav\": \"과제인지도 모르겠습니다\",\r\n    \"./datasets/son/audio/NB10584578.0030.wav\": \"세월호 참사는 오늘로 백육십일째를 맞았습니다\",\r\n    \"./datasets/son/audio/NB10584578.0031.wav\": \"쓸쓸한 팽목항에는\",\r\n    \"./datasets/son/audio/NB10584578.0032.wav\": \"자원봉사자마저 하나둘 철수하고 있고\",\r\n    \"./datasets/son/audio/NB10584578.0033.wav\": \"슬픈 이천십사년은 오늘로 이제 딱\",\r\n    \"./datasets/son/audio/NB10584578.0034.wav\": \"백일이 남았습니다\",\r\n    \"./datasets/son/audio/NB10584578.0035.wav\": \"잠시 후에 문희상 비대위원장을 스튜디오에서 만나겠습니다\",\r\n    \"./datasets/son/audio/NB10585784.0001.wav\": \"자 이어서 앵커 브리핑 순서입니다 오늘 뉴스 룸이 주목한 단어는 덫입니다\",\r\n    \"./datasets/son/audio/NB10585784.0002.wav\": \"어 잔꾀를 부리다 자신이 놓은 덫에 스스로 걸리고 만 꼴이다\",\r\n    \"./datasets/son/audio/NB10585784.0003.wav\": \"국회 선진화법 개정을 추진하고 있는 새누리당을 향해서\",\r\n    \"./datasets/son/audio/NB10585784.0004.wav\": \"새정치민주연합에 박수현 의원이 이런 말을 했군요\",\r\n    \"./datasets/son/audio/NB10585784.0005.wav\": \"이 말을 이해하기 위해서는 지난 이천십이년에 국회로 한 걸음\",\r\n    \"./datasets/son/audio/NB10585784.0006.wav\": \"돌아가 봐야만 합니다\",\r\n    \"./datasets/son/audio/NB10585784.0007.wav\": \"기대보다는 걱정이 앞서는 것이\",\r\n    \"./datasets/son/audio/NB10585784.0008.wav\": \"솔직한 내 심정입니다\",\r\n    \"./datasets/son/audio/NB10585784.0009.wav\": \"이제 개정안이 통과된 이상 우리 여야가\",\r\n    \"./datasets/son/audio/NB10585784.0010.wav\": \"대화와 타협을 통해서\",\r\n    \"./datasets/son/audio/NB10585784.0011.wav\": \"국민들에게 신뢰받는 선진 국회를 만들어 가기를 간절히 바랍니다\",\r\n    \"./datasets/son/audio/NB10585784.0015.wav\": \"예 이렇게 세번 두들기고 법안은 통과가 되는데요\",\r\n    \"./datasets/son/audio/NB10585784.0016.wav\": \"국회선진화법은 재적의원 중에 과반이 아닌 오분의 삼이상이 찬성해야 만\",\r\n    \"./datasets/son/audio/NB10585784.0017.wav\": \"안건을 올릴 수 있도록 만든 법이죠\"\r\n}"
  },
  {
    "path": "datasets/son.py",
    "content": "# -*- coding: utf-8 -*-\r\n\r\nfrom concurrent.futures import ProcessPoolExecutor\r\nfrom functools import partial\r\nimport numpy as np\r\nimport os,json\r\nfrom utils import audio\r\nfrom text import text_to_sequence\r\n\r\n\r\ndef build_from_path(hparams, in_dir, out_dir, num_workers=1, tqdm=lambda x: x):\r\n    \"\"\"\r\n    Preprocesses the speech dataset from a gven input path to given output directories\r\n\r\n    Args:\r\n        - hparams: hyper parameters\r\n        - input_dir: input directory that contains the files to prerocess\r\n        - out_dir: output directory of npz files\r\n        - n_jobs: Optional, number of worker process to parallelize across\r\n        - tqdm: Optional, provides a nice progress bar\r\n\r\n    Returns:\r\n        - A list of tuple describing the train examples. this should be written to train.txt\r\n    \"\"\"\r\n\r\n    executor = ProcessPoolExecutor(max_workers=num_workers)\r\n    futures = []\r\n    index = 1\r\n\r\n    path = os.path.join(in_dir, 'son-recognition-All.json')\r\n    \r\n    with open(path,encoding='utf-8') as f:\r\n        content = f.read()\r\n        data = json.loads(content)\r\n        for key, text in data.items():\r\n            wav_path = key.strip().split('/')\r\n            wav_path = os.path.join(in_dir, 'audio', '%s' % wav_path[-1])\r\n            # In case of test file\r\n            if not os.path.exists(wav_path):\r\n                continue\r\n            futures.append(executor.submit(partial(_process_utterance, out_dir, wav_path, text,hparams)))\r\n            index += 1\r\n\r\n    return [future.result() for future in tqdm(futures) if future.result() is not None]\r\n\r\n\r\ndef _process_utterance(out_dir, wav_path, text, hparams):\r\n    \"\"\"\r\n    Preprocesses a single utterance wav/text pair\r\n\r\n    this writes the mel scale spectogram to disk and return a tuple to write\r\n    to the train.txt file\r\n\r\n    Args:\r\n        - mel_dir: the directory to write the mel spectograms into\r\n        - linear_dir: the directory to write the linear spectrograms into\r\n        - wav_dir: the directory to write the preprocessed wav into\r\n        - index: the numeric index to use in the spectogram filename\r\n        - wav_path: path to the audio file containing the speech input\r\n        - text: text spoken in the input audio file\r\n        - hparams: hyper parameters\r\n\r\n    Returns:\r\n        - A tuple: (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, linear_frames, text)\r\n    \"\"\"\r\n    try:\r\n        # Load the audio as numpy array\r\n        wav = audio.load_wav(wav_path, sr=hparams.sample_rate)\r\n    except FileNotFoundError: #catch missing wav exception\r\n        print('file {} present in csv metadata is not present in wav folder. skipping!'.format(wav_path))\r\n        return None\r\n\r\n    #rescale wav\r\n    if hparams.rescaling:   # hparams.rescale = True\r\n        wav = wav / np.abs(wav).max() * hparams.rescaling_max\r\n\r\n    #M-AILABS extra silence specific\r\n    if hparams.trim_silence:  # hparams.trim_silence = True\r\n        wav = audio.trim_silence(wav, hparams)   # Trim leading and trailing silence\r\n\r\n    #Mu-law quantize, default 값은 'raw'\r\n    if hparams.input_type=='mulaw-quantize':\r\n        #[0, quantize_channels)\r\n        out = audio.mulaw_quantize(wav, hparams.quantize_channels)\r\n\r\n        #Trim silences\r\n        start, end = audio.start_and_end_indices(out, hparams.silence_threshold)\r\n        wav = wav[start: end]\r\n        out = out[start: end]\r\n\r\n        constant_values = mulaw_quantize(0, hparams.quantize_channels)\r\n        out_dtype = np.int16\r\n\r\n    elif hparams.input_type=='mulaw':\r\n        #[-1, 1]\r\n        out = audio.mulaw(wav, hparams.quantize_channels)\r\n        constant_values = audio.mulaw(0., hparams.quantize_channels)\r\n        out_dtype = np.float32\r\n\r\n    else:  # raw\r\n        #[-1, 1]\r\n        out = wav\r\n        constant_values = 0.\r\n        out_dtype = np.float32\r\n\r\n    # Compute the mel scale spectrogram from the wav\r\n    mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)\r\n    mel_frames = mel_spectrogram.shape[1]\r\n\r\n    if mel_frames > hparams.max_mel_frames and hparams.clip_mels_length:   # hparams.max_mel_frames = 1000, hparams.clip_mels_length = True\r\n        return None\r\n\r\n    #Compute the linear scale spectrogram from the wav\r\n    linear_spectrogram = audio.linearspectrogram(wav, hparams).astype(np.float32)\r\n    linear_frames = linear_spectrogram.shape[1]\r\n\r\n    #sanity check\r\n    assert linear_frames == mel_frames\r\n\r\n    if hparams.use_lws:    # hparams.use_lws = False\r\n        #Ensure time resolution adjustement between audio and mel-spectrogram\r\n        fft_size = hparams.fft_size if hparams.win_size is None else hparams.win_size\r\n        l, r = audio.pad_lr(wav, fft_size, audio.get_hop_size(hparams))\r\n\r\n        #Zero pad audio signal\r\n        out = np.pad(out, (l, r), mode='constant', constant_values=constant_values)\r\n    else:\r\n        #Ensure time resolution adjustement between audio and mel-spectrogram\r\n        pad = audio.librosa_pad_lr(wav, hparams.fft_size, audio.get_hop_size(hparams))\r\n\r\n        #Reflect pad audio signal (Just like it's done in Librosa to avoid frame inconsistency)\r\n        out = np.pad(out, pad, mode='reflect')\r\n\r\n    assert len(out) >= mel_frames * audio.get_hop_size(hparams)\r\n\r\n    #time resolution adjustement\r\n    #ensure length of raw audio is multiple of hop size so that we can use\r\n    #transposed convolution to upsample\r\n    out = out[:mel_frames * audio.get_hop_size(hparams)]\r\n    assert len(out) % audio.get_hop_size(hparams) == 0\r\n    time_steps = len(out)\r\n\r\n    # Write the spectrogram and audio to disk\r\n    wav_id = os.path.splitext(os.path.basename(wav_path))[0]\r\n    \r\n    # Write the spectrograms to disk:\r\n    audio_filename = '{}-audio.npy'.format(wav_id)\r\n    mel_filename = '{}-mel.npy'.format(wav_id)\r\n    linear_filename = '{}-linear.npy'.format(wav_id)\r\n    npz_filename = '{}.npz'.format(wav_id)\r\n    npz_flag=True\r\n    if npz_flag:\r\n        # Tacotron 코드와 맞추기 위해, 같은 key를 사용한다.\r\n        data = {\r\n            'audio': out.astype(out_dtype),\r\n            'mel': mel_spectrogram.T,  \r\n            'linear': linear_spectrogram.T,\r\n            'time_steps': time_steps,\r\n            'mel_frames': mel_frames,\r\n            'text': text,\r\n            'tokens': text_to_sequence(text),   # eos(~)에 해당하는 \"1\"이 끝에 붙는다.\r\n            'loss_coeff': 1  # For Tacotron\r\n        }\r\n        \r\n        np.savez(os.path.join(out_dir,npz_filename ), **data, allow_pickle=False)\r\n    else:\r\n        np.save(os.path.join(out_dir, audio_filename), out.astype(out_dtype), allow_pickle=False)\r\n        np.save(os.path.join(out_dir, mel_filename), mel_spectrogram.T, allow_pickle=False)\r\n        np.save(os.path.join(out_dir, linear_filename), linear_spectrogram.T, allow_pickle=False)\r\n\r\n    # Return a tuple describing this training example\r\n    return (audio_filename, mel_filename, linear_filename, time_steps, mel_frames, text,npz_filename)"
  },
  {
    "path": "generate.py",
    "content": "#  coding: utf-8\r\n\"\"\"\r\nsample_rate = 16000이므로, samples 48000이면 3초 길이가 된다.\r\n\r\n> python generate.py --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10\r\n> python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2018-12-21T22-58-10   <----scalar_input = True\r\n> python generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10\r\npython generate.py --wav_seed ./logdir-wavenet/seed.wav --mel ./logdir-tacotron/generate/mel-2018-12-25_22-27-50-0.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10\r\n\r\n\r\ngc_id = 0(moon), 1(son)\r\npython generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-22T23-08-16\r\npython generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-22T23-08-16\r\n\r\n\r\n\"\"\"\r\nimport argparse\r\nfrom datetime import datetime\r\nimport json\r\nimport os,time\r\n\r\nimport librosa\r\nimport numpy as np\r\nimport tensorflow as tf\r\n\r\nfrom wavenet import WaveNetModel, mu_law_decode, mu_law_encode\r\nfrom hparams import hparams\r\nfrom utils import load_hparams,load\r\nfrom utils import audio\r\nfrom utils import plot\r\n\r\nimport warnings\r\nwarnings.simplefilter(action='ignore', category=FutureWarning)\r\n\r\n\r\n\r\ndef _interp(feats, in_range):\r\n    #rescales from [-max, max] (or [0, max]) to [0, 1]\r\n    return (feats - in_range[0]) / (in_range[1] - in_range[0])\r\n\r\n\r\ndef get_arguments():\r\n    def _str_to_bool(s):\r\n        \"\"\"Convert string to bool (in argparse context).\"\"\"\r\n        if s.lower() not in ['true', 'false']:\r\n            raise ValueError('Argument needs to be a boolean, got {}'.format(s))\r\n        return {'true': True, 'false': False}[s.lower()]\r\n\r\n    def _ensure_positive_float(f):\r\n        \"\"\"Ensure argument is a positive float.\"\"\"\r\n        if float(f) < 0:\r\n            raise argparse.ArgumentTypeError('Argument must be greater than zero')\r\n        return float(f)\r\n\r\n    parser = argparse.ArgumentParser(description='WaveNet generation script')\r\n    parser.add_argument('checkpoint_dir', type=str, help='Which model checkpoint to generate from')\r\n    \r\n    TEMPERATURE = 1.0\r\n    parser.add_argument('--temperature', type=_ensure_positive_float, default=TEMPERATURE,help='Sampling temperature')\r\n    \r\n    \r\n    LOGDIR = './logdir-wavenet'\r\n    parser.add_argument('--logdir',type=str,default=LOGDIR,help='Directory in which to store the logging information for TensorBoard.')\r\n    parser.add_argument('--wav_out_path',type=str,default=None,help='Path to output wav file')\r\n    \r\n    BATCH_SIZE = 1\r\n    parser.add_argument('--batch_size', type=int, default=BATCH_SIZE,help='batch size')\r\n    \r\n    \r\n    parser.add_argument('--wav_seed',type=str,default=None,help='The wav file to start generation from')\r\n    parser.add_argument('--mel',type=str,default=None,help='mel input')\r\n    parser.add_argument('--gc_cardinality',type=int,default=None,help='Number of categories upon which we globally condition.')\r\n    parser.add_argument('--gc_id',type=int,default=None,help='ID of category to generate, if globally conditioned.')\r\n    \r\n    arguments = parser.parse_args()\r\n    if hparams.gc_channels is not None:\r\n        if arguments.gc_cardinality is None:\r\n            raise ValueError(\"Globally conditioning but gc_cardinality not specified. Use --gc_cardinality=377 for full VCTK corpus.\")\r\n\r\n        if arguments.gc_id is None:\r\n            raise ValueError(\"Globally conditioning, but global condition was not specified. Use --gc_id to specify global condition.\")\r\n\r\n    return arguments\r\n\r\n\r\n# def write_wav(waveform, sample_rate, filename):\r\n#     y = np.array(waveform)\r\n#     librosa.output.write_wav(filename, y, sample_rate)\r\n#     print('Updated wav file at {}'.format(filename))\r\n\r\n\r\ndef create_seed(filename,sample_rate,quantization_channels,window_size,scalar_input):\r\n    # seed의 앞부분만 사용한다.\r\n    seed_audio, _ = librosa.load(filename, sr=sample_rate, mono=True)\r\n    seed_audio = audio.trim_silence(seed_audio, hparams)\r\n    if scalar_input:\r\n        if len(seed_audio) < window_size:\r\n            return seed_audio\r\n        else: return seed_audio[:window_size]\r\n    else:\r\n        quantized = mu_law_encode(seed_audio, quantization_channels)\r\n    \r\n    \r\n        # 짧으면 짧은 대로 return하는데, padding이라도 해야되지 않나???\r\n        cut_index = tf.cond(tf.size(quantized) < tf.constant(window_size), lambda: tf.size(quantized), lambda: tf.constant(window_size))\r\n    \r\n        return quantized[:cut_index]\r\n\r\n\r\ndef main():\r\n    config = get_arguments()\r\n    started_datestring = \"{0:%Y-%m-%dT%H-%M-%S}\".format(datetime.now())\r\n    logdir = os.path.join(config.logdir, 'generate', started_datestring)\r\n    \r\n    if not os.path.exists(logdir):\r\n        os.makedirs(logdir)\r\n\r\n    load_hparams(hparams, config.checkpoint_dir)\r\n\r\n\r\n    with tf.device('/cpu:0'):  # cpu가 더 빠르다. gpu로 설정하면 Error. tf.device 없이 하면 더 느려진다.\r\n\r\n        sess = tf.Session()\r\n        scalar_input = hparams.scalar_input\r\n        net = WaveNetModel(\r\n            batch_size=config.batch_size,\r\n            dilations=hparams.dilations,\r\n            filter_width=hparams.filter_width,\r\n            residual_channels=hparams.residual_channels,\r\n            dilation_channels=hparams.dilation_channels,\r\n            quantization_channels=hparams.quantization_channels,\r\n            out_channels =hparams.out_channels,\r\n            skip_channels=hparams.skip_channels,\r\n            use_biases=hparams.use_biases,\r\n            scalar_input=hparams.scalar_input,\r\n            global_condition_channels=hparams.gc_channels,\r\n            global_condition_cardinality=config.gc_cardinality,\r\n            local_condition_channels=hparams.num_mels,\r\n            upsample_factor=hparams.upsample_factor,\r\n            legacy = hparams.legacy,\r\n            residual_legacy = hparams.residual_legacy,\r\n            train_mode=False)   # train 단계에서는 global_condition_cardinality를 AudioReader에서 파악했지만, 여기서는 넣어주어야 함\r\n            \r\n        if scalar_input:\r\n            samples = tf.placeholder(tf.float32,shape=[net.batch_size,None])\r\n        else:\r\n            samples = tf.placeholder(tf.int32,shape=[net.batch_size,None])  # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이)\r\n    \r\n        # local condition이 (N,T,num_mels) 여야 하지만, 길이 1까지로 들어가야하기 때무넹, (N,1,num_mels) --> squeeze하면 (N,num_mels)\r\n        upsampled_local_condition = tf.placeholder(tf.float32,shape=[net.batch_size,hparams.num_mels])  \r\n        \r\n        next_sample = net.predict_proba_incremental(samples,upsampled_local_condition, [config.gc_id]*net.batch_size)  # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용\r\n        \r\n        # making local condition data. placeholder - upsampled_local_condition 넣어줄 upsampled local condition data를 만들어 보자.\r\n        \r\n        mel_input = np.load(config.mel)\r\n        sample_size = mel_input.shape[0] * hparams.hop_size\r\n        mel_input = np.tile(mel_input,(config.batch_size,1,1))\r\n        with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE):\r\n            upsampled_local_condition_data = net.create_upsample(mel_input,upsample_type=hparams.upsample_type)\r\n            \r\n        var_list = [var for var in tf.global_variables() if 'queue' not in var.name ]\r\n        saver = tf.train.Saver(var_list)\r\n        print('Restoring model from {}'.format(config.checkpoint_dir))\r\n        \r\n        load(saver, sess, config.checkpoint_dir)\r\n        \r\n        sess.run(net.queue_initializer) # 이 부분이 없으면, checkpoint에서 복원된 값들이 들어 있다.\r\n    \r\n     \r\n    \r\n        quantization_channels = hparams.quantization_channels\r\n        if config.wav_seed:\r\n            # wav_seed의 길이가 receptive_field보다 작으면, padding이라도 해야 되는 거 아닌가? 그냥 짧으면 짧은 대로 return함  --> 그래서 너무 짧으면 error\r\n            seed = create_seed(config.wav_seed,hparams.sample_rate,quantization_channels,net.receptive_field,scalar_input)  # --> mu_law encode 된 것.\r\n            if scalar_input:\r\n                waveform = seed.tolist()\r\n            else:\r\n                waveform = sess.run(seed).tolist()  # [116, 114, 120, 121, 127, ...]\r\n\r\n            print('Priming generation...')\r\n            for i, x in enumerate(waveform[-net.receptive_field: -1]):  # 제일 마지막 1개는 아래의 for loop의 첫 loop에서 넣어준다.\r\n                if i % 100 == 0:\r\n                    print('Priming sample {}/{}'.format(i,net.receptive_field), end='\\r')\r\n                sess.run(next_sample, feed_dict={samples: np.array([x]*net.batch_size).reshape(net.batch_size,1), upsampled_local_condition: np.zeros([net.batch_size,hparams.num_mels])})\r\n            print('Done.')\r\n            waveform = np.array([waveform[-net.receptive_field:]]*net.batch_size)            \r\n        else:\r\n            # Silence with a single random sample at the end.\r\n            if scalar_input:\r\n                waveform = [0.0] * (net.receptive_field - 1)\r\n                waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1)\r\n                waveform = np.concatenate([waveform,2*np.random.rand(net.batch_size).reshape(net.batch_size,-1)-1],axis=-1) # -1~1사이의 random number를 만들어 끝에 붙힌다.\r\n                # wavefor: shape(batch_size,net.receptive_field )\r\n            else:\r\n                waveform = [quantization_channels / 2] * (net.receptive_field - 1)  # 필요한 receptive_field 크기보다 1개 작게 만든 후, 아래에서 random하게 1개를 덧붙힌다.\r\n                waveform = np.array(waveform*net.batch_size).reshape(net.batch_size,-1)\r\n                waveform = np.concatenate([waveform,np.random.randint(quantization_channels,size=net.batch_size).reshape(net.batch_size,-1)],axis=-1)  # one hot 변환 전. (batch_size, 5117)\r\n            \r\n    \r\n    \r\n    \r\n   \r\n    \r\n        start_time = time.time()\r\n        upsampled_local_condition_data = sess.run(upsampled_local_condition_data)\r\n        last_sample_timestamp = datetime.now()\r\n        for step in range(sample_size):  # 원하는 길이를 구하기 위해 loop sample_size\r\n\r\n            window = waveform[:,-1:]  # 제일 끝에 있는 1개만 samples에 넣어 준다.  window: shape(N,1)\r\n \r\n    \r\n            # Run the WaveNet to predict the next sample.\r\n            \r\n            # fast가 아닌경우. window: [128.0, 128.0, ..., 128.0, 178, 185]\r\n            # fast인 경우, window는 숫자 1개.\r\n            prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step,:]})  # samples는 mu law encoding된 것. 계산 과정에서 one hot으로 변환된다.  --> (batch_size,256)\r\n    \r\n            if scalar_input:\r\n                sample = prediction  # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다.\r\n            else:\r\n                # Scale prediction distribution using temperature.\r\n                # 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다.\r\n                # config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다.\r\n                np.seterr(divide='ignore')\r\n                scaled_prediction = np.log(prediction) / config.temperature   # config.temperature인 경우는 값의 변화가 없다.\r\n                scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True))  # np.log(np.sum(np.exp(scaled_prediction)))\r\n                scaled_prediction = np.exp(scaled_prediction)\r\n                np.seterr(divide='warn')\r\n        \r\n                # Prediction distribution at temperature=1.0 should be unchanged after\r\n                # scaling.\r\n                if config.temperature == 1.0:\r\n                    np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.')\r\n                \r\n                # argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다.\r\n                sample = [[np.random.choice(np.arange(quantization_channels), p=p)] for p in scaled_prediction]  # choose one sample per batch\r\n            \r\n            waveform = np.concatenate([waveform,sample],axis=-1)   #window.shape: (N,1)\r\n    \r\n            # Show progress only once per second.\r\n            current_sample_timestamp = datetime.now()\r\n            time_since_print = current_sample_timestamp - last_sample_timestamp\r\n            if time_since_print.total_seconds() > 1.:\r\n                duration = time.time() - start_time\r\n                print('Sample {:3<d}/{:3<d}, ({:.3f} sec/step)'.format(step + 1, sample_size, duration), end='\\r')\r\n                last_sample_timestamp = current_sample_timestamp\r\n    \r\n    \r\n        # Introduce a newline to clear the carriage return from the progress.\r\n        print()\r\n    \r\n        \r\n        # Save the result as a wav file.    \r\n        if hparams.input_type == 'raw':\r\n            out = waveform[:,net.receptive_field:]\r\n        elif hparams.input_type == 'mulaw':\r\n            decode = mu_law_decode(samples, quantization_channels,quantization=False)\r\n            out = sess.run(decode, feed_dict={samples: waveform[:,net.receptive_field:]})\r\n        else:  # 'mulaw-quantize'\r\n            decode = mu_law_decode(samples, quantization_channels,quantization=True)\r\n            out = sess.run(decode, feed_dict={samples: waveform[:,net.receptive_field:]})          \r\n            \r\n            \r\n        # save wav\r\n        \r\n        for i in range(net.batch_size):\r\n            config.wav_out_path= logdir + '/test-{}.wav'.format(i)\r\n            mel_path =  config.wav_out_path.replace(\".wav\", \".png\")\r\n            \r\n            gen_mel_spectrogram = audio.melspectrogram(out[i], hparams).astype(np.float32).T\r\n            audio.save_wav(out[i], config.wav_out_path, hparams.sample_rate)  # save_wav 내에서 out[i]의 값이 바뀐다.\r\n            \r\n            plot.plot_spectrogram(gen_mel_spectrogram, mel_path, title='generated mel spectrogram',target_spectrogram=mel_input[i])        \r\n        print('Finished generating.')\r\n\r\n\r\nif __name__ == '__main__':\r\n    s = time.time()\r\n    main()\r\n    print(time.time()-s,'sec')"
  },
  {
    "path": "hparams.py",
    "content": "# -*- coding: utf-8 -*-\r\n\r\nimport tensorflow as tf\r\nimport numpy as np\r\n\r\nhparams = tf.contrib.training.HParams(\r\n    name = \"Tacotron-2\",\r\n    \r\n    # tacotron hyper parameter\r\n    \r\n    cleaners = 'korean_cleaners',  # 'korean_cleaners'   or 'english_cleaners'\r\n    \r\n    \r\n    skip_path_filter = False, # npz파일에서 불필요한 것을 거르는 작업을 할지 말지 결정. receptive_field 보다 짧은 data를 걸러야 하기 때문에 해 줘야 한다.\r\n    use_lws = False,\r\n    \r\n    # Audio\r\n    sample_rate = 24000,  # \r\n    \r\n    # shift can be specified by either hop_size(우선) or frame_shift_ms\r\n    hop_size = 300,             # frame_shift_ms = 12.5ms\r\n    fft_size=2048,   # n_fft. 주로 1024로 되어있는데, tacotron에서 2048사용\r\n    win_size = 1200,   # 50ms\r\n    num_mels=80,\r\n\r\n    #Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude levels. Also allows for better G&L phase reconstruction)\r\n    preemphasize = True, #whether to apply filter\r\n    preemphasis = 0.97,\r\n    min_level_db = -100,\r\n    ref_level_db = 20,\r\n    signal_normalization = True, #Whether to normalize mel spectrograms to some predefined range (following below parameters)\r\n    allow_clipping_in_normalization = True, #Only relevant if mel_normalization = True\r\n    symmetric_mels = True, #Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, faster and cleaner convergence)\r\n    max_abs_value = 4., #max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not be too big to avoid gradient explosion, not too small for fast convergence)\r\n    \r\n        \r\n    rescaling=True,\r\n    rescaling_max=0.999, \r\n    \r\n    trim_silence = True, #Whether to clip silence in Audio (at beginning and end of audio only, not the middle)\r\n    #M-AILABS (and other datasets) trim params (there parameters are usually correct for any data, but definitely must be tuned for specific speakers)\r\n    trim_fft_size = 512, \r\n    trim_hop_size = 128,\r\n    trim_top_db = 23,\r\n    \r\n    \r\n    \r\n    \r\n    clip_mels_length = True, #For cases of OOM (Not really recommended, only use if facing unsolvable OOM errors, also consider clipping your samples to smaller chunks)   \r\n    max_mel_frames = 1000,  #Only relevant when clip_mels_length = True, please only use after trying output_per_steps=3 and still getting OOM errors.\r\n    \r\n    \r\n\r\n    l2_regularization_strength = 0,  # Coefficient in the L2 regularization.\r\n    sample_size = 9000,              # Concatenate and cut audio samples to this many samples\r\n    silence_threshold = 0,             # Volume threshold below which to trim the start and the end from the training set samples. e.g. 2\r\n\r\n    \r\n    filter_width = 3,\r\n    gc_channels = 32,                  # global_condition_vector의 차원. 이것 지정함으로써, global conditioning을 모델에 반영하라는 의미가 된다.\r\n    \r\n    input_type=\"raw\",    # 'mulaw-quantize', 'mulaw', 'raw',   mulaw, raw 2가지는 scalar input\r\n    scalar_input = True,   # input_type과 맞아야 함.\r\n    \r\n    \r\n    dilations = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,\r\n                  1, 2, 4, 8, 16, 32, 64, 128, 256, 512],\r\n    residual_channels = 128,\r\n    dilation_channels = 256,\r\n    quantization_channels = 256,\r\n    out_channels = 30,  # discretized_mix_logistic_loss를 적용하기 때문에, 3의 배수\r\n    skip_channels = 128,\r\n    use_biases = True,\r\n    upsample_type = 'SubPixel',  # 'SubPixel', None   \r\n    upsample_factor=[12,25],   # np.prod(upsample_factor) must equal to hop_size\r\n\r\n\r\n\r\n    # wavenet training hp\r\n    wavenet_batch_size = 2,            # 16--> OOM. wavenet은 batch_size가 고정되어야 한다.\r\n    store_metadata = False,\r\n    num_steps = 1000000,                # Number of training steps\r\n\r\n    #Learning rate schedule\r\n    wavenet_learning_rate = 1e-3, #wavenet initial learning rate\r\n    wavenet_decay_rate = 0.5, #Only used with 'exponential' scheme. Defines the decay rate.\r\n    wavenet_decay_steps = 300000, #Only used with 'exponential' scheme. Defines the decay steps.\r\n\r\n    #Regularization parameters\r\n    wavenet_clip_gradients = True, #Whether the clip the gradients during wavenet training.\r\n\r\n    # residual 결과를 sum할 때, \r\n    legacy = True, #Whether to use legacy mode: Multiply all skip outputs but the first one with sqrt(0.5) (True for more early training stability, especially for large models)\r\n    \r\n    # residual block내에서  x = (x + residual) * np.sqrt(0.5)\r\n    residual_legacy = True, #Whether to scale residual blocks outputs by a factor of sqrt(0.5) (True for input variance preservation early in training and better overall stability)\r\n\r\n\r\n    wavenet_dropout = 0.05,\r\n    \r\n    optimizer = 'adam',\r\n    momentum = 0.9,                   # 'Specify the momentum to be used by sgd or rmsprop optimizer. Ignored by the adam optimizer.\r\n    max_checkpoints = 3,             # 'Maximum amount of checkpoints that will be kept alive. Default: '    \r\n\r\n\r\n    ####################################\r\n    ####################################\r\n    ####################################\r\n    # TACOTRON HYPERPARAMETERS\r\n    \r\n    # Training\r\n    adam_beta1 = 0.9,\r\n    adam_beta2 = 0.999,\r\n    \r\n    #Learning rate schedule\r\n    tacotron_decay_learning_rate = True, #boolean, determines if the learning rate will follow an exponential decay\r\n    tacotron_start_decay = 40000, #Step at which learning decay starts\r\n    tacotron_decay_steps = 18000, #Determines the learning rate decay slope (UNDER TEST)\r\n    tacotron_decay_rate = 0.5, #learning rate decay rate (UNDER TEST)\r\n    tacotron_initial_learning_rate = 1e-3, #starting learning rate\r\n    tacotron_final_learning_rate = 1e-4, #minimal learning rate\r\n    \r\n    \r\n    \r\n    \r\n    initial_data_greedy = True,\r\n    initial_phase_step = 8000,   # 여기서 지정한 step 이전에는 data_dirs의 각각의 디렉토리에 대하여 같은 수의 example을 만들고, 이후, weght 비듈에 따라 ... 즉, 아래의 'main_data_greedy_factor'의 영향을 받는다.\r\n    main_data_greedy_factor = 0,\r\n    main_data = [''],    # 이곳에 있는 directory 속에 있는 data는 가중치를 'main_data_greedy_factor' 만큼 더 준다.\r\n    prioritize_loss = False,    \r\n    \r\n\r\n    # Model\r\n    model_type = 'multi-speaker', # [single, multi-speaker]\r\n    speaker_embedding_size  = 16, \r\n\r\n    embedding_size = 512,    # 'ᄀ', 'ᄂ', 'ᅡ' 에 대한 embedding dim\r\n    dropout_prob = 0.5,\r\n\r\n    reduction_factor = 2,  # reduction_factor가 적으면 더 많은 iteration이 필요하므로, 더 많은 메모리가 필요하다.\r\n    \r\n    # Encoder\r\n    enc_conv_num_layers = 3,\r\n    enc_conv_kernel_size = 5,\r\n    enc_conv_channels = 512,\r\n    tacotron_zoneout_rate = 0.1,\r\n    encoder_lstm_units = 256,\r\n\r\n\r\n    attention_type = 'bah_mon_norm',    # 'loc_sen', 'bah_mon_norm'\r\n    attention_size = 128,\r\n\r\n    #Attention mechanism\r\n    smoothing = False, #Whether to smooth the attention normalization function\r\n    attention_dim = 128, #dimension of attention space\r\n    attention_filters = 32, #number of attention convolution filters\r\n    attention_kernel = (31, ), #kernel size of attention convolution\r\n    cumulative_weights = True, #Whether to cumulate (sum) all previous attention weights or simply feed previous weights (Recommended: True)\r\n\r\n    #Attention synthesis constraints\r\n    #\"Monotonic\" constraint forces the model to only look at the forwards attention_win_size steps.\r\n    #\"Window\" allows the model to look at attention_win_size neighbors, both forward and backward steps.\r\n    synthesis_constraint = False,  #Whether to use attention windows constraints in synthesis only (Useful for long utterances synthesis)\r\n    synthesis_constraint_type = 'window', #can be in ('window', 'monotonic'). \r\n    attention_win_size = 7, #Side of the window. Current step does not count. If mode is window and attention_win_size is not pair, the 1 extra is provided to backward part of the window.\r\n\r\n    #Loss params\r\n    mask_encoder = True, #whether to mask encoder padding while computing location sensitive attention. Set to True for better prosody but slower convergence.\r\n\r\n\r\n    #Decoder\r\n    prenet_layers = [256, 256], #number of layers and number of units of prenet\r\n    decoder_layers = 2, #number of decoder lstm layers\r\n    decoder_lstm_units = 1024, #number of decoder lstm units on each layer\r\n\r\n    dec_prenet_sizes = [256, 256], #number of layers and number of units of prenet\r\n\r\n\r\n\r\n    #Residual postnet\r\n    postnet_num_layers = 5, #number of postnet convolutional layers\r\n    postnet_kernel_size = (5, ), #size of postnet convolution filters for each layer\r\n    postnet_channels = 512, #number of postnet convolution filters for each layer\r\n\r\n\r\n\r\n    # for linear mel spectrogrma\r\n    post_bank_size = 8,\r\n    post_bank_channel_size = 128,\r\n    post_maxpool_width = 2,\r\n    post_highway_depth = 4,\r\n    post_rnn_size = 128,\r\n    post_proj_sizes = [256, 80], # num_mels=80\r\n    post_proj_width = 3,\r\n\r\n\r\n\r\n    tacotron_reg_weight = 1e-6, #regularization weight (for L2 regularization)\r\n    inference_prenet_dropout = True,\r\n\r\n\r\n    # Eval\r\n    min_tokens = 30,  #originally 50, 30 is good for korean,  text를 token으로 쪼갰을 때, 최소 길이 이상되어야 train에 사용\r\n    min_n_frame = 30*5,  # min_n_frame = reduction_factor * min_iters, reduction_factor와 곱해서 min_n_frame을 설정한다.\r\n    max_n_frame = 200*5,\r\n    skip_inadequate = False,\r\n \r\n    griffin_lim_iters = 60,\r\n    power = 1.5, \r\n \r\n)\r\n\r\nif hparams.use_lws:\r\n    # Does not work if fft_size is not multiple of hop_size!!\r\n    # sample size = 20480, hop_size=256=12.5ms. fft_size는 window_size를 결정하는데, 2048을 시간으로 환산하면 2048/20480 = 0.1초=100ms\r\n    hparams.sample_rate = 20480  # \r\n    \r\n    # shift can be specified by either hop_size(우선) or frame_shift_ms\r\n    hparams.hop_size = 256             # frame_shift_ms = 12.5ms\r\n    hparams.frame_shift_ms=None      # hop_size=  sample_rate *  frame_shift_ms / 1000\r\n    hparams.fft_size=2048   # 주로 1024로 되어있는데, tacotron에서 2048사용==> output size = 1025\r\n    hparams.win_size = None # 256x4 --> 50ms\r\n    \r\n  \r\n    \r\nelse:\r\n    # 미리 정의되 parameter들로 부터 consistant하게 정의해 준다.\r\n    hparams.num_freq = int(hparams.fft_size/2 + 1)\r\n    hparams.frame_shift_ms = hparams.hop_size * 1000.0/ hparams.sample_rate      # hop_size=  sample_rate *  frame_shift_ms / 1000\r\n    hparams.frame_length_ms = hparams.win_size * 1000.0/ hparams.sample_rate \r\n\r\n\r\ndef hparams_debug_string():\r\n    values = hparams.values()\r\n    hp = ['  %s: %s' % (name, values[name]) for name in sorted(values)]\r\n    return 'Hyperparameters:\\n' + '\\n'.join(hp)\r\n"
  },
  {
    "path": "preprocess.py",
    "content": "# coding: utf-8\n\"\"\"\npython preprocess.py --num_workers 10 --name son --in_dir D:\\hccho\\multi-speaker-tacotron-tensorflow-master\\datasets\\son --out_dir .\\data\\son\npython preprocess.py --num_workers 10 --name moon --in_dir D:\\hccho\\multi-speaker-tacotron-tensorflow-master\\datasets\\moon --out_dir .\\data\\moon\n ==> out_dir에  'audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'tokens', 'loss_coeff'를 묶은 npz파일이 생성된다.\n \n \n \n\"\"\"\nimport argparse\nimport os\nfrom multiprocessing import cpu_count\nfrom tqdm import tqdm\nimport importlib\nfrom hparams import hparams, hparams_debug_string\nimport warnings\nwarnings.simplefilter(action='ignore', category=FutureWarning)\n\ndef preprocess(mod, in_dir, out_root,num_workers):\n    os.makedirs(out_dir, exist_ok=True)\n    metadata = mod.build_from_path(hparams, in_dir, out_dir,num_workers=num_workers, tqdm=tqdm)\n    write_metadata(metadata, out_dir)\n\n\ndef write_metadata(metadata, out_dir):\n    with open(os.path.join(out_dir, 'train.txt'), 'w', encoding='utf-8') as f:\n        for m in metadata:\n            f.write('|'.join([str(x) for x in m]) + '\\n')\n    mel_frames = sum([int(m[4]) for m in metadata])\n    timesteps = sum([int(m[3]) for m in metadata])\n    sr = hparams.sample_rate\n    hours = timesteps / sr / 3600\n    print('Write {} utterances, {} mel frames, {} audio timesteps, ({:.2f} hours)'.format(len(metadata), mel_frames, timesteps, hours))\n    print('Max input length (text chars): {}'.format(max(len(m[5]) for m in metadata)))\n    print('Max mel frames length: {}'.format(max(int(m[4]) for m in metadata)))\n    print('Max audio timesteps length: {}'.format(max(m[3] for m in metadata)))\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--name', type=str, default=None)\n    parser.add_argument('--in_dir', type=str, default=None)\n    parser.add_argument('--out_dir', type=str, default=None)\n    parser.add_argument('--num_workers', type=str, default=None)\n    parser.add_argument('--hparams', type=str, default=None)\n    args = parser.parse_args()\n\n    if args.hparams is not None:\n        hparams.parse(args.hparams)\n    print(hparams_debug_string())\n\n    name = args.name\n    in_dir = args.in_dir\n    out_dir = args.out_dir\n    num_workers = args.num_workers\n    num_workers = cpu_count() if num_workers is None else int(num_workers)  # cpu_count() = process 갯수\n\n    print(\"Sampling frequency: {}\".format(hparams.sample_rate))\n\n    assert name in [\"cmu_arctic\", \"ljspeech\", \"son\", \"moon\"]\n    mod = importlib.import_module('datasets.{}'.format(name))\n    preprocess(mod, in_dir, out_dir, num_workers)\n"
  },
  {
    "path": "synthesizer.py",
    "content": "# coding: utf-8\r\n\r\n\"\"\"\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text \"그런데 청년은 이렇게 말합니다\"\r\n\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text \"이런 논란은 타코트론 논문 이후에 사라졌습니다\"\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text \"이런 논란은 타코트론 논문 이후에 사라졌습니다\"\r\n\r\n\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text \"오는 6월6일은 제64회 현충일입니다\"\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text \"오는 6월6일은 제64회 현충일입니다\"\r\n\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text \"오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다\"\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text \"오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다\"\r\n\"\"\"\r\nimport io\r\nimport os\r\nimport re\r\nimport librosa\r\nimport argparse\r\nimport numpy as np\r\nfrom glob import glob\r\nfrom tqdm import tqdm\r\nimport tensorflow as tf\r\nfrom functools import partial\r\n\r\nfrom hparams import hparams\r\nfrom tacotron2 import create_model, get_most_recent_checkpoint\r\nfrom utils.audio import save_wav, inv_linear_spectrogram, inv_preemphasis, inv_spectrogram_tensorflow\r\nfrom utils import plot, PARAMS_NAME, load_json, load_hparams, add_prefix, add_postfix, get_time, parallel_run, makedirs, str2bool\r\n\r\nfrom text.korean import tokenize\r\nfrom text import text_to_sequence, sequence_to_text\r\nfrom datasets.datafeeder_tacotron2 import _prepare_inputs\r\nimport warnings\r\nwarnings.simplefilter(action='ignore', category=FutureWarning)\r\ntf.logging.set_verbosity(tf.logging.ERROR)\r\n\r\nclass Synthesizer(object):\r\n    def close(self):\r\n        tf.reset_default_graph()\r\n        self.sess.close()\r\n\r\n    def load(self, checkpoint_path, num_speakers=2, checkpoint_step=None, inference_prenet_dropout=True,model_name='tacotron'):\r\n        self.num_speakers = num_speakers\r\n\r\n        if os.path.isdir(checkpoint_path):\r\n            load_path = checkpoint_path\r\n            checkpoint_path = get_most_recent_checkpoint(checkpoint_path, checkpoint_step)\r\n        else:\r\n            load_path = os.path.dirname(checkpoint_path)\r\n\r\n        print('Constructing model: %s' % model_name)\r\n\r\n        inputs = tf.placeholder(tf.int32, [None, None], 'inputs')\r\n        input_lengths = tf.placeholder(tf.int32, [None], 'input_lengths')\r\n\r\n        batch_size = tf.shape(inputs)[0]\r\n        speaker_id = tf.placeholder_with_default(\r\n                tf.zeros([batch_size], dtype=tf.int32), [None], 'speaker_id')\r\n\r\n        load_hparams(hparams, load_path)\r\n        hparams.inference_prenet_dropout = inference_prenet_dropout\r\n        with tf.variable_scope('model') as scope:\r\n            self.model = create_model(hparams)\r\n\r\n            self.model.initialize(inputs=inputs, input_lengths=input_lengths, num_speakers=self.num_speakers, speaker_id=speaker_id,is_training=False)\r\n            self.wav_output = inv_spectrogram_tensorflow(self.model.linear_outputs,hparams)\r\n\r\n        print('Loading checkpoint: %s' % checkpoint_path)\r\n\r\n        sess_config = tf.ConfigProto(\r\n                allow_soft_placement=True,\r\n                intra_op_parallelism_threads=1,\r\n                inter_op_parallelism_threads=2)\r\n        sess_config.gpu_options.allow_growth = True\r\n\r\n        self.sess = tf.Session(config=sess_config)\r\n        self.sess.run(tf.global_variables_initializer())\r\n        saver = tf.train.Saver()\r\n        saver.restore(self.sess, checkpoint_path)\r\n\r\n    def synthesize(self,\r\n            texts=None, tokens=None,\r\n            base_path=None, paths=None, speaker_ids=None,\r\n            start_of_sentence=None, end_of_sentence=True,\r\n            pre_word_num=0, post_word_num=0,\r\n            pre_surplus_idx=0, post_surplus_idx=1,\r\n            use_short_concat=False,\r\n            base_alignment_path=None,\r\n            librosa_trim=False,\r\n            attention_trim=True,\r\n            isKorean=True):\r\n        # Possible inputs:\r\n        # 1) text=text\r\n        # 2) text=texts\r\n        # 3) tokens=tokens, texts=texts # use texts as guide\r\n\r\n        if type(texts) == str:\r\n            texts = [texts]\r\n\r\n        if texts is not None and tokens is None:\r\n            sequences = np.array([text_to_sequence(text) for text in texts])\r\n            sequences = _prepare_inputs(sequences)\r\n        elif tokens is not None:\r\n            sequences = tokens\r\n\r\n        #sequences = np.pad(sequences,[(0,0),(0,5)],'constant',constant_values=(0))  # case by case ---> overfitting?\r\n        \r\n        if paths is None:\r\n            paths = [None] * len(sequences)\r\n        if texts is None:\r\n            texts = [None] * len(sequences)\r\n\r\n        time_str = get_time()\r\n        def plot_and_save_parallel(wavs, alignments,mels):\r\n\r\n            items = list(enumerate(zip(wavs, alignments, paths, texts, sequences,mels)))\r\n\r\n            fn = partial(\r\n                    plot_graph_and_save_audio,\r\n                    base_path=base_path,\r\n                    start_of_sentence=start_of_sentence, end_of_sentence=end_of_sentence,\r\n                    pre_word_num=pre_word_num, post_word_num=post_word_num,\r\n                    pre_surplus_idx=pre_surplus_idx, post_surplus_idx=post_surplus_idx,\r\n                    use_short_concat=use_short_concat,\r\n                    librosa_trim=librosa_trim,\r\n                    attention_trim=attention_trim,\r\n                    time_str=time_str,\r\n                    isKorean=isKorean)\r\n            return parallel_run(fn, items,desc=\"plot_graph_and_save_audio\", parallel=False)\r\n\r\n        #input_lengths = np.argmax(np.array(sequences) == 1, 1)+1\r\n        input_lengths = [np.argmax(a==1)+1 for a in sequences]\r\n\r\n        fetches = [\r\n                #self.wav_output,\r\n                self.model.linear_outputs,\r\n                self.model.alignments,   #  # batch_size, text length(encoder), target length(decoder)\r\n                self.model.mel_outputs,\r\n        ]\r\n\r\n        feed_dict = { self.model.inputs: sequences, self.model.input_lengths: input_lengths, }\r\n\r\n\r\n        if speaker_ids is not None:\r\n            if type(speaker_ids) == dict:\r\n                speaker_embed_table = sess.run(\r\n                        self.model.speaker_embed_table)\r\n\r\n                speaker_embed =  [speaker_ids[speaker_id] * speaker_embed_table[speaker_id] for speaker_id in speaker_ids]\r\n                feed_dict.update({ self.model.speaker_embed_table: np.tile() })\r\n            else:\r\n                feed_dict[self.model.speaker_id] = speaker_ids\r\n\r\n        wavs, alignments,mels = self.sess.run(fetches, feed_dict=feed_dict)\r\n        results = plot_and_save_parallel(wavs, alignments,mels=mels)  \r\n\r\n        return results\r\n\r\ndef plot_graph_and_save_audio(args,\r\n        base_path=None,\r\n        start_of_sentence=None, end_of_sentence=None,\r\n        pre_word_num=0, post_word_num=0,\r\n        pre_surplus_idx=0, post_surplus_idx=1,\r\n        use_short_concat=False,\r\n        save_alignment=False,\r\n        librosa_trim=False, attention_trim=False,\r\n        time_str=None, isKorean=True):\r\n\r\n    idx, (wav, alignment, path, text, sequence,mel) = args\r\n\r\n    if base_path:\r\n        plot_path = \"{}/{}.png\".format(base_path, get_time())\r\n    elif path:\r\n        plot_path = path.rsplit('.', 1)[0] + \".png\"\r\n    else:\r\n        plot_path = None\r\n\r\n\r\n    if plot_path:\r\n        plot.plot_alignment(alignment, plot_path, text=text, isKorean=isKorean)\r\n\r\n    if use_short_concat:\r\n        wav = short_concat(\r\n                wav, alignment, text,\r\n                start_of_sentence, end_of_sentence,\r\n                pre_word_num, post_word_num,\r\n                pre_surplus_idx, post_surplus_idx)\r\n\r\n    if attention_trim and end_of_sentence:\r\n        # attention이 text의 마지막까지 왔다면, 그 뒷부분은 버린다.\r\n        end_idx_counter = 0\r\n        attention_argmax = alignment.argmax(0)   # alignment: text length(encoder), target length(decoder)   ==> target length(decoder)\r\n        end_idx = min(len(sequence) - 1, max(attention_argmax))\r\n        max_counter = min((attention_argmax == end_idx).sum(), 5)\r\n\r\n        for jdx, attend_idx in enumerate(attention_argmax):\r\n            if len(attention_argmax) > jdx + 1:\r\n                if attend_idx == end_idx:\r\n                    end_idx_counter += 1\r\n\r\n                if attend_idx == end_idx and attention_argmax[jdx + 1] > end_idx:\r\n                    break\r\n\r\n                if end_idx_counter >= max_counter:\r\n                    break\r\n            else:\r\n                break\r\n\r\n        spec_end_idx = hparams.reduction_factor * jdx + 3\r\n        wav = wav[:spec_end_idx]\r\n        mel = mel[:spec_end_idx]\r\n\r\n    audio_out = inv_linear_spectrogram(wav.T,hparams)\r\n\r\n    if librosa_trim and end_of_sentence:\r\n        yt, index = librosa.effects.trim(audio_out, frame_length=5120, hop_length=256, top_db=50)\r\n        audio_out = audio_out[:index[-1]]\r\n        mel = mel[:index[-1]//hparams.hop_size]\r\n\r\n    if save_alignment:\r\n        alignment_path = \"{}/{}.npy\".format(base_path, idx)\r\n        np.save(alignment_path, alignment, allow_pickle=False)\r\n\r\n    \r\n    if path or base_path:\r\n        if path:\r\n            current_path = add_postfix(path, idx)\r\n        elif base_path:\r\n            current_path = plot_path.replace(\".png\", \".wav\")\r\n\r\n        save_wav(audio_out, current_path,hparams.sample_rate)\r\n         \r\n        #hccho    \r\n        mel_path = current_path.replace(\".wav\",\".npy\")\r\n        np.save(mel_path,mel)\r\n               \r\n        return True\r\n    else:\r\n        io_out = io.BytesIO()\r\n        save_wav(audio_out, io_out,hparams.sample_rate)\r\n        result = io_out.getvalue()\r\n        return result\r\n\r\ndef get_most_recent_checkpoint(checkpoint_dir, checkpoint_step=None):\r\n    if checkpoint_step is None:\r\n        checkpoint_paths = [path for path in glob(\"{}/*.ckpt-*.data-*\".format(checkpoint_dir))]\r\n        idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]\r\n\r\n        max_idx = max(idxes)\r\n    else:\r\n        max_idx = checkpoint_step\r\n    lastest_checkpoint = os.path.join(checkpoint_dir, \"model.ckpt-{}\".format(max_idx))\r\n    print(\" [*] Found lastest checkpoint: {}\".format(lastest_checkpoint))\r\n    return lastest_checkpoint\r\n\r\ndef short_concat(\r\n        wav, alignment, text,\r\n        start_of_sentence, end_of_sentence,\r\n        pre_word_num, post_word_num,\r\n        pre_surplus_idx, post_surplus_idx):\r\n\r\n    # np.array(list(decomposed_text))[attention_argmax]\r\n    attention_argmax = alignment.argmax(0)\r\n\r\n    if not start_of_sentence and pre_word_num > 0:\r\n        surplus_decomposed_text = decompose_ko_text(\"\".join(text.split()[0]))\r\n        start_idx = len(surplus_decomposed_text) + 1\r\n\r\n        for idx, attend_idx in enumerate(attention_argmax):\r\n            if attend_idx == start_idx and attention_argmax[idx - 1] < start_idx:\r\n                break\r\n\r\n        wav_start_idx = hparams.reduction_factor * idx - 1 - pre_surplus_idx\r\n    else:\r\n        wav_start_idx = 0\r\n\r\n    if not end_of_sentence and post_word_num > 0:\r\n        surplus_decomposed_text = decompose_ko_text(\"\".join(text.split()[-1]))\r\n        end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1\r\n\r\n        for idx, attend_idx in enumerate(attention_argmax):\r\n            if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx:\r\n                break\r\n\r\n        wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx\r\n    else:\r\n        if True: # attention based split\r\n            if end_of_sentence:\r\n                end_idx = min(len(decomposed_text) - 1, max(attention_argmax))\r\n            else:\r\n                surplus_decomposed_text = decompose_ko_text(\"\".join(text.split()[-1]))\r\n                end_idx = len(decomposed_text.replace(surplus_decomposed_text, '')) - 1\r\n\r\n            while True:\r\n                if end_idx in attention_argmax:\r\n                    break\r\n                end_idx -= 1\r\n\r\n            end_idx_counter = 0\r\n            for idx, attend_idx in enumerate(attention_argmax):\r\n                if len(attention_argmax) > idx + 1:\r\n                    if attend_idx == end_idx:\r\n                        end_idx_counter += 1\r\n\r\n                    if attend_idx == end_idx and attention_argmax[idx + 1] > end_idx:\r\n                        break\r\n\r\n                    if end_idx_counter > 5:\r\n                        break\r\n                else:\r\n                    break\r\n\r\n            wav_end_idx = hparams.reduction_factor * idx + 1 + post_surplus_idx\r\n        else:\r\n            wav_end_idx = None\r\n\r\n    wav = wav[wav_start_idx:wav_end_idx]\r\n\r\n    if end_of_sentence:\r\n        wav = np.lib.pad(wav, ((0, 20), (0, 0)), 'constant', constant_values=0)\r\n    else:\r\n        wav = np.lib.pad(wav, ((0, 10), (0, 0)), 'constant', constant_values=0)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    parser = argparse.ArgumentParser()\r\n    parser.add_argument('--load_path', required=True)\r\n    parser.add_argument('--sample_path', default=\"logdir-tacotron2/generate\")\r\n    parser.add_argument('--text', required=True)\r\n    parser.add_argument('--num_speakers', default=1, type=int)\r\n    parser.add_argument('--speaker_id', default=0, type=int)\r\n    parser.add_argument('--checkpoint_step', default=None, type=int)\r\n    parser.add_argument('--is_korean', default=True, type=str2bool)\r\n    parser.add_argument('--base_alignment_path', default=None)\r\n    config = parser.parse_args()\r\n\r\n    makedirs(config.sample_path)\r\n\r\n    synthesizer = Synthesizer()\r\n    synthesizer.load(config.load_path, config.num_speakers, config.checkpoint_step,inference_prenet_dropout=False)\r\n\r\n    audio = synthesizer.synthesize(texts=[config.text],base_path=config.sample_path,speaker_ids=[config.speaker_id],\r\n                                   attention_trim=True,base_alignment_path=config.base_alignment_path,isKorean=config.is_korean)[0]\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "tacotron2/__init__.py",
    "content": "# coding: utf-8\r\nimport os\r\nfrom glob import glob\r\nfrom .tacotron2 import Tacotron2\r\n\r\n\r\ndef create_model(hparams):\r\n    return Tacotron2(hparams)\r\n\r\n\r\ndef get_most_recent_checkpoint(checkpoint_dir):\r\n    checkpoint_paths = [path for path in glob(\"{}/*.ckpt-*.data-*\".format(checkpoint_dir))]\r\n    idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]\r\n\r\n    max_idx = max(idxes)\r\n    lastest_checkpoint = os.path.join(checkpoint_dir, \"model.ckpt-{}\".format(max_idx))\r\n\r\n    #latest_checkpoint=checkpoint_paths[0]\r\n    print(\" [*] Found lastest checkpoint: {}\".format(lastest_checkpoint))\r\n    return lastest_checkpoint\r\n"
  },
  {
    "path": "tacotron2/helpers.py",
    "content": "# coding: utf-8\r\n# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py\r\n\r\nimport numpy as np\r\nimport tensorflow as tf\r\nfrom tensorflow.contrib.seq2seq import Helper\r\n\r\n\r\n# Adapted from tf.contrib.seq2seq.GreedyEmbeddingHelper\r\nclass TacoTestHelper(Helper):\r\n    def __init__(self, batch_size, output_dim, r):\r\n        with tf.name_scope('TacoTestHelper'):\r\n            self._batch_size = batch_size\r\n            self._output_dim = output_dim\r\n            self._end_token = tf.tile([0.0], [output_dim * r])  # [0.0,0.0,...]\r\n            self._reduction_factor = r\r\n    @property\r\n    def batch_size(self):\r\n        return self._batch_size\r\n    \r\n    @property\r\n    def sample_ids_dtype(self):\r\n        return tf.int32\r\n\r\n    @property\r\n    def sample_ids_shape(self):\r\n        return tf.TensorShape([])\r\n    \r\n    def initialize(self, name=None):\r\n        return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim))\r\n\r\n    def sample(self, time, outputs, state, name=None):\r\n        return tf.tile([0], [self._batch_size])  # Return all 0; we ignore them\r\n\r\n    def next_inputs(self, time, outputs, state, sample_ids, name=None):\r\n        '''Stop on EOS. Otherwise, pass the last output as the next input and pass through state.'''\r\n        with tf.name_scope('TacoTestHelper'):\r\n            stop_token_preds = tf.nn.sigmoid(outputs[:,-self._reduction_factor:])\r\n            finished = tf.reduce_any(tf.cast(tf.round(stop_token_preds), tf.bool),axis=1)\r\n            # Feed last output frame as next input. outputs is [N, output_dim * r]\r\n            \r\n            next_inputs = outputs[:, -(self._output_dim+self._reduction_factor):-self._reduction_factor]  # stop token 부분을 제외\r\n            return (finished, next_inputs, state)\r\n\r\n\r\nclass TacoTrainingHelper(Helper):\r\n    def __init__(self, targets, output_dim, r):\r\n        # inputs is [N, T_in], targets is [N, T_out, D]\r\n        # output_dim = hp.num_mels = 80\r\n        # r = hp.reduction_factor = 4 or 5\r\n        with tf.name_scope('TacoTrainingHelper'):\r\n            self._batch_size = tf.shape(targets)[0]\r\n            self._output_dim = output_dim\r\n\r\n            # Feed every r-th target frame as input\r\n            self._targets = targets[:, r-1::r, :]\r\n\r\n            # Use full length for every target because we don't want to mask the padding frames\r\n            num_steps = tf.shape(self._targets)[1]\r\n            self._lengths = tf.tile([num_steps], [self._batch_size])\r\n\r\n    @property\r\n    def batch_size(self):\r\n        return self._batch_size\r\n\r\n    @property\r\n    def sample_ids_dtype(self):\r\n        return tf.int32\r\n\r\n    @property\r\n    def sample_ids_shape(self):\r\n        return tf.TensorShape([])\r\n\r\n\r\n    def initialize(self, name=None):\r\n        return (tf.tile([False], [self._batch_size]), _go_frames(self._batch_size, self._output_dim))\r\n\r\n    def sample(self, time, outputs, state, name=None):\r\n        return tf.tile([0], [self._batch_size])  # Return all 0; we ignore them\r\n\r\n    def next_inputs(self, time, outputs, state, sample_ids, name=None):  # time에 해당하는 input을 만들어 return해야 한다.\r\n        with tf.name_scope(name or 'TacoTrainingHelper'):\r\n            finished = (time + 1 >= self._lengths)\r\n\r\n            next_inputs = self._targets[:, time, :]\r\n            \r\n            return (finished, next_inputs, state)\r\n\r\n\r\ndef _go_frames(batch_size, output_dim):\r\n    '''Returns all-zero <GO> frames for a given batch size and output dimension'''\r\n    return tf.tile([[0.0]], [batch_size, output_dim])\r\n"
  },
  {
    "path": "tacotron2/modules.py",
    "content": "# coding: utf-8\r\n# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py\r\n\r\nimport tensorflow as tf\r\nfrom tensorflow.contrib.rnn import GRUCell\r\nfrom tensorflow.python.layers import core\r\nfrom tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, AttentionWrapper, AttentionWrapperState\r\n\r\n\r\ndef prenet(inputs, is_training, layer_sizes, drop_prob, scope=None):\r\n    x = inputs  # 3차원 array(batch,seq_length,embedding_dim)   ==> (batch,seq_length,256)  ==> (batch,seq_length,128)\r\n    #drop_rate = drop_prob if is_training else 0.0\r\n    #print('drop_rate',drop_rate)\r\n    with tf.variable_scope(scope or 'prenet'):\r\n        for i, size in enumerate(layer_sizes):  # [f(256), f(256)]\r\n            dense = tf.layers.dense(x, units=size, activation=tf.nn.relu, name='projection_%d' % (i+1))\r\n            # Tacotron2 논문에서는 training, inference 모두에 dropout 적용\r\n            x = tf.layers.dropout(dense, rate=drop_prob,training=True, name='dropout_%d' % (i+1)) # Tacotron2에서는 training, inference 모두에 dropout 적용\r\n    return x\r\n\r\ndef cbhg(inputs, input_lengths, is_training, bank_size, bank_channel_size, maxpool_width, highway_depth, \r\n         rnn_size, proj_sizes, proj_width, scope,before_highway=None, encoder_rnn_init_state=None):\r\n    # inputs: (N,T_in, 128), bank_size: 16\r\n    batch_size = tf.shape(inputs)[0]\r\n    with tf.variable_scope(scope):\r\n        with tf.variable_scope('conv_bank'):\r\n            # Convolution bank: concatenate on the last axis\r\n            # to stack channels from all convolutions\r\n            conv_fn = lambda k: conv1d(inputs, k, bank_channel_size, tf.nn.relu, is_training, 'conv1d_%d' % k)  # bank_channel_size =128\r\n\r\n            conv_outputs = tf.concat( [conv_fn(k) for k in range(1, bank_size+1)], axis=-1,)  # ==> (N,T_in,128*bank_size)\r\n\r\n        # Maxpooling:\r\n        maxpool_output = tf.layers.max_pooling1d(conv_outputs,pool_size=maxpool_width,strides=1,padding='same')  # maxpool_width = 2\r\n\r\n        # Two projection layers:\r\n        proj_out = maxpool_output\r\n        for idx, proj_size in enumerate(proj_sizes):   # [f(128), f(128)],  post: [f(256), f(80)]\r\n            activation_fn = None if idx == len(proj_sizes) - 1 else tf.nn.relu\r\n            proj_out = conv1d(proj_out, proj_width, proj_size, activation_fn,is_training, 'proj_{}'.format(idx + 1))  # proj_width = 3\r\n\r\n        # Residual connection:\r\n        if before_highway is not None: # multi-sperker mode\r\n            expanded_before_highway = tf.expand_dims(before_highway, [1])\r\n            tiled_before_highway = tf.tile(expanded_before_highway, [1, tf.shape(proj_out)[1], 1])\r\n\r\n            highway_input = proj_out + inputs + tiled_before_highway\r\n        else: # single model\r\n            highway_input = proj_out + inputs\r\n\r\n        # Handle dimensionality mismatch:\r\n        if highway_input.shape[2] != rnn_size:  # rnn_size = 128\r\n            highway_input = tf.layers.dense(highway_input, rnn_size,name='highway_projection')\r\n\r\n        # 4-layer HighwayNet:\r\n        for idx in range(highway_depth):\r\n            highway_input = highwaynet(highway_input, 'highway_%d' % (idx+1))\r\n\r\n        rnn_input = highway_input\r\n\r\n        # Bidirectional RNN\r\n        if encoder_rnn_init_state is not None:\r\n            initial_state_fw, initial_state_bw = tf.split(encoder_rnn_init_state, 2, 1)\r\n        else:  # single mode\r\n            initial_state_fw, initial_state_bw = None, None\r\n\r\n        cell_fw, cell_bw = GRUCell(rnn_size), GRUCell(rnn_size)\r\n        outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,rnn_input,sequence_length=input_lengths,\r\n                                                          initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32)\r\n        return tf.concat(outputs, axis=2)    # Concat forward and backward\r\n\r\n\r\ndef batch_tile(tensor, batch_size):\r\n    expaneded_tensor = tf.expand_dims(tensor, [0])\r\n    return tf.tile(expaneded_tensor, \\\r\n            [batch_size] + [1 for _ in tensor.get_shape()])\r\n\r\n\r\ndef highwaynet(inputs, scope):\r\n    highway_dim = int(inputs.get_shape()[-1])\r\n\r\n    with tf.variable_scope(scope):\r\n        H = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.relu,name='H_projection')\r\n        T = tf.layers.dense(inputs,units=highway_dim, activation=tf.nn.sigmoid,name='T_projection',bias_initializer=tf.constant_initializer(-1.0))\r\n        return H * T + inputs * (1.0 - T)\r\n\r\n\r\ndef conv1d(inputs, kernel_size, channels, activation, is_training, scope):\r\n    with tf.variable_scope(scope):\r\n        # strides=1, padding = same 이므로, kernel_size에 상관없이 크기가 유지된다.\r\n        conv1d_output = tf.layers.conv1d(inputs,filters=channels,kernel_size=kernel_size,activation=activation,padding='same') # padding이 same이라 kenel size가 달라도 concat된다.\r\n        return tf.layers.batch_normalization(conv1d_output, training=is_training)\r\n"
  },
  {
    "path": "tacotron2/rnn_wrappers.py",
    "content": "# coding: utf-8\r\nimport numpy as np\r\nimport tensorflow as tf\r\nfrom tensorflow.contrib.rnn import RNNCell\r\nfrom tensorflow.python.ops import rnn_cell_impl\r\n#from tensorflow.contrib.data.python.util import nest\r\nfrom tensorflow.contrib.framework import nest\r\nfrom tensorflow.contrib.seq2seq.python.ops.attention_wrapper import _bahdanau_score, _BaseAttentionMechanism, BahdanauAttention, \\\r\n                             AttentionWrapperState, AttentionMechanism, _BaseMonotonicAttentionMechanism,_maybe_mask_score,_prepare_memory,_monotonic_probability_fn\r\nfrom tensorflow.python.ops import array_ops, math_ops, nn_ops, variable_scope\r\nfrom tensorflow.python.layers.core import Dense\r\nfrom .modules import prenet\r\nimport functools\r\n_zero_state_tensors = rnn_cell_impl._zero_state_tensors\r\n\r\n\r\nclass ZoneoutLSTMCell(RNNCell):\r\n    '''Wrapper for tf LSTM to create Zoneout LSTM Cell\r\n\r\n    inspired by:\r\n    https://github.com/teganmaharaj/zoneout/blob/master/zoneout_tensorflow.py\r\n\r\n    Published by one of 'https://arxiv.org/pdf/1606.01305.pdf' paper writers.\r\n\r\n    Many thanks to @Ondal90 for pointing this out. You sir are a hero!\r\n    '''\r\n    def __init__(self, num_units, is_training, zoneout_factor_cell=0., zoneout_factor_output=0., state_is_tuple=True, name=None):\r\n        '''Initializer with possibility to set different zoneout values for cell/hidden states.\r\n        '''\r\n        zm = min(zoneout_factor_output, zoneout_factor_cell)\r\n        zs = max(zoneout_factor_output, zoneout_factor_cell)\r\n\r\n        if zm < 0. or zs > 1.:\r\n            raise ValueError('One/both provided Zoneout factors are not in [0, 1]')\r\n\r\n        self._cell = tf.nn.rnn_cell.LSTMCell(num_units, state_is_tuple=state_is_tuple, name=name)\r\n        self._zoneout_cell = zoneout_factor_cell\r\n        self._zoneout_outputs = zoneout_factor_output\r\n        self.is_training = is_training\r\n        self.state_is_tuple = state_is_tuple\r\n\r\n    @property\r\n    def state_size(self):\r\n        return self._cell.state_size\r\n\r\n    @property\r\n    def output_size(self):\r\n        return self._cell.output_size\r\n\r\n    def __call__(self, inputs, state, scope=None):\r\n        '''Runs vanilla LSTM Cell and applies zoneout.\r\n        '''\r\n        #Apply vanilla LSTM\r\n        output, new_state = self._cell(inputs, state, scope)\r\n\r\n        if self.state_is_tuple:\r\n            (prev_c, prev_h) = state\r\n            (new_c, new_h) = new_state\r\n        else:\r\n            num_proj = self._cell._num_units if self._cell._num_proj is None else self._cell._num_proj\r\n            prev_c = tf.slice(state, [0, 0], [-1, self._cell._num_units])\r\n            prev_h = tf.slice(state, [0, self._cell._num_units], [-1, num_proj])\r\n            new_c = tf.slice(new_state, [0, 0], [-1, self._cell._num_units])\r\n            new_h = tf.slice(new_state, [0, self._cell._num_units], [-1, num_proj])\r\n\r\n        #Apply zoneout\r\n        if self.is_training:\r\n            #nn.dropout takes keep_prob (probability to keep activations) not drop_prob (probability to mask activations)!\r\n            c = (1 - self._zoneout_cell) * tf.nn.dropout(new_c - prev_c, (1 - self._zoneout_cell)) + prev_c   # tf.nn.dropout outputs the input element scaled up by 1 / keep_prob\r\n            h = (1 - self._zoneout_outputs) * tf.nn.dropout(new_h - prev_h, (1 - self._zoneout_outputs)) + prev_h\r\n\r\n        else:\r\n            c = (1 - self._zoneout_cell) * new_c + self._zoneout_cell * prev_c\r\n            h = (1 - self._zoneout_outputs) * new_h + self._zoneout_outputs * prev_h\r\n\r\n        new_state = tf.nn.rnn_cell.LSTMStateTuple(c, h) if self.state_is_tuple else tf.concat(1, [c, h])\r\n\r\n        return output, new_state\r\n\r\nclass DecoderWrapper(RNNCell):\r\n    '''Runs RNN inputs through a prenet before sending them to the cell.'''\r\n    #  input에 prenet을 먼저 적용하는 것 뿐이다.\r\n    def __init__(self, cell, is_training, prenet_sizes, dropout_prob,inference_prenet_dropout=True):\r\n\r\n        super(DecoderWrapper, self).__init__()\r\n        self._is_training = is_training\r\n\r\n        self._cell = cell\r\n\r\n        self.prenet_sizes = prenet_sizes\r\n        if not is_training and not inference_prenet_dropout:\r\n            self.dropout_prob = 0.\r\n        else: self.dropout_prob = dropout_prob\r\n\r\n    @property\r\n    def state_size(self):\r\n        return self._cell.state_size\r\n\r\n    @property\r\n    def output_size(self):\r\n        return self._cell.output_size + self._cell.state_size.attention\r\n\r\n    def call(self, inputs, state):\r\n        prenet_out = prenet(inputs, self._is_training,self.prenet_sizes, self.dropout_prob, scope='decoder_prenet')\r\n\r\n        output, res_state = self._cell(prenet_out, state)\r\n        \r\n        return tf.concat([output, res_state.attention], axis=-1), res_state\r\n\r\n    def zero_state(self, batch_size, dtype):\r\n        return self._cell.zero_state(batch_size, dtype)\r\n\r\n\r\n\r\nclass LocationSensitiveAttention(BahdanauAttention):\r\n    \"\"\"Impelements Bahdanau-style (cumulative) scoring function.\r\n    Usually referred to as \"hybrid\" attention (content-based + location-based)\r\n    Extends the additive attention described in:\r\n    \"D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-\r\n  tion by jointly learning to align and translate,” in Proceedings\r\n  of ICLR, 2015.\"\r\n    to use previous alignments as additional location features.\r\n\r\n    This attention is described in:\r\n    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-\r\n  gio, “Attention-based models for speech recognition,” in Ad-\r\n  vances in Neural Information Processing Systems, 2015, pp.\r\n  577–585.\r\n    \"\"\"\r\n\r\n    def __init__(self,\r\n                 num_units,\r\n                 memory,\r\n                 hparams,\r\n                 is_training,\r\n                 mask_encoder=True,\r\n                 memory_sequence_length=None,\r\n                 smoothing=False,\r\n                 cumulate_weights=True,\r\n                 name='LocationSensitiveAttention'):\r\n        \"\"\"Construct the Attention mechanism.\r\n        Args:\r\n            num_units: The depth of the query mechanism.\r\n            memory: The memory to query; usually the output of an RNN encoder.  This\r\n                tensor should be shaped `[batch_size, max_time, ...]`.\r\n            mask_encoder (optional): Boolean, whether to mask encoder paddings.\r\n            memory_sequence_length (optional): Sequence lengths for the batch entries\r\n                in memory.  If provided, the memory tensor rows are masked with zeros\r\n                for values past the respective sequence lengths. Only relevant if mask_encoder = True.\r\n            smoothing (optional): Boolean. Determines which normalization function to use.\r\n                Default normalization function (probablity_fn) is softmax. If smoothing is\r\n                enabled, we replace softmax with:\r\n                        a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))\r\n                Introduced in:\r\n                    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-\r\n                  gio, “Attention-based models for speech recognition,” in Ad-\r\n                  vances in Neural Information Processing Systems, 2015, pp.\r\n                  577–585.\r\n                This is mainly used if the model wants to attend to multiple input parts\r\n                at the same decoding step. We probably won't be using it since multiple sound\r\n                frames may depend on the same character/phone, probably not the way around.\r\n                Note:\r\n                    We still keep it implemented in case we want to test it. They used it in the\r\n                    paper in the context of speech recognition, where one phoneme may depend on\r\n                    multiple subsequent sound frames.\r\n            name: Name to use when creating ops.\r\n        \"\"\"\r\n        #Create normalization function\r\n        #Setting it to None defaults in using softmax\r\n        normalization_function = _smoothing_normalization if (smoothing == True) else None\r\n        memory_length = memory_sequence_length if (mask_encoder==True) else None\r\n        super(LocationSensitiveAttention, self).__init__(\r\n                num_units=num_units,\r\n                memory=memory,\r\n                memory_sequence_length=memory_length,\r\n                probability_fn=normalization_function,\r\n                name=name)\r\n\r\n        self.location_convolution = tf.layers.Conv1D(filters=hparams.attention_filters,\r\n            kernel_size=hparams.attention_kernel, padding='same', use_bias=True,\r\n            bias_initializer=tf.zeros_initializer(), name='location_features_convolution')\r\n        self.location_layer = tf.layers.Dense(units=num_units, use_bias=False,dtype=tf.float32, name='location_features_projection')\r\n        self._cumulate = cumulate_weights\r\n        self.synthesis_constraint = hparams.synthesis_constraint and not is_training\r\n        self.attention_win_size = tf.convert_to_tensor(hparams.attention_win_size, dtype=tf.int32)\r\n        self.constraint_type = hparams.synthesis_constraint_type\r\n\r\n    def __call__(self, query, state):\r\n        \"\"\"Score the query based on the keys and values.\r\n        Args:\r\n            query: Tensor of dtype matching `self.values` and shape\r\n                `[batch_size, query_depth]`.\r\n            state (previous alignments): Tensor of dtype matching `self.values` and shape\r\n                `[batch_size, alignments_size]`\r\n                (`alignments_size` is memory's `max_time`).\r\n        Returns:\r\n            alignments: Tensor of dtype matching `self.values` and shape\r\n                `[batch_size, alignments_size]` (`alignments_size` is memory's\r\n                `max_time`).\r\n        \"\"\"\r\n        previous_alignments = state\r\n        with variable_scope.variable_scope(None, \"Location_Sensitive_Attention\", [query]):\r\n\r\n            # processed_query shape [batch_size, query_depth] -> [batch_size, attention_dim]\r\n            processed_query = self.query_layer(query) if self.query_layer else query\r\n            # -> [batch_size, 1, attention_dim]\r\n            processed_query = tf.expand_dims(processed_query, 1)\r\n\r\n            # processed_location_features shape [batch_size, max_time, attention dimension]\r\n            # [batch_size, max_time] -> [batch_size, max_time, 1]\r\n            expanded_alignments = tf.expand_dims(previous_alignments, axis=2)\r\n            # location features [batch_size, max_time, filters]\r\n            f = self.location_convolution(expanded_alignments)\r\n            # Projected location features [batch_size, max_time, attention_dim]\r\n            processed_location_features = self.location_layer(f)\r\n\r\n            # energy shape [batch_size, max_time]\r\n            energy = _location_sensitive_score(processed_query, processed_location_features, self.keys)\r\n\r\n        if self.synthesis_constraint:\r\n            prev_max_attentions = tf.argmax(previous_alignments, -1, output_type=tf.int32)\r\n            Tx = tf.shape(energy)[-1]\r\n            # prev_max_attentions = tf.squeeze(prev_max_attentions, [-1])\r\n            if self.constraint_type == 'monotonic':\r\n                key_masks = tf.sequence_mask(prev_max_attentions, Tx)\r\n                reverse_masks = tf.sequence_mask(Tx - self.attention_win_size - prev_max_attentions, Tx)[:, ::-1]\r\n            else:\r\n                assert self.constraint_type == 'window'\r\n                key_masks = tf.sequence_mask(prev_max_attentions - (self.attention_win_size // 2 + (self.attention_win_size % 2 != 0)), Tx)\r\n                reverse_masks = tf.sequence_mask(Tx - (self.attention_win_size // 2) - prev_max_attentions, Tx)[:, ::-1]\r\n            \r\n            masks = tf.logical_or(key_masks, reverse_masks)\r\n            paddings = tf.ones_like(energy) * (-2 ** 32 + 1)  # (N, Ty/r, Tx)\r\n            energy = tf.where(tf.equal(masks, False), energy, paddings)\r\n\r\n        # alignments shape = energy shape = [batch_size, max_time]\r\n        alignments = self._probability_fn(energy, previous_alignments)\r\n\r\n        # Cumulate alignments\r\n        if self._cumulate:\r\n            next_state = alignments + previous_alignments\r\n        else:\r\n            next_state = alignments\r\n\r\n        return alignments, next_state\r\n\r\n\r\ndef _location_sensitive_score(W_query, W_fil, W_keys):\r\n    \"\"\"Impelements Bahdanau-style (cumulative) scoring function.\r\n    This attention is described in:\r\n        J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-\r\n      gio, “Attention-based models for speech recognition,” in Ad-\r\n      vances in Neural Information Processing Systems, 2015, pp.\r\n      577–585.\r\n\r\n    #############################################################################\r\n              hybrid attention (content-based + location-based)\r\n                               f = F * α_{i-1}\r\n       energy = dot(v_a, tanh(W_keys(h_enc) + W_query(h_dec) + W_fil(f) + b_a))\r\n    #############################################################################\r\n\r\n    Args:\r\n        W_query: Tensor, shape '[batch_size, 1, attention_dim]' to compare to location features.\r\n        W_location: processed previous alignments into location features, shape '[batch_size, max_time, attention_dim]'\r\n        W_keys: Tensor, shape '[batch_size, max_time, attention_dim]', typically the encoder outputs.\r\n    Returns:\r\n        A '[batch_size, max_time]' attention score (energy)\r\n    \"\"\"\r\n    # Get the number of hidden units from the trailing dimension of keys\r\n    dtype = W_query.dtype\r\n    num_units = W_keys.shape[-1].value or array_ops.shape(W_keys)[-1]\r\n\r\n    v_a = tf.get_variable(\r\n        'attention_variable_projection', shape=[num_units], dtype=dtype,\r\n        initializer=tf.contrib.layers.xavier_initializer())\r\n    b_a = tf.get_variable(\r\n        'attention_bias', shape=[num_units], dtype=dtype,\r\n        initializer=tf.zeros_initializer())\r\n\r\n    return tf.reduce_sum(v_a * tf.tanh(W_keys + W_query + W_fil + b_a), [2])\r\n\r\n\r\ndef _smoothing_normalization(e):\r\n    \"\"\"Applies a smoothing normalization function instead of softmax\r\n    Introduced in:\r\n        J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-\r\n      gio, “Attention-based models for speech recognition,” in Ad-\r\n      vances in Neural Information Processing Systems, 2015, pp.\r\n      577–585.\r\n\r\n    ############################################################################\r\n                        Smoothing normalization function\r\n                a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))\r\n    ############################################################################\r\n\r\n    Args:\r\n        e: matrix [batch_size, max_time(memory_time)]: expected to be energy (score)\r\n            values of an attention mechanism\r\n    Returns:\r\n        matrix [batch_size, max_time]: [0, 1] normalized alignments with possible\r\n            attendance to multiple memory time steps.\r\n    \"\"\"\r\n    return tf.nn.sigmoid(e) / tf.reduce_sum(tf.nn.sigmoid(e), axis=-1, keepdims=True)\r\nclass GmmAttention(AttentionMechanism):\r\n    def __init__(self,\r\n                 num_mixtures,\r\n                 memory,\r\n                 memory_sequence_length=None,\r\n                 check_inner_dims_defined=True,\r\n                 score_mask_value=None,\r\n                 name='GmmAttention'):\r\n\r\n        self.dtype = memory.dtype\r\n        self.num_mixtures = num_mixtures\r\n        self.query_layer = tf.layers.Dense(3 * num_mixtures, name='gmm_query_projection', use_bias=True, dtype=self.dtype)\r\n\r\n        with tf.name_scope(name, 'GmmAttentionMechanismInit'):\r\n            if score_mask_value is None:\r\n                score_mask_value = 0.\r\n            self._maybe_mask_score = functools.partial(\r\n                _maybe_mask_score,\r\n                memory_sequence_length=memory_sequence_length,\r\n                score_mask_value=score_mask_value)\r\n            self._value = _prepare_memory(\r\n                memory, memory_sequence_length, check_inner_dims_defined)\r\n            self._batch_size = (\r\n                self._value.shape[0].value or tf.shape(self._value)[0])\r\n            self._alignments_size = (\r\n                    self._value.shape[1].value or tf.shape(self._value)[1])\r\n\r\n    @property\r\n    def values(self):\r\n        return self._value\r\n\r\n    @property\r\n    def batch_size(self):\r\n        return self._batch_size\r\n\r\n    @property\r\n    def alignments_size(self):\r\n        return self._alignments_size\r\n\r\n    @property\r\n    def state_size(self):\r\n        return self.num_mixtures\r\n\r\n    def initial_alignments(self, batch_size, dtype):\r\n        max_time = self._alignments_size\r\n        return _zero_state_tensors(max_time, batch_size, dtype)\r\n\r\n    def initial_state(self, batch_size, dtype):\r\n        state_size_ = self.state_size\r\n        return _zero_state_tensors(state_size_, batch_size, dtype)\r\n\r\n    def __call__(self, query, state):\r\n        with tf.variable_scope(\"GmmAttention\"):\r\n            previous_kappa = state\r\n            \r\n            params = self.query_layer(query)   # query(dec_rnn_size=256) , params(num_mixtures(256)*3)\r\n            alpha_hat, beta_hat, kappa_hat = tf.split(params, num_or_size_splits=3, axis=1)\r\n\r\n            # [batch_size, num_mixtures, 1]\r\n            alpha = tf.expand_dims(tf.exp(alpha_hat), axis=2)\r\n            # softmax makes the alpha value more stable.\r\n            # alpha = tf.expand_dims(tf.nn.softmax(alpha_hat, axis=1), axis=2)\r\n            beta = tf.expand_dims(tf.exp(beta_hat), axis=2)\r\n            kappa = tf.expand_dims(previous_kappa + tf.exp(kappa_hat), axis=2)\r\n\r\n            # [1, 1, max_input_steps]\r\n            mu = tf.reshape(tf.cast(tf.range(self.alignments_size), dtype=tf.float32), shape=[1, 1, self.alignments_size])  # [[[0,1,2,...]]]\r\n\r\n            # [batch_size, max_input_steps]\r\n            phi = tf.reduce_sum(alpha * tf.exp(-beta * (kappa - mu) ** 2.), axis=1)\r\n\r\n        alignments = self._maybe_mask_score(phi)\r\n        state = tf.squeeze(kappa, axis=2)\r\n\r\n        return alignments, state\r\n"
  },
  {
    "path": "tacotron2/tacotron2.py",
    "content": "# coding: utf-8\r\n\r\n# Code based on https://github.com/keithito/tacotron/blob/master/models/tacotron.py\r\n\r\n\"\"\"\r\n모델 수정\r\n1. prenet에서 dropout 적용 오류 수정\r\n2. AttentionWrapper 적용 순서 오류 수정: keith ito 코드는 잘 구현되어 있음\r\n3. BahdanauMonotonicAttention에서 normalize=True적용(2018년9월11일 적용)\r\n4. BahdanauMonotonicAttention에서 memory_sequence_length 입력\r\n5. synhesizer.py  input_lengths 계산오류. +1 해야 함.\r\n\r\n\r\n\"\"\"\r\n\r\n\r\n\r\nimport numpy as np\r\nimport tensorflow as tf\r\nfrom tensorflow.contrib.seq2seq import BasicDecoder, BahdanauAttention, BahdanauMonotonicAttention,LuongAttention\r\nfrom tensorflow.contrib.rnn import GRUCell, MultiRNNCell, OutputProjectionWrapper, ResidualWrapper,LSTMStateTuple\r\n\r\nfrom utils.infolog import log\r\nfrom text.symbols import symbols\r\n\r\nfrom .modules import *\r\nfrom .helpers import TacoTestHelper, TacoTrainingHelper\r\nfrom .rnn_wrappers import LocationSensitiveAttention,GmmAttention,ZoneoutLSTMCell,DecoderWrapper\r\n\r\n\r\nclass Tacotron2():\r\n    def __init__(self, hparams):\r\n        self._hparams = hparams\r\n\r\n\r\n    def initialize(self, inputs, input_lengths, num_speakers, speaker_id=None,mel_targets=None, linear_targets=None, is_training= False,loss_coeff=None,stop_token_targets=None):\r\n        \r\n\r\n        with tf.variable_scope('Eembedding') as scope:\r\n            hp = self._hparams\r\n            batch_size = tf.shape(inputs)[0]\r\n\r\n            # Embeddings(256)\r\n            char_embed_table = tf.get_variable('inputs_embedding', [len(symbols), hp.embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5))\r\n            \r\n            zero_pad = True\r\n            if zero_pad:    # transformer에 구현되어 있는 거 보고, 가져온 로직.\r\n                # <PAD> 0 은 embedding이 0으로 고정되고, train으로 변하지 않는다. 즉, 위의 get_variable에서 잡았던 변수의 첫번째 행(<PAD>)에 대응되는 것은 사용되지 않는 것이다)\r\n                char_embed_table = tf.concat((tf.zeros(shape=[1, hp.embedding_size]),char_embed_table[1:, :]), 0)\r\n            \r\n            \r\n            # [N, T_in, embedding_size]\r\n            char_embedded_inputs = tf.nn.embedding_lookup(char_embed_table, inputs)\r\n\r\n            self.num_speakers = num_speakers\r\n            if self.num_speakers > 1:\r\n                speaker_embed_table = tf.get_variable('speaker_embedding',[self.num_speakers, hp.speaker_embedding_size], dtype=tf.float32,initializer=tf.truncated_normal_initializer(stddev=0.5))\r\n                # [N, T_in, speaker_embedding_size]\r\n                speaker_embed = tf.nn.embedding_lookup(speaker_embed_table, speaker_id)                       \r\n                \r\n                deep_dense = lambda x, dim,name: tf.layers.dense(x, dim, activation=tf.nn.softsign,name=name)   # softsign: x / (abs(x) + 1)\r\n\r\n                encoder_rnn_init_state = deep_dense( speaker_embed, hp.encoder_lstm_units * 4,'encoder_init_dense')  # hp.encoder_lstm_units = 256\r\n\r\n                decoder_rnn_init_states = [deep_dense(speaker_embed, hp.decoder_lstm_units*2,'decoder_init_dense_{}'.format(i)) for i in range(hp.decoder_layers)]  # hp.decoder_lstm_units = 1024\r\n\r\n                speaker_embed = None\r\n            else:\r\n                # self.num_speakers =1인 경우\r\n                speaker_embed = None\r\n                encoder_rnn_init_state = None   # bidirectional GRU의 init state\r\n                attention_rnn_init_state = None\r\n                decoder_rnn_init_states = None\r\n        \r\n        \r\n        with tf.variable_scope('Encoder') as scope:\r\n            ##############\r\n            # Encoder\r\n            ##############\r\n            x = char_embedded_inputs\r\n            for i in range(hp.enc_conv_num_layers):\r\n                x = tf.layers.conv1d(x,filters=hp.enc_conv_channels,kernel_size=hp.enc_conv_kernel_size,padding='same',activation=tf.nn.relu,name='Encoder_{}'.format(i))\r\n                x = tf.layers.batch_normalization(x, training=is_training)\r\n                x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='dropout_{}'.format(i))\r\n\r\n\r\n            if encoder_rnn_init_state is not None:\r\n                initial_state_fw_c,initial_state_fw_h, initial_state_bw_c,initial_state_bw_h = tf.split(encoder_rnn_init_state, 4, 1)\r\n                initial_state_fw = LSTMStateTuple(initial_state_fw_c,initial_state_fw_h)\r\n                initial_state_bw = LSTMStateTuple(initial_state_bw_c,initial_state_bw_h)\r\n            else:  # single mode\r\n                initial_state_fw, initial_state_bw = None, None\r\n\r\n            cell_fw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM')\r\n            cell_bw= ZoneoutLSTMCell(hp.encoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,zoneout_factor_output=hp.tacotron_zoneout_rate,name='encoder_fw_LSTM')\r\n            encoder_conv_output = x\r\n            outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw,encoder_conv_output,sequence_length=input_lengths,\r\n                                                              initial_state_fw=initial_state_fw,initial_state_bw=initial_state_bw,dtype=tf.float32)\r\n\r\n            # envoder_outpust = [N,T,2*encoder_lstm_units] = [N,T,512]\r\n            encoder_outputs = tf.concat(outputs, axis=2) # Concat and return forward + backward outputs\r\n            \r\n            \r\n            \r\n            \r\n        with tf.variable_scope('Decoder') as scope:\r\n            \r\n            ##############\r\n            # Attention\r\n            ##############            \r\n            if hp.attention_type == 'bah_mon':\r\n                attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths,normalize=False)\r\n            elif hp.attention_type == 'bah_mon_norm':  # hccho 추가\r\n                attention_mechanism = BahdanauMonotonicAttention(hp.attention_size, encoder_outputs,memory_sequence_length = input_lengths, normalize=True) \r\n            elif hp.attention_type == 'loc_sen': # Location Sensitivity Attention\r\n                attention_mechanism = LocationSensitiveAttention(hp.attention_size, encoder_outputs,hparams=hp, is_training=is_training,\r\n                                    mask_encoder=hp.mask_encoder,memory_sequence_length = input_lengths,smoothing=hp.smoothing,cumulate_weights=hp.cumulative_weights)\r\n            elif hp.attention_type == 'gmm': # GMM Attention\r\n                attention_mechanism = GmmAttention(hp.attention_size, memory=encoder_outputs,memory_sequence_length = input_lengths)  \r\n            elif hp.attention_type == 'bah_norm':\r\n                attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, normalize=True)\r\n            elif hp.attention_type == 'luong_scaled':\r\n                attention_mechanism = LuongAttention( hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths, scale=True)\r\n            elif hp.attention_type == 'luong':\r\n                attention_mechanism = LuongAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths)\r\n            elif hp.attention_type == 'bah':\r\n                attention_mechanism = BahdanauAttention(hp.attention_size, encoder_outputs,memory_sequence_length=input_lengths)\r\n            else:\r\n                raise Exception(\" [!] Unkown attention type: {}\".format(hp.attention_type))\r\n            \r\n            decoder_lstm = [ZoneoutLSTMCell(hp.decoder_lstm_units, is_training,zoneout_factor_cell=hp.tacotron_zoneout_rate,\r\n                                            zoneout_factor_output=hp.tacotron_zoneout_rate,name='decoder_LSTM_{}'.format(i+1)) for i in range(hp.decoder_layers)]\r\n            \r\n            decoder_lstm = tf.contrib.rnn.MultiRNNCell(decoder_lstm, state_is_tuple=True)\r\n            decoder_init_state = decoder_lstm.zero_state(batch_size=batch_size, dtype=tf.float32) # 여기서 zero_state를 부르면, 위의 AttentionWrapper에서 이미 넣은 준 값도 포함되어 있다.\r\n\r\n            \r\n            \r\n            if hp.model_type == \"multi-speaker\":\r\n\r\n                decoder_init_state = list(decoder_init_state)\r\n \r\n                for idx, cell in enumerate(decoder_rnn_init_states):\r\n                    shape1 = decoder_init_state[idx][0].get_shape().as_list()\r\n                    shape2 = cell.get_shape().as_list()\r\n                    if shape1[1]*2 != shape2[1]:\r\n                        raise Exception(\" [!] Shape {} and {} should be equal\".format(shape1, shape2))\r\n                    c,h = tf.split(cell,2,1)\r\n                    decoder_init_state[idx] = LSTMStateTuple(c,h)\r\n \r\n                decoder_init_state = tuple(decoder_init_state) \r\n            \r\n            \r\n            attention_cell = AttentionWrapper(decoder_lstm,attention_mechanism, initial_cell_state=decoder_init_state,\r\n                                              alignment_history=True,output_attention=False)  # output_attention=False 에 주목, attention_layer_size에 값을 넣지 않았다. 그래서 attention = contex vector가 된다.\r\n\r\n\r\n\r\n            # attention_state_size = 256\r\n            # Decoder input -> prenet -> decoder_lstm -> concat[output, attention]\r\n            dec_prenet_outputs = DecoderWrapper(attention_cell , is_training, hp.dec_prenet_sizes, hp.dropout_prob,hp.inference_prenet_dropout)\r\n\r\n            dec_outputs_cell = OutputProjectionWrapper(dec_prenet_outputs,(hp.num_mels+1) * hp.reduction_factor)\r\n\r\n            if is_training:\r\n                helper = TacoTrainingHelper(mel_targets, hp.num_mels, hp.reduction_factor)  # inputs은 batch_size 계산에만 사용됨\r\n            else:\r\n                helper = TacoTestHelper(batch_size, hp.num_mels, hp.reduction_factor)\r\n\r\n\r\n            decoder_init_state = dec_outputs_cell.zero_state(batch_size=batch_size, dtype=tf.float32)\r\n            (decoder_outputs, _), final_decoder_state, _ = \\\r\n                    tf.contrib.seq2seq.dynamic_decode(BasicDecoder(dec_outputs_cell, helper, decoder_init_state),maximum_iterations=int(hp.max_n_frame/hp.reduction_factor))  # max_iters=200\r\n            \r\n            decoder_mel_outputs = tf.reshape(decoder_outputs[:,:,:hp.num_mels * hp.reduction_factor], [batch_size, -1, hp.num_mels])   # [N,iters,400] -> [N,5*iters,80]\r\n            stop_token_outputs = tf.reshape(decoder_outputs[:,:,hp.num_mels * hp.reduction_factor:], [batch_size, -1]) # [N,iters]\r\n \r\n \r\n            # Postnet\r\n            x = decoder_mel_outputs\r\n            for i in range(hp.postnet_num_layers):\r\n                activation = tf.nn.tanh if i != (hp.postnet_num_layers-1) else None\r\n                x = tf.layers.conv1d(x,filters=hp.postnet_channels,kernel_size=hp.postnet_kernel_size,padding='same',activation=activation,name='Postnet_{}'.format(i))\r\n                x = tf.layers.batch_normalization(x, training=is_training)\r\n                x = tf.layers.dropout(x, rate=hp.dropout_prob, training=is_training, name='Postnet_dropout_{}'.format(i))\r\n \r\n            residual = tf.layers.dense(x,hp.num_mels,name='residual_projection')\r\n            mel_outputs = decoder_mel_outputs + residual\r\n\r\n            # Add post-processing CBHG:\r\n            # mel_outputs: (N,T,num_mels)\r\n            post_outputs = cbhg(mel_outputs, None, is_training,hp.post_bank_size, hp.post_bank_channel_size, hp.post_maxpool_width, hp.post_highway_depth, hp.post_rnn_size,\r\n                                hp.post_proj_sizes, hp.post_proj_width,scope='post_cbhg')\r\n \r\n \r\n            linear_outputs = tf.layers.dense(post_outputs, hp.num_freq,name='linear_spectogram_projection')    # [N, T_out, F(1025)]\r\n \r\n            # Grab alignments from the final decoder state:\r\n            alignments = tf.transpose(final_decoder_state.alignment_history.stack(), [1, 2, 0])  # batch_size, text length(encoder), target length(decoder)\r\n \r\n \r\n            self.inputs = inputs\r\n            self.speaker_id = speaker_id\r\n            self.input_lengths = input_lengths\r\n            self.loss_coeff = loss_coeff\r\n            self.decoder_mel_outputs = decoder_mel_outputs\r\n            self.mel_outputs = mel_outputs\r\n            self.linear_outputs = linear_outputs\r\n            self.alignments = alignments\r\n            self.mel_targets = mel_targets\r\n            self.linear_targets = linear_targets\r\n            self.final_decoder_state = final_decoder_state\r\n            self.stop_token_targets = stop_token_targets\r\n            self.stop_token_outputs = stop_token_outputs\r\n            self.all_vars = tf.trainable_variables()\r\n            log('='*40)\r\n            log(' model_type: %s' % hp.model_type)\r\n            log('='*40)\r\n \r\n            log('Initialized Tacotron model. Dimensions: ')\r\n            log('    embedding:                %d' % char_embedded_inputs.shape[-1])\r\n            log('    encoder conv out:               %d' % encoder_conv_output.shape[-1])\r\n            log('    encoder out:              %d' % encoder_outputs.shape[-1])\r\n            log('    attention out:            %d' % attention_cell.output_size)\r\n            log('    decoder prenet lstm concat out :        %d' % dec_prenet_outputs.output_size)\r\n            log('    decoder cell out:         %d' % dec_outputs_cell.output_size)\r\n            log('    decoder out (%d frames):  %d' % (hp.reduction_factor, decoder_outputs.shape[-1]))\r\n            log('    decoder mel out:    %d' % decoder_mel_outputs.shape[-1])\r\n            log('    mel out:    %d' % mel_outputs.shape[-1])\r\n            log('    postnet out:              %d' % post_outputs.shape[-1])\r\n            log('    linear out:               %d' % linear_outputs.shape[-1])\r\n            log('  Tacotron Parameters       {:.3f} Million.'.format(np.sum([np.prod(v.get_shape().as_list()) for v in self.all_vars]) / 1000000))\r\n\r\n    def add_loss(self):\r\n        '''Adds loss to the model. Sets \"loss\" field. initialize must have been called.'''\r\n        with tf.variable_scope('loss') as scope:\r\n            hp = self._hparams\r\n            before = tf.squared_difference(self.mel_targets, self.decoder_mel_outputs)\r\n            after = tf.squared_difference(self.mel_targets, self.mel_outputs)\r\n            mel_loss = before+after\r\n            \r\n            stop_token_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.stop_token_targets, logits=self.stop_token_outputs))\r\n\r\n            l1 = tf.abs(self.linear_targets - self.linear_outputs)\r\n            expanded_loss_coeff = tf.expand_dims(tf.expand_dims(self.loss_coeff, [-1]), [-1])\r\n\r\n\r\n            regularization_loss = tf.reduce_mean([tf.nn.l2_loss(v) for v in self.all_vars\r\n                if not('bias' in v.name or 'Bias' in v.name or 'projection' in v.name or 'inputs_embedding' in v.name or 'speaker_embedding' in v.name\r\n                    or 'dense' in v.name or 'RNN' in v.name or 'LSTM' in v.name)]) * hp.tacotron_reg_weight\r\n\r\n            regularization_loss = 0\r\n            if hp.prioritize_loss:\r\n                # Prioritize loss for frequencies.\r\n                upper_priority_freq = int(5000 / (hp.sample_rate * 0.5) * hp.num_freq)\r\n                lower_priority_freq = int(165 / (hp.sample_rate * 0.5) * hp.num_freq)\r\n\r\n                l1_priority= l1[:,:,lower_priority_freq:upper_priority_freq]\r\n\r\n                self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + \\\r\n                        0.5 * tf.reduce_mean(l1 * expanded_loss_coeff) + 0.5 * tf.reduce_mean(l1_priority * expanded_loss_coeff) + stop_token_loss + regularization_loss\r\n                self.linear_loss = tf.reduce_mean( 0.5 * (tf.reduce_mean(l1) + tf.reduce_mean(l1_priority)))\r\n            else:\r\n                self.loss = tf.reduce_mean(mel_loss * expanded_loss_coeff) + tf.reduce_mean(l1 * expanded_loss_coeff) + stop_token_loss +  regularization_loss    # 이 loss는 사용하지 않고, 아래의 loss_without_coeff를 사용함\r\n                self.linear_loss = tf.reduce_mean(l1)\r\n\r\n            self.mel_loss = tf.reduce_mean(mel_loss)\r\n            self.loss_without_coeff = self.mel_loss + self.linear_loss + stop_token_loss + regularization_loss\r\n\r\n\r\n\r\n    def add_optimizer(self, global_step):\r\n        '''Adds optimizer. Sets \"gradients\" and \"optimize\" fields. add_loss must have been called.\r\n\r\n        Args:\r\n            global_step: int32 scalar Tensor representing current global step in training\r\n        '''\r\n        with tf.variable_scope('optimizer') as scope:\r\n            hp = self._hparams\r\n\r\n\r\n            if hp.tacotron_decay_learning_rate:\r\n                self.decay_steps = hp.tacotron_decay_steps\r\n                self.decay_rate = hp.tacotron_decay_rate\r\n                self.learning_rate = self._learning_rate_decay(hp.tacotron_initial_learning_rate, global_step)\r\n            else:\r\n                self.learning_rate = tf.convert_to_tensor(hp.tacotron_initial_learning_rate)\r\n\r\n\r\n            optimizer = tf.train.AdamOptimizer(self.learning_rate, hp.adam_beta1, hp.adam_beta2)\r\n            gradients, variables = zip(*optimizer.compute_gradients(self.loss))\r\n            self.gradients = gradients\r\n            clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1.0)\r\n\r\n            # Add dependency on UPDATE_OPS; otherwise batchnorm won't work correctly. See:\r\n            # https://github.com/tensorflow/tensorflow/issues/1122\r\n            with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):\r\n                self.optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step)\r\n\r\n\r\n\r\n    def _learning_rate_decay(self, init_lr, global_step):\r\n        #################################################################\r\n        # Narrow Exponential Decay:\r\n\r\n        # Phase 1: lr = 1e-3\r\n        # We only start learning rate decay after 50k steps\r\n\r\n        # Phase 2: lr in ]1e-5, 1e-3[\r\n        # decay reach minimal value at step 310k\r\n\r\n        # Phase 3: lr = 1e-5\r\n        # clip by minimal learning rate value (step > 310k)\r\n        #################################################################\r\n        hp = self._hparams\r\n\r\n        #Compute natural exponential decay\r\n        lr = tf.train.exponential_decay(init_lr, \r\n            global_step - hp.tacotron_start_decay, #lr = 1e-3 at step 50k\r\n            self.decay_steps, \r\n            self.decay_rate, #lr = 1e-5 around step 310k\r\n            name='lr_exponential_decay')\r\n\r\n\r\n        #clip learning rate by max and min values (initial and final values)\r\n        return tf.minimum(tf.maximum(lr, hp.tacotron_final_learning_rate), init_lr)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n"
  },
  {
    "path": "text/__init__.py",
    "content": "# coding: utf-8\r\nimport re\r\nimport string\r\nimport numpy as np\r\n\r\nfrom text import cleaners\r\nfrom hparams import hparams\r\nfrom text.symbols import symbols, en_symbols, PAD, EOS\r\nfrom text.korean import jamo_to_korean\r\n\r\n\r\n\r\n# Mappings from symbol to numeric ID and vice versa:\r\n_symbol_to_id = {s: i for i, s in enumerate(symbols)}   # 80개\r\n_id_to_symbol = {i: s for i, s in enumerate(symbols)}\r\nisEn=False\r\n\r\n\r\n# Regular expression matching text enclosed in curly braces:\r\n_curly_re = re.compile(r'(.*?)\\{(.+?)\\}(.*)')\r\n\r\npuncuation_table = str.maketrans({key: None for key in string.punctuation})\r\n\r\ndef convert_to_en_symbols():\r\n    '''Converts built-in korean symbols to english, to be used for english training\r\n    \r\n'''\r\n    global _symbol_to_id, _id_to_symbol, isEn\r\n    if not isEn:\r\n        print(\" [!] Converting to english mode\")\r\n    _symbol_to_id = {s: i for i, s in enumerate(en_symbols)}\r\n    _id_to_symbol = {i: s for i, s in enumerate(en_symbols)}\r\n    isEn=True\r\n\r\ndef remove_puncuations(text):\r\n    return text.translate(puncuation_table)\r\n\r\ndef text_to_sequence(text, as_token=False):    \r\n    cleaner_names = [x.strip() for x in hparams.cleaners.split(',')]\r\n    if ('english_cleaners' in cleaner_names) and isEn==False:\r\n        convert_to_en_symbols()\r\n    return _text_to_sequence(text, cleaner_names, as_token)\r\n\r\ndef _text_to_sequence(text, cleaner_names, as_token):\r\n    '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.\r\n\r\n        The text can optionally have ARPAbet sequences enclosed in curly braces embedded\r\n        in it. For example, \"Turn left on {HH AW1 S S T AH0 N} Street.\"\r\n\r\n        Args:\r\n            text: string to convert to a sequence\r\n            cleaner_names: names of the cleaner functions to run the text through\r\n\r\n        Returns:\r\n            List of integers corresponding to the symbols in the text\r\n    '''\r\n    sequence = []\r\n\r\n    # Check for curly braces and treat their contents as ARPAbet:\r\n    while len(text):\r\n        m = _curly_re.match(text)\r\n        if not m:\r\n            sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))\r\n            break\r\n        sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))\r\n        sequence += _arpabet_to_sequence(m.group(2))\r\n        text = m.group(3)\r\n\r\n    # Append EOS token\r\n    sequence.append(_symbol_to_id[EOS])  # [14, 29, 45, 2, 27, 62, 20, 21, 4, 39, 45, 1]\r\n\r\n    if as_token:\r\n        return sequence_to_text(sequence, combine_jamo=True)\r\n    else:\r\n        return np.array(sequence, dtype=np.int32)\r\n\r\n\r\ndef sequence_to_text(sequence, skip_eos_and_pad=False, combine_jamo=False):\r\n    '''Converts a sequence of IDs back to a string'''\r\n    cleaner_names=[x.strip() for x in hparams.cleaners.split(',')]\r\n    if 'english_cleaners' in cleaner_names and isEn==False:\r\n        convert_to_en_symbols()\r\n        \r\n    result = ''\r\n    for symbol_id in sequence:\r\n        if symbol_id in _id_to_symbol:\r\n            s = _id_to_symbol[symbol_id]\r\n            # Enclose ARPAbet back in curly braces:\r\n            if len(s) > 1 and s[0] == '@':\r\n                s = '{%s}' % s[1:]\r\n\r\n            if not skip_eos_and_pad or s not in [EOS, PAD]:\r\n                result += s\r\n\r\n    result = result.replace('}{', ' ')\r\n\r\n    if combine_jamo:\r\n        return jamo_to_korean(result)\r\n    else:\r\n        return result\r\n\r\n\r\n\r\ndef _clean_text(text, cleaner_names):\r\n    \r\n    for name in cleaner_names:\r\n        cleaner = getattr(cleaners, name)\r\n        if not cleaner:\r\n            raise Exception('Unknown cleaner: %s' % name)\r\n        text = cleaner(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']\r\n    return text\r\n\r\n\r\ndef _symbols_to_sequence(symbols):\r\n    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]\r\n\r\n\r\ndef _arpabet_to_sequence(text):\r\n    return _symbols_to_sequence(['@' + s for s in text.split()])\r\n\r\n\r\ndef _should_keep_symbol(s):\r\n    return s in _symbol_to_id and s is not '_' and s is not '~'\r\n"
  },
  {
    "path": "text/cleaners.py",
    "content": "# coding: utf-8\r\n\r\n# Code based on https://github.com/keithito/tacotron/blob/master/text/cleaners.py\r\n'''\r\nCleaners are transformations that run over the input text at both training and eval time.\r\n\r\nCleaners can be selected by passing a comma-delimited list of cleaner names as the \"cleaners\"\r\nhyperparameter. Some cleaners are English-specific. You'll typically want to use:\r\n    1. \"english_cleaners\" for English text\r\n    2. \"transliteration_cleaners\" for non-English text that can be transliterated to ASCII using\r\n         the Unidecode library (https://pypi.python.org/pypi/Unidecode)\r\n    3. \"basic_cleaners\" if you do not want to transliterate (in this case, you should also update\r\n         the symbols in symbols.py to match your data).\r\n'''\r\n\r\nimport re\r\nfrom .korean import tokenize as ko_tokenize\r\n\r\n# Added to support LJ_speech\r\nfrom unidecode import unidecode\r\nfrom .en_numbers import normalize_numbers as en_normalize_numbers\r\n\r\n# Regular expression matching whitespace:\r\n_whitespace_re = re.compile(r'\\s+')\r\n\r\n\r\ndef korean_cleaners(text):\r\n    '''Pipeline for Korean text, including number and abbreviation expansion.'''\r\n    text = ko_tokenize(text) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']\r\n    return text\r\n\r\n\r\n# List of (regular expression, replacement) pairs for abbreviations:\r\n_abbreviations = [(re.compile('\\\\b%s\\\\.' % x[0], re.IGNORECASE), x[1]) for x in [\r\n    ('mrs', 'misess'),\r\n    ('mr', 'mister'),\r\n    ('dr', 'doctor'),\r\n    ('st', 'saint'),\r\n    ('co', 'company'),\r\n    ('jr', 'junior'),\r\n    ('maj', 'major'),\r\n    ('gen', 'general'),\r\n    ('drs', 'doctors'),\r\n    ('rev', 'reverend'),\r\n    ('lt', 'lieutenant'),\r\n    ('hon', 'honorable'),\r\n    ('sgt', 'sergeant'),\r\n    ('capt', 'captain'),\r\n    ('esq', 'esquire'),\r\n    ('ltd', 'limited'),\r\n    ('col', 'colonel'),\r\n    ('ft', 'fort'),\r\n]]\r\n\r\n\r\ndef expand_abbreviations(text):\r\n    for regex, replacement in _abbreviations:\r\n        text = re.sub(regex, replacement, text)\r\n    return text\r\n\r\n\r\ndef expand_numbers(text):\r\n    return en_normalize_numbers(text)\r\n\r\n\r\ndef lowercase(text):\r\n    return text.lower()\r\n\r\n\r\ndef collapse_whitespace(text):\r\n    return re.sub(_whitespace_re, ' ', text)\r\n\r\ndef convert_to_ascii(text):\r\n    '''Converts to ascii, existed in keithito but deleted in carpedm20'''\r\n    return unidecode(text)\r\n    \r\n\r\ndef basic_cleaners(text):\r\n    '''Basic pipeline that lowercases and collapses whitespace without transliteration.'''\r\n    text = lowercase(text)\r\n    text = collapse_whitespace(text)\r\n    return text\r\n\r\n\r\ndef transliteration_cleaners(text):\r\n    '''Pipeline for non-English text that transliterates to ASCII.'''\r\n    text = convert_to_ascii(text)\r\n    text = lowercase(text)\r\n    text = collapse_whitespace(text)\r\n    return text\r\n\r\n\r\ndef english_cleaners(text):\r\n    '''Pipeline for English text, including number and abbreviation expansion.'''\r\n    text = convert_to_ascii(text)\r\n    text = lowercase(text)\r\n    text = expand_numbers(text)\r\n    text = expand_abbreviations(text)\r\n    text = collapse_whitespace(text)\r\n    return text\r\n\r\n\r\n"
  },
  {
    "path": "text/en_numbers.py",
    "content": "import inflect\r\nimport re\r\n\r\n\r\n_inflect = inflect.engine()\r\n_comma_number_re = re.compile(r'([0-9][0-9\\,]+[0-9])')\r\n_decimal_number_re = re.compile(r'([0-9]+\\.[0-9]+)')\r\n_pounds_re = re.compile(r'£([0-9\\,]*[0-9]+)')\r\n_dollars_re = re.compile(r'\\$([0-9\\.\\,]*[0-9]+)')\r\n_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')\r\n_number_re = re.compile(r'[0-9]+')\r\n\r\n\r\ndef _remove_commas(m):\r\n  return m.group(1).replace(',', '')\r\n\r\n\r\ndef _expand_decimal_point(m):\r\n  return m.group(1).replace('.', ' point ')\r\n\r\n\r\ndef _expand_dollars(m):\r\n  match = m.group(1)\r\n  parts = match.split('.')\r\n  if len(parts) > 2:\r\n    return match + ' dollars'  # Unexpected format\r\n  dollars = int(parts[0]) if parts[0] else 0\r\n  cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0\r\n  if dollars and cents:\r\n    dollar_unit = 'dollar' if dollars == 1 else 'dollars'\r\n    cent_unit = 'cent' if cents == 1 else 'cents'\r\n    return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)\r\n  elif dollars:\r\n    dollar_unit = 'dollar' if dollars == 1 else 'dollars'\r\n    return '%s %s' % (dollars, dollar_unit)\r\n  elif cents:\r\n    cent_unit = 'cent' if cents == 1 else 'cents'\r\n    return '%s %s' % (cents, cent_unit)\r\n  else:\r\n    return 'zero dollars'\r\n\r\n\r\ndef _expand_ordinal(m):\r\n  return _inflect.number_to_words(m.group(0))\r\n\r\n\r\ndef _expand_number(m):\r\n  num = int(m.group(0))\r\n  if num > 1000 and num < 3000:\r\n    if num == 2000:\r\n      return 'two thousand'\r\n    elif num > 2000 and num < 2010:\r\n      return 'two thousand ' + _inflect.number_to_words(num % 100)\r\n    elif num % 100 == 0:\r\n      return _inflect.number_to_words(num // 100) + ' hundred'\r\n    else:\r\n      return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')\r\n  else:\r\n    return _inflect.number_to_words(num, andword='')\r\n\r\n\r\ndef normalize_numbers(text):\r\n  text = re.sub(_comma_number_re, _remove_commas, text)\r\n  text = re.sub(_pounds_re, r'\\1 pounds', text)\r\n  text = re.sub(_dollars_re, _expand_dollars, text)\r\n  text = re.sub(_decimal_number_re, _expand_decimal_point, text)\r\n  text = re.sub(_ordinal_re, _expand_ordinal, text)\r\n  text = re.sub(_number_re, _expand_number, text)\r\n  return text\r\n"
  },
  {
    "path": "text/english.py",
    "content": "# Code from https://github.com/keithito/tacotron/blob/master/util/numbers.py\r\nimport inflect\r\n\r\n\r\n_inflect = inflect.engine()\r\n_comma_number_re = re.compile(r'([0-9][0-9\\,]+[0-9])')\r\n_decimal_number_re = re.compile(r'([0-9]+\\.[0-9]+)')\r\n_pounds_re = re.compile(r'£([0-9\\,]*[0-9]+)')\r\n_dollars_re = re.compile(r'\\$([0-9\\.\\,]*[0-9]+)')\r\n_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')\r\n_number_re = re.compile(r'[0-9]+')\r\n\r\n\r\ndef _remove_commas(m):\r\n    return m.group(1).replace(',', '')\r\n\r\n\r\ndef _expand_decimal_point(m):\r\n    return m.group(1).replace('.', ' point ')\r\n\r\n\r\ndef _expand_dollars(m):\r\n    match = m.group(1)\r\n    parts = match.split('.')\r\n    if len(parts) > 2:\r\n        return match + ' dollars'    # Unexpected format\r\n    dollars = int(parts[0]) if parts[0] else 0\r\n    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0\r\n    if dollars and cents:\r\n        dollar_unit = 'dollar' if dollars == 1 else 'dollars'\r\n        cent_unit = 'cent' if cents == 1 else 'cents'\r\n        return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)\r\n    elif dollars:\r\n        dollar_unit = 'dollar' if dollars == 1 else 'dollars'\r\n        return '%s %s' % (dollars, dollar_unit)\r\n    elif cents:\r\n        cent_unit = 'cent' if cents == 1 else 'cents'\r\n        return '%s %s' % (cents, cent_unit)\r\n    else:\r\n        return 'zero dollars'\r\n\r\n\r\ndef _expand_ordinal(m):\r\n    return _inflect.number_to_words(m.group(0))\r\n\r\n\r\ndef _expand_number(m):\r\n    num = int(m.group(0))\r\n    if num > 1000 and num < 3000:\r\n        if num == 2000:\r\n            return 'two thousand'\r\n        elif num > 2000 and num < 2010:\r\n            return 'two thousand ' + _inflect.number_to_words(num % 100)\r\n        elif num % 100 == 0:\r\n            return _inflect.number_to_words(num // 100) + ' hundred'\r\n        else:\r\n            return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')\r\n    else:\r\n        return _inflect.number_to_words(num, andword='')\r\n\r\n\r\ndef normalize(text):\r\n    text = re.sub(_comma_number_re, _remove_commas, text)\r\n    text = re.sub(_pounds_re, r'\\1 pounds', text)\r\n    text = re.sub(_dollars_re, _expand_dollars, text)\r\n    text = re.sub(_decimal_number_re, _expand_decimal_point, text)\r\n    text = re.sub(_ordinal_re, _expand_ordinal, text)\r\n    text = re.sub(_number_re, _expand_number, text)\r\n    return text\r\n"
  },
  {
    "path": "text/ko_dictionary.py",
    "content": "# coding: utf-8\r\n\r\netc_dictionary = {\r\n        '2 30대': '이삼십대',\r\n        '20~30대': '이삼십대',\r\n        '20, 30대': '이십대 삼십대',\r\n        '1+1': '원플러스원',\r\n        '3에서 6개월인': '3개월에서 육개월인',\r\n}\r\n\r\nenglish_dictionary = {\r\n        'Devsisters': '데브시스터즈',\r\n        'track': '트랙',\r\n\r\n        # krbook\r\n        'LA': '엘에이',\r\n        'LG': '엘지',\r\n        'KOREA': '코리아',\r\n        'JSA': '제이에스에이',\r\n        'PGA': '피지에이',\r\n        'GA': '지에이',\r\n        'idol': '아이돌',\r\n        'KTX': '케이티엑스',\r\n        'AC': '에이씨',\r\n        'DVD': '디비디',\r\n        'US': '유에스',\r\n        'CNN': '씨엔엔',\r\n        'LPGA': '엘피지에이',\r\n        'P': '피',\r\n        'L': '엘',\r\n        'T': '티',\r\n        'B': '비',\r\n        'C': '씨',\r\n        'BIFF': '비아이에프에프',\r\n        'GV': '지비',\r\n\r\n        # JTBC\r\n        'IT': '아이티',\r\n        'IQ': '아이큐',\r\n        'JTBC': '제이티비씨',\r\n        'trickle down effect': '트리클 다운 이펙트',\r\n        'trickle up effect': '트리클 업 이펙트',\r\n        'down': '다운',\r\n        'up': '업',\r\n        'FCK': '에프씨케이',\r\n        'AP': '에이피',\r\n        'WHERETHEWILDTHINGSARE': '',\r\n        'Rashomon Effect': '',\r\n        'O': '오',\r\n        'OO': '오오',\r\n        'B': '비',\r\n        'GDP': '지디피',\r\n        'CIPA': '씨아이피에이',\r\n        'YS': '와이에스',\r\n        'Y': '와이',\r\n        'S': '에스',\r\n        'JTBC': '제이티비씨',\r\n        'PC': '피씨',\r\n        'bill': '빌',\r\n        'Halmuny': '하모니', #####\r\n        'X': '엑스',\r\n        'SNS': '에스엔에스',\r\n        'ability': '어빌리티',\r\n        'shy': '',\r\n        'CCTV': '씨씨티비',\r\n        'IT': '아이티',\r\n        'the tenth man': '더 텐쓰 맨', ####\r\n        'L': '엘',\r\n        'PC': '피씨',\r\n        'YSDJJPMB': '', ########\r\n        'Content Attitude Timing': '컨텐트 애티튜드 타이밍',\r\n        'CAT': '캣',\r\n        'IS': '아이에스',\r\n        'SNS': '에스엔에스',\r\n        'K': '케이',\r\n        'Y': '와이',\r\n        'KDI': '케이디아이',\r\n        'DOC': '디오씨',\r\n        'CIA': '씨아이에이',\r\n        'PBS': '피비에스',\r\n        'D': '디',\r\n        'PPropertyPositionPowerPrisonP'\r\n        'S': '에스',\r\n        'francisco': '프란시스코',\r\n        'I': '아이',\r\n        'III': '아이아이', ######\r\n        'No joke': '노 조크',\r\n        'BBK': '비비케이',\r\n        'LA': '엘에이',\r\n        'Don': '',\r\n        't worry be happy': ' 워리 비 해피',\r\n        'NO': '엔오', #####\r\n        'it was our sky': '잇 워즈 아워 스카이',\r\n        'it is our sky': '잇 이즈 아워 스카이', ####\r\n        'NEIS': '엔이아이에스', #####\r\n        'IMF': '아이엠에프',\r\n        'apology': '어폴로지',\r\n        'humble': '험블',\r\n        'M': '엠',\r\n        'Nowhere Man': '노웨어 맨',\r\n        'The Tenth Man': '더 텐쓰 맨',\r\n        'PBS': '피비에스',\r\n        'BBC': '비비씨',\r\n        'MRJ': '엠알제이',\r\n        'CCTV': '씨씨티비',\r\n        'Pick me up': '픽 미 업',\r\n        'DNA': '디엔에이',\r\n        'UN': '유엔',\r\n        'STOP': '스탑', #####\r\n        'PRESS': '프레스', #####\r\n        'not to be': '낫 투비',\r\n        'Denial': '디나이얼',\r\n        'G': '지',\r\n        'IMF': '아이엠에프',\r\n        'GDP': '지디피',\r\n        'JTBC': '제이티비씨',\r\n        'Time flies like an arrow': '타임 플라이즈 라이크 언 애로우',\r\n        'DDT': '디디티',\r\n        'AI': '에이아이',\r\n        'Z': '제트',\r\n        'OECD': '오이씨디',\r\n        'N': '앤',\r\n        'A': '에이',\r\n        'MB': '엠비',\r\n        'EH': '이에이치',\r\n        'IS': '아이에스',\r\n        'TV': '티비',\r\n        'MIT': '엠아이티',\r\n        'KBO': '케이비오',\r\n        'I love America': '아이 러브 아메리카',\r\n        'SF': '에스에프',\r\n        'Q': '큐',\r\n        'KFX': '케이에프엑스',\r\n        'PM': '피엠',\r\n        'Prime Minister': '프라임 미니스터',\r\n        'Swordline': '스워드라인',\r\n        'TBS': '티비에스',\r\n        'DDT': '디디티',\r\n        'CS': '씨에스',\r\n        'Reflecting Absence': '리플렉팅 앱센스',\r\n        'PBS': '피비에스',\r\n        'Drum being beaten by everyone': '드럼 빙 비튼 바이 에브리원',\r\n        'negative pressure': '네거티브 프레셔',\r\n        'F': '에프',\r\n        'KIA': '기아',\r\n        'FTA': '에프티에이',\r\n        'Que sais-je': '',\r\n        'UFC': '유에프씨',\r\n        'P': '피',\r\n        'DJ': '디제이',\r\n        'Chaebol': '채벌',\r\n        'BBC': '비비씨',\r\n        'OECD': '오이씨디',\r\n        'BC': '삐씨',\r\n        'C': '씨',\r\n        'B': '씨',\r\n        'KY': '케이와이',\r\n        'K': '케이',\r\n        'CEO': '씨이오',\r\n        'YH': '와이에치',\r\n        'IS': '아이에스',\r\n        'who are you': '후 얼 유',\r\n        'Y': '와이',\r\n        'The Devils Advocate': '더 데빌즈 어드보카트',\r\n        'YS': '와이에스',\r\n        'so sorry': '쏘 쏘리',\r\n        'Santa': '산타',\r\n        'Big Endian': '빅 엔디안',\r\n        'Small Endian': '스몰 엔디안',\r\n        'Oh Captain My Captain': '오 캡틴 마이 캡틴',\r\n        'AIB': '에이아이비',\r\n        'K': '케이',\r\n        'PBS': '피비에스',\r\n}\r\n"
  },
  {
    "path": "text/korean.py",
    "content": "﻿# coding: utf-8\r\n# Code based on \r\n\r\nimport re\r\nimport os\r\nimport ast\r\nimport json\r\nfrom jamo import hangul_to_jamo, h2j, j2h\r\n\r\nfrom .ko_dictionary import english_dictionary, etc_dictionary\r\n\r\nPAD = '_'\r\nEOS = '~'\r\nPUNC = '!\\'(),-.:;?'\r\nSPACE = ' '\r\n\r\nJAMO_LEADS = \"\".join([chr(_) for _ in range(0x1100, 0x1113)])\r\nJAMO_VOWELS = \"\".join([chr(_) for _ in range(0x1161, 0x1176)])\r\nJAMO_TAILS = \"\".join([chr(_) for _ in range(0x11A8, 0x11C3)])\r\n\r\nVALID_CHARS = JAMO_LEADS + JAMO_VOWELS + JAMO_TAILS + PUNC + SPACE\r\nALL_SYMBOLS = PAD + EOS + VALID_CHARS\r\n\r\nchar_to_id = {c: i for i, c in enumerate(ALL_SYMBOLS)}\r\nid_to_char = {i: c for i, c in enumerate(ALL_SYMBOLS)}\r\n\r\nquote_checker = \"\"\"([`\"'＂“‘])(.+?)([`\"'＂”’])\"\"\"\r\n\r\ndef is_lead(char):\r\n    return char in JAMO_LEADS\r\n\r\ndef is_vowel(char):\r\n    return char in JAMO_VOWELS\r\n\r\ndef is_tail(char):\r\n    return char in JAMO_TAILS\r\n\r\ndef get_mode(char):\r\n    if is_lead(char):\r\n        return 0\r\n    elif is_vowel(char):\r\n        return 1\r\n    elif is_tail(char):\r\n        return 2\r\n    else:\r\n        return -1\r\n\r\ndef _get_text_from_candidates(candidates):\r\n    if len(candidates) == 0:\r\n        return \"\"\r\n    elif len(candidates) == 1:\r\n        return _jamo_char_to_hcj(candidates[0])\r\n    else:\r\n        return j2h(**dict(zip([\"lead\", \"vowel\", \"tail\"], candidates)))\r\n\r\ndef jamo_to_korean(text):\r\n    text = h2j(text)\r\n\r\n    idx = 0\r\n    new_text = \"\"\r\n    candidates = []\r\n\r\n    while True:\r\n        if idx >= len(text):\r\n            new_text += _get_text_from_candidates(candidates)\r\n            break\r\n\r\n        char = text[idx]\r\n        mode = get_mode(char)\r\n\r\n        if mode == 0:\r\n            new_text += _get_text_from_candidates(candidates)\r\n            candidates = [char]\r\n        elif mode == -1:\r\n            new_text += _get_text_from_candidates(candidates)\r\n            new_text += char\r\n            candidates = []\r\n        else:\r\n            candidates.append(char)\r\n\r\n        idx += 1\r\n    return new_text\r\n\r\nnum_to_kor = {\r\n        '0': '영',\r\n        '1': '일',\r\n        '2': '이',\r\n        '3': '삼',\r\n        '4': '사',\r\n        '5': '오',\r\n        '6': '육',\r\n        '7': '칠',\r\n        '8': '팔',\r\n        '9': '구',\r\n}\r\n\r\nunit_to_kor1 = {\r\n        '%': '퍼센트',\r\n        'cm': '센치미터',\r\n        'mm': '밀리미터',\r\n        'km': '킬로미터',\r\n        'kg': '킬로그람',\r\n}\r\nunit_to_kor2 = {\r\n        'm': '미터',\r\n}\r\n\r\nupper_to_kor = {\r\n        'A': '에이',\r\n        'B': '비',\r\n        'C': '씨',\r\n        'D': '디',\r\n        'E': '이',\r\n        'F': '에프',\r\n        'G': '지',\r\n        'H': '에이치',\r\n        'I': '아이',\r\n        'J': '제이',\r\n        'K': '케이',\r\n        'L': '엘',\r\n        'M': '엠',\r\n        'N': '엔',\r\n        'O': '오',\r\n        'P': '피',\r\n        'Q': '큐',\r\n        'R': '알',\r\n        'S': '에스',\r\n        'T': '티',\r\n        'U': '유',\r\n        'V': '브이',\r\n        'W': '더블유',\r\n        'X': '엑스',\r\n        'Y': '와이',\r\n        'Z': '지',\r\n}\r\n\r\ndef compare_sentence_with_jamo(text1, text2):\r\n    return h2j(text1) != h2j(text)\r\n\r\ndef tokenize(text, as_id=False):\r\n    # jamo package에 있는 hangul_to_jamo를 이용하여 한글 string을 초성/중성/종성으로 나눈다.\r\n    text = normalize(text)\r\n    tokens = list(hangul_to_jamo(text)) # '존경하는'  --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ', '~']\r\n\r\n    if as_id:\r\n        return [char_to_id[token] for token in tokens] + [char_to_id[EOS]]\r\n    else:\r\n        return [token for token in tokens] + [EOS]\r\n\r\ndef tokenizer_fn(iterator):\r\n    return (token for x in iterator for token in tokenize(x, as_id=False))\r\n\r\ndef normalize(text):\r\n    text = text.strip()\r\n\r\n    text = re.sub('\\(\\d+일\\)', '', text)\r\n    text = re.sub('\\([⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]+\\)', '', text)\r\n\r\n    text = normalize_with_dictionary(text, etc_dictionary)\r\n    text = normalize_english(text)\r\n    text = re.sub('[a-zA-Z]+', normalize_upper, text)\r\n\r\n    text = normalize_quote(text)\r\n    text = normalize_number(text)\r\n\r\n    return text\r\n\r\ndef normalize_with_dictionary(text, dic):\r\n    if any(key in text for key in dic.keys()):\r\n        pattern = re.compile('|'.join(re.escape(key) for key in dic.keys()))\r\n        return pattern.sub(lambda x: dic[x.group()], text)\r\n    else:\r\n        return text\r\n\r\ndef normalize_english(text):\r\n    def fn(m):\r\n        word = m.group()\r\n        if word in english_dictionary:\r\n            return english_dictionary.get(word)\r\n        else:\r\n            return word\r\n\r\n    text = re.sub(\"([A-Za-z]+)\", fn, text)\r\n    return text\r\n\r\ndef normalize_upper(text):\r\n    text = text.group(0)\r\n\r\n    if all([char.isupper() for char in text]):\r\n        return \"\".join(upper_to_kor[char] for char in text)\r\n    else:\r\n        return text\r\n\r\ndef normalize_quote(text):\r\n    def fn(found_text):\r\n        from nltk import sent_tokenize # NLTK doesn't along with multiprocessing\r\n\r\n        found_text = found_text.group()\r\n        unquoted_text = found_text[1:-1]\r\n\r\n        sentences = sent_tokenize(unquoted_text)\r\n        return \" \".join([\"'{}'\".format(sent) for sent in sentences])\r\n\r\n    return re.sub(quote_checker, fn, text)\r\n\r\nnumber_checker = \"([+-]?\\d[\\d,]*)[\\.]?\\d*\"\r\ncount_checker = \"(시|명|가지|살|마리|포기|송이|수|톨|통|점|개|벌|척|채|다발|그루|자루|줄|켤레|그릇|잔|마디|상자|사람|곡|병|판)\"\r\n\r\ndef normalize_number(text):\r\n    text = normalize_with_dictionary(text, unit_to_kor1)\r\n    text = normalize_with_dictionary(text, unit_to_kor2)\r\n    text = re.sub(number_checker + count_checker,\r\n            lambda x: number_to_korean(x, True), text)\r\n    text = re.sub(number_checker,\r\n            lambda x: number_to_korean(x, False), text)\r\n    return text\r\n\r\nnum_to_kor1 = [\"\"] + list(\"일이삼사오육칠팔구\")\r\nnum_to_kor2 = [\"\"] + list(\"만억조경해\")\r\nnum_to_kor3 = [\"\"] + list(\"십백천\")\r\n\r\n#count_to_kor1 = [\"\"] + [\"하나\",\"둘\",\"셋\",\"넷\",\"다섯\",\"여섯\",\"일곱\",\"여덟\",\"아홉\"]\r\ncount_to_kor1 = [\"\"] + [\"한\",\"두\",\"세\",\"네\",\"다섯\",\"여섯\",\"일곱\",\"여덟\",\"아홉\"]\r\n\r\ncount_tenth_dict = {\r\n        \"십\": \"열\",\r\n        \"두십\": \"스물\",\r\n        \"세십\": \"서른\",\r\n        \"네십\": \"마흔\",\r\n        \"다섯십\": \"쉰\",\r\n        \"여섯십\": \"예순\",\r\n        \"일곱십\": \"일흔\",\r\n        \"여덟십\": \"여든\",\r\n        \"아홉십\": \"아흔\",\r\n}\r\n\r\n\r\n\r\ndef number_to_korean(num_str, is_count=False):\r\n    if is_count:\r\n        num_str, unit_str = num_str.group(1), num_str.group(2)\r\n    else:\r\n        num_str, unit_str = num_str.group(), \"\"\r\n    \r\n    num_str = num_str.replace(',', '')\r\n    num = ast.literal_eval(num_str)\r\n\r\n    if num == 0:\r\n        return \"영\"\r\n\r\n    check_float = num_str.split('.')\r\n    if len(check_float) == 2:\r\n        digit_str, float_str = check_float\r\n    elif len(check_float) >= 3:\r\n        raise Exception(\" [!] Wrong number format\")\r\n    else:\r\n        digit_str, float_str = check_float[0], None\r\n\r\n    if is_count and float_str is not None:\r\n        raise Exception(\" [!] `is_count` and float number does not fit each other\")\r\n\r\n    digit = int(digit_str)\r\n\r\n    if digit_str.startswith(\"-\"):\r\n        digit, digit_str = abs(digit), str(abs(digit))\r\n\r\n    kor = \"\"\r\n    size = len(str(digit))\r\n    tmp = []\r\n\r\n    for i, v in enumerate(digit_str, start=1):\r\n        v = int(v)\r\n\r\n        if v != 0:\r\n            if is_count:\r\n                tmp += count_to_kor1[v]\r\n            else:\r\n                tmp += num_to_kor1[v]\r\n\r\n            tmp += num_to_kor3[(size - i) % 4]\r\n\r\n        if (size - i) % 4 == 0 and len(tmp) != 0:\r\n            kor += \"\".join(tmp)\r\n            tmp = []\r\n            kor += num_to_kor2[int((size - i) / 4)]\r\n\r\n    if is_count:\r\n        if kor.startswith(\"한\") and len(kor) > 1:\r\n            kor = kor[1:]\r\n\r\n        if any(word in kor for word in count_tenth_dict):\r\n            kor = re.sub(\r\n                    '|'.join(count_tenth_dict.keys()),\r\n                    lambda x: count_tenth_dict[x.group()], kor)\r\n\r\n    if not is_count and kor.startswith(\"일\") and len(kor) > 1:\r\n        kor = kor[1:]\r\n\r\n    if float_str is not None:\r\n        kor += \"쩜 \"\r\n        kor += re.sub('\\d', lambda x: num_to_kor[x.group()], float_str)\r\n\r\n    if num_str.startswith(\"+\"):\r\n        kor = \"플러스 \" + kor\r\n    elif num_str.startswith(\"-\"):\r\n        kor = \"마이너스 \" + kor\r\n\r\n    return kor + unit_str\r\n\r\nif __name__ == \"__main__\":\r\n    def test_normalize(text):\r\n        print(text)\r\n        print(normalize(text))\r\n        print(\"=\"*30)\r\n\r\n    test_normalize(\"JTBC는 JTBCs를 DY는 A가 Absolute\")\r\n    test_normalize(\"오늘(13일) 3,600마리 강아지가\")\r\n    test_normalize(\"60.3%\")\r\n    test_normalize('\"저돌\"(猪突) 입니다.')\r\n    test_normalize('비대위원장이 지난 1월 이런 말을 했습니다. “난 그냥 산돼지처럼 돌파하는 스타일이다”')\r\n    test_normalize(\"지금은 -12.35%였고 종류는 5가지와 19가지, 그리고 55가지였다\")\r\n    test_normalize(\"JTBC는 TH와 K 양이 2017년 9월 12일 오후 12시에 24살이 된다\")\r\n    print(list(hangul_to_jamo(list(hangul_to_jamo('비대위원장이 지난 1월 이런 말을 했습니다? “난 그냥 산돼지처럼 돌파하는 스타일이다”')))))"
  },
  {
    "path": "text/symbols.py",
    "content": "# coding: utf-8\r\n'''\r\nDefines the set of symbols used in text input to the model.\r\n\r\nThe default is a set of ASCII characters that works well for English or text that has been run\r\nthrough Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details.\r\n'''\r\nfrom jamo import h2j, j2h\r\nfrom jamo.jamo import _jamo_char_to_hcj\r\n\r\nfrom .korean import ALL_SYMBOLS, PAD, EOS\r\n\r\n# For english\r\nen_symbols = PAD+EOS+'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\\'(),-.:;? '  #<-For deployment(Because korean ALL_SYMBOLS follow this convention)\r\n\r\nsymbols = ALL_SYMBOLS # for korean\r\n\r\n\"\"\"\r\n초성과 종성은 같아보이지만, 다른 character이다.\r\n\r\n'_~ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ!'(),-.:;? '\r\n\r\n'_': 0, '~': 1, 'ᄀ': 2, 'ᄁ': 3, 'ᄂ': 4, 'ᄃ': 5, 'ᄄ': 6, 'ᄅ': 7, 'ᄆ': 8, 'ᄇ': 9, 'ᄈ': 10, \r\n'ᄉ': 11, 'ᄊ': 12, 'ᄋ': 13, 'ᄌ': 14, 'ᄍ': 15, 'ᄎ': 16, 'ᄏ': 17, 'ᄐ': 18, 'ᄑ': 19, 'ᄒ': 20, \r\n'ᅡ': 21, 'ᅢ': 22, 'ᅣ': 23, 'ᅤ': 24, 'ᅥ': 25, 'ᅦ': 26, 'ᅧ': 27, 'ᅨ': 28, 'ᅩ': 29, 'ᅪ': 30, \r\n'ᅫ': 31, 'ᅬ': 32, 'ᅭ': 33, 'ᅮ': 34, 'ᅯ': 35, 'ᅰ': 36, 'ᅱ': 37, 'ᅲ': 38, 'ᅳ': 39, 'ᅴ': 40, \r\n'ᅵ': 41, 'ᆨ': 42, 'ᆩ': 43, 'ᆪ': 44, 'ᆫ': 45, 'ᆬ': 46, 'ᆭ': 47, 'ᆮ': 48, 'ᆯ': 49, 'ᆰ': 50, \r\n'ᆱ': 51, 'ᆲ': 52, 'ᆳ': 53, 'ᆴ': 54, 'ᆵ': 55, 'ᆶ': 56, 'ᆷ': 57, 'ᆸ': 58, 'ᆹ': 59, 'ᆺ': 60, \r\n'ᆻ': 61, 'ᆼ': 62, 'ᆽ': 63, 'ᆾ': 64, 'ᆿ': 65, 'ᇀ': 66, 'ᇁ': 67, 'ᇂ': 68, '!': 69, \"'\": 70, \r\n'(': 71, ')': 72, ',': 73, '-': 74, '.': 75, ':': 76, ';': 77, '?': 78, ' ': 79\r\n\"\"\""
  },
  {
    "path": "train_tacotron2.py",
    "content": "# coding: utf-8\r\nimport os\r\nimport time\r\nimport math\r\nimport argparse\r\nimport traceback\r\nimport subprocess\r\nimport numpy as np\r\nfrom jamo import h2j\r\nimport tensorflow as tf\r\nfrom datetime import datetime\r\nfrom functools import partial\r\n\r\nfrom hparams import hparams, hparams_debug_string\r\nfrom tacotron2 import create_model, get_most_recent_checkpoint\r\n\r\nfrom utils import ValueWindow, prepare_dirs\r\nfrom utils import infolog, warning, plot, load_hparams\r\nfrom utils import get_git_revision_hash, get_git_diff, str2bool, parallel_run\r\n\r\nfrom utils.audio import save_wav, inv_spectrogram\r\nfrom text import sequence_to_text, text_to_sequence\r\nfrom datasets.datafeeder_tacotron2 import DataFeederTacotron2\r\nimport warnings\r\nwarnings.simplefilter(action='ignore', category=FutureWarning)\r\n\r\ntf.logging.set_verbosity(tf.logging.ERROR)\r\nlog = infolog.log\r\n\r\n\r\n\r\ndef get_git_commit():\r\n    subprocess.check_output(['git', 'diff-index', '--quiet', 'HEAD'])     # Verify client is clean\r\n    commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode().strip()[:10]\r\n    log('Git commit: %s' % commit)\r\n    return commit\r\n\r\n\r\ndef add_stats(model, model2=None, scope_name='train'):\r\n    with tf.variable_scope(scope_name) as scope:\r\n        summaries = [\r\n                tf.summary.scalar('loss_mel', model.mel_loss),\r\n                tf.summary.scalar('loss_linear', model.linear_loss),\r\n                tf.summary.scalar('loss', model.loss_without_coeff),\r\n        ]\r\n\r\n        if scope_name == 'train':\r\n            gradient_norms = [tf.norm(grad) for grad in model.gradients if grad is not None]\r\n\r\n            summaries.extend([\r\n                    tf.summary.scalar('learning_rate', model.learning_rate),\r\n                    tf.summary.scalar('max_gradient_norm', tf.reduce_max(gradient_norms)),\r\n            ])\r\n\r\n    if model2 is not None:\r\n        with tf.variable_scope('gap_test-train') as scope:\r\n            summaries.extend([\r\n                    tf.summary.scalar('loss_mel',\r\n                            model.mel_loss - model2.mel_loss),\r\n                    tf.summary.scalar('loss_linear', \r\n                            model.linear_loss - model2.linear_loss),\r\n                    tf.summary.scalar('loss',\r\n                            model.loss_without_coeff - model2.loss_without_coeff),\r\n            ])\r\n\r\n    return tf.summary.merge(summaries)\r\n\r\n\r\ndef save_and_plot_fn(args, log_dir, step, loss, prefix):\r\n    idx, (seq, spec, align) = args\r\n\r\n    audio_path = os.path.join(log_dir, '{}-step-{:09d}-audio{:03d}.wav'.format(prefix, step, idx))\r\n    align_path = os.path.join(log_dir, '{}-step-{:09d}-align{:03d}.png'.format(prefix, step, idx))\r\n\r\n    waveform = inv_spectrogram(spec.T,hparams)\r\n    save_wav(waveform, audio_path,hparams.sample_rate)\r\n\r\n    info_text = 'step={:d}, loss={:.5f}'.format(step, loss)\r\n    if 'korean_cleaners' in [x.strip() for x in hparams.cleaners.split(',')]:\r\n        log('Training korean : Use jamo')\r\n        plot.plot_alignment( align, align_path, info=info_text, text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=True), isKorean=True)\r\n    else:\r\n        log('Training non-korean : X use jamo')\r\n        plot.plot_alignment(align, align_path, info=info_text,text=sequence_to_text(seq,skip_eos_and_pad=True, combine_jamo=False), isKorean=False) \r\n\r\ndef save_and_plot(sequences, spectrograms,alignments, log_dir, step, loss, prefix):\r\n\r\n    fn = partial(save_and_plot_fn,log_dir=log_dir, step=step, loss=loss, prefix=prefix)\r\n    items = list(enumerate(zip(sequences, spectrograms, alignments)))\r\n\r\n    parallel_run(fn, items, parallel=False)\r\n    log('Test finished for step {}.'.format(step))\r\n\r\n\r\ndef train(log_dir, config):\r\n    config.data_paths = config.data_paths  # ['datasets/moon']\r\n\r\n    data_dirs = config.data_paths  # ['datasets/moon\\\\data']\r\n    num_speakers = len(data_dirs)\r\n    config.num_test = config.num_test_per_speaker * num_speakers  # 2*1\r\n\r\n    if num_speakers > 1 and hparams.model_type not in [\"multi-speaker\", \"simple\"]:\r\n        raise Exception(\"[!] Unkown model_type for multi-speaker: {}\".format(config.model_type))\r\n\r\n    commit = get_git_commit() if config.git else 'None'\r\n    checkpoint_path = os.path.join(log_dir, 'model.ckpt') # 'logdir-tacotron\\\\moon_2018-08-28_13-06-42\\\\model.ckpt'\r\n\r\n    #log(' [*] git recv-parse HEAD:\\n%s' % get_git_revision_hash())  # hccho: 주석 처리\r\n    log('='*50)\r\n    #log(' [*] dit diff:\\n%s' % get_git_diff())\r\n    log('='*50)\r\n    log(' [*] Checkpoint path: %s' % checkpoint_path)\r\n    log(' [*] Loading training data from: %s' % data_dirs)\r\n    log(' [*] Using model: %s' % config.model_dir)  # 'logdir-tacotron\\\\moon_2018-08-28_13-06-42'\r\n    log(hparams_debug_string())\r\n\r\n    # Set up DataFeeder:\r\n    coord = tf.train.Coordinator()\r\n    with tf.variable_scope('datafeeder') as scope:\r\n        # DataFeeder의 6개 placeholder: train_feeder.inputs, train_feeder.input_lengths, train_feeder.loss_coeff, train_feeder.mel_targets, train_feeder.linear_targets, train_feeder.speaker_id\r\n        train_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 32,data_type='train', batch_size=config.batch_size)\r\n        test_feeder = DataFeederTacotron2(coord, data_dirs, hparams, config, 8, data_type='test', batch_size=config.num_test)\r\n\r\n    # Set up model:\r\n\r\n    global_step = tf.Variable(0, name='global_step', trainable=False)\r\n\r\n    with tf.variable_scope('model') as scope:\r\n        model = create_model(hparams)\r\n        model.initialize(inputs=train_feeder.inputs, input_lengths=train_feeder.input_lengths,num_speakers=num_speakers,speaker_id=train_feeder.speaker_id,\r\n                         mel_targets=train_feeder.mel_targets, linear_targets=train_feeder.linear_targets,is_training=True,\r\n                         loss_coeff=train_feeder.loss_coeff,stop_token_targets=train_feeder.stop_token_targets)\r\n\r\n        model.add_loss()\r\n        model.add_optimizer(global_step)\r\n        train_stats = add_stats(model, scope_name='train') # legacy\r\n\r\n    with tf.variable_scope('model', reuse=True) as scope:\r\n        test_model = create_model(hparams)\r\n        test_model.initialize(inputs=test_feeder.inputs, input_lengths=test_feeder.input_lengths,num_speakers=num_speakers,speaker_id=test_feeder.speaker_id,\r\n                         mel_targets=test_feeder.mel_targets, linear_targets=test_feeder.linear_targets,is_training=False,\r\n                         loss_coeff=test_feeder.loss_coeff,stop_token_targets=test_feeder.stop_token_targets)\r\n        \r\n        test_model.add_loss()\r\n\r\n\r\n    # Bookkeeping:\r\n    step = 0\r\n    time_window = ValueWindow(100)\r\n    loss_window = ValueWindow(100)\r\n    saver = tf.train.Saver(max_to_keep=None, keep_checkpoint_every_n_hours=2)\r\n\r\n    sess_config = tf.ConfigProto(log_device_placement=False,allow_soft_placement=True)\r\n    sess_config.gpu_options.allow_growth=True\r\n\r\n    # Train!\r\n    #with tf.Session(config=sess_config) as sess:\r\n    with tf.Session() as sess:\r\n        try:\r\n            summary_writer = tf.summary.FileWriter(log_dir, sess.graph)\r\n            sess.run(tf.global_variables_initializer())\r\n\r\n            if config.load_path:\r\n                # Restore from a checkpoint if the user requested it.\r\n                restore_path = get_most_recent_checkpoint(config.model_dir)\r\n                saver.restore(sess, restore_path)\r\n                log('Resuming from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True)\r\n            elif config.initialize_path:\r\n                restore_path = get_most_recent_checkpoint(config.initialize_path)\r\n                saver.restore(sess, restore_path)\r\n                log('Initialized from checkpoint: %s at commit: %s' % (restore_path, commit), slack=True)\r\n\r\n                zero_step_assign = tf.assign(global_step, 0)\r\n                sess.run(zero_step_assign)\r\n\r\n                start_step = sess.run(global_step)\r\n                log('='*50)\r\n                log(' [*] Global step is reset to {}'.format(start_step))\r\n                log('='*50)\r\n            else:\r\n                log('Starting new training run at commit: %s' % commit, slack=True)\r\n\r\n            start_step = sess.run(global_step)\r\n\r\n            train_feeder.start_in_session(sess, start_step)\r\n            test_feeder.start_in_session(sess, start_step)\r\n\r\n            while not coord.should_stop():\r\n                start_time = time.time()\r\n                step, loss, opt = sess.run([global_step, model.loss_without_coeff, model.optimize])\r\n\r\n                time_window.append(time.time() - start_time)\r\n                loss_window.append(loss)\r\n\r\n                message = 'Step %-7d [%.03f sec/step, loss=%.05f, avg_loss=%.05f]' % (step, time_window.average, loss, loss_window.average)\r\n                log(message, slack=(step % config.checkpoint_interval == 0))\r\n\r\n                if loss > 100 or math.isnan(loss):\r\n                    log('Loss exploded to %.05f at step %d!' % (loss, step), slack=True)\r\n                    raise Exception('Loss Exploded')\r\n\r\n                if step % config.summary_interval == 0:\r\n                    log('Writing summary at step: %d' % step)\r\n\r\n\r\n                    summary_writer.add_summary(sess.run( train_stats), step)\r\n\r\n                if step % config.checkpoint_interval == 0:\r\n                    log('Saving checkpoint to: %s-%d' % (checkpoint_path, step))\r\n                    saver.save(sess, checkpoint_path, global_step=step)\r\n\r\n                if step % config.test_interval == 0:\r\n                    log('Saving audio and alignment...')\r\n                    num_test = config.num_test\r\n\r\n                    fetches = [\r\n                            model.inputs[:num_test],\r\n                            model.linear_outputs[:num_test],\r\n                            model.alignments[:num_test],\r\n                            test_model.inputs[:num_test],\r\n                            test_model.linear_outputs[:num_test],\r\n                            test_model.alignments[:num_test],\r\n                    ]\r\n\r\n\r\n                    sequences, spectrograms, alignments, test_sequences, test_spectrograms, test_alignments =  sess.run(fetches)\r\n\r\n\r\n                    #librosa는 ffmpeg가 있어야 한다.\r\n                    save_and_plot(sequences[:1], spectrograms[:1], alignments[:1], log_dir, step, loss, \"train\")  # spectrograms: (num_test,200,1025), alignments: (num_test,encoder_length,decoder_length)\r\n                    save_and_plot(test_sequences, test_spectrograms, test_alignments, log_dir, step, loss, \"test\")\r\n\r\n        except Exception as e:\r\n            log('Exiting due to exception: %s' % e, slack=True)\r\n            traceback.print_exc()\r\n            coord.request_stop(e)\r\n\r\n\r\ndef main():\r\n    parser = argparse.ArgumentParser()\r\n\r\n    parser.add_argument('--log_dir', default='logdir-tacotron2')\r\n    \r\n    parser.add_argument('--data_paths', default='D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\moon,D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\son')\r\n    #parser.add_argument('--data_paths', default='D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\small1,D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\small2')\r\n    \r\n    \r\n    #parser.add_argument('--load_path', default=None)   # 아래의 'initialize_path'보다 우선 적용\r\n    parser.add_argument('--load_path', default='logdir-tacotron2/moon+son_2019-03-01_10-35-44')\r\n    \r\n    \r\n    parser.add_argument('--initialize_path', default=None)   # ckpt로 부터 model을 restore하지만, global step은 0에서 시작\r\n\r\n    parser.add_argument('--batch_size', type=int, default=32)\r\n    parser.add_argument('--num_test_per_speaker', type=int, default=2)\r\n    parser.add_argument('--random_seed', type=int, default=123)\r\n    parser.add_argument('--summary_interval', type=int, default=100)\r\n    \r\n    parser.add_argument('--test_interval', type=int, default=500)  # 500\r\n    \r\n    parser.add_argument('--checkpoint_interval', type=int, default=2000) # 2000\r\n    parser.add_argument('--skip_path_filter', type=str2bool, default=False, help='Use only for debugging')\r\n\r\n    parser.add_argument('--slack_url', help='Slack webhook URL to get periodic reports.')\r\n    parser.add_argument('--git', action='store_true', help='If set, verify that the client is clean.')  # The store_true option automatically creates a default value of False.\r\n\r\n    config = parser.parse_args()\r\n    config.data_paths = config.data_paths.split(\",\")\r\n    setattr(hparams, \"num_speakers\", len(config.data_paths))\r\n\r\n    prepare_dirs(config, hparams)\r\n\r\n    log_path = os.path.join(config.model_dir, 'train.log')\r\n    infolog.init(log_path, config.model_dir, config.slack_url)\r\n\r\n    tf.set_random_seed(config.random_seed)\r\n    print(config.data_paths)\r\n\r\n\r\n    if config.load_path is not None and config.initialize_path is not None:\r\n        raise Exception(\" [!] Only one of load_path and initialize_path should be set\")\r\n\r\n    train(config.model_dir, config)\r\n\r\n\r\nif __name__ == '__main__':\r\n    main()\r\n"
  },
  {
    "path": "train_vocoder.py",
    "content": "#  coding: utf-8\r\n\"\"\"\r\n- train data를 speaker를 분리된 디렉토리로 받아서, speaker id를 디렉토리별로 부과.\r\n- file name에서 speaker id를 추론하는 방식이 아님.\r\n\r\n\"\"\"\r\n\r\nfrom __future__ import print_function\r\n\r\nimport argparse\r\nimport numpy as np\r\nimport os\r\nimport time\r\nimport traceback\r\nfrom glob import glob\r\nimport tensorflow as tf\r\nfrom tensorflow.python.client import timeline\r\nfrom datetime import datetime\r\nfrom wavenet import WaveNetModel,mu_law_decode\r\nfrom datasets import DataFeederWavenet\r\nfrom hparams import hparams\r\nfrom utils import validate_directories,load,save,infolog,get_tensors_in_checkpoint_file,build_tensors_in_checkpoint_file,plot,audio\r\n\r\ntf.logging.set_verbosity(tf.logging.ERROR)\r\nEPSILON = 0.001\r\nlog = infolog.log\r\n\r\ndef eval_step(sess,logdir,step,waveform,upsampled_local_condition_data,speaker_id_data,mel_input_data,samples,speaker_id,upsampled_local_condition,next_sample,temperature=1.0):\r\n    waveform = waveform[:,:1]\r\n    \r\n    sample_size = upsampled_local_condition_data.shape[1]\r\n    last_sample_timestamp = datetime.now()\r\n    start_time = time.time()\r\n    for step2 in range(sample_size):  # 원하는 길이를 구하기 위해 loop sample_size\r\n        window = waveform[:,-1:]  # 제일 끝에 있는 1개만 samples에 넣어 준다.  window: shape(N,1)\r\n        \r\n\r\n        prediction = sess.run(next_sample, feed_dict={samples: window,upsampled_local_condition: upsampled_local_condition_data[:,step2,:],speaker_id: speaker_id_data })\r\n\r\n\r\n        if hparams.scalar_input:\r\n            sample = prediction  # logistic distribution으로부터 sampling 되었기 때문에, randomness가 있다.\r\n        else:\r\n            # Scale prediction distribution using temperature.\r\n            # 다음 과정은 config.temperature==1이면 각 원소를 합으로 나누어주는 것에 불과. 이미 softmax를 적용한 겂이므로, 합이 1이된다. 그래서 값의 변화가 없다.\r\n            # config.temperature가 1이 아니며, 각 원소의 log취한 값을 나눈 후, 합이 1이 되도록 rescaling하는 것이 된다.\r\n            np.seterr(divide='ignore')\r\n            scaled_prediction = np.log(prediction) / temperature   # config.temperature인 경우는 값의 변화가 없다.\r\n            scaled_prediction = (scaled_prediction - np.logaddexp.reduce(scaled_prediction,axis=-1,keepdims=True))  # np.log(np.sum(np.exp(scaled_prediction)))\r\n            scaled_prediction = np.exp(scaled_prediction)\r\n            np.seterr(divide='warn')\r\n    \r\n            # Prediction distribution at temperature=1.0 should be unchanged after\r\n            # scaling.\r\n            if temperature == 1.0:\r\n                np.testing.assert_allclose( prediction, scaled_prediction, atol=1e-5, err_msg='Prediction scaling at temperature=1.0 is not working as intended.')\r\n            \r\n            # argmax로 선택하지 않기 때문에, 같은 입력이 들어가도 달라질 수 있다.\r\n            sample = [[np.random.choice(np.arange(hparams.quantization_channels), p=p)] for p in scaled_prediction]  # choose one sample per batch\r\n        \r\n        waveform = np.concatenate([waveform,sample],axis=-1)   #window.shape: (N,1)\r\n\r\n        # Show progress only once per second.\r\n        current_sample_timestamp = datetime.now()\r\n        time_since_print = current_sample_timestamp - last_sample_timestamp\r\n        if time_since_print.total_seconds() > 1.:\r\n            duration = time.time() - start_time\r\n            print('Sample {:3<d}/{:3<d}, ({:.3f} sec/step)'.format(step2 + 1, sample_size, duration), end='\\r')\r\n            last_sample_timestamp = current_sample_timestamp\r\n    \r\n    print('\\n')\r\n    # Save the result as a wav file.    \r\n    if hparams.input_type == 'raw':\r\n        out = waveform[:,1:]\r\n    elif hparams.input_type == 'mulaw':\r\n        decode = mu_law_decode(samples, hparams.quantization_channels,quantization=False)\r\n        out = sess.run(decode, feed_dict={samples: waveform[:,1:]})\r\n    else:  # 'mulaw-quantize'\r\n        decode = mu_law_decode(samples, hparams.quantization_channels,quantization=True)\r\n        out = sess.run(decode, feed_dict={samples: waveform[:,1:]})          \r\n        \r\n        \r\n    # save wav\r\n    \r\n    for i in range(1):\r\n        wav_out_path= logdir + '/test-{}-{}.wav'.format(step,i)\r\n        mel_path =  wav_out_path.replace(\".wav\", \".png\")\r\n        \r\n        gen_mel_spectrogram = audio.melspectrogram(out[i], hparams).astype(np.float32).T\r\n        audio.save_wav(out[i], wav_out_path, hparams.sample_rate)  # save_wav 내에서 out[i]의 값이 바뀐다.\r\n        \r\n        plot.plot_spectrogram(gen_mel_spectrogram, mel_path, title='generated mel spectrogram{}'.format(step),target_spectrogram=mel_input_data[i])  \r\n\r\ndef create_network(hp,batch_size,num_speakers,is_training):\r\n    net = WaveNetModel(\r\n        batch_size=batch_size,\r\n        dilations=hp.dilations,\r\n        filter_width=hp.filter_width,\r\n        residual_channels=hp.residual_channels,\r\n        dilation_channels=hp.dilation_channels,\r\n        quantization_channels=hp.quantization_channels,\r\n        out_channels =hp.out_channels,\r\n        skip_channels=hp.skip_channels,\r\n        use_biases=hp.use_biases,  #  True\r\n        scalar_input=hp.scalar_input,\r\n        global_condition_channels=hp.gc_channels,\r\n        global_condition_cardinality=num_speakers,\r\n        local_condition_channels=hp.num_mels,\r\n        upsample_factor=hp.upsample_factor,\r\n        legacy = hp.legacy,\r\n        residual_legacy = hp.residual_legacy,\r\n        drop_rate = hp.wavenet_dropout,\r\n        train_mode=is_training)\r\n    \r\n    return net\r\ndef main():\r\n    def _str_to_bool(s):\r\n        \"\"\"Convert string to bool (in argparse context).\"\"\"\r\n        if s.lower() not in ['true', 'false']:\r\n            raise ValueError('Argument needs to be a boolean, got {}'.format(s))\r\n        return {'true': True, 'false': False}[s.lower()]\r\n    \r\n    \r\n    parser = argparse.ArgumentParser(description='WaveNet example network')\r\n    \r\n    DATA_DIRECTORY =  'D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\moon,D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\son'\r\n    #DATA_DIRECTORY =  'D:\\\\hccho\\\\Tacotron-Wavenet-Vocoder-hccho\\\\data\\\\moon'\r\n    parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing the VCTK corpus.')\r\n\r\n\r\n    #LOGDIR = None\r\n    LOGDIR = './/logdir-wavenet//train//2019-03-27T20-27-18'\r\n\r\n    parser.add_argument('--logdir', type=str, default=LOGDIR,help='Directory in which to store the logging information for TensorBoard. If the model already exists, it will restore the state and will continue training. Cannot use with --logdir_root and --restore_from.')\r\n    \r\n    \r\n    parser.add_argument('--logdir_root', type=str, default=None,help='Root directory to place the logging output and generated model. These are stored under the dated subdirectory of --logdir_root. Cannot use with --logdir.')\r\n    parser.add_argument('--restore_from', type=str, default=None,help='Directory in which to restore the model from. This creates the new model under the dated directory in --logdir_root. Cannot use with --logdir.')\r\n    \r\n    \r\n    CHECKPOINT_EVERY = 1000   # checkpoint 저장 주기\r\n    parser.add_argument('--checkpoint_every', type=int, default=CHECKPOINT_EVERY,help='How many steps to save each checkpoint after. Default: ' + str(CHECKPOINT_EVERY) + '.')\r\n    \r\n    \r\n    parser.add_argument('--eval_every', type=int, default=2,help='Steps between eval on test data')\r\n    \r\n   \r\n    \r\n    config = parser.parse_args()  # command 창에서 입력받을 수 있는 조건\r\n    config.data_dir = config.data_dir.split(\",\")\r\n    \r\n    try:\r\n        directories = validate_directories(config,hparams)\r\n    except ValueError as e:\r\n        print(\"Some arguments are wrong:\")\r\n        print(str(e))\r\n        return\r\n\r\n    logdir = directories['logdir']\r\n    restore_from = directories['restore_from']\r\n\r\n    # Even if we restored the model, we will treat it as new training\r\n    # if the trained model is written into an arbitrary location.\r\n    is_overwritten_training = logdir != restore_from\r\n\r\n\r\n    log_path = os.path.join(logdir, 'train.log')\r\n    infolog.init(log_path, logdir)\r\n\r\n\r\n    global_step = tf.Variable(0, name='global_step', trainable=False)\r\n\r\n    if hparams.l2_regularization_strength == 0:\r\n        hparams.l2_regularization_strength = None\r\n\r\n\r\n    # Create coordinator.\r\n    coord = tf.train.Coordinator()\r\n    num_speakers = len(config.data_dir)\r\n    # Load raw waveform from VCTK corpus.\r\n    with tf.name_scope('create_inputs'):\r\n        # Allow silence trimming to be skipped by specifying a threshold near\r\n        # zero.\r\n        silence_threshold = hparams.silence_threshold if hparams.silence_threshold > EPSILON else None\r\n        gc_enable = True  # Before: num_speakers > 1    After: 항상 True\r\n        \r\n        # AudioReader에서 wav 파일을 잘라 input값을 만든다. receptive_field길이만큼을 앞부분에 pad하거나 앞조각에서 가져온다. (receptive_field+ sample_size)크기로 자른다.\r\n        reader = DataFeederWavenet(coord,config.data_dir,batch_size=hparams.wavenet_batch_size,gc_enable= gc_enable,test_mode=False)\r\n        \r\n        # test를 위한 DataFeederWavenet를 하나 만들자. 여기서는 딱 1개의 파일만 가져온다.\r\n        reader_test = DataFeederWavenet(coord,config.data_dir,batch_size=1,gc_enable= gc_enable,test_mode=True,queue_size=1)\r\n        \r\n        \r\n\r\n        audio_batch, lc_batch, gc_id_batch = reader.inputs_wav, reader.local_condition, reader.speaker_id\r\n\r\n\r\n    # Create train network.\r\n    net = create_network(hparams,hparams.wavenet_batch_size,num_speakers,is_training=True)\r\n    net.add_loss(input_batch=audio_batch,local_condition=lc_batch, global_condition_batch=gc_id_batch, l2_regularization_strength=hparams.l2_regularization_strength,upsample_type=hparams.upsample_type)\r\n    net.add_optimizer(hparams,global_step)\r\n\r\n\r\n\r\n    run_metadata = tf.RunMetadata()\r\n\r\n    # Set up session\r\n    sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))  # log_device_placement=False --> cpu/gpu 자동 배치.\r\n    init = tf.global_variables_initializer()\r\n    sess.run(init)\r\n    \r\n    # Saver for storing checkpoints of the model.\r\n    saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints)  # 최대 checkpoint 저장 갯수 지정\r\n    \r\n    try:\r\n        start_step = load(saver, sess, restore_from)  # checkpoint load\r\n        if is_overwritten_training or start_step is None:\r\n            # The first training step will be saved_global_step + 1,\r\n            # therefore we put -1 here for new or overwritten trainings.\r\n            zero_step_assign = tf.assign(global_step, 0)\r\n            sess.run(zero_step_assign)\r\n            start_step=0\r\n    except:\r\n        print(\"Something went wrong while restoring checkpoint. We will terminate training to avoid accidentally overwriting the previous model.\")\r\n        raise\r\n\r\n\r\n    ###########\r\n\r\n    reader.start_in_session(sess,start_step)\r\n    reader_test.start_in_session(sess,start_step)\r\n    \r\n    ################### Create test network.  <---- Queue 생성 때문에, sess restore후 test network 생성\r\n    net_test = create_network(hparams,1,num_speakers,is_training=False)\r\n  \r\n    if hparams.scalar_input:\r\n        samples = tf.placeholder(tf.float32,shape=[net_test.batch_size,None])\r\n        waveform = 2*np.random.rand(net_test.batch_size).reshape(net_test.batch_size,-1)-1\r\n        \r\n    else:\r\n        samples = tf.placeholder(tf.int32,shape=[net_test.batch_size,None])  # samples: mu_law_encode로 변환된 것. one-hot으로 변환되기 전. (batch_size, 길이)\r\n        waveform = np.random.randint(hparams.quantization_channels,size=net_test.batch_size).reshape(net_test.batch_size,-1)\r\n    upsampled_local_condition = tf.placeholder(tf.float32,shape=[net_test.batch_size,hparams.num_mels])  \r\n    \r\n        \r\n\r\n    speaker_id = tf.placeholder(tf.int32,shape=[net_test.batch_size])  \r\n    next_sample = net_test.predict_proba_incremental(samples,upsampled_local_condition,speaker_id)  # Fast Wavenet Generation Algorithm-1611.09482 algorithm 적용\r\n\r\n        \r\n    sess.run(net_test.queue_initializer)\r\n    \r\n\r\n\r\n\r\n    # test를 위한 placeholder는 모두 3개: samples,speaker_id,upsampled_local_condition\r\n    # test용 mel-spectrogram을 하나 뽑자. 그것을 고정하지 않으면, thread가 계속 돌아가면서 data를 읽어온다.  reader_test의 역할은 여기서 끝난다.\r\n\r\n    mel_input_test, speaker_id_test = sess.run([reader_test.local_condition,reader_test.speaker_id])\r\n\r\n\r\n    with tf.variable_scope('wavenet',reuse=tf.AUTO_REUSE):\r\n        upsampled_local_condition_data = net_test.create_upsample(mel_input_test,upsample_type=hparams.upsample_type)\r\n        upsampled_local_condition_data_ = sess.run(upsampled_local_condition_data)  # upsampled_local_condition_data_ 을 feed_dict로 placehoder인 upsampled_local_condition에 넣어준다.\r\n\r\n    ######################################################\r\n    \r\n    \r\n    start_step = sess.run(global_step)\r\n    step = last_saved_step = start_step\r\n    try:        \r\n        \r\n        while not coord.should_stop():\r\n            \r\n            start_time = time.time()\r\n            if hparams.store_metadata and step % 50 == 0:\r\n                # Slow run that stores extra information for debugging.\r\n                log('Storing metadata')\r\n                run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)\r\n                step, loss_value, _ = sess.run([global_step, net.loss, net.optimize],options=run_options,run_metadata=run_metadata)\r\n\r\n                tl = timeline.Timeline(run_metadata.step_stats)\r\n                timeline_path = os.path.join(logdir, 'timeline.trace')\r\n                with open(timeline_path, 'w') as f:\r\n                    f.write(tl.generate_chrome_trace_format(show_memory=True))\r\n            else:\r\n                step, loss_value, _ = sess.run([global_step,net.loss, net.optimize])\r\n\r\n            duration = time.time() - start_time\r\n            log('step {:d} - loss = {:.3f}, ({:.3f} sec/step)'.format(step, loss_value, duration))\r\n            \r\n            \r\n            if step % config.checkpoint_every == 0:\r\n                save(saver, sess, logdir, step)\r\n                last_saved_step = step\r\n                \r\n                \r\n            if step % config.eval_every == 0:  # config.eval_every\r\n                eval_step(sess,logdir,step,waveform,upsampled_local_condition_data_,speaker_id_test,mel_input_test,samples,speaker_id,upsampled_local_condition,next_sample)\r\n            \r\n            if step >= hparams.num_steps:\r\n                # error message가 나오지만, 여기서 멈춘 것은 맞다.\r\n                raise Exception('End xxx~~~yyy')\r\n            \r\n    except Exception as e:\r\n        print('finally')\r\n        log('Exiting due to exception: %s' % e, slack=True)\r\n        #if step > last_saved_step:\r\n        #    save(saver, sess, logdir, step)        \r\n        traceback.print_exc()\r\n        coord.request_stop(e)\r\n\r\n\r\nif __name__ == '__main__':\r\n    main()\r\n    traceback.print_exc()\r\n    print('Done')\r\n"
  },
  {
    "path": "utils/__init__.py",
    "content": "# -*- coding: utf-8 -*-\r\nimport re,json,sys,os\r\nimport tensorflow as tf\r\nfrom tqdm import tqdm\r\nfrom contextlib import closing\r\nfrom multiprocessing import Pool\r\nfrom collections import namedtuple\r\nfrom datetime import datetime, timedelta\r\nfrom shutil import copyfile as copy_file\r\nfrom tensorflow.python import pywrap_tensorflow\r\n\r\nPARAMS_NAME = \"params.json\"\r\nSTARTED_DATESTRING = \"{0:%Y-%m-%dT%H-%M-%S}\".format(datetime.now())\r\nLOGDIR_ROOT_Wavenet = './logdir-wavenet'\r\n\r\n\r\nclass ValueWindow():\r\n    def __init__(self, window_size=100):\r\n        self._window_size = window_size\r\n        self._values = []\r\n\r\n    def append(self, x):\r\n        self._values = self._values[-(self._window_size - 1):] + [x]\r\n\r\n    @property\r\n    def sum(self):\r\n        return sum(self._values)\r\n\r\n    @property\r\n    def count(self):\r\n        return len(self._values)\r\n\r\n    @property\r\n    def average(self):\r\n        return self.sum / max(1, self.count)\r\n\r\n    def reset(self):\r\n        self._values = []\r\ndef prepare_dirs(config, hparams):\r\n    if hasattr(config, \"data_paths\"):\r\n        config.datasets = [os.path.basename(data_path) for data_path in config.data_paths]\r\n        dataset_desc = \"+\".join(config.datasets)\r\n\r\n    if config.load_path:\r\n        config.model_dir = config.load_path\r\n    else:\r\n        config.model_name = \"{}_{}\".format(dataset_desc, get_time())\r\n        config.model_dir = os.path.join(config.log_dir, config.model_name)\r\n\r\n        for path in [config.log_dir, config.model_dir]:\r\n            if not os.path.exists(path):\r\n                os.makedirs(path)\r\n\r\n    if config.load_path:\r\n        load_hparams(hparams, config.model_dir)\r\n    else:\r\n        setattr(hparams, \"num_speakers\", len(config.datasets))\r\n\r\n        save_hparams(config.model_dir, hparams)\r\n        copy_file(\"hparams.py\", os.path.join(config.model_dir, \"hparams.py\"))\r\n\r\ndef save(saver, sess, logdir, step):\r\n    model_name = 'model.ckpt'\r\n    checkpoint_path = os.path.join(logdir, model_name)\r\n    print('Storing checkpoint to {} ...'.format(logdir), end=\"\")\r\n    sys.stdout.flush()\r\n\r\n    if not os.path.exists(logdir):\r\n        os.makedirs(logdir)\r\n\r\n    saver.save(sess, checkpoint_path, global_step=step)\r\n    print(' Done.')\r\n\r\n\r\ndef load(saver, sess, logdir):\r\n    print(\"Trying to restore saved checkpoints from {} ...\".format(logdir),end=\"\")\r\n\r\n    ckpt = tf.train.get_checkpoint_state(logdir)\r\n    #ckpt = get_most_recent_checkpoint(logdir)\r\n    if ckpt:\r\n        print(\"  Checkpoint found: {}\".format(ckpt.model_checkpoint_path))\r\n        global_step = int(ckpt.model_checkpoint_path.split('/')[-1].split('-')[-1])\r\n        print(\"  Global step was: {}\".format(global_step))\r\n        print(\"  Restoring...\", end=\"\")\r\n        saver.restore(sess, ckpt.model_checkpoint_path)\r\n        print(\" Done.\")\r\n        return global_step\r\n    else:\r\n        print(\" No checkpoint found.\")\r\n        return None\r\n\r\n\r\ndef get_default_logdir(logdir_root):\r\n    logdir = os.path.join(logdir_root, 'train', STARTED_DATESTRING)\r\n    if not os.path.exists(logdir):\r\n        os.makedirs(logdir)    \r\n    return logdir\r\n\r\n\r\ndef validate_directories(args,hparams):\r\n    \"\"\"Validate and arrange directory related arguments.\"\"\"\r\n\r\n    # Validation\r\n    if args.logdir and args.logdir_root:\r\n        raise ValueError(\"--logdir and --logdir_root cannot be specified at the same time.\")\r\n\r\n    if args.logdir and args.restore_from:\r\n        raise ValueError(\r\n            \"--logdir and --restore_from cannot be specified at the same \"\r\n            \"time. This is to keep your previous model from unexpected \"\r\n            \"overwrites.\\n\"\r\n            \"Use --logdir_root to specify the root of the directory which \"\r\n            \"will be automatically created with current date and time, or use \"\r\n            \"only --logdir to just continue the training from the last \"\r\n            \"checkpoint.\")\r\n\r\n    # Arrangement\r\n    logdir_root = args.logdir_root\r\n    if logdir_root is None:\r\n        logdir_root = LOGDIR_ROOT_Wavenet\r\n        \r\n        \r\n    logdir = args.logdir\r\n    if logdir is None:\r\n        logdir = get_default_logdir(logdir_root)\r\n        print('Using default logdir: {}'.format(logdir))\r\n        save_hparams(logdir, hparams)\r\n        copy_file(\"hparams.py\", os.path.join(logdir, \"hparams.py\"))\r\n    else:\r\n        load_hparams(hparams, logdir)\r\n        \r\n    restore_from = args.restore_from\r\n    if restore_from is None:\r\n        # args.logdir and args.restore_from are exclusive,\r\n        # so it is guaranteed the logdir here is newly created.\r\n        restore_from = logdir\r\n\r\n    return {\r\n        'logdir': logdir,\r\n        'logdir_root': args.logdir_root,\r\n        'restore_from': restore_from\r\n    }\r\ndef save_hparams(model_dir, hparams):\r\n    param_path = os.path.join(model_dir, PARAMS_NAME)\r\n\r\n    info = eval(hparams.to_json(),{'false': False, 'true': True, 'null': None})\r\n    write_json(param_path, info)\r\n\r\n    print(\" [*] MODEL dir: {}\".format(model_dir))\r\n    print(\" [*] PARAM path: {}\".format(param_path))\r\n    \r\ndef write_json(path, data):\r\n    with open(path, 'w',encoding='utf-8') as f:\r\n        json.dump(data, f, indent=4, sort_keys=True, ensure_ascii=False)\r\n\r\ndef load_hparams(hparams, load_path, skip_list=[]):\r\n    # log dir에 있는 hypermarameter 정보를 이용해서, hparams.py의 정보를 update한다.\r\n    path = os.path.join(load_path, PARAMS_NAME)\r\n\r\n    new_hparams = load_json(path)\r\n    hparams_keys = vars(hparams).keys()\r\n\r\n    for key, value in new_hparams.items():\r\n        if key in skip_list or key not in hparams_keys:\r\n            print(\"Skip {} because it not exists\".format(key))  #json에 있지만, hparams에 없다는 의미\r\n            continue\r\n\r\n        if key not in ['xxxxx',]:  # update 하지 말아야 할 것을 지정할 수 있다.\r\n            original_value = getattr(hparams, key)\r\n            if original_value != value:\r\n                print(\"UPDATE {}: {} -> {}\".format(key, getattr(hparams, key), value))\r\n                setattr(hparams, key, value)\r\ndef load_json(path, as_class=False, encoding='euc-kr'):\r\n    with open(path,encoding=encoding) as f:\r\n        content = f.read()\r\n        content = re.sub(\",\\s*}\", \"}\", content)\r\n        content = re.sub(\",\\s*]\", \"]\", content)\r\n\r\n        if as_class:\r\n            data = json.loads(content, object_hook=\\\r\n                    lambda data: namedtuple('Data', data.keys())(*data.values()))\r\n        else:\r\n            data = json.loads(content)\r\n\r\n    return data    \r\ndef get_most_recent_checkpoint(checkpoint_dir):\r\n    checkpoint_paths = [path for path in glob(\"{}/*.ckpt-*.data-*\".format(checkpoint_dir))]\r\n    idxes = [int(os.path.basename(path).split('-')[1].split('.')[0]) for path in checkpoint_paths]\r\n\r\n    max_idx = max(idxes)\r\n    lastest_checkpoint = os.path.join(checkpoint_dir, \"model.ckpt-{}\".format(max_idx))\r\n\r\n    #latest_checkpoint=checkpoint_paths[0]\r\n    print(\" [*] Found lastest checkpoint: {}\".format(lastest_checkpoint))\r\n    return lastest_checkpoint\r\n\r\ndef add_prefix(path, prefix):\r\n    dir_path, filename = os.path.dirname(path), os.path.basename(path)\r\n    return \"{}/{}.{}\".format(dir_path, prefix, filename)\r\n\r\ndef add_postfix(path, postfix):\r\n    path_without_ext, ext = path.rsplit('.', 1)\r\n    return \"{}.{}.{}\".format(path_without_ext, postfix, ext)\r\n\r\ndef remove_postfix(path):\r\n    items = path.rsplit('.', 2)\r\n    return items[0] + \".\" + items[2]\r\n\r\ndef get_time():\r\n    return datetime.now().strftime(\"%Y-%m-%d_%H-%M-%S\")\r\n\r\ndef parallel_run(fn, items, desc=\"\", parallel=True):\r\n    results = []\r\n\r\n    if parallel:\r\n        with closing(Pool(10)) as pool:\r\n            for out in tqdm(pool.imap_unordered(fn, items), total=len(items), desc=desc):\r\n                if out is not None:\r\n                    results.append(out)\r\n    else:\r\n        for item in tqdm(items, total=len(items), desc=desc):\r\n            out = fn(item)\r\n            if out is not None:\r\n                results.append(out)\r\n\r\n    return results\r\ndef makedirs(path):\r\n    if not os.path.exists(path):\r\n        print(\" [*] Make directories : {}\".format(path))\r\n        os.makedirs(path)\r\n        \r\ndef str2bool(v):\r\n    return v.lower() in ('true', '1')\r\n\r\ndef remove_file(path):\r\n    if os.path.exists(path):\r\n        print(\" [*] Removed: {}\".format(path))\r\n        os.remove(path)\r\n\r\ndef get_git_revision_hash():\r\n    return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode(\"utf-8\")\r\ndef get_git_diff():\r\n    return subprocess.check_output(['git', 'diff']).decode(\"utf-8\")\r\n\r\ndef warning(msg):\r\n    print(\"=\"*40)\r\n    print(\" [!] {}\".format(msg))\r\n    print(\"=\"*40)\r\n    print()\t\t\r\n\r\ndef get_tensors_in_checkpoint_file(file_name,all_tensors=True,tensor_name=None):\r\n    # checkpoint 파일로 부터 복구\r\n    # e.g  file_name: 'D:\\\\hccho\\\\Tacotron-2-hccho\\\\model.ckpt-155000'\r\n    varlist=[]\r\n    var_value =[]\r\n    reader = pywrap_tensorflow.NewCheckpointReader(file_name)\r\n    trainable_variables_names = [v.name[:-2] for v in tf.trainable_variables()]  # 끝부분의 ':0' 제외\r\n    if all_tensors:\r\n        var_to_shape_map = reader.get_variable_to_shape_map()\r\n        for key in sorted(var_to_shape_map):\r\n            if key in trainable_variables_names:   # hccho\r\n                varlist.append(key)\r\n                var_value.append(reader.get_tensor(key))\r\n    else:\r\n        varlist.append(tensor_name)\r\n        var_value.append(reader.get_tensor(tensor_name))\r\n    return (varlist, var_value)\r\n\r\ndef build_tensors_in_checkpoint_file(loaded_tensors):\r\n    # 현재 tensor graph에 있는 tensor중에서 loaded_tensors에 있는 tensor name을 가져온다.\r\n    full_var_list = list()\r\n    # Loop all loaded tensors\r\n    for i, tensor_name in enumerate(loaded_tensors[0]):\r\n        # Extract tensor\r\n        try:\r\n            tensor_aux = tf.get_default_graph().get_tensor_by_name(tensor_name+\":0\")\r\n        except:\r\n            print('Not found: '+tensor_name)\r\n        full_var_list.append(tensor_aux)\r\n    return full_var_list\r\n\"\"\"\r\n    # restore egample 모델을 변형했을 때, 기존 ckpt로부터 중복되는 trainable_varaibles 복구.\r\n    CHECKPOINT_NAME = 'D:\\\\hccho\\\\Tacotron-2-hccho\\\\ver1\\\\logdir-wavenet\\\\train\\\\2019-03-22T23-08-16\\\\model.ckpt-155000'\r\n    restored_vars  = get_tensors_in_checkpoint_file(file_name=CHECKPOINT_NAME)\r\n    tensors_to_load = build_tensors_in_checkpoint_file(restored_vars)\r\n    loader = tf.train.Saver(tensors_to_load)\r\n    loader.restore(sess, CHECKPOINT_NAME)\r\n    \r\n    new_saver = tf.train.Saver(var_list=tf.global_variables(), max_to_keep=hparams.max_checkpoints)  # 최대 checkpoint 저장 갯수 지정\r\n    save(new_saver, sess, logdir, 0)\r\n    exit()\r\n\r\n\"\"\"\r\n\r\n\r\n"
  },
  {
    "path": "utils/audio.py",
    "content": "# coding: utf-8\nimport librosa\nimport librosa.filters\nimport numpy as np\nimport tensorflow as tf\nfrom scipy import signal\nfrom scipy.io import wavfile\nfrom tensorflow.contrib.training.python.training.hparam import HParams\n\n\ndef load_wav(path, sr):\n    return librosa.core.load(path, sr=sr)[0]\n\ndef save_wav(wav, path, sr):\n    wav *= 32767 / max(0.01, np.max(np.abs(wav)))\n    #proposed by @dsmiller   --> libosa type error(bug) 극복\n    wavfile.write(path, sr, wav.astype(np.int16))\n\ndef save_wavenet_wav(wav, path, sr):\n    librosa.output.write_wav(path, wav, sr=sr)\n\ndef preemphasis(wav, k, preemphasize=True):\n    if preemphasize:\n        return signal.lfilter([1, -k], [1], wav)\n    return wav\n\ndef inv_preemphasis(wav, k, inv_preemphasize=True):\n    if inv_preemphasize:\n        return signal.lfilter([1], [1, -k], wav)\n    return wav\n\n#From https://github.com/r9y9/wavenet_vocoder/blob/master/audio.py\ndef start_and_end_indices(quantized, silence_threshold=2):\n    for start in range(quantized.size):\n        if abs(quantized[start] - 127) > silence_threshold:\n            break\n    for end in range(quantized.size - 1, 1, -1):\n        if abs(quantized[end] - 127) > silence_threshold:\n            break\n\n    assert abs(quantized[start] - 127) > silence_threshold\n    assert abs(quantized[end] - 127) > silence_threshold\n\n    return start, end\n\ndef trim_silence(wav, hparams):\n    '''Trim leading and trailing silence\n\n    Useful for M-AILABS dataset if we choose to trim the extra 0.5 silence at beginning and end.\n    '''\n    #Thanks @begeekmyfriend and @lautjy for pointing out the params contradiction. These params are separate and tunable per dataset.\n    return librosa.effects.trim(wav, top_db= hparams.trim_top_db, frame_length=hparams.trim_fft_size, hop_length=hparams.trim_hop_size)[0]\n\ndef get_hop_size(hparams):\n    hop_size = hparams.hop_size\n    if hop_size is None:\n        assert hparams.frame_shift_ms is not None\n        hop_size = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate)\n    return hop_size\n\ndef linearspectrogram(wav, hparams):\n    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)\n    S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db\n\n    if hparams.signal_normalization:  # Tacotron에서 항상적용했다.\n        return _normalize(S, hparams)\n    return S\n\ndef melspectrogram(wav, hparams):\n    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)\n    S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db\n\n    if hparams.signal_normalization:\n        return _normalize(S, hparams)\n    return S\n\ndef inv_linear_spectrogram(linear_spectrogram, hparams):\n    '''Converts linear spectrogram to waveform using librosa'''\n    if hparams.signal_normalization:\n        D = _denormalize(linear_spectrogram, hparams)\n    else:\n        D = linear_spectrogram\n\n    S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear\n\n    if hparams.use_lws:\n        processor = _lws_processor(hparams)\n        D = processor.run_lws(S.astype(np.float64).T ** hparams.power)\n        y = processor.istft(D).astype(np.float32)\n        return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)\n    else:\n        return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)\n\n\ndef inv_mel_spectrogram(mel_spectrogram, hparams):\n    '''Converts mel spectrogram to waveform using librosa'''\n    if hparams.signal_normalization:\n        D = _denormalize(mel_spectrogram, hparams)\n    else:\n        D = mel_spectrogram\n\n    S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams)  # Convert back to linear\n\n    if hparams.use_lws:\n        processor = _lws_processor(hparams)\n        D = processor.run_lws(S.astype(np.float64).T ** hparams.power)\n        y = processor.istft(D).astype(np.float32)\n        return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)\n    else:\n        return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)\n\ndef inv_spectrogram_tensorflow(spectrogram,hparams):\n    S = _db_to_amp_tensorflow(_denormalize_tensorflow(spectrogram,hparams) + hparams.ref_level_db)\n    return _griffin_lim_tensorflow(tf.pow(S, hparams.power),hparams)\n\n\ndef inv_spectrogram(spectrogram,hparams):\n    S = _db_to_amp(_denormalize(spectrogram,hparams) + hparams.ref_level_db)    # Convert back to linear.  spectrogram: (num_freq,length)\n    return inv_preemphasis(_griffin_lim(S ** hparams.power,hparams),hparams.preemphasis, hparams.preemphasize)                 # Reconstruct phase\n\n\n\ndef _lws_processor(hparams):\n    import lws\n    return lws.lws(hparams.fft_size, get_hop_size(hparams), fftsize=hparams.win_size, mode=\"speech\")\n\ndef _griffin_lim(S, hparams):\n    '''librosa implementation of Griffin-Lim\n    Based on https://github.com/librosa/librosa/issues/434\n    '''\n    angles = np.exp(2j * np.pi * np.random.rand(*S.shape))\n    S_complex = np.abs(S).astype(np.complex)\n    y = _istft(S_complex * angles, hparams)\n    for i in range(hparams.griffin_lim_iters):\n        angles = np.exp(1j * np.angle(_stft(y, hparams)))\n        y = _istft(S_complex * angles, hparams)\n    return y\n\ndef _stft(y, hparams):\n    if hparams.use_lws:\n        return _lws_processor(hparams).stft(y).T\n    else:\n        return librosa.stft(y=y, n_fft=hparams.fft_size, hop_length=get_hop_size(hparams), win_length=hparams.win_size)\n\ndef _istft(y, hparams):\n    return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size)\n\n##########################################################\n#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)\ndef num_frames(length, fsize, fshift):\n    \"\"\"Compute number of time frames of spectrogram\n    \"\"\"\n    pad = (fsize - fshift)\n    if length % fshift == 0:\n        M = (length + pad * 2 - fsize) // fshift + 1\n    else:\n        M = (length + pad * 2 - fsize) // fshift + 2\n    return M\n\n\ndef pad_lr(x, fsize, fshift):\n    \"\"\"Compute left and right padding\n    \"\"\"\n    M = num_frames(len(x), fsize, fshift)\n    pad = (fsize - fshift)\n    T = len(x) + 2 * pad\n    r = (M - 1) * fshift + fsize - T\n    return pad, pad + r\n##########################################################\n#Librosa correct padding\ndef librosa_pad_lr(x, fsize, fshift):\n    '''compute right padding (final frame)\n    '''\n    return int(fsize // 2)\n\n\n# Conversions\n_mel_basis = None\n_inv_mel_basis = None\n\ndef _linear_to_mel(spectogram, hparams):\n    global _mel_basis\n    if _mel_basis is None:\n        _mel_basis = _build_mel_basis(hparams)\n    return np.dot(_mel_basis, spectogram)\n\ndef _mel_to_linear(mel_spectrogram, hparams):\n    global _inv_mel_basis\n    if _inv_mel_basis is None:\n        _inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))\n    return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram))\n\ndef _build_mel_basis(hparams):\n    #assert hparams.fmax <= hparams.sample_rate // 2\n    \n    #fmin: Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])\n    #fmax: 7600, To be increased/reduced depending on data.\n    #return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels,fmin=hparams.fmin, fmax=hparams.fmax)\n    return librosa.filters.mel(hparams.sample_rate, hparams.fft_size, n_mels=hparams.num_mels)  # fmin=0, fmax= sample_rate/2.0\n\ndef _amp_to_db(x, hparams):\n    min_level = np.exp(hparams.min_level_db / 20 * np.log(10))  # min_level_db = -100\n    return 20 * np.log10(np.maximum(min_level, x))\n\ndef _db_to_amp(x):\n    return np.power(10.0, (x) * 0.05)\n\ndef _normalize(S, hparams):\n    if hparams.allow_clipping_in_normalization:\n        if hparams.symmetric_mels:\n            return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value,\n             -hparams.max_abs_value, hparams.max_abs_value)\n        else:\n            return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)\n \n    assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0\n    if hparams.symmetric_mels:\n        return (2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value\n    else:\n        return hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db))\n \ndef _denormalize(D, hparams):\n    if hparams.allow_clipping_in_normalization:\n        if hparams.symmetric_mels:\n            return (((np.clip(D, -hparams.max_abs_value,\n                hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value))\n                + hparams.min_level_db)\n        else:\n            return ((np.clip(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)\n \n    if hparams.symmetric_mels:\n        return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)\n    else:\n        return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)\n\n# 김태훈 구현. 이 차이 때문에 호환이 되지 않는다.\n# def _normalize(S,hparams):\n#     return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1)  # min_level_db = -100\n# \n# def _denormalize(S,hparams):\n#     return (np.clip(S, 0, 1) * -hparams.min_level_db) + hparams.min_level_db\n\n#From https://github.com/r9y9/nnmnkwii/blob/master/nnmnkwii/preprocessing/generic.py\ndef mulaw(x, mu=256):\n    \"\"\"Mu-Law companding\n    Method described in paper [1]_.\n    .. math::\n        f(x) = sign(x) ln (1 + mu |x|) / ln (1 + mu)\n    Args:\n        x (array-like): Input signal. Each value of input signal must be in\n          range of [-1, 1].\n        mu (number): Compression parameter ``μ``.\n    Returns:\n        array-like: Compressed signal ([-1, 1])\n    See also:\n        :func:`nnmnkwii.preprocessing.inv_mulaw`\n        :func:`nnmnkwii.preprocessing.mulaw_quantize`\n        :func:`nnmnkwii.preprocessing.inv_mulaw_quantize`\n    .. [1] Brokish, Charles W., and Michele Lewis. \"A-law and mu-law companding\n        implementations using the tms320c54x.\" SPRA163 (1997).\n    \"\"\"\n    return _sign(x) * _log1p(mu * _abs(x)) / _log1p(mu)\n\n\ndef inv_mulaw(y, mu=256):\n    \"\"\"Inverse of mu-law companding (mu-law expansion)\n    .. math::\n        f^{-1}(x) = sign(y) (1 / mu) (1 + mu)^{|y|} - 1)\n    Args:\n        y (array-like): Compressed signal. Each value of input signal must be in\n          range of [-1, 1].\n        mu (number): Compression parameter ``μ``.\n    Returns:\n        array-like: Uncomprresed signal (-1 <= x <= 1)\n    See also:\n        :func:`nnmnkwii.preprocessing.inv_mulaw`\n        :func:`nnmnkwii.preprocessing.mulaw_quantize`\n        :func:`nnmnkwii.preprocessing.inv_mulaw_quantize`\n    \"\"\"\n    return _sign(y) * (1.0 / mu) * ((1.0 + mu)**_abs(y) - 1.0)\n\n\ndef mulaw_quantize(x, mu=256):\n    \"\"\"Mu-Law companding + quantize\n    Args:\n        x (array-like): Input signal. Each value of input signal must be in\n          range of [-1, 1].\n        mu (number): Compression parameter ``μ``.\n    Returns:\n        array-like: Quantized signal (dtype=int)\n          - y ∈ [0, mu] if x ∈ [-1, 1]\n          - y ∈ [0, mu) if x ∈ [-1, 1)\n    .. note::\n        If you want to get quantized values of range [0, mu) (not [0, mu]),\n        then you need to provide input signal of range [-1, 1).\n    Examples:\n        >>> from scipy.io import wavfile\n        >>> import pysptk\n        >>> import numpy as np\n        >>> from nnmnkwii import preprocessing as P\n        >>> fs, x = wavfile.read(pysptk.util.example_audio_file())\n        >>> x = (x / 32768.0).astype(np.float32)\n        >>> y = P.mulaw_quantize(x)\n        >>> print(y.min(), y.max(), y.dtype)\n        15 246 int64\n    See also:\n        :func:`nnmnkwii.preprocessing.mulaw`\n        :func:`nnmnkwii.preprocessing.inv_mulaw`\n        :func:`nnmnkwii.preprocessing.inv_mulaw_quantize`\n    \"\"\"\n    mu = mu-1\n    y = mulaw(x, mu)\n    # scale [-1, 1] to [0, mu]\n    return _asint((y + 1) / 2 * mu)\n\n\ndef inv_mulaw_quantize(y, mu=256):\n    \"\"\"Inverse of mu-law companding + quantize\n    Args:\n        y (array-like): Quantized signal (∈ [0, mu]).\n        mu (number): Compression parameter ``μ``.\n    Returns:\n        array-like: Uncompressed signal ([-1, 1])\n    Examples:\n        >>> from scipy.io import wavfile\n        >>> import pysptk\n        >>> import numpy as np\n        >>> from nnmnkwii import preprocessing as P\n        >>> fs, x = wavfile.read(pysptk.util.example_audio_file())\n        >>> x = (x / 32768.0).astype(np.float32)\n        >>> x_hat = P.inv_mulaw_quantize(P.mulaw_quantize(x))\n        >>> x_hat = (x_hat * 32768).astype(np.int16)\n    See also:\n        :func:`nnmnkwii.preprocessing.mulaw`\n        :func:`nnmnkwii.preprocessing.inv_mulaw`\n        :func:`nnmnkwii.preprocessing.mulaw_quantize`\n    \"\"\"\n    # [0, m) to [-1, 1]\n    mu = mu-1\n    y = 2 * _asfloat(y) / mu - 1\n    return inv_mulaw(y, mu)\n\ndef _sign(x):\n    #wrapper to support tensorflow tensors/numpy arrays\n    isnumpy = isinstance(x, np.ndarray)\n    isscalar = np.isscalar(x)\n    return np.sign(x) if (isnumpy or isscalar) else tf.sign(x)\n\n\ndef _log1p(x):\n    #wrapper to support tensorflow tensors/numpy arrays\n    isnumpy = isinstance(x, np.ndarray)\n    isscalar = np.isscalar(x)\n    return np.log1p(x) if (isnumpy or isscalar) else tf.log1p(x)\n\n\ndef _abs(x):\n    #wrapper to support tensorflow tensors/numpy arrays\n    isnumpy = isinstance(x, np.ndarray)\n    isscalar = np.isscalar(x)\n    return np.abs(x) if (isnumpy or isscalar) else tf.abs(x)\n\n\ndef _asint(x):\n    #wrapper to support tensorflow tensors/numpy arrays\n    isnumpy = isinstance(x, np.ndarray)\n    isscalar = np.isscalar(x)\n    return x.astype(np.int) if isnumpy else int(x) if isscalar else tf.cast(x, tf.int32)\n\n\ndef _asfloat(x):\n    #wrapper to support tensorflow tensors/numpy arrays\n    isnumpy = isinstance(x, np.ndarray)\n    isscalar = np.isscalar(x)\n    return x.astype(np.float32) if isnumpy else float(x) if isscalar else tf.cast(x, tf.float32)\n\ndef frames_to_hours(n_frames,hparams):\n    return sum((n_frame for n_frame in n_frames)) * hparams.frame_shift_ms / (3600 * 1000)\n\ndef get_duration(audio,hparams):\n    return librosa.core.get_duration(audio, sr=hparams.sample_rate)\n\ndef _db_to_amp_tensorflow(x):\n    return tf.pow(tf.ones(tf.shape(x)) * 10.0, x * 0.05)\n\ndef _denormalize_tensorflow(S,hparams):\n    return (tf.clip_by_value(S, 0, 1) * -hparams.min_level_db) + hparams.min_level_db\n\ndef _griffin_lim_tensorflow(S,hparams):\n    with tf.variable_scope('griffinlim'):\n        S = tf.expand_dims(S, 0)\n        S_complex = tf.identity(tf.cast(S, dtype=tf.complex64))\n        y = _istft_tensorflow(S_complex,hparams)\n        for i in range(hparams.griffin_lim_iters):\n            est = _stft_tensorflow(y,hparams)\n            angles = est / tf.cast(tf.maximum(1e-8, tf.abs(est)), tf.complex64)\n            y = _istft_tensorflow(S_complex * angles,hparams)\n        return tf.squeeze(y, 0)\n\ndef _istft_tensorflow(stfts,hparams):\n    n_fft, hop_length, win_length = _stft_parameters(hparams)\n    return tf.contrib.signal.inverse_stft(stfts, win_length, hop_length, n_fft)\n\ndef _stft_tensorflow(signals,hparams):\n    n_fft, hop_length, win_length = _stft_parameters(hparams)\n    return tf.contrib.signal.stft(signals, win_length, hop_length, n_fft, pad_end=False)\n\ndef _stft_parameters(hparams):\n    n_fft = (hparams.num_freq - 1) * 2  # hparams.num_freq = 1025\n    hop_length = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate)  # hparams.frame_shift_ms = 12.5\n    win_length = int(hparams.frame_length_ms / 1000 * hparams.sample_rate)  # hparams.frame_length_ms = 50\n    return n_fft, hop_length, win_length"
  },
  {
    "path": "utils/infolog.py",
    "content": "import atexit\nfrom datetime import datetime\nimport json\nfrom threading import Thread\nfrom urllib.request import Request, urlopen\n\n\n_format = '%Y-%m-%d %H:%M:%S.%f'\n_file = None\n_run_name = None\n_slack_url = None\n\n\ndef init(filename, run_name, slack_url=None):\n    global _file, _run_name, _slack_url\n    _close_logfile()\n    _file = open(filename, 'a')\n    _file.write('\\n-----------------------------------------------------------------\\n')\n    _file.write('Starting new training run\\n')\n    _file.write('-----------------------------------------------------------------\\n')\n    _run_name = run_name\n    _slack_url = slack_url\n\n\ndef log(msg, slack=False):\n    print(msg)\n    if _file is not None:\n        _file.write('[%s]    %s\\n' % (datetime.now().strftime(_format)[:-3], msg))\n    if slack and _slack_url is not None:\n        Thread(target=_send_slack, args=(msg,)).start()\n\n\ndef _close_logfile():\n    global _file\n    if _file is not None:\n        _file.close()\n        _file = None\n\n\ndef _send_slack(msg):\n    req = Request(_slack_url)\n    req.add_header('Content-Type', 'application/json')\n    urlopen(req, json.dumps({\n        'username': 'tacotron',\n        'icon_emoji': ':taco:',\n        'text': '*%s*: %s' % (_run_name, msg)\n    }).encode())\n\n\natexit.register(_close_logfile)\n"
  },
  {
    "path": "utils/plot.py",
    "content": "# coding: utf-8\r\nimport os \r\nimport matplotlib\r\nimport matplotlib.font_manager as font_manager\r\nfrom jamo import h2j, j2hcj\r\nimport numpy as np\r\nmatplotlib.use('Agg')\r\n\r\n# font 문제 해결\r\n#matplotlib.rc('font', family=\"NanumBarunGothic\")\r\n\r\n#font_manager._rebuild()  <---- 1번만 해주면 됨\r\n\r\nfont_fname = './/utils//NanumBarunGothic.ttf'\r\nfont_name = font_manager.FontProperties(fname=font_fname).get_name()\r\nmatplotlib.rc('font', family=\"NanumBarunGothic\")\r\n\r\n\r\nimport matplotlib.pyplot as plt\r\n\r\nfrom text import PAD, EOS\r\nfrom utils import add_postfix\r\nfrom text.korean import normalize\r\n\r\ndef plot(alignment, info, text, isKorean=True):\r\n    char_len, audio_len = alignment.shape # 145, 200\r\n\r\n    fig, ax = plt.subplots(figsize=(char_len/5, 5))\r\n    im = ax.imshow(\r\n            alignment.T,\r\n            aspect='auto',\r\n            origin='lower',\r\n            interpolation='none')\r\n\r\n    xlabel = 'Encoder timestep'\r\n    ylabel = 'Decoder timestep'\r\n\r\n    if info is not None:\r\n        xlabel += '\\n{}'.format(info)\r\n\r\n    plt.xlabel(xlabel)\r\n    plt.ylabel(ylabel)\r\n\r\n    if text:\r\n        if isKorean:\r\n            jamo_text = j2hcj(h2j(normalize(text)))\r\n        else:\r\n            jamo_text=text\r\n        pad = [PAD] * (char_len - len(jamo_text) - 1)\r\n        A = [tok for tok in jamo_text] + [EOS] + pad\r\n        A = [x if x != ' ' else '' for x in A]   # 공백이 있으면 그 뒤가 출력되지 않는 문제...\r\n        plt.xticks(range(char_len), A)\r\n\r\n    if text is not None:\r\n        while True:\r\n            if text[-1] in [EOS, PAD]:\r\n                text = text[:-1]\r\n            else:\r\n                break\r\n        plt.title(text)\r\n\r\n    plt.tight_layout()\r\n\r\ndef plot_alignment(\r\n        alignment, path, info=None, text=None, isKorean=True):\r\n\r\n    if text:  # text = '대체 투입되었던 구급대원이'\r\n        tmp_alignment = alignment[:len(h2j(text)) + 2]  # '대체 투입되었던 구급대원이' 푼 후, 길이 측정  <--- padding제거 효과\r\n\r\n        plot(tmp_alignment, info, text, isKorean)\r\n        plt.savefig(path, format='png')\r\n    else:\r\n        plot(alignment, info, text, isKorean)\r\n        plt.savefig(path, format='png')\r\n\r\n    print(\" [*] Plot saved: {}\".format(path))\r\n    \r\n\r\ndef plot_spectrogram(pred_spectrogram, path, title=None, split_title=False, target_spectrogram=None, max_len=None, auto_aspect=False):\r\n    if max_len is not None:\r\n        target_spectrogram = target_spectrogram[:max_len]\r\n        pred_spectrogram = pred_spectrogram[:max_len]\r\n\r\n    if split_title:\r\n        title = split_title_line(title)\r\n\r\n    fig = plt.figure(figsize=(10, 8))\r\n    # Set common labels\r\n    fig.text(0.5, 0.18, title, horizontalalignment='center', fontsize=16)\r\n\r\n    #target spectrogram subplot\r\n    if target_spectrogram is not None:\r\n        ax1 = fig.add_subplot(311)\r\n        ax2 = fig.add_subplot(312)\r\n\r\n        if auto_aspect:\r\n            im = ax1.imshow(np.rot90(target_spectrogram), aspect='auto', interpolation='none')\r\n        else:\r\n            im = ax1.imshow(np.rot90(target_spectrogram), interpolation='none')\r\n        ax1.set_title('Target Mel-Spectrogram')\r\n        fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)\r\n        ax2.set_title('Predicted Mel-Spectrogram')\r\n    else:\r\n        ax2 = fig.add_subplot(211)\r\n\r\n    if auto_aspect:\r\n        im = ax2.imshow(np.rot90(pred_spectrogram), aspect='auto', interpolation='none')\r\n    else:\r\n        im = ax2.imshow(np.rot90(pred_spectrogram), interpolation='none')\r\n    fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax2)   # 'horizontal'   'vertical'\r\n\r\n    plt.tight_layout()\r\n    plt.savefig(path, format='png')\r\n    plt.close()"
  },
  {
    "path": "wavenet/__init__.py",
    "content": "#  coding: utf-8\r\nfrom .model import WaveNetModel\r\nfrom .ops import (mu_law_encode, mu_law_decode,optimizer_factory)\r\n"
  },
  {
    "path": "wavenet/mixture.py",
    "content": "# coding:utf-8\n\"\"\"\nthe code is adapted from:\nhttps://github.com/Rayhane-mamah/Tacotron-2/blob/master/wavenet_vocoder/models/mixture.py\nhttps://github.com/openai/pixel-cnn/blob/master/pixel_cnn_pp/nn.py\nhttps://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py\nhttps://github.com/azraelkuan/tensorflow_wavenet_vocoder/tree/dev\n\"\"\"\nimport tensorflow as tf\nimport numpy as np\n\ndef log_sum_exp(x):\n    \"\"\" numerically stable log_sum_exp implementation that prevents overflow \"\"\"\n    axis = len(x.get_shape()) - 1\n    m = tf.reduce_max(x, axis)\n    m2 = tf.reduce_max(x, axis, keepdims=True)\n    return m + tf.log(tf.reduce_sum(tf.exp(x - m2), axis))\n\n\ndef log_prob_from_logits(x):\n    \"\"\" numerically stable log_softmax implementation that prevents overflow \"\"\"\n    axis = len(x.get_shape()) - 1\n    m = tf.reduce_max(x, axis, keepdims=True)\n    return x - m - tf.log(tf.reduce_sum(tf.exp(x - m), axis, keepdims=True))\n\n#  https://github.com/Rayhane-mamah/Tacotron-2/issues/155  <--- 설명 있음\ndef discretized_mix_logistic_loss(y_hat, y, num_class=256, log_scale_min=float(np.log(1e-14)), reduce=True):\n    \"\"\"\n    Discretized mixture of logistic distributions loss\n    y_hat: Predicted output B x T x C\n    y: Target   B x T x 1  (-1~1)\n    num_class: Number of classes\n    log_scale_min: Log scale minimum value\n    reduce: If True, the losses are averaged or summed for each minibatch\n    :return: loss\n    \"\"\"\n    y_hat_shape = y_hat.get_shape().as_list()\n\n    assert len(y_hat_shape) == 3\n    assert y_hat_shape[2] % 3 == 0\n\n    nr_mix = y_hat_shape[2] // 3   # 30 --> 10\n\n    # unpack parameters\n    logit_probs = y_hat[:, :, :nr_mix]\n    means = y_hat[:, :, nr_mix:2 * nr_mix]\n    log_scales = tf.maximum(y_hat[:, :, nr_mix * 2:nr_mix * 3], log_scale_min)\n\n    # B x T x 1 => B x T x nr_mix\n    y = tf.tile(y, [1, 1, nr_mix])\n\n    centered_y = y - means\n    inv_stdv = tf.exp(-log_scales)\n    plus_in = inv_stdv * (centered_y + 1. / (num_class - 1))\n    cdf_plus = tf.nn.sigmoid(plus_in)\n    min_in = inv_stdv * (centered_y - 1. / (num_class - 1))\n    cdf_min = tf.nn.sigmoid(min_in)\n\n    log_cdf_plus = plus_in - tf.nn.softplus(plus_in)  # log probability for edge case of 0 (before scaling)   equivalent tf.log(cdf_plus)\n\n    log_one_minus_cdf_min = -tf.nn.softplus(min_in)  # log probability for edge case of 255 (before scaling)  equivalent tf.log(1-cdf_min)\n\n    cdf_delta = cdf_plus - cdf_min  # probability for all other cases\n    \n  \n    mid_in = inv_stdv * centered_y\n    #log probability in the center of the bin, to be used in extreme cases\n    #(not actually used in this code) \n    log_pdf_mid = mid_in - log_scales - 2. * tf.nn.softplus(mid_in)  # mid 값을 pdf에 직접 넣고 계산하면 나온다.\n\n    log_probs = tf.where(y < -0.999, log_cdf_plus,\n                         tf.where(y > 0.999, log_one_minus_cdf_min,\n                                  tf.where(cdf_delta > 1e-5, tf.log(tf.maximum(cdf_delta, 1e-12)),log_pdf_mid - np.log((num_class - 1) / 2))))\n\n    log_probs = log_probs + tf.nn.log_softmax(logit_probs, -1)\n    # log_probs = log_probs + log_prob_from_logits(logit_probs)\n\n    if reduce:\n        return -tf.reduce_sum(log_sum_exp(log_probs))\n    else:\n        return -log_sum_exp(log_probs)\n\n\ndef sample_from_discretized_mix_logistic(y, log_scale_min=float(np.log(1e-14))):\n    \"\"\"\n\n    :param y: B x T x C\n    :param log_scale_min:\n    :return: [-1, 1]\n    \"\"\"\n    # 아래 코드에서 2번의 uniform random sampling이 있는데, 한번은 Gumbel distribution으로 부터 sampling을 위한 것이고, 또 한번은 logistic distribution을 위한 것이다.\n    \n    y_shape = y.get_shape().as_list()\n\n    assert len(y_shape) == 3\n    assert y_shape[2] % 3 == 0\n    nr_mix = y_shape[2] // 3\n\n    logit_probs = y[:, :, :nr_mix]\n\n    # u: random_uniform --> -log(-log(u)): standard Gumbel random sample\n    # category 결정을 위해 logit_probs(softmax 취하기 전의 값) + ( -log(-log(u)) )   ---> argmax를 취하면 category가 결정된다.\n    sel = tf.one_hot(tf.argmax(logit_probs - tf.log(-tf.log(tf.random_uniform(tf.shape(logit_probs), minval=1e-5, maxval=1. - 1e-5))), 2), depth=nr_mix, dtype=tf.float32)\n\n    means = tf.reduce_sum(y[:, :, nr_mix:nr_mix * 2] * sel, axis=2)\n\n    log_scales = tf.maximum(tf.reduce_sum(y[:, :, nr_mix * 2:nr_mix * 3] * sel, axis=2), log_scale_min)\n\n    # output audio를 만들기 위해 logistic distribution으로 부터 sampling\n    u = tf.random_uniform(tf.shape(means), minval=1e-5, maxval=1. - 1e-5)\n    x = means + tf.exp(log_scales) * (tf.log(u) - tf.log(1. - u))   # u을 logistic distribution의 cdf의 역함수에 대입.\n\n    x = tf.minimum(tf.maximum(x, -1.), 1.)\n    return x\n"
  },
  {
    "path": "wavenet/model.py",
    "content": "#  coding: utf-8\r\nimport numpy as np\r\nimport tensorflow as tf\r\n\r\nfrom .ops import mu_law_encode,optimizer_factory,SubPixelConvolution\r\nfrom .mixture import discretized_mix_logistic_loss, sample_from_discretized_mix_logistic\r\nclass WaveNetModel(object):\r\n    def __init__(self,batch_size,dilations,filter_width,residual_channels,dilation_channels,skip_channels,quantization_channels=2**8,out_channels=30,\r\n                 use_biases=False,scalar_input=False,global_condition_channels=None,\r\n                 global_condition_cardinality=None,local_condition_channels=80,upsample_factor=None,legacy=True,residual_legacy=True,train_mode=True,drop_rate=0.0):\r\n\r\n        self.batch_size = batch_size\r\n        self.dilations = dilations\r\n        self.filter_width = filter_width\r\n        self.residual_channels = residual_channels\r\n        self.dilation_channels = dilation_channels\r\n        self.quantization_channels = quantization_channels\r\n        self.use_biases = use_biases\r\n        self.skip_channels = skip_channels\r\n        self.scalar_input = scalar_input\r\n        self.global_condition_channels = global_condition_channels\r\n        self.global_condition_cardinality = global_condition_cardinality\r\n        self.local_condition_channels=local_condition_channels\r\n        self.upsample_factor=upsample_factor\r\n        self.train_mode = train_mode\r\n        self.out_channels = out_channels\r\n        self.legacy=legacy\r\n        self.residual_legacy=residual_legacy\r\n        self.drop_rate = drop_rate\r\n        self.ema = tf.train.ExponentialMovingAverage(decay=0.9999)\r\n        \r\n        self.receptive_field = WaveNetModel.calculate_receptive_field(self.filter_width, self.dilations)\r\n        \r\n    @staticmethod\r\n    def calculate_receptive_field(filter_width, dilations):\r\n        # causal 때문에 length (T-1) + (여기서 계산되는 receptive_field만큼의  padding)  --> 최종 output의 길이가 T가 된다.\r\n        receptive_field = (filter_width - 1) * sum(dilations) + 1  # 마지막 +1은 causal condition 때문에 1개 자른 것의 때문에 길이가 T-1인 되기 때문에 +1을 통해서 입력과 같은 길이 T가 된다.\r\n        return receptive_field\r\n\r\n    def _create_causal_layer(self, input_batch):\r\n        with tf.name_scope('causal_layer'):\r\n            \r\n            if self.scalar_input:\r\n                return tf.layers.conv1d(input_batch,filters=self.residual_channels,kernel_size=1,padding='valid',dilation_rate=1,use_bias=True)\r\n            else:\r\n                return tf.layers.conv1d(input_batch,filters=self.residual_channels,kernel_size=1,padding='valid',dilation_rate=1,use_bias=True)\r\n\r\n\r\n    def _create_queue(self):\r\n        # first layer(causal layer)나 local condition은 kernel_size = 1이므로, Queue가 필요없다.\r\n        with tf.variable_scope('queue'):\r\n            self.dilation_queue=[]\r\n            for i,d in enumerate(self.dilations):\r\n                q = tf.Variable(initial_value=tf.zeros(shape=[self.batch_size,d*(self.filter_width-1)+1,self.residual_channels], dtype=tf.float32), name='dilation_queue'.format(i), trainable=False)\r\n                self.dilation_queue.append(q)\r\n        \r\n        # restore했을 때, Dilation_Queue,Causal_Queue는 0으로 initialization해야 한다.\r\n        self.queue_initializer= tf.variables_initializer(self.dilation_queue)\r\n\r\n    def _create_dilation_layer(self, input_batch, layer_index, dilation,local_condition_batch,global_condition_batch):\r\n        # input_batch는 train mode에서는 길이 줄어드는 것을 대비하여 padding이 되어 있다.\r\n        with tf.variable_scope('dilation_layer'):\r\n            residual =  input_batch\r\n            if self.train_mode:\r\n                # padding\r\n                padding = (self.filter_width - 1)*dilation\r\n                input_batch = tf.pad(input_batch, tf.constant([(0, 0), (padding, 0), (0, 0)]))\r\n\r\n            else:\r\n                self.dilation_queue[layer_index] =  tf.scatter_update(self.dilation_queue[layer_index],tf.range(self.batch_size),tf.concat([self.dilation_queue[layer_index][:,1:,:],input_batch],axis=1) )\r\n                input_batch =  self.dilation_queue[layer_index]\r\n\r\n\r\n            input_batch = tf.layers.dropout(input_batch,rate=self.drop_rate,training=self.train_mode)\r\n            \r\n            dilation_layer = tf.layers.Conv1D(filters=self.dilation_channels*2,kernel_size=self.filter_width,dilation_rate=dilation,padding='valid',use_bias=self.use_biases,name='conv_filter_gate')\r\n            \r\n            if self.train_mode:\r\n                conv = dilation_layer(input_batch)\r\n                conv_filter, conv_gate = tf.split(conv,2,axis=-1)\r\n                \r\n            else:\r\n                \r\n                dilation_layer.build((self.batch_size,1,input_batch.shape.as_list()[-1]))   # shape의 마지막만 중요함. kernel을 잡는데 마지막 차원만 사용됨\r\n                \r\n                linearized_weights = tf.reshape(dilation_layer.kernel,(-1,self.dilation_channels*2))\r\n                input_batch = input_batch[:, 0::dilation, :]\r\n                temp = tf.matmul(tf.reshape(input_batch,(self.batch_size,-1)), linearized_weights)\r\n                if self.use_biases:\r\n                    temp = tf.nn.bias_add(temp, dilation_layer.bias)                \r\n                \r\n                conv_filter, conv_gate = tf.split(tf.expand_dims(temp,1),2,axis=-1)\r\n                \r\n                            \r\n            if global_condition_batch is not None:\r\n                conv_filter += tf.layers.conv1d(global_condition_batch,filters=self.dilation_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases,name=\"gc_filter\")\r\n                conv_gate += tf.layers.conv1d(global_condition_batch,filters=self.dilation_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases,name=\"gc_gate\")\r\n    \r\n            if local_condition_batch is not None:\r\n                local_filter = tf.layers.conv1d(local_condition_batch,filters=self.dilation_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases,name=\"lc_filter\")\r\n                local_gate = tf.layers.conv1d(local_condition_batch,filters=self.dilation_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases,name=\"lc_gate\")\r\n                \r\n                conv_filter += local_filter\r\n                conv_gate += local_gate            \r\n                \r\n                    \r\n            out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)\r\n    \r\n            # The 1x1 conv to produce the residual output  == FC\r\n            transformed = tf.layers.conv1d(out,filters=self.residual_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases,name=\"dense\")\r\n    \r\n            # The 1x1 conv to produce the skip output\r\n            skip_contribution = tf.layers.conv1d(out,filters=self.skip_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases,name=\"skip\")\r\n    \r\n\r\n            # residual + transformed: 다음 단계의 입력으로 들어감\r\n            if self.residual_legacy:\r\n                out = (residual + transformed) * np.sqrt(0.5)\r\n            else:\r\n                out = residual + transformed\r\n    \r\n            return skip_contribution, out   # skip_contribution: 결과값으로 쌓임. \r\n    def create_upsample(self, local_condition_batch,upsample_type='SubPixel'):\r\n        local_condition_batch = tf.expand_dims(local_condition_batch, [3])\r\n        # local condition batch N H W C\r\n        freq_axis_kernel_size = self.filter_width   # Rayhane-mamah 코드에서는 hyper parameter로 받음. frame(num_mels)에 적용되는 kernel_size임\r\n        for i in range(len(self.upsample_factor)):\r\n            if upsample_type =='SubPixel':\r\n                \r\n                # NN_init, NN_scaler <---- hyper parameter이지만, 여기서는 True, 0.3으로 고정\r\n                # kernel_size: (3, hparams.freq_axis_kernel_size) 이렇게 되어 있는데, 왜 3인지 모르겠음. upsample_factor[i]로 대체. \r\n                # freq_axis_kernel_size는 hparams에 3으로 되어 있는데, 여기서는 filter_width로 처리  <---- frame(num_mels)에 적용되는 kernel_size임\r\n                subpixel_layer = SubPixelConvolution(filters=1, kernel_size=(self.upsample_factor[i],freq_axis_kernel_size),padding='same', strides=(self.upsample_factor[i],1),\r\n                                      NN_init=True, NN_scaler=0.3,up_layers=len(self.upsample_factor), name='SubPixelConvolution_layer_{}'.format(i))\r\n                local_condition_batch = subpixel_layer(local_condition_batch)\r\n            else:\r\n                local_condition_batch = tf.layers.conv2d_transpose(local_condition_batch,filters=1, kernel_size=(self.upsample_factor[i], freq_axis_kernel_size),\r\n                                                   strides=(self.upsample_factor[i],1),padding='same',use_bias=False,name='upsample_2D_{}'.format(i))\r\n            \r\n            local_condition_batch = tf.nn.relu(local_condition_batch)\r\n            \r\n            # for debugging\r\n            #local_condition_batch = tf.Print(local_condition_batch,[tf.shape(local_condition_batch),\"xx{}\".format(i)])\r\n        local_condition_batch = tf.squeeze(local_condition_batch, [3])\r\n        \r\n        return local_condition_batch\r\n    def _create_network(self, input_batch,local_condition_batch, global_condition_batch):  \r\n        '''Construct the WaveNet network.'''\r\n        # global_condition_batch: (batch_size, 1, self.global_condition_channels)  <--- 가운데 1은 크기 1짜리 data FC대신에 conv1d를 적용하기 위해 강제로 넣었다고 봐야 한다.\r\n        \r\n        if self.train_mode==False:\r\n            self._create_queue()\r\n        \r\n        \r\n        current_layer = input_batch  # causal cut으로 길이 1이 줄어든 상태\r\n           \r\n\r\n        # Pre-process the input with a regular convolution\r\n        current_layer = self._create_causal_layer(current_layer)  # 여전 모델에서는 길이가 줄었지만, 수정 후에는 길이 불변\r\n\r\n        # Add all defined dilation layers.\r\n        outputs = None\r\n        with tf.variable_scope('dilated_stack'):\r\n            for layer_index, dilation in enumerate(self.dilations): # [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512]\r\n                with tf.variable_scope('layer{}'.format(layer_index)):\r\n                    \r\n                    output, current_layer = self._create_dilation_layer(current_layer, layer_index, dilation,local_condition_batch,global_condition_batch)\r\n\r\n                    if outputs is None:\r\n                        outputs = output\r\n                    else:\r\n                        outputs = outputs + output\r\n                        \r\n                        if self.legacy:\r\n                            outputs = outputs * np.sqrt(0.5)\r\n                        \r\n        with tf.name_scope('postprocessing'):\r\n            # Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to\r\n            # postprocess the output.\r\n             \r\n            transformed1 = tf.nn.relu(outputs)\r\n            conv1 = tf.layers.conv1d(transformed1,filters=self.skip_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases)\r\n    \r\n            transformed2 = tf.nn.relu(conv1)\r\n            if self.scalar_input:\r\n                conv2 = tf.layers.conv1d(transformed2,filters=self.out_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases)\r\n            else:\r\n                conv2 = tf.layers.conv1d(transformed2,filters=self.quantization_channels,kernel_size=1,padding=\"same\",use_bias=self.use_biases)\r\n\r\n        return conv2\r\n\r\n    def _one_hot(self, input_batch):\r\n        '''One-hot encodes the waveform amplitudes.\r\n\r\n        This allows the definition of the network as a categorical distribution\r\n        over a finite set of possible amplitudes.\r\n        '''\r\n        with tf.name_scope('one_hot_encode'):\r\n            encoded = tf.one_hot(input_batch, depth=self.quantization_channels, dtype=tf.float32)  # (1, ?, 1) --> (1, ?, 1, 256)\r\n            shape = [self.batch_size, -1, self.quantization_channels]\r\n            encoded = tf.reshape(encoded, shape)  # (1, ?, 1, 256) --> (1, ?, 256)\r\n        return encoded\r\n\r\n    def _embed_gc(self, global_condition):  # global_condition = global_condition_batch <---- data\r\n        '''Returns embedding for global condition.\r\n        :param global_condition: Either ID of global condition for\r\n               tf.nn.embedding_lookup or actual embedding. The latter is\r\n               experimental.\r\n        :return: Embedding or None\r\n        '''\r\n        # global_condition: (N,)\r\n        # self.global_condition_cardinality가 None이 아니며, global_condition 은 gc id이면 되고, 그렇지 않으면, global_condition은 embedding vector가 넘어와야 한다.\r\n        embedding = None\r\n        if self.global_condition_cardinality is not None:\r\n            # Only lookup the embedding if the global condition is presented\r\n            # as an integer of mutually-exclusive categories ...\r\n            embedding_table = tf.get_variable('gc_embedding', [self.global_condition_cardinality, self.global_condition_channels], dtype=tf.float32,initializer=tf.contrib.layers.xavier_initializer(uniform=False))   # (2, 32)\r\n            embedding = tf.nn.embedding_lookup(embedding_table,global_condition)\r\n        elif global_condition is not None:\r\n            # ... else the global_condition (if any) is already provided\r\n            # as an embedding.\r\n\r\n            # In this case, the number of global_embedding channels must be\r\n            # equal to the the last dimension of the global_condition tensor.\r\n            gc_batch_rank = len(global_condition.get_shape())\r\n            dims_match = (global_condition.get_shape()[gc_batch_rank - 1] == self.global_condition_channels)\r\n            if not dims_match:\r\n                raise ValueError('Shape of global_condition {} does not match global_condition_channels {}.'.format(global_condition.get_shape(),\r\n                                        self.global_condition_channels))\r\n            embedding = global_condition\r\n\r\n        if embedding is not None:\r\n            embedding = tf.reshape(embedding,[self.batch_size, 1, self.global_condition_channels])\r\n\r\n        return embedding\r\n\r\n\r\n    def predict_proba_incremental(self, waveform,upsampled_local_condition=None, global_condition=None,name='wavenet'):\r\n        \"\"\"\r\n        local_condition: upsampled local condition\r\n        \"\"\"\r\n\r\n\r\n        with tf.variable_scope(name,reuse=tf.AUTO_REUSE):\r\n            \r\n            if self.scalar_input:\r\n                encoded = tf.reshape(waveform , [self.batch_size, -1, 1])  # (N,1,1)\r\n            else:\r\n                encoded = tf.one_hot(waveform, self.quantization_channels)\r\n                encoded = tf.reshape(encoded, [self.batch_size,-1, self.quantization_channels])   # encoded shape=(N,1, 256)\r\n            \r\n            gc_embedding = self._embed_gc(global_condition)                   # --> shape=(1, 1, 32)\r\n            \r\n            \r\n            # local condition\r\n            if upsampled_local_condition is not None:\r\n                upsampled_local_condition = tf.reshape(upsampled_local_condition , [self.batch_size, -1, self.local_condition_channels])\r\n            \r\n            raw_output = self._create_network(encoded,upsampled_local_condition,gc_embedding)        # 이것이 fast generation algorithm의 핵심  --> (batch_size, 1, 256)\r\n            \r\n            if self.scalar_input:\r\n                out = tf.reshape(raw_output, [self.batch_size, -1, self.out_channels])\r\n                proba = sample_from_discretized_mix_logistic(out)\r\n            else:\r\n                out = tf.reshape(raw_output, [self.batch_size, self.quantization_channels])\r\n                proba = tf.cast(tf.nn.softmax(tf.cast(out, tf.float64)), tf.float32)\r\n\r\n            return proba\r\n\r\n    def add_loss(self, input_batch,local_condition=None, global_condition_batch=None, l2_regularization_strength=None,upsample_type=None, name='wavenet'):\r\n        '''Creates a WaveNet network and returns the autoencoding loss.\r\n\r\n        The variables are all scoped to the given name.\r\n        '''\r\n        with tf.variable_scope(name):\r\n            # We mu-law encode and quantize the input audioform.\r\n            # quantization_channels 크기의 one hot encoding을 적용한 예정. 16bit= 65536개였다면,  quantization_channels로 줄이는 효과가 있다.\r\n            # mu law encoding은 bit를 단순히 줄이는 것보다 advanced된 방식으로 줄인다.\r\n            # input_batch: (batch_size,?,1)  <-- 마지막 1은 channel 1을 의미\r\n            encoded_input = mu_law_encode(input_batch, self.quantization_channels)  # \"quantization_channels\": 256   ---> (batch_size, ?, 1)\r\n\r\n            gc_embedding = self._embed_gc(global_condition_batch) # (self.batch_size, 1, self.global_condition_channels) <--- 가운데 1은 강제로 reshape\r\n            encoded = self._one_hot(encoded_input)      #  (1, ?, quantization_channels=256)\r\n            if self.scalar_input:\r\n                network_input = tf.reshape( tf.cast(input_batch, tf.float32), [self.batch_size, -1, 1])\r\n            else:\r\n                network_input = encoded\r\n                \r\n            # Cut off the last sample of network input to preserve causality.\r\n            network_input_width = tf.shape(network_input)[1] - 1\r\n            if self.scalar_input:\r\n                input = tf.slice(network_input, [0, 0, 0], [-1, network_input_width,1])\r\n            else:\r\n                input = tf.slice(network_input, [0, 0, 0], [-1, network_input_width, self.quantization_channels])\r\n\r\n\r\n            # local condition\r\n            if local_condition is not None:\r\n                local_condition = self.create_upsample(local_condition,upsample_type)\r\n                local_condition = tf.slice(local_condition, [0, 0, 0], [-1, network_input_width,self.local_condition_channels])\r\n\r\n            raw_output = self._create_network(input,local_condition, gc_embedding)  # (batch_size, ?, quantization_channels=256) , (batch_size, 1, self.global_condition_channels)\r\n            \r\n            \r\n            with tf.name_scope('loss'):\r\n                # Cut off the samples corresponding to the receptive field\r\n                # for the first predicted sample.\r\n                \r\n                # scalar input인 경우에도 target은 mu-law companding된 것이 된다.\r\n                target_output = tf.slice(network_input , [0, 1, 0],[-1, -1, -1])   # [-1,-1,-1] --> 나머지 모두\r\n                \r\n                if self.scalar_input:\r\n                    loss = discretized_mix_logistic_loss(raw_output, target_output,num_class=2**16, reduce=False)\r\n                    reduced_loss = tf.reduce_mean(loss)                    \r\n                else:\r\n                    # 3 dim array의 loss를 계산학 위해, 2 dim으로 변환한다. batch와 time 부분을 합쳐서 2dim으로 변환\r\n                    target_output = tf.reshape(target_output, [-1, self.quantization_channels])\r\n                    prediction = tf.reshape(raw_output, [-1, self.quantization_channels])\r\n                    loss = tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=target_output)\r\n                    reduced_loss = tf.reduce_mean(loss)\r\n\r\n                tf.summary.scalar('loss', reduced_loss)\r\n\r\n                if l2_regularization_strength is None:\r\n                    self.loss = reduced_loss\r\n                else:\r\n                    # L2 regularization for all trainable parameters\r\n                    l2_loss = tf.add_n([tf.nn.l2_loss(v)  for v in tf.trainable_variables() if not('bias' in v.name)])\r\n\r\n                    # Add the regularization term to the loss\r\n                    total_loss = (reduced_loss + l2_regularization_strength * l2_loss)\r\n\r\n                    tf.summary.scalar('l2_loss', l2_loss)\r\n                    tf.summary.scalar('total_loss', total_loss)\r\n\r\n                    self.loss = total_loss\r\n\r\n    def add_optimizer(self, hparams,global_step):\r\n        '''Adds optimizer to the graph. Supposes that initialize function has already been called.\r\n        '''\r\n        with tf.variable_scope('optimizer'):\r\n            hp = hparams\r\n\r\n            learning_rate = tf.train.exponential_decay(hp.wavenet_learning_rate, global_step,hp.wavenet_decay_steps,hp.wavenet_decay_rate)\r\n\r\n            #Adam optimization\r\n            self.learning_rate = learning_rate\r\n            optimizer = tf.train.AdamOptimizer(learning_rate)\r\n\r\n            gradients, variables = zip(*optimizer.compute_gradients(self.loss))   # len(tf.trainable_variables()) = len(variables)\r\n            self.gradients = gradients\r\n\r\n            #Gradients clipping\r\n            if hp.wavenet_clip_gradients:\r\n                # Rayhane-mamah는 tf.clip_by_norm -> tf.clip_by_value 두 단계를 적용. 여기서는 tf.clip_by_global_norm\r\n                \r\n                clipped_gradients, _ = tf.clip_by_global_norm(gradients, 1)   # tf.clip_by_global_norm vs tf.clip_by_norm\r\n            else:\r\n                clipped_gradients = gradients\r\n\r\n            with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):\r\n                adam_optimize = optimizer.apply_gradients(zip(clipped_gradients, variables),global_step=global_step)        \r\n                \r\n        #Add exponential moving average\r\n        #https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage\r\n        #Use adam optimization process as a dependency\r\n        with tf.control_dependencies([adam_optimize]):\r\n            #Create the shadow variables and add ops to maintain moving averages\r\n            #Also updates moving averages after each update step\r\n            #This is the optimize call instead of traditional adam_optimize one.\r\n            assert tuple(tf.trainable_variables()) == variables #Verify all trainable variables are being averaged\r\n            self.optimize = self.ema.apply(variables)                             \r\n                "
  },
  {
    "path": "wavenet/ops.py",
    "content": "#  coding: utf-8\r\nimport tensorflow as tf\r\nimport numpy as np\r\ndef create_adam_optimizer(learning_rate, momentum):\r\n    return tf.train.AdamOptimizer(learning_rate=learning_rate,\r\n                                  epsilon=1e-4)\r\n\r\n\r\ndef create_sgd_optimizer(learning_rate, momentum):\r\n    return tf.train.MomentumOptimizer(learning_rate=learning_rate,\r\n                                      momentum=momentum)\r\n\r\n\r\ndef create_rmsprop_optimizer(learning_rate, momentum):\r\n    return tf.train.RMSPropOptimizer(learning_rate=learning_rate,\r\n                                     momentum=momentum,\r\n                                     epsilon=1e-5)\r\n\r\n\r\noptimizer_factory = {'adam': create_adam_optimizer,\r\n                     'sgd': create_sgd_optimizer,\r\n                     'rmsprop': create_rmsprop_optimizer}\r\ndef mu_law_encode(audio, quantization_channels):\r\n    '''Quantizes waveform amplitudes.'''\r\n    with tf.name_scope('encode'):\r\n        mu = tf.to_float(quantization_channels - 1)\r\n        # Perform mu-law companding transformation (ITU-T, 1988).\r\n        # Minimum operation is here to deal with rare large amplitudes caused\r\n        # by resampling.\r\n        safe_audio_abs = tf.minimum(tf.abs(audio), 1.0)\r\n        magnitude = tf.log1p(mu * safe_audio_abs) / tf.log1p(mu)  # tf.log1p(x) = log(1+x)\r\n        signal = tf.sign(audio) * magnitude\r\n        # Quantize signal to the specified number of levels.\r\n        return tf.to_int32((signal + 1) / 2 * mu + 0.5)\r\n\r\n\r\ndef mu_law_decode(output, quantization_channels, quantization=True):\r\n    '''Recovers waveform from quantized values.'''\r\n    with tf.name_scope('decode'):\r\n        mu = quantization_channels - 1\r\n        # Map values back to [-1, 1].\r\n        if quantization:\r\n            signal = 2 * (tf.to_float(output) / mu) - 1\r\n        else:\r\n            signal = output\r\n        # Perform inverse of mu-law transformation.\r\n        magnitude = (1 / mu) * ((1 + mu)**abs(signal) - 1)\r\n        return tf.sign(signal) * magnitude\r\n\r\n\r\nclass SubPixelConvolution(tf.layers.Conv2D):\r\n    '''Sub-Pixel Convolutions are vanilla convolutions followed by Periodic Shuffle.\r\n\r\n    They serve the purpose of upsampling (like deconvolutions) but are faster and less prone to checkerboard artifact with the right initialization.\r\n    In contrast to ResizeConvolutions, SubPixel have the same computation speed (when using same n° of params), but a larger receptive fields as they operate on low resolution.\r\n    '''\r\n    def __init__(self, filters, kernel_size, padding, strides, NN_init, NN_scaler, up_layers, name=None, **kwargs):\r\n        #Output channels = filters * H_upsample * W_upsample\r\n        conv_filters = filters * strides[0] * strides[1]\r\n\r\n        #Create initial kernel\r\n        self.NN_init = NN_init\r\n        self.up_layers = up_layers\r\n        self.NN_scaler = NN_scaler\r\n        init_kernel = tf.constant_initializer(self._init_kernel(kernel_size, strides, conv_filters), dtype=tf.float32) if NN_init else None\r\n\r\n        #Build convolution component and save Shuffle parameters.\r\n        super(SubPixelConvolution, self).__init__(\r\n            filters=conv_filters,\r\n            kernel_size=kernel_size,\r\n            strides=(1, 1),\r\n            padding=padding,\r\n            kernel_initializer=init_kernel,\r\n            bias_initializer=tf.zeros_initializer(),\r\n            data_format='channels_last',\r\n            name=name, **kwargs)\r\n\r\n        self.out_filters = filters\r\n        self.shuffle_strides = strides\r\n        self.scope = 'SubPixelConvolution' if None else name\r\n\r\n    def build(self, input_shape):\r\n        '''Build SubPixel initial weights (ICNR: avoid checkerboard artifacts).\r\n\r\n        To ensure checkerboard free SubPixel Conv, initial weights must make the subpixel conv equivalent to conv->NN resize.\r\n        To do that, we replace initial kernel with the special kernel W_n == W_0 for all n <= out_channels.\r\n        In other words, we want our initial kernel to extract feature maps then apply Nearest neighbor upsampling.\r\n        NN upsampling is guaranteed to happen when we force all our output channels to be equal (neighbor pixels are duplicated).\r\n        We can think of this as limiting our initial subpixel conv to a low resolution conv (1 channel) followed by a duplication (made by PS).\r\n\r\n        Ref: https://arxiv.org/pdf/1707.02937.pdf\r\n        '''\r\n        #Initialize layer\r\n        super(SubPixelConvolution, self).build(input_shape)\r\n\r\n        if not self.NN_init:\r\n            #If no NN init is used, ensure all channel-wise parameters are equal.\r\n            self.built = False\r\n\r\n            #Get W_0 which is the first filter of the first output channels\r\n            W_0 = tf.expand_dims(self.kernel[:, :, :, 0], axis=3) #[H_k, W_k, in_c, 1]\r\n\r\n            #Tile W_0 across all output channels and replace original kernel\r\n            self.kernel = tf.tile(W_0, [1, 1, 1, self.filters]) #[H_k, W_k, in_c, out_c]\r\n\r\n        self.built = True\r\n\r\n    def call(self, inputs):\r\n        with tf.variable_scope(self.scope) as scope:\r\n            #Inputs are supposed [batch_size, freq, time_steps, channels]\r\n            convolved = super(SubPixelConvolution, self).call(inputs)\r\n\r\n            #[batch_size, up_freq, up_time_steps, channels]\r\n            return self.PS(convolved)\r\n\r\n    def PS(self, inputs):\r\n        #Get different shapes\r\n        #[batch_size, H, W, C(out_c * r1 * r2)]\r\n        batch_size = tf.shape(inputs)[0]\r\n        H = tf.shape(inputs)[1]\r\n        W = tf.shape(inputs)[2]   #W = tf.shape(inputs)[2]\r\n        C = inputs.shape[-1]\r\n        r1, r2 = self.shuffle_strides #supposing strides = (freq_stride, time_stride)\r\n        out_c = self.out_filters #number of filters as output of the convolution (usually 1 for this model)\r\n\r\n        assert C == r1 * r2 * out_c\r\n\r\n        #Split and shuffle (output) channels separately. (Split-Concat block)\r\n        Xc = tf.split(inputs, out_c, axis=3) # out_c x [batch_size, H, W, C/out_c]\r\n        outputs = tf.concat([self._phase_shift(x, batch_size, H, W, r1, r2) for x in Xc], 3) #[batch_size, r1 * H, r2 * W, out_c]\r\n\r\n        with tf.control_dependencies([tf.assert_equal(out_c, tf.shape(outputs)[-1]),\r\n            tf.assert_equal(H * r1, tf.shape(outputs)[1])]):\r\n            outputs = tf.identity(outputs, name='SubPixelConv_output_check')\r\n\r\n        return tf.reshape(outputs, [tf.shape(outputs)[0], r1 * H, tf.shape(outputs)[2], out_c])\r\n\r\n    def _phase_shift(self, inputs, batch_size, H, W, r1, r2):\r\n        #Do a periodic shuffle on each output channel separately\r\n        x = tf.reshape(inputs, [batch_size, H, W, r1, r2]) #[batch_size, H, W, r1, r2]\r\n\r\n        #Width dim shuffle\r\n        x = tf.transpose(x, [4, 2, 3, 1, 0]) #[r2, W, r1, H, batch_size]\r\n        x = tf.batch_to_space_nd(x, [r2], [[0, 0]]) #[1, r2*W, r1, H, batch_size]\r\n        x = tf.squeeze(x, [0]) #[r2*W, r1, H, batch_size]\r\n\r\n        #Height dim shuffle\r\n        x = tf.transpose(x, [1, 2, 0, 3]) #[r1, H, r2*W, batch_size]\r\n        x = tf.batch_to_space_nd(x, [r1], [[0, 0]]) #[1, r1*H, r2*W, batch_size]\r\n        x = tf.transpose(x, [3, 1, 2, 0]) #[batch_size, r1*H, r2*W, 1]\r\n\r\n        return x\r\n\r\n    def _init_kernel(self, kernel_size, strides, filters):\r\n        '''Nearest Neighbor Upsample (Checkerboard free) init kernel size\r\n        '''\r\n        overlap = kernel_size[1] // strides[1]\r\n        init_kernel = np.zeros(kernel_size, dtype=np.float32)\r\n        i = kernel_size[1] // 2\r\n        j = [kernel_size[0] // 2 - 1, kernel_size[0] // 2] if kernel_size[0] % 2 == 0 else [kernel_size[0] // 2]\r\n        for j_i in j:\r\n            init_kernel[j_i,i] = 1. / max(overlap, 1.) if kernel_size[1] % 2 == 0 else 1.\r\n\r\n        init_kernel = np.tile(np.expand_dims(init_kernel, 2), [1, 1, 1, filters])\r\n\r\n        return init_kernel * (self.NN_scaler)**(1/self.up_layers)\r\n"
  },
  {
    "path": "명령어모음.txt",
    "content": "python preprocess.py --num_workers 10 --name son --in_dir .\\datasets\\son --out_dir .\\data\\son\r\n\r\n\r\npython preprocess.py --num_workers 10 --name moon --in_dir .\\datasets\\moon --out_dir .\\data\\moon\r\n\r\n\r\npython train_tacotron2.py\r\n\r\n\r\npython train_vocoder.py\r\n\r\n\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 0 --text \"오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다\"\r\npython synthesizer.py --load_path logdir-tacotron2/moon+son_2019-02-27_00-21-42 --num_speakers 2 --speaker_id 1 --text \"오스트랄로피테쿠스 아파렌시스는 멸종된 사람족 종으로, 현재에는 뼈 화석이 발견되어 있다\"\r\n\r\npython generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-27T20-27-18\r\npython generate.py --mel ./logdir-wavenet/mel-son.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-27T20-27-18\r\npython generate.py --mel ./logdir-wavenet/moon-Aust.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2019-03-27T20-27-18\r\npython generate.py --mel ./logdir-wavenet/son-Aust.npy --gc_cardinality 2 --gc_id 1 ./logdir-wavenet/train/2019-03-27T20-27-18"
  }
]