[
  {
    "path": "README.md",
    "content": "# Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations\nThis is the official implementation of the paper [Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations](https://arxiv.org/abs/1804.02812).\nYou can find the demo webpage [here](https://jjery2243542.github.io/voice_conversion_demo/), and the pretrained model [here](http://speech.ee.ntu.edu.tw/~jjery2243542/resource/model/is18/model.pkl).\n\n# Dependency\n- python 3.6+\n- pytorch 0.4.0\n- h5py 2.8\n- tensorboardX\nWe also use some preprocess script from [Kyubyong/tacotron](https://github.com/Kyubyong/tacotron).\n\n# Preprocess\nOur model is trained on [CSTR VCTK Corpus](https://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html).\n\n### Feature extraction\nWe use the code from [Kyubyong/tacotron](https://github.com/Kyubyong/tacotron) to extract feature. The default paprameters can be found at ```preprocess/tacotron/norm_utils.py```.\n\nThe configuration for preprocess is at ```preprocess/vctk.config```, where: \n- **data_root_dir**: the path of VCTK Corpus (VCTK-Corpus).\n- **h5py_path**: the path to store extracted features.\n- **index_path**: the path to store sampled segments.\n- **traini_proportion**: the proportion of training utterances. Default: 0.9.\n- **n_samples**: the number of sampled samples. Default: 500000.\n- **seg_len**: the length of sampled segments. Default: 128.\n- **speaker_used_path**: the path of used speaker list. Our speakers set used in the paper is [here](http://speech.ee.ntu.edu.tw/~jjery2243542/resource/model/is18/en_speaker_used.txt).\n\nOnce you edited the config file, you can run ```preprocess.sh``` to preprocess the dataset.\n\n# Training\nYou can start training by running ```main.py```. The arguments are listed below.\n- **--load_model**: whether to load the model from checkpoint.\n- **-flag**: flag of this training episode for tensorboard. Default: train.\n- **-hps_path**: the path of hyper-parameters set. You can find the default setting at ```vctk.json```.\n- **--load_model_path**: If **--load_model** is on, it will load the model parameters from this path.\n- **-dataset_path**: the path of processed features (.h5).\n- **-index_path**: the path of sampled segment indexes (.json).\n- **-output_model_path**: the path to store trained model. \n\n# Testing\nYou can inference by running ```python3 test.py```. The arguments are listed below.\n- **-hps**: the path of hyper-parameter set. Default: vctk.json\n- **-m**: the path of model checkpoint to load.\n- **-s**: the path of source .wav file.\n- **-t**: the index of target speaker id (integer). Same order as the speaker list (```en_speaker_used.txt```).\n- **-o**: output .wav path.\n- **-sr**: sample rate of the output .wav file. Default: 16000.\n- **--use_gen**: if the flag is on, inference will use generator. Default: True.\n\n# Reference\nPlease cite our paper if you find this repository useful.\n```\n@article{chou2018multi,\n  title={Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations},\n  author={Chou, Ju-chieh and Yeh, Cheng-chieh and Lee, Hung-yi and Lee, Lin-shan},\n  journal={arXiv preprint arXiv:1804.02812},\n  year={2018}\n}\n```\n\n# Contact\nIf you have any question about the paper or the code, feel free to email me at [jjery2243542@gmail.com](jjery2243542@gmail.com).\n"
  },
  {
    "path": "convert.py",
    "content": "import torch\nfrom torch import optim\nfrom torch.autograd import Variable\nimport numpy as np\nimport pickle\nfrom utils import Hps\nfrom utils import DataLoader\nfrom utils import Logger\nfrom utils import myDataset\nfrom utils import Indexer\nfrom solver import Solver\nfrom preprocess.tacotron.norm_utils import spectrogram2wav\n#from preprocess.tacotron.audio import inv_spectrogram, save_wav\nfrom scipy.io.wavfile import write\n#from preprocess.tacotron.mcep import mc2wav\nimport h5py \nimport os \nimport soundfile as sf\n#import pysptk\n#import pyworld as pw\n\ndef sp2wav(sp): \n    #exp_sp = np.exp(sp)\n    exp_sp = sp\n    wav_data = spectrogram2wav(exp_sp)\n    return wav_data\n\ndef get_world_param(f_h5, src_speaker, utt_id, tar_speaker, tar_speaker_id, solver, dset='test', gen=True):\n    mc = f_h5[f'{dset}/{src_speaker}/{utt_id}/norm_mc'][()]\n    converted_mc = convert_mc(mc, tar_speaker_id, solver, gen=gen)\n    #converted_mc = mc\n    mc_mean = f_h5[f'train/{tar_speaker}'].attrs['mc_mean']\n    mc_std = f_h5[f'train/{tar_speaker}'].attrs['mc_std']\n    converted_mc = converted_mc * mc_std + mc_mean\n    log_f0 = f_h5[f'{dset}/{src_speaker}/{utt_id}/log_f0'][()]\n    src_mean = f_h5[f'train/{src_speaker}'].attrs['f0_mean']\n    src_std = f_h5[f'train/{src_speaker}'].attrs['f0_std']\n    tar_mean = f_h5[f'train/{tar_speaker}'].attrs['f0_mean']\n    tar_std = f_h5[f'train/{tar_speaker}'].attrs['f0_std']\n    index = np.where(log_f0 > 1e-10)[0]\n    log_f0[index] = (log_f0[index] - src_mean) * tar_std / src_std + tar_mean\n    log_f0[index] = np.exp(log_f0[index])\n    f0 = log_f0\n    ap = f_h5[f'{dset}/{src_speaker}/{utt_id}/ap'][()]\n    converted_mc = converted_mc[:ap.shape[0]]\n    sp = pysptk.conversion.mc2sp(converted_mc, alpha=0.41, fftlen=1024)\n    return f0, sp, ap\n\ndef synthesis(f0, sp, ap, sr=16000):\n    y = pw.synthesize(\n            f0.astype(np.float64),\n            sp.astype(np.float64),\n            ap.astype(np.float64), \n            sr, \n            pw.default_frame_period)\n    return y\n\ndef convert_sp(sp, c, solver, gen=True):\n    c_var = Variable(torch.from_numpy(np.array([c]))).cuda()\n    sp_tensor = torch.from_numpy(np.expand_dims(sp, axis=0))\n    sp_tensor = sp_tensor.type(torch.FloatTensor)\n    converted_sp = solver.test_step(sp_tensor, c_var, gen=gen)\n    converted_sp = converted_sp.squeeze(axis=0).transpose((1, 0))\n    return converted_sp\n\ndef convert_mc(mc, c, solver, gen=True):\n    c_var = Variable(torch.from_numpy(np.array([c]))).cuda()\n    mc_tensor = torch.from_numpy(np.expand_dims(mc, axis=0))\n    mc_tensor = mc_tensor.type(torch.FloatTensor)\n    converted_mc = solver.test_step(mc_tensor, c_var, gen=gen)\n    converted_mc = converted_mc.squeeze(axis=0).transpose((1, 0))\n    return converted_mc\n\ndef get_model(hps_path='./hps/vctk.json', model_path='/storage/model/voice_conversion/vctk/clf/model.pkl-109999'):\n    hps = Hps()\n    hps.load(hps_path)\n    hps_tuple = hps.get_tuple()\n    solver = Solver(hps_tuple, None)\n    solver.load_model(model_path)\n    return solver\n\ndef convert_all_sp(h5_path, src_speaker, tar_speaker, gen=True, \n        dset='test', speaker_used_path='/storage/feature/voice_conversion/vctk/dataset_used/en_speaker_used.txt',\n        root_dir='/storage/result/voice_conversion/vctk/p226_to_p225/',\n        model_path='/storage/model/voice_conversion/vctk/clf/wo_tanh/model_0.001.pkl-79999'):\n    # read speaker id file\n    with open(speaker_used_path) as f:\n        speakers = [line.strip() for line in f]\n        speaker2id = {speaker:i for i, speaker in enumerate(speakers)}\n    solver = get_model(hps_path='hps/vctk.json', \n            model_path=model_path)\n    with h5py.File(h5_path, 'r') as f_h5:\n        for utt_id in f_h5[f'{dset}/{src_speaker}']:\n            sp = f_h5[f'{dset}/{src_speaker}/{utt_id}/lin'][()]\n            converted_sp = convert_sp(sp, speaker2id[tar_speaker], solver, gen=gen)\n            wav_data = sp2wav(converted_sp)\n            wav_path = os.path.join(root_dir, f'{src_speaker}_{tar_speaker}_{utt_id}.wav')\n            sf.write(wav_path, wav_data, 16000, 'PCM_24')\n\ndef convert_all_mc(h5_path, src_speaker, tar_speaker, gen=False, \n        dset='test', speaker_used_path='/storage/feature/voice_conversion/vctk/mcep/en_speaker_used.txt',\n        root_dir='/storage/result/voice_conversion/vctk/p226_to_p225',\n        model_path='/storage/model/voice_conversion/vctk/clf/wo_tanh/model_0.001.pkl-79999'):\n    # read speaker id file\n    with open(speaker_used_path) as f:\n        speakers = [line.strip() for line in f]\n        speaker2id = {speaker:i for i, speaker in enumerate(speakers)}\n    solver = get_model(hps_path='hps/vctk.json', \n            model_path=model_path)\n    with h5py.File(h5_path, 'r') as f_h5:\n        for utt_id in f_h5[f'{dset}/{src_speaker}']:\n            f0, sp, ap = get_world_param(f_h5, src_speaker, utt_id, tar_speaker, tar_speaker_id=speaker2id[tar_speaker], solver=solver, dset='test', gen=gen)\n            wav_data = synthesis(f0, sp, ap)\n            wav_path = os.path.join(root_dir, f'{src_speaker}_{tar_speaker}_{utt_id}.wav')\n            sf.write(wav_path, wav_data, 16000, 'PCM_24')\n\nif __name__ == '__main__':\n    h5_path = '/storage/feature/voice_conversion/vctk/dataset_used/norm_vctk.h5'\n    root_dir = '/storage/result/voice_conversion/vctk/norm/clf_gen'\n    #h5_path = '/storage/feature/voice_conversion/LibriSpeech/libri.h5'\n    #h5_path = '/storage/feature/voice_conversion/vctk/mcep/trim_mc_vctk_backup.h5'\n    #convert_all_mc(h5_path, '226', '225', root_dir='./test_mc/', gen=False, \n    #        model_path='/storage/model/voice_conversion/vctk/mcep/clf/model.pkl-129999')\n    #convert_all_mc(h5_path, '225', '226', root_dir='./test_mc/', gen=False, \n    #        model_path='/storage/model/voice_conversion/vctk/mcep/clf/model.pkl-129999')\n    #convert_all_mc(h5_path, '225', '228', root_dir='./test_mc/', gen=False, \n    #        model_path='/storage/model/voice_conversion/vctk/mcep/clf/model.pkl-129999')\n    #model_path = '/storage/model/voice_conversion/vctk/mcep/clf/model.pkl-129999'\n    model_path = '/storage/model/voice_conversion/vctk/clf/norm/wo_tanh/model_0.001_no_ins.pkl-124000'\n    #model_path = '/storage/model/voice_conversion/librispeech/ls_1e-3.pkl-99999'\n    speakers = ['225', '226', '227']\n    for speaker_A in speakers:\n        for speaker_B in speakers:\n            if speaker_A == speaker_B:\n                continue\n            else:\n                dir_path = os.path.join(root_dir, f'p{speaker_A}_p{speaker_B}')\n                if not os.path.exists(dir_path):\n                    os.makedirs(dir_path)\n                convert_all_sp(h5_path,speaker_A,speaker_B,\n                        root_dir=dir_path, \n                        gen=True, model_path=model_path)\n    # diff accent\n    #dir_path = os.path.join(root_dir, '163_6925')\n    #if not os.path.exists(dir_path):\n    #    os.makedirs(dir_path)\n    #convert_all_sp(h5_path,'163','6925',root_dir=dir_path, \n    #        gen=False, model_path=model_path)\n    #dir_path = os.path.join(root_dir, '460_1363')\n    #if not os.path.exists(dir_path):\n    #    os.makedirs(dir_path)\n    #convert_all_sp(h5_path,'460','1363',root_dir=dir_path, \n    #        gen=False, model_path=model_path)\n    #convert_all_sp(h5_path,'363','256',root_dir=os.path.join(root_dir, 'p363_p256'), \n    #        gen=True, model_path=model_path)\n    #convert_all_sp(h5_path,'340','251',root_dir=os.path.join(root_dir, 'p340_p251'), \n    #        gen=True, model_path=model_path)\n    #convert_all_sp(h5_path,'285','251',root_dir=os.path.join(root_dir, 'p285_p251'), \n    #        gen=True, model_path=model_path)\n"
  },
  {
    "path": "main.py",
    "content": "import torch\nfrom torch import optim\nfrom torch.autograd import Variable\nimport numpy as np\nimport pickle\nfrom utils import Hps\nfrom utils import DataLoader\nfrom utils import Logger\nfrom utils import SingleDataset\nfrom solver import Solver\nimport argparse\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--load_model', default=False, action='store_true')\n    parser.add_argument('-flag', default='train')\n    parser.add_argument('-hps_path')\n    parser.add_argument('-load_model_path')\n    parser.add_argument('-dataset_path')\n    parser.add_argument('-index_path')\n    parser.add_argument('-output_model_path')\n    args = parser.parse_args()\n    hps = Hps()\n    hps.load(args.hps_path)\n    hps_tuple = hps.get_tuple()\n    dataset = SingleDataset(args.dataset_path, args.index_path, seg_len=hps_tuple.seg_len)\n    data_loader = DataLoader(dataset)\n\n    solver = Solver(hps_tuple, data_loader)\n    if args.load_model:\n        solver.load_model(args.load_model_path)\n\n    solver.train(args.output_model_path, args.flag, mode='pretrain_G')\n    solver.train(args.output_model_path, args.flag, mode='pretrain_D')\n    solver.train(args.output_model_path, args.flag, mode='train')\n    solver.train(args.output_model_path, args.flag, mode='patchGAN')\n"
  },
  {
    "path": "model.py",
    "content": "import numpy as np\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch\nfrom torch.autograd import Variable\n\ndef pad_layer(inp, layer, is_2d=False):\n    if type(layer.kernel_size) == tuple:\n        kernel_size = layer.kernel_size[0]\n    else:\n        kernel_size = layer.kernel_size\n    if not is_2d:\n        if kernel_size % 2 == 0:\n            pad = (kernel_size//2, kernel_size//2 - 1)\n        else:\n            pad = (kernel_size//2, kernel_size//2)\n    else:\n        if kernel_size % 2 == 0:\n            pad = (kernel_size//2, kernel_size//2 - 1, kernel_size//2, kernel_size//2 - 1)\n        else:\n            pad = (kernel_size//2, kernel_size//2, kernel_size//2, kernel_size//2)\n    # padding\n    inp = F.pad(inp, \n            pad=pad,\n            mode='reflect')\n    out = layer(inp)\n    return out\n\ndef pixel_shuffle_1d(inp, upscale_factor=2):\n    batch_size, channels, in_width = inp.size()\n    channels //= upscale_factor\n    \n    out_width = in_width * upscale_factor\n    inp_view = inp.contiguous().view(batch_size, channels, upscale_factor, in_width)\n    shuffle_out = inp_view.permute(0, 1, 3, 2).contiguous()\n    shuffle_out = shuffle_out.view(batch_size, channels, out_width)\n    return shuffle_out\n\ndef upsample(x, scale_factor=2):\n    x_up = F.upsample(x, scale_factor=2, mode='nearest')\n    return x_up\n\ndef RNN(inp, layer):\n    inp_permuted = inp.permute(2, 0, 1)\n    state_mul = (int(layer.bidirectional) + 1) * layer.num_layers\n    zero_state = Variable(torch.zeros(state_mul, inp.size(0), layer.hidden_size))\n    zero_state = zero_state.cuda() if torch.cuda.is_available() else zero_state\n    out_permuted, _ = layer(inp_permuted, zero_state)\n    out_rnn = out_permuted.permute(1, 2, 0)\n    return out_rnn\n\ndef linear(inp, layer):\n    batch_size = inp.size(0)\n    hidden_dim = inp.size(1)\n    seq_len = inp.size(2)\n    inp_permuted = inp.permute(0, 2, 1)\n    inp_expand = inp_permuted.contiguous().view(batch_size*seq_len, hidden_dim)\n    out_expand = layer(inp_expand)\n    out_permuted = out_expand.view(batch_size, seq_len, out_expand.size(1))\n    out = out_permuted.permute(0, 2, 1)\n    return out\n\ndef append_emb(emb, expand_size, output):\n    emb = emb.unsqueeze(dim=2)\n    emb_expand = emb.expand(emb.size(0), emb.size(1), expand_size)\n    output = torch.cat([output, emb_expand], dim=1)\n    return output\n\nclass PatchDiscriminator(nn.Module):\n    def __init__(self, n_class=33, ns=0.2, dp=0.1):\n        super(PatchDiscriminator, self).__init__()\n        self.ns = ns\n        self.conv1 = nn.Conv2d(1, 64, kernel_size=5, stride=2)\n        self.conv2 = nn.Conv2d(64, 128, kernel_size=5, stride=2)\n        self.conv3 = nn.Conv2d(128, 256, kernel_size=5, stride=2)\n        self.conv4 = nn.Conv2d(256, 512, kernel_size=5, stride=2)\n        self.conv5 = nn.Conv2d(512, 512, kernel_size=5, stride=2)\n        self.conv6 = nn.Conv2d(512, 32, kernel_size=1)\n        self.conv7 = nn.Conv2d(32, 1, kernel_size=(17, 4))\n        #self.conv_classify = nn.Conv2d(512, n_class, kernel_size=(17, 4))\n        self.conv_classify = nn.Conv2d(32, n_class, kernel_size=(17, 4))\n        self.drop1 = nn.Dropout2d(p=dp)\n        self.drop2 = nn.Dropout2d(p=dp)\n        self.drop3 = nn.Dropout2d(p=dp)\n        self.drop4 = nn.Dropout2d(p=dp)\n        self.drop5 = nn.Dropout2d(p=dp)\n        self.drop6 = nn.Dropout2d(p=dp)\n        self.ins_norm1 = nn.InstanceNorm2d(self.conv1.out_channels)\n        self.ins_norm2 = nn.InstanceNorm2d(self.conv2.out_channels)\n        self.ins_norm3 = nn.InstanceNorm2d(self.conv3.out_channels)\n        self.ins_norm4 = nn.InstanceNorm2d(self.conv4.out_channels)\n        self.ins_norm5 = nn.InstanceNorm2d(self.conv5.out_channels)\n        self.ins_norm6 = nn.InstanceNorm2d(self.conv6.out_channels)\n\n    def conv_block(self, x, conv_layer, after_layers):\n        out = pad_layer(x, conv_layer, is_2d=True)\n        out = F.leaky_relu(out, negative_slope=self.ns)\n        for layer in after_layers:\n            out = layer(out)\n        return out \n\n    def forward(self, x, classify=False):\n        x = torch.unsqueeze(x, dim=1)\n        out = self.conv_block(x, self.conv1, [self.ins_norm1, self.drop1])\n        out = self.conv_block(out, self.conv2, [self.ins_norm2, self.drop2])\n        out = self.conv_block(out, self.conv3, [self.ins_norm3, self.drop3])\n        out = self.conv_block(out, self.conv4, [self.ins_norm4, self.drop4])\n        out = self.conv_block(out, self.conv5, [self.ins_norm5, self.drop5])\n        out = self.conv_block(out, self.conv6, [self.ins_norm6, self.drop6])\n        # GAN output value\n        val = self.conv7(out)\n        val = val.view(val.size(0), -1)\n        mean_val = torch.mean(val, dim=1)\n        if classify:\n            # classify\n            logits = self.conv_classify(out)\n            logits = logits.view(logits.size(0), -1)\n            return mean_val, logits\n        else:\n            return mean_val\n\nclass SpeakerClassifier(nn.Module):\n    def __init__(self, c_in=512, c_h=512, n_class=8, dp=0.1, ns=0.01):\n        super(SpeakerClassifier, self).__init__()\n        self.dp, self.ns = dp, ns\n        self.conv1 = nn.Conv1d(c_in, c_h, kernel_size=5)\n        self.conv2 = nn.Conv1d(c_h, c_h, kernel_size=5)\n        self.conv3 = nn.Conv1d(c_h, c_h, kernel_size=5)\n        self.conv4 = nn.Conv1d(c_h, c_h, kernel_size=5)\n        self.conv5 = nn.Conv1d(c_h, c_h, kernel_size=5)\n        self.conv6 = nn.Conv1d(c_h, c_h, kernel_size=5)\n        self.conv7 = nn.Conv1d(c_h, c_h//2, kernel_size=3)\n        self.conv8 = nn.Conv1d(c_h//2, c_h//4, kernel_size=3)\n        self.conv9 = nn.Conv1d(c_h//4, n_class, kernel_size=16)\n        self.drop1 = nn.Dropout(p=dp)\n        self.drop2 = nn.Dropout(p=dp)\n        self.drop3 = nn.Dropout(p=dp)\n        self.drop4 = nn.Dropout(p=dp)\n        self.ins_norm1 = nn.InstanceNorm1d(c_h)\n        self.ins_norm2 = nn.InstanceNorm1d(c_h)\n        self.ins_norm3 = nn.InstanceNorm1d(c_h)\n        self.ins_norm4 = nn.InstanceNorm1d(c_h//4)\n\n    def conv_block(self, x, conv_layers, after_layers, res=True):\n        out = x\n        for layer in conv_layers:\n            out = pad_layer(out, layer)\n            out = F.leaky_relu(out, negative_slope=self.ns)\n        for layer in after_layers:\n            out = layer(out)\n        if res:\n            out = out + x\n        return out\n\n    def forward(self, x):\n        out = self.conv_block(x, [self.conv1, self.conv2], [self.ins_norm1, self.drop1], res=False)\n        out = self.conv_block(out, [self.conv3, self.conv4], [self.ins_norm2, self.drop2], res=True)\n        out = self.conv_block(out, [self.conv5, self.conv6], [self.ins_norm3, self.drop3], res=True)\n        out = self.conv_block(out, [self.conv7, self.conv8], [self.ins_norm4, self.drop4], res=False)\n        out = self.conv9(out)\n        out = out.view(out.size()[0], -1)\n        return out\n\nclass Decoder(nn.Module):\n    def __init__(self, c_in=512, c_out=513, c_h=512, c_a=8, emb_size=128, ns=0.2):\n        super(Decoder, self).__init__()\n        self.ns = ns\n        self.conv1 = nn.Conv1d(c_in, 2*c_h, kernel_size=3)\n        self.conv2 = nn.Conv1d(c_h, c_h, kernel_size=3)\n        self.conv3 = nn.Conv1d(c_h, 2*c_h, kernel_size=3)\n        self.conv4 = nn.Conv1d(c_h, c_h, kernel_size=3)\n        self.conv5 = nn.Conv1d(c_h, 2*c_h, kernel_size=3)\n        self.conv6 = nn.Conv1d(c_h, c_h, kernel_size=3)\n        self.dense1 = nn.Linear(c_h, c_h)\n        self.dense2 = nn.Linear(c_h, c_h)\n        self.dense3 = nn.Linear(c_h, c_h)\n        self.dense4 = nn.Linear(c_h, c_h)\n        self.RNN = nn.GRU(input_size=c_h, hidden_size=c_h//2, num_layers=1, bidirectional=True)\n        self.dense5 = nn.Linear(2*c_h + c_h, c_h)\n        self.linear = nn.Linear(c_h, c_out)\n        # normalization layer\n        self.ins_norm1 = nn.InstanceNorm1d(c_h)\n        self.ins_norm2 = nn.InstanceNorm1d(c_h)\n        self.ins_norm3 = nn.InstanceNorm1d(c_h)\n        self.ins_norm4 = nn.InstanceNorm1d(c_h)\n        self.ins_norm5 = nn.InstanceNorm1d(c_h)\n        # embedding layer\n        self.emb1 = nn.Embedding(c_a, c_h)\n        self.emb2 = nn.Embedding(c_a, c_h)\n        self.emb3 = nn.Embedding(c_a, c_h)\n        self.emb4 = nn.Embedding(c_a, c_h)\n        self.emb5 = nn.Embedding(c_a, c_h)\n\n    def conv_block(self, x, conv_layers, norm_layer, emb, res=True):\n        # first layer\n        x_add = x + emb.view(emb.size(0), emb.size(1), 1)\n        out = pad_layer(x_add, conv_layers[0])\n        out = F.leaky_relu(out, negative_slope=self.ns)\n        # upsample by pixelshuffle\n        out = pixel_shuffle_1d(out, upscale_factor=2)\n        out = out + emb.view(emb.size(0), emb.size(1), 1)\n        out = pad_layer(out, conv_layers[1])\n        out = F.leaky_relu(out, negative_slope=self.ns)\n        out = norm_layer(out)\n        if res:\n            x_up = upsample(x, scale_factor=2)\n            out = out + x_up\n        return out\n\n    def dense_block(self, x, emb, layers, norm_layer, res=True):\n        out = x\n        for layer in layers:\n            out = out + emb.view(emb.size(0), emb.size(1), 1)\n            out = linear(out, layer)\n            out = F.leaky_relu(out, negative_slope=self.ns)\n        out = norm_layer(out)\n        if res:\n            out = out + x\n        return out\n\n    def forward(self, x, c):\n        # conv layer\n        out = self.conv_block(x, [self.conv1, self.conv2], self.ins_norm1, self.emb1(c), res=True )\n        out = self.conv_block(out, [self.conv3, self.conv4], self.ins_norm2, self.emb2(c), res=True)\n        out = self.conv_block(out, [self.conv5, self.conv6], self.ins_norm3, self.emb3(c), res=True)\n        # dense layer\n        out = self.dense_block(out, self.emb4(c), [self.dense1, self.dense2], self.ins_norm4, res=True)\n        out = self.dense_block(out, self.emb4(c), [self.dense3, self.dense4], self.ins_norm5, res=True)\n        emb = self.emb5(c)\n        out_add = out + emb.view(emb.size(0), emb.size(1), 1)\n        # rnn layer\n        out_rnn = RNN(out_add, self.RNN)\n        out = torch.cat([out, out_rnn], dim=1)\n        out = append_emb(self.emb5(c), out.size(2), out)\n        out = linear(out, self.dense5)\n        out = F.leaky_relu(out, negative_slope=self.ns)\n        out = linear(out, self.linear)\n        #out = torch.tanh(out)\n        return out\n\nclass Encoder(nn.Module):\n    def __init__(self, c_in=513, c_h1=128, c_h2=512, c_h3=128, ns=0.2, dp=0.5):\n        super(Encoder, self).__init__()\n        self.ns = ns\n        self.conv1s = nn.ModuleList(\n                [nn.Conv1d(c_in, c_h1, kernel_size=k) for k in range(1, 8)]\n            )\n        self.conv2 = nn.Conv1d(len(self.conv1s)*c_h1 + c_in, c_h2, kernel_size=1)\n        self.conv3 = nn.Conv1d(c_h2, c_h2, kernel_size=5)\n        self.conv4 = nn.Conv1d(c_h2, c_h2, kernel_size=5, stride=2)\n        self.conv5 = nn.Conv1d(c_h2, c_h2, kernel_size=5)\n        self.conv6 = nn.Conv1d(c_h2, c_h2, kernel_size=5, stride=2)\n        self.conv7 = nn.Conv1d(c_h2, c_h2, kernel_size=5)\n        self.conv8 = nn.Conv1d(c_h2, c_h2, kernel_size=5, stride=2)\n        self.dense1 = nn.Linear(c_h2, c_h2)\n        self.dense2 = nn.Linear(c_h2, c_h2)\n        self.dense3 = nn.Linear(c_h2, c_h2)\n        self.dense4 = nn.Linear(c_h2, c_h2)\n        self.RNN = nn.GRU(input_size=c_h2, hidden_size=c_h3, num_layers=1, bidirectional=True)\n        self.linear = nn.Linear(c_h2 + 2*c_h3, c_h2)\n        # normalization layer\n        self.ins_norm1 = nn.InstanceNorm1d(c_h2)\n        self.ins_norm2 = nn.InstanceNorm1d(c_h2)\n        self.ins_norm3 = nn.InstanceNorm1d(c_h2)\n        self.ins_norm4 = nn.InstanceNorm1d(c_h2)\n        self.ins_norm5 = nn.InstanceNorm1d(c_h2)\n        self.ins_norm6 = nn.InstanceNorm1d(c_h2)\n        # dropout layer\n        self.drop1 = nn.Dropout(p=dp)\n        self.drop2 = nn.Dropout(p=dp)\n        self.drop3 = nn.Dropout(p=dp)\n        self.drop4 = nn.Dropout(p=dp)\n        self.drop5 = nn.Dropout(p=dp)\n        self.drop6 = nn.Dropout(p=dp)\n\n    def conv_block(self, x, conv_layers, norm_layers, res=True):\n        out = x\n        for layer in conv_layers:\n            out = pad_layer(out, layer)\n            out = F.leaky_relu(out, negative_slope=self.ns)\n        for layer in norm_layers:\n            out = layer(out)\n        if res:\n            x_pad = F.pad(x, pad=(0, x.size(2) % 2), mode='reflect')\n            x_down = F.avg_pool1d(x_pad, kernel_size=2)\n            out = x_down + out \n        return out\n\n    def dense_block(self, x, layers, norm_layers, res=True):\n        out = x\n        for layer in layers:\n            out = linear(out, layer)\n            out = F.leaky_relu(out, negative_slope=self.ns)\n        for layer in norm_layers:\n            out = layer(out)\n        if res:\n            out = out + x\n        return out\n\n    def forward(self, x):\n        outs = []\n        for l in self.conv1s:\n            out = pad_layer(x, l)\n            outs.append(out)\n        out = torch.cat(outs + [x], dim=1)\n        out = F.leaky_relu(out, negative_slope=self.ns)\n        out = self.conv_block(out, [self.conv2], [self.ins_norm1, self.drop1], res=False)\n        out = self.conv_block(out, [self.conv3, self.conv4], [self.ins_norm2, self.drop2])\n        out = self.conv_block(out, [self.conv5, self.conv6], [self.ins_norm3, self.drop3])\n        out = self.conv_block(out, [self.conv7, self.conv8], [self.ins_norm4, self.drop4])\n        # dense layer\n        out = self.dense_block(out, [self.dense1, self.dense2], [self.ins_norm5, self.drop5], res=True)\n        out = self.dense_block(out, [self.dense3, self.dense4], [self.ins_norm6, self.drop6], res=True)\n        out_rnn = RNN(out, self.RNN)\n        out = torch.cat([out, out_rnn], dim=1)\n        out = linear(out, self.linear)\n        out = F.leaky_relu(out, negative_slope=self.ns)\n        return out\n"
  },
  {
    "path": "preprocess/make_dataset_vctk.py",
    "content": "import h5py\nimport numpy as np\nimport sys\nimport os\nimport glob\nimport re\nfrom collections import defaultdict\n#from tacotron.audio import load_wav, spectrogram, melspectrogram\nfrom tacotron.norm_utils import get_spectrograms \n\ndef read_speaker_info(path='/storage/datasets/VCTK/VCTK-Corpus/speaker-info.txt'):\n    accent2speaker = defaultdict(lambda: [])\n    with open(path) as f:\n        splited_lines = [line.strip().split() for line in f][1:]\n        speakers = [line[0] for line in splited_lines]\n        regions = [line[3] for line in splited_lines]\n        for speaker, region in zip(speakers, regions):\n            accent2speaker[region].append(speaker)\n    return accent2speaker\n\n\nif __name__ == '__main__':\n    if len(sys.argv) < 4:\n        print('usage: python3 make_dataset_vctk.py [data root directory (VCTK-Corpus)] [h5py path] '\n                '[training proportion]')\n        exit(0)\n\n    root_dir = sys.argv[1]\n    h5py_path = sys.argv[2]\n    proportion = float(sys.argv[3])\n\n    accent2speaker = read_speaker_info(os.path.join(root_dir, 'speaker-info.txt'))\n    filename_groups = defaultdict(lambda : [])\n    with h5py.File(h5py_path, 'w') as f_h5:\n        filenames = sorted(glob.glob(os.path.join(root_dir, 'wav48/*/*.wav')))\n        for filename in filenames:\n            # divide into groups\n            sub_filename = filename.strip().split('/')[-1]\n            # format: p{speaker}_{sid}.wav\n            speaker_id, utt_id = re.match(r'p(\\d+)_(\\d+)\\.wav', sub_filename).groups()\n            filename_groups[speaker_id].append(filename)\n        for speaker_id, filenames in filename_groups.items():\n            # only use the speakers who are English accent.\n            if speaker_id not in accent2speaker['English']:\n                continue\n            print('processing {}'.format(speaker_id))\n            train_size = int(len(filenames) * proportion)\n            for i, filename in enumerate(filenames):\n                sub_filename = filename.strip().split('/')[-1]\n                # format: p{speaker}_{sid}.wav\n                speaker_id, utt_id = re.match(r'p(\\d+)_(\\d+)\\.wav', sub_filename).groups()\n                _, lin_spec = get_spectrograms(filename)\n                if i < train_size:\n                    datatype = 'train'\n                else:\n                    datatype = 'test'\n                f_h5.create_dataset(f'{datatype}/{speaker_id}/{utt_id}', \\\n                    data=lin_spec, dtype=np.float32)\n"
  },
  {
    "path": "preprocess/make_single_samples.py",
    "content": "import sys\nimport h5py\nimport numpy as np\nimport json\nfrom collections import namedtuple\nimport random\n\nclass Sampler(object):\n    def __init__(self, h5_path, dset, seg_len, used_speaker_path):\n        self.dset = dset\n        self.f_h5 = h5py.File(h5_path, 'r')\n        self.seg_len = seg_len\n        self.utt2len = self.get_utt_len()\n        self.speakers = self.read_speakers(used_speaker_path)\n        self.n_speaker = len(self.speakers)\n        print(self.speakers)\n        self.speaker2utts = {speaker:list(self.f_h5[f'{dset}/{speaker}'].keys()) for speaker in self.speakers}\n        # remove too short utterence\n        self.rm_too_short_utt(limit=self.seg_len)\n        self.single_indexer = namedtuple('single_index', ['speaker', 'i', 't'])\n\n    def get_utt_len(self):\n        utt2len = {}\n        for dset in ['train', 'test']:\n            for speaker in self.f_h5[f'{dset}']:\n                for utt_id in self.f_h5[f'{dset}/{speaker}']:\n                    length = self.f_h5[f'{dset}/{speaker}/{utt_id}'][()].shape[0]\n                    utt2len[(speaker, utt_id)] = length\n        return utt2len\n\n    def rm_too_short_utt(self, limit):\n        for (speaker, utt_id), length in self.utt2len.items():\n            if speaker in self.speakers and length <= limit and utt_id in self.speaker2utts[speaker]:\n                self.speaker2utts[speaker].remove(utt_id)\n\n    def read_speakers(self, path):\n        with open(path) as f:\n            speakers = [line.strip() for line in f]\n            return speakers     \n\n    def sample_utt(self, speaker_id, n_samples=1):\n        # sample an utterence\n        dset = self.dset\n        utt_ids = random.sample(self.speaker2utts[speaker_id], n_samples)\n        lengths = [self.f_h5[f'{dset}/{speaker_id}/{utt_id}'].shape[0] for utt_id in utt_ids]\n        return [(utt_id, length) for utt_id, length in zip(utt_ids, lengths)]\n\n    def rand(self, l):\n        rand_idx = random.randint(0, len(l) - 1)\n        return l[rand_idx]\n\n    def sample_single(self):\n        seg_len = self.seg_len\n        speaker_idx, = random.sample(range(len(self.speakers)), 1)\n        speaker = self.speakers[speaker_idx]\n        (utt_id, utt_len), = self.sample_utt(speaker, 1)\n        t = random.randint(0, utt_len - seg_len)  \n        index_tuple = self.single_indexer(speaker=speaker_idx, i=f'{speaker}/{utt_id}', t=t)\n        return index_tuple\n\nif __name__ == '__main__':\n    if len(sys.argv) < 6:\n        print('usage: python3 make_single_samples.py [h5py path] [training sampled index path (.json)] '\n                '[n_samples] [segment length] [used speaker file path]')\n        exit(0)\n\n    h5py_path = sys.argv[1]\n    output_path = sys.argv[2]\n    n_samples = int(sys.argv[3])\n    segment_len = int(sys.argv[4])\n    speaker_path = sys.argv[5]\n\n    sampler = Sampler(h5_path=h5py_path, seg_len=segment_len, dset='train', used_speaker_path=speaker_path)\n    samples = [sampler.sample_single()._asdict() for _ in range(n_samples)]\n    with open(sys.argv[2], 'w') as f_json:\n        json.dump(samples, f_json, indent=4, separators=(',', ': '))\n"
  },
  {
    "path": "preprocess/preprocess.sh",
    "content": ". vctk.config\npython3 make_dataset_vctk.py $data_root_dir $h5py_path $train_proportion\npython3 make_single_samples.py $h5py_path $index_path $n_samples $seg_len $speaker_used_path\n"
  },
  {
    "path": "preprocess/tacotron/norm_utils.py",
    "content": "# -*- coding: utf-8 -*-\n# /usr/bin/python2\n'''\nBy kyubyong park. kbpark.linguist@gmail.com.\nhttps://www.github.com/kyubyong/dc_tts\n'''\nfrom __future__ import print_function, division\n\n#from hyperparams import Hyperparams as hp\nimport numpy as np\n#import tensorflow as tf\nimport librosa\nimport copy\nimport matplotlib\nmatplotlib.use('pdf')\nimport matplotlib.pyplot as plt\nfrom scipy import signal\nimport os\n\nclass hyperparams(object):\n    def __init__(self):\n        self.max_duration = 10.0\n\n        # signal processing\n        self.sr = 16000 # Sample rate.\n        self.n_fft = 1024 # fft points (samples)\n        self.frame_shift = 0.0125 # seconds\n        self.frame_length = 0.05 # seconds\n        self.hop_length = int(self.sr*self.frame_shift) # samples.\n        self.win_length = int(self.sr*self.frame_length) # samples.\n        self.n_mels = 80 # Number of Mel banks to generate\n        self.power = 1.2 # Exponent for amplifying the predicted magnitude\n        self.n_iter = 300 # Number of inversion iterations\n        self.preemphasis = .97 # or None\n        self.max_db = 100\n        self.ref_db = 20\n\nhp = hyperparams()\n\ndef get_spectrograms(fpath):\n    '''Returns normalized log(melspectrogram) and log(magnitude) from `sound_file`.\n    Args:\n      sound_file: A string. The full path of a sound file.\n\n    Returns:\n      mel: A 2d array of shape (T, n_mels) <- Transposed\n      mag: A 2d array of shape (T, 1+n_fft/2) <- Transposed\n    '''\n    # num = np.random.randn()\n    # if num < .2:\n    #     y, sr = librosa.load(fpath, sr=hp.sr)\n    # else:\n    #     if num < .4:\n    #         tempo = 1.1\n    #     elif num < .6:\n    #         tempo = 1.2\n    #     elif num < .8:\n    #         tempo = 0.9\n    #     else:\n    #         tempo = 0.8\n    #     cmd = \"ffmpeg -i {} -y ar {} -hide_banner -loglevel panic -ac 1 -filter:a atempo={} -vn temp.wav\".format(fpath, hp.sr, tempo)\n    #     os.system(cmd)\n    #     y, sr = librosa.load('temp.wav', sr=hp.sr)\n\n    # Loading sound file\n    y, sr = librosa.load(fpath, sr=hp.sr)\n\n\n    # Trimming\n    y, _ = librosa.effects.trim(y)\n\n    # Preemphasis\n    y = np.append(y[0], y[1:] - hp.preemphasis * y[:-1])\n\n    # stft\n    linear = librosa.stft(y=y,\n                          n_fft=hp.n_fft,\n                          hop_length=hp.hop_length,\n                          win_length=hp.win_length)\n\n    # magnitude spectrogram\n    mag = np.abs(linear)  # (1+n_fft//2, T)\n\n    # mel spectrogram\n    mel_basis = librosa.filters.mel(hp.sr, hp.n_fft, hp.n_mels)  # (n_mels, 1+n_fft//2)\n    mel = np.dot(mel_basis, mag)  # (n_mels, t)\n\n    # to decibel\n    mel = 20 * np.log10(np.maximum(1e-5, mel))\n    mag = 20 * np.log10(np.maximum(1e-5, mag))\n\n    # normalize\n    mel = np.clip((mel - hp.ref_db + hp.max_db) / hp.max_db, 1e-8, 1)\n    mag = np.clip((mag - hp.ref_db + hp.max_db) / hp.max_db, 1e-8, 1)\n\n    # Transpose\n    mel = mel.T.astype(np.float32)  # (T, n_mels)\n    mag = mag.T.astype(np.float32)  # (T, 1+n_fft//2)\n\n    return mel, mag\n\n\ndef spectrogram2wav(mag):\n    '''# Generate wave file from spectrogram'''\n    # transpose\n    mag = mag.T\n\n    # de-noramlize\n    mag = (np.clip(mag, 0, 1) * hp.max_db) - hp.max_db + hp.ref_db\n\n    # to amplitude\n    mag = np.power(10.0, mag * 0.05)\n\n    # wav reconstruction\n    wav = griffin_lim(mag)\n\n    # de-preemphasis\n    wav = signal.lfilter([1], [1, -hp.preemphasis], wav)\n\n    # trim\n    wav, _ = librosa.effects.trim(wav)\n\n    return wav.astype(np.float32)\n\n\ndef griffin_lim(spectrogram):\n    '''Applies Griffin-Lim's raw.\n    '''\n    X_best = copy.deepcopy(spectrogram)\n    for i in range(hp.n_iter):\n        X_t = invert_spectrogram(X_best)\n        est = librosa.stft(X_t, hp.n_fft, hp.hop_length, win_length=hp.win_length)\n        phase = est / np.maximum(1e-8, np.abs(est))\n        X_best = spectrogram * phase\n    X_t = invert_spectrogram(X_best)\n    y = np.real(X_t)\n\n    return y\n\n\ndef invert_spectrogram(spectrogram):\n    '''\n    spectrogram: [f, t]\n    '''\n    return librosa.istft(spectrogram, hp.hop_length, win_length=hp.win_length, window=\"hann\")\n\n\ndef plot_alignment(alignment, gs):\n    \"\"\"Plots the alignment\n    alignments: A list of (numpy) matrix of shape (encoder_steps, decoder_steps)\n    gs : (int) global step\n    \"\"\"\n    fig, ax = plt.subplots()\n    im = ax.imshow(alignment)\n\n    # cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])\n    fig.colorbar(im)\n    plt.title('{} Steps'.format(gs))\n    plt.savefig('{}/alignment_{}k.png'.format(hp.logdir, gs//1000), format='png')\n\ndef learning_rate_decay(init_lr, global_step, warmup_steps=4000.):\n    '''Noam scheme from tensor2tensor'''\n    step = tf.cast(global_step + 1, dtype=tf.float32)\n    return init_lr * warmup_steps ** 0.5 * tf.minimum(step * warmup_steps ** -1.5, step ** -0.5)\n\ndef load_spectrograms(fpath):\n    fname = os.path.basename(fpath)\n    mel, mag = get_spectrograms(fpath)\n    t = mel.shape[0]\n    num_paddings = hp.r - (t % hp.r) if t % hp.r != 0 else 0 # for reduction\n    mel = np.pad(mel, [[0, num_paddings], [0, 0]], mode=\"constant\")\n    mag = np.pad(mag, [[0, num_paddings], [0, 0]], mode=\"constant\")\n    return fname, mel.reshape((-1, hp.n_mels*hp.r)), mag\n"
  },
  {
    "path": "solver.py",
    "content": "import torch\nfrom torch import optim\nfrom torch.autograd import Variable\nfrom torch import nn\nimport torch.nn.functional as F\nimport numpy as np\nimport pickle\nfrom model import Encoder\nfrom model import Decoder\nfrom model import SpeakerClassifier\nfrom model import PatchDiscriminator\nimport os\nfrom utils import Hps\nfrom utils import Logger\nfrom utils import DataLoader\nfrom utils import to_var\nfrom utils import reset_grad\nfrom utils import grad_clip\nfrom utils import cal_acc\nfrom utils import cc\nfrom utils import calculate_gradients_penalty\nfrom utils import gen_noise\n\nclass Solver(object):\n    def __init__(self, hps, data_loader, log_dir='./log/'):\n        self.hps = hps\n        self.data_loader = data_loader\n        self.model_kept = []\n        self.max_keep = 100\n        self.build_model()\n        self.logger = Logger(log_dir)\n\n    def build_model(self):\n        hps = self.hps\n        ns = self.hps.ns\n        emb_size = self.hps.emb_size\n        self.Encoder = cc(Encoder(ns=ns, dp=hps.enc_dp))\n        self.Decoder = cc(Decoder(ns=ns, c_a=hps.n_speakers, emb_size=emb_size))\n        self.Generator = cc(Decoder(ns=ns, c_a=hps.n_speakers, emb_size=emb_size))\n        self.SpeakerClassifier = cc(SpeakerClassifier(ns=ns, n_class=hps.n_speakers, dp=hps.dis_dp))\n        self.PatchDiscriminator = cc(nn.DataParallel(PatchDiscriminator(ns=ns, n_class=hps.n_speakers)))\n        betas = (0.5, 0.9)\n        params = list(self.Encoder.parameters()) + list(self.Decoder.parameters())\n        self.ae_opt = optim.Adam(params, lr=self.hps.lr, betas=betas)\n        self.clf_opt = optim.Adam(self.SpeakerClassifier.parameters(), lr=self.hps.lr, betas=betas)\n        self.gen_opt = optim.Adam(self.Generator.parameters(), lr=self.hps.lr, betas=betas)\n        self.patch_opt = optim.Adam(self.PatchDiscriminator.parameters(), lr=self.hps.lr, betas=betas)\n\n    def save_model(self, model_path, iteration, enc_only=True):\n        if not enc_only:\n            all_model = {\n                'encoder': self.Encoder.state_dict(),\n                'decoder': self.Decoder.state_dict(),\n                'generator': self.Generator.state_dict(),\n                'classifier': self.SpeakerClassifier.state_dict(),\n                'patch_discriminator': self.PatchDiscriminator.state_dict(),\n            }\n        else:\n            all_model = {\n                'encoder': self.Encoder.state_dict(),\n                'decoder': self.Decoder.state_dict(),\n                'generator': self.Generator.state_dict(),\n            }\n        new_model_path = '{}-{}'.format(model_path, iteration)\n        with open(new_model_path, 'wb') as f_out:\n            torch.save(all_model, f_out)\n        self.model_kept.append(new_model_path)\n\n        if len(self.model_kept) >= self.max_keep:\n            os.remove(self.model_kept[0])\n            self.model_kept.pop(0)\n\n    def load_model(self, model_path, enc_only=True):\n        print('load model from {}'.format(model_path))\n        with open(model_path, 'rb') as f_in:\n            all_model = torch.load(f_in)\n            self.Encoder.load_state_dict(all_model['encoder'])\n            self.Decoder.load_state_dict(all_model['decoder'])\n            self.Generator.load_state_dict(all_model['generator'])\n            if not enc_only:\n                self.SpeakerClassifier.load_state_dict(all_model['classifier'])\n                self.PatchDiscriminator.load_state_dict(all_model['patch_discriminator'])\n\n    def set_eval(self):\n        self.Encoder.eval()\n        self.Decoder.eval()\n        self.Generator.eval()\n        self.SpeakerClassifier.eval()\n        self.PatchDiscriminator.eval()\n\n    def test_step(self, x, c, gen=False):\n        self.set_eval()\n        x = to_var(x).permute(0, 2, 1)\n        enc = self.Encoder(x)\n        x_tilde = self.Decoder(enc, c)\n        if gen:\n            x_tilde += self.Generator(enc, c)\n        return x_tilde.data.cpu().numpy()\n\n    def permute_data(self, data):\n        C = to_var(data[0], requires_grad=False)\n        X = to_var(data[1]).permute(0, 2, 1)\n        return C, X\n\n    def sample_c(self, size):\n        n_speakers = self.hps.n_speakers\n        c_sample = Variable(\n                torch.multinomial(torch.ones(n_speakers), num_samples=size, replacement=True),  \n                requires_grad=False)\n        c_sample = c_sample.cuda() if torch.cuda.is_available() else c_sample\n        return c_sample\n\n    def encode_step(self, x):\n        enc = self.Encoder(x)\n        return enc\n\n    def decode_step(self, enc, c):\n        x_tilde = self.Decoder(enc, c)\n        return x_tilde\n\n    def patch_step(self, x, x_tilde, is_dis=True):\n        D_real, real_logits = self.PatchDiscriminator(x, classify=True)\n        D_fake, fake_logits = self.PatchDiscriminator(x_tilde, classify=True)\n        if is_dis:\n            w_dis = torch.mean(D_real - D_fake)\n            gp = calculate_gradients_penalty(self.PatchDiscriminator, x, x_tilde)\n            return w_dis, real_logits, gp\n        else:\n            return -torch.mean(D_fake), fake_logits\n\n    def gen_step(self, enc, c):\n        x_gen = self.Decoder(enc, c) + self.Generator(enc, c)\n        return x_gen \n\n    def clf_step(self, enc):\n        logits = self.SpeakerClassifier(enc)\n        return logits\n\n    def cal_loss(self, logits, y_true):\n        # calculate loss \n        criterion = nn.CrossEntropyLoss()\n        loss = criterion(logits, y_true)\n        return loss\n\n    def train(self, model_path, flag='train', mode='train'):\n        # load hyperparams\n        hps = self.hps\n        if mode == 'pretrain_G':\n            for iteration in range(hps.enc_pretrain_iters):\n                data = next(self.data_loader)\n                c, x = self.permute_data(data)\n                # encode\n                enc = self.encode_step(x)\n                x_tilde = self.decode_step(enc, c)\n                loss_rec = torch.mean(torch.abs(x_tilde - x))\n                reset_grad([self.Encoder, self.Decoder])\n                loss_rec.backward()\n                grad_clip([self.Encoder, self.Decoder], self.hps.max_grad_norm)\n                self.ae_opt.step()\n                # tb info\n                info = {\n                    f'{flag}/pre_loss_rec': loss_rec.item(),\n                }\n                slot_value = (iteration + 1, hps.enc_pretrain_iters) + tuple([value for value in info.values()])\n                log = 'pre_G:[%06d/%06d], loss_rec=%.3f'\n                print(log % slot_value)\n                if iteration % 100 == 0:\n                    for tag, value in info.items():\n                        self.logger.scalar_summary(tag, value, iteration + 1)\n        elif mode == 'pretrain_D':\n            for iteration in range(hps.dis_pretrain_iters):\n                data = next(self.data_loader)\n                c, x = self.permute_data(data)\n                # encode\n                enc = self.encode_step(x)\n                # classify speaker\n                logits = self.clf_step(enc)\n                loss_clf = self.cal_loss(logits, c)\n                # update \n                reset_grad([self.SpeakerClassifier])\n                loss_clf.backward()\n                grad_clip([self.SpeakerClassifier], self.hps.max_grad_norm)\n                self.clf_opt.step()\n                # calculate acc\n                acc = cal_acc(logits, c)\n                info = {\n                    f'{flag}/pre_loss_clf': loss_clf.item(),\n                    f'{flag}/pre_acc': acc,\n                }\n                slot_value = (iteration + 1, hps.dis_pretrain_iters) + tuple([value for value in info.values()])\n                log = 'pre_D:[%06d/%06d], loss_clf=%.2f, acc=%.2f'\n                print(log % slot_value)\n                if iteration % 100 == 0:\n                    for tag, value in info.items():\n                        self.logger.scalar_summary(tag, value, iteration + 1)\n        elif mode == 'patchGAN':\n            for iteration in range(hps.patch_iters):\n                #=======train D=========#\n                for step in range(hps.n_patch_steps):\n                    data = next(self.data_loader)\n                    c, x = self.permute_data(data)\n                    ## encode\n                    enc = self.encode_step(x)\n                    # sample c\n                    c_prime = self.sample_c(x.size(0))\n                    # generator\n                    x_tilde = self.gen_step(enc, c_prime)\n                    # discriminstor\n                    w_dis, real_logits, gp = self.patch_step(x, x_tilde, is_dis=True)\n                    # aux classification loss \n                    loss_clf = self.cal_loss(real_logits, c)\n                    loss = -hps.beta_dis * w_dis + hps.beta_clf * loss_clf + hps.lambda_ * gp\n                    reset_grad([self.PatchDiscriminator])\n                    loss.backward()\n                    grad_clip([self.PatchDiscriminator], self.hps.max_grad_norm)\n                    self.patch_opt.step()\n                    # calculate acc\n                    acc = cal_acc(real_logits, c)\n                    info = {\n                        f'{flag}/w_dis': w_dis.item(),\n                        f'{flag}/gp': gp.item(), \n                        f'{flag}/real_loss_clf': loss_clf.item(),\n                        f'{flag}/real_acc': acc, \n                    }\n                    slot_value = (step, iteration+1, hps.patch_iters) + tuple([value for value in info.values()])\n                    log = 'patch_D-%d:[%06d/%06d], w_dis=%.2f, gp=%.2f, loss_clf=%.2f, acc=%.2f'\n                    print(log % slot_value)\n                    if iteration % 100 == 0:\n                        for tag, value in info.items():\n                            self.logger.scalar_summary(tag, value, iteration + 1)\n                #=======train G=========#\n                data = next(self.data_loader)\n                c, x = self.permute_data(data)\n                # encode\n                enc = self.encode_step(x)\n                # sample c\n                c_prime = self.sample_c(x.size(0))\n                # generator\n                x_tilde = self.gen_step(enc, c_prime)\n                # discriminstor\n                loss_adv, fake_logits = self.patch_step(x, x_tilde, is_dis=False)\n                # aux classification loss \n                loss_clf = self.cal_loss(fake_logits, c_prime)\n                loss = hps.beta_clf * loss_clf + hps.beta_gen * loss_adv\n                reset_grad([self.Generator])\n                loss.backward()\n                grad_clip([self.Generator], self.hps.max_grad_norm)\n                self.gen_opt.step()\n                # calculate acc\n                acc = cal_acc(fake_logits, c_prime)\n                info = {\n                    f'{flag}/loss_adv': loss_adv.item(),\n                    f'{flag}/fake_loss_clf': loss_clf.item(),\n                    f'{flag}/fake_acc': acc, \n                }\n                slot_value = (iteration+1, hps.patch_iters) + tuple([value for value in info.values()])\n                log = 'patch_G:[%06d/%06d], loss_adv=%.2f, loss_clf=%.2f, acc=%.2f'\n                print(log % slot_value)\n                if iteration % 100 == 0:\n                    for tag, value in info.items():\n                        self.logger.scalar_summary(tag, value, iteration + 1)\n                if iteration % 1000 == 0 or iteration + 1 == hps.patch_iters:\n                    self.save_model(model_path, iteration + hps.iters)\n        elif mode == 'train':\n            for iteration in range(hps.iters):\n                # calculate current alpha\n                if iteration < hps.lat_sched_iters:\n                    current_alpha = hps.alpha_enc * (iteration / hps.lat_sched_iters)\n                else:\n                    current_alpha = hps.alpha_enc\n                #==================train D==================#\n                for step in range(hps.n_latent_steps):\n                    data = next(self.data_loader)\n                    c, x = self.permute_data(data)\n                    # encode\n                    enc = self.encode_step(x)\n                    # classify speaker\n                    logits = self.clf_step(enc)\n                    loss_clf = self.cal_loss(logits, c)\n                    loss = hps.alpha_dis * loss_clf\n                    # update \n                    reset_grad([self.SpeakerClassifier])\n                    loss.backward()\n                    grad_clip([self.SpeakerClassifier], self.hps.max_grad_norm)\n                    self.clf_opt.step()\n                    # calculate acc\n                    acc = cal_acc(logits, c)\n                    info = {\n                        f'{flag}/D_loss_clf': loss_clf.item(),\n                        f'{flag}/D_acc': acc,\n                    }\n                    slot_value = (step, iteration + 1, hps.iters) + tuple([value for value in info.values()])\n                    log = 'D-%d:[%06d/%06d], loss_clf=%.2f, acc=%.2f'\n                    print(log % slot_value)\n                    if iteration % 100 == 0:\n                        for tag, value in info.items():\n                            self.logger.scalar_summary(tag, value, iteration + 1)\n                #==================train G==================#\n                data = next(self.data_loader)\n                c, x = self.permute_data(data)\n                # encode\n                enc = self.encode_step(x)\n                # decode\n                x_tilde = self.decode_step(enc, c)\n                loss_rec = torch.mean(torch.abs(x_tilde - x))\n                # classify speaker\n                logits = self.clf_step(enc)\n                acc = cal_acc(logits, c)\n                loss_clf = self.cal_loss(logits, c)\n                # maximize classification loss\n                loss = loss_rec - current_alpha * loss_clf\n                reset_grad([self.Encoder, self.Decoder])\n                loss.backward()\n                grad_clip([self.Encoder, self.Decoder], self.hps.max_grad_norm)\n                self.ae_opt.step()\n                info = {\n                    f'{flag}/loss_rec': loss_rec.item(),\n                    f'{flag}/G_loss_clf': loss_clf.item(),\n                    f'{flag}/alpha': current_alpha,\n                    f'{flag}/G_acc': acc,\n                }\n                slot_value = (iteration + 1, hps.iters) + tuple([value for value in info.values()])\n                log = 'G:[%06d/%06d], loss_rec=%.3f, loss_clf=%.2f, alpha=%.2e, acc=%.2f'\n                print(log % slot_value)\n                if iteration % 100 == 0:\n                    for tag, value in info.items():\n                        self.logger.scalar_summary(tag, value, iteration + 1)\n                if iteration % 1000 == 0 or iteration + 1 == hps.iters:\n                    self.save_model(model_path, iteration)\n\n"
  },
  {
    "path": "test.py",
    "content": "import torch\nfrom torch import optim\nfrom torch.autograd import Variable\nimport numpy as np\nimport pickle\nfrom utils import Hps\nfrom preprocess.tacotron.norm_utils import spectrogram2wav, get_spectrograms\nfrom scipy.io.wavfile import write\nimport glob\nimport os\nimport argparse\nfrom solver import Solver\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-hps', help='The path of hyper-parameter set', default='vctk.json')\n    parser.add_argument('-model', '-m', help='The path of model checkpoint')\n    parser.add_argument('-source', '-s', help='The path of source .wav file')\n    parser.add_argument('-target', '-t', help='Target speaker id (integer). Same order as the speaker list when preprocessing (en_speaker_used.txt)')\n    parser.add_argument('-output', '-o', help='output .wav path')\n    parser.add_argument('-sample_rate', '-sr', default=16000, type=int)\n    parser.add_argument('--use_gen', default=True, action='store_true')\n\n    args = parser.parse_args()\n\n    hps = Hps()\n    hps.load(args.hps)\n    hps_tuple = hps.get_tuple()\n    solver = Solver(hps_tuple, None)\n    solver.load_model(args.model)\n    _, spec = get_spectrograms(args.source)\n    spec_expand = np.expand_dims(spec, axis=0)\n    spec_tensor = torch.from_numpy(spec_expand).type(torch.FloatTensor)\n    c = Variable(torch.from_numpy(np.array([int(args.target)]))).cuda()\n    result = solver.test_step(spec_tensor, c, gen=args.use_gen)\n    result = result.squeeze(axis=0).transpose((1, 0))\n    wav_data = spectrogram2wav(result)\n    write(args.output, rate=args.sample_rate, data=wav_data)\n"
  },
  {
    "path": "utils.py",
    "content": "import json\nimport h5py\nimport pickle\nimport os\nfrom collections import defaultdict\nfrom collections import namedtuple\nimport numpy as np\nimport math\nimport argparse\nimport random\nimport time\nimport torch\nfrom torch.utils import data\nfrom tensorboardX import SummaryWriter\nfrom torch.autograd import Variable\n\ndef cc(net):\n    if torch.cuda.is_available():\n        return net.cuda()\n    else:\n        return net\n\ndef gen_noise(x_dim, y_dim):\n    x = torch.randn(x_dim, 1) \n    y = torch.randn(1, y_dim)\n    return x @ y\n\ndef to_var(x, requires_grad=True):\n    x = Variable(x, requires_grad=requires_grad)\n    return x.cuda() if torch.cuda.is_available() else x\n\ndef reset_grad(net_list):\n    for net in net_list:\n        net.zero_grad()\n\ndef grad_clip(net_list, max_grad_norm):\n    for net in net_list:\n        torch.nn.utils.clip_grad_norm_(net.parameters(), max_grad_norm)\n\ndef calculate_gradients_penalty(netD, real_data, fake_data):\n    alpha = torch.rand(real_data.size(0))\n    alpha = alpha.view(real_data.size(0), 1, 1)\n    alpha = alpha.cuda() if torch.cuda.is_available() else alpha\n    alpha = Variable(alpha)\n    interpolates = alpha * real_data + (1 - alpha) * fake_data\n\n    disc_interpolates = netD(interpolates)\n\n    use_cuda = torch.cuda.is_available()\n    grad_outputs = torch.ones(disc_interpolates.size()).cuda() if use_cuda else torch.ones(disc_interpolates.size())\n\n    gradients = torch.autograd.grad(\n        outputs=disc_interpolates,\n        inputs=interpolates,\n        grad_outputs=grad_outputs,\n        create_graph=True, retain_graph=True, only_inputs=True)[0]\n    gradients_penalty = (1. - torch.sqrt(1e-12 + torch.sum(gradients.view(gradients.size(0), -1)**2, dim=1))) ** 2\n    gradients_penalty = torch.mean(gradients_penalty)\n    return gradients_penalty\n\ndef cal_acc(logits, y_true):\n    _, ind = torch.max(logits, dim=1)\n    acc = torch.sum((ind == y_true).type(torch.FloatTensor)) / y_true.size(0)\n    return acc\n\nclass Hps(object):\n    def __init__(self):\n        self.hps = namedtuple('hps', [\n            'lr',\n            'alpha_dis',\n            'alpha_enc',\n            'beta_dis', \n            'beta_gen', \n            'beta_clf',\n            'lambda_',\n            'ns', \n            'enc_dp', \n            'dis_dp', \n            'max_grad_norm',\n            'seg_len',\n            'emb_size',\n            'n_speakers',\n            'n_latent_steps',\n            'n_patch_steps', \n            'batch_size',\n            'lat_sched_iters',\n            'enc_pretrain_iters',\n            'dis_pretrain_iters',\n            'patch_iters', \n            'iters',\n            ]\n        )\n        default = \\\n            [1e-4, 1, 1e-4, 0, 0, 0, 10, 0.01, 0.5, 0.1, 5, 128, 128, 8, 5, 0, 32, 50000, 5000, 5000, 30000, 60000]\n        self._hps = self.hps._make(default)\n\n    def get_tuple(self):\n        return self._hps\n\n    def load(self, path):\n        with open(path, 'r') as f_json:\n            hps_dict = json.load(f_json)\n        self._hps = self.hps(**hps_dict)\n\n    def dump(self, path):\n        with open(path, 'w') as f_json:\n            json.dump(self._hps._asdict(), f_json, indent=4, separators=(',', ': '))\n\nclass DataLoader(object):\n    def __init__(self, dataset, batch_size=16):\n        self.dataset = dataset\n        self.n_elements = len(self.dataset[0])\n        self.batch_size = batch_size\n        self.index = 0\n\n    def all(self, size=1000):\n        samples = [self.dataset[self.index + i] for i in range(size)]\n        batch = [[s for s in sample] for sample in zip(*samples)]\n        batch_tensor = [torch.from_numpy(np.array(data)) for data in batch]\n\n        if self.index + 2 * self.batch_size >= len(self.dataset):\n            self.index = 0\n        else:\n            self.index += self.batch_size\n        return tuple(batch_tensor)\n\n    def __iter__(self):\n        return self\n\n    def __next__(self):\n        samples = [self.dataset[self.index + i] for i in range(self.batch_size)]\n        batch = [[s for s in sample] for sample in zip(*samples)]\n        batch_tensor = [torch.from_numpy(np.array(data)) for data in batch]\n\n        if self.index + 2 * self.batch_size >= len(self.dataset):\n            self.index = 0\n        else:\n            self.index += self.batch_size\n        return tuple(batch_tensor)\n\nclass SingleDataset(data.Dataset):\n    def __init__(self, h5_path, index_path, dset='train', seg_len=128):\n        self.dataset = h5py.File(h5_path, 'r')\n        with open(index_path) as f_index:\n            self.indexes = json.load(f_index)\n        self.indexer = namedtuple('index', ['speaker', 'i', 't'])\n        self.seg_len = seg_len\n        self.dset = dset\n\n    def __getitem__(self, i):\n        index = self.indexes[i]\n        index = self.indexer(**index)\n        speaker = index.speaker\n        i, t = index.i, index.t\n        seg_len = self.seg_len\n        data = [speaker, self.dataset[f'{self.dset}/{i}'][t:t+seg_len]]\n        return tuple(data)\n\n    def __len__(self):\n        return len(self.indexes)\n\nclass Logger(object):\n    def __init__(self, log_dir='./log'):\n        self.writer = SummaryWriter(log_dir)\n\n    def scalar_summary(self, tag, value, step):\n        self.writer.add_scalar(tag, value, step)\n\n"
  },
  {
    "path": "vctk.json",
    "content": "{\n    \"lr\": 0.0001,\n    \"alpha_dis\": 1,\n    \"alpha_enc\": 0.01,\n    \"beta_dis\": 1,\n    \"beta_gen\": 1,\n    \"beta_clf\": 1,\n    \"lambda_\": 10,\n    \"ns\": 0.01,\n    \"enc_dp\": 0.5,\n    \"dis_dp\": 0.3,\n    \"max_grad_norm\": 5,\n    \"seg_len\": 128,\n    \"emb_size\": 512,\n    \"n_speakers\": 20,\n    \"n_latent_steps\": 5,\n    \"n_patch_steps\": 5,\n    \"batch_size\": 32,\n    \"lat_sched_iters\": 50000,\n    \"enc_pretrain_iters\": 8000,\n    \"dis_pretrain_iters\": 20000,\n    \"patch_iters\": 50000,\n    \"iters\": 80000\n}\n"
  }
]