[
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2020 Kun Ma\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# **CycleGAN-VC3-PyTorch**\n\n[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/jackaduma/CycleGAN-VC2)\n[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://paypal.me/jackaduma?locale.x=zh_XC)\n\n[**中文说明**](./README.zh-CN.md) | [**English**](./README.md)\n\n------\n\nThis code is a **PyTorch** implementation for paper: [CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion](https://arxiv.org/abs/2010.11672]), a nice work on **Voice-Conversion/Voice Cloning**.\n\n- [x] Dataset\n  - [ ] VC\n- [x] Usage\n  - [x] Training\n  - [x] Example \n- [ ] Demo\n- [x] Reference\n\n------\n\n## **CycleGAN-VC3**\n\n### [**Project Page**](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html) \n\n\nNon-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, CycleGAN-VC [3] and CycleGAN-VC2 [2] have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for **mel-spectrogram conversion**, they are typically used for mel-cepstrum conversion even when comparative methods employ mel-spectrogram as a conversion target. To address this, we examined the applicability of CycleGAN-VC/VC2 to **mel-spectrogram conversion**. Through initial experiments, we discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion. To remedy this, we propose CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates **time-frequency adaptive normalization (TFAN)**. Using TFAN, we can adjust the scale and bias of the converted features while reflecting the time-frequency structure of the source mel-spectrogram. We evaluated CycleGAN-VC3 on inter-gender and intra-gender non-parallel VC. A subjective evaluation of naturalness and similarity showed that for every VC pair, CycleGAN-VC3 outperforms or is competitive with the two types of CycleGAN-VC2, one of which was applied to mel-cepstrum and the other to mel-spectrogram.\n\n![network comparison](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/images/comparison.png \"comparison between vc2 and vc3\")  _Figure 1. We developed time-frequency adaptive normalization (TFAN), which extends instance normalization [5] so that the affine parameters become element-dependent and are determined according to an entire input mel-spectrogram._\n\n------\n\n**This repository contains:** \n\n1. [TFAN module code](tfan_module.py) which implemented the TFAN module\n1. [model code](model.py) which implemented the model network.\n2. [audio preprocessing script](preprocess_training.py) you can use to create cache for [training data](data).\n3. [training scripts](train.py) to train the model.\n\n\n\n------\n\n## **Table of Contents**\n\n- [**CycleGAN-VC3-PyTorch**](#cyclegan-vc3-pytorch)\n  - [**CycleGAN-VC3**](#cyclegan-vc3)\n    - [**Project Page**](#project-page)\n  - [**Table of Contents**](#table-of-contents)\n  - [**Requirement**](#requirement)\n  - [**Usage**](#usage)\n  - [**Star-History**](#star-history)\n  - [**Reference**](#reference)\n  - [Donation](#donation)\n  - [**License**](#license)\n  \n------\n\n## **Requirement** \n\n```bash\npip install -r requirements.txt\n```\n## **Usage**\n\n\n------\n\n## **Star-History**\n\n![star-history](https://api.star-history.com/svg?repos=jackaduma/CycleGAN-VC3&type=Date \"star-history\")\n\n------\n\n## **Reference**\n1. **CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion.** [Paper](https://arxiv.org/abs/2010.11672), [Project](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html)\n2. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. [Paper](https://arxiv.org/abs/1904.04631), [Project](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html)\n3. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. [Paper](https://arxiv.org/abs/1711.11293), [Project](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc/)\n4. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. [Paper](https://arxiv.org/abs/1703.10593), [Project](https://junyanz.github.io/CycleGAN/), [Code](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix)\n5. Image-to-Image Translation with Conditional Adversarial Nets. [Paper](https://arxiv.org/abs/1611.07004), [Project](https://phillipi.github.io/pix2pix/), [Code](https://github.com/phillipi/pix2pix)\n\n\n------\n\n## Donation\nIf this project help you reduce time to develop, you can give me a cup of coffee :) \n\nAliPay(支付宝)\n<div align=\"center\">\n\t<img src=\"./misc/ali_pay.png\" alt=\"ali_pay\" width=\"400\" />\n</div>\n\nWechatPay(微信)\n<div align=\"center\">\n    <img src=\"./misc/wechat_pay.png\" alt=\"wechat_pay\" width=\"400\" />\n</div>\n\n[![paypal](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://paypal.me/jackaduma?locale.x=zh_XC)\n\n\n------\n\n## **License**\n\n[MIT](LICENSE) © Kun"
  },
  {
    "path": "datasets.py",
    "content": ""
  },
  {
    "path": "feature_utils.py",
    "content": "#!python\n# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom librosa.filters import mel as librosa_mel_fn\n\n\nclass Audio2Mel(nn.Module):\n    def __init__(\n            self,\n            n_fft=1024,\n            hop_length=256,\n            win_length=1024,\n            sampling_rate=22050,\n            n_mel_channels=80,\n            mel_fmin=0.0,\n            mel_fmax=None,\n    ):\n        super().__init__()\n        ##############################################\n        # FFT Parameters                              #\n        ##############################################\n        window = torch.hann_window(win_length).float()\n        mel_basis = librosa_mel_fn(\n            sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax\n        )\n        mel_basis = torch.from_numpy(mel_basis).float()\n        self.register_buffer(\"mel_basis\", mel_basis)\n        self.register_buffer(\"window\", window)\n        self.n_fft = n_fft\n        self.hop_length = hop_length\n        self.win_length = win_length\n        self.sampling_rate = sampling_rate\n        self.n_mel_channels = n_mel_channels\n\n    def forward(self, audio):\n        p = (self.n_fft - self.hop_length) // 2\n        audio = F.pad(audio, (p, p), \"reflect\").squeeze(1)\n        fft = torch.stft(\n            audio,\n            n_fft=self.n_fft,\n            hop_length=self.hop_length,\n            win_length=self.win_length,\n            window=self.window,\n            center=False,\n        )\n        real_part, imag_part = fft.unbind(-1)\n        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)\n        mel_output = torch.matmul(self.mel_basis, magnitude)\n        log_mel_spec = torch.log10(torch.clamp(mel_output, min=1e-5))\n        return log_mel_spec\n"
  },
  {
    "path": "melgan_vocoder.py",
    "content": "#!python\n# -*- coding: utf-8 -*-\nimport os\nimport yaml\nfrom pathlib import Path\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn.utils import weight_norm\n\nfrom feature_utils import Audio2Mel\n\n\ndef weights_init(m):\n    classname = m.__class__.__name__\n    if classname.find(\"Conv\") != -1:\n        m.weight.data.normal_(0.0, 0.02)\n    elif classname.find(\"BatchNorm2d\") != -1:\n        m.weight.data.normal_(1.0, 0.02)\n        m.bias.data.fill_(0)\n\n\ndef WNConv1d(*args, **kwargs):\n    return weight_norm(nn.Conv1d(*args, **kwargs))\n\n\ndef WNConvTranspose1d(*args, **kwargs):\n    return weight_norm(nn.ConvTranspose1d(*args, **kwargs))\n\n\nclass ResnetBlock(nn.Module):\n    def __init__(self, dim, dilation=1):\n        super().__init__()\n        self.block = nn.Sequential(\n            nn.LeakyReLU(0.2),\n            nn.ReflectionPad1d(dilation),\n            WNConv1d(dim, dim, kernel_size=3, dilation=dilation),\n            nn.LeakyReLU(0.2),\n            WNConv1d(dim, dim, kernel_size=1),\n        )\n        self.shortcut = WNConv1d(dim, dim, kernel_size=1)\n\n    def forward(self, x):\n        return self.shortcut(x) + self.block(x)\n\n\nclass Generator(nn.Module):\n    def __init__(self, input_size, ngf, n_residual_layers):\n        super().__init__()\n        ratios = [8, 8, 2, 2]\n        self.hop_length = np.prod(ratios)\n        mult = int(2 ** len(ratios))\n\n        model = [\n            nn.ReflectionPad1d(3),\n            WNConv1d(input_size, mult * ngf, kernel_size=7, padding=0),\n        ]\n\n        # Upsample to raw audio scale\n        for i, r in enumerate(ratios):\n            model += [\n                nn.LeakyReLU(0.2),\n                WNConvTranspose1d(\n                    mult * ngf,\n                    mult * ngf // 2,\n                    kernel_size=r * 2,\n                    stride=r,\n                    padding=r // 2 + r % 2,\n                    output_padding=r % 2,\n                ),\n            ]\n\n            for j in range(n_residual_layers):\n                model += [ResnetBlock(mult * ngf // 2, dilation=3 ** j)]\n\n            mult //= 2\n\n        model += [\n            nn.LeakyReLU(0.2),\n            nn.ReflectionPad1d(3),\n            WNConv1d(ngf, 1, kernel_size=7, padding=0),\n            nn.Tanh(),\n        ]\n\n        self.model = nn.Sequential(*model)\n        self.apply(weights_init)\n\n    def forward(self, x):\n        return self.model(x)\n\n\ndef get_default_device():\n    if torch.cuda.is_available():\n        return \"cuda\"\n    else:\n        return \"cpu\"\n\n\ndef load_model(mel2wav_path, device=get_default_device()):\n    \"\"\"\n    Args:\n        mel2wav_path (str or Path): path to the root folder of dumped text2mel\n        device (str or torch.device): device to load the model\n    \"\"\"\n    root = Path(mel2wav_path)\n    with open(root / \"args.yml\", \"r\") as f:\n        args = yaml.load(f, Loader=yaml.FullLoader)\n    netG = Generator(args.n_mel_channels, args.ngf, args.n_residual_layers).to(device)\n    netG.load_state_dict(torch.load(root / \"best_netG.pt\", map_location=device))\n    return netG\n\n\nclass MelVocoder:\n    def __init__(\n            self,\n            path,\n            device=get_default_device(),\n            github=False,\n            model_name=\"multi_speaker\",\n    ):\n        self.fft = Audio2Mel().to(device)\n        if github:\n            netG = Generator(80, 32, 3).to(device)\n            root = Path(os.path.dirname(__file__)).parent\n            netG.load_state_dict(\n                torch.load(root / f\"models/{model_name}.pt\", map_location=device)\n            )\n            self.mel2wav = netG\n        else:\n            self.mel2wav = load_model(path, device)\n        self.device = device\n\n    def __call__(self, audio):\n        \"\"\"\n        Performs audio to mel conversion (See Audio2Mel in mel2wav/modules.py)\n        Args:\n            audio (torch.tensor): PyTorch tensor containing audio (batch_size, timesteps)\n        Returns:\n            torch.tensor: log-mel-spectrogram computed on input audio (batch_size, 80, timesteps)\n        \"\"\"\n        return self.fft(audio.unsqueeze(1).to(self.device))\n\n    def inverse(self, mel):\n        \"\"\"\n        Performs mel2audio conversion\n        Args:\n            mel (torch.tensor): PyTorch tensor containing log-mel spectrograms (batch_size, 80, timesteps)\n        Returns:\n            torch.tensor:  Inverted raw audio (batch_size, timesteps)\n\n        \"\"\"\n        with torch.no_grad():\n            return self.mel2wav(mel.to(self.device)).squeeze(1)\n"
  },
  {
    "path": "model.py",
    "content": "#! python\n# -*- coding: utf-8 -*-\n# Author: kun\n# @Time: 2020-11-17 14:35\n\nimport torch.nn as nn\nimport torch\nfrom tfan_module import TFAN_1D, TFAN_2D\n\n\nclass GLU(nn.Module):\n    def __init__(self):\n        super(GLU, self).__init__()\n        # Custom Implementation because the Voice Conversion Cycle GAN\n        # paper assumes GLU won't reduce the dimension of tensor by 2.\n\n    def forward(self, input):\n        return input * torch.sigmoid(input)\n\n\nclass PixelShuffle(nn.Module):\n    def __init__(self, upscale_factor):\n        super(PixelShuffle, self).__init__()\n        # Custom Implementation because PyTorch PixelShuffle requires,\n        # 4D input. Whereas, in this case we have have 3D array\n        self.upscale_factor = upscale_factor\n\n    def forward(self, input):\n        n = input.shape[0]\n        c_out = input.shape[1] // 2\n        w_new = input.shape[2] * 2\n        return input.view(n, c_out, w_new)\n\n\n##########################################################################################\nclass ResidualLayer(nn.Module):\n    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):\n        super(ResidualLayer, self).__init__()\n\n        # self.residualLayer = nn.Sequential(nn.Conv1d(in_channels=in_channels,\n        #                                              out_channels=out_channels,\n        #                                              kernel_size=kernel_size,\n        #                                              stride=1,\n        #                                              padding=padding),\n        #                                    nn.InstanceNorm1d(\n        #                                        num_features=out_channels,\n        #                                        affine=True),\n        #                                    GLU(),\n        #                                    nn.Conv1d(in_channels=out_channels,\n        #                                              out_channels=in_channels,\n        #                                              kernel_size=kernel_size,\n        #                                              stride=1,\n        #                                              padding=padding),\n        #                                    nn.InstanceNorm1d(\n        #                                        num_features=in_channels,\n        #                                        affine=True)\n        #                                    )\n\n        self.conv1d_layer = nn.Sequential(nn.Conv1d(in_channels=in_channels,\n                                                    out_channels=out_channels,\n                                                    kernel_size=kernel_size,\n                                                    stride=1,\n                                                    padding=padding),\n                                          nn.InstanceNorm1d(num_features=out_channels,\n                                                            affine=True))\n\n        self.conv_layer_gates = nn.Sequential(nn.Conv1d(in_channels=in_channels,\n                                                        out_channels=out_channels,\n                                                        kernel_size=kernel_size,\n                                                        stride=1,\n                                                        padding=padding),\n                                              nn.InstanceNorm1d(num_features=out_channels,\n                                                                affine=True))\n\n        self.conv1d_out_layer = nn.Sequential(nn.Conv1d(in_channels=out_channels,\n                                                        out_channels=in_channels,\n                                                        kernel_size=kernel_size,\n                                                        stride=1,\n                                                        padding=padding),\n                                              nn.InstanceNorm1d(num_features=in_channels,\n                                                                affine=True))\n\n    def forward(self, input):\n        h1_norm = self.conv1d_layer(input)\n        h1_gates_norm = self.conv_layer_gates(input)\n\n        # GLU\n        h1_glu = h1_norm * torch.sigmoid(h1_gates_norm)\n\n        h2_norm = self.conv1d_out_layer(h1_glu)\n        return input + h2_norm\n\n\n##########################################################################################\nclass downSample_Generator(nn.Module):\n    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):\n        super(downSample_Generator, self).__init__()\n\n        self.convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,\n                                                 out_channels=out_channels,\n                                                 kernel_size=kernel_size,\n                                                 stride=stride,\n                                                 padding=padding),\n                                       nn.InstanceNorm2d(num_features=out_channels,\n                                                         affine=True))\n        self.convLayer_gates = nn.Sequential(nn.Conv2d(in_channels=in_channels,\n                                                       out_channels=out_channels,\n                                                       kernel_size=kernel_size,\n                                                       stride=stride,\n                                                       padding=padding),\n                                             nn.InstanceNorm2d(num_features=out_channels,\n                                                               affine=True))\n\n    def forward(self, input):\n        # GLU\n        return self.convLayer(input) * torch.sigmoid(self.convLayer_gates(input))\n\n\n##########################################################################################\nclass Generator(nn.Module):\n    def __init__(self):\n        super(Generator, self).__init__()\n\n        # 2D Conv Layer\n        self.conv1 = nn.Conv2d(in_channels=1,  # TODO 1 ?\n                               out_channels=128,\n                               kernel_size=(5, 15),\n                               stride=(1, 1),\n                               padding=(2, 7))\n\n        self.conv1_gates = nn.Conv2d(in_channels=1,  # TODO 1 ?\n                                     out_channels=128,\n                                     kernel_size=(5, 15),\n                                     stride=1,\n                                     padding=(2, 7))\n\n        # 2D Downsample Layer\n        self.downSample1 = downSample_Generator(in_channels=128,\n                                                out_channels=256,\n                                                kernel_size=5,\n                                                stride=2,\n                                                padding=2)\n\n        self.downSample2 = downSample_Generator(in_channels=256,\n                                                out_channels=256,\n                                                kernel_size=5,\n                                                stride=2,\n                                                padding=2)\n\n        # 2D -> 1D Conv\n        # self.conv2dto1dLayer = nn.Sequential(nn.Conv1d(in_channels=2304,\n        #                                                out_channels=256,\n        #                                                kernel_size=1,\n        #                                                stride=1,\n        #                                                padding=0),\n        #                                      nn.InstanceNorm1d(num_features=256,\n        #                                                        affine=True))\n\n        self.conv2dto1dLayer = nn.Conv1d(in_channels=2304,\n                                         out_channels=256,\n                                         kernel_size=1,\n                                         stride=1,\n                                         padding=0)\n        self.conv2dto1dLayer_tfan = TFAN_1D(256)\n\n        # Residual Blocks\n        self.residualLayer1 = ResidualLayer(in_channels=256,\n                                            out_channels=512,\n                                            kernel_size=3,\n                                            stride=1,\n                                            padding=1)\n        self.residualLayer2 = ResidualLayer(in_channels=256,\n                                            out_channels=512,\n                                            kernel_size=3,\n                                            stride=1,\n                                            padding=1)\n        self.residualLayer3 = ResidualLayer(in_channels=256,\n                                            out_channels=512,\n                                            kernel_size=3,\n                                            stride=1,\n                                            padding=1)\n        self.residualLayer4 = ResidualLayer(in_channels=256,\n                                            out_channels=512,\n                                            kernel_size=3,\n                                            stride=1,\n                                            padding=1)\n        self.residualLayer5 = ResidualLayer(in_channels=256,\n                                            out_channels=512,\n                                            kernel_size=3,\n                                            stride=1,\n                                            padding=1)\n        self.residualLayer6 = ResidualLayer(in_channels=256,\n                                            out_channels=512,\n                                            kernel_size=3,\n                                            stride=1,\n                                            padding=1)\n\n        # 1D -> 2D Conv\n        # self.conv1dto2dLayer = nn.Sequential(nn.Conv1d(in_channels=256,\n        #                                                out_channels=2304,\n        #                                                kernel_size=1,\n        #                                                stride=1,\n        #                                                padding=0),\n        #                                      nn.InstanceNorm1d(num_features=2304,\n        #                                                        affine=True))\n\n        self.conv1dto2dLayer = nn.Conv1d(in_channels=256,\n                                         out_channels=2304,\n                                         kernel_size=1,\n                                         stride=1,\n                                         padding=0)\n        self.conv1dto2dLayer_tfan = TFAN_1D(2304)\n\n        # UpSample Layer\n        self.upSample1 = self.upSample(in_channels=256,\n                                       out_channels=1024,\n                                       kernel_size=5,\n                                       stride=1,\n                                       padding=2)\n\n        self.upSample1_tfan = TFAN_2D(1024 // 4)\n        self.glu = GLU()\n\n        self.upSample2 = self.upSample(in_channels=256,\n                                       out_channels=512,\n                                       kernel_size=5,\n                                       stride=1,\n                                       padding=2)\n        self.upSample2_tfan = TFAN_2D(512 // 4)\n\n        self.lastConvLayer = nn.Conv2d(in_channels=128,\n                                       out_channels=1,\n                                       kernel_size=(5, 15),\n                                       stride=(1, 1),\n                                       padding=(2, 7))\n\n    def downSample(self, in_channels, out_channels, kernel_size, stride, padding):\n        self.ConvLayer = nn.Sequential(nn.Conv1d(in_channels=in_channels,\n                                                 out_channels=out_channels,\n                                                 kernel_size=kernel_size,\n                                                 stride=stride,\n                                                 padding=padding),\n                                       nn.InstanceNorm1d(\n                                           num_features=out_channels,\n                                           affine=True),\n                                       GLU())\n\n        return self.ConvLayer\n\n    # def upSample(self, in_channels, out_channels, kernel_size, stride, padding):\n    #     self.convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,\n    #                                              out_channels=out_channels,\n    #                                              kernel_size=kernel_size,\n    #                                              stride=stride,\n    #                                              padding=padding),\n    #                                    nn.PixelShuffle(upscale_factor=2),\n    #                                    nn.InstanceNorm2d(\n    #                                        num_features=out_channels // 4,\n    #                                        affine=True),\n    #                                    GLU())\n    #     return self.convLayer\n\n    def upSample(self, in_channels, out_channels, kernel_size, stride, padding):\n        self.convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,\n                                                 out_channels=out_channels,\n                                                 kernel_size=kernel_size,\n                                                 stride=stride,\n                                                 padding=padding),\n                                       nn.PixelShuffle(upscale_factor=2))\n        return self.convLayer\n\n    def forward(self, input):\n        # GLU\n        print(\"Generator forward input: \", input.shape)\n        input = input.unsqueeze(1)\n        print(\"Generator forward input: \", input.shape)\n        seg_1d = input  # for TFAN module\n\n        conv1 = self.conv1(input) * torch.sigmoid(self.conv1_gates(input))\n        print(\"Generator forward conv1: \", conv1.shape)\n\n        # DownloadSample\n        downsample1 = self.downSample1(conv1)\n        print(\"Generator forward downsample1: \", downsample1.shape)\n        downsample2 = self.downSample2(downsample1)\n        print(\"Generator forward downsample2: \", downsample2.shape)\n\n        # 2D -> 1D\n        # reshape\n        reshape2dto1d = downsample2.view(downsample2.size(0), 2304, 1, -1)\n        reshape2dto1d = reshape2dto1d.squeeze(2)\n        # print(\"Generator forward reshape2dto1d: \", reshape2dto1d.shape)\n        conv2dto1d_layer = self.conv2dto1dLayer(reshape2dto1d)\n        # print(\"Generator forward conv2dto1d_layer: \", conv2dto1d_layer.shape)\n\n        conv2dto1d_layer = self.conv2dto1dLayer_tfan(conv2dto1d_layer, seg_1d)\n\n        residual_layer_1 = self.residualLayer1(conv2dto1d_layer)\n        residual_layer_2 = self.residualLayer2(residual_layer_1)\n        residual_layer_3 = self.residualLayer3(residual_layer_2)\n        residual_layer_4 = self.residualLayer4(residual_layer_3)\n        residual_layer_5 = self.residualLayer5(residual_layer_4)\n        residual_layer_6 = self.residualLayer6(residual_layer_5)\n\n        # print(\"Generator forward residual_layer_6: \", residual_layer_6.shape)\n\n        # 1D -> 2D\n        conv1dto2d_layer = self.conv1dto2dLayer(residual_layer_6)\n        # print(\"Generator forward conv1dto2d_layer: \", conv1dto2d_layer.shape)\n\n        conv1dto2d_layer = self.conv1dto2dLayer_tfan(conv1dto2d_layer, seg_1d)\n\n        # reshape\n        reshape1dto2d = conv1dto2d_layer.unsqueeze(2)\n        reshape1dto2d = reshape1dto2d.view(reshape1dto2d.size(0), 256, 9, -1)\n        # print(\"Generator forward reshape1dto2d: \", reshape1dto2d.shape)\n\n        seg_2d = reshape1dto2d\n\n        # UpSample\n        upsample_layer_1 = self.upSample1(reshape1dto2d)\n        # print(\"Generator forward upsample_layer_1: \", upsample_layer_1.shape)\n        upsample_layer_1 = self.upSample1_tfan(upsample_layer_1, seg_2d)\n        upsample_layer_1 = self.glu(upsample_layer_1)\n\n        upsample_layer_2 = self.upSample2(upsample_layer_1)\n        # print(\"Generator forward upsample_layer_2: \", upsample_layer_2.shape)\n        upsample_layer_2 = self.upSample2_tfan(upsample_layer_2, seg_2d)\n        upsample_layer_2 = self.glu(upsample_layer_2)\n\n        output = self.lastConvLayer(upsample_layer_2)\n        # print(\"Generator forward output: \", output.shape)\n        output = output.squeeze(1)\n        # print(\"Generator forward output: \", output.shape)\n        return output\n\n\n##########################################################################################\n# 鉴别器  PatchGAN\nclass Discriminator(nn.Module):\n    def __init__(self):\n        super(Discriminator, self).__init__()\n\n        self.convLayer1 = nn.Sequential(nn.Conv2d(in_channels=1,\n                                                  out_channels=128,\n                                                  kernel_size=(3, 3),\n                                                  stride=(1, 1),\n                                                  padding=(1, 1)),\n                                        GLU())\n\n        # DownSample Layer\n        self.downSample1 = self.downSample(in_channels=128,\n                                           out_channels=256,\n                                           kernel_size=(3, 3),\n                                           stride=(2, 2),\n                                           padding=1)\n\n        self.downSample2 = self.downSample(in_channels=256,\n                                           out_channels=512,\n                                           kernel_size=(3, 3),\n                                           stride=[2, 2],\n                                           padding=1)\n\n        self.downSample3 = self.downSample(in_channels=512,\n                                           out_channels=1024,\n                                           kernel_size=[3, 3],\n                                           stride=[2, 2],\n                                           padding=1)\n\n        self.downSample4 = self.downSample(in_channels=1024,\n                                           out_channels=1024,\n                                           kernel_size=[1, 10],  # [1, 5] for cyclegan-vc2\n                                           stride=(1, 1),\n                                           padding=(0, 2))\n\n        # Conv Layer\n        self.outputConvLayer = nn.Sequential(nn.Conv2d(in_channels=1024,\n                                                       out_channels=1,\n                                                       kernel_size=(1, 3),\n                                                       stride=[1, 1],\n                                                       padding=[0, 1]))\n\n    def downSample(self, in_channels, out_channels, kernel_size, stride, padding):\n        convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,\n                                            out_channels=out_channels,\n                                            kernel_size=kernel_size,\n                                            stride=stride,\n                                            padding=padding),\n                                  nn.InstanceNorm2d(num_features=out_channels,\n                                                    affine=True),\n                                  GLU())\n        return convLayer\n\n    def forward(self, input):\n        # input has shape [batch_size, num_features, time]\n        # discriminator requires shape [batchSize, 1, num_features, time]\n        input = input.unsqueeze(1)\n        # print(\"Discriminator forward input: \", input.shape)\n        conv_layer_1 = self.convLayer1(input)\n        # print(\"Discriminator forward conv_layer_1: \", conv_layer_1.shape)\n\n        downsample1 = self.downSample1(conv_layer_1)\n        # print(\"Discriminator forward downsample1: \", downsample1.shape)\n        downsample2 = self.downSample2(downsample1)\n        # print(\"Discriminator forward downsample2: \", downsample2.shape)\n        downsample3 = self.downSample3(downsample2)\n        # print(\"Discriminator forward downsample3: \", downsample3.shape)\n\n        # downsample3 = downsample3.contiguous().permute(0, 2, 3, 1).contiguous()\n        # print(\"Discriminator forward downsample3: \", downsample3.shape)\n\n        output = torch.sigmoid(self.outputConvLayer(downsample3))\n        # print(\"Discriminator forward output: \", output.shape)\n        return output\n\n\nif __name__ == '__main__':\n    import sys\n    import numpy as np\n\n    args = sys.argv\n    print(args)\n    if len(args) > 1:\n        if args[1] == \"g\":\n            generator = Generator()\n            print(generator)\n        elif args[1] == \"d\":\n            discriminator = Discriminator()\n            print(discriminator)\n\n        sys.exit(0)\n\n    # Generator Dimensionality Testing\n    input = torch.randn(10, 36, 1100)  # (N, C_in, Width) For Conv1d\n    np.random.seed(0)\n    # print(np.random.randn(10))\n    input = np.random.randn(2, 80, 64)\n    input = torch.from_numpy(input).float()\n    print(\"Generator input: \", input.shape)\n    generator = Generator()\n    output = generator(input)\n    print(\"Generator output shape: \", output.shape)\n\n    # Discriminator Dimensionality Testing\n    # input = torch.randn(32, 1, 24, 128)  # (N, C_in, height, width) For Conv2d\n    discriminator = Discriminator()\n    output = discriminator(output)\n    print(\"Discriminator output shape \", output.shape)\n"
  },
  {
    "path": "tfan_module.py",
    "content": "#! python\n# -*- coding: utf-8 -*-\n# Author: kun\n# @Time: 2020-11-17 14:35\n\nimport re\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.nn.utils.spectral_norm as spectral_norm\n\n\n# Returns a function that creates a normalization function\ndef get_norm_layer(opt):\n    # helper function to get # output channels of the previous layer\n    def get_out_channel(layer):\n        if hasattr(layer, 'out_channels'):\n            return getattr(layer, 'out_channels')\n        return layer.weight.size(0)\n\n    # this function will be returned\n    def add_norm_layer(layer):\n        layer = spectral_norm(layer)\n\n        # remove bias in the previous layer, which is meaningless\n        # since it has no effect after normalization\n        if getattr(layer, 'bias', None) is not None:\n            delattr(layer, 'bias')\n            layer.register_parameter('bias', None)\n\n        norm_layer = nn.InstanceNorm2d(get_out_channel(layer), affine=False)\n\n        return nn.Sequential(layer, norm_layer)\n\n    return add_norm_layer\n\n\nclass TFAN_1D(nn.Module):\n    \"\"\"\n    as paper said, it has best performance when N=3, kernal_size in h is 5\n    \"\"\"\n\n    def __init__(self, norm_nc, ks=5, label_nc=128, N=3):\n        super().__init__()\n\n        self.param_free_norm = nn.InstanceNorm1d(norm_nc, affine=False)\n\n        self.repeat_N = N\n\n        # The dimension of the intermediate embedding space. Yes, hardcoded.\n        nhidden = 128\n\n        pw = ks // 2\n\n        self.mlp_shared = nn.Sequential(\n            nn.Conv1d(label_nc, nhidden, kernel_size=ks, padding=pw),\n            nn.ReLU()\n        )\n        self.mlp_gamma = nn.Conv1d(nhidden, norm_nc, kernel_size=ks, padding=pw)\n        self.mlp_beta = nn.Conv1d(nhidden, norm_nc, kernel_size=ks, padding=pw)\n\n    def forward(self, x, segmap):\n        # Part 1. generate parameter-free normalized activations\n        normalized = self.param_free_norm(x)\n\n        # Part 2. produce scaling and bias conditioned on semantic map\n        segmap = F.interpolate(segmap, size=x.size()[2:], mode='nearest')\n\n        # actv = self.mlp_shared(segmap)\n        temp = segmap\n        for i in range(self.repeat_N):\n            temp = self.mlp_shared(temp)\n        actv = temp\n\n        gamma = self.mlp_gamma(actv)\n        beta = self.mlp_beta(actv)\n\n        # apply scale and bias\n        out = normalized * (1 + gamma) + beta\n\n        return out\n\n\nclass TFAN_2D(nn.Module):\n    \"\"\"\n    as paper said, it has best performance when N=3, kernal_size in h is 5\n    \"\"\"\n\n    def __init__(self, norm_nc, ks=5, label_nc=128, N=3):\n        super().__init__()\n\n        self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)\n        self.repeat_N = N\n\n        # The dimension of the intermediate embedding space. Yes, hardcoded.\n        nhidden = 128\n\n        pw = ks // 2\n        self.mlp_shared = nn.Sequential(\n            nn.Conv2d(label_nc, nhidden, kernel_size=ks, padding=pw),\n            nn.ReLU()\n        )\n        self.mlp_gamma = nn.Conv2d(nhidden, norm_nc, kernel_size=ks, padding=pw)\n        self.mlp_beta = nn.Conv2d(nhidden, norm_nc, kernel_size=ks, padding=pw)\n\n    def forward(self, x, segmap):\n        # Part 1. generate parameter-free normalized activations\n        normalized = self.param_free_norm(x)\n\n        # Part 2. produce scaling and bias conditioned on semantic map\n        segmap = F.interpolate(segmap, size=x.size()[2:], mode='nearest')\n\n        # actv = self.mlp_shared(segmap)\n        temp = segmap\n        for i in range(self.repeat_N):\n            temp = self.mlp_shared(temp)\n        actv = temp\n\n        gamma = self.mlp_gamma(actv)\n        beta = self.mlp_beta(actv)\n\n        # apply scale and bias\n        out = normalized * (1 + gamma) + beta\n\n        return out\n"
  },
  {
    "path": "train.py",
    "content": ""
  }
]