Full Code of jackaduma/CycleGAN-VC3 for AI

main 4bdbddab205e cached
8 files
36.8 KB
8.1k tokens
46 symbols
1 requests
Download .txt
Repository: jackaduma/CycleGAN-VC3
Branch: main
Commit: 4bdbddab205e
Files: 8
Total size: 36.8 KB

Directory structure:
gitextract_v07ir58a/

├── LICENSE
├── README.md
├── datasets.py
├── feature_utils.py
├── melgan_vocoder.py
├── model.py
├── tfan_module.py
└── train.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 Kun Ma

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# **CycleGAN-VC3-PyTorch**

[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/jackaduma/CycleGAN-VC2)
[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://paypal.me/jackaduma?locale.x=zh_XC)

[**中文说明**](./README.zh-CN.md) | [**English**](./README.md)

------

This code is a **PyTorch** implementation for paper: [CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion](https://arxiv.org/abs/2010.11672]), a nice work on **Voice-Conversion/Voice Cloning**.

- [x] Dataset
  - [ ] VC
- [x] Usage
  - [x] Training
  - [x] Example 
- [ ] Demo
- [x] Reference

------

## **CycleGAN-VC3**

### [**Project Page**](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html) 


Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, CycleGAN-VC [3] and CycleGAN-VC2 [2] have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for **mel-spectrogram conversion**, they are typically used for mel-cepstrum conversion even when comparative methods employ mel-spectrogram as a conversion target. To address this, we examined the applicability of CycleGAN-VC/VC2 to **mel-spectrogram conversion**. Through initial experiments, we discovered that their direct applications compromised the time-frequency structure that should be preserved during conversion. To remedy this, we propose CycleGAN-VC3, an improvement of CycleGAN-VC2 that incorporates **time-frequency adaptive normalization (TFAN)**. Using TFAN, we can adjust the scale and bias of the converted features while reflecting the time-frequency structure of the source mel-spectrogram. We evaluated CycleGAN-VC3 on inter-gender and intra-gender non-parallel VC. A subjective evaluation of naturalness and similarity showed that for every VC pair, CycleGAN-VC3 outperforms or is competitive with the two types of CycleGAN-VC2, one of which was applied to mel-cepstrum and the other to mel-spectrogram.

![network comparison](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/images/comparison.png "comparison between vc2 and vc3")  _Figure 1. We developed time-frequency adaptive normalization (TFAN), which extends instance normalization [5] so that the affine parameters become element-dependent and are determined according to an entire input mel-spectrogram._

------

**This repository contains:** 

1. [TFAN module code](tfan_module.py) which implemented the TFAN module
1. [model code](model.py) which implemented the model network.
2. [audio preprocessing script](preprocess_training.py) you can use to create cache for [training data](data).
3. [training scripts](train.py) to train the model.



------

## **Table of Contents**

- [**CycleGAN-VC3-PyTorch**](#cyclegan-vc3-pytorch)
  - [**CycleGAN-VC3**](#cyclegan-vc3)
    - [**Project Page**](#project-page)
  - [**Table of Contents**](#table-of-contents)
  - [**Requirement**](#requirement)
  - [**Usage**](#usage)
  - [**Star-History**](#star-history)
  - [**Reference**](#reference)
  - [Donation](#donation)
  - [**License**](#license)
  
------

## **Requirement** 

```bash
pip install -r requirements.txt
```
## **Usage**


------

## **Star-History**

![star-history](https://api.star-history.com/svg?repos=jackaduma/CycleGAN-VC3&type=Date "star-history")

------

## **Reference**
1. **CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion.** [Paper](https://arxiv.org/abs/2010.11672), [Project](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html)
2. CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion. [Paper](https://arxiv.org/abs/1904.04631), [Project](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html)
3. Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks. [Paper](https://arxiv.org/abs/1711.11293), [Project](http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc/)
4. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. [Paper](https://arxiv.org/abs/1703.10593), [Project](https://junyanz.github.io/CycleGAN/), [Code](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix)
5. Image-to-Image Translation with Conditional Adversarial Nets. [Paper](https://arxiv.org/abs/1611.07004), [Project](https://phillipi.github.io/pix2pix/), [Code](https://github.com/phillipi/pix2pix)


------

## Donation
If this project help you reduce time to develop, you can give me a cup of coffee :) 

AliPay(支付宝)
<div align="center">
	<img src="./misc/ali_pay.png" alt="ali_pay" width="400" />
</div>

WechatPay(微信)
<div align="center">
    <img src="./misc/wechat_pay.png" alt="wechat_pay" width="400" />
</div>

[![paypal](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://paypal.me/jackaduma?locale.x=zh_XC)


------

## **License**

[MIT](LICENSE) © Kun

================================================
FILE: datasets.py
================================================


================================================
FILE: feature_utils.py
================================================
#!python
# -*- coding: utf-8 -*-

import torch
import torch.nn as nn
import torch.nn.functional as F
from librosa.filters import mel as librosa_mel_fn


class Audio2Mel(nn.Module):
    def __init__(
            self,
            n_fft=1024,
            hop_length=256,
            win_length=1024,
            sampling_rate=22050,
            n_mel_channels=80,
            mel_fmin=0.0,
            mel_fmax=None,
    ):
        super().__init__()
        ##############################################
        # FFT Parameters                              #
        ##############################################
        window = torch.hann_window(win_length).float()
        mel_basis = librosa_mel_fn(
            sampling_rate, n_fft, n_mel_channels, mel_fmin, mel_fmax
        )
        mel_basis = torch.from_numpy(mel_basis).float()
        self.register_buffer("mel_basis", mel_basis)
        self.register_buffer("window", window)
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.win_length = win_length
        self.sampling_rate = sampling_rate
        self.n_mel_channels = n_mel_channels

    def forward(self, audio):
        p = (self.n_fft - self.hop_length) // 2
        audio = F.pad(audio, (p, p), "reflect").squeeze(1)
        fft = torch.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self.window,
            center=False,
        )
        real_part, imag_part = fft.unbind(-1)
        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
        mel_output = torch.matmul(self.mel_basis, magnitude)
        log_mel_spec = torch.log10(torch.clamp(mel_output, min=1e-5))
        return log_mel_spec


================================================
FILE: melgan_vocoder.py
================================================
#!python
# -*- coding: utf-8 -*-
import os
import yaml
from pathlib import Path

import torch
import torch.nn as nn
from torch.nn.utils import weight_norm

from feature_utils import Audio2Mel


def weights_init(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        m.weight.data.normal_(0.0, 0.02)
    elif classname.find("BatchNorm2d") != -1:
        m.weight.data.normal_(1.0, 0.02)
        m.bias.data.fill_(0)


def WNConv1d(*args, **kwargs):
    return weight_norm(nn.Conv1d(*args, **kwargs))


def WNConvTranspose1d(*args, **kwargs):
    return weight_norm(nn.ConvTranspose1d(*args, **kwargs))


class ResnetBlock(nn.Module):
    def __init__(self, dim, dilation=1):
        super().__init__()
        self.block = nn.Sequential(
            nn.LeakyReLU(0.2),
            nn.ReflectionPad1d(dilation),
            WNConv1d(dim, dim, kernel_size=3, dilation=dilation),
            nn.LeakyReLU(0.2),
            WNConv1d(dim, dim, kernel_size=1),
        )
        self.shortcut = WNConv1d(dim, dim, kernel_size=1)

    def forward(self, x):
        return self.shortcut(x) + self.block(x)


class Generator(nn.Module):
    def __init__(self, input_size, ngf, n_residual_layers):
        super().__init__()
        ratios = [8, 8, 2, 2]
        self.hop_length = np.prod(ratios)
        mult = int(2 ** len(ratios))

        model = [
            nn.ReflectionPad1d(3),
            WNConv1d(input_size, mult * ngf, kernel_size=7, padding=0),
        ]

        # Upsample to raw audio scale
        for i, r in enumerate(ratios):
            model += [
                nn.LeakyReLU(0.2),
                WNConvTranspose1d(
                    mult * ngf,
                    mult * ngf // 2,
                    kernel_size=r * 2,
                    stride=r,
                    padding=r // 2 + r % 2,
                    output_padding=r % 2,
                ),
            ]

            for j in range(n_residual_layers):
                model += [ResnetBlock(mult * ngf // 2, dilation=3 ** j)]

            mult //= 2

        model += [
            nn.LeakyReLU(0.2),
            nn.ReflectionPad1d(3),
            WNConv1d(ngf, 1, kernel_size=7, padding=0),
            nn.Tanh(),
        ]

        self.model = nn.Sequential(*model)
        self.apply(weights_init)

    def forward(self, x):
        return self.model(x)


def get_default_device():
    if torch.cuda.is_available():
        return "cuda"
    else:
        return "cpu"


def load_model(mel2wav_path, device=get_default_device()):
    """
    Args:
        mel2wav_path (str or Path): path to the root folder of dumped text2mel
        device (str or torch.device): device to load the model
    """
    root = Path(mel2wav_path)
    with open(root / "args.yml", "r") as f:
        args = yaml.load(f, Loader=yaml.FullLoader)
    netG = Generator(args.n_mel_channels, args.ngf, args.n_residual_layers).to(device)
    netG.load_state_dict(torch.load(root / "best_netG.pt", map_location=device))
    return netG


class MelVocoder:
    def __init__(
            self,
            path,
            device=get_default_device(),
            github=False,
            model_name="multi_speaker",
    ):
        self.fft = Audio2Mel().to(device)
        if github:
            netG = Generator(80, 32, 3).to(device)
            root = Path(os.path.dirname(__file__)).parent
            netG.load_state_dict(
                torch.load(root / f"models/{model_name}.pt", map_location=device)
            )
            self.mel2wav = netG
        else:
            self.mel2wav = load_model(path, device)
        self.device = device

    def __call__(self, audio):
        """
        Performs audio to mel conversion (See Audio2Mel in mel2wav/modules.py)
        Args:
            audio (torch.tensor): PyTorch tensor containing audio (batch_size, timesteps)
        Returns:
            torch.tensor: log-mel-spectrogram computed on input audio (batch_size, 80, timesteps)
        """
        return self.fft(audio.unsqueeze(1).to(self.device))

    def inverse(self, mel):
        """
        Performs mel2audio conversion
        Args:
            mel (torch.tensor): PyTorch tensor containing log-mel spectrograms (batch_size, 80, timesteps)
        Returns:
            torch.tensor:  Inverted raw audio (batch_size, timesteps)

        """
        with torch.no_grad():
            return self.mel2wav(mel.to(self.device)).squeeze(1)


================================================
FILE: model.py
================================================
#! python
# -*- coding: utf-8 -*-
# Author: kun
# @Time: 2020-11-17 14:35

import torch.nn as nn
import torch
from tfan_module import TFAN_1D, TFAN_2D


class GLU(nn.Module):
    def __init__(self):
        super(GLU, self).__init__()
        # Custom Implementation because the Voice Conversion Cycle GAN
        # paper assumes GLU won't reduce the dimension of tensor by 2.

    def forward(self, input):
        return input * torch.sigmoid(input)


class PixelShuffle(nn.Module):
    def __init__(self, upscale_factor):
        super(PixelShuffle, self).__init__()
        # Custom Implementation because PyTorch PixelShuffle requires,
        # 4D input. Whereas, in this case we have have 3D array
        self.upscale_factor = upscale_factor

    def forward(self, input):
        n = input.shape[0]
        c_out = input.shape[1] // 2
        w_new = input.shape[2] * 2
        return input.view(n, c_out, w_new)


##########################################################################################
class ResidualLayer(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
        super(ResidualLayer, self).__init__()

        # self.residualLayer = nn.Sequential(nn.Conv1d(in_channels=in_channels,
        #                                              out_channels=out_channels,
        #                                              kernel_size=kernel_size,
        #                                              stride=1,
        #                                              padding=padding),
        #                                    nn.InstanceNorm1d(
        #                                        num_features=out_channels,
        #                                        affine=True),
        #                                    GLU(),
        #                                    nn.Conv1d(in_channels=out_channels,
        #                                              out_channels=in_channels,
        #                                              kernel_size=kernel_size,
        #                                              stride=1,
        #                                              padding=padding),
        #                                    nn.InstanceNorm1d(
        #                                        num_features=in_channels,
        #                                        affine=True)
        #                                    )

        self.conv1d_layer = nn.Sequential(nn.Conv1d(in_channels=in_channels,
                                                    out_channels=out_channels,
                                                    kernel_size=kernel_size,
                                                    stride=1,
                                                    padding=padding),
                                          nn.InstanceNorm1d(num_features=out_channels,
                                                            affine=True))

        self.conv_layer_gates = nn.Sequential(nn.Conv1d(in_channels=in_channels,
                                                        out_channels=out_channels,
                                                        kernel_size=kernel_size,
                                                        stride=1,
                                                        padding=padding),
                                              nn.InstanceNorm1d(num_features=out_channels,
                                                                affine=True))

        self.conv1d_out_layer = nn.Sequential(nn.Conv1d(in_channels=out_channels,
                                                        out_channels=in_channels,
                                                        kernel_size=kernel_size,
                                                        stride=1,
                                                        padding=padding),
                                              nn.InstanceNorm1d(num_features=in_channels,
                                                                affine=True))

    def forward(self, input):
        h1_norm = self.conv1d_layer(input)
        h1_gates_norm = self.conv_layer_gates(input)

        # GLU
        h1_glu = h1_norm * torch.sigmoid(h1_gates_norm)

        h2_norm = self.conv1d_out_layer(h1_glu)
        return input + h2_norm


##########################################################################################
class downSample_Generator(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
        super(downSample_Generator, self).__init__()

        self.convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,
                                                 out_channels=out_channels,
                                                 kernel_size=kernel_size,
                                                 stride=stride,
                                                 padding=padding),
                                       nn.InstanceNorm2d(num_features=out_channels,
                                                         affine=True))
        self.convLayer_gates = nn.Sequential(nn.Conv2d(in_channels=in_channels,
                                                       out_channels=out_channels,
                                                       kernel_size=kernel_size,
                                                       stride=stride,
                                                       padding=padding),
                                             nn.InstanceNorm2d(num_features=out_channels,
                                                               affine=True))

    def forward(self, input):
        # GLU
        return self.convLayer(input) * torch.sigmoid(self.convLayer_gates(input))


##########################################################################################
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()

        # 2D Conv Layer
        self.conv1 = nn.Conv2d(in_channels=1,  # TODO 1 ?
                               out_channels=128,
                               kernel_size=(5, 15),
                               stride=(1, 1),
                               padding=(2, 7))

        self.conv1_gates = nn.Conv2d(in_channels=1,  # TODO 1 ?
                                     out_channels=128,
                                     kernel_size=(5, 15),
                                     stride=1,
                                     padding=(2, 7))

        # 2D Downsample Layer
        self.downSample1 = downSample_Generator(in_channels=128,
                                                out_channels=256,
                                                kernel_size=5,
                                                stride=2,
                                                padding=2)

        self.downSample2 = downSample_Generator(in_channels=256,
                                                out_channels=256,
                                                kernel_size=5,
                                                stride=2,
                                                padding=2)

        # 2D -> 1D Conv
        # self.conv2dto1dLayer = nn.Sequential(nn.Conv1d(in_channels=2304,
        #                                                out_channels=256,
        #                                                kernel_size=1,
        #                                                stride=1,
        #                                                padding=0),
        #                                      nn.InstanceNorm1d(num_features=256,
        #                                                        affine=True))

        self.conv2dto1dLayer = nn.Conv1d(in_channels=2304,
                                         out_channels=256,
                                         kernel_size=1,
                                         stride=1,
                                         padding=0)
        self.conv2dto1dLayer_tfan = TFAN_1D(256)

        # Residual Blocks
        self.residualLayer1 = ResidualLayer(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=1,
                                            padding=1)
        self.residualLayer2 = ResidualLayer(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=1,
                                            padding=1)
        self.residualLayer3 = ResidualLayer(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=1,
                                            padding=1)
        self.residualLayer4 = ResidualLayer(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=1,
                                            padding=1)
        self.residualLayer5 = ResidualLayer(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=1,
                                            padding=1)
        self.residualLayer6 = ResidualLayer(in_channels=256,
                                            out_channels=512,
                                            kernel_size=3,
                                            stride=1,
                                            padding=1)

        # 1D -> 2D Conv
        # self.conv1dto2dLayer = nn.Sequential(nn.Conv1d(in_channels=256,
        #                                                out_channels=2304,
        #                                                kernel_size=1,
        #                                                stride=1,
        #                                                padding=0),
        #                                      nn.InstanceNorm1d(num_features=2304,
        #                                                        affine=True))

        self.conv1dto2dLayer = nn.Conv1d(in_channels=256,
                                         out_channels=2304,
                                         kernel_size=1,
                                         stride=1,
                                         padding=0)
        self.conv1dto2dLayer_tfan = TFAN_1D(2304)

        # UpSample Layer
        self.upSample1 = self.upSample(in_channels=256,
                                       out_channels=1024,
                                       kernel_size=5,
                                       stride=1,
                                       padding=2)

        self.upSample1_tfan = TFAN_2D(1024 // 4)
        self.glu = GLU()

        self.upSample2 = self.upSample(in_channels=256,
                                       out_channels=512,
                                       kernel_size=5,
                                       stride=1,
                                       padding=2)
        self.upSample2_tfan = TFAN_2D(512 // 4)

        self.lastConvLayer = nn.Conv2d(in_channels=128,
                                       out_channels=1,
                                       kernel_size=(5, 15),
                                       stride=(1, 1),
                                       padding=(2, 7))

    def downSample(self, in_channels, out_channels, kernel_size, stride, padding):
        self.ConvLayer = nn.Sequential(nn.Conv1d(in_channels=in_channels,
                                                 out_channels=out_channels,
                                                 kernel_size=kernel_size,
                                                 stride=stride,
                                                 padding=padding),
                                       nn.InstanceNorm1d(
                                           num_features=out_channels,
                                           affine=True),
                                       GLU())

        return self.ConvLayer

    # def upSample(self, in_channels, out_channels, kernel_size, stride, padding):
    #     self.convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,
    #                                              out_channels=out_channels,
    #                                              kernel_size=kernel_size,
    #                                              stride=stride,
    #                                              padding=padding),
    #                                    nn.PixelShuffle(upscale_factor=2),
    #                                    nn.InstanceNorm2d(
    #                                        num_features=out_channels // 4,
    #                                        affine=True),
    #                                    GLU())
    #     return self.convLayer

    def upSample(self, in_channels, out_channels, kernel_size, stride, padding):
        self.convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,
                                                 out_channels=out_channels,
                                                 kernel_size=kernel_size,
                                                 stride=stride,
                                                 padding=padding),
                                       nn.PixelShuffle(upscale_factor=2))
        return self.convLayer

    def forward(self, input):
        # GLU
        print("Generator forward input: ", input.shape)
        input = input.unsqueeze(1)
        print("Generator forward input: ", input.shape)
        seg_1d = input  # for TFAN module

        conv1 = self.conv1(input) * torch.sigmoid(self.conv1_gates(input))
        print("Generator forward conv1: ", conv1.shape)

        # DownloadSample
        downsample1 = self.downSample1(conv1)
        print("Generator forward downsample1: ", downsample1.shape)
        downsample2 = self.downSample2(downsample1)
        print("Generator forward downsample2: ", downsample2.shape)

        # 2D -> 1D
        # reshape
        reshape2dto1d = downsample2.view(downsample2.size(0), 2304, 1, -1)
        reshape2dto1d = reshape2dto1d.squeeze(2)
        # print("Generator forward reshape2dto1d: ", reshape2dto1d.shape)
        conv2dto1d_layer = self.conv2dto1dLayer(reshape2dto1d)
        # print("Generator forward conv2dto1d_layer: ", conv2dto1d_layer.shape)

        conv2dto1d_layer = self.conv2dto1dLayer_tfan(conv2dto1d_layer, seg_1d)

        residual_layer_1 = self.residualLayer1(conv2dto1d_layer)
        residual_layer_2 = self.residualLayer2(residual_layer_1)
        residual_layer_3 = self.residualLayer3(residual_layer_2)
        residual_layer_4 = self.residualLayer4(residual_layer_3)
        residual_layer_5 = self.residualLayer5(residual_layer_4)
        residual_layer_6 = self.residualLayer6(residual_layer_5)

        # print("Generator forward residual_layer_6: ", residual_layer_6.shape)

        # 1D -> 2D
        conv1dto2d_layer = self.conv1dto2dLayer(residual_layer_6)
        # print("Generator forward conv1dto2d_layer: ", conv1dto2d_layer.shape)

        conv1dto2d_layer = self.conv1dto2dLayer_tfan(conv1dto2d_layer, seg_1d)

        # reshape
        reshape1dto2d = conv1dto2d_layer.unsqueeze(2)
        reshape1dto2d = reshape1dto2d.view(reshape1dto2d.size(0), 256, 9, -1)
        # print("Generator forward reshape1dto2d: ", reshape1dto2d.shape)

        seg_2d = reshape1dto2d

        # UpSample
        upsample_layer_1 = self.upSample1(reshape1dto2d)
        # print("Generator forward upsample_layer_1: ", upsample_layer_1.shape)
        upsample_layer_1 = self.upSample1_tfan(upsample_layer_1, seg_2d)
        upsample_layer_1 = self.glu(upsample_layer_1)

        upsample_layer_2 = self.upSample2(upsample_layer_1)
        # print("Generator forward upsample_layer_2: ", upsample_layer_2.shape)
        upsample_layer_2 = self.upSample2_tfan(upsample_layer_2, seg_2d)
        upsample_layer_2 = self.glu(upsample_layer_2)

        output = self.lastConvLayer(upsample_layer_2)
        # print("Generator forward output: ", output.shape)
        output = output.squeeze(1)
        # print("Generator forward output: ", output.shape)
        return output


##########################################################################################
# 鉴别器  PatchGAN
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()

        self.convLayer1 = nn.Sequential(nn.Conv2d(in_channels=1,
                                                  out_channels=128,
                                                  kernel_size=(3, 3),
                                                  stride=(1, 1),
                                                  padding=(1, 1)),
                                        GLU())

        # DownSample Layer
        self.downSample1 = self.downSample(in_channels=128,
                                           out_channels=256,
                                           kernel_size=(3, 3),
                                           stride=(2, 2),
                                           padding=1)

        self.downSample2 = self.downSample(in_channels=256,
                                           out_channels=512,
                                           kernel_size=(3, 3),
                                           stride=[2, 2],
                                           padding=1)

        self.downSample3 = self.downSample(in_channels=512,
                                           out_channels=1024,
                                           kernel_size=[3, 3],
                                           stride=[2, 2],
                                           padding=1)

        self.downSample4 = self.downSample(in_channels=1024,
                                           out_channels=1024,
                                           kernel_size=[1, 10],  # [1, 5] for cyclegan-vc2
                                           stride=(1, 1),
                                           padding=(0, 2))

        # Conv Layer
        self.outputConvLayer = nn.Sequential(nn.Conv2d(in_channels=1024,
                                                       out_channels=1,
                                                       kernel_size=(1, 3),
                                                       stride=[1, 1],
                                                       padding=[0, 1]))

    def downSample(self, in_channels, out_channels, kernel_size, stride, padding):
        convLayer = nn.Sequential(nn.Conv2d(in_channels=in_channels,
                                            out_channels=out_channels,
                                            kernel_size=kernel_size,
                                            stride=stride,
                                            padding=padding),
                                  nn.InstanceNorm2d(num_features=out_channels,
                                                    affine=True),
                                  GLU())
        return convLayer

    def forward(self, input):
        # input has shape [batch_size, num_features, time]
        # discriminator requires shape [batchSize, 1, num_features, time]
        input = input.unsqueeze(1)
        # print("Discriminator forward input: ", input.shape)
        conv_layer_1 = self.convLayer1(input)
        # print("Discriminator forward conv_layer_1: ", conv_layer_1.shape)

        downsample1 = self.downSample1(conv_layer_1)
        # print("Discriminator forward downsample1: ", downsample1.shape)
        downsample2 = self.downSample2(downsample1)
        # print("Discriminator forward downsample2: ", downsample2.shape)
        downsample3 = self.downSample3(downsample2)
        # print("Discriminator forward downsample3: ", downsample3.shape)

        # downsample3 = downsample3.contiguous().permute(0, 2, 3, 1).contiguous()
        # print("Discriminator forward downsample3: ", downsample3.shape)

        output = torch.sigmoid(self.outputConvLayer(downsample3))
        # print("Discriminator forward output: ", output.shape)
        return output


if __name__ == '__main__':
    import sys
    import numpy as np

    args = sys.argv
    print(args)
    if len(args) > 1:
        if args[1] == "g":
            generator = Generator()
            print(generator)
        elif args[1] == "d":
            discriminator = Discriminator()
            print(discriminator)

        sys.exit(0)

    # Generator Dimensionality Testing
    input = torch.randn(10, 36, 1100)  # (N, C_in, Width) For Conv1d
    np.random.seed(0)
    # print(np.random.randn(10))
    input = np.random.randn(2, 80, 64)
    input = torch.from_numpy(input).float()
    print("Generator input: ", input.shape)
    generator = Generator()
    output = generator(input)
    print("Generator output shape: ", output.shape)

    # Discriminator Dimensionality Testing
    # input = torch.randn(32, 1, 24, 128)  # (N, C_in, height, width) For Conv2d
    discriminator = Discriminator()
    output = discriminator(output)
    print("Discriminator output shape ", output.shape)


================================================
FILE: tfan_module.py
================================================
#! python
# -*- coding: utf-8 -*-
# Author: kun
# @Time: 2020-11-17 14:35

import re
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.utils.spectral_norm as spectral_norm


# Returns a function that creates a normalization function
def get_norm_layer(opt):
    # helper function to get # output channels of the previous layer
    def get_out_channel(layer):
        if hasattr(layer, 'out_channels'):
            return getattr(layer, 'out_channels')
        return layer.weight.size(0)

    # this function will be returned
    def add_norm_layer(layer):
        layer = spectral_norm(layer)

        # remove bias in the previous layer, which is meaningless
        # since it has no effect after normalization
        if getattr(layer, 'bias', None) is not None:
            delattr(layer, 'bias')
            layer.register_parameter('bias', None)

        norm_layer = nn.InstanceNorm2d(get_out_channel(layer), affine=False)

        return nn.Sequential(layer, norm_layer)

    return add_norm_layer


class TFAN_1D(nn.Module):
    """
    as paper said, it has best performance when N=3, kernal_size in h is 5
    """

    def __init__(self, norm_nc, ks=5, label_nc=128, N=3):
        super().__init__()

        self.param_free_norm = nn.InstanceNorm1d(norm_nc, affine=False)

        self.repeat_N = N

        # The dimension of the intermediate embedding space. Yes, hardcoded.
        nhidden = 128

        pw = ks // 2

        self.mlp_shared = nn.Sequential(
            nn.Conv1d(label_nc, nhidden, kernel_size=ks, padding=pw),
            nn.ReLU()
        )
        self.mlp_gamma = nn.Conv1d(nhidden, norm_nc, kernel_size=ks, padding=pw)
        self.mlp_beta = nn.Conv1d(nhidden, norm_nc, kernel_size=ks, padding=pw)

    def forward(self, x, segmap):
        # Part 1. generate parameter-free normalized activations
        normalized = self.param_free_norm(x)

        # Part 2. produce scaling and bias conditioned on semantic map
        segmap = F.interpolate(segmap, size=x.size()[2:], mode='nearest')

        # actv = self.mlp_shared(segmap)
        temp = segmap
        for i in range(self.repeat_N):
            temp = self.mlp_shared(temp)
        actv = temp

        gamma = self.mlp_gamma(actv)
        beta = self.mlp_beta(actv)

        # apply scale and bias
        out = normalized * (1 + gamma) + beta

        return out


class TFAN_2D(nn.Module):
    """
    as paper said, it has best performance when N=3, kernal_size in h is 5
    """

    def __init__(self, norm_nc, ks=5, label_nc=128, N=3):
        super().__init__()

        self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)
        self.repeat_N = N

        # The dimension of the intermediate embedding space. Yes, hardcoded.
        nhidden = 128

        pw = ks // 2
        self.mlp_shared = nn.Sequential(
            nn.Conv2d(label_nc, nhidden, kernel_size=ks, padding=pw),
            nn.ReLU()
        )
        self.mlp_gamma = nn.Conv2d(nhidden, norm_nc, kernel_size=ks, padding=pw)
        self.mlp_beta = nn.Conv2d(nhidden, norm_nc, kernel_size=ks, padding=pw)

    def forward(self, x, segmap):
        # Part 1. generate parameter-free normalized activations
        normalized = self.param_free_norm(x)

        # Part 2. produce scaling and bias conditioned on semantic map
        segmap = F.interpolate(segmap, size=x.size()[2:], mode='nearest')

        # actv = self.mlp_shared(segmap)
        temp = segmap
        for i in range(self.repeat_N):
            temp = self.mlp_shared(temp)
        actv = temp

        gamma = self.mlp_gamma(actv)
        beta = self.mlp_beta(actv)

        # apply scale and bias
        out = normalized * (1 + gamma) + beta

        return out


================================================
FILE: train.py
================================================
Download .txt
gitextract_v07ir58a/

├── LICENSE
├── README.md
├── datasets.py
├── feature_utils.py
├── melgan_vocoder.py
├── model.py
├── tfan_module.py
└── train.py
Download .txt
SYMBOL INDEX (46 symbols across 4 files)

FILE: feature_utils.py
  class Audio2Mel (line 10) | class Audio2Mel(nn.Module):
    method __init__ (line 11) | def __init__(
    method forward (line 38) | def forward(self, audio):

FILE: melgan_vocoder.py
  function weights_init (line 14) | def weights_init(m):
  function WNConv1d (line 23) | def WNConv1d(*args, **kwargs):
  function WNConvTranspose1d (line 27) | def WNConvTranspose1d(*args, **kwargs):
  class ResnetBlock (line 31) | class ResnetBlock(nn.Module):
    method __init__ (line 32) | def __init__(self, dim, dilation=1):
    method forward (line 43) | def forward(self, x):
  class Generator (line 47) | class Generator(nn.Module):
    method __init__ (line 48) | def __init__(self, input_size, ngf, n_residual_layers):
    method forward (line 88) | def forward(self, x):
  function get_default_device (line 92) | def get_default_device():
  function load_model (line 99) | def load_model(mel2wav_path, device=get_default_device()):
  class MelVocoder (line 113) | class MelVocoder:
    method __init__ (line 114) | def __init__(
    method __call__ (line 133) | def __call__(self, audio):
    method inverse (line 143) | def inverse(self, mel):

FILE: model.py
  class GLU (line 11) | class GLU(nn.Module):
    method __init__ (line 12) | def __init__(self):
    method forward (line 17) | def forward(self, input):
  class PixelShuffle (line 21) | class PixelShuffle(nn.Module):
    method __init__ (line 22) | def __init__(self, upscale_factor):
    method forward (line 28) | def forward(self, input):
  class ResidualLayer (line 36) | class ResidualLayer(nn.Module):
    method __init__ (line 37) | def __init__(self, in_channels, out_channels, kernel_size, stride, pad...
    method forward (line 83) | def forward(self, input):
  class downSample_Generator (line 95) | class downSample_Generator(nn.Module):
    method __init__ (line 96) | def __init__(self, in_channels, out_channels, kernel_size, stride, pad...
    method forward (line 114) | def forward(self, input):
  class Generator (line 120) | class Generator(nn.Module):
    method __init__ (line 121) | def __init__(self):
    method downSample (line 237) | def downSample(self, in_channels, out_channels, kernel_size, stride, p...
    method upSample (line 263) | def upSample(self, in_channels, out_channels, kernel_size, stride, pad...
    method forward (line 272) | def forward(self, input):
  class Discriminator (line 340) | class Discriminator(nn.Module):
    method __init__ (line 341) | def __init__(self):
    method downSample (line 383) | def downSample(self, in_channels, out_channels, kernel_size, stride, p...
    method forward (line 394) | def forward(self, input):

FILE: tfan_module.py
  function get_norm_layer (line 14) | def get_norm_layer(opt):
  class TFAN_1D (line 38) | class TFAN_1D(nn.Module):
    method __init__ (line 43) | def __init__(self, norm_nc, ks=5, label_nc=128, N=3):
    method forward (line 62) | def forward(self, x, segmap):
  class TFAN_2D (line 84) | class TFAN_2D(nn.Module):
    method __init__ (line 89) | def __init__(self, norm_nc, ks=5, label_nc=128, N=3):
    method forward (line 106) | def forward(self, x, segmap):
Condensed preview — 8 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (39K chars).
[
  {
    "path": "LICENSE",
    "chars": 1063,
    "preview": "MIT License\n\nCopyright (c) 2020 Kun Ma\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof "
  },
  {
    "path": "README.md",
    "chars": 5157,
    "preview": "# **CycleGAN-VC3-PyTorch**\n\n[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgre"
  },
  {
    "path": "datasets.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "feature_utils.py",
    "chars": 1760,
    "preview": "#!python\n# -*- coding: utf-8 -*-\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom librosa.filter"
  },
  {
    "path": "melgan_vocoder.py",
    "chars": 4448,
    "preview": "#!python\n# -*- coding: utf-8 -*-\nimport os\nimport yaml\nfrom pathlib import Path\n\nimport torch\nimport torch.nn as nn\nfrom"
  },
  {
    "path": "model.py",
    "chars": 21497,
    "preview": "#! python\n# -*- coding: utf-8 -*-\n# Author: kun\n# @Time: 2020-11-17 14:35\n\nimport torch.nn as nn\nimport torch\nfrom tfan_"
  },
  {
    "path": "tfan_module.py",
    "chars": 3754,
    "preview": "#! python\n# -*- coding: utf-8 -*-\n# Author: kun\n# @Time: 2020-11-17 14:35\n\nimport re\nimport torch\nimport torch.nn as nn\n"
  },
  {
    "path": "train.py",
    "chars": 0,
    "preview": ""
  }
]

About this extraction

This page contains the full source code of the jackaduma/CycleGAN-VC3 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 8 files (36.8 KB), approximately 8.1k tokens, and a symbol index with 46 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!