Showing preview only (1,043K chars total). Download the full file or copy to clipboard to get everything.
Repository: svc-develop-team/so-vits-svc
Branch: 4.1-Stable
Commit: 730930d337d1
Files: 147
Total size: 995.4 KB
Directory structure:
gitextract_7qli04ux/
├── .gitattributes
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── ask_for_help.yaml
│ │ ├── ask_for_help_en_US.yaml
│ │ ├── bug_report.yaml
│ │ ├── bug_report_en_US.yaml
│ │ ├── config.yml
│ │ └── default.md
│ └── workflows/
│ ├── reviewdog.yml
│ └── ruff.yml
├── .gitignore
├── .ruff.toml
├── LICENSE
├── README.md
├── README_zh_CN.md
├── cluster/
│ ├── __init__.py
│ ├── kmeans.py
│ └── train_cluster.py
├── compress_model.py
├── configs/
│ └── diffusion.yaml
├── configs_template/
│ ├── config_template.json
│ ├── config_tiny_template.json
│ └── diffusion_template.yaml
├── data_utils.py
├── diffusion/
│ ├── __init__.py
│ ├── data_loaders.py
│ ├── diffusion.py
│ ├── diffusion_onnx.py
│ ├── dpm_solver_pytorch.py
│ ├── how to export onnx.md
│ ├── infer_gt_mel.py
│ ├── logger/
│ │ ├── __init__.py
│ │ ├── saver.py
│ │ └── utils.py
│ ├── onnx_export.py
│ ├── solver.py
│ ├── uni_pc.py
│ ├── unit2mel.py
│ ├── vocoder.py
│ └── wavenet.py
├── edgetts/
│ ├── tts.py
│ └── tts_voices.py
├── export_index_for_onnx.py
├── flask_api.py
├── flask_api_full_song.py
├── inference/
│ ├── __init__.py
│ ├── infer_tool.py
│ ├── infer_tool_grad.py
│ └── slicer.py
├── inference_main.py
├── models.py
├── modules/
│ ├── DSConv.py
│ ├── F0Predictor/
│ │ ├── CrepeF0Predictor.py
│ │ ├── DioF0Predictor.py
│ │ ├── F0Predictor.py
│ │ ├── FCPEF0Predictor.py
│ │ ├── HarvestF0Predictor.py
│ │ ├── PMF0Predictor.py
│ │ ├── RMVPEF0Predictor.py
│ │ ├── __init__.py
│ │ ├── crepe.py
│ │ ├── fcpe/
│ │ │ ├── __init__.py
│ │ │ ├── model.py
│ │ │ ├── nvSTFT.py
│ │ │ └── pcmer.py
│ │ └── rmvpe/
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ ├── deepunet.py
│ │ ├── inference.py
│ │ ├── model.py
│ │ ├── seq.py
│ │ ├── spec.py
│ │ └── utils.py
│ ├── __init__.py
│ ├── attentions.py
│ ├── commons.py
│ ├── enhancer.py
│ ├── losses.py
│ ├── mel_processing.py
│ └── modules.py
├── onnx_export.py
├── onnx_export_old.py
├── onnxexport/
│ ├── model_onnx.py
│ └── model_onnx_speaker_mix.py
├── preprocess_flist_config.py
├── preprocess_hubert_f0.py
├── requirements.txt
├── requirements_onnx_encoder.txt
├── requirements_win.txt
├── resample.py
├── sovits4_for_colab.ipynb
├── spkmix.py
├── train.py
├── train_diff.py
├── train_index.py
├── utils.py
├── vdecoder/
│ ├── __init__.py
│ ├── hifigan/
│ │ ├── env.py
│ │ ├── models.py
│ │ ├── nvSTFT.py
│ │ └── utils.py
│ ├── hifiganwithsnake/
│ │ ├── alias/
│ │ │ ├── __init__.py
│ │ │ ├── act.py
│ │ │ ├── filter.py
│ │ │ └── resample.py
│ │ ├── env.py
│ │ ├── models.py
│ │ ├── nvSTFT.py
│ │ └── utils.py
│ └── nsf_hifigan/
│ ├── env.py
│ ├── models.py
│ ├── nvSTFT.py
│ └── utils.py
├── vencoder/
│ ├── CNHubertLarge.py
│ ├── ContentVec256L12_Onnx.py
│ ├── ContentVec256L9.py
│ ├── ContentVec256L9_Onnx.py
│ ├── ContentVec768L12.py
│ ├── ContentVec768L12_Onnx.py
│ ├── ContentVec768L9_Onnx.py
│ ├── DPHubert.py
│ ├── HubertSoft.py
│ ├── HubertSoft_Onnx.py
│ ├── WavLMBasePlus.py
│ ├── WhisperPPG.py
│ ├── WhisperPPGLarge.py
│ ├── __init__.py
│ ├── dphubert/
│ │ ├── __init__.py
│ │ ├── components.py
│ │ ├── hardconcrete.py
│ │ ├── model.py
│ │ ├── pruning_utils.py
│ │ └── utils/
│ │ ├── __init__.py
│ │ └── import_huggingface_wavlm.py
│ ├── encoder.py
│ ├── hubert/
│ │ ├── __init__.py
│ │ ├── hubert_model.py
│ │ └── hubert_model_onnx.py
│ ├── wavlm/
│ │ ├── WavLM.py
│ │ └── modules.py
│ └── whisper/
│ ├── __init__.py
│ ├── audio.py
│ ├── decoding.py
│ ├── model.py
│ ├── tokenizer.py
│ └── utils.py
├── wav_upload.py
└── webUI.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
* text=auto eol=lf
================================================
FILE: .github/ISSUE_TEMPLATE/ask_for_help.yaml
================================================
name: 请求帮助
description: 遇到了无法自行解决的错误
title: '[Help]: '
labels: [ "help wanted" ]
body:
- type: markdown
attributes:
value: |
#### 提问前请先自己去尝试解决,比如查看[本仓库wiki](https://github.com/svc-develop-team/so-vits-svc/wiki),也可以借助chatgpt或一些搜索引擎(谷歌/必应/New Bing/StackOverflow等等)。如果实在无法自己解决再发issue,在提issue之前,请先了解《[提问的智慧](https://github.com/ryanhanwu/How-To-Ask-Questions-The-Smart-Way/blob/main/README-zh_CN.md)》。
---
### 什么样的issue会被直接close
1. 伸手党
2. 一键包/环境包相关
3. 提供的信息不全
4. 低级的如缺少依赖而导致无法运行的问题
4. 所用的数据集是无授权数据集(游戏角色/二次元人物暂不归为此类,但是训练时候也要小心谨慎。如果能联系到官方,必须先和官方联系并核实清楚)
---
- type: checkboxes
id: Clause
attributes:
label: 请勾选下方的确认框。
options:
- label: "我已仔细阅读[README.md](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/README_zh_CN.md)和[wiki中的Quick solution](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)。"
required: true
- label: "我已通过各种搜索引擎排查问题,我要提出的问题并不常见。"
required: true
- label: "我未在使用由第三方用户提供的一键包/环境包。"
required: true
- type: markdown
attributes:
value: |
# 请根据实际使用环境填写以下信息
- type: input
id: System
attributes:
label: 系统平台版本号
description: Windows执行`winver` | Linux执行`uname -a`
validations:
required: true
- type: input
id: GPU
attributes:
label: GPU 型号
description: 执行`nvidia-smi`
validations:
required: true
- type: input
id: PythonVersion
attributes:
label: Python版本
description: 执行`python -V`
validations:
required: true
- type: input
id: PyTorchVersion
attributes:
label: PyTorch版本
description: 执行`pip show torch`
validations:
required: true
- type: dropdown
id: Branch
attributes:
label: sovits分支
options:
- 4.0(默认)
- 4.0-v2
- 3.0-32k
- 3.0-48k
validations:
required: true
- type: input
id: DatasetSource
attributes:
label: 数据集来源(用于判断数据集质量)
description: 如:UVR处理过的vtb直播音频、录音棚录制
validations:
required: true
- type: input
id: WhereOccurs
attributes:
label: 出现问题的环节或执行的命令
description: 如:预处理、训练、`python preprocess_hubert_f0.py`
validations:
required: true
- type: textarea
id: Description
attributes:
label: 问题描述
description: 在这里描述自己的问题,越详细越好
validations:
required: true
- type: textarea
id: Log
attributes:
label: 日志
description: 将从执行命令到执行完毕输出的所有信息(包括你所执行的命令)粘贴到[pastebin.com](https://pastebin.com/)并把剪贴板链接贴到这里,日志量少的话也可以直接贴在下面
render: python
validations:
required: true
- type: textarea
id: ValidOneClick
attributes:
label: 截图`so-vits-svc`、`logs/44k`文件夹并粘贴到此处
validations:
required: true
- type: textarea
id: Supplementary
attributes:
label: 补充说明
================================================
FILE: .github/ISSUE_TEMPLATE/ask_for_help_en_US.yaml
================================================
name: Ask for help
description: Encountered an error cannot be resolved by self
title: '[Help]: '
labels: [ "help wanted" ]
body:
- type: markdown
attributes:
value: |
#### Please try to solve the problem yourself before asking for help. At first you can read *[repo wiki](https://github.com/svc-develop-team/so-vits-svc/wiki)*. Then you can use chatgpt or some search engines like google, bing, new bing and StackOverflow until you really find that you can't solve it by yourself. And before you raise an issue, please understand *[How To Ask Questions The Smart Way](http://www.catb.org/~esr/faqs/smart-questions.html)* in advance.
---
### What kind of issue will be closed immediately
1. Beggars or Free Riders
2. One click package / Environment package (Not using `pip install -r requirement.txt`)
3. Incomplete information
4. Stupid issues such as miss a dependency package
4. Using unlicenced dataset (Game characters / anime characters are not included in this category temporarily but you still need to pay attention. If you can contact the official, you must contact the official and verify it at first.)
---
- type: checkboxes
id: Clause
attributes:
label: Please check the checkboxes below.
options:
- label: "I have read *[README.md](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/README.md)* and *[Quick solution in wiki](https://github.com/svc-develop-team/so-vits-svc/wiki/Quick-solution)* carefully."
required: true
- label: "I have been troubleshooting issues through various search engines. The questions I want to ask are not common."
required: true
- label: "I am NOT using one click package / environment package."
required: true
- type: markdown
attributes:
value: |
# Please fill in the following information according to your actual environment
- type: input
id: System
attributes:
label: OS version
description: Windows run `winver` | Linux run `uname -a`
validations:
required: true
- type: input
id: GPU
attributes:
label: GPU
description: Run `nvidia-smi`
validations:
required: true
- type: input
id: PythonVersion
attributes:
label: Python version
description: Run `python -V`
validations:
required: true
- type: input
id: PyTorchVersion
attributes:
label: PyTorch version
description: Run `pip show torch`
validations:
required: true
- type: dropdown
id: Branch
attributes:
label: Branch of sovits
options:
- 4.0(Default)
- 4.0-v2
- 3.0-32k
- 3.0-48k
validations:
required: true
- type: input
id: DatasetSource
attributes:
label: Dataset source (Used to judge the dataset quality)
description: Such as UVR-processed streaming audio / Recorded in recording studio
validations:
required: true
- type: input
id: WhereOccurs
attributes:
label: Where thr problem occurs or what command you executed
description: Such as Preprocessing / Training / `python preprocess_hubert_f0.py`
validations:
required: true
- type: textarea
id: Description
attributes:
label: Problem description
description: Describe your problem here, the more detailed the better.
validations:
required: true
- type: textarea
id: Log
attributes:
label: Log
description: All information output from the command you executed to the end of execution (include the command). It can also be directly posted below if there is only few text.
render: python
validations:
required: true
- type: textarea
id: ValidOneClick
attributes:
label: Screenshot `so-vits-svc` and `logs/44k` folders and paste here
validations:
required: true
- type: textarea
id: Supplementary
attributes:
label: Supplementary description
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.yaml
================================================
name: 问题回报
description: 遇到了BUG?!
title: '[Bug]: '
labels: [ "bug?" ]
body:
- type: markdown
attributes:
value: |
# 请根据实际使用环境填写以下信息
- type: input
id: System
attributes:
label: 系统平台版本号
description: Windows执行`winver` | Linux执行`uname -a`
validations:
required: true
- type: input
id: GPU
attributes:
label: GPU 型号
description: 执行`nvidia-smi`
validations:
required: true
- type: input
id: PythonVersion
attributes:
label: Python版本
description: 执行`python -V`
validations:
required: true
- type: input
id: PyTorchVersion
attributes:
label: PyTorch版本
description: 执行`pip show torch`
validations:
required: true
- type: dropdown
id: Branch
attributes:
label: sovits分支
options:
- 4.0(默认)
- 4.0-v2
- 3.0-32k
- 3.0-48k
validations:
required: true
- type: input
id: DatasetSource
attributes:
label: 数据集来源(用于判断数据集质量)
description: 如:UVR处理过的vtb直播音频、录音棚录制
validations:
required: true
- type: input
id: WhereOccurs
attributes:
label: 出现问题的环节或执行的命令
description: 如:预处理、训练、`python preprocess_hubert_f0.py`
validations:
required: true
- type: textarea
id: Description
attributes:
label: 情况描述
description: 在这里描述遇到的情况,越详细越好
validations:
required: true
- type: textarea
id: Log
attributes:
label: 日志
description: 将从执行命令到执行完毕输出的所有信息(包括你所执行的命令)粘贴到[pastebin.com](https://pastebin.com/)并把剪贴板链接贴到这里,日志量少的话也可以直接贴在下面
render: python
validations:
required: true
- type: textarea
id: Supplementary
attributes:
label: 补充说明
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report_en_US.yaml
================================================
name: Bug report
description: Encountered an bug?!
title: '[Bug]: '
labels: [ "bug?" ]
body:
- type: markdown
attributes:
value: |
# Please fill in the following information according to your actual environment
- type: input
id: System
attributes:
label: OS version
description: Windows run `winver` | Linux run `uname -a`
validations:
required: true
- type: input
id: GPU
attributes:
label: GPU
description: Run `nvidia-smi`
validations:
required: true
- type: input
id: PythonVersion
attributes:
label: Python version
description: Run `python -V`
validations:
required: true
- type: input
id: PyTorchVersion
attributes:
label: PyTorch version
description: Run `pip show torch`
validations:
required: true
- type: dropdown
id: Branch
attributes:
label: Branch of sovits
options:
- 4.0(Default)
- 4.0-v2
- 3.0-32k
- 3.0-48k
validations:
required: true
- type: input
id: DatasetSource
attributes:
label: Dataset source (Used to judge the dataset quality)
description: Such as UVR-processed streaming audio / Recorded in recording studio
validations:
required: true
- type: input
id: WhereOccurs
attributes:
label: Where thr problem occurs or what command you executed
description: Such as Preprocessing / Training / `python preprocess_hubert_f0.py`
validations:
required: true
- type: textarea
id: Description
attributes:
label: Situation description
description: Describe your situation here, the more detailed the better.
validations:
required: true
- type: textarea
id: Log
attributes:
label: Log
description: All information output from the command you executed to the end of execution (include the command). You can paste them to [pastebin.com](https://pastebin.com/) then paste the short link here. It can also be directly posted below if there is only few text.
render: python
validations:
required: true
- type: textarea
id: Supplementary
attributes:
label: Supplementary description
================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: false
contact_links:
- name: 讨论区 / Discussions
url: https://github.com/svc-develop-team/so-vits-svc/discussions
about: 简单的询问/讨论请转至讨论区或发起一个低优先级的Default issue / For simple inquiries / discussions, please go to the discussions or raise a low priority Default issue
================================================
FILE: .github/ISSUE_TEMPLATE/default.md
================================================
---
name: Default issue
about: 如果模板中没有你想发起的issue类型,可以选择此项,但这个issue也许会获得一个较低的处理优先级 / If there is no issue type you want to raise, you can start with this one. But this issue maybe will get a lower priority to deal with.
title: ''
labels: 'not urgent'
assignees: ''
---
================================================
FILE: .github/workflows/reviewdog.yml
================================================
name: Ruff Autofix
on: [pull_request]
jobs:
ruff:
permissions:
checks: write
contents: read
pull-requests: write
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: chartboost/ruff-action@v1
with:
args: --fix -e
- uses: reviewdog/action-suggester@v1
with:
tool_name: ruff
================================================
FILE: .github/workflows/ruff.yml
================================================
name: Ruff
on: [push, pull_request]
jobs:
ruff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: chartboost/ruff-action@v1
================================================
FILE: .gitignore
================================================
# Created by https://www.toptal.com/developers/gitignore/api/python
# Edit at https://www.toptal.com/developers/gitignore?templates=python
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
checkpoints/
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
pytestdebug.log
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
doc/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# End of https://www.toptal.com/developers/gitignore/api/python
/shelf/
/workspace.xml
dataset
dataset_raw
raw
results
inference/chunks_temp.json
logs
hubert/checkpoint_best_legacy_500.pt
configs/config.json
filelists/test.txt
filelists/train.txt
filelists/val.txt
.idea/
.vscode/
.idea/modules.xml
.idea/so-vits-svc.iml
.idea/vcs.xml
.idea/inspectionProfiles/profiles_settings.xml
.idea/inspectionProfiles/Project_Default.xml
pretrain/
.vscode/launch.json
trained/**/
================================================
FILE: .ruff.toml
================================================
select = ["E", "F", "I"]
# Never enforce `E501` (line length violations).
ignore = ["E501", "E741"]
================================================
FILE: LICENSE
================================================
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU Affero General Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
our General Public Licenses are intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
Developers that use our General Public Licenses protect your rights
with two steps: (1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public.
The GNU Affero General Public License is designed specifically to
ensure that, in such cases, the modified source code becomes available
to the community. It requires the operator of a network server to
provide the source code of the modified version running there to the
users of that server. Therefore, public use of a modified version, on
a publicly accessible server, gives the public access to the source
code of the modified version.
An older license, called the Affero General Public License and
published by Affero, was designed to accomplish similar goals. This is
a different license, not a version of the Affero GPL, but Affero has
released a new version of the Affero GPL which permits relicensing under
this license.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU Affero General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction; Use with the GNU General Public License.
Notwithstanding any other provision of this License, if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your version
supports such interaction) an opportunity to receive the Corresponding
Source of your version by providing access to the Corresponding Source
from a network server at no charge, through some standard or customary
means of facilitating copying of software. This Corresponding Source
shall include the Corresponding Source for any work covered by version 3
of the GNU General Public License that is incorporated pursuant to the
following paragraph.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the work with which it is combined will remain governed by version
3 of the GNU General Public License.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU Affero General Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU Affero General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU Affero General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU Affero General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU AGPL, see
<https://www.gnu.org/licenses/>.
================================================
FILE: README.md
================================================
<div align="center">
<img alt="LOGO" src="https://avatars.githubusercontent.com/u/127122328?s=400&u=5395a98a4f945a3a50cb0cc96c2747505d190dbc&v=4" width="300" height="300" />
# SoftVC VITS Singing Voice Conversion
[**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
[](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.1-Stable/sovits4_for_colab.ipynb)
[](https://github.com/svc-develop-team/so-vits-svc/blob/4.1-Stable/LICENSE)
This round of limited time update is coming to an end, the warehouse will enter the Archieve state, please know
</div>
> ✨ A studio that contains visible f0 editor, speaker mix timeline editor and other features (Where the Onnx models are used) : [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio)
> ✨ A fork with a greatly improved user interface: [34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)
> ✨ A client supports real-time conversion: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
**This project differs fundamentally from VITS, as it focuses on Singing Voice Conversion (SVC) rather than Text-to-Speech (TTS). In this project, TTS functionality is not supported, and VITS is incapable of performing SVC tasks. It's important to note that the models used in these two projects are not interchangeable or universally applicable.**
## Announcement
The purpose of this project was to enable developers to have their beloved anime characters perform singing tasks. The developers' intention was to focus solely on fictional characters and avoid any involvement of real individuals, anything related to real individuals deviates from the developer's original intention.
## Disclaimer
This project is an open-source, offline endeavor, and all members of SvcDevelopTeam, as well as other developers and maintainers involved (hereinafter referred to as contributors), have no control over the project. The contributors have never provided any form of assistance to any organization or individual, including but not limited to dataset extraction, dataset processing, computing support, training support, inference, and so on. The contributors do not and cannot be aware of the purposes for which users utilize the project. Therefore, any AI models and synthesized audio produced through the training of this project are unrelated to the contributors. Any issues or consequences arising from their use are the sole responsibility of the user.
This project is run completely offline and does not collect any user information or gather user input data. Therefore, contributors to this project are not aware of all user input and models and therefore are not responsible for any user input.
This project serves as a framework only and does not possess speech synthesis functionality by itself. All functionalities require users to train the models independently. Furthermore, this project does not come bundled with any models, and any secondary distributed projects are independent of the contributors of this project.
## 📏 Terms of Use
# Warning: Please ensure that you address any authorization issues related to the dataset on your own. You bear full responsibility for any problems arising from the usage of non-authorized datasets for training, as well as any resulting consequences. The repository and its maintainer, svc develop team, disclaim any association with or liability for the consequences.
1. This project is exclusively established for academic purposes, aiming to facilitate communication and learning. It is not intended for deployment in production environments.
2. Any sovits-based video posted to a video platform must clearly specify in the introduction the input source vocals and audio used for the voice changer conversion, e.g., if you use someone else's video/audio and convert it by separating the vocals as the input source, you must give a clear link to the original video or music; if you use your own vocals or a voice synthesized by another voice synthesis engine as the input source, you must also state this in your introduction.
3. You are solely responsible for any infringement issues caused by the input source and all consequences. When using other commercial vocal synthesis software as an input source, please ensure that you comply with the regulations of that software, noting that the regulations of many vocal synthesis engines explicitly state that they cannot be used to convert input sources!
4. Engaging in illegal activities, as well as religious and political activities, is strictly prohibited when using this project. The project developers vehemently oppose the aforementioned activities. If you disagree with this provision, the usage of the project is prohibited.
5. If you continue to use the program, you will be deemed to have agreed to the terms and conditions set forth in README and README has discouraged you and is not responsible for any subsequent problems.
6. If you intend to employ this project for any other purposes, kindly contact and inform the maintainers of this repository in advance.
## 📝 Model Introduction
The singing voice conversion model uses SoftVC content encoder to extract speech features from the source audio. These feature vectors are directly fed into VITS without the need for conversion to a text-based intermediate representation. As a result, the pitch and intonations of the original audio are preserved. Meanwhile, the vocoder was replaced with [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) to solve the problem of sound interruption.
### 🆕 4.1-Stable Version Update Content
- Feature input is changed to the 12th Layer of [Content Vec](https://github.com/auspicious3000/contentvec) Transformer output, And compatible with 4.0 branches.
- Update the shallow diffusion, you can use the shallow diffusion model to improve the sound quality.
- Added Whisper-PPG encoder support
- Added static/dynamic sound fusion
- Added loudness embedding
- Added Functionality of feature retrieval from [RVC](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI)
### 🆕 Questions about compatibility with the 4.0 model
- To support the 4.0 model and incorporate the speech encoder, you can make modifications to the `config.json` file. Add the `speech_encoder` field to the "model" section as shown below:
```
"model": {
.........
"ssl_dim": 256,
"n_speakers": 200,
"speech_encoder":"vec256l9"
}
```
### 🆕 Shallow diffusion

## 💬 Python Version
Based on our testing, we have determined that the project runs stable on `Python 3.8.9`.
## 📥 Pre-trained Model Files
#### **Required**
**You need to select one encoder from the list below**
##### **1. If using contentvec as speech encoder(recommended)**
`vec768l12` and `vec256l9` require the encoder
- ContentVec: [checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
- Place it under the `pretrain` directory
Or download the following ContentVec, which is only 199MB in size but has the same effect:
- ContentVec: [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt)
- Change the file name to `checkpoint_best_legacy_500.pt` and place it in the `pretrain` directory
```shell
# contentvec
wget -P pretrain/ https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt -O checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory
```
##### **2. If hubertsoft is used as the speech encoder**
- soft vc hubert: [hubert-soft-0d54a1f4.pt](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt)
- Place it under the `pretrain` directory
##### **3. If whisper-ppg as the encoder**
- download model at [medium.pt](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), the model fits `whisper-ppg`
- or download model at [large-v2.pt](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt), the model fits `whisper-ppg-large`
- Place it under the `pretrain` directory
##### **4. If cnhubertlarge as the encoder**
- download model at [chinese-hubert-large-fairseq-ckpt.pt](https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt)
- Place it under the `pretrain` directory
##### **5. If dphubert as the encoder**
- download model at [DPHuBERT-sp0.75.pth](https://huggingface.co/pyf98/DPHuBERT/resolve/main/DPHuBERT-sp0.75.pth)
- Place it under the `pretrain` directory
##### **6. If WavLM is used as the encoder**
- download model at [WavLM-Base+.pt](https://valle.blob.core.windows.net/share/wavlm/WavLM-Base+.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D), the model fits `wavlmbase+`
- Place it under the `pretrain` directory
##### **7. If OnnxHubert/ContentVec as the encoder**
- download model at [MoeSS-SUBModel](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main)
- Place it under the `pretrain` directory
#### **List of Encoders**
- "vec768l12"
- "vec256l9"
- "vec256l9-onnx"
- "vec256l12-onnx"
- "vec768l9-onnx"
- "vec768l12-onnx"
- "hubertsoft-onnx"
- "hubertsoft"
- "whisper-ppg"
- "cnhubertlarge"
- "dphubert"
- "whisper-ppg-large"
- "wavlmbase+"
#### **Optional(Strongly recommend)**
- Pre-trained model files: `G_0.pth` `D_0.pth`
- Place them under the `logs/44k` directory
- Diffusion model pretraining base model file: `model_0.pt`
- Put it in the `logs/44k/diffusion` directory
Get Sovits Pre-trained model from svc-develop-team(TBD) or anywhere else.
Diffusion model references [Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC) diffusion model. The pre-trained diffusion model is universal with the DDSP-SVC's. You can go to [Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC)'s repo to get the pre-trained diffusion model.
While the pretrained model typically does not pose copyright concerns, it is essential to remain vigilant. It is advisable to consult with the author beforehand or carefully review the description to ascertain the permissible usage of the model. This helps ensure compliance with any specified guidelines or restrictions regarding its utilization.
#### **Optional(Select as Required)**
##### NSF-HIFIGAN
If you are using the `NSF-HIFIGAN enhancer` or `shallow diffusion`, you will need to download the pre-trained NSF-HIFIGAN model.
- Pre-trained NSF-HIFIGAN Vocoder: [nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip)
- Unzip and place the four files under the `pretrain/nsf_hifigan` directory
```shell
# nsf_hifigan
wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URL: https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
```
##### RMVPE
If you are using the `rmvpe` F0 Predictor, you will need to download the pre-trained RMVPE model.
+ download model at [rmvpe.zip](https://github.com/yxlllc/RMVPE/releases/download/230917/rmvpe.zip), this weight is recommended.
+ unzip `rmvpe.zip`,and rename the `model.pt` file to `rmvpe.pt` and place it under the `pretrain` directory.
- ~~download model at [rmvpe.pt](https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/rmvpe.pt)~~
- ~~Place it under the `pretrain` directory~~
##### FCPE(Preview version)
[FCPE(Fast Context-base Pitch Estimator)](https://github.com/CNChTu/MelPE) is a dedicated F0 predictor designed for real-time voice conversion and will become the preferred F0 predictor for sovits real-time voice conversion in the future.(The paper is being written)
If you are using the `fcpe` F0 Predictor, you will need to download the pre-trained FCPE model.
- download model at [fcpe.pt](https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/fcpe.pt)
- Place it under the `pretrain` directory
## 📊 Dataset Preparation
Simply place the dataset in the `dataset_raw` directory with the following file structure:
```
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
```
There are no specific restrictions on the format of the name for each audio file (naming conventions such as `000001.wav` to `999999.wav` are also valid), but the file type must be `WAV``.
You can customize the speaker's name as showed below:
```
dataset_raw
└───suijiSUI
├───1.wav
├───...
└───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
```
## 🛠️ Preprocessing
### 0. Slice audio
To avoid video memory overflow during training or pre-processing, it is recommended to limit the length of audio clips. Cutting the audio to a length of "5s - 15s" is more recommended. Slightly longer times are acceptable, however, excessively long clips may cause problems such as `torch.cuda.OutOfMemoryError`.
To facilitate the slicing process, you can use [audio-slicer-GUI](https://github.com/flutydeer/audio-slicer) or [audio-slicer-CLI](https://github.com/openvpi/audio-slicer)
In general, only the `Minimum Interval` needs to be adjusted. For spoken audio, the default value usually suffices, while for singing audio, it can be adjusted to around `100` or even `50`, depending on the specific requirements.
After slicing, it is recommended to remove any audio clips that are excessively long or too short.
**If you are using whisper-ppg encoder for training, the audio clips must shorter than 30s.**
### 1. Resample to 44100Hz and mono
```shell
python resample.py
```
#### Cautions
Although this project has resample.py scripts for resampling, mono and loudness matching, the default loudness matching is to match to 0db. This can cause damage to the sound quality. While python's loudness matching package pyloudnorm does not limit the level, this can lead to sonic boom. Therefore, it is recommended to consider using professional sound processing software, such as `adobe audition` for loudness matching. If you are already using other software for loudness matching, add the parameter `-skip_loudnorm` to the run command:
```shell
python resample.py --skip_loudnorm
```
### 2. Automatically split the dataset into training and validation sets, and generate configuration files.
```shell
python preprocess_flist_config.py --speech_encoder vec768l12
```
speech_encoder has the following options
```
vec768l12
vec256l9
hubertsoft
whisper-ppg
cnhubertlarge
dphubert
whisper-ppg-large
wavlmbase+
```
If the speech_encoder argument is omitted, the default value is `vec768l12`
**Use loudness embedding**
Add `--vol_aug` if you want to enable loudness embedding:
```shell
python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug
```
After enabling loudness embedding, the trained model will match the loudness of the input source; otherwise, it will match the loudness of the training set.
#### You can modify some parameters in the generated config.json and diffusion.yaml
* `keep_ckpts`: Keep the the the number of previous models during training. Set to `0` to keep them all. Default is `3`.
* `all_in_mem`: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is **much larger** than your dataset.
* `batch_size`: The amount of data loaded to the GPU for a single training session can be adjusted to a size lower than the GPU memory capacity.
* `vocoder_name`: Select a vocoder. The default is `nsf-hifigan`.
##### diffusion.yaml
* `cache_all_data`: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is **much larger** than your dataset.
* `duration`: The duration of the audio slicing during training, can be adjusted according to the size of the video memory, **Note: this value must be less than the minimum time of the audio in the training set!**
* `batch_size`: The amount of data loaded to the GPU for a single training session can be adjusted to a size lower than the video memory capacity.
* `timesteps`: The total number of steps in the diffusion model, which defaults to 1000.
* `k_step_max`: Training can only train `k_step_max` step diffusion to save training time, note that the value must be less than `timesteps`, 0 is to train the entire diffusion model, **Note: if you do not train the entire diffusion model will not be able to use only_diffusion!**
##### **List of Vocoders**
```
nsf-hifigan
nsf-snake-hifigan
```
### 3. Generate hubert and f0
```shell
python preprocess_hubert_f0.py --f0_predictor dio
```
f0_predictor has the following options
```
crepe
dio
pm
harvest
rmvpe
fcpe
```
If the training set is too noisy,it is recommended to use `crepe` to handle f0
If the f0_predictor parameter is omitted, the default value is `rmvpe`
If you want shallow diffusion (optional), you need to add the `--use_diff` parameter, for example:
```shell
python preprocess_hubert_f0.py --f0_predictor dio --use_diff
```
**Speed Up preprocess**
If your dataset is pretty large,you can increase the param `--num_processes` like that:
```shell
python preprocess_hubert_f0.py --f0_predictor dio --num_processes 8
```
All the worker will be assigned to different GPU if you have more than one GPUs.
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
## 🏋️ Training
### Sovits Model
```shell
python train.py -c configs/config.json -m 44k
```
### Diffusion Model (optional)
If the shallow diffusion function is needed, the diffusion model needs to be trained. The diffusion model training method is as follows:
```shell
python train_diff.py -c configs/diffusion.yaml
```
During training, the model files will be saved to `logs/44k`, and the diffusion model will be saved to `logs/44k/diffusion`
## 🤖 Inference
Use [inference_main.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/inference_main.py)
```shell
# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
```
Required parameters:
- `-m` | `--model_path`: path to the model.
- `-c` | `--config_path`: path to the configuration file.
- `-n` | `--clean_names`: a list of wav file names located in the `raw` folder.
- `-t` | `--trans`: pitch shift, supports positive and negative (semitone) values.
- `-s` | `--spk_list`: Select the speaker ID to use for conversion.
- `-cl` | `--clip`: Forced audio clipping, set to 0 to disable(default), setting it to a non-zero value (duration in seconds) to enable.
Optional parameters: see the next section
- `-lg` | `--linear_gradient`: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use the default value of 0.
- `-f0p` | `--f0_predictor`: Select a F0 predictor, options are `crepe`, `pm`, `dio`, `harvest`, `rmvpe`,`fcpe`, default value is `pm`(note: f0 mean pooling will be enable when using `crepe`)
- `-a` | `--auto_predict_f0`: automatic pitch prediction, do not enable this when converting singing voices as it can cause serious pitch issues.
- `-cm` | `--cluster_model_path`: Cluster model or feature retrieval index path, if left blank, it will be automatically set as the default path of these models. If there is no training cluster or feature retrieval, fill in at will.
- `-cr` | `--cluster_infer_ratio`: The proportion of clustering scheme or feature retrieval ranges from 0 to 1. If there is no training clustering model or feature retrieval, the default is 0.
- `-eh` | `--enhance`: Whether to use NSF_HIFIGAN enhancer, this option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is disabled by default.
- `-shd` | `--shallow_diffusion`: Whether to use shallow diffusion, which can solve some electrical sound problems after use. This option is disabled by default. When this option is enabled, NSF_HIFIGAN enhancer will be disabled
- `-usm` | `--use_spk_mix`: whether to use dynamic voice fusion
- `-lea` | `--loudness_envelope_adjustment`:The adjustment of the input source's loudness envelope in relation to the fusion ratio of the output loudness envelope. The closer to 1, the more the output loudness envelope is used
- `-fr` | `--feature_retrieval`:Whether to use feature retrieval If clustering model is used, it will be disabled, and `cm` and `cr` parameters will become the index path and mixing ratio of feature retrieval
Shallow diffusion settings:
- `-dm` | `--diffusion_model_path`: Diffusion model path
- `-dc` | `--diffusion_config_path`: Diffusion config file path
- `-ks` | `--k_step`: The larger the number of k_steps, the closer it is to the result of the diffusion model. The default is 100
- `-od` | `--only_diffusion`: Whether to use Only diffusion mode, which does not load the sovits model to only use diffusion model inference
- `-se` | `--second_encoding`:which involves applying an additional encoding to the original audio before shallow diffusion. This option can yield varying results - sometimes positive and sometimes negative.
### Cautions
If inferencing using `whisper-ppg` speech encoder, you need to set `--clip` to 25 and `-lg` to 1. Otherwise it will fail to infer properly.
## 🤔 Optional Settings
If you are satisfied with the previous results, or if you do not feel you understand what follows, you can skip it and it will have no effect on the use of the model. The impact of these optional settings mentioned is relatively small, and while they may have some impact on specific datasets, in most cases the difference may not be significant.
### Automatic f0 prediction
During the training of the 4.0 model, an f0 predictor is also trained, which enables automatic pitch prediction during voice conversion. However, if the results are not satisfactory, manual pitch prediction can be used instead. Please note that when converting singing voices, it is advised not to enable this feature as it may cause significant pitch shifting.
- Set `auto_predict_f0` to `true` in `inference_main.py`.
### Cluster-based timbre leakage control
Introduction: The clustering scheme implemented in this model aims to reduce timbre leakage and enhance the similarity of the trained model to the target's timbre, although the effect may not be very pronounced. However, relying solely on clustering can reduce the model's clarity and make it sound less distinct. Therefore, a fusion method is adopted in this model to control the balance between the clustering and non-clustering approaches. This allows manual adjustment of the trade-off between "sounding like the target's timbre" and "have clear enunciation" to find an optimal balance.
No changes are required in the existing steps. Simply train an additional clustering model, which incurs relatively low training costs.
- Training process:
- Train on a machine with good CPU performance. According to extant experience, it takes about 4 minutes to train each speaker on a Tencent Cloud machine with 6-core CPU.
- Execute `python cluster/train_cluster.py`. The output model will be saved in `logs/44k/kmeans_10000.pt`.
- The clustering model can currently be trained using the gpu by executing `python cluster/train_cluster.py --gpu`
- Inference process:
- Specify `cluster_model_path` in `inference_main.py`. If not specified, the default is `logs/44k/kmeans_10000.pt`.
- Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using clustering at all, `1` means only using clustering, and usually `0.5` is sufficient.
### Feature retrieval
Introduction: As with the clustering scheme, the timbre leakage can be reduced, the enunciation is slightly better than clustering, but it will reduce the inference speed. By employing the fusion method, it becomes possible to linearly control the balance between feature retrieval and non-feature retrieval, allowing for fine-tuning of the desired proportion.
- Training process:
First, it needs to be executed after generating hubert and f0:
```shell
python train_index.py -c configs/config.json
```
The output of the model will be in `logs/44k/feature_and_index.pkl`
- Inference process:
- The `--feature_retrieval` needs to be formulated first, and the clustering mode automatically switches to the feature retrieval mode.
- Specify `cluster_model_path` in `inference_main.py`. If not specified, the default is `logs/44k/feature_and_index.pkl`.
- Specify `cluster_infer_ratio` in `inference_main.py`, where `0` means not using feature retrieval at all, `1` means only using feature retrieval, and usually `0.5` is sufficient.
## 🗜️ Model compression
The generated model contains data that is needed for further training. If you confirm that the model is final and not be used in further training, it is safe to remove these data to get smaller file size (about 1/3).
```shell
# Example
python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth"
```
## 👨🔧 Timbre mixing
### Static Tone Mixing
**Refer to `webUI.py` file for stable Timbre mixing of the gadget/lab feature.**
Introduction: This function can combine multiple models into one model (convex combination or linear combination of multiple model parameters) to create mixed voice that do not exist in reality
**Note:**
1. This feature is only supported for single-speaker models
2. If you force a multi-speaker model, it is critical to make sure there are the same number of speakers in each model. This will ensure that sounds with the same SpeakerID can be mixed correctly.
3. Ensure that the `model` fields in config.json of all models to be mixed are the same
4. The mixed model can use any config.json file from the models being synthesized. However, the clustering model will not be functional after mixed.
5. When batch uploading models, it is best to put the models into a folder and upload them together after selecting them
6. It is suggested to adjust the mixing ratio between 0 and 100, or to other numbers, but unknown effects will occur in the linear combination mode
7. After mixing, the file named output.pth will be saved in the root directory of the project
8. Convex combination mode will perform Softmax to add the mix ratio to 1, while linear combination mode will not
### Dynamic timbre mixing
**Refer to the `spkmix.py` file for an introduction to dynamic timbre mixing**
Character mix track writing rules:
Role ID: \[\[Start time 1, end time 1, start value 1, start value 1], [Start time 2, end time 2, start value 2]]
The start time must be the same as the end time of the previous one. The first start time must be 0, and the last end time must be 1 (time ranges from 0 to 1).
All roles must be filled in. For unused roles, fill \[\[0., 1., 0., 0.]]
The fusion value can be filled in arbitrarily, and the linear change from the start value to the end value within the specified period of time. The
internal linear combination will be automatically guaranteed to be 1 (convex combination condition), so it can be used safely
Use the `--use_spk_mix` parameter when reasoning to enable dynamic timbre mixing
## 📤 Exporting to Onnx
Use [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
- Create a folder named `checkpoints` and open it
- Create a folder in the `checkpoints` folder as your project folder, naming it after your project, for example `aziplayer`
- Rename your model as `model.pth`, the configuration file as `config.json`, and place them in the `aziplayer` folder you just created
- Modify `"NyaruTaffy"` in `path = "NyaruTaffy"` in [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py) to your project name, `path = "aziplayer"`(onnx_export_speaker_mix makes you can mix speaker's voice)
- Run [onnx_export.py](https://github.com/svc-develop-team/so-vits-svc/blob/4.0/onnx_export.py)
- Wait for it to finish running. A `model.onnx` will be generated in your project folder, which is the exported model.
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)
## 📎 Reference
| URL | Designation | Title | Implementation Source |
| --- | ----------- | ----- | --------------------- |
|[2106.06103](https://arxiv.org/abs/2106.06103) | VITS (Synthesizer)| Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech | [jaywalnut310/vits](https://github.com/jaywalnut310/vits) |
|[2111.02392](https://arxiv.org/abs/2111.02392) | SoftVC (Speech Encoder)| A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | [bshall/hubert](https://github.com/bshall/hubert) |
|[2204.09224](https://arxiv.org/abs/2204.09224) | ContentVec (Speech Encoder)| ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers | [auspicious3000/contentvec](https://github.com/auspicious3000/contentvec) |
|[2212.04356](https://arxiv.org/abs/2212.04356) | Whisper (Speech Encoder) | Robust Speech Recognition via Large-Scale Weak Supervision | [openai/whisper](https://github.com/openai/whisper) |
|[2110.13900](https://arxiv.org/abs/2110.13900) | WavLM (Speech Encoder) | WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | [microsoft/unilm/wavlm](https://github.com/microsoft/unilm/tree/master/wavlm) |
|[2305.17651](https://arxiv.org/abs/2305.17651) | DPHubert (Speech Encoder) | DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models | [pyf98/DPHuBERT](https://github.com/pyf98/DPHuBERT) |
|[DOI:10.21437/Interspeech.2017-68](http://dx.doi.org/10.21437/Interspeech.2017-68) | Harvest (F0 Predictor) | Harvest: A high-performance fundamental frequency estimator from speech signals | [mmorise/World/harvest](https://github.com/mmorise/World/blob/master/src/harvest.cpp) |
|[aes35-000039](https://www.aes.org/e-lib/online/browse.cfm?elib=15165) | Dio (F0 Predictor) | Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech | [mmorise/World/dio](https://github.com/mmorise/World/blob/master/src/dio.cpp) |
|[8461329](https://ieeexplore.ieee.org/document/8461329) | Crepe (F0 Predictor) | Crepe: A Convolutional Representation for Pitch Estimation | [maxrmorrison/torchcrepe](https://github.com/maxrmorrison/torchcrepe) |
|[DOI:10.1016/j.wocn.2018.07.001](https://doi.org/10.1016/j.wocn.2018.07.001) | Parselmouth (F0 Predictor) | Introducing Parselmouth: A Python interface to Praat | [YannickJadoul/Parselmouth](https://github.com/YannickJadoul/Parselmouth) |
|[2306.15412v2](https://arxiv.org/abs/2306.15412v2) | RMVPE (F0 Predictor) | RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music | [Dream-High/RMVPE](https://github.com/Dream-High/RMVPE) |
|[2010.05646](https://arxiv.org/abs/2010.05646) | HIFIGAN (Vocoder) | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | [jik876/hifi-gan](https://github.com/jik876/hifi-gan) |
|[1810.11946](https://arxiv.org/abs/1810.11946.pdf) | NSF (Vocoder) | Neural source-filter-based waveform model for statistical parametric speech synthesis | [openvpi/DiffSinger/modules/nsf_hifigan](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan)
|[2006.08195](https://arxiv.org/abs/2006.08195) | Snake (Vocoder) | Neural Networks Fail to Learn Periodic Functions and How to Fix It | [EdwardDixon/snake](https://github.com/EdwardDixon/snake)
|[2105.02446v3](https://arxiv.org/abs/2105.02446v3) | Shallow Diffusion (PostProcessing)| DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | [CNChTu/Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC) |
|[K-means](https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=01D65490BADCC216F350D06F84D721AD?doi=10.1.1.308.8619&rep=rep1&type=pdf) | Feature K-means Clustering (PreProcessing)| Some methods for classification and analysis of multivariate observations | This repo |
| | Feature TopK Retrieval (PreProcessing)| Retrieval based Voice Conversion | [RVC-Project/Retrieval-based-Voice-Conversion-WebUI](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) |
| | whisper ppg| whisper ppg | [PlayVoice/whisper_ppg](https://github.com/PlayVoice/whisper_ppg) |
| | bigvgan| bigvgan | [PlayVoice/so-vits-svc-5.0](https://github.com/PlayVoice/so-vits-svc-5.0/tree/bigvgan-mix-v2/vits_decoder/alias) |
## ☀️ Previous contributors
For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
*Some members have not listed according to their personal wishes.*
<table>
<tr>
<td align="center"><a href="https://github.com/MistEO"><img src="https://avatars.githubusercontent.com/u/18511905?v=4" width="100px;" alt=""/><br /><sub><b>MistEO</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/XiaoMiku01"><img src="https://avatars.githubusercontent.com/u/54094119?v=4" width="100px;" alt=""/><br /><sub><b>XiaoMiku01</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/ForsakenRei"><img src="https://avatars.githubusercontent.com/u/23041178?v=4" width="100px;" alt=""/><br /><sub><b>しぐれ</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/TomoGaSukunai"><img src="https://avatars.githubusercontent.com/u/25863522?v=4" width="100px;" alt=""/><br /><sub><b>TomoGaSukunai</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Plachtaa"><img src="https://avatars.githubusercontent.com/u/112609742?v=4" width="100px;" alt=""/><br /><sub><b>Plachtaa</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/zdxiaoda"><img src="https://avatars.githubusercontent.com/u/45501959?v=4" width="100px;" alt=""/><br /><sub><b>zd小达</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Archivoice"><img src="https://avatars.githubusercontent.com/u/107520869?v=4" width="100px;" alt=""/><br /><sub><b>凍聲響世</b></sub></a><br /></td>
</tr>
</table>
## 📚 Some legal provisions for reference
#### Any country, region, organization, or individual using this project must comply with the following laws.
#### 《民法典》
##### 第一千零一十九条
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。
##### 第一千零二十四条
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
##### 第一千零二十七条
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。
#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%88%91%E6%B3%95)》
#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
#### 《[中华人民共和国合同法](http://www.npc.gov.cn/zgrdw/npc/lfzt/rlyw/2016-07/01/content_1992739.htm)》
## 💪 Thanks to all contributors for their efforts
<a href="https://github.com/svc-develop-team/so-vits-svc/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=svc-develop-team/so-vits-svc" />
</a>
================================================
FILE: README_zh_CN.md
================================================
<div align="center">
<img alt="LOGO" src="https://avatars.githubusercontent.com/u/127122328?s=400&u=5395a98a4f945a3a50cb0cc96c2747505d190dbc&v=4" width="300" height="300" />
# SoftVC VITS Singing Voice Conversion
[**English**](./README.md) | [**中文简体**](./README_zh_CN.md)
[](https://colab.research.google.com/github/svc-develop-team/so-vits-svc/blob/4.1-Stable/sovits4_for_colab.ipynb)
[](https://github.com/svc-develop-team/so-vits-svc/blob/4.1-Stable/LICENSE)
本轮限时更新即将结束,仓库将进入Archieve状态,望周知
</div>
#### ✨ 带有 F0 曲线编辑器,角色混合时间轴编辑器的推理端 (Onnx 模型的用途): [MoeVoiceStudio](https://github.com/NaruseMioShirakana/MoeVoiceStudio)
#### ✨ 改善了交互的一个分支推荐: [34j/so-vits-svc-fork](https://github.com/34j/so-vits-svc-fork)
#### ✨ 支持实时转换的一个客户端: [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
**本项目与 Vits 有着根本上的不同。Vits 是 TTS,本项目是 SVC。本项目无法实现 TTS,Vits 也无法实现 SVC,这两个项目的模型是完全不通用的。**
## 重要通知
这个项目是为了让开发者最喜欢的动画角色唱歌而开发的,任何涉及真人的东西都与开发者的意图背道而驰。
## 声明
本项目为开源、离线的项目,SvcDevelopTeam 的所有成员与本项目的所有开发者以及维护者(以下简称贡献者)对本项目没有控制力。本项目的贡献者从未向任何组织或个人提供包括但不限于数据集提取、数据集加工、算力支持、训练支持、推理等一切形式的帮助;本项目的贡献者不知晓也无法知晓使用者使用该项目的用途。故一切基于本项目训练的 AI 模型和合成的音频都与本项目贡献者无关。一切由此造成的问题由使用者自行承担。
此项目完全离线运行,不能收集任何用户信息或获取用户输入数据。因此,这个项目的贡献者不知道所有的用户输入和模型,因此不负责任何用户输入。
本项目只是一个框架项目,本身并没有语音合成的功能,所有的功能都需要用户自己训练模型。同时,这个项目没有任何模型,任何二次分发的项目都与这个项目的贡献者无关。
## 📏 使用规约
# Warning:请自行解决数据集授权问题,禁止使用非授权数据集进行训练!任何由于使用非授权数据集进行训练造成的问题,需自行承担全部责任和后果!与仓库、仓库维护者、svc develop team 无关!
1. 本项目是基于学术交流目的建立,仅供交流与学习使用,并非为生产环境准备。
2. 任何发布到视频平台的基于 sovits 制作的视频,都必须要在简介明确指明用于变声器转换的输入源歌声、音频,例如:使用他人发布的视频 / 音频,通过分离的人声作为输入源进行转换的,必须要给出明确的原视频、音乐链接;若使用是自己的人声,或是使用其他歌声合成引擎合成的声音作为输入源进行转换的,也必须在简介加以说明。
3. 由输入源造成的侵权问题需自行承担全部责任和一切后果。使用其他商用歌声合成软件作为输入源时,请确保遵守该软件的使用条例,注意,许多歌声合成引擎使用条例中明确指明不可用于输入源进行转换!
4. 禁止使用该项目从事违法行为与宗教、政治等活动,该项目维护者坚决抵制上述行为,不同意此条则禁止使用该项目。
5. 继续使用视为已同意本仓库 README 所述相关条例,本仓库 README 已进行劝导义务,不对后续可能存在问题负责。
6. 如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。
## 📝 模型简介
歌声音色转换模型,通过 SoftVC 内容编码器提取源音频语音特征,与 F0 同时输入 VITS 替换原本的文本输入达到歌声转换的效果。同时,更换声码器为 [NSF HiFiGAN](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan) 解决断音问题。
### 🆕 4.1-Stable 版本更新内容
+ 特征输入更换为 [Content Vec](https://github.com/auspicious3000/contentvec) 的第 12 层 Transformer 输出,并兼容 4.0 分支
+ 更新浅层扩散,可以使用浅层扩散模型提升音质
+ 增加 whisper 语音编码器的支持
+ 增加静态/动态声线融合
+ 增加响度嵌入
+ 增加特征检索,来自于 [RVC](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI)
### 🆕 关于兼容 4.0 模型的问题
+ 可通过修改 4.0 模型的 config.json 对 4.0 的模型进行支持,需要在 config.json 的 model 字段中添加 speech_encoder 字段,具体见下
```
"model": {
.........
"ssl_dim": 256,
"n_speakers": 200,
"speech_encoder":"vec256l9"
}
```
### 🆕 关于浅扩散

## 💬 关于 Python 版本问题
在进行测试后,我们认为`Python 3.8.9`能够稳定地运行该项目
## 📥 预先下载的模型文件
#### **必须项**
**以下编码器需要选择一个使用**
##### **1. 若使用 contentvec 作为声音编码器(推荐)**
`vec768l12`与`vec256l9` 需要该编码器
+ contentvec :[checkpoint_best_legacy_500.pt](https://ibm.box.com/s/z1wgl1stco8ffooyatzdwsqn2psd9lrr)
+ 放在`pretrain`目录下
或者下载下面的 ContentVec,大小只有 199MB,但效果相同:
+ contentvec :[hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt)
+ 将文件名改为`checkpoint_best_legacy_500.pt`后,放在`pretrain`目录下
```shell
# contentvec
wget -P pretrain/ https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt -O checkpoint_best_legacy_500.pt
# 也可手动下载放在 pretrain 目录
```
##### **2. 若使用 hubertsoft 作为声音编码器**
+ soft vc hubert:[hubert-soft-0d54a1f4.pt](https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt)
+ 放在`pretrain`目录下
##### **3. 若使用 Whisper-ppg 作为声音编码器**
+ 下载模型 [medium.pt](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt), 该模型适配`whisper-ppg`
+ 下载模型 [large-v2.pt](https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt), 该模型适配`whisper-ppg-large`
+ 放在`pretrain`目录下
##### **4. 若使用 cnhubertlarge 作为声音编码器**
+ 下载模型 [chinese-hubert-large-fairseq-ckpt.pt](https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt)
+ 放在`pretrain`目录下
##### **5. 若使用 dphubert 作为声音编码器**
+ 下载模型 [DPHuBERT-sp0.75.pth](https://huggingface.co/pyf98/DPHuBERT/resolve/main/DPHuBERT-sp0.75.pth)
+ 放在`pretrain`目录下
##### **6. 若使用 WavLM 作为声音编码器**
+ 下载模型 [WavLM-Base+.pt](https://valle.blob.core.windows.net/share/wavlm/WavLM-Base+.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D), 该模型适配`wavlmbase+`
+ 放在`pretrain`目录下
##### **7. 若使用 OnnxHubert/ContentVec 作为声音编码器**
+ 下载模型 [MoeSS-SUBModel](https://huggingface.co/NaruseMioShirakana/MoeSS-SUBModel/tree/main)
+ 放在`pretrain`目录下
#### **编码器列表**
- "vec768l12"
- "vec256l9"
- "vec256l9-onnx"
- "vec256l12-onnx"
- "vec768l9-onnx"
- "vec768l12-onnx"
- "hubertsoft-onnx"
- "hubertsoft"
- "whisper-ppg"
- "cnhubertlarge"
- "dphubert"
- "whisper-ppg-large"
- "wavlmbase+"
#### **可选项(强烈建议使用)**
+ 预训练底模文件: `G_0.pth` `D_0.pth`
+ 放在`logs/44k`目录下
+ 扩散模型预训练底模文件: `model_0.pt`
+ 放在`logs/44k/diffusion`目录下
从 svc-develop-team(待定)或任何其他地方获取 Sovits 底模
扩散模型引用了 [Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC) 的 Diffusion Model,底模与 [Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC) 的扩散模型底模通用,可以去 [Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC) 获取扩散模型的底模
虽然底模一般不会引起什么版权问题,但还是请注意一下,比如事先询问作者,又或者作者在模型描述中明确写明了可行的用途
#### **可选项(根据情况选择)**
##### NSF-HIFIGAN
如果使用`NSF-HIFIGAN 增强器`或`浅层扩散`的话,需要下载预训练的 NSF-HIFIGAN 模型,如果不需要可以不下载
+ 预训练的 NSF-HIFIGAN 声码器 :[nsf_hifigan_20221211.zip](https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip)
+ 解压后,将四个文件放在`pretrain/nsf_hifigan`目录下
```shell
# nsf_hifigan
wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
# 也可手动下载放在 pretrain/nsf_hifigan 目录
# 地址:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
```
##### RMVPE
如果使用`rmvpe`F0预测器的话,需要下载预训练的 RMVPE 模型
+ 下载模型[rmvpe.zip](https://github.com/yxlllc/RMVPE/releases/download/230917/rmvpe.zip),目前首推该权重。
+ 解压缩`rmvpe.zip`,并将其中的`model.pt`文件改名为`rmvpe.pt`并放在`pretrain`目录下
+ ~~下载模型 [rmvpe.pt](https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/rmvpe.pt)~~
+ ~~放在`pretrain`目录下~~
##### FCPE(预览版)
> 你说的对,但是[FCPE](https://github.com/CNChTu/MelPE)是由svc-develop-team自主研发的一款全新的F0预测器,后面忘了
[FCPE(Fast Context-base Pitch Estimator)](https://github.com/CNChTu/MelPE)是一个为实时语音转换所设计的专用F0预测器,他将在未来成为Sovits实时语音转换的首选F0预测器.(论文未来会有的)
如果使用 `fcpe` F0预测器的话,需要下载预训练的 FCPE 模型
+ 下载模型 [fcpe.pt](https://huggingface.co/datasets/ylzz1997/rmvpe_pretrain_model/resolve/main/fcpe.pt)
+ 放在`pretrain`目录下
## 📊 数据集准备
仅需要以以下文件结构将数据集放入 dataset_raw 目录即可。
```
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
```
对于每一个音频文件的名称并没有格式的限制(`000001.wav`~`999999.wav`之类的命名方式也是合法的),不过文件类型必须是`wav`。
可以自定义说话人名称
```
dataset_raw
└───suijiSUI
├───1.wav
├───...
└───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
```
## 🛠️ 数据预处理
### 0. 音频切片
将音频切片至`5s - 15s`, 稍微长点也无伤大雅,实在太长可能会导致训练中途甚至预处理就爆显存
可以使用 [audio-slicer-GUI](https://github.com/flutydeer/audio-slicer)、[audio-slicer-CLI](https://github.com/openvpi/audio-slicer)
一般情况下只需调整其中的`Minimum Interval`,普通陈述素材通常保持默认即可,歌唱素材可以调整至`100`甚至`50`
切完之后手动删除过长过短的音频
**如果你使用 Whisper-ppg 声音编码器进行训练,所有的切片长度必须小于 30s**
### 1. 重采样至 44100Hz 单声道
```shell
python resample.py
```
#### 注意
虽然本项目拥有重采样、转换单声道与响度匹配的脚本 resample.py,但是默认的响度匹配是匹配到 0db。这可能会造成音质的受损。而 python 的响度匹配包 pyloudnorm 无法对电平进行压限,这会导致爆音。所以建议可以考虑使用专业声音处理软件如`adobe audition`等软件做响度匹配处理。若已经使用其他软件做响度匹配,可以在运行上述命令时添加`--skip_loudnorm`跳过响度匹配步骤。如:
```shell
python resample.py --skip_loudnorm
```
### 2. 自动划分训练集、验证集,以及自动生成配置文件
```shell
python preprocess_flist_config.py --speech_encoder vec768l12
```
speech_encoder 拥有以下选择
```
vec768l12
vec256l9
hubertsoft
whisper-ppg
whisper-ppg-large
cnhubertlarge
dphubert
wavlmbase+
```
如果省略 speech_encoder 参数,默认值为 vec768l12
**使用响度嵌入**
若使用响度嵌入,需要增加`--vol_aug`参数,比如:
```shell
python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug
```
使用后训练出的模型将匹配到输入源响度,否则为训练集响度。
#### 此时可以在生成的 config.json 与 diffusion.yaml 修改部分参数
##### config.json
* `keep_ckpts`:训练时保留最后几个模型,`0`为保留所有,默认只保留最后`3`个
* `all_in_mem`:加载所有数据集到内存中,某些平台的硬盘 IO 过于低下、同时内存容量 **远大于** 数据集体积时可以启用
* `batch_size`:单次训练加载到 GPU 的数据量,调整到低于显存容量的大小即可
* `vocoder_name` : 选择一种声码器,默认为`nsf-hifigan`.
##### diffusion.yaml
* `cache_all_data`:加载所有数据集到内存中,某些平台的硬盘 IO 过于低下、同时内存容量 **远大于** 数据集体积时可以启用
* `duration`:训练时音频切片时长,可根据显存大小调整,**注意,该值必须小于训练集内音频的最短时间!**
* `batch_size`:单次训练加载到 GPU 的数据量,调整到低于显存容量的大小即可
* `timesteps` : 扩散模型总步数,默认为 1000.
* `k_step_max` : 训练时可仅训练`k_step_max`步扩散以节约训练时间,注意,该值必须小于`timesteps`,0 为训练整个扩散模型,**注意,如果不训练整个扩散模型将无法使用仅扩散模型推理!**
##### **声码器列表**
```
nsf-hifigan
nsf-snake-hifigan
```
### 3. 生成 hubert 与 f0
```shell
python preprocess_hubert_f0.py --f0_predictor dio
```
f0_predictor 拥有以下选择
```
crepe
dio
pm
harvest
rmvpe
fcpe
```
如果训练集过于嘈杂,请使用 crepe 处理 f0
如果省略 f0_predictor 参数,默认值为 rmvpe
尚若需要浅扩散功能(可选),需要增加--use_diff 参数,比如
```shell
python preprocess_hubert_f0.py --f0_predictor dio --use_diff
```
**加速预处理**
如若您的数据集比较大,可以尝试添加`--num_processes`参数:
```shell
python preprocess_hubert_f0.py --f0_predictor dio --use_diff --num_processes 8
```
所有的Workers会被自动分配到多个线程上
执行完以上步骤后 dataset 目录便是预处理完成的数据,可以删除 dataset_raw 文件夹了
## 🏋️ 训练
### 主模型训练
```shell
python train.py -c configs/config.json -m 44k
```
### 扩散模型(可选)
尚若需要浅扩散功能,需要训练扩散模型,扩散模型训练方法为:
```shell
python train_diff.py -c configs/diffusion.yaml
```
模型训练结束后,模型文件保存在`logs/44k`目录下,扩散模型在`logs/44k/diffusion`下
## 🤖 推理
使用 [inference_main.py](inference_main.py)
```shell
# 例
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"
```
必填项部分:
+ `-m` | `--model_path`:模型路径
+ `-c` | `--config_path`:配置文件路径
+ `-n` | `--clean_names`:wav 文件名列表,放在 raw 文件夹下
+ `-t` | `--trans`:音高调整,支持正负(半音)
+ `-s` | `--spk_list`:合成目标说话人名称
+ `-cl` | `--clip`:音频强制切片,默认 0 为自动切片,单位为秒/s
可选项部分:部分具体见下一节
+ `-lg` | `--linear_gradient`:两段音频切片的交叉淡入长度,如果强制切片后出现人声不连贯可调整该数值,如果连贯建议采用默认值 0,单位为秒
+ `-f0p` | `--f0_predictor`:选择 F0 预测器,可选择 crepe,pm,dio,harvest,rmvpe,fcpe, 默认为 pm(注意:crepe 为原 F0 使用均值滤波器)
+ `-a` | `--auto_predict_f0`:语音转换自动预测音高,转换歌声时不要打开这个会严重跑调
+ `-cm` | `--cluster_model_path`:聚类模型或特征检索索引路径,留空则自动设为各方案模型的默认路径,如果没有训练聚类或特征检索则随便填
+ `-cr` | `--cluster_infer_ratio`:聚类方案或特征检索占比,范围 0-1,若没有训练聚类模型或特征检索则默认 0 即可
+ `-eh` | `--enhance`:是否使用 NSF_HIFIGAN 增强器,该选项对部分训练集少的模型有一定的音质增强效果,但是对训练好的模型有反面效果,默认关闭
+ `-shd` | `--shallow_diffusion`:是否使用浅层扩散,使用后可解决一部分电音问题,默认关闭,该选项打开时,NSF_HIFIGAN 增强器将会被禁止
+ `-usm` | `--use_spk_mix`:是否使用角色融合/动态声线融合
+ `-lea` | `--loudness_envelope_adjustment`:输入源响度包络替换输出响度包络融合比例,越靠近 1 越使用输出响度包络
+ `-fr` | `--feature_retrieval`:是否使用特征检索,如果使用聚类模型将被禁用,且 cm 与 cr 参数将会变成特征检索的索引路径与混合比例
浅扩散设置:
+ `-dm` | `--diffusion_model_path`:扩散模型路径
+ `-dc` | `--diffusion_config_path`:扩散模型配置文件路径
+ `-ks` | `--k_step`:扩散步数,越大越接近扩散模型的结果,默认 100
+ `-od` | `--only_diffusion`:纯扩散模式,该模式不会加载 sovits 模型,以扩散模型推理
+ `-se` | `--second_encoding`:二次编码,浅扩散前会对原始音频进行二次编码,玄学选项,有时候效果好,有时候效果差
### 注意!
如果使用`whisper-ppg` 声音编码器进行推理,需要将`--clip`设置为 25,`-lg`设置为 1。否则将无法正常推理。
## 🤔 可选项
如果前面的效果已经满意,或者没看明白下面在讲啥,那后面的内容都可以忽略,不影响模型使用(这些可选项影响比较小,可能在某些特定数据上有点效果,但大部分情况似乎都感知不太明显)
### 自动 f0 预测
4.0 模型训练过程会训练一个 f0 预测器,对于语音转换可以开启自动音高预测,如果效果不好也可以使用手动的,但转换歌声时请不要启用此功能!!!会严重跑调!!
+ 在 inference_main 中设置 auto_predict_f0 为 true 即可
### 聚类音色泄漏控制
介绍:聚类方案可以减小音色泄漏,使得模型训练出来更像目标的音色(但其实不是特别明显),但是单纯的聚类方案会降低模型的咬字(会口齿不清)(这个很明显),本模型采用了融合的方式,可以线性控制聚类方案与非聚类方案的占比,也就是可以手动在"像目标音色" 和 "咬字清晰" 之间调整比例,找到合适的折中点
使用聚类前面的已有步骤不用进行任何的变动,只需要额外训练一个聚类模型,虽然效果比较有限,但训练成本也比较低
+ 训练过程:
+ 使用 cpu 性能较好的机器训练,据我的经验在腾讯云 6 核 cpu 训练每个 speaker 需要约 4 分钟即可完成训练
+ 执行`python cluster/train_cluster.py`,模型的输出会在`logs/44k/kmeans_10000.pt`
+ 聚类模型目前可以使用 gpu 进行训练,执行`python cluster/train_cluster.py --gpu`
+ 推理过程:
+ `inference_main.py`中指定`cluster_model_path` 为模型输出文件,留空则默认为`logs/44k/kmeans_10000.pt`
+ `inference_main.py`中指定`cluster_infer_ratio`,`0`为完全不使用聚类,`1`为只使用聚类,通常设置`0.5`即可
### 特征检索
介绍:跟聚类方案一样可以减小音色泄漏,咬字比聚类稍好,但会降低推理速度,采用了融合的方式,可以线性控制特征检索与非特征检索的占比,
+ 训练过程:
首先需要在生成 hubert 与 f0 后执行:
```shell
python train_index.py -c configs/config.json
```
模型的输出会在`logs/44k/feature_and_index.pkl`
+ 推理过程:
+ 需要首先指定`--feature_retrieval`,此时聚类方案会自动切换到特征检索方案
+ `inference_main.py`中指定`cluster_model_path` 为模型输出文件,留空则默认为`logs/44k/feature_and_index.pkl`
+ `inference_main.py`中指定`cluster_infer_ratio`,`0`为完全不使用特征检索,`1`为只使用特征检索,通常设置`0.5`即可
## 🗜️ 模型压缩
生成的模型含有继续训练所需的信息。如果确认不再训练,可以移除模型中此部分信息,得到约 1/3 大小的最终模型。
使用 [compress_model.py](compress_model.py)
```shell
# 例
python compress_model.py -c="configs/config.json" -i="logs/44k/G_30400.pth" -o="logs/44k/release.pth"
```
## 👨🔧 声线混合
### 静态声线混合
**参考`webUI.py`文件中,小工具/实验室特性的静态声线融合。**
介绍:该功能可以将多个声音模型合成为一个声音模型(多个模型参数的凸组合或线性组合),从而制造出现实中不存在的声线
**注意:**
1. 该功能仅支持单说话人的模型
2. 如果强行使用多说话人模型,需要保证多个模型的说话人数量相同,这样可以混合同一个 SpaekerID 下的声音
3. 保证所有待混合模型的 config.json 中的 model 字段是相同的
4. 输出的混合模型可以使用待合成模型的任意一个 config.json,但聚类模型将不能使用
5. 批量上传模型的时候最好把模型放到一个文件夹选中后一起上传
6. 混合比例调整建议大小在 0-100 之间,也可以调为其他数字,但在线性组合模式下会出现未知的效果
7. 混合完毕后,文件将会保存在项目根目录中,文件名为 output.pth
8. 凸组合模式会将混合比例执行 Softmax 使混合比例相加为 1,而线性组合模式不会
### 动态声线混合
**参考`spkmix.py`文件中关于动态声线混合的介绍**
角色混合轨道 编写规则:
角色 ID : \[\[起始时间 1, 终止时间 1, 起始数值 1, 起始数值 1], [起始时间 2, 终止时间 2, 起始数值 2, 起始数值 2]]
起始时间和前一个的终止时间必须相同,第一个起始时间必须为 0,最后一个终止时间必须为 1 (时间的范围为 0-1)
全部角色必须填写,不使用的角色填、[\[0., 1., 0., 0.]] 即可
融合数值可以随便填,在指定的时间段内从起始数值线性变化为终止数值,内部会自动确保线性组合为 1(凸组合条件),可以放心使用
推理的时候使用`--use_spk_mix`参数即可启用动态声线混合
## 📤 Onnx 导出
使用 [onnx_export.py](onnx_export.py)
+ 新建文件夹:`checkpoints` 并打开
+ 在`checkpoints`文件夹中新建一个文件夹作为项目文件夹,文件夹名为你的项目名称,比如`aziplayer`
+ 将你的模型更名为`model.pth`,配置文件更名为`config.json`,并放置到刚才创建的`aziplayer`文件夹下
+ 将 [onnx_export.py](onnx_export.py) 中`path = "NyaruTaffy"` 的 `"NyaruTaffy"` 修改为你的项目名称,`path = "aziplayer" (onnx_export_speaker_mix,为支持角色混合的 onnx 导出)`
+ 运行 [onnx_export.py](onnx_export.py)
+ 等待执行完毕,在你的项目文件夹下会生成一个`model.onnx`,即为导出的模型
注意:Hubert Onnx 模型请使用 MoeSS 提供的模型,目前无法自行导出(fairseq 中 Hubert 有不少 onnx 不支持的算子和涉及到常量的东西,在导出时会报错或者导出的模型输入输出 shape 和结果都有问题)
## 📎 引用及论文
| URL | 名称 | 标题 | 源码 |
| --- | ----------- | ----- | --------------------- |
|[2106.06103](https://arxiv.org/abs/2106.06103) | VITS (Synthesizer)| Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech | [jaywalnut310/vits](https://github.com/jaywalnut310/vits) |
|[2111.02392](https://arxiv.org/abs/2111.02392) | SoftVC (Speech Encoder)| A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | [bshall/hubert](https://github.com/bshall/hubert) |
|[2204.09224](https://arxiv.org/abs/2204.09224) | ContentVec (Speech Encoder)| ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers | [auspicious3000/contentvec](https://github.com/auspicious3000/contentvec) |
|[2212.04356](https://arxiv.org/abs/2212.04356) | Whisper (Speech Encoder) | Robust Speech Recognition via Large-Scale Weak Supervision | [openai/whisper](https://github.com/openai/whisper) |
|[2110.13900](https://arxiv.org/abs/2110.13900) | WavLM (Speech Encoder) | WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | [microsoft/unilm/wavlm](https://github.com/microsoft/unilm/tree/master/wavlm) |
|[2305.17651](https://arxiv.org/abs/2305.17651) | DPHubert (Speech Encoder) | DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models | [pyf98/DPHuBERT](https://github.com/pyf98/DPHuBERT) |
|[DOI:10.21437/Interspeech.2017-68](http://dx.doi.org/10.21437/Interspeech.2017-68) | Harvest (F0 Predictor) | Harvest: A high-performance fundamental frequency estimator from speech signals | [mmorise/World/harvest](https://github.com/mmorise/World/blob/master/src/harvest.cpp) |
|[aes35-000039](https://www.aes.org/e-lib/online/browse.cfm?elib=15165) | Dio (F0 Predictor) | Fast and reliable F0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech | [mmorise/World/dio](https://github.com/mmorise/World/blob/master/src/dio.cpp) |
|[8461329](https://ieeexplore.ieee.org/document/8461329) | Crepe (F0 Predictor) | Crepe: A Convolutional Representation for Pitch Estimation | [maxrmorrison/torchcrepe](https://github.com/maxrmorrison/torchcrepe) |
|[DOI:10.1016/j.wocn.2018.07.001](https://doi.org/10.1016/j.wocn.2018.07.001) | Parselmouth (F0 Predictor) | Introducing Parselmouth: A Python interface to Praat | [YannickJadoul/Parselmouth](https://github.com/YannickJadoul/Parselmouth) |
|[2306.15412v2](https://arxiv.org/abs/2306.15412v2) | RMVPE (F0 Predictor) | RMVPE: A Robust Model for Vocal Pitch Estimation in Polyphonic Music | [Dream-High/RMVPE](https://github.com/Dream-High/RMVPE) |
|[2010.05646](https://arxiv.org/abs/2010.05646) | HIFIGAN (Vocoder) | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | [jik876/hifi-gan](https://github.com/jik876/hifi-gan) |
|[1810.11946](https://arxiv.org/abs/1810.11946.pdf) | NSF (Vocoder) | Neural source-filter-based waveform model for statistical parametric speech synthesis | [openvpi/DiffSinger/modules/nsf_hifigan](https://github.com/openvpi/DiffSinger/tree/refactor/modules/nsf_hifigan)
|[2006.08195](https://arxiv.org/abs/2006.08195) | Snake (Vocoder) | Neural Networks Fail to Learn Periodic Functions and How to Fix It | [EdwardDixon/snake](https://github.com/EdwardDixon/snake)
|[2105.02446v3](https://arxiv.org/abs/2105.02446v3) | Shallow Diffusion (PostProcessing)| DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | [CNChTu/Diffusion-SVC](https://github.com/CNChTu/Diffusion-SVC) |
|[K-means](https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=01D65490BADCC216F350D06F84D721AD?doi=10.1.1.308.8619&rep=rep1&type=pdf) | Feature K-means Clustering (PreProcessing)| Some methods for classification and analysis of multivariate observations | 本代码库 |
| | Feature TopK Retrieval (PreProcessing)| Retrieval based Voice Conversion | [RVC-Project/Retrieval-based-Voice-Conversion-WebUI](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) |
## ☀️ 旧贡献者
因为某些原因原作者进行了删库处理,本仓库重建之初由于组织成员疏忽直接重新上传了所有文件导致以前的 contributors 全部木大,现在在 README 里重新添加一个旧贡献者列表
*某些成员已根据其个人意愿不将其列出*
<table>
<tr>
<td align="center"><a href="https://github.com/MistEO"><img src="https://avatars.githubusercontent.com/u/18511905?v=4" width="100px;" alt=""/><br /><sub><b>MistEO</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/XiaoMiku01"><img src="https://avatars.githubusercontent.com/u/54094119?v=4" width="100px;" alt=""/><br /><sub><b>XiaoMiku01</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/ForsakenRei"><img src="https://avatars.githubusercontent.com/u/23041178?v=4" width="100px;" alt=""/><br /><sub><b>しぐれ</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/TomoGaSukunai"><img src="https://avatars.githubusercontent.com/u/25863522?v=4" width="100px;" alt=""/><br /><sub><b>TomoGaSukunai</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Plachtaa"><img src="https://avatars.githubusercontent.com/u/112609742?v=4" width="100px;" alt=""/><br /><sub><b>Plachtaa</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/zdxiaoda"><img src="https://avatars.githubusercontent.com/u/45501959?v=4" width="100px;" alt=""/><br /><sub><b>zd 小达</b></sub></a><br /></td>
<td align="center"><a href="https://github.com/Archivoice"><img src="https://avatars.githubusercontent.com/u/107520869?v=4" width="100px;" alt=""/><br /><sub><b>凍聲響世</b></sub></a><br /></td>
</tr>
</table>
## 📚 一些法律条例参考
#### 任何国家,地区,组织和个人使用此项目必须遵守以下法律
#### 《民法典》
##### 第一千零一十九条
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。
##### 第一千零二十四条
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
##### 第一千零二十七条
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。
#### 《[中华人民共和国宪法](http://www.gov.cn/guoqing/2018-03/22/content_5276318.htm)》
#### 《[中华人民共和国刑法](http://gongbao.court.gov.cn/Details/f8e30d0689b23f57bfc782d21035c3.html?sw=中华人民共和国刑法)》
#### 《[中华人民共和国民法典](http://gongbao.court.gov.cn/Details/51eb6750b8361f79be8f90d09bc202.html)》
#### 《[中华人民共和国合同法](http://www.npc.gov.cn/zgrdw/npc/lfzt/rlyw/2016-07/01/content_1992739.htm)》
## 💪 感谢所有的贡献者
<a href="https://github.com/svc-develop-team/so-vits-svc/graphs/contributors" target="_blank">
<img src="https://contrib.rocks/image?repo=svc-develop-team/so-vits-svc" />
</a>
================================================
FILE: cluster/__init__.py
================================================
import torch
from sklearn.cluster import KMeans
def get_cluster_model(ckpt_path):
checkpoint = torch.load(ckpt_path)
kmeans_dict = {}
for spk, ckpt in checkpoint.items():
km = KMeans(ckpt["n_features_in_"])
km.__dict__["n_features_in_"] = ckpt["n_features_in_"]
km.__dict__["_n_threads"] = ckpt["_n_threads"]
km.__dict__["cluster_centers_"] = ckpt["cluster_centers_"]
kmeans_dict[spk] = km
return kmeans_dict
def get_cluster_result(model, x, speaker):
"""
x: np.array [t, 256]
return cluster class result
"""
return model[speaker].predict(x)
def get_cluster_center_result(model, x,speaker):
"""x: np.array [t, 256]"""
predict = model[speaker].predict(x)
return model[speaker].cluster_centers_[predict]
def get_center(model, x,speaker):
return model[speaker].cluster_centers_[x]
================================================
FILE: cluster/kmeans.py
================================================
from time import time
import numpy as np
import pynvml
import torch
from torch.nn.functional import normalize
# device=torch.device("cuda:0")
def _kpp(data: torch.Tensor, k: int, sample_size: int = -1):
""" Picks k points in the data based on the kmeans++ method.
Parameters
----------
data : torch.Tensor
Expect a rank 1 or 2 array. Rank 1 is assumed to describe 1-D
data, rank 2 multidimensional data, in which case one
row is one observation.
k : int
Number of samples to generate.
sample_size : int
sample data to avoid memory overflow during calculation
Returns
-------
init : ndarray
A 'k' by 'N' containing the initial centroids.
References
----------
.. [1] D. Arthur and S. Vassilvitskii, "k-means++: the advantages of
careful seeding", Proceedings of the Eighteenth Annual ACM-SIAM Symposium
on Discrete Algorithms, 2007.
.. [2] scipy/cluster/vq.py: _kpp
"""
batch_size=data.shape[0]
if batch_size>sample_size:
data = data[torch.randint(0, batch_size,[sample_size], device=data.device)]
dims = data.shape[1] if len(data.shape) > 1 else 1
init = torch.zeros((k, dims)).to(data.device)
r = torch.distributions.uniform.Uniform(0, 1)
for i in range(k):
if i == 0:
init[i, :] = data[torch.randint(data.shape[0], [1])]
else:
D2 = torch.cdist(init[:i, :][None, :], data[None, :], p=2)[0].amin(dim=0)
probs = D2 / torch.sum(D2)
cumprobs = torch.cumsum(probs, dim=0)
init[i, :] = data[torch.searchsorted(cumprobs, r.sample([1]).to(data.device))]
return init
class KMeansGPU:
'''
Kmeans clustering algorithm implemented with PyTorch
Parameters:
n_clusters: int,
Number of clusters
max_iter: int, default: 100
Maximum number of iterations
tol: float, default: 0.0001
Tolerance
verbose: int, default: 0
Verbosity
mode: {'euclidean', 'cosine'}, default: 'euclidean'
Type of distance measure
init_method: {'random', 'point', '++'}
Type of initialization
minibatch: {None, int}, default: None
Batch size of MinibatchKmeans algorithm
if None perform full KMeans algorithm
Attributes:
centroids: torch.Tensor, shape: [n_clusters, n_features]
cluster centroids
'''
def __init__(self, n_clusters, max_iter=200, tol=1e-4, verbose=0, mode="euclidean",device=torch.device("cuda:0")):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.tol = tol
self.verbose = verbose
self.mode = mode
self.device=device
pynvml.nvmlInit()
gpu_handle = pynvml.nvmlDeviceGetHandleByIndex(device.index)
info = pynvml.nvmlDeviceGetMemoryInfo(gpu_handle)
self.minibatch=int(33e6/self.n_clusters*info.free/ 1024 / 1024 / 1024)
print("free_mem/GB:",info.free/ 1024 / 1024 / 1024,"minibatch:",self.minibatch)
@staticmethod
def cos_sim(a, b):
"""
Compute cosine similarity of 2 sets of vectors
Parameters:
a: torch.Tensor, shape: [m, n_features]
b: torch.Tensor, shape: [n, n_features]
"""
return normalize(a, dim=-1) @ normalize(b, dim=-1).transpose(-2, -1)
@staticmethod
def euc_sim(a, b):
"""
Compute euclidean similarity of 2 sets of vectors
Parameters:
a: torch.Tensor, shape: [m, n_features]
b: torch.Tensor, shape: [n, n_features]
"""
return 2 * a @ b.transpose(-2, -1) -(a**2).sum(dim=1)[..., :, None] - (b**2).sum(dim=1)[..., None, :]
def max_sim(self, a, b):
"""
Compute maximum similarity (or minimum distance) of each vector
in a with all of the vectors in b
Parameters:
a: torch.Tensor, shape: [m, n_features]
b: torch.Tensor, shape: [n, n_features]
"""
if self.mode == 'cosine':
sim_func = self.cos_sim
elif self.mode == 'euclidean':
sim_func = self.euc_sim
sim = sim_func(a, b)
max_sim_v, max_sim_i = sim.max(dim=-1)
return max_sim_v, max_sim_i
def fit_predict(self, X):
"""
Combination of fit() and predict() methods.
This is faster than calling fit() and predict() seperately.
Parameters:
X: torch.Tensor, shape: [n_samples, n_features]
centroids: {torch.Tensor, None}, default: None
if given, centroids will be initialized with given tensor
if None, centroids will be randomly chosen from X
Return:
labels: torch.Tensor, shape: [n_samples]
mini_=33kk/k*remain
mini=min(mini_,fea_shape)
offset=log2(k/1000)*1.5
kpp_all=min(mini_*10/offset,fea_shape)
kpp_sample=min(mini_/12/offset,fea_shape)
"""
assert isinstance(X, torch.Tensor), "input must be torch.Tensor"
assert X.dtype in [torch.half, torch.float, torch.double], "input must be floating point"
assert X.ndim == 2, "input must be a 2d tensor with shape: [n_samples, n_features] "
# print("verbose:%s"%self.verbose)
offset = np.power(1.5,np.log(self.n_clusters / 1000))/np.log(2)
with torch.no_grad():
batch_size= X.shape[0]
# print(self.minibatch, int(self.minibatch * 10 / offset), batch_size)
start_time = time()
if (self.minibatch*10//offset< batch_size):
x = X[torch.randint(0, batch_size,[int(self.minibatch*10/offset)])].to(self.device)
else:
x = X.to(self.device)
# print(x.device)
self.centroids = _kpp(x, self.n_clusters, min(int(self.minibatch/12/offset),batch_size))
del x
torch.cuda.empty_cache()
# self.centroids = self.centroids.to(self.device)
num_points_in_clusters = torch.ones(self.n_clusters, device=self.device, dtype=X.dtype)#全1
closest = None#[3098036]#int64
if(self.minibatch>=batch_size//2 and self.minibatch<batch_size):
X = X[torch.randint(0, batch_size,[self.minibatch])].to(self.device)
elif(self.minibatch>=batch_size):
X=X.to(self.device)
for i in range(self.max_iter):
iter_time = time()
if self.minibatch<batch_size//2:#可用minibatch数太小,每次都得从内存倒腾到显存
x = X[torch.randint(0, batch_size, [self.minibatch])].to(self.device)
else:#否则直接全部缓存
x = X
closest = self.max_sim(a=x, b=self.centroids)[1].to(torch.int16)#[3098036]#int64#0~999
matched_clusters, counts = closest.unique(return_counts=True)#int64#1k
expanded_closest = closest[None].expand(self.n_clusters, -1)#[1000, 3098036]#int16#0~999
mask = (expanded_closest==torch.arange(self.n_clusters, device=self.device)[:, None]).to(X.dtype)#==后者是int64*1000
c_grad = mask @ x / mask.sum(-1)[..., :, None]
c_grad[c_grad!=c_grad] = 0 # remove NaNs
error = (c_grad - self.centroids).pow(2).sum()
if self.minibatch is not None:
lr = 1/num_points_in_clusters[:,None] * 0.9 + 0.1
else:
lr = 1
matched_clusters=matched_clusters.long()
num_points_in_clusters[matched_clusters] += counts#IndexError: tensors used as indices must be long, byte or bool tensors
self.centroids = self.centroids * (1-lr) + c_grad * lr
if self.verbose >= 2:
print('iter:', i, 'error:', error.item(), 'time spent:', round(time()-iter_time, 4))
if error <= self.tol:
break
if self.verbose >= 1:
print(f'used {i+1} iterations ({round(time()-start_time, 4)}s) to cluster {batch_size} items into {self.n_clusters} clusters')
return closest
================================================
FILE: cluster/train_cluster.py
================================================
import argparse
import logging
import os
import time
from pathlib import Path
import numpy as np
import torch
import tqdm
from kmeans import KMeansGPU
from sklearn.cluster import KMeans, MiniBatchKMeans
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def train_cluster(in_dir, n_clusters, use_minibatch=True, verbose=False,use_gpu=False):#gpu_minibatch真拉,虽然库支持但是也不考虑
if str(in_dir).endswith(".ipynb_checkpoints"):
logger.info(f"Ignore {in_dir}")
logger.info(f"Loading features from {in_dir}")
features = []
nums = 0
for path in tqdm.tqdm(in_dir.glob("*.soft.pt")):
# for name in os.listdir(in_dir):
# path="%s/%s"%(in_dir,name)
features.append(torch.load(path,map_location="cpu").squeeze(0).numpy().T)
# print(features[-1].shape)
features = np.concatenate(features, axis=0)
print(nums, features.nbytes/ 1024**2, "MB , shape:",features.shape, features.dtype)
features = features.astype(np.float32)
logger.info(f"Clustering features of shape: {features.shape}")
t = time.time()
if(use_gpu is False):
if use_minibatch:
kmeans = MiniBatchKMeans(n_clusters=n_clusters,verbose=verbose, batch_size=4096, max_iter=80).fit(features)
else:
kmeans = KMeans(n_clusters=n_clusters,verbose=verbose).fit(features)
else:
kmeans = KMeansGPU(n_clusters=n_clusters, mode='euclidean', verbose=2 if verbose else 0,max_iter=500,tol=1e-2)#
features=torch.from_numpy(features)#.to(device)
kmeans.fit_predict(features)#
print(time.time()-t, "s")
x = {
"n_features_in_": kmeans.n_features_in_ if use_gpu is False else features.shape[1],
"_n_threads": kmeans._n_threads if use_gpu is False else 4,
"cluster_centers_": kmeans.cluster_centers_ if use_gpu is False else kmeans.centroids.cpu().numpy(),
}
print("end")
return x
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--dataset', type=Path, default="./dataset/44k",
help='path of training data directory')
parser.add_argument('--output', type=Path, default="logs/44k",
help='path of model output directory')
parser.add_argument('--gpu',action='store_true', default=False ,
help='to use GPU')
args = parser.parse_args()
checkpoint_dir = args.output
dataset = args.dataset
use_gpu = args.gpu
n_clusters = 10000
ckpt = {}
for spk in os.listdir(dataset):
if os.path.isdir(dataset/spk):
print(f"train kmeans for {spk}...")
in_dir = dataset/spk
x = train_cluster(in_dir, n_clusters,use_minibatch=False,verbose=False,use_gpu=use_gpu)
ckpt[spk] = x
checkpoint_path = checkpoint_dir / f"kmeans_{n_clusters}.pt"
checkpoint_path.parent.mkdir(exist_ok=True, parents=True)
torch.save(
ckpt,
checkpoint_path,
)
================================================
FILE: compress_model.py
================================================
from collections import OrderedDict
import torch
import utils
from models import SynthesizerTrn
def copyStateDict(state_dict):
if list(state_dict.keys())[0].startswith('module'):
start_idx = 1
else:
start_idx = 0
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = ','.join(k.split('.')[start_idx:])
new_state_dict[name] = v
return new_state_dict
def removeOptimizer(config: str, input_model: str, ishalf: bool, output_model: str):
hps = utils.get_hparams_from_file(config)
net_g = SynthesizerTrn(hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
**hps.model)
optim_g = torch.optim.AdamW(net_g.parameters(),
hps.train.learning_rate,
betas=hps.train.betas,
eps=hps.train.eps)
state_dict_g = torch.load(input_model, map_location="cpu")
new_dict_g = copyStateDict(state_dict_g)
keys = []
for k, v in new_dict_g['model'].items():
if "enc_q" in k: continue # noqa: E701
keys.append(k)
new_dict_g = {k: new_dict_g['model'][k].half() for k in keys} if ishalf else {k: new_dict_g['model'][k] for k in keys}
torch.save(
{
'model': new_dict_g,
'iteration': 0,
'optimizer': optim_g.state_dict(),
'learning_rate': 0.0001
}, output_model)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-c",
"--config",
type=str,
default='configs/config.json')
parser.add_argument("-i", "--input", type=str)
parser.add_argument("-o", "--output", type=str, default=None)
parser.add_argument('-hf', '--half', action='store_true', default=False, help='Save as FP16')
args = parser.parse_args()
output = args.output
if output is None:
import os.path
filename, ext = os.path.splitext(args.input)
half = "_half" if args.half else ""
output = filename + "_release" + half + ext
removeOptimizer(args.config, args.input, args.half, output)
================================================
FILE: configs/diffusion.yaml
================================================
================================================
FILE: configs_template/config_template.json
================================================
{
"train": {
"log_interval": 200,
"eval_interval": 800,
"seed": 1234,
"epochs": 10000,
"learning_rate": 0.0001,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 6,
"fp16_run": false,
"half_type": "fp16",
"lr_decay": 0.999875,
"segment_size": 10240,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"use_sr": true,
"max_speclen": 512,
"port": "8001",
"keep_ckpts": 3,
"all_in_mem": false,
"vol_aug":false
},
"data": {
"training_files": "filelists/train.txt",
"validation_files": "filelists/val.txt",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": 22050,
"unit_interpolate_mode":"nearest"
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [ 8, 8, 2, 2, 2],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [16,16, 4, 4, 4],
"n_layers_q": 3,
"n_layers_trans_flow": 3,
"n_flow_layer": 4,
"use_spectral_norm": false,
"gin_channels": 768,
"ssl_dim": 768,
"n_speakers": 200,
"vocoder_name":"nsf-hifigan",
"speech_encoder":"vec768l12",
"speaker_embedding":false,
"vol_embedding":false,
"use_depthwise_conv":false,
"flow_share_parameter": false,
"use_automatic_f0_prediction": true,
"use_transformer_flow": false
},
"spk": {
"nyaru": 0,
"huiyu": 1,
"nen": 2,
"paimon": 3,
"yunhao": 4
}
}
================================================
FILE: configs_template/config_tiny_template.json
================================================
{
"train": {
"log_interval": 200,
"eval_interval": 800,
"seed": 1234,
"epochs": 10000,
"learning_rate": 0.0001,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 6,
"fp16_run": false,
"half_type": "fp16",
"lr_decay": 0.999875,
"segment_size": 10240,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0,
"use_sr": true,
"max_speclen": 512,
"port": "8001",
"keep_ckpts": 3,
"all_in_mem": false,
"vol_aug":false
},
"data": {
"training_files": "filelists/train.txt",
"validation_files": "filelists/val.txt",
"max_wav_value": 32768.0,
"sampling_rate": 44100,
"filter_length": 2048,
"hop_length": 512,
"win_length": 2048,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": 22050,
"unit_interpolate_mode":"nearest"
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 512,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [ 8, 8, 2, 2, 2],
"upsample_initial_channel": 400,
"upsample_kernel_sizes": [16,16, 4, 4, 4],
"n_layers_q": 3,
"n_layers_trans_flow": 3,
"n_flow_layer": 4,
"use_spectral_norm": false,
"gin_channels": 768,
"ssl_dim": 768,
"n_speakers": 200,
"vocoder_name":"nsf-hifigan",
"speech_encoder":"vec768l12",
"speaker_embedding":false,
"vol_embedding":false,
"use_depthwise_conv":true,
"flow_share_parameter": true,
"use_automatic_f0_prediction": true,
"use_transformer_flow": false
},
"spk": {
"nyaru": 0,
"huiyu": 1,
"nen": 2,
"paimon": 3,
"yunhao": 4
}
}
================================================
FILE: configs_template/diffusion_template.yaml
================================================
data:
sampling_rate: 44100
block_size: 512 # Equal to hop_length
duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
encoder: 'vec768l12' # 'hubertsoft', 'vec256l9', 'vec768l12'
cnhubertsoft_gate: 10
encoder_sample_rate: 16000
encoder_hop_size: 320
encoder_out_channels: 768 # 256 if using 'hubertsoft'
training_files: "filelists/train.txt"
validation_files: "filelists/val.txt"
extensions: # List of extension included in the data collection
- wav
unit_interpolate_mode: "nearest"
model:
type: 'Diffusion'
n_layers: 20
n_chans: 512
n_hidden: 256
use_pitch_aug: true
timesteps : 1000
k_step_max: 0 # must <= timesteps, If it is 0, train all
n_spk: 1 # max number of different speakers
device: cuda
vocoder:
type: 'nsf-hifigan'
ckpt: 'pretrain/nsf_hifigan/model'
infer:
speedup: 10
method: 'dpm-solver++' # 'pndm' or 'dpm-solver' or 'ddim' or 'unipc' or 'dpm-solver++'
env:
expdir: logs/44k/diffusion
gpu_id: 0
train:
num_workers: 4 # If your cpu and gpu are both very strong, set to 0 may be faster!
amp_dtype: fp32 # fp32, fp16 or bf16 (fp16 or bf16 may be faster if it is supported by your gpu)
batch_size: 48
cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
cache_device: 'cpu' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
cache_fp16: true
epochs: 100000
interval_log: 10
interval_val: 2000
interval_force_save: 5000
lr: 0.0001
decay_step: 100000
gamma: 0.5
weight_decay: 0
save_opt: false
spk:
'nyaru': 0
================================================
FILE: data_utils.py
================================================
import os
import random
import numpy as np
import torch
import torch.utils.data
import utils
from modules.mel_processing import spectrogram_torch
from utils import load_filepaths_and_text, load_wav_to_torch
# import h5py
"""Multi speaker version"""
class TextAudioSpeakerLoader(torch.utils.data.Dataset):
"""
1) loads audio, speaker_id, text pairs
2) normalizes text and converts them to sequences of integers
3) computes spectrograms from audio files.
"""
def __init__(self, audiopaths, hparams, all_in_mem: bool = False, vol_aug: bool = True):
self.audiopaths = load_filepaths_and_text(audiopaths)
self.hparams = hparams
self.max_wav_value = hparams.data.max_wav_value
self.sampling_rate = hparams.data.sampling_rate
self.filter_length = hparams.data.filter_length
self.hop_length = hparams.data.hop_length
self.win_length = hparams.data.win_length
self.unit_interpolate_mode = hparams.data.unit_interpolate_mode
self.sampling_rate = hparams.data.sampling_rate
self.use_sr = hparams.train.use_sr
self.spec_len = hparams.train.max_speclen
self.spk_map = hparams.spk
self.vol_emb = hparams.model.vol_embedding
self.vol_aug = hparams.train.vol_aug and vol_aug
random.seed(1234)
random.shuffle(self.audiopaths)
self.all_in_mem = all_in_mem
if self.all_in_mem:
self.cache = [self.get_audio(p[0]) for p in self.audiopaths]
def get_audio(self, filename):
filename = filename.replace("\\", "/")
audio, sampling_rate = load_wav_to_torch(filename)
if sampling_rate != self.sampling_rate:
raise ValueError(
"Sample Rate not match. Expect {} but got {} from {}".format(
self.sampling_rate, sampling_rate, filename))
audio_norm = audio / self.max_wav_value
audio_norm = audio_norm.unsqueeze(0)
spec_filename = filename.replace(".wav", ".spec.pt")
# Ideally, all data generated after Mar 25 should have .spec.pt
if os.path.exists(spec_filename):
spec = torch.load(spec_filename)
else:
spec = spectrogram_torch(audio_norm, self.filter_length,
self.sampling_rate, self.hop_length, self.win_length,
center=False)
spec = torch.squeeze(spec, 0)
torch.save(spec, spec_filename)
spk = filename.split("/")[-2]
spk = torch.LongTensor([self.spk_map[spk]])
f0, uv = np.load(filename + ".f0.npy",allow_pickle=True)
f0 = torch.FloatTensor(np.array(f0,dtype=float))
uv = torch.FloatTensor(np.array(uv,dtype=float))
c = torch.load(filename+ ".soft.pt")
c = utils.repeat_expand_2d(c.squeeze(0), f0.shape[0], mode=self.unit_interpolate_mode)
if self.vol_emb:
volume_path = filename + ".vol.npy"
volume = np.load(volume_path)
volume = torch.from_numpy(volume).float()
else:
volume = None
lmin = min(c.size(-1), spec.size(-1))
assert abs(c.size(-1) - spec.size(-1)) < 3, (c.size(-1), spec.size(-1), f0.shape, filename)
assert abs(audio_norm.shape[1]-lmin * self.hop_length) < 3 * self.hop_length
spec, c, f0, uv = spec[:, :lmin], c[:, :lmin], f0[:lmin], uv[:lmin]
audio_norm = audio_norm[:, :lmin * self.hop_length]
if volume is not None:
volume = volume[:lmin]
return c, f0, spec, audio_norm, spk, uv, volume
def random_slice(self, c, f0, spec, audio_norm, spk, uv, volume):
# if spec.shape[1] < 30:
# print("skip too short audio:", filename)
# return None
if random.choice([True, False]) and self.vol_aug and volume is not None:
max_amp = float(torch.max(torch.abs(audio_norm))) + 1e-5
max_shift = min(1, np.log10(1/max_amp))
log10_vol_shift = random.uniform(-1, max_shift)
audio_norm = audio_norm * (10 ** log10_vol_shift)
volume = volume * (10 ** log10_vol_shift)
spec = spectrogram_torch(audio_norm,
self.hparams.data.filter_length,
self.hparams.data.sampling_rate,
self.hparams.data.hop_length,
self.hparams.data.win_length,
center=False)[0]
if spec.shape[1] > 800:
start = random.randint(0, spec.shape[1]-800)
end = start + 790
spec, c, f0, uv = spec[:, start:end], c[:, start:end], f0[start:end], uv[start:end]
audio_norm = audio_norm[:, start * self.hop_length : end * self.hop_length]
if volume is not None:
volume = volume[start:end]
return c, f0, spec, audio_norm, spk, uv,volume
def __getitem__(self, index):
if self.all_in_mem:
return self.random_slice(*self.cache[index])
else:
return self.random_slice(*self.get_audio(self.audiopaths[index][0]))
def __len__(self):
return len(self.audiopaths)
class TextAudioCollate:
def __call__(self, batch):
batch = [b for b in batch if b is not None]
input_lengths, ids_sorted_decreasing = torch.sort(
torch.LongTensor([x[0].shape[1] for x in batch]),
dim=0, descending=True)
max_c_len = max([x[0].size(1) for x in batch])
max_wav_len = max([x[3].size(1) for x in batch])
lengths = torch.LongTensor(len(batch))
c_padded = torch.FloatTensor(len(batch), batch[0][0].shape[0], max_c_len)
f0_padded = torch.FloatTensor(len(batch), max_c_len)
spec_padded = torch.FloatTensor(len(batch), batch[0][2].shape[0], max_c_len)
wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len)
spkids = torch.LongTensor(len(batch), 1)
uv_padded = torch.FloatTensor(len(batch), max_c_len)
volume_padded = torch.FloatTensor(len(batch), max_c_len)
c_padded.zero_()
spec_padded.zero_()
f0_padded.zero_()
wav_padded.zero_()
uv_padded.zero_()
volume_padded.zero_()
for i in range(len(ids_sorted_decreasing)):
row = batch[ids_sorted_decreasing[i]]
c = row[0]
c_padded[i, :, :c.size(1)] = c
lengths[i] = c.size(1)
f0 = row[1]
f0_padded[i, :f0.size(0)] = f0
spec = row[2]
spec_padded[i, :, :spec.size(1)] = spec
wav = row[3]
wav_padded[i, :, :wav.size(1)] = wav
spkids[i, 0] = row[4]
uv = row[5]
uv_padded[i, :uv.size(0)] = uv
volume = row[6]
if volume is not None:
volume_padded[i, :volume.size(0)] = volume
else :
volume_padded = None
return c_padded, f0_padded, spec_padded, wav_padded, spkids, lengths, uv_padded, volume_padded
================================================
FILE: diffusion/__init__.py
================================================
================================================
FILE: diffusion/data_loaders.py
================================================
import os
import random
import librosa
import numpy as np
import torch
from torch.utils.data import Dataset
from tqdm import tqdm
from utils import repeat_expand_2d
def traverse_dir(
root_dir,
extensions,
amount=None,
str_include=None,
str_exclude=None,
is_pure=False,
is_sort=False,
is_ext=True):
file_list = []
cnt = 0
for root, _, files in os.walk(root_dir):
for file in files:
if any([file.endswith(f".{ext}") for ext in extensions]):
# path
mix_path = os.path.join(root, file)
pure_path = mix_path[len(root_dir)+1:] if is_pure else mix_path
# amount
if (amount is not None) and (cnt == amount):
if is_sort:
file_list.sort()
return file_list
# check string
if (str_include is not None) and (str_include not in pure_path):
continue
if (str_exclude is not None) and (str_exclude in pure_path):
continue
if not is_ext:
ext = pure_path.split('.')[-1]
pure_path = pure_path[:-(len(ext)+1)]
file_list.append(pure_path)
cnt += 1
if is_sort:
file_list.sort()
return file_list
def get_data_loaders(args, whole_audio=False):
data_train = AudioDataset(
filelists = args.data.training_files,
waveform_sec=args.data.duration,
hop_size=args.data.block_size,
sample_rate=args.data.sampling_rate,
load_all_data=args.train.cache_all_data,
whole_audio=whole_audio,
extensions=args.data.extensions,
n_spk=args.model.n_spk,
spk=args.spk,
device=args.train.cache_device,
fp16=args.train.cache_fp16,
unit_interpolate_mode = args.data.unit_interpolate_mode,
use_aug=True)
loader_train = torch.utils.data.DataLoader(
data_train ,
batch_size=args.train.batch_size if not whole_audio else 1,
shuffle=True,
num_workers=args.train.num_workers if args.train.cache_device=='cpu' else 0,
persistent_workers=(args.train.num_workers > 0) if args.train.cache_device=='cpu' else False,
pin_memory=True if args.train.cache_device=='cpu' else False
)
data_valid = AudioDataset(
filelists = args.data.validation_files,
waveform_sec=args.data.duration,
hop_size=args.data.block_size,
sample_rate=args.data.sampling_rate,
load_all_data=args.train.cache_all_data,
whole_audio=True,
spk=args.spk,
extensions=args.data.extensions,
unit_interpolate_mode = args.data.unit_interpolate_mode,
n_spk=args.model.n_spk)
loader_valid = torch.utils.data.DataLoader(
data_valid,
batch_size=1,
shuffle=False,
num_workers=0,
pin_memory=True
)
return loader_train, loader_valid
class AudioDataset(Dataset):
def __init__(
self,
filelists,
waveform_sec,
hop_size,
sample_rate,
spk,
load_all_data=True,
whole_audio=False,
extensions=['wav'],
n_spk=1,
device='cpu',
fp16=False,
use_aug=False,
unit_interpolate_mode = 'left'
):
super().__init__()
self.waveform_sec = waveform_sec
self.sample_rate = sample_rate
self.hop_size = hop_size
self.filelists = filelists
self.whole_audio = whole_audio
self.use_aug = use_aug
self.data_buffer={}
self.pitch_aug_dict = {}
self.unit_interpolate_mode = unit_interpolate_mode
# np.load(os.path.join(self.path_root, 'pitch_aug_dict.npy'), allow_pickle=True).item()
if load_all_data:
print('Load all the data filelists:', filelists)
else:
print('Load the f0, volume data filelists:', filelists)
with open(filelists,"r") as f:
self.paths = f.read().splitlines()
for name_ext in tqdm(self.paths, total=len(self.paths)):
path_audio = name_ext
duration = librosa.get_duration(filename = path_audio, sr = self.sample_rate)
path_f0 = name_ext + ".f0.npy"
f0,_ = np.load(path_f0,allow_pickle=True)
f0 = torch.from_numpy(np.array(f0,dtype=float)).float().unsqueeze(-1).to(device)
path_volume = name_ext + ".vol.npy"
volume = np.load(path_volume)
volume = torch.from_numpy(volume).float().unsqueeze(-1).to(device)
path_augvol = name_ext + ".aug_vol.npy"
aug_vol = np.load(path_augvol)
aug_vol = torch.from_numpy(aug_vol).float().unsqueeze(-1).to(device)
if n_spk is not None and n_spk > 1:
spk_name = name_ext.split("/")[-2]
spk_id = spk[spk_name] if spk_name in spk else 0
if spk_id < 0 or spk_id >= n_spk:
raise ValueError(' [x] Muiti-speaker traing error : spk_id must be a positive integer from 0 to n_spk-1 ')
else:
spk_id = 0
spk_id = torch.LongTensor(np.array([spk_id])).to(device)
if load_all_data:
'''
audio, sr = librosa.load(path_audio, sr=self.sample_rate)
if len(audio.shape) > 1:
audio = librosa.to_mono(audio)
audio = torch.from_numpy(audio).to(device)
'''
path_mel = name_ext + ".mel.npy"
mel = np.load(path_mel)
mel = torch.from_numpy(mel).to(device)
path_augmel = name_ext + ".aug_mel.npy"
aug_mel,keyshift = np.load(path_augmel, allow_pickle=True)
aug_mel = np.array(aug_mel,dtype=float)
aug_mel = torch.from_numpy(aug_mel).to(device)
self.pitch_aug_dict[name_ext] = keyshift
path_units = name_ext + ".soft.pt"
units = torch.load(path_units).to(device)
units = units[0]
units = repeat_expand_2d(units,f0.size(0),unit_interpolate_mode).transpose(0,1)
if fp16:
mel = mel.half()
aug_mel = aug_mel.half()
units = units.half()
self.data_buffer[name_ext] = {
'duration': duration,
'mel': mel,
'aug_mel': aug_mel,
'units': units,
'f0': f0,
'volume': volume,
'aug_vol': aug_vol,
'spk_id': spk_id
}
else:
path_augmel = name_ext + ".aug_mel.npy"
aug_mel,keyshift = np.load(path_augmel, allow_pickle=True)
self.pitch_aug_dict[name_ext] = keyshift
self.data_buffer[name_ext] = {
'duration': duration,
'f0': f0,
'volume': volume,
'aug_vol': aug_vol,
'spk_id': spk_id
}
def __getitem__(self, file_idx):
name_ext = self.paths[file_idx]
data_buffer = self.data_buffer[name_ext]
# check duration. if too short, then skip
if data_buffer['duration'] < (self.waveform_sec + 0.1):
return self.__getitem__( (file_idx + 1) % len(self.paths))
# get item
return self.get_data(name_ext, data_buffer)
def get_data(self, name_ext, data_buffer):
name = os.path.splitext(name_ext)[0]
frame_resolution = self.hop_size / self.sample_rate
duration = data_buffer['duration']
waveform_sec = duration if self.whole_audio else self.waveform_sec
# load audio
idx_from = 0 if self.whole_audio else random.uniform(0, duration - waveform_sec - 0.1)
start_frame = int(idx_from / frame_resolution)
units_frame_len = int(waveform_sec / frame_resolution)
aug_flag = random.choice([True, False]) and self.use_aug
'''
audio = data_buffer.get('audio')
if audio is None:
path_audio = os.path.join(self.path_root, 'audio', name) + '.wav'
audio, sr = librosa.load(
path_audio,
sr = self.sample_rate,
offset = start_frame * frame_resolution,
duration = waveform_sec)
if len(audio.shape) > 1:
audio = librosa.to_mono(audio)
# clip audio into N seconds
audio = audio[ : audio.shape[-1] // self.hop_size * self.hop_size]
audio = torch.from_numpy(audio).float()
else:
audio = audio[start_frame * self.hop_size : (start_frame + units_frame_len) * self.hop_size]
'''
# load mel
mel_key = 'aug_mel' if aug_flag else 'mel'
mel = data_buffer.get(mel_key)
if mel is None:
mel = name_ext + ".mel.npy"
mel = np.load(mel)
mel = mel[start_frame : start_frame + units_frame_len]
mel = torch.from_numpy(mel).float()
else:
mel = mel[start_frame : start_frame + units_frame_len]
# load f0
f0 = data_buffer.get('f0')
aug_shift = 0
if aug_flag:
aug_shift = self.pitch_aug_dict[name_ext]
f0_frames = 2 ** (aug_shift / 12) * f0[start_frame : start_frame + units_frame_len]
# load units
units = data_buffer.get('units')
if units is None:
path_units = name_ext + ".soft.pt"
units = torch.load(path_units)
units = units[0]
units = repeat_expand_2d(units,f0.size(0),self.unit_interpolate_mode).transpose(0,1)
units = units[start_frame : start_frame + units_frame_len]
# load volume
vol_key = 'aug_vol' if aug_flag else 'volume'
volume = data_buffer.get(vol_key)
volume_frames = volume[start_frame : start_frame + units_frame_len]
# load spk_id
spk_id = data_buffer.get('spk_id')
# load shift
aug_shift = torch.from_numpy(np.array([[aug_shift]])).float()
return dict(mel=mel, f0=f0_frames, volume=volume_frames, units=units, spk_id=spk_id, aug_shift=aug_shift, name=name, name_ext=name_ext)
def __len__(self):
return len(self.paths)
================================================
FILE: diffusion/diffusion.py
================================================
from collections import deque
from functools import partial
from inspect import isfunction
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from tqdm import tqdm
def exists(x):
return x is not None
def default(val, d):
if exists(val):
return val
return d() if isfunction(d) else d
def extract(a, t, x_shape):
b, *_ = t.shape
out = a.gather(-1, t)
return out.reshape(b, *((1,) * (len(x_shape) - 1)))
def noise_like(shape, device, repeat=False):
def repeat_noise():
return torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
def noise():
return torch.randn(shape, device=device)
return repeat_noise() if repeat else noise()
def linear_beta_schedule(timesteps, max_beta=0.02):
"""
linear schedule
"""
betas = np.linspace(1e-4, max_beta, timesteps)
return betas
def cosine_beta_schedule(timesteps, s=0.008):
"""
cosine schedule
as proposed in https://openreview.net/forum?id=-NEXDKk8gZ
"""
steps = timesteps + 1
x = np.linspace(0, steps, steps)
alphas_cumprod = np.cos(((x / steps) + s) / (1 + s) * np.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return np.clip(betas, a_min=0, a_max=0.999)
beta_schedule = {
"cosine": cosine_beta_schedule,
"linear": linear_beta_schedule,
}
class GaussianDiffusion(nn.Module):
def __init__(self,
denoise_fn,
out_dims=128,
timesteps=1000,
k_step=1000,
max_beta=0.02,
spec_min=-12,
spec_max=2):
super().__init__()
self.denoise_fn = denoise_fn
self.out_dims = out_dims
betas = beta_schedule['linear'](timesteps, max_beta=max_beta)
alphas = 1. - betas
alphas_cumprod = np.cumprod(alphas, axis=0)
alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])
timesteps, = betas.shape
self.num_timesteps = int(timesteps)
self.k_step = k_step if k_step>0 and k_step<timesteps else timesteps
self.noise_list = deque(maxlen=4)
to_torch = partial(torch.tensor, dtype=torch.float32)
self.register_buffer('betas', to_torch(betas))
self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))
# calculations for diffusion q(x_t | x_{t-1}) and others
self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))
# calculations for posterior q(x_{t-1} | x_t, x_0)
posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)
# above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
self.register_buffer('posterior_variance', to_torch(posterior_variance))
# below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
self.register_buffer('posterior_mean_coef1', to_torch(
betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
self.register_buffer('posterior_mean_coef2', to_torch(
(1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))
self.register_buffer('spec_min', torch.FloatTensor([spec_min])[None, None, :out_dims])
self.register_buffer('spec_max', torch.FloatTensor([spec_max])[None, None, :out_dims])
def q_mean_variance(self, x_start, t):
mean = extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
variance = extract(1. - self.alphas_cumprod, t, x_start.shape)
log_variance = extract(self.log_one_minus_alphas_cumprod, t, x_start.shape)
return mean, variance, log_variance
def predict_start_from_noise(self, x_t, t, noise):
return (
extract(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
extract(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
)
def q_posterior(self, x_start, x_t, t):
posterior_mean = (
extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
extract(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
posterior_variance = extract(self.posterior_variance, t, x_t.shape)
posterior_log_variance_clipped = extract(self.posterior_log_variance_clipped, t, x_t.shape)
return posterior_mean, posterior_variance, posterior_log_variance_clipped
def p_mean_variance(self, x, t, cond):
noise_pred = self.denoise_fn(x, t, cond=cond)
x_recon = self.predict_start_from_noise(x, t=t, noise=noise_pred)
x_recon.clamp_(-1., 1.)
model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
return model_mean, posterior_variance, posterior_log_variance
@torch.no_grad()
def p_sample_ddim(self, x, t, interval, cond):
"""
Use the DDIM method from
"""
a_t = extract(self.alphas_cumprod, t, x.shape)
a_prev = extract(self.alphas_cumprod, torch.max(t - interval, torch.zeros_like(t)), x.shape)
noise_pred = self.denoise_fn(x, t, cond=cond)
x_prev = a_prev.sqrt() * (x / a_t.sqrt() + (((1 - a_prev) / a_prev).sqrt()-((1 - a_t) / a_t).sqrt()) * noise_pred)
return x_prev
@torch.no_grad()
def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
b, *_, device = *x.shape, x.device
model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, cond=cond)
noise = noise_like(x.shape, device, repeat_noise)
# no noise when t == 0
nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
@torch.no_grad()
def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repeat_noise=False):
"""
Use the PLMS method from
[Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778).
"""
def get_x_pred(x, noise_t, t):
a_t = extract(self.alphas_cumprod, t, x.shape)
a_prev = extract(self.alphas_cumprod, torch.max(t - interval, torch.zeros_like(t)), x.shape)
a_t_sq, a_prev_sq = a_t.sqrt(), a_prev.sqrt()
x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x - 1 / (
a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
x_pred = x + x_delta
return x_pred
noise_list = self.noise_list
noise_pred = self.denoise_fn(x, t, cond=cond)
if len(noise_list) == 0:
x_pred = get_x_pred(x, noise_pred, t)
noise_pred_prev = self.denoise_fn(x_pred, max(t - interval, 0), cond=cond)
noise_pred_prime = (noise_pred + noise_pred_prev) / 2
elif len(noise_list) == 1:
noise_pred_prime = (3 * noise_pred - noise_list[-1]) / 2
elif len(noise_list) == 2:
noise_pred_prime = (23 * noise_pred - 16 * noise_list[-1] + 5 * noise_list[-2]) / 12
else:
noise_pred_prime = (55 * noise_pred - 59 * noise_list[-1] + 37 * noise_list[-2] - 9 * noise_list[-3]) / 24
x_prev = get_x_pred(x, noise_pred_prime, t)
noise_list.append(noise_pred)
return x_prev
def q_sample(self, x_start, t, noise=None):
noise = default(noise, lambda: torch.randn_like(x_start))
return (
extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
extract(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
)
def p_losses(self, x_start, t, cond, noise=None, loss_type='l2'):
noise = default(noise, lambda: torch.randn_like(x_start))
x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
x_recon = self.denoise_fn(x_noisy, t, cond)
if loss_type == 'l1':
loss = (noise - x_recon).abs().mean()
elif loss_type == 'l2':
loss = F.mse_loss(noise, x_recon)
else:
raise NotImplementedError()
return loss
def forward(self,
condition,
gt_spec=None,
infer=True,
infer_speedup=10,
method='dpm-solver',
k_step=300,
use_tqdm=True):
"""
conditioning diffusion, use fastspeech2 encoder output as the condition
"""
cond = condition.transpose(1, 2)
b, device = condition.shape[0], condition.device
if not infer:
spec = self.norm_spec(gt_spec)
t = torch.randint(0, self.k_step, (b,), device=device).long()
norm_spec = spec.transpose(1, 2)[:, None, :, :] # [B, 1, M, T]
return self.p_losses(norm_spec, t, cond=cond)
else:
shape = (cond.shape[0], 1, self.out_dims, cond.shape[2])
if gt_spec is None:
t = self.k_step
x = torch.randn(shape, device=device)
else:
t = k_step
norm_spec = self.norm_spec(gt_spec)
norm_spec = norm_spec.transpose(1, 2)[:, None, :, :]
x = self.q_sample(x_start=norm_spec, t=torch.tensor([t - 1], device=device).long())
if method is not None and infer_speedup > 1:
if method == 'dpm-solver' or method == 'dpm-solver++':
from .dpm_solver_pytorch import (
DPM_Solver,
NoiseScheduleVP,
model_wrapper,
)
# 1. Define the noise schedule.
noise_schedule = NoiseScheduleVP(schedule='discrete', betas=self.betas[:t])
# 2. Convert your discrete-time `model` to the continuous-time
# noise prediction model. Here is an example for a diffusion model
# `model` with the noise prediction type ("noise") .
def my_wrapper(fn):
def wrapped(x, t, **kwargs):
ret = fn(x, t, **kwargs)
if use_tqdm:
self.bar.update(1)
return ret
return wrapped
model_fn = model_wrapper(
my_wrapper(self.denoise_fn),
noise_schedule,
model_type="noise", # or "x_start" or "v" or "score"
model_kwargs={"cond": cond}
)
# 3. Define dpm-solver and sample by singlestep DPM-Solver.
# (We recommend singlestep DPM-Solver for unconditional sampling)
# You can adjust the `steps` to balance the computation
# costs and the sample quality.
if method == 'dpm-solver':
dpm_solver = DPM_Solver(model_fn, noise_schedule, algorithm_type="dpmsolver")
elif method == 'dpm-solver++':
dpm_solver = DPM_Solver(model_fn, noise_schedule, algorithm_type="dpmsolver++")
steps = t // infer_speedup
if use_tqdm:
self.bar = tqdm(desc="sample time step", total=steps)
x = dpm_solver.sample(
x,
steps=steps,
order=2,
skip_type="time_uniform",
method="multistep",
)
if use_tqdm:
self.bar.close()
elif method == 'pndm':
self.noise_list = deque(maxlen=4)
if use_tqdm:
for i in tqdm(
reversed(range(0, t, infer_speedup)), desc='sample time step',
total=t // infer_speedup,
):
x = self.p_sample_plms(
x, torch.full((b,), i, device=device, dtype=torch.long),
infer_speedup, cond=cond
)
else:
for i in reversed(range(0, t, infer_speedup)):
x = self.p_sample_plms(
x, torch.full((b,), i, device=device, dtype=torch.long),
infer_speedup, cond=cond
)
elif method == 'ddim':
if use_tqdm:
for i in tqdm(
reversed(range(0, t, infer_speedup)), desc='sample time step',
total=t // infer_speedup,
):
x = self.p_sample_ddim(
x, torch.full((b,), i, device=device, dtype=torch.long),
infer_speedup, cond=cond
)
else:
for i in reversed(range(0, t, infer_speedup)):
x = self.p_sample_ddim(
x, torch.full((b,), i, device=device, dtype=torch.long),
infer_speedup, cond=cond
)
elif method == 'unipc':
from .uni_pc import NoiseScheduleVP, UniPC, model_wrapper
# 1. Define the noise schedule.
noise_schedule = NoiseScheduleVP(schedule='discrete', betas=self.betas[:t])
# 2. Convert your discrete-time `model` to the continuous-time
# noise prediction model. Here is an example for a diffusion model
# `model` with the noise prediction type ("noise") .
def my_wrapper(fn):
def wrapped(x, t, **kwargs):
ret = fn(x, t, **kwargs)
if use_tqdm:
self.bar.update(1)
return ret
return wrapped
model_fn = model_wrapper(
my_wrapper(self.denoise_fn),
noise_schedule,
model_type="noise", # or "x_start" or "v" or "score"
model_kwargs={"cond": cond}
)
# 3. Define uni_pc and sample by multistep UniPC.
# You can adjust the `steps` to balance the computation
# costs and the sample quality.
uni_pc = UniPC(model_fn, noise_schedule, variant='bh2')
steps = t // infer_speedup
if use_tqdm:
self.bar = tqdm(desc="sample time step", total=steps)
x = uni_pc.sample(
x,
steps=steps,
order=2,
skip_type="time_uniform",
method="multistep",
)
if use_tqdm:
self.bar.close()
else:
raise NotImplementedError(method)
else:
if use_tqdm:
for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
else:
for i in reversed(range(0, t)):
x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
x = x.squeeze(1).transpose(1, 2) # [B, T, M]
return self.denorm_spec(x)
def norm_spec(self, x):
return (x - self.spec_min) / (self.spec_max - self.spec_min) * 2 - 1
def denorm_spec(self, x):
return (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
================================================
FILE: diffusion/diffusion_onnx.py
================================================
import math
from collections import deque
from functools import partial
from inspect import isfunction
import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torch.nn import Conv1d, Mish
from tqdm import tqdm
def exists(x):
return x is not None
def default(val, d):
if exists(val):
return val
return d() if isfunction(d) else d
def extract(a, t):
return a[t].reshape((1, 1, 1, 1))
def noise_like(shape, device, repeat=False):
def repeat_noise():
return torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
def noise():
return torch.randn(shape, device=device)
return repeat_noise() if repeat else noise()
def linear_beta_schedule(timesteps, max_beta=0.02):
"""
linear schedule
"""
betas = np.linspace(1e-4, max_beta, timesteps)
return betas
def cosine_beta_schedule(timesteps, s=0.008):
"""
cosine schedule
as proposed in https://openreview.net/forum?id=-NEXDKk8gZ
"""
steps = timesteps + 1
x = np.linspace(0, steps, steps)
alphas_cumprod = np.cos(((x / steps) + s) / (1 + s) * np.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return np.clip(betas, a_min=0, a_max=0.999)
beta_schedule = {
"cosine": cosine_beta_schedule,
"linear": linear_beta_schedule,
}
def extract_1(a, t):
return a[t].reshape((1, 1, 1, 1))
def predict_stage0(noise_pred, noise_pred_prev):
return (noise_pred + noise_pred_prev) / 2
def predict_stage1(noise_pred, noise_list):
return (noise_pred * 3
- noise_list[-1]) / 2
def predict_stage2(noise_pred, noise_list):
return (noise_pred * 23
- noise_list[-1] * 16
+ noise_list[-2] * 5) / 12
def predict_stage3(noise_pred, noise_list):
return (noise_pred * 55
- noise_list[-1] * 59
+ noise_list[-2] * 37
- noise_list[-3] * 9) / 24
class SinusoidalPosEmb(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim
self.half_dim = dim // 2
self.emb = 9.21034037 / (self.half_dim - 1)
self.emb = torch.exp(torch.arange(self.half_dim) * torch.tensor(-self.emb)).unsqueeze(0)
self.emb = self.emb.cpu()
def forward(self, x):
emb = self.emb * x
emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
return emb
class ResidualBlock(nn.Module):
def __init__(self, encoder_hidden, residual_channels, dilation):
super().__init__()
self.residual_channels = residual_channels
self.dilated_conv = Conv1d(residual_channels, 2 * residual_channels, 3, padding=dilation, dilation=dilation)
self.diffusion_projection = nn.Linear(residual_channels, residual_channels)
self.conditioner_projection = Conv1d(encoder_hidden, 2 * residual_channels, 1)
self.output_projection = Conv1d(residual_channels, 2 * residual_channels, 1)
def forward(self, x, conditioner, diffusion_step):
diffusion_step = self.diffusion_projection(diffusion_step).unsqueeze(-1)
conditioner = self.conditioner_projection(conditioner)
y = x + diffusion_step
y = self.dilated_conv(y) + conditioner
gate, filter_1 = torch.split(y, [self.residual_channels, self.residual_channels], dim=1)
y = torch.sigmoid(gate) * torch.tanh(filter_1)
y = self.output_projection(y)
residual, skip = torch.split(y, [self.residual_channels, self.residual_channels], dim=1)
return (x + residual) / 1.41421356, skip
class DiffNet(nn.Module):
def __init__(self, in_dims, n_layers, n_chans, n_hidden):
super().__init__()
self.encoder_hidden = n_hidden
self.residual_layers = n_layers
self.residual_channels = n_chans
self.input_projection = Conv1d(in_dims, self.residual_channels, 1)
self.diffusion_embedding = SinusoidalPosEmb(self.residual_channels)
dim = self.residual_channels
self.mlp = nn.Sequential(
nn.Linear(dim, dim * 4),
Mish(),
nn.Linear(dim * 4, dim)
)
self.residual_layers = nn.ModuleList([
ResidualBlock(self.encoder_hidden, self.residual_channels, 1)
for i in range(self.residual_layers)
])
self.skip_projection = Conv1d(self.residual_channels, self.residual_channels, 1)
self.output_projection = Conv1d(self.residual_channels, in_dims, 1)
nn.init.zeros_(self.output_projection.weight)
def forward(self, spec, diffusion_step, cond):
x = spec.squeeze(0)
x = self.input_projection(x) # x [B, residual_channel, T]
x = F.relu(x)
# skip = torch.randn_like(x)
diffusion_step = diffusion_step.float()
diffusion_step = self.diffusion_embedding(diffusion_step)
diffusion_step = self.mlp(diffusion_step)
x, skip = self.residual_layers[0](x, cond, diffusion_step)
# noinspection PyTypeChecker
for layer in self.residual_layers[1:]:
x, skip_connection = layer.forward(x, cond, diffusion_step)
skip = skip + skip_connection
x = skip / math.sqrt(len(self.residual_layers))
x = self.skip_projection(x)
x = F.relu(x)
x = self.output_projection(x) # [B, 80, T]
return x.unsqueeze(1)
class AfterDiffusion(nn.Module):
def __init__(self, spec_max, spec_min, v_type='a'):
super().__init__()
self.spec_max = spec_max
self.spec_min = spec_min
self.type = v_type
def forward(self, x):
x = x.squeeze(1).permute(0, 2, 1)
mel_out = (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
if self.type == 'nsf-hifigan-log10':
mel_out = mel_out * 0.434294
return mel_out.transpose(2, 1)
class Pred(nn.Module):
def __init__(self, alphas_cumprod):
super().__init__()
self.alphas_cumprod = alphas_cumprod
def forward(self, x_1, noise_t, t_1, t_prev):
a_t = extract(self.alphas_cumprod, t_1).cpu()
a_prev = extract(self.alphas_cumprod, t_prev).cpu()
a_t_sq, a_prev_sq = a_t.sqrt().cpu(), a_prev.sqrt().cpu()
x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x_1 - 1 / (
a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
x_pred = x_1 + x_delta.cpu()
return x_pred
class GaussianDiffusion(nn.Module):
def __init__(self,
out_dims=128,
n_layers=20,
n_chans=384,
n_hidden=256,
timesteps=1000,
k_step=1000,
max_beta=0.02,
spec_min=-12,
spec_max=2):
super().__init__()
self.denoise_fn = DiffNet(out_dims, n_layers, n_chans, n_hidden)
self.out_dims = out_dims
self.mel_bins = out_dims
self.n_hidden = n_hidden
betas = beta_schedule['linear'](timesteps, max_beta=max_beta)
alphas = 1. - betas
alphas_cumprod = np.cumprod(alphas, axis=0)
alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])
timesteps, = betas.shape
self.num_timesteps = int(timesteps)
self.k_step = k_step
self.noise_list = deque(maxlen=4)
to_torch = partial(torch.tensor, dtype=torch.float32)
self.register_buffer('betas', to_torch(betas))
self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))
# calculations for diffusion q(x_t | x_{t-1}) and others
self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))
# calculations for posterior q(x_{t-1} | x_t, x_0)
posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)
# above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
self.register_buffer('posterior_variance', to_torch(posterior_variance))
# below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
self.register_buffer('posterior_mean_coef1', to_torch(
betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
self.register_buffer('posterior_mean_coef2', to_torch(
(1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))
self.register_buffer('spec_min', torch.FloatTensor([spec_min])[None, None, :out_dims])
self.register_buffer('spec_max', torch.FloatTensor([spec_max])[None, None, :out_dims])
self.ad = AfterDiffusion(self.spec_max, self.spec_min)
self.xp = Pred(self.alphas_cumprod)
def q_mean_variance(self, x_start, t):
mean = extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
variance = extract(1. - self.alphas_cumprod, t, x_start.shape)
log_variance = extract(self.log_one_minus_alphas_cumprod, t, x_start.shape)
return mean, variance, log_variance
def predict_start_from_noise(self, x_t, t, noise):
return (
extract(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
extract(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
)
def q_posterior(self, x_start, x_t, t):
posterior_mean = (
extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
extract(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
posterior_variance = extract(self.posterior_variance, t, x_t.shape)
posterior_log_variance_clipped = extract(self.posterior_log_variance_clipped, t, x_t.shape)
return posterior_mean, posterior_variance, posterior_log_variance_clipped
def p_mean_variance(self, x, t, cond):
noise_pred = self.denoise_fn(x, t, cond=cond)
x_recon = self.predict_start_from_noise(x, t=t, noise=noise_pred)
x_recon.clamp_(-1., 1.)
model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
return model_mean, posterior_variance, posterior_log_variance
@torch.no_grad()
def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
b, *_, device = *x.shape, x.device
model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, cond=cond)
noise = noise_like(x.shape, device, repeat_noise)
# no noise when t == 0
nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
@torch.no_grad()
def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repeat_noise=False):
"""
Use the PLMS method from
[Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778).
"""
def get_x_pred(x, noise_t, t):
a_t = extract(self.alphas_cumprod, t)
a_prev = extract(self.alphas_cumprod, torch.max(t - interval, torch.zeros_like(t)))
a_t_sq, a_prev_sq = a_t.sqrt(), a_prev.sqrt()
x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x - 1 / (
a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
x_pred = x + x_delta
return x_pred
noise_list = self.noise_list
noise_pred = self.denoise_fn(x, t, cond=cond)
if len(noise_list) == 0:
x_pred = get_x_pred(x, noise_pred, t)
noise_pred_prev = self.denoise_fn(x_pred, max(t - interval, 0), cond=cond)
noise_pred_prime = (noise_pred + noise_pred_prev) / 2
elif len(noise_list) == 1:
noise_pred_prime = (3 * noise_pred - noise_list[-1]) / 2
elif len(noise_list) == 2:
noise_pred_prime = (23 * noise_pred - 16 * noise_list[-1] + 5 * noise_list[-2]) / 12
else:
noise_pred_prime = (55 * noise_pred - 59 * noise_list[-1] + 37 * noise_list[-2] - 9 * noise_list[-3]) / 24
x_prev = get_x_pred(x, noise_pred_prime, t)
noise_list.append(noise_pred)
return x_prev
def q_sample(self, x_start, t, noise=None):
noise = default(noise, lambda: torch.randn_like(x_start))
return (
extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
extract(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
)
def p_losses(self, x_start, t, cond, noise=None, loss_type='l2'):
noise = default(noise, lambda: torch.randn_like(x_start))
x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
x_recon = self.denoise_fn(x_noisy, t, cond)
if loss_type == 'l1':
loss = (noise - x_recon).abs().mean()
elif loss_type == 'l2':
loss = F.mse_loss(noise, x_recon)
else:
raise NotImplementedError()
return loss
def org_forward(self,
condition,
init_noise=None,
gt_spec=None,
infer=True,
infer_speedup=100,
method='pndm',
k_step=1000,
use_tqdm=True):
"""
conditioning diffusion, use fastspeech2 encoder output as the condition
"""
cond = condition
b, device = condition.shape[0], condition.device
if not infer:
spec = self.norm_spec(gt_spec)
t = torch.randint(0, self.k_step, (b,), device=device).long()
norm_spec = spec.transpose(1, 2)[:, None, :, :] # [B, 1, M, T]
return self.p_losses(norm_spec, t, cond=cond)
else:
shape = (cond.shape[0], 1, self.out_dims, cond.shape[2])
if gt_spec is None:
t = self.k_step
if init_noise is None:
x = torch.randn(shape, device=device)
else:
x = init_noise
else:
t = k_step
norm_spec = self.norm_spec(gt_spec)
norm_spec = norm_spec.transpose(1, 2)[:, None, :, :]
x = self.q_sample(x_start=norm_spec, t=torch.tensor([t - 1], device=device).long())
if method is not None and infer_speedup > 1:
if method == 'dpm-solver':
from .dpm_solver_pytorch import (
DPM_Solver,
NoiseScheduleVP,
model_wrapper,
)
# 1. Define the noise schedule.
noise_schedule = NoiseScheduleVP(schedule='discrete', betas=self.betas[:t])
# 2. Convert your discrete-time `model` to the continuous-time
# noise prediction model. Here is an example for a diffusion model
# `model` with the noise prediction type ("noise") .
def my_wrapper(fn):
def wrapped(x, t, **kwargs):
ret = fn(x, t, **kwargs)
if use_tqdm:
self.bar.update(1)
return ret
return wrapped
model_fn = model_wrapper(
my_wrapper(self.denoise_fn),
noise_schedule,
model_type="noise", # or "x_start" or "v" or "score"
model_kwargs={"cond": cond}
)
# 3. Define dpm-solver and sample by singlestep DPM-Solver.
# (We recommend singlestep DPM-Solver for unconditional sampling)
# You can adjust the `steps` to balance the computation
# costs and the sample quality.
dpm_solver = DPM_Solver(model_fn, noise_schedule)
steps = t // infer_speedup
if use_tqdm:
self.bar = tqdm(desc="sample time step", total=steps)
x = dpm_solver.sample(
x,
steps=steps,
order=3,
skip_type="time_uniform",
method="singlestep",
)
if use_tqdm:
self.bar.close()
elif method == 'pndm':
self.noise_list = deque(maxlen=4)
if use_tqdm:
for i in tqdm(
reversed(range(0, t, infer_speedup)), desc='sample time step',
total=t // infer_speedup,
):
x = self.p_sample_plms(
x, torch.full((b,), i, device=device, dtype=torch.long),
infer_speedup, cond=cond
)
else:
for i in reversed(range(0, t, infer_speedup)):
x = self.p_sample_plms(
x, torch.full((b,), i, device=device, dtype=torch.long),
infer_speedup, cond=cond
)
else:
raise NotImplementedError(method)
else:
if use_tqdm:
for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
else:
for i in reversed(range(0, t)):
x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
x = x.squeeze(1).transpose(1, 2) # [B, T, M]
return self.denorm_spec(x).transpose(2, 1)
def norm_spec(self, x):
return (x - self.spec_min) / (self.spec_max - self.spec_min) * 2 - 1
def denorm_spec(self, x):
return (x + 1) / 2 * (self.spec_max - self.spec_min) + self.spec_min
def get_x_pred(self, x_1, noise_t, t_1, t_prev):
a_t = extract(self.alphas_cumprod, t_1)
a_prev = extract(self.alphas_cumprod, t_prev)
a_t_sq, a_prev_sq = a_t.sqrt(), a_prev.sqrt()
x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x_1 - 1 / (
a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
x_pred = x_1 + x_delta
return x_pred
def OnnxExport(self, project_name=None, init_noise=None, hidden_channels=256, export_denoise=True, export_pred=True, export_after=True):
cond = torch.randn([1, self.n_hidden, 10]).cpu()
if init_noise is None:
x = torch.randn((1, 1, self.mel_bins, cond.shape[2]), dtype=torch.float32).cpu()
else:
x = init_noise
pndms = 100
org_y_x = self.org_forward(cond, init_noise=x)
device = cond.device
n_frames = cond.shape[2]
step_range = torch.arange(0, self.k_step, pndms, dtype=torch.long, device=device).flip(0)
plms_noise_stage = torch.tensor(0, dtype=torch.long, device=device)
noise_list = torch.zeros((0, 1, 1, self.mel_bins, n_frames), device=device)
ot = step_range[0]
ot_1 = torch.full((1,), ot, device=device, dtype=torch.long)
if export_denoise:
torch.onnx.export(
self.denoise_fn,
(x.cpu(), ot_1.cpu(), cond.cpu()),
f"{project_name}_denoise.onnx",
input_names=["noise", "time", "condition"],
output_names=["noise_pred"],
dynamic_axes={
"noise": [3],
"condition": [2]
},
opset_version=16
)
for t in step_range:
t_1 = torch.full((1,), t, device=device, dtype=torch.long)
noise_pred = self.denoise_fn(x, t_1, cond)
t_prev = t_1 - pndms
t_prev = t_prev * (t_prev > 0)
if plms_noise_stage == 0:
if export_pred:
torch.onnx.export(
self.xp,
(x.cpu(), noise_pred.cpu(), t_1.cpu(), t_prev.cpu()),
f"{project_name}_pred.onnx",
input_names=["noise", "noise_pred", "time", "time_prev"],
output_names=["noise_pred_o"],
dynamic_axes={
"noise": [3],
"noise_pred": [3]
},
opset_version=16
)
x_pred = self.get_x_pred(x, noise_pred, t_1, t_prev)
noise_pred_prev = self.denoise_fn(x_pred, t_prev, cond=cond)
noise_pred_prime = predict_stage0(noise_pred, noise_pred_prev)
elif plms_noise_stage == 1:
noise_pred_prime = predict_stage1(noise_pred, noise_list)
elif plms_noise_stage == 2:
noise_pred_prime = predict_stage2(noise_pred, noise_list)
else:
noise_pred_prime = predict_stage3(noise_pred, noise_list)
noise_pred = noise_pred.unsqueeze(0)
if plms_noise_stage < 3:
noise_list = torch.cat((noise_list, noise_pred), dim=0)
plms_noise_stage = plms_noise_stage + 1
else:
noise_list = torch.cat((noise_list[-2:], noise_pred), dim=0)
x = self.get_x_pred(x, noise_pred_prime, t_1, t_prev)
if export_after:
torch.onnx.export(
self.ad,
x.cpu(),
f"{project_name}_after.onnx",
input_names=["x"],
output_names=["mel_out"],
dynamic_axes={
"x": [3]
},
opset_version=16
)
x = self.ad(x)
print((x == org_y_x).all())
return x
def forward(self, condition=None, init_noise=None, pndms=None, k_step=None):
cond = condition
x = init_noise
device = cond.device
n_frames = cond.shape[2]
step_range = torch.arange(0, k_step.item(), pndms.item(), dtype=torch.long, device=device).flip(0)
plms_noise_stage = torch.tensor(0, dtype=torch.long, device=device)
noise_list = torch.zeros((0, 1, 1, self.mel_bins, n_frames), device=device)
for t in step_range:
t_1 = torch.full((1,), t, device=device, dtype=torch.long)
noise_pred = self.denoise_fn(x, t_1, cond)
t_prev = t_1 - pndms
t_prev = t_prev * (t_prev > 0)
if plms_noise_stage == 0:
x_pred = self.get_x_pred(x, noise_pred, t_1, t_prev)
noise_pred_prev = self.denoise_fn(x_pred, t_prev, cond=cond)
noise_pred_prime = predict_stage0(noise_pred, noise_pred_prev)
elif plms_noise_stage == 1:
noise_pred_prime = predict_stage1(noise_pred, noise_list)
elif plms_noise_stage == 2:
noise_pred_prime = predict_stage2(noise_pred, noise_list)
else:
noise_pred_prime = predict_stage3(noise_pred, noise_list)
noise_pred = noise_pred.unsqueeze(0)
if plms_noise_stage < 3:
noise_list = torch.cat((noise_list, noise_pred), dim=0)
plms_noise_stage = plms_noise_stage + 1
else:
noise_list = torch.cat((noise_list[-2:], noise_pred), dim=0)
x = self.get_x_pred(x, noise_pred_prime, t_1, t_prev)
x = self.ad(x)
return x
================================================
FILE: diffusion/dpm_solver_pytorch.py
================================================
import torch
class NoiseScheduleVP:
def __init__(
self,
schedule='discrete',
betas=None,
alphas_cumprod=None,
continuous_beta_0=0.1,
continuous_beta_1=20.,
dtype=torch.float32,
):
"""Create a wrapper class for the forward SDE (VP type).
***
Update: We support discrete-time diffusion models by implementing a picewise linear interpolation for log_alpha_t.
We recommend to use schedule='discrete' for the discrete-time diffusion models, especially for high-resolution images.
***
The forward SDE ensures that the condition distribution q_{t|0}(x_t | x_0) = N ( alpha_t * x_0, sigma_t^2 * I ).
We further define lambda_t = log(alpha_t) - log(sigma_t), which is the half-logSNR (described in the DPM-Solver paper).
Therefore, we implement the functions for computing alpha_t, sigma_t and lambda_t. For t in [0, T], we have:
log_alpha_t = self.marginal_log_mean_coeff(t)
sigma_t = self.marginal_std(t)
lambda_t = self.marginal_lambda(t)
Moreover, as lambda(t) is an invertible function, we also support its inverse function:
t = self.inverse_lambda(lambda_t)
===============================================================
We support both discrete-time DPMs (trained on n = 0, 1, ..., N-1) and continuous-time DPMs (trained on t in [t_0, T]).
1. For discrete-time DPMs:
For discrete-time DPMs trained on n = 0, 1, ..., N-1, we convert the discrete steps to continuous time steps by:
t_i = (i + 1) / N
e.g. for N = 1000, we have t_0 = 1e-3 and T = t_{N-1} = 1.
We solve the corresponding diffusion ODE from time T = 1 to time t_0 = 1e-3.
Args:
betas: A `torch.Tensor`. The beta array for the discrete-time DPM. (See the original DDPM paper for details)
alphas_cumprod: A `torch.Tensor`. The cumprod alphas for the discrete-time DPM. (See the original DDPM paper for details)
Note that we always have alphas_cumprod = cumprod(1 - betas). Therefore, we only need to set one of `betas` and `alphas_cumprod`.
**Important**: Please pay special attention for the args for `alphas_cumprod`:
The `alphas_cumprod` is the \hat{alpha_n} arrays in the notations of DDPM. Specifically, DDPMs assume that
q_{t_n | 0}(x_{t_n} | x_0) = N ( \sqrt{\hat{alpha_n}} * x_0, (1 - \hat{alpha_n}) * I ).
Therefore, the notation \hat{alpha_n} is different from the notation alpha_t in DPM-Solver. In fact, we have
alpha_{t_n} = \sqrt{\hat{alpha_n}},
and
log(alpha_{t_n}) = 0.5 * log(\hat{alpha_n}).
2. For continuous-time DPMs:
We support the linear VPSDE for the continuous time setting. The hyperparameters for the noise
schedule are the default settings in Yang Song's ScoreSDE:
Args:
beta_min: A `float` number. The smallest beta for the linear schedule.
beta_max: A `float` number. The largest beta for the linear schedule.
T: A `float` number. The ending time of the forward process.
===============================================================
Args:
schedule: A `str`. The noise schedule of the forward SDE. 'discrete' for discrete-time DPMs,
'linear' for continuous-time DPMs.
Returns:
A wrapper object of the forward SDE (VP type).
===============================================================
Example:
# For discrete-time DPMs, given betas (the beta array for n = 0, 1, ..., N - 1):
>>> ns = NoiseScheduleVP('discrete', betas=betas)
# For discrete-time DPMs, given alphas_cumprod (the \hat{alpha_n} array for n = 0, 1, ..., N - 1):
>>> ns = NoiseScheduleVP('discrete', alphas_cumprod=alphas_cumprod)
# For continuous-time DPMs (VPSDE), linear schedule:
>>> ns = NoiseScheduleVP('linear', continuous_beta_0=0.1, continuous_beta_1=20.)
"""
if schedule not in ['discrete', 'linear']:
raise ValueError("Unsupported noise schedule {}. The schedule needs to be 'discrete' or 'linear'".format(schedule))
self.schedule = schedule
if schedule == 'discrete':
if betas is not None:
log_alphas = 0.5 * torch.log(1 - betas).cumsum(dim=0)
else:
assert alphas_cumprod is not None
log_alphas = 0.5 * torch.log(alphas_cumprod)
self.T = 1.
self.log_alpha_array = self.numerical_clip_alpha(log_alphas).reshape((1, -1,)).to(dtype=dtype)
self.total_N = self.log_alpha_array.shape[1]
self.t_array = torch.linspace(0., 1., self.total_N + 1)[1:].reshape((1, -1)).to(dtype=dtype)
else:
self.T = 1.
self.total_N = 1000
self.beta_0 = continuous_beta_0
self.beta_1 = continuous_beta_1
def numerical_clip_alpha(self, log_alphas, clipped_lambda=-5.1):
"""
For some beta schedules such as cosine schedule, the log-SNR has numerical isssues.
We clip the log-SNR near t=T within -5.1 to ensure the stability.
Such a trick is very useful for diffusion models with the cosine schedule, such as i-DDPM, guided-diffusion and GLIDE.
"""
log_sigmas = 0.5 * torch.log(1. - torch.exp(2. * log_alphas))
lambs = log_alphas - log_sigmas
idx = torch.searchsorted(torch.flip(lambs, [0]), clipped_lambda)
if idx > 0:
log_alphas = log_alphas[:-idx]
return log_alphas
def marginal_log_mean_coeff(self, t):
"""
Compute log(alpha_t) of a given continuous-time label t in [0, T].
"""
if self.schedule == 'discrete':
return interpolate_fn(t.reshape((-1, 1)), self.t_array.to(t.device), self.log_alpha_array.to(t.device)).reshape((-1))
elif self.schedule == 'linear':
return -0.25 * t ** 2 * (self.beta_1 - self.beta_0) - 0.5 * t * self.beta_0
def marginal_alpha(self, t):
"""
Compute alpha_t of a given continuous-time label t in [0, T].
"""
return torch.exp(self.marginal_log_mean_coeff(t))
def marginal_std(self, t):
"""
Compute sigma_t of a given continuous-time label t in [0, T].
"""
return torch.sqrt(1. - torch.exp(2. * self.marginal_log_mean_coeff(t)))
def marginal_lambda(self, t):
"""
Compute lambda_t = log(alpha_t) - log(sigma_t) of a given continuous-time label t in [0, T].
"""
log_mean_coeff = self.marginal_log_mean_coeff(t)
log_std = 0.5 * torch.log(1. - torch.exp(2. * log_mean_coeff))
return log_mean_coeff - log_std
def inverse_lambda(self, lamb):
"""
Compute the continuous-time label t in [0, T] of a given half-logSNR lambda_t.
"""
if self.schedule == 'linear':
tmp = 2. * (self.beta_1 - self.beta_0) * torch.logaddexp(-2. * lamb, torch.zeros((1,)).to(lamb))
Delta = self.beta_0**2 + tmp
return tmp / (torch.sqrt(Delta) + self.beta_0) / (self.beta_1 - self.beta_0)
elif self.schedule == 'discrete':
log_alpha = -0.5 * torch.logaddexp(torch.zeros((1,)).to(lamb.device), -2. * lamb)
t = interpolate_fn(log_alpha.reshape((-1, 1)), torch.flip(self.log_alpha_array.to(lamb.device), [1]), torch.flip(self.t_array.to(lamb.device), [1]))
return t.reshape((-1,))
def model_wrapper(
model,
noise_schedule,
model_type="noise",
model_kwargs={},
guidance_type="uncond",
condition=None,
unconditional_condition=None,
guidance_scale=1.,
classifier_fn=None,
classifier_kwargs={},
):
"""Create a wrapper function for the noise prediction model.
DPM-Solver needs to solve the continuous-time diffusion ODEs. For DPMs trained on discrete-time labels, we need to
firstly wrap the model function to a noise prediction model that accepts the continuous time as the input.
We support four types of the diffusion model by setting `model_type`:
1. "noise": noise prediction model. (Trained by predicting noise).
2. "x_start": data prediction model. (Trained by predicting the data x_0 at time 0).
3. "v": velocity prediction model. (Trained by predicting the velocity).
The "v" prediction is derivation detailed in Appendix D of [1], and is used in Imagen-Video [2].
[1] Salimans, Tim, and Jonathan Ho. "Progressive distillation for fast sampling of diffusion models."
arXiv preprint arXiv:2202.00512 (2022).
[2] Ho, Jonathan, et al. "Imagen Video: High Definition Video Generation with Diffusion Models."
arXiv preprint arXiv:2210.02303 (2022).
4. "score": marginal score function. (Trained by denoising score matching).
Note that the score function and the noise prediction model follows a simple relationship:
```
noise(x_t, t) = -sigma_t * score(x_t, t)
gitextract_7qli04ux/ ├── .gitattributes ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── ask_for_help.yaml │ │ ├── ask_for_help_en_US.yaml │ │ ├── bug_report.yaml │ │ ├── bug_report_en_US.yaml │ │ ├── config.yml │ │ └── default.md │ └── workflows/ │ ├── reviewdog.yml │ └── ruff.yml ├── .gitignore ├── .ruff.toml ├── LICENSE ├── README.md ├── README_zh_CN.md ├── cluster/ │ ├── __init__.py │ ├── kmeans.py │ └── train_cluster.py ├── compress_model.py ├── configs/ │ └── diffusion.yaml ├── configs_template/ │ ├── config_template.json │ ├── config_tiny_template.json │ └── diffusion_template.yaml ├── data_utils.py ├── diffusion/ │ ├── __init__.py │ ├── data_loaders.py │ ├── diffusion.py │ ├── diffusion_onnx.py │ ├── dpm_solver_pytorch.py │ ├── how to export onnx.md │ ├── infer_gt_mel.py │ ├── logger/ │ │ ├── __init__.py │ │ ├── saver.py │ │ └── utils.py │ ├── onnx_export.py │ ├── solver.py │ ├── uni_pc.py │ ├── unit2mel.py │ ├── vocoder.py │ └── wavenet.py ├── edgetts/ │ ├── tts.py │ └── tts_voices.py ├── export_index_for_onnx.py ├── flask_api.py ├── flask_api_full_song.py ├── inference/ │ ├── __init__.py │ ├── infer_tool.py │ ├── infer_tool_grad.py │ └── slicer.py ├── inference_main.py ├── models.py ├── modules/ │ ├── DSConv.py │ ├── F0Predictor/ │ │ ├── CrepeF0Predictor.py │ │ ├── DioF0Predictor.py │ │ ├── F0Predictor.py │ │ ├── FCPEF0Predictor.py │ │ ├── HarvestF0Predictor.py │ │ ├── PMF0Predictor.py │ │ ├── RMVPEF0Predictor.py │ │ ├── __init__.py │ │ ├── crepe.py │ │ ├── fcpe/ │ │ │ ├── __init__.py │ │ │ ├── model.py │ │ │ ├── nvSTFT.py │ │ │ └── pcmer.py │ │ └── rmvpe/ │ │ ├── __init__.py │ │ ├── constants.py │ │ ├── deepunet.py │ │ ├── inference.py │ │ ├── model.py │ │ ├── seq.py │ │ ├── spec.py │ │ └── utils.py │ ├── __init__.py │ ├── attentions.py │ ├── commons.py │ ├── enhancer.py │ ├── losses.py │ ├── mel_processing.py │ └── modules.py ├── onnx_export.py ├── onnx_export_old.py ├── onnxexport/ │ ├── model_onnx.py │ └── model_onnx_speaker_mix.py ├── preprocess_flist_config.py ├── preprocess_hubert_f0.py ├── requirements.txt ├── requirements_onnx_encoder.txt ├── requirements_win.txt ├── resample.py ├── sovits4_for_colab.ipynb ├── spkmix.py ├── train.py ├── train_diff.py ├── train_index.py ├── utils.py ├── vdecoder/ │ ├── __init__.py │ ├── hifigan/ │ │ ├── env.py │ │ ├── models.py │ │ ├── nvSTFT.py │ │ └── utils.py │ ├── hifiganwithsnake/ │ │ ├── alias/ │ │ │ ├── __init__.py │ │ │ ├── act.py │ │ │ ├── filter.py │ │ │ └── resample.py │ │ ├── env.py │ │ ├── models.py │ │ ├── nvSTFT.py │ │ └── utils.py │ └── nsf_hifigan/ │ ├── env.py │ ├── models.py │ ├── nvSTFT.py │ └── utils.py ├── vencoder/ │ ├── CNHubertLarge.py │ ├── ContentVec256L12_Onnx.py │ ├── ContentVec256L9.py │ ├── ContentVec256L9_Onnx.py │ ├── ContentVec768L12.py │ ├── ContentVec768L12_Onnx.py │ ├── ContentVec768L9_Onnx.py │ ├── DPHubert.py │ ├── HubertSoft.py │ ├── HubertSoft_Onnx.py │ ├── WavLMBasePlus.py │ ├── WhisperPPG.py │ ├── WhisperPPGLarge.py │ ├── __init__.py │ ├── dphubert/ │ │ ├── __init__.py │ │ ├── components.py │ │ ├── hardconcrete.py │ │ ├── model.py │ │ ├── pruning_utils.py │ │ └── utils/ │ │ ├── __init__.py │ │ └── import_huggingface_wavlm.py │ ├── encoder.py │ ├── hubert/ │ │ ├── __init__.py │ │ ├── hubert_model.py │ │ └── hubert_model_onnx.py │ ├── wavlm/ │ │ ├── WavLM.py │ │ └── modules.py │ └── whisper/ │ ├── __init__.py │ ├── audio.py │ ├── decoding.py │ ├── model.py │ ├── tokenizer.py │ └── utils.py ├── wav_upload.py └── webUI.py
SYMBOL INDEX (1241 symbols across 104 files)
FILE: cluster/__init__.py
function get_cluster_model (line 5) | def get_cluster_model(ckpt_path):
function get_cluster_result (line 16) | def get_cluster_result(model, x, speaker):
function get_cluster_center_result (line 23) | def get_cluster_center_result(model, x,speaker):
function get_center (line 28) | def get_center(model, x,speaker):
FILE: cluster/kmeans.py
function _kpp (line 10) | def _kpp(data: torch.Tensor, k: int, sample_size: int = -1):
class KMeansGPU (line 51) | class KMeansGPU:
method __init__ (line 82) | def __init__(self, n_clusters, max_iter=200, tol=1e-4, verbose=0, mode...
method cos_sim (line 96) | def cos_sim(a, b):
method euc_sim (line 108) | def euc_sim(a, b):
method max_sim (line 117) | def max_sim(self, a, b):
method fit_predict (line 133) | def fit_predict(self, X):
FILE: cluster/train_cluster.py
function train_cluster (line 16) | def train_cluster(in_dir, n_clusters, use_minibatch=True, verbose=False,...
FILE: compress_model.py
function copyStateDict (line 9) | def copyStateDict(state_dict):
function removeOptimizer (line 21) | def removeOptimizer(config: str, input_model: str, ishalf: bool, output_...
FILE: data_utils.py
class TextAudioSpeakerLoader (line 18) | class TextAudioSpeakerLoader(torch.utils.data.Dataset):
method __init__ (line 25) | def __init__(self, audiopaths, hparams, all_in_mem: bool = False, vol_...
method get_audio (line 47) | def get_audio(self, filename):
method random_slice (line 94) | def random_slice(self, c, f0, spec, audio_norm, spk, uv, volume):
method __getitem__ (line 121) | def __getitem__(self, index):
method __len__ (line 127) | def __len__(self):
class TextAudioCollate (line 131) | class TextAudioCollate:
method __call__ (line 133) | def __call__(self, batch):
FILE: diffusion/data_loaders.py
function traverse_dir (line 13) | def traverse_dir(
function get_data_loaders (line 54) | def get_data_loaders(args, whole_audio=False):
class AudioDataset (line 98) | class AudioDataset(Dataset):
method __init__ (line 99) | def __init__(
method __getitem__ (line 208) | def __getitem__(self, file_idx):
method get_data (line 218) | def get_data(self, name_ext, data_buffer):
method __len__ (line 287) | def __len__(self):
FILE: diffusion/diffusion.py
function exists (line 12) | def exists(x):
function default (line 16) | def default(val, d):
function extract (line 22) | def extract(a, t, x_shape):
function noise_like (line 28) | def noise_like(shape, device, repeat=False):
function linear_beta_schedule (line 36) | def linear_beta_schedule(timesteps, max_beta=0.02):
function cosine_beta_schedule (line 44) | def cosine_beta_schedule(timesteps, s=0.008):
class GaussianDiffusion (line 63) | class GaussianDiffusion(nn.Module):
method __init__ (line 64) | def __init__(self,
method q_mean_variance (line 115) | def q_mean_variance(self, x_start, t):
method predict_start_from_noise (line 121) | def predict_start_from_noise(self, x_t, t, noise):
method q_posterior (line 127) | def q_posterior(self, x_start, x_t, t):
method p_mean_variance (line 136) | def p_mean_variance(self, x, t, cond):
method p_sample_ddim (line 146) | def p_sample_ddim(self, x, t, interval, cond):
method p_sample (line 158) | def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
method p_sample_plms (line 167) | def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repe...
method q_sample (line 203) | def q_sample(self, x_start, t, noise=None):
method p_losses (line 210) | def p_losses(self, x_start, t, cond, noise=None, loss_type='l2'):
method forward (line 225) | def forward(self,
method norm_spec (line 392) | def norm_spec(self, x):
method denorm_spec (line 395) | def denorm_spec(self, x):
FILE: diffusion/diffusion_onnx.py
function exists (line 14) | def exists(x):
function default (line 18) | def default(val, d):
function extract (line 24) | def extract(a, t):
function noise_like (line 28) | def noise_like(shape, device, repeat=False):
function linear_beta_schedule (line 36) | def linear_beta_schedule(timesteps, max_beta=0.02):
function cosine_beta_schedule (line 44) | def cosine_beta_schedule(timesteps, s=0.008):
function extract_1 (line 63) | def extract_1(a, t):
function predict_stage0 (line 67) | def predict_stage0(noise_pred, noise_pred_prev):
function predict_stage1 (line 71) | def predict_stage1(noise_pred, noise_list):
function predict_stage2 (line 76) | def predict_stage2(noise_pred, noise_list):
function predict_stage3 (line 82) | def predict_stage3(noise_pred, noise_list):
class SinusoidalPosEmb (line 89) | class SinusoidalPosEmb(nn.Module):
method __init__ (line 90) | def __init__(self, dim):
method forward (line 98) | def forward(self, x):
class ResidualBlock (line 104) | class ResidualBlock(nn.Module):
method __init__ (line 105) | def __init__(self, encoder_hidden, residual_channels, dilation):
method forward (line 113) | def forward(self, x, conditioner, diffusion_step):
class DiffNet (line 129) | class DiffNet(nn.Module):
method __init__ (line 130) | def __init__(self, in_dims, n_layers, n_chans, n_hidden):
method forward (line 151) | def forward(self, spec, diffusion_step, cond):
class AfterDiffusion (line 172) | class AfterDiffusion(nn.Module):
method __init__ (line 173) | def __init__(self, spec_max, spec_min, v_type='a'):
method forward (line 179) | def forward(self, x):
class Pred (line 187) | class Pred(nn.Module):
method __init__ (line 188) | def __init__(self, alphas_cumprod):
method forward (line 192) | def forward(self, x_1, noise_t, t_1, t_prev):
class GaussianDiffusion (line 203) | class GaussianDiffusion(nn.Module):
method __init__ (line 204) | def __init__(self,
method q_mean_variance (line 259) | def q_mean_variance(self, x_start, t):
method predict_start_from_noise (line 265) | def predict_start_from_noise(self, x_t, t, noise):
method q_posterior (line 271) | def q_posterior(self, x_start, x_t, t):
method p_mean_variance (line 280) | def p_mean_variance(self, x, t, cond):
method p_sample (line 290) | def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
method p_sample_plms (line 299) | def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repe...
method q_sample (line 335) | def q_sample(self, x_start, t, noise=None):
method p_losses (line 342) | def p_losses(self, x_start, t, cond, noise=None, loss_type='l2'):
method org_forward (line 357) | def org_forward(self,
method norm_spec (line 467) | def norm_spec(self, x):
method denorm_spec (line 470) | def denorm_spec(self, x):
method get_x_pred (line 473) | def get_x_pred(self, x_1, noise_t, t_1, t_prev):
method OnnxExport (line 482) | def OnnxExport(self, project_name=None, init_noise=None, hidden_channe...
method forward (line 574) | def forward(self, condition=None, init_noise=None, pndms=None, k_step=...
FILE: diffusion/dpm_solver_pytorch.py
class NoiseScheduleVP (line 4) | class NoiseScheduleVP:
method __init__ (line 5) | def __init__(
method numerical_clip_alpha (line 112) | def numerical_clip_alpha(self, log_alphas, clipped_lambda=-5.1):
method marginal_log_mean_coeff (line 125) | def marginal_log_mean_coeff(self, t):
method marginal_alpha (line 134) | def marginal_alpha(self, t):
method marginal_std (line 140) | def marginal_std(self, t):
method marginal_lambda (line 146) | def marginal_lambda(self, t):
method inverse_lambda (line 154) | def inverse_lambda(self, lamb):
function model_wrapper (line 168) | def model_wrapper(
class DPM_Solver (line 335) | class DPM_Solver:
method __init__ (line 336) | def __init__(
method dynamic_thresholding_fn (line 414) | def dynamic_thresholding_fn(self, x0, t):
method noise_prediction_fn (line 425) | def noise_prediction_fn(self, x, t):
method data_prediction_fn (line 431) | def data_prediction_fn(self, x, t):
method model_fn (line 442) | def model_fn(self, x, t):
method get_time_steps (line 451) | def get_time_steps(self, skip_type, t_T, t_0, N, device):
method get_orders_and_timesteps_for_singlestep_solver (line 480) | def get_orders_and_timesteps_for_singlestep_solver(self, steps, order,...
method denoise_to_zero_fn (line 539) | def denoise_to_zero_fn(self, x, s):
method dpm_solver_first_update (line 545) | def dpm_solver_first_update(self, x, s, t, model_s=None, return_interm...
method singlestep_dpm_solver_second_update (line 591) | def singlestep_dpm_solver_second_update(self, x, s, t, r1=0.5, model_s...
method singlestep_dpm_solver_third_update (line 672) | def singlestep_dpm_solver_third_update(self, x, s, t, r1=1./3., r2=2./...
method multistep_dpm_solver_second_update (line 793) | def multistep_dpm_solver_second_update(self, x, model_prev_list, t_pre...
method multistep_dpm_solver_third_update (line 851) | def multistep_dpm_solver_third_update(self, x, model_prev_list, t_prev...
method singlestep_dpm_solver_update (line 903) | def singlestep_dpm_solver_update(self, x, s, t, order, return_intermed...
method multistep_dpm_solver_update (line 929) | def multistep_dpm_solver_update(self, x, model_prev_list, t_prev_list,...
method dpm_solver_adaptive (line 953) | def dpm_solver_adaptive(self, x, order, t_T, t_0, h_init=0.05, atol=0....
method add_noise (line 1014) | def add_noise(self, x, t, noise=None):
method inverse (line 1034) | def inverse(self, x, steps=20, t_start=None, t_end=None, order=2, skip...
method sample (line 1049) | def sample(self, x, steps=20, t_start=None, t_end=None, order=2, skip_...
function interpolate_fn (line 1255) | def interpolate_fn(x, xp, yp):
function expand_dims (line 1297) | def expand_dims(v, dims):
FILE: diffusion/infer_gt_mel.py
class DiffGtMel (line 7) | class DiffGtMel:
method __init__ (line 8) | def __init__(self, project_path=None, device=None):
method flush_model (line 18) | def flush_model(self, project_path, ddsp_config=None):
method check_args (line 26) | def check_args(self, args1, args2):
method __call__ (line 35) | def __call__(self, audio, f0, hubert, volume, acc=1, spk_id=1, k_step=...
method infer (line 58) | def infer(self, audio, f0, hubert, volume, acc=1, spk_id=1, k_step=0, ...
FILE: diffusion/logger/saver.py
class Saver (line 15) | class Saver(object):
method __init__ (line 16) | def __init__(
method log_info (line 47) | def log_info(self, msg):
method log_value (line 70) | def log_value(self, dict):
method log_spec (line 74) | def log_spec(self, name, spec, spec_out, vmin=-14, vmax=3.5):
method log_audio (line 84) | def log_audio(self, dict):
method get_interval_time (line 88) | def get_interval_time(self, update=True):
method get_total_time (line 95) | def get_total_time(self, to_str=True):
method save_model (line 102) | def save_model(
method delete_model (line 130) | def delete_model(self, name='model', postfix=''):
method global_step_increment (line 142) | def global_step_increment(self):
FILE: diffusion/logger/utils.py
function traverse_dir (line 8) | def traverse_dir(
class DotDict (line 50) | class DotDict(dict):
method __getattr__ (line 51) | def __getattr__(*args):
function get_network_paras_amount (line 59) | def get_network_paras_amount(model_dict):
function load_config (line 69) | def load_config(path_config):
function save_config (line 76) | def save_config(path_config,config):
function to_json (line 81) | def to_json(path_params, path_json):
function convert_tensor_to_numpy (line 92) | def convert_tensor_to_numpy(tensor, is_squeeze=True):
function load_model (line 102) | def load_model(
FILE: diffusion/onnx_export.py
class DotDict (line 11) | class DotDict(dict):
method __getattr__ (line 12) | def __getattr__(*args):
function load_model_vocoder (line 20) | def load_model_vocoder(
class Unit2Mel (line 48) | class Unit2Mel(nn.Module):
method __init__ (line 49) | def __init__(
method forward (line 84) | def forward(self, units, mel2ph, f0, volume, g = None):
method init_spkembed (line 110) | def init_spkembed(self, units, f0, volume, spk_id = None, spk_mix_dict...
method OnnxExport (line 135) | def OnnxExport(self, project_name=None, init_noise=None, export_encode...
method ExportOnnx (line 171) | def ExportOnnx(self, project_name=None):
FILE: diffusion/solver.py
function test (line 13) | def test(args, model, vocoder, loader_test, saver):
function train (line 93) | def train(args, initial_global_step, model, optimizer, scheduler, vocode...
FILE: diffusion/uni_pc.py
class NoiseScheduleVP (line 6) | class NoiseScheduleVP:
method __init__ (line 7) | def __init__(
method marginal_log_mean_coeff (line 103) | def marginal_log_mean_coeff(self, t):
method marginal_alpha (line 117) | def marginal_alpha(self, t):
method marginal_std (line 123) | def marginal_std(self, t):
method marginal_lambda (line 129) | def marginal_lambda(self, t):
method inverse_lambda (line 137) | def inverse_lambda(self, lamb):
function model_wrapper (line 157) | def model_wrapper(
class UniPC (line 238) | class UniPC:
method __init__ (line 239) | def __init__(
method dynamic_thresholding_fn (line 270) | def dynamic_thresholding_fn(self, x0, t=None):
method noise_prediction_fn (line 281) | def noise_prediction_fn(self, x, t):
method data_prediction_fn (line 287) | def data_prediction_fn(self, x, t):
method model_fn (line 298) | def model_fn(self, x, t):
method get_time_steps (line 307) | def get_time_steps(self, skip_type, t_T, t_0, N, device):
method get_orders_and_timesteps_for_singlestep_solver (line 324) | def get_orders_and_timesteps_for_singlestep_solver(self, steps, order,...
method denoise_to_zero_fn (line 355) | def denoise_to_zero_fn(self, x, s):
method multistep_uni_pc_update (line 361) | def multistep_uni_pc_update(self, x, model_prev_list, t_prev_list, t, ...
method multistep_uni_pc_vary_update (line 370) | def multistep_uni_pc_vary_update(self, x, model_prev_list, t_prev_list...
method multistep_uni_pc_bh_update (line 473) | def multistep_uni_pc_bh_update(self, x, model_prev_list, t_prev_list, ...
method sample (line 592) | def sample(self, x, steps=20, t_start=None, t_end=None, order=2, skip_...
function interpolate_fn (line 681) | def interpolate_fn(x, xp, yp):
function expand_dims (line 723) | def expand_dims(v, dims):
FILE: diffusion/unit2mel.py
class DotDict (line 13) | class DotDict(dict):
method __getattr__ (line 14) | def __getattr__(*args):
function load_model_vocoder (line 22) | def load_model_vocoder(
class Unit2Mel (line 61) | class Unit2Mel(nn.Module):
method __init__ (line 62) | def __init__(
method init_spkembed (line 94) | def init_spkembed(self, units, f0, volume, spk_id = None, spk_mix_dict...
method init_spkmix (line 119) | def init_spkmix(self, n_spk):
method forward (line 131) | def forward(self, units, f0, volume, spk_id = None, spk_mix_dict = Non...
FILE: diffusion/vocoder.py
class Vocoder (line 8) | class Vocoder:
method __init__ (line 9) | def __init__(self, vocoder_type, vocoder_ckpt, device = None):
method extract (line 26) | def extract(self, audio, sample_rate, keyshift=0):
method infer (line 41) | def infer(self, mel, f0):
class NsfHifiGAN (line 47) | class NsfHifiGAN(torch.nn.Module):
method __init__ (line 48) | def __init__(self, model_path, device=None):
method sample_rate (line 65) | def sample_rate(self):
method hop_size (line 68) | def hop_size(self):
method dimension (line 71) | def dimension(self):
method extract (line 74) | def extract(self, audio, keyshift=0):
method forward (line 78) | def forward(self, mel, f0):
class NsfHifiGANLog10 (line 87) | class NsfHifiGANLog10(NsfHifiGAN):
method forward (line 88) | def forward(self, mel, f0):
FILE: diffusion/wavenet.py
class Conv1d (line 10) | class Conv1d(torch.nn.Conv1d):
method __init__ (line 11) | def __init__(self, *args, **kwargs):
class SinusoidalPosEmb (line 16) | class SinusoidalPosEmb(nn.Module):
method __init__ (line 17) | def __init__(self, dim):
method forward (line 21) | def forward(self, x):
class ResidualBlock (line 31) | class ResidualBlock(nn.Module):
method __init__ (line 32) | def __init__(self, encoder_hidden, residual_channels, dilation):
method forward (line 46) | def forward(self, x, conditioner, diffusion_step):
class WaveNet (line 64) | class WaveNet(nn.Module):
method __init__ (line 65) | def __init__(self, in_dims=128, n_layers=20, n_chans=384, n_hidden=256):
method forward (line 86) | def forward(self, spec, diffusion_step, cond):
FILE: edgetts/tts.py
function _main (line 21) | async def _main() -> None:
FILE: flask_api.py
function voice_change_model (line 20) | def voice_change_model():
FILE: flask_api_full_song.py
function wav2wav (line 13) | def wav2wav():
FILE: inference/infer_tool.py
function read_temp (line 28) | def read_temp(file_name):
function write_temp (line 51) | def write_temp(file_name, data):
function timeit (line 56) | def timeit(func):
function format_wav (line 66) | def format_wav(audio_path):
function get_end_file (line 73) | def get_end_file(dir_path, end):
function get_md5 (line 84) | def get_md5(content):
function fill_a_to_b (line 87) | def fill_a_to_b(a, b):
function mkdir (line 92) | def mkdir(paths: list):
function pad_array (line 97) | def pad_array(arr, target_length):
function split_list_by_n (line 108) | def split_list_by_n(list_collection, n, pre=0):
class F0FilterException (line 113) | class F0FilterException(Exception):
class Svc (line 116) | class Svc(object):
method __init__ (line 117) | def __init__(self, net_g_path, config_path,
method load_model (line 189) | def load_model(self, spk_mix_enable=False):
method get_unit_f0 (line 204) | def get_unit_f0(self, wav, tran, cluster_infer_ratio, speaker, f0_filt...
method infer (line 256) | def infer(self, speaker, tran, raw_path,
method clear_empty (line 342) | def clear_empty(self):
method unload_model (line 346) | def unload_model(self):
method slice_inference (line 356) | def slice_inference(self,
class RealTimeVC (line 498) | class RealTimeVC:
method __init__ (line 499) | def __init__(self):
method process (line 507) | def process(self, svc_model, speaker_id, f_pitch_change, input_wav_path,
FILE: inference/infer_tool_grad.py
function resize2d_f0 (line 19) | def resize2d_f0(x, target_len):
function get_f0 (line 27) | def get_f0(x, p_len,f0_up_key=0):
function clean_pitch (line 51) | def clean_pitch(input_pitch):
function plt_pitch (line 58) | def plt_pitch(input_pitch):
function f0_to_pitch (line 64) | def f0_to_pitch(ff):
function fill_a_to_b (line 69) | def fill_a_to_b(a, b):
function mkdir (line 75) | def mkdir(paths: list):
class VitsSvc (line 81) | class VitsSvc(object):
method __init__ (line 82) | def __init__(self):
method set_device (line 89) | def set_device(self, device):
method loadCheckpoint (line 95) | def loadCheckpoint(self, path):
method get_units (line 105) | def get_units(self, source, sr):
method get_unit_pitch (line 112) | def get_unit_pitch(self, in_path, tran):
method infer (line 121) | def infer(self, speaker_id, tran, raw_path):
method inference (line 133) | def inference(self,srcaudio,chara,tran,slice_db):
FILE: inference/slicer.py
class Slicer (line 6) | class Slicer:
method __init__ (line 7) | def __init__(self,
method _apply_slice (line 26) | def _apply_slice(self, waveform, begin, end):
method slice (line 33) | def slice(self, waveform):
function cut (line 120) | def cut(audio_path, db_thresh=-30, min_len=5000):
function chunks2audio (line 131) | def chunks2audio(audio_path, chunks):
FILE: inference_main.py
function main (line 14) | def main():
FILE: models.py
class ResidualCouplingBlock (line 15) | class ResidualCouplingBlock(nn.Module):
method __init__ (line 16) | def __init__(self,
method forward (line 45) | def forward(self, x, x_mask, g=None, reverse=False):
class TransformerCouplingBlock (line 54) | class TransformerCouplingBlock(nn.Module):
method __init__ (line 55) | def __init__(self,
method forward (line 85) | def forward(self, x, x_mask, g=None, reverse=False):
class Encoder (line 95) | class Encoder(nn.Module):
method __init__ (line 96) | def __init__(self,
method forward (line 117) | def forward(self, x, x_lengths, g=None):
class TextEncoder (line 128) | class TextEncoder(nn.Module):
method __init__ (line 129) | def __init__(self,
method forward (line 155) | def forward(self, x, x_mask, f0=None, noice_scale=1):
class DiscriminatorP (line 165) | class DiscriminatorP(torch.nn.Module):
method __init__ (line 166) | def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=...
method forward (line 180) | def forward(self, x):
class DiscriminatorS (line 202) | class DiscriminatorS(torch.nn.Module):
method __init__ (line 203) | def __init__(self, use_spectral_norm=False):
method forward (line 216) | def forward(self, x):
class MultiPeriodDiscriminator (line 230) | class MultiPeriodDiscriminator(torch.nn.Module):
method __init__ (line 231) | def __init__(self, use_spectral_norm=False):
method forward (line 239) | def forward(self, y, y_hat):
class SpeakerEncoder (line 255) | class SpeakerEncoder(torch.nn.Module):
method __init__ (line 256) | def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden...
method forward (line 262) | def forward(self, mels):
method compute_partial_slices (line 268) | def compute_partial_slices(self, total_frames, partial_frames, partial...
method embed_utterance (line 276) | def embed_utterance(self, mel, partial_frames=128, partial_hop=64):
class F0Decoder (line 296) | class F0Decoder(nn.Module):
method __init__ (line 297) | def __init__(self,
method forward (line 328) | def forward(self, x, norm_f0, x_mask, spk_emb=None):
class SynthesizerTrn (line 339) | class SynthesizerTrn(nn.Module):
method __init__ (line 344) | def __init__(self,
method EnableCharacterMix (line 456) | def EnableCharacterMix(self, n_speakers_map, device):
method forward (line 463) | def forward(self, c, f0, uv, spec, g=None, c_lengths=None, spec_length...
method infer (line 496) | def infer(self, c, f0, uv, g=None, noice_scale=0.35, seed=52468, predi...
FILE: modules/DSConv.py
class Depthwise_Separable_Conv1D (line 5) | class Depthwise_Separable_Conv1D(nn.Module):
method __init__ (line 6) | def __init__(
method forward (line 23) | def forward(self, input):
method weight_norm (line 26) | def weight_norm(self):
method remove_weight_norm (line 30) | def remove_weight_norm(self):
class Depthwise_Separable_TransposeConv1D (line 34) | class Depthwise_Separable_TransposeConv1D(nn.Module):
method __init__ (line 35) | def __init__(
method forward (line 53) | def forward(self, input):
method weight_norm (line 56) | def weight_norm(self):
method remove_weight_norm (line 60) | def remove_weight_norm(self):
function weight_norm_modules (line 65) | def weight_norm_modules(module, name = 'weight', dim = 0):
function remove_weight_norm_modules (line 72) | def remove_weight_norm_modules(module, name = 'weight'):
FILE: modules/F0Predictor/CrepeF0Predictor.py
class CrepeF0Predictor (line 7) | class CrepeF0Predictor(F0Predictor):
method __init__ (line 8) | def __init__(self,hop_length=512,f0_min=50,f0_max=1100,device=None,sam...
method compute_f0 (line 18) | def compute_f0(self,wav,p_len=None):
method compute_f0_uv (line 27) | def compute_f0_uv(self,wav,p_len=None):
FILE: modules/F0Predictor/DioF0Predictor.py
class DioF0Predictor (line 7) | class DioF0Predictor(F0Predictor):
method __init__ (line 8) | def __init__(self,hop_length=512,f0_min=50,f0_max=1100,sampling_rate=4...
method interpolate_f0 (line 15) | def interpolate_f0(self,f0):
method resize_f0 (line 39) | def resize_f0(self,x, target_len):
method compute_f0 (line 46) | def compute_f0(self,wav,p_len=None):
method compute_f0_uv (line 61) | def compute_f0_uv(self,wav,p_len=None):
FILE: modules/F0Predictor/F0Predictor.py
class F0Predictor (line 1) | class F0Predictor(object):
method compute_f0 (line 2) | def compute_f0(self,wav,p_len):
method compute_f0_uv (line 10) | def compute_f0_uv(self,wav,p_len):
FILE: modules/F0Predictor/FCPEF0Predictor.py
class FCPEF0Predictor (line 12) | class FCPEF0Predictor(F0Predictor):
method __init__ (line 13) | def __init__(self, hop_length=512, f0_min=50, f0_max=1100, dtype=torch...
method repeat_expand (line 28) | def repeat_expand(
method post_process (line 54) | def post_process(self, x, sampling_rate, f0, pad_to):
method compute_f0 (line 87) | def compute_f0(self, wav, p_len=None):
method compute_f0_uv (line 99) | def compute_f0_uv(self, wav, p_len=None):
FILE: modules/F0Predictor/HarvestF0Predictor.py
class HarvestF0Predictor (line 7) | class HarvestF0Predictor(F0Predictor):
method __init__ (line 8) | def __init__(self,hop_length=512,f0_min=50,f0_max=1100,sampling_rate=4...
method interpolate_f0 (line 15) | def interpolate_f0(self,f0):
method resize_f0 (line 38) | def resize_f0(self,x, target_len):
method compute_f0 (line 45) | def compute_f0(self,wav,p_len=None):
method compute_f0_uv (line 58) | def compute_f0_uv(self,wav,p_len=None):
FILE: modules/F0Predictor/PMF0Predictor.py
class PMF0Predictor (line 7) | class PMF0Predictor(F0Predictor):
method __init__ (line 8) | def __init__(self,hop_length=512,f0_min=50,f0_max=1100,sampling_rate=4...
method interpolate_f0 (line 15) | def interpolate_f0(self,f0):
method compute_f0 (line 40) | def compute_f0(self,wav,p_len=None):
method compute_f0_uv (line 57) | def compute_f0_uv(self,wav,p_len=None):
FILE: modules/F0Predictor/RMVPEF0Predictor.py
class RMVPEF0Predictor (line 12) | class RMVPEF0Predictor(F0Predictor):
method __init__ (line 13) | def __init__(self,hop_length=512,f0_min=50,f0_max=1100, dtype=torch.fl...
method repeat_expand (line 27) | def repeat_expand(
method post_process (line 53) | def post_process(self, x, sampling_rate, f0, pad_to):
method compute_f0 (line 85) | def compute_f0(self,wav,p_len=None):
method compute_f0_uv (line 97) | def compute_f0_uv(self,wav,p_len=None):
FILE: modules/F0Predictor/crepe.py
function repeat_expand (line 15) | def repeat_expand(
class BasePitchExtractor (line 54) | class BasePitchExtractor:
method __init__ (line 55) | def __init__(
method __call__ (line 76) | def __call__(self, x, sampling_rate=44100, pad_to=None):
method post_process (line 79) | def post_process(self, x, sampling_rate, f0, pad_to):
class MaskedAvgPool1d (line 115) | class MaskedAvgPool1d(nn.Module):
method __init__ (line 116) | def __init__(
method forward (line 132) | def forward(self, x, mask=None):
class MaskedMedianPool1d (line 183) | class MaskedMedianPool1d(nn.Module):
method __init__ (line 184) | def __init__(
method forward (line 203) | def forward(self, x, mask=None):
class CrepePitchExtractor (line 255) | class CrepePitchExtractor(BasePitchExtractor):
method __init__ (line 256) | def __init__(
method __call__ (line 289) | def __call__(self, x, sampling_rate=44100, pad_to=None):
FILE: modules/F0Predictor/fcpe/model.py
function l2_regularization (line 12) | def l2_regularization(model, l2_alpha):
class FCPE (line 20) | class FCPE(nn.Module):
method __init__ (line 21) | def __init__(
method forward (line 87) | def forward(self, mel, infer=True, gt_f0=None, return_hz_f0=False, cde...
method cents_decoder (line 121) | def cents_decoder(self, y, mask=True):
method cents_local_decoder (line 135) | def cents_local_decoder(self, y, mask=True):
method cent_to_f0 (line 154) | def cent_to_f0(self, cent):
method f0_to_cent (line 157) | def f0_to_cent(self, f0):
method gaussian_blurred_cent (line 160) | def gaussian_blurred_cent(self, cents): # cents: [B,N,1]
class FCPEInfer (line 167) | class FCPEInfer:
method __init__ (line 168) | def __init__(self, model_path, device=None, dtype=torch.float32):
method __call__ (line 198) | def __call__(self, audio, sr, threshold=0.05):
class Wav2Mel (line 206) | class Wav2Mel:
method __init__ (line 208) | def __init__(self, args, device=None, dtype=torch.float32):
method extract_nvstft (line 227) | def extract_nvstft(self, audio, keyshift=0, train=False):
method extract_mel (line 231) | def extract_mel(self, audio, sample_rate, keyshift=0, train=False):
method __call__ (line 252) | def __call__(self, audio, sample_rate, keyshift=0, train=False):
class DotDict (line 256) | class DotDict(dict):
method __getattr__ (line 257) | def __getattr__(*args):
FILE: modules/F0Predictor/fcpe/nvSTFT.py
function load_wav_to_torch (line 13) | def load_wav_to_torch(full_path, target_sr=None, return_empty_on_excepti...
function dynamic_range_compression (line 45) | def dynamic_range_compression(x, C=1, clip_val=1e-5):
function dynamic_range_decompression (line 48) | def dynamic_range_decompression(x, C=1):
function dynamic_range_compression_torch (line 51) | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
function dynamic_range_decompression_torch (line 54) | def dynamic_range_decompression_torch(x, C=1):
class STFT (line 57) | class STFT():
method __init__ (line 58) | def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop...
method get_mel (line 71) | def get_mel(self, y, keyshift=0, speed=1, center=False, train=False):
method __call__ (line 128) | def __call__(self, audiopath):
FILE: modules/F0Predictor/fcpe/pcmer.py
function softmax_kernel (line 12) | def softmax_kernel(data, *, projection_matrix, is_query, normalize_data=...
function orthogonal_matrix_chunk (line 47) | def orthogonal_matrix_chunk(cols, qr_uniform_q = False, device = None):
function exists (line 58) | def exists(val):
function empty (line 61) | def empty(tensor):
function default (line 64) | def default(val, d):
function cast_tuple (line 67) | def cast_tuple(val):
class PCmer (line 70) | class PCmer(nn.Module):
method __init__ (line 73) | def __init__(self,
method forward (line 94) | def forward(self, phone, mask=None):
class _EncoderLayer (line 108) | class _EncoderLayer(nn.Module):
method __init__ (line 116) | def __init__(self, parent: PCmer):
method forward (line 136) | def forward(self, phone, mask=None):
function calc_same_padding (line 145) | def calc_same_padding(kernel_size):
class Swish (line 151) | class Swish(nn.Module):
method forward (line 152) | def forward(self, x):
class Transpose (line 155) | class Transpose(nn.Module):
method __init__ (line 156) | def __init__(self, dims):
method forward (line 161) | def forward(self, x):
class GLU (line 164) | class GLU(nn.Module):
method __init__ (line 165) | def __init__(self, dim):
method forward (line 169) | def forward(self, x):
class DepthWiseConv1d (line 173) | class DepthWiseConv1d(nn.Module):
method __init__ (line 174) | def __init__(self, chan_in, chan_out, kernel_size, padding):
method forward (line 179) | def forward(self, x):
class ConformerConvModule (line 183) | class ConformerConvModule(nn.Module):
method __init__ (line 184) | def __init__(
method forward (line 209) | def forward(self, x):
function linear_attention (line 212) | def linear_attention(q, k, v):
function gaussian_orthogonal_random_matrix (line 228) | def gaussian_orthogonal_random_matrix(nb_rows, nb_columns, scaling = 0, ...
class FastAttention (line 257) | class FastAttention(nn.Module):
method __init__ (line 258) | def __init__(self, dim_heads, nb_features = None, ortho_scaling = 0, c...
method redraw_projection_matrix (line 280) | def redraw_projection_matrix(self):
method forward (line 285) | def forward(self, q, k, v):
class SelfAttention (line 304) | class SelfAttention(nn.Module):
method __init__ (line 305) | def __init__(self, dim, causal = False, heads = 8, dim_head = 64, loca...
method redraw_projection_matrix (line 328) | def redraw_projection_matrix(self):
method forward (line 332) | def forward(self, x, context = None, mask = None, context_mask = None,...
FILE: modules/F0Predictor/rmvpe/deepunet.py
class ConvBlockRes (line 7) | class ConvBlockRes(nn.Module):
method __init__ (line 8) | def __init__(self, in_channels, out_channels, momentum=0.01):
method forward (line 35) | def forward(self, x):
class ResEncoderBlock (line 42) | class ResEncoderBlock(nn.Module):
method __init__ (line 43) | def __init__(self, in_channels, out_channels, kernel_size, n_blocks=1,...
method forward (line 54) | def forward(self, x):
class ResDecoderBlock (line 63) | class ResDecoderBlock(nn.Module):
method __init__ (line 64) | def __init__(self, in_channels, out_channels, stride, n_blocks=1, mome...
method forward (line 84) | def forward(self, x, concat_tensor):
class Encoder (line 92) | class Encoder(nn.Module):
method __init__ (line 93) | def __init__(self, in_channels, in_size, n_encoders, kernel_size, n_bl...
method forward (line 108) | def forward(self, x):
class Intermediate (line 117) | class Intermediate(nn.Module):
method __init__ (line 118) | def __init__(self, in_channels, out_channels, n_inters, n_blocks, mome...
method forward (line 126) | def forward(self, x):
class Decoder (line 132) | class Decoder(nn.Module):
method __init__ (line 133) | def __init__(self, in_channels, n_decoders, stride, n_blocks, momentum...
method forward (line 142) | def forward(self, x, concat_tensors):
class TimbreFilter (line 148) | class TimbreFilter(nn.Module):
method __init__ (line 149) | def __init__(self, latent_rep_channels):
method forward (line 155) | def forward(self, x_tensors):
class DeepUnet (line 162) | class DeepUnet(nn.Module):
method __init__ (line 163) | def __init__(self, kernel_size, n_blocks, en_de_layers=5, inter_layers...
method forward (line 170) | def forward(self, x):
class DeepUnet0 (line 178) | class DeepUnet0(nn.Module):
method __init__ (line 179) | def __init__(self, kernel_size, n_blocks, en_de_layers=5, inter_layers...
method forward (line 186) | def forward(self, x):
FILE: modules/F0Predictor/rmvpe/inference.py
class RMVPE (line 11) | class RMVPE:
method __init__ (line 12) | def __init__(self, model_path, device=None, dtype = torch.float32, hop...
method mel2hidden (line 28) | def mel2hidden(self, mel):
method decode (line 35) | def decode(self, hidden, thred=0.03, use_viterbi=False):
method infer_from_audio (line 43) | def infer_from_audio(self, audio, sample_rate=16000, thred=0.05, use_v...
FILE: modules/F0Predictor/rmvpe/model.py
class E2E (line 9) | class E2E(nn.Module):
method __init__ (line 10) | def __init__(self, hop_length, n_blocks, n_gru, kernel_size, en_de_lay...
method forward (line 30) | def forward(self, x):
class E2E0 (line 43) | class E2E0(nn.Module):
method __init__ (line 44) | def __init__(self, n_blocks, n_gru, kernel_size, en_de_layers=5, inter...
method forward (line 63) | def forward(self, mel):
FILE: modules/F0Predictor/rmvpe/seq.py
class BiGRU (line 4) | class BiGRU(nn.Module):
method __init__ (line 5) | def __init__(self, input_features, hidden_features, num_layers):
method forward (line 9) | def forward(self, x):
class BiLSTM (line 13) | class BiLSTM(nn.Module):
method __init__ (line 14) | def __init__(self, input_features, hidden_features, num_layers):
method forward (line 18) | def forward(self, x):
FILE: modules/F0Predictor/rmvpe/spec.py
class MelSpectrogram (line 7) | class MelSpectrogram(torch.nn.Module):
method __init__ (line 8) | def __init__(
method forward (line 38) | def forward(self, audio, keyshift=0, speed=1, center=True):
FILE: modules/F0Predictor/rmvpe/utils.py
function cycle (line 12) | def cycle(iterable):
function summary (line 18) | def summary(model, file=sys.stdout):
function to_local_average_cents (line 64) | def to_local_average_cents(salience, center=None, thred=0.05):
function to_viterbi_cents (line 90) | def to_viterbi_cents(salience, thred=0.05):
FILE: modules/attentions.py
class FFT (line 12) | class FFT(nn.Module):
method __init__ (line 13) | def __init__(self, hidden_channels, filter_channels, n_heads, n_layers...
method forward (line 43) | def forward(self, x, x_mask, g = None):
class Encoder (line 73) | class Encoder(nn.Module):
method __init__ (line 74) | def __init__(self, hidden_channels, filter_channels, n_heads, n_layers...
method forward (line 95) | def forward(self, x, x_mask):
class Decoder (line 110) | class Decoder(nn.Module):
method __init__ (line 111) | def __init__(self, hidden_channels, filter_channels, n_heads, n_layers...
method forward (line 137) | def forward(self, x, x_mask, h, h_mask):
class MultiHeadAttention (line 161) | class MultiHeadAttention(nn.Module):
method __init__ (line 162) | def __init__(self, channels, out_channels, n_heads, p_dropout=0., wind...
method forward (line 198) | def forward(self, x, c, attn_mask=None):
method attention (line 208) | def attention(self, query, key, value, mask=None):
method _matmul_with_relative_values (line 241) | def _matmul_with_relative_values(self, x, y):
method _matmul_with_relative_keys (line 250) | def _matmul_with_relative_keys(self, x, y):
method _get_relative_embeddings (line 259) | def _get_relative_embeddings(self, relative_embeddings, length):
method _relative_position_to_absolute_position (line 274) | def _relative_position_to_absolute_position(self, x):
method _absolute_position_to_relative_position (line 291) | def _absolute_position_to_relative_position(self, x):
method _attention_bias_proximal (line 305) | def _attention_bias_proximal(self, length):
class FFN (line 317) | class FFN(nn.Module):
method __init__ (line 318) | def __init__(self, in_channels, out_channels, filter_channels, kernel_...
method forward (line 337) | def forward(self, x, x_mask):
method _causal_padding (line 347) | def _causal_padding(self, x):
method _same_padding (line 356) | def _same_padding(self, x):
FILE: modules/commons.py
function slice_pitch_segments (line 7) | def slice_pitch_segments(x, ids_str, segment_size=4):
function rand_slice_segments_with_pitch (line 15) | def rand_slice_segments_with_pitch(x, pitch, x_lengths=None, segment_siz...
function init_weights (line 25) | def init_weights(m, mean=0.0, std=0.01):
function get_padding (line 33) | def get_padding(kernel_size, dilation=1):
function convert_pad_shape (line 37) | def convert_pad_shape(pad_shape):
function intersperse (line 43) | def intersperse(lst, item):
function kl_divergence (line 49) | def kl_divergence(m_p, logs_p, m_q, logs_q):
function rand_gumbel (line 56) | def rand_gumbel(shape):
function rand_gumbel_like (line 62) | def rand_gumbel_like(x):
function slice_segments (line 67) | def slice_segments(x, ids_str, segment_size=4):
function rand_slice_segments (line 76) | def rand_slice_segments(x, x_lengths=None, segment_size=4):
function rand_spec_segments (line 86) | def rand_spec_segments(x, x_lengths=None, segment_size=4):
function get_timing_signal_1d (line 96) | def get_timing_signal_1d(
function add_timing_signal_1d (line 112) | def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
function cat_timing_signal_1d (line 118) | def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis...
function subsequent_mask (line 124) | def subsequent_mask(length):
function fused_add_tanh_sigmoid_multiply (line 130) | def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
function shift_1d (line 139) | def shift_1d(x):
function sequence_mask (line 144) | def sequence_mask(length, max_length=None):
function generate_path (line 151) | def generate_path(duration, mask):
function clip_grad_value_ (line 168) | def clip_grad_value_(parameters, clip_value, norm_type=2):
FILE: modules/enhancer.py
class Enhancer (line 10) | class Enhancer:
method __init__ (line 11) | def __init__(self, enhancer_type, enhancer_ckpt, device=None):
method enhance (line 25) | def enhance(self,
class NsfHifiGAN (line 80) | class NsfHifiGAN(torch.nn.Module):
method __init__ (line 81) | def __init__(self, model_path, device=None):
method sample_rate (line 89) | def sample_rate(self):
method hop_size (line 92) | def hop_size(self):
method forward (line 95) | def forward(self, audio, f0):
FILE: modules/losses.py
function feature_loss (line 4) | def feature_loss(fmap_r, fmap_g):
function discriminator_loss (line 15) | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
function generator_loss (line 31) | def generator_loss(disc_outputs):
function kl_loss (line 43) | def kl_loss(z_p, logs_q, m_p, logs_p, z_mask):
FILE: modules/mel_processing.py
function dynamic_range_compression_torch (line 8) | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
function dynamic_range_decompression_torch (line 17) | def dynamic_range_decompression_torch(x, C=1):
function spectral_normalize_torch (line 26) | def spectral_normalize_torch(magnitudes):
function spectral_de_normalize_torch (line 31) | def spectral_de_normalize_torch(magnitudes):
function spectrogram_torch (line 40) | def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, cente...
function spec_to_mel_torch (line 67) | def spec_to_mel_torch(spec, n_fft, num_mels, sampling_rate, fmin, fmax):
function mel_spectrogram_torch (line 79) | def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, w...
FILE: modules/modules.py
function set_Conv1dModel (line 18) | def set_Conv1dModel(use_depthwise_conv):
class LayerNorm (line 23) | class LayerNorm(nn.Module):
method __init__ (line 24) | def __init__(self, channels, eps=1e-5):
method forward (line 32) | def forward(self, x):
class ConvReluNorm (line 38) | class ConvReluNorm(nn.Module):
method __init__ (line 39) | def __init__(self, in_channels, hidden_channels, out_channels, kernel_...
method forward (line 63) | def forward(self, x, x_mask):
class WN (line 73) | class WN(torch.nn.Module):
method __init__ (line 74) | def __init__(self, hidden_channels, kernel_size, dilation_rate, n_laye...
method forward (line 110) | def forward(self, x, x_mask, g=None, **kwargs):
method remove_weight_norm (line 140) | def remove_weight_norm(self):
class ResBlock1 (line 149) | class ResBlock1(torch.nn.Module):
method __init__ (line 150) | def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
method forward (line 172) | def forward(self, x, x_mask=None):
method remove_weight_norm (line 187) | def remove_weight_norm(self):
class ResBlock2 (line 194) | class ResBlock2(torch.nn.Module):
method __init__ (line 195) | def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
method forward (line 205) | def forward(self, x, x_mask=None):
method remove_weight_norm (line 216) | def remove_weight_norm(self):
class Log (line 221) | class Log(nn.Module):
method forward (line 222) | def forward(self, x, x_mask, reverse=False, **kwargs):
class Flip (line 232) | class Flip(nn.Module):
method forward (line 233) | def forward(self, x, *args, reverse=False, **kwargs):
class ElementwiseAffine (line 242) | class ElementwiseAffine(nn.Module):
method __init__ (line 243) | def __init__(self, channels):
method forward (line 249) | def forward(self, x, x_mask, reverse=False, **kwargs):
class ResidualCouplingLayer (line 260) | class ResidualCouplingLayer(nn.Module):
method __init__ (line 261) | def __init__(self,
method forward (line 288) | def forward(self, x, x_mask, g=None, reverse=False):
class TransformerCouplingLayer (line 309) | class TransformerCouplingLayer(nn.Module):
method __init__ (line 310) | def __init__(self,
method forward (line 337) | def forward(self, x, x_mask, g=None, reverse=False):
FILE: onnx_export.py
function OnnxExport (line 11) | def OnnxExport(path=None):
FILE: onnx_export_old.py
function main (line 7) | def main(NetExport):
FILE: onnxexport/model_onnx.py
class ResidualCouplingBlock (line 16) | class ResidualCouplingBlock(nn.Module):
method __init__ (line 17) | def __init__(self,
method forward (line 41) | def forward(self, x, x_mask, g=None, reverse=False):
class Encoder (line 51) | class Encoder(nn.Module):
method __init__ (line 52) | def __init__(self,
method forward (line 73) | def forward(self, x, x_lengths, g=None):
class TextEncoder (line 84) | class TextEncoder(nn.Module):
method __init__ (line 85) | def __init__(self,
method forward (line 111) | def forward(self, x, x_mask, f0=None, z=None):
class DiscriminatorP (line 120) | class DiscriminatorP(torch.nn.Module):
method __init__ (line 121) | def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=...
method forward (line 135) | def forward(self, x):
class DiscriminatorS (line 157) | class DiscriminatorS(torch.nn.Module):
method __init__ (line 158) | def __init__(self, use_spectral_norm=False):
method forward (line 171) | def forward(self, x):
class F0Decoder (line 185) | class F0Decoder(nn.Module):
method __init__ (line 186) | def __init__(self,
method forward (line 217) | def forward(self, x, norm_f0, x_mask, spk_emb=None):
class SynthesizerTrn (line 228) | class SynthesizerTrn(nn.Module):
method __init__ (line 233) | def __init__(self,
method forward (line 312) | def forward(self, c, f0, mel2ph, uv, noise=None, g=None):
FILE: onnxexport/model_onnx_speaker_mix.py
class ResidualCouplingBlock (line 12) | class ResidualCouplingBlock(nn.Module):
method __init__ (line 13) | def __init__(self,
method forward (line 42) | def forward(self, x, x_mask, g=None, reverse=False):
class TransformerCouplingBlock (line 51) | class TransformerCouplingBlock(nn.Module):
method __init__ (line 52) | def __init__(self,
method forward (line 82) | def forward(self, x, x_mask, g=None, reverse=False):
class Encoder (line 92) | class Encoder(nn.Module):
method __init__ (line 93) | def __init__(self,
method forward (line 114) | def forward(self, x, x_lengths, g=None):
class TextEncoder (line 125) | class TextEncoder(nn.Module):
method __init__ (line 126) | def __init__(self,
method forward (line 152) | def forward(self, x, x_mask, f0=None, z=None):
class F0Decoder (line 162) | class F0Decoder(nn.Module):
method __init__ (line 163) | def __init__(self,
method forward (line 194) | def forward(self, x, norm_f0, x_mask, spk_emb=None):
class SynthesizerTrn (line 205) | class SynthesizerTrn(nn.Module):
method __init__ (line 210) | def __init__(self,
method export_chara_mix (line 324) | def export_chara_mix(self, speakers_mix):
method forward (line 334) | def forward(self, c, f0, mel2ph, uv, noise=None, g=None, vol = None):
FILE: preprocess_flist_config.py
function get_wav_duration (line 15) | def get_wav_duration(file_path):
FILE: preprocess_hubert_f0.py
function process_one (line 31) | def process_one(filename, hmodel, f0p, device, diff=False, mel_extractor...
function process_batch (line 106) | def process_batch(file_chunk, f0p, diff=False, mel_extractor=None, devic...
function parallel_process (line 119) | def parallel_process(filenames, num_processes, f0p, diff, mel_extractor,...
FILE: resample.py
function load_wav (line 13) | def load_wav(wav_path):
function trim_wav (line 17) | def trim_wav(wav, top_db=40):
function normalize_peak (line 21) | def normalize_peak(wav, threshold=1.0):
function resample_wav (line 28) | def resample_wav(wav, sr, target_sr):
function save_wav_to_path (line 32) | def save_wav_to_path(wav, save_path, sr):
function process (line 40) | def process(item):
function process_all_speakers (line 76) | def process_all_speakers():
FILE: train.py
function main (line 35) | def main():
function run (line 47) | def run(rank, n_gpus, hps):
function train_and_evaluate (line 135) | def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scale...
function evaluate (line 276) | def evaluate(hps, generator, eval_loader, writer_eval):
FILE: train_diff.py
function parse_args (line 14) | def parse_args(args=None, namespace=None):
FILE: utils.py
function normalize_f0 (line 31) | def normalize_f0(f0, x_mask, uv, random_scale=True):
function plot_data_to_numpy (line 46) | def plot_data_to_numpy(x, y):
function f0_to_coarse (line 69) | def f0_to_coarse(f0):
function get_content (line 82) | def get_content(cmodel, y):
function get_f0_predictor (line 88) | def get_f0_predictor(f0_predictor,hop_length,sampling_rate,**kargs):
function get_speech_encoder (line 111) | def get_speech_encoder(speech_encoder,device=None,**kargs):
function load_checkpoint (line 155) | def load_checkpoint(checkpoint_path, model, optimizer=None, skip_optimiz...
function save_checkpoint (line 190) | def save_checkpoint(model, optimizer, learning_rate, iteration, checkpoi...
function clean_checkpoints (line 202) | def clean_checkpoints(path_to_models='logs/44k/', n_ckpts_to_keep=2, sor...
function summarize (line 227) | def summarize(writer, global_step, scalars={}, histograms={}, images={},...
function latest_checkpoint_path (line 238) | def latest_checkpoint_path(dir_path, regex="G_*.pth"):
function plot_spectrogram_to_numpy (line 246) | def plot_spectrogram_to_numpy(spectrogram):
function plot_alignment_to_numpy (line 272) | def plot_alignment_to_numpy(alignment, info=None):
function load_wav_to_torch (line 301) | def load_wav_to_torch(full_path):
function load_filepaths_and_text (line 306) | def load_filepaths_and_text(filename, split="|"):
function get_hparams (line 312) | def get_hparams(init=True):
function get_hparams_from_dir (line 342) | def get_hparams_from_dir(model_dir):
function get_hparams_from_file (line 353) | def get_hparams_from_file(config_path, infer_mode = False):
function check_git_hash (line 361) | def check_git_hash(model_dir):
function get_logger (line 381) | def get_logger(model_dir, filename="train.log"):
function repeat_expand_2d (line 396) | def repeat_expand_2d(content, target_len, mode = 'left'):
function repeat_expand_2d_left (line 402) | def repeat_expand_2d_left(content, target_len):
function repeat_expand_2d_other (line 420) | def repeat_expand_2d_other(content, target_len, mode = 'nearest'):
function mix_model (line 427) | def mix_model(model_paths,mix_rate,mode):
function change_rms (line 440) | def change_rms(data1, sr1, data2, sr2, rate): # 1是输入音频,2是输出音频,rate是2的占比...
function train_index (line 461) | def train_index(spk_name,root_dir = "dataset/44k/"): #from: RVC https:/...
class HParams (line 514) | class HParams():
method __init__ (line 515) | def __init__(self, **kwargs):
method keys (line 521) | def keys(self):
method items (line 524) | def items(self):
method values (line 527) | def values(self):
method __len__ (line 530) | def __len__(self):
method __getitem__ (line 533) | def __getitem__(self, key):
method __setitem__ (line 536) | def __setitem__(self, key, value):
method __contains__ (line 539) | def __contains__(self, key):
method __repr__ (line 542) | def __repr__(self):
method get (line 545) | def get(self,index):
class InferHParams (line 549) | class InferHParams(HParams):
method __init__ (line 550) | def __init__(self, **kwargs):
method __getattr__ (line 556) | def __getattr__(self,index):
class Volume_Extractor (line 560) | class Volume_Extractor:
method __init__ (line 561) | def __init__(self, hop_size = 512):
method extract (line 564) | def extract(self, audio): # audio: 2d tensor array
FILE: vdecoder/hifigan/env.py
class AttrDict (line 5) | class AttrDict(dict):
method __init__ (line 6) | def __init__(self, *args, **kwargs):
function build_env (line 11) | def build_env(config, config_name, path):
FILE: vdecoder/hifigan/models.py
function load_model (line 17) | def load_model(model_path, device='cuda'):
class ResBlock1 (line 36) | class ResBlock1(torch.nn.Module):
method __init__ (line 37) | def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
method forward (line 60) | def forward(self, x):
method remove_weight_norm (line 69) | def remove_weight_norm(self):
class ResBlock2 (line 76) | class ResBlock2(torch.nn.Module):
method __init__ (line 77) | def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
method forward (line 88) | def forward(self, x):
method remove_weight_norm (line 95) | def remove_weight_norm(self):
function padDiff (line 100) | def padDiff(x):
class SineGen (line 103) | class SineGen(torch.nn.Module):
method __init__ (line 119) | def __init__(self, samp_rate, harmonic_num=0,
method _f02uv (line 133) | def _f02uv(self, f0):
method _f02sine (line 138) | def _f02sine(self, f0_values):
method forward (line 197) | def forward(self, f0, upp=None):
class SourceModuleHnNSF (line 274) | class SourceModuleHnNSF(torch.nn.Module):
method __init__ (line 292) | def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
method forward (line 307) | def forward(self, x, upp=None):
class Generator (line 323) | class Generator(torch.nn.Module):
method __init__ (line 324) | def __init__(self, h):
method OnnxExport (line 362) | def OnnxExport(self):
method forward (line 366) | def forward(self, x, f0, g=None):
method remove_weight_norm (line 396) | def remove_weight_norm(self):
class DiscriminatorP (line 406) | class DiscriminatorP(torch.nn.Module):
method __init__ (line 407) | def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=...
method forward (line 420) | def forward(self, x):
class MultiPeriodDiscriminator (line 442) | class MultiPeriodDiscriminator(torch.nn.Module):
method __init__ (line 443) | def __init__(self, periods=None):
method forward (line 450) | def forward(self, y, y_hat):
class DiscriminatorS (line 466) | class DiscriminatorS(torch.nn.Module):
method __init__ (line 467) | def __init__(self, use_spectral_norm=False):
method forward (line 481) | def forward(self, x):
class MultiScaleDiscriminator (line 494) | class MultiScaleDiscriminator(torch.nn.Module):
method __init__ (line 495) | def __init__(self):
method forward (line 507) | def forward(self, y, y_hat):
function feature_loss (line 526) | def feature_loss(fmap_r, fmap_g):
function discriminator_loss (line 535) | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
function generator_loss (line 549) | def generator_loss(disc_outputs):
FILE: vdecoder/hifigan/nvSTFT.py
function load_wav_to_torch (line 12) | def load_wav_to_torch(full_path, target_sr=None, return_empty_on_excepti...
function dynamic_range_compression (line 44) | def dynamic_range_compression(x, C=1, clip_val=1e-5):
function dynamic_range_decompression (line 47) | def dynamic_range_decompression(x, C=1):
function dynamic_range_compression_torch (line 50) | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
function dynamic_range_decompression_torch (line 53) | def dynamic_range_decompression_torch(x, C=1):
class STFT (line 56) | class STFT():
method __init__ (line 57) | def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop...
method get_mel (line 70) | def get_mel(self, y, center=False):
method __call__ (line 104) | def __call__(self, audiopath):
FILE: vdecoder/hifigan/utils.py
function plot_spectrogram (line 10) | def plot_spectrogram(spectrogram):
function init_weights (line 22) | def init_weights(m, mean=0.0, std=0.01):
function apply_weight_norm (line 28) | def apply_weight_norm(m):
function get_padding (line 34) | def get_padding(kernel_size, dilation=1):
function load_checkpoint (line 38) | def load_checkpoint(filepath, device):
function save_checkpoint (line 46) | def save_checkpoint(filepath, obj):
function del_old_checkpoints (line 52) | def del_old_checkpoints(cp_dir, prefix, n_models=2):
function scan_checkpoint (line 62) | def scan_checkpoint(cp_dir, prefix):
FILE: vdecoder/hifiganwithsnake/alias/act.py
class Activation1d (line 13) | class Activation1d(nn.Module):
method __init__ (line 14) | def __init__(self,
method forward (line 28) | def forward(self, x):
class SnakeBeta (line 36) | class SnakeBeta(nn.Module):
method __init__ (line 54) | def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha...
method forward (line 79) | def forward(self, x):
class Mish (line 95) | class Mish(nn.Module):
method __init__ (line 102) | def __init__(self):
method forward (line 105) | def forward(self, x):
class SnakeAlias (line 109) | class SnakeAlias(nn.Module):
method __init__ (line 110) | def __init__(self,
method forward (line 125) | def forward(self, x, C=None):
FILE: vdecoder/hifiganwithsnake/alias/filter.py
function sinc (line 16) | def sinc(x: torch.Tensor):
function kaiser_sinc_filter1d (line 29) | def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filt...
class LowPassFilter1d (line 61) | class LowPassFilter1d(nn.Module):
method __init__ (line 62) | def __init__(self,
method forward (line 93) | def forward(self, x):
FILE: vdecoder/hifiganwithsnake/alias/resample.py
class UpSample1d (line 10) | class UpSample1d(nn.Module):
method __init__ (line 11) | def __init__(self, ratio=2, kernel_size=None, C=None):
method forward (line 38) | def forward(self, x, C=None):
class DownSample1d (line 57) | class DownSample1d(nn.Module):
method __init__ (line 58) | def __init__(self, ratio=2, kernel_size=None, C=None):
method forward (line 69) | def forward(self, x):
FILE: vdecoder/hifiganwithsnake/env.py
class AttrDict (line 5) | class AttrDict(dict):
method __init__ (line 6) | def __init__(self, *args, **kwargs):
function build_env (line 11) | def build_env(config, config_name, path):
FILE: vdecoder/hifiganwithsnake/models.py
function load_model (line 19) | def load_model(model_path, device='cuda'):
class ResBlock1 (line 38) | class ResBlock1(torch.nn.Module):
method __init__ (line 39) | def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5), C=N...
method forward (line 67) | def forward(self, x, DIM=None):
method remove_weight_norm (line 77) | def remove_weight_norm(self):
class ResBlock2 (line 84) | class ResBlock2(torch.nn.Module):
method __init__ (line 85) | def __init__(self, h, channels, kernel_size=3, dilation=(1, 3), C=None):
method forward (line 101) | def forward(self, x, DIM=None):
method remove_weight_norm (line 108) | def remove_weight_norm(self):
function padDiff (line 113) | def padDiff(x):
class SineGen (line 116) | class SineGen(torch.nn.Module):
method __init__ (line 132) | def __init__(self, samp_rate, harmonic_num=0,
method _f02uv (line 146) | def _f02uv(self, f0):
method _f02sine (line 151) | def _f02sine(self, f0_values):
method forward (line 210) | def forward(self, f0, upp=None):
class SourceModuleHnNSF (line 288) | class SourceModuleHnNSF(torch.nn.Module):
method __init__ (line 306) | def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
method forward (line 321) | def forward(self, x, upp=None):
class Generator (line 337) | class Generator(torch.nn.Module):
method __init__ (line 338) | def __init__(self, h):
method OnnxExport (line 379) | def OnnxExport(self):
method forward (line 383) | def forward(self, x, f0, g=None):
method remove_weight_norm (line 415) | def remove_weight_norm(self):
class DiscriminatorP (line 425) | class DiscriminatorP(torch.nn.Module):
method __init__ (line 426) | def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=...
method forward (line 439) | def forward(self, x):
class MultiPeriodDiscriminator (line 461) | class MultiPeriodDiscriminator(torch.nn.Module):
method __init__ (line 462) | def __init__(self, periods=None):
method forward (line 469) | def forward(self, y, y_hat):
class DiscriminatorS (line 485) | class DiscriminatorS(torch.nn.Module):
method __init__ (line 486) | def __init__(self, use_spectral_norm=False):
method forward (line 500) | def forward(self, x):
class MultiScaleDiscriminator (line 513) | class MultiScaleDiscriminator(torch.nn.Module):
method __init__ (line 514) | def __init__(self):
method forward (line 526) | def forward(self, y, y_hat):
function feature_loss (line 545) | def feature_loss(fmap_r, fmap_g):
function discriminator_loss (line 554) | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
function generator_loss (line 568) | def generator_loss(disc_outputs):
FILE: vdecoder/hifiganwithsnake/nvSTFT.py
function load_wav_to_torch (line 12) | def load_wav_to_torch(full_path, target_sr=None, return_empty_on_excepti...
function dynamic_range_compression (line 44) | def dynamic_range_compression(x, C=1, clip_val=1e-5):
function dynamic_range_decompression (line 47) | def dynamic_range_decompression(x, C=1):
function dynamic_range_compression_torch (line 50) | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
function dynamic_range_decompression_torch (line 53) | def dynamic_range_decompression_torch(x, C=1):
class STFT (line 56) | class STFT():
method __init__ (line 57) | def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop...
method get_mel (line 70) | def get_mel(self, y, center=False):
method __call__ (line 104) | def __call__(self, audiopath):
FILE: vdecoder/hifiganwithsnake/utils.py
function plot_spectrogram (line 10) | def plot_spectrogram(spectrogram):
function init_weights (line 22) | def init_weights(m, mean=0.0, std=0.01):
function apply_weight_norm (line 28) | def apply_weight_norm(m):
function get_padding (line 34) | def get_padding(kernel_size, dilation=1):
function load_checkpoint (line 38) | def load_checkpoint(filepath, device):
function save_checkpoint (line 46) | def save_checkpoint(filepath, obj):
function del_old_checkpoints (line 52) | def del_old_checkpoints(cp_dir, prefix, n_models=2):
function scan_checkpoint (line 62) | def scan_checkpoint(cp_dir, prefix):
FILE: vdecoder/nsf_hifigan/env.py
class AttrDict (line 5) | class AttrDict(dict):
method __init__ (line 6) | def __init__(self, *args, **kwargs):
function build_env (line 11) | def build_env(config, config_name, path):
FILE: vdecoder/nsf_hifigan/models.py
function load_model (line 17) | def load_model(model_path, device='cuda'):
function load_config (line 29) | def load_config(model_path):
class ResBlock1 (line 39) | class ResBlock1(torch.nn.Module):
method __init__ (line 40) | def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5)):
method forward (line 63) | def forward(self, x):
method remove_weight_norm (line 72) | def remove_weight_norm(self):
class ResBlock2 (line 79) | class ResBlock2(torch.nn.Module):
method __init__ (line 80) | def __init__(self, h, channels, kernel_size=3, dilation=(1, 3)):
method forward (line 91) | def forward(self, x):
method remove_weight_norm (line 98) | def remove_weight_norm(self):
class SineGen (line 103) | class SineGen(torch.nn.Module):
method __init__ (line 119) | def __init__(self, samp_rate, harmonic_num=0,
method _f02uv (line 130) | def _f02uv(self, f0):
method forward (line 137) | def forward(self, f0, upp):
class SourceModuleHnNSF (line 182) | class SourceModuleHnNSF(torch.nn.Module):
method __init__ (line 200) | def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
method forward (line 215) | def forward(self, x, upp):
class Generator (line 221) | class Generator(torch.nn.Module):
method __init__ (line 222) | def __init__(self, h):
method forward (line 259) | def forward(self, x, f0):
method remove_weight_norm (line 280) | def remove_weight_norm(self):
class DiscriminatorP (line 290) | class DiscriminatorP(torch.nn.Module):
method __init__ (line 291) | def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=...
method forward (line 304) | def forward(self, x):
class MultiPeriodDiscriminator (line 326) | class MultiPeriodDiscriminator(torch.nn.Module):
method __init__ (line 327) | def __init__(self, periods=None):
method forward (line 334) | def forward(self, y, y_hat):
class DiscriminatorS (line 350) | class DiscriminatorS(torch.nn.Module):
method __init__ (line 351) | def __init__(self, use_spectral_norm=False):
method forward (line 365) | def forward(self, x):
class MultiScaleDiscriminator (line 378) | class MultiScaleDiscriminator(torch.nn.Module):
method __init__ (line 379) | def __init__(self):
method forward (line 391) | def forward(self, y, y_hat):
function feature_loss (line 410) | def feature_loss(fmap_r, fmap_g):
function discriminator_loss (line 419) | def discriminator_loss(disc_real_outputs, disc_generated_outputs):
function generator_loss (line 433) | def generator_loss(disc_outputs):
FILE: vdecoder/nsf_hifigan/nvSTFT.py
function load_wav_to_torch (line 13) | def load_wav_to_torch(full_path, target_sr=None, return_empty_on_excepti...
function dynamic_range_compression (line 45) | def dynamic_range_compression(x, C=1, clip_val=1e-5):
function dynamic_range_decompression (line 48) | def dynamic_range_decompression(x, C=1):
function dynamic_range_compression_torch (line 51) | def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
function dynamic_range_decompression_torch (line 54) | def dynamic_range_decompression_torch(x, C=1):
class STFT (line 57) | class STFT():
method __init__ (line 58) | def __init__(self, sr=22050, n_mels=80, n_fft=1024, win_size=1024, hop...
method get_mel (line 71) | def get_mel(self, y, keyshift=0, speed=1, center=False):
method __call__ (line 127) | def __call__(self, audiopath):
FILE: vdecoder/nsf_hifigan/utils.py
function plot_spectrogram (line 12) | def plot_spectrogram(spectrogram):
function init_weights (line 24) | def init_weights(m, mean=0.0, std=0.01):
function apply_weight_norm (line 30) | def apply_weight_norm(m):
function get_padding (line 36) | def get_padding(kernel_size, dilation=1):
function load_checkpoint (line 40) | def load_checkpoint(filepath, device):
function save_checkpoint (line 48) | def save_checkpoint(filepath, obj):
function del_old_checkpoints (line 54) | def del_old_checkpoints(cp_dir, prefix, n_models=2):
function scan_checkpoint (line 64) | def scan_checkpoint(cp_dir, prefix):
FILE: vencoder/CNHubertLarge.py
class CNHubertLarge (line 7) | class CNHubertLarge(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/chinese-hubert-large-fairseq-ckp...
method encoder (line 23) | def encoder(self, wav):
FILE: vencoder/ContentVec256L12_Onnx.py
class ContentVec256L12_Onnx (line 7) | class ContentVec256L12_Onnx(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/vec-256-layer-12.onnx", device=N...
method encoder (line 24) | def encoder(self, wav):
FILE: vencoder/ContentVec256L9.py
class ContentVec256L9 (line 7) | class ContentVec256L9(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/checkpoint_best_legacy_500.pt", ...
method encoder (line 23) | def encoder(self, wav):
FILE: vencoder/ContentVec256L9_Onnx.py
class ContentVec256L9_Onnx (line 7) | class ContentVec256L9_Onnx(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/vec-256-layer-9.onnx", device=No...
method encoder (line 22) | def encoder(self, wav):
FILE: vencoder/ContentVec768L12.py
class ContentVec768L12 (line 7) | class ContentVec768L12(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/checkpoint_best_legacy_500.pt", ...
method encoder (line 23) | def encoder(self, wav):
FILE: vencoder/ContentVec768L12_Onnx.py
class ContentVec768L12_Onnx (line 7) | class ContentVec768L12_Onnx(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/vec-768-layer-12.onnx", device=N...
method encoder (line 24) | def encoder(self, wav):
FILE: vencoder/ContentVec768L9_Onnx.py
class ContentVec768L9_Onnx (line 7) | class ContentVec768L9_Onnx(SpeechEncoder):
method __init__ (line 8) | def __init__(self,vec_path = "pretrain/vec-768-layer-9.onnx",device=No...
method encoder (line 24) | def encoder(self, wav):
FILE: vencoder/DPHubert.py
class DPHubert (line 7) | class DPHubert(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/DPHuBERT-sp0.75.pth", device=None):
method encoder (line 20) | def encoder(self, wav):
FILE: vencoder/HubertSoft.py
class HubertSoft (line 7) | class HubertSoft(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/hubert-soft-0d54a1f4.pt", device...
method encoder (line 19) | def encoder(self, wav):
FILE: vencoder/HubertSoft_Onnx.py
class HubertSoft_Onnx (line 7) | class HubertSoft_Onnx(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/hubert-soft.onnx", device=None):
method encoder (line 24) | def encoder(self, wav):
FILE: vencoder/WavLMBasePlus.py
class WavLMBasePlus (line 7) | class WavLMBasePlus(SpeechEncoder):
method __init__ (line 8) | def __init__(self, vec_path="pretrain/WavLM-Base+.pt", device=None):
method encoder (line 22) | def encoder(self, wav):
FILE: vencoder/WhisperPPG.py
class WhisperPPG (line 8) | class WhisperPPG(SpeechEncoder):
method __init__ (line 9) | def __init__(self, vec_path="pretrain/medium.pt", device=None):
method encoder (line 22) | def encoder(self, wav):
FILE: vencoder/WhisperPPGLarge.py
class WhisperPPGLarge (line 8) | class WhisperPPGLarge(SpeechEncoder):
method __init__ (line 9) | def __init__(self, vec_path="pretrain/large-v2.pt", device=None):
method encoder (line 22) | def encoder(self, wav):
FILE: vencoder/dphubert/components.py
function _init_transformer_params (line 24) | def _init_transformer_params(module):
class LayerNorm (line 54) | class LayerNorm(nn.LayerNorm):
method forward (line 57) | def forward(self, input: Tensor) -> Tensor:
class ConvLayerBlock (line 64) | class ConvLayerBlock(Module):
method __init__ (line 67) | def __init__(
method forward (line 94) | def forward(
method get_num_params_and_out_channels (line 122) | def get_num_params_and_out_channels(self, in_channels):
class FeatureExtractor (line 137) | class FeatureExtractor(Module):
method __init__ (line 145) | def __init__(
method forward (line 158) | def forward(
method get_num_params_and_final_out_channels (line 187) | def get_num_params_and_final_out_channels(self):
method prune (line 198) | def prune(self):
class FeatureProjection (line 238) | class FeatureProjection(Module):
method __init__ (line 249) | def __init__(
method forward (line 263) | def forward(self, x):
method get_num_params (line 276) | def get_num_params(self, in_features):
class ConvolutionalPositionalEmbedding (line 280) | class ConvolutionalPositionalEmbedding(Module):
method __init__ (line 289) | def __init__(
method __prepare_scriptable__ (line 309) | def __prepare_scriptable__(self):
method forward (line 319) | def forward(self, x):
class SelfAttention (line 336) | class SelfAttention(Module):
method __init__ (line 346) | def __init__(
method forward (line 379) | def forward(
method get_num_params (line 438) | def get_num_params(self):
method prune (line 451) | def prune(self):
class WavLMSelfAttention (line 486) | class WavLMSelfAttention(SelfAttention):
method __init__ (line 501) | def __init__(
method compute_bias (line 546) | def compute_bias(self, query_length: int, key_length: int) -> Tensor:
method _relative_positions_bucket (line 563) | def _relative_positions_bucket(self, relative_positions: Tensor, bidir...
method forward (line 602) | def forward(
method prune (line 661) | def prune(self):
class FeedForward (line 696) | class FeedForward(Module):
method __init__ (line 699) | def __init__(
method forward (line 726) | def forward(self, x):
method get_num_params (line 750) | def get_num_params(self):
method prune (line 763) | def prune(self):
class EncoderLayer (line 794) | class EncoderLayer(Module):
method __init__ (line 797) | def __init__(
method forward (line 814) | def forward(
method get_num_params (line 859) | def get_num_params(self):
class Transformer (line 868) | class Transformer(Module):
method __init__ (line 869) | def __init__(
method _preprocess (line 885) | def _preprocess(self, x: Tensor):
method forward (line 894) | def forward(
method get_intermediate_outputs (line 909) | def get_intermediate_outputs(
method get_num_params (line 929) | def get_num_params(self):
method prune (line 936) | def prune(self):
class Encoder (line 958) | class Encoder(Module):
method __init__ (line 959) | def __init__(
method _preprocess (line 968) | def _preprocess(
method forward (line 986) | def forward(
method extract_features (line 995) | def extract_features(
method get_num_params (line 1005) | def get_num_params(self, in_features):
method prune (line 1011) | def prune(self, conv_out_index):
function _get_feature_extractor (line 1020) | def _get_feature_extractor(
function _get_encoder (line 1097) | def _get_encoder(
function _get_wavlm_encoder (line 1289) | def _get_wavlm_encoder(
function _get_padding_mask (line 1387) | def _get_padding_mask(input: Tensor, lengths: Tensor) -> Tensor:
class GradMultiply (line 1401) | class GradMultiply(torch.autograd.Function):
method forward (line 1403) | def forward(ctx, x, scale):
method backward (line 1409) | def backward(ctx, grad):
FILE: vencoder/dphubert/hardconcrete.py
class HardConcrete (line 14) | class HardConcrete(nn.Module):
method __init__ (line 28) | def __init__(
method reset_parameters (line 70) | def reset_parameters(self):
method l0_norm (line 76) | def l0_norm(self) -> torch.Tensor:
method forward (line 85) | def forward(self) -> torch.Tensor:
method extra_repr (line 118) | def extra_repr(self) -> str:
method __repr__ (line 121) | def __repr__(self) -> str:
FILE: vencoder/dphubert/model.py
class Wav2Vec2Model (line 19) | class Wav2Vec2Model(Module):
method __init__ (line 44) | def __init__(
method extract_features (line 58) | def extract_features(
method get_num_params (line 109) | def get_num_params(self):
method prune (line 115) | def prune(self):
method forward (line 127) | def forward(
function wav2vec2_model (line 172) | def wav2vec2_model(**configs) -> Wav2Vec2Model:
function wav2vec2_model_original (line 181) | def wav2vec2_model_original(
function wav2vec2_base (line 364) | def wav2vec2_base(
function wav2vec2_large (line 422) | def wav2vec2_large(
function wav2vec2_large_lv60k (line 480) | def wav2vec2_large_lv60k(
function hubert_base (line 538) | def hubert_base(
function hubert_large (line 599) | def hubert_large(
function hubert_xlarge (line 657) | def hubert_xlarge(
function _init_hubert_pretrain_model (line 715) | def _init_hubert_pretrain_model(module):
function wavlm_model (line 736) | def wavlm_model(
function wavlm_base (line 865) | def wavlm_base(
function wavlm_large (line 917) | def wavlm_large(
FILE: vencoder/dphubert/pruning_utils.py
function prune_linear_layer (line 9) | def prune_linear_layer(layer: nn.Linear, index: torch.LongTensor, dim: s...
function prune_conv1d_layer (line 26) | def prune_conv1d_layer(layer: nn.Conv1d, index: torch.LongTensor, dim: s...
function prune_layer_norm (line 43) | def prune_layer_norm(layernorm: Union[nn.LayerNorm, nn.GroupNorm], index...
FILE: vencoder/dphubert/utils/import_huggingface_wavlm.py
function _get_config (line 18) | def _get_config(cfg):
function _get_config_wavlm (line 39) | def _get_config_wavlm(cfg):
function _build (line 66) | def _build(config, original):
function transform_wavlm_encoder_state (line 93) | def transform_wavlm_encoder_state(state: Dict[str, Any], encoder_num_lay...
function import_huggingface_model (line 100) | def import_huggingface_model(original: Module) -> Wav2Vec2Model:
FILE: vencoder/encoder.py
class SpeechEncoder (line 1) | class SpeechEncoder(object):
method __init__ (line 2) | def __init__(self, vec_path="pretrain/checkpoint_best_legacy_500.pt", ...
method encoder (line 8) | def encoder(self, wav):
FILE: vencoder/hubert/hubert_model.py
class Hubert (line 11) | class Hubert(nn.Module):
method __init__ (line 12) | def __init__(self, num_label_embeddings: int = 100, mask: bool = True):
method mask (line 31) | def mask(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
method encode (line 38) | def encode(
method logits (line 49) | def logits(self, x: torch.Tensor) -> torch.Tensor:
method forward (line 57) | def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
class HubertSoft (line 64) | class HubertSoft(Hubert):
method __init__ (line 65) | def __init__(self):
method units (line 69) | def units(self, wav: torch.Tensor) -> torch.Tensor:
class FeatureExtractor (line 75) | class FeatureExtractor(nn.Module):
method __init__ (line 76) | def __init__(self):
method forward (line 87) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class FeatureProjection (line 98) | class FeatureProjection(nn.Module):
method __init__ (line 99) | def __init__(self):
method forward (line 105) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class PositionalConvEmbedding (line 112) | class PositionalConvEmbedding(nn.Module):
method __init__ (line 113) | def __init__(self):
method forward (line 124) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class TransformerEncoder (line 130) | class TransformerEncoder(nn.Module):
method __init__ (line 131) | def __init__(
method forward (line 140) | def forward(
function _compute_mask (line 155) | def _compute_mask(
function hubert_soft (line 210) | def hubert_soft(
FILE: vencoder/hubert/hubert_model_onnx.py
class Hubert (line 11) | class Hubert(nn.Module):
method __init__ (line 12) | def __init__(self, num_label_embeddings: int = 100, mask: bool = True):
method mask (line 31) | def mask(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
method encode (line 38) | def encode(
method logits (line 49) | def logits(self, x: torch.Tensor) -> torch.Tensor:
class HubertSoft (line 58) | class HubertSoft(Hubert):
method __init__ (line 59) | def __init__(self):
method units (line 62) | def units(self, wav: torch.Tensor) -> torch.Tensor:
method forward (line 67) | def forward(self, x):
class FeatureExtractor (line 70) | class FeatureExtractor(nn.Module):
method __init__ (line 71) | def __init__(self):
method forward (line 82) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class FeatureProjection (line 93) | class FeatureProjection(nn.Module):
method __init__ (line 94) | def __init__(self):
method forward (line 100) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class PositionalConvEmbedding (line 107) | class PositionalConvEmbedding(nn.Module):
method __init__ (line 108) | def __init__(self):
method forward (line 119) | def forward(self, x: torch.Tensor) -> torch.Tensor:
class TransformerEncoder (line 125) | class TransformerEncoder(nn.Module):
method __init__ (line 126) | def __init__(
method forward (line 135) | def forward(
function _compute_mask (line 150) | def _compute_mask(
function hubert_soft (line 205) | def hubert_soft(
FILE: vencoder/wavlm/WavLM.py
function compute_mask_indices (line 35) | def compute_mask_indices(
class WavLMConfig (line 162) | class WavLMConfig:
method __init__ (line 163) | def __init__(self, cfg=None):
method update (line 216) | def update(self, cfg: dict):
class WavLM (line 220) | class WavLM(nn.Module):
method __init__ (line 221) | def __init__(
method apply_mask (line 271) | def apply_mask(self, x, padding_mask):
method forward_padding_mask (line 311) | def forward_padding_mask(
method extract_features (line 323) | def extract_features(
class ConvFeatureExtractionModel (line 378) | class ConvFeatureExtractionModel(nn.Module):
method __init__ (line 379) | def __init__(
method forward (line 483) | def forward(self, x, mask=None):
class TransformerEncoder (line 505) | class TransformerEncoder(nn.Module):
method __init__ (line 506) | def __init__(self, args):
method forward (line 562) | def forward(self, x, padding_mask=None, streaming_mask=None, layer=None):
method extract_features (line 570) | def extract_features(self, x, padding_mask=None, streaming_mask=None, ...
class TransformerSentenceEncoderLayer (line 613) | class TransformerSentenceEncoderLayer(nn.Module):
method __init__ (line 619) | def __init__(
method forward (line 675) | def forward(
FILE: vencoder/wavlm/modules.py
class TransposeLast (line 20) | class TransposeLast(nn.Module):
method __init__ (line 21) | def __init__(self, deconstruct_idx=None):
method forward (line 25) | def forward(self, x):
class Fp32LayerNorm (line 31) | class Fp32LayerNorm(nn.LayerNorm):
method __init__ (line 32) | def __init__(self, *args, **kwargs):
method forward (line 35) | def forward(self, input):
class Fp32GroupNorm (line 46) | class Fp32GroupNorm(nn.GroupNorm):
method __init__ (line 47) | def __init__(self, *args, **kwargs):
method forward (line 50) | def forward(self, input):
class GradMultiply (line 61) | class GradMultiply(torch.autograd.Function):
method forward (line 63) | def forward(ctx, x, scale):
method backward (line 69) | def backward(ctx, grad):
class SamePad (line 73) | class SamePad(nn.Module):
method __init__ (line 74) | def __init__(self, kernel_size, causal=False):
method forward (line 81) | def forward(self, x):
class Swish (line 87) | class Swish(nn.Module):
method __init__ (line 91) | def __init__(self):
method forward (line 96) | def forward(self, x):
class GLU_Linear (line 100) | class GLU_Linear(nn.Module):
method __init__ (line 101) | def __init__(self, input_dim, output_dim, glu_type="sigmoid", bias_in_...
method forward (line 121) | def forward(self, x):
function gelu_accurate (line 133) | def gelu_accurate(x):
function gelu (line 141) | def gelu(x: torch.Tensor) -> torch.Tensor:
function get_activation_fn (line 145) | def get_activation_fn(activation: str):
function init_bert_params (line 169) | def init_bert_params(module):
function quant_noise (line 204) | def quant_noise(module, p, block_size):
class MultiheadAttention (line 304) | class MultiheadAttention(nn.Module):
method __init__ (line 310) | def __init__(
method reset_parameters (line 396) | def reset_parameters(self):
method _relative_positions_bucket (line 418) | def _relative_positions_bucket(self, relative_positions, bidirectional...
method compute_bias (line 445) | def compute_bias(self, query_length, key_length):
method forward (line 458) | def forward(
method _append_prev_key_padding_mask (line 767) | def _append_prev_key_padding_mask(
method _get_input_buffer (line 810) | def _get_input_buffer(
method _set_input_buffer (line 820) | def _set_input_buffer(
method apply_sparse_mask (line 827) | def apply_sparse_mask(self, attn_weights, tgt_len: int, src_len: int, ...
FILE: vencoder/whisper/audio.py
function load_audio (line 22) | def load_audio(file: str, sr: int = SAMPLE_RATE):
function pad_or_trim (line 52) | def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
function mel_filters (line 77) | def mel_filters(device, n_mels: int = N_MELS) -> torch.Tensor:
function log_mel_spectrogram (line 91) | def log_mel_spectrogram(audio: Union[str, np.ndarray, torch.Tensor], n_m...
FILE: vencoder/whisper/decoding.py
function detect_language (line 19) | def detect_language(model: "Whisper", mel: Tensor, tokenizer: Tokenizer ...
class DecodingOptions (line 72) | class DecodingOptions:
class DecodingResult (line 104) | class DecodingResult:
class Inference (line 116) | class Inference:
method logits (line 117) | def logits(self, tokens: Tensor, audio_features: Tensor) -> Tensor:
method rearrange_kv_cache (line 121) | def rearrange_kv_cache(self, source_indices) -> None:
method cleanup_caching (line 125) | def cleanup_caching(self) -> None:
class PyTorchInference (line 130) | class PyTorchInference(Inference):
method __init__ (line 131) | def __init__(self, model: "Whisper", initial_token_length: int):
method logits (line 137) | def logits(self, tokens: Tensor, audio_features: Tensor) -> Tensor:
method cleanup_caching (line 147) | def cleanup_caching(self):
method rearrange_kv_cache (line 154) | def rearrange_kv_cache(self, source_indices):
class SequenceRanker (line 160) | class SequenceRanker:
method rank (line 161) | def rank(self, tokens: List[List[Tensor]], sum_logprobs: List[List[flo...
class MaximumLikelihoodRanker (line 169) | class MaximumLikelihoodRanker(SequenceRanker):
method __init__ (line 175) | def __init__(self, length_penalty: Optional[float]):
method rank (line 178) | def rank(self, tokens: List[List[Tensor]], sum_logprobs: List[List[flo...
class TokenDecoder (line 195) | class TokenDecoder:
method reset (line 196) | def reset(self):
method update (line 199) | def update(self, tokens: Tensor, logits: Tensor, sum_logprobs: Tensor)...
method finalize (line 224) | def finalize(
class GreedyDecoder (line 249) | class GreedyDecoder(TokenDecoder):
method __init__ (line 250) | def __init__(self, temperature: float, eot: int):
method update (line 254) | def update(self, tokens: Tensor, logits: Tensor, sum_logprobs: Tensor)...
method finalize (line 271) | def finalize(self, tokens: Tensor, sum_logprobs: Tensor):
class BeamSearchDecoder (line 277) | class BeamSearchDecoder(TokenDecoder):
method __init__ (line 278) | def __init__(self, beam_size: int, eot: int, inference: Inference, pat...
method reset (line 288) | def reset(self):
method update (line 291) | def update(self, tokens: Tensor, logits: Tensor, sum_logprobs: Tensor)...
method finalize (line 347) | def finalize(self, preceding_tokens: Tensor, sum_logprobs: Tensor):
class LogitFilter (line 367) | class LogitFilter:
method apply (line 368) | def apply(self, logits: Tensor, tokens: Tensor) -> None:
class SuppressBlank (line 383) | class SuppressBlank(LogitFilter):
method __init__ (line 384) | def __init__(self, tokenizer: Tokenizer, sample_begin: int):
method apply (line 388) | def apply(self, logits: Tensor, tokens: Tensor):
class SuppressTokens (line 393) | class SuppressTokens(LogitFilter):
method __init__ (line 394) | def __init__(self, suppress_tokens: Sequence[int]):
method apply (line 397) | def apply(self, logits: Tensor, tokens: Tensor):
class ApplyTimestampRules (line 401) | class ApplyTimestampRules(LogitFilter):
method __init__ (line 402) | def __init__(
method apply (line 409) | def apply(self, logits: Tensor, tokens: Tensor):
class DecodingTask (line 444) | class DecodingTask:
method __init__ (line 450) | def __init__(self, model: "Whisper", options: DecodingOptions):
method _verify_options (line 499) | def _verify_options(self, options: DecodingOptions) -> DecodingOptions:
method _get_initial_tokens (line 512) | def _get_initial_tokens(self) -> Tuple[int]:
method _get_suppress_tokens (line 534) | def _get_suppress_tokens(self) -> Tuple[int]:
method _get_audio_features (line 557) | def _get_audio_features(self, mel: Tensor):
method _detect_language (line 575) | def _detect_language(self, audio_features: Tensor, tokens: Tensor):
method _main_loop (line 587) | def _main_loop(self, audio_features: Tensor, tokens: Tensor):
method run (line 619) | def run(self, mel: Tensor) -> List[DecodingResult]:
function decode (line 684) | def decode(model: "Whisper", mel: Tensor, options: DecodingOptions = Dec...
FILE: vencoder/whisper/model.py
class ModelDimensions (line 14) | class ModelDimensions:
class LayerNorm (line 27) | class LayerNorm(nn.LayerNorm):
method forward (line 28) | def forward(self, x: Tensor) -> Tensor:
class Linear (line 32) | class Linear(nn.Linear):
method forward (line 33) | def forward(self, x: Tensor) -> Tensor:
class Conv1d (line 39) | class Conv1d(nn.Conv1d):
method _conv_forward (line 40) | def _conv_forward(self, x: Tensor, weight: Tensor, bias: Optional[Tens...
function sinusoids (line 46) | def sinusoids(length, channels, max_timescale=10000):
class MultiHeadAttention (line 55) | class MultiHeadAttention(nn.Module):
method __init__ (line 56) | def __init__(self, n_state: int, n_head: int):
method forward (line 64) | def forward(
method qkv_attention (line 86) | def qkv_attention(self, q: Tensor, k: Tensor, v: Tensor, mask: Optiona...
class ResidualAttentionBlock (line 102) | class ResidualAttentionBlock(nn.Module):
method __init__ (line 103) | def __init__(self, n_state: int, n_head: int, cross_attention: bool = ...
method forward (line 116) | def forward(
class AudioEncoder (line 130) | class AudioEncoder(nn.Module):
method __init__ (line 131) | def __init__(self, n_mels: int, n_ctx: int, n_state: int, n_head: int,...
method forward (line 142) | def forward(self, x: Tensor):
class TextDecoder (line 164) | class TextDecoder(nn.Module):
method __init__ (line 165) | def __init__(self, n_vocab: int, n_ctx: int, n_state: int, n_head: int...
method forward (line 179) | def forward(self, x: Tensor, xa: Tensor, kv_cache: Optional[dict] = No...
class Whisper (line 199) | class Whisper(nn.Module):
method __init__ (line 200) | def __init__(self, dims: ModelDimensions):
method embed_audio (line 218) | def embed_audio(self, mel: torch.Tensor):
method logits (line 221) | def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
method forward (line 224) | def forward(self, mel: torch.Tensor, tokens: torch.Tensor) -> Dict[str...
method device (line 228) | def device(self):
method is_multilingual (line 232) | def is_multilingual(self):
method install_kv_cache_hooks (line 235) | def install_kv_cache_hooks(self, cache: Optional[dict] = None):
FILE: vencoder/whisper/tokenizer.py
class Tokenizer (line 130) | class Tokenizer:
method encode (line 137) | def encode(self, text, **kwargs):
method decode (line 140) | def decode(self, token_ids: Union[int, List[int], np.ndarray, torch.Te...
method decode_with_timestamps (line 143) | def decode_with_timestamps(self, tokens) -> str:
method eot (line 161) | def eot(self) -> int:
method sot (line 166) | def sot(self) -> int:
method sot_lm (line 171) | def sot_lm(self) -> int:
method sot_prev (line 176) | def sot_prev(self) -> int:
method no_speech (line 181) | def no_speech(self) -> int:
method no_timestamps (line 186) | def no_timestamps(self) -> int:
method timestamp_begin (line 191) | def timestamp_begin(self) -> int:
method language_token (line 196) | def language_token(self) -> int:
method all_language_tokens (line 215) | def all_language_tokens(self) -> Tuple[int]:
method all_language_codes (line 227) | def all_language_codes(self) -> Tuple[str]:
method sot_sequence_including_notimestamps (line 232) | def sot_sequence_including_notimestamps(self) -> Tuple[int]:
method non_speech_tokens (line 237) | def non_speech_tokens(self) -> Tuple[int]:
method _get_single_token_id (line 267) | def _get_single_token_id(self, text) -> int:
function build_tokenizer (line 274) | def build_tokenizer(name: str = "gpt2"):
function get_tokenizer (line 295) | def get_tokenizer(
FILE: vencoder/whisper/utils.py
function make_safe (line 10) | def make_safe(string):
function make_safe (line 15) | def make_safe(string):
function exact_div (line 20) | def exact_div(x, y):
function str2bool (line 25) | def str2bool(string):
function optional_int (line 33) | def optional_int(string):
function optional_float (line 37) | def optional_float(string):
function compression_ratio (line 41) | def compression_ratio(text) -> float:
function format_timestamp (line 46) | def format_timestamp(seconds: float, always_include_hours: bool = False,...
class ResultWriter (line 63) | class ResultWriter:
method __init__ (line 66) | def __init__(self, output_dir: str):
method __call__ (line 69) | def __call__(self, result: dict, audio_path: str):
method write_result (line 76) | def write_result(self, result: dict, file: TextIO):
class WriteTXT (line 80) | class WriteTXT(ResultWriter):
method write_result (line 83) | def write_result(self, result: dict, file: TextIO):
class WriteVTT (line 88) | class WriteVTT(ResultWriter):
method write_result (line 91) | def write_result(self, result: dict, file: TextIO):
class WriteSRT (line 102) | class WriteSRT(ResultWriter):
method write_result (line 105) | def write_result(self, result: dict, file: TextIO):
class WriteTSV (line 118) | class WriteTSV(ResultWriter):
method write_result (line 129) | def write_result(self, result: dict, file: TextIO):
class WriteJSON (line 137) | class WriteJSON(ResultWriter):
method write_result (line 140) | def write_result(self, result: dict, file: TextIO):
function get_writer (line 144) | def get_writer(output_format: str, output_dir: str) -> Callable[[dict, T...
FILE: webUI.py
function upload_mix_append_file (line 43) | def upload_mix_append_file(files,sfiles):
function mix_submit_click (line 56) | def mix_submit_click(js,mode):
function updata_mix_info (line 71) | def updata_mix_info(files):
function modelAnalysis (line 82) | def modelAnalysis(model_path,config_path,cluster_model_path,device,enhan...
function modelUnload (line 132) | def modelUnload():
function vc_infer (line 142) | def vc_infer(output_format, sid, audio_path, truncated_basename, vc_tran...
function vc_fn (line 183) | def vc_fn(sid, input_audio, output_format, vc_transform, auto_f0,cluster...
function text_clear (line 213) | def text_clear(text):
function vc_fn2 (line 216) | def vc_fn2(_text, _lang, _gender, _rate, _volume, sid, output_format, vc...
function model_compression (line 244) | def model_compression(_model):
function scan_local_models (line 255) | def scan_local_models():
function local_model_refresh_fn (line 267) | def local_model_refresh_fn():
function debug_change (line 271) | def debug_change():
Condensed preview — 147 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,064K chars).
[
{
"path": ".gitattributes",
"chars": 19,
"preview": "* text=auto eol=lf\n"
},
{
"path": ".github/ISSUE_TEMPLATE/ask_for_help.yaml",
"chars": 2927,
"preview": "name: 请求帮助\ndescription: 遇到了无法自行解决的错误\ntitle: '[Help]: '\nlabels: [ \"help wanted\" ]\n\nbody:\n - type: markdown\n attribute"
},
{
"path": ".github/ISSUE_TEMPLATE/ask_for_help_en_US.yaml",
"chars": 4070,
"preview": "name: Ask for help\ndescription: Encountered an error cannot be resolved by self\ntitle: '[Help]: '\nlabels: [ \"help wanted"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.yaml",
"chars": 1760,
"preview": "name: 问题回报\ndescription: 遇到了BUG?!\ntitle: '[Bug]: '\nlabels: [ \"bug?\" ]\n\nbody:\n - type: markdown\n attributes:\n val"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report_en_US.yaml",
"chars": 2264,
"preview": "name: Bug report\ndescription: Encountered an bug?!\ntitle: '[Bug]: '\nlabels: [ \"bug?\" ]\n\nbody:\n - type: markdown\n att"
},
{
"path": ".github/ISSUE_TEMPLATE/config.yml",
"chars": 294,
"preview": "blank_issues_enabled: false\ncontact_links:\n - name: 讨论区 / Discussions\n url: https://github.com/svc-develop-team/so-v"
},
{
"path": ".github/ISSUE_TEMPLATE/default.md",
"chars": 268,
"preview": "---\nname: Default issue\nabout: 如果模板中没有你想发起的issue类型,可以选择此项,但这个issue也许会获得一个较低的处理优先级 / If there is no issue type you want t"
},
{
"path": ".github/workflows/reviewdog.yml",
"chars": 373,
"preview": "name: Ruff Autofix\non: [pull_request]\njobs:\n ruff:\n permissions:\n checks: write\n contents: read\n pull"
},
{
"path": ".github/workflows/ruff.yml",
"chars": 162,
"preview": "name: Ruff\non: [push, pull_request]\njobs:\n ruff:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v"
},
{
"path": ".gitignore",
"chars": 2508,
"preview": "\n# Created by https://www.toptal.com/developers/gitignore/api/python\n# Edit at https://www.toptal.com/developers/gitigno"
},
{
"path": ".ruff.toml",
"chars": 101,
"preview": "select = [\"E\", \"F\", \"I\"]\n\n# Never enforce `E501` (line length violations).\nignore = [\"E501\", \"E741\"]\n"
},
{
"path": "LICENSE",
"chars": 34523,
"preview": " GNU AFFERO GENERAL PUBLIC LICENSE\n Version 3, 19 November 2007\n\n Copyright (C)"
},
{
"path": "README.md",
"chars": 37104,
"preview": "<div align=\"center\">\r\n<img alt=\"LOGO\" src=\"https://avatars.githubusercontent.com/u/127122328?s=400&u=5395a98a4f945a3a50c"
},
{
"path": "README_zh_CN.md",
"chars": 21576,
"preview": "<div align=\"center\">\r\n<img alt=\"LOGO\" src=\"https://avatars.githubusercontent.com/u/127122328?s=400&u=5395a98a4f945a3a50c"
},
{
"path": "cluster/__init__.py",
"chars": 883,
"preview": "import torch\nfrom sklearn.cluster import KMeans\n\n\ndef get_cluster_model(ckpt_path):\n checkpoint = torch.load(ckpt_pat"
},
{
"path": "cluster/kmeans.py",
"chars": 7782,
"preview": "from time import time\r\n\r\nimport numpy as np\r\nimport pynvml\r\nimport torch\r\nfrom torch.nn.functional import normalize\r\n\r\n\r"
},
{
"path": "cluster/train_cluster.py",
"chars": 3029,
"preview": "import argparse\nimport logging\nimport os\nimport time\nfrom pathlib import Path\n\nimport numpy as np\nimport torch\nimport tq"
},
{
"path": "compress_model.py",
"chars": 2359,
"preview": "from collections import OrderedDict\r\n\r\nimport torch\r\n\r\nimport utils\r\nfrom models import SynthesizerTrn\r\n\r\n\r\ndef copyStat"
},
{
"path": "configs/diffusion.yaml",
"chars": 0,
"preview": ""
},
{
"path": "configs_template/config_template.json",
"chars": 1845,
"preview": "{\n \"train\": {\n \"log_interval\": 200,\n \"eval_interval\": 800,\n \"seed\": 1234,\n \"epochs\": 10000,\n \"learning_r"
},
{
"path": "configs_template/config_tiny_template.json",
"chars": 1843,
"preview": "{\n \"train\": {\n \"log_interval\": 200,\n \"eval_interval\": 800,\n \"seed\": 1234,\n \"epochs\": 10000,\n \"learning_r"
},
{
"path": "configs_template/diffusion_template.yaml",
"chars": 1638,
"preview": "data:\n sampling_rate: 44100\n block_size: 512 # Equal to hop_length\n duration: 2 # Audio duration during training, mus"
},
{
"path": "data_utils.py",
"chars": 7039,
"preview": "import os\nimport random\n\nimport numpy as np\nimport torch\nimport torch.utils.data\n\nimport utils\nfrom modules.mel_processi"
},
{
"path": "diffusion/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "diffusion/data_loaders.py",
"chars": 10916,
"preview": "import os\nimport random\n\nimport librosa\nimport numpy as np\nimport torch\nfrom torch.utils.data import Dataset\nfrom tqdm i"
},
{
"path": "diffusion/diffusion.py",
"chars": 17014,
"preview": "from collections import deque\nfrom functools import partial\nfrom inspect import isfunction\n\nimport numpy as np\nimport to"
},
{
"path": "diffusion/diffusion_onnx.py",
"chars": 24620,
"preview": "import math\nfrom collections import deque\nfrom functools import partial\nfrom inspect import isfunction\n\nimport numpy as "
},
{
"path": "diffusion/dpm_solver_pytorch.py",
"chars": 68246,
"preview": "import torch\n\n\nclass NoiseScheduleVP:\n def __init__(\n self,\n schedule='discrete',\n b"
},
{
"path": "diffusion/how to export onnx.md",
"chars": 201,
"preview": "- Open [onnx_export](onnx_export.py)\r\n- project_name = \"dddsp\" change \"project_name\" to your project name\r\n- model_path "
},
{
"path": "diffusion/infer_gt_mel.py",
"chars": 3020,
"preview": "import torch\nimport torch.nn.functional as F\n\nfrom diffusion.unit2mel import load_model_vocoder\n\n\nclass DiffGtMel:\n d"
},
{
"path": "diffusion/logger/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "diffusion/logger/saver.py",
"chars": 4091,
"preview": "'''\nauthor: wayn391@mastertones\n'''\n\nimport datetime\nimport os\nimport time\n\nimport matplotlib.pyplot as plt\nimport torch"
},
{
"path": "diffusion/logger/utils.py",
"chars": 3738,
"preview": "import json\nimport os\n\nimport torch\nimport yaml\n\n\ndef traverse_dir(\n root_dir,\n extensions,\n amount"
},
{
"path": "diffusion/onnx_export.py",
"chars": 9005,
"preview": "import os\r\n\r\nimport numpy as np\r\nimport torch\r\nimport torch.nn as nn\r\nimport torch.nn.functional as F\r\nimport yaml\r\nfrom"
},
{
"path": "diffusion/solver.py",
"chars": 7270,
"preview": "import time\n\nimport librosa\nimport numpy as np\nimport torch\nfrom torch import autocast\nfrom torch.cuda.amp import GradSc"
},
{
"path": "diffusion/uni_pc.py",
"chars": 32119,
"preview": "import math\n\nimport torch\n\n\nclass NoiseScheduleVP:\n def __init__(\n self,\n schedule='discrete',\n"
},
{
"path": "diffusion/unit2mel.py",
"chars": 6452,
"preview": "import os\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport yaml\n\nfrom .diffusion import GaussianDiffusion\nf"
},
{
"path": "diffusion/vocoder.py",
"chars": 3370,
"preview": "import torch\nfrom torchaudio.transforms import Resample\n\nfrom vdecoder.nsf_hifigan.models import load_config, load_model"
},
{
"path": "diffusion/wavenet.py",
"chars": 3794,
"preview": "import math\nfrom math import sqrt\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch.nn impo"
},
{
"path": "edgetts/tts.py",
"chars": 1524,
"preview": "import asyncio\r\nimport random\r\nimport sys\r\n\r\nimport edge_tts\r\nfrom edge_tts import VoicesManager\r\nfrom langdetect import"
},
{
"path": "edgetts/tts_voices.py",
"chars": 10773,
"preview": "#List of Supported Voices for edge_TTS\r\nSUPPORTED_VOICES = {\r\n 'zh-CN-XiaoxiaoNeural': 'zh-CN',\r\n 'zh-CN-XiaoyiNeu"
},
{
"path": "export_index_for_onnx.py",
"chars": 411,
"preview": "import os\nimport pickle\n\nimport faiss\n\npath = \"crs\"\nindexs_file_path = f\"checkpoints/{path}/feature_and_index.pkl\"\nindex"
},
{
"path": "flask_api.py",
"chars": 2274,
"preview": "import io\nimport logging\n\nimport soundfile\nimport torch\nimport torchaudio\nfrom flask import Flask, request, send_file\nfr"
},
{
"path": "flask_api_full_song.py",
"chars": 2070,
"preview": "import io\n\nimport numpy as np\nimport soundfile\nfrom flask import Flask, request, send_file\n\nfrom inference import infer_"
},
{
"path": "inference/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "inference/infer_tool.py",
"chars": 25169,
"preview": "import gc\nimport hashlib\nimport io\nimport json\nimport logging\nimport os\nimport pickle\nimport time\nfrom pathlib import Pa"
},
{
"path": "inference/infer_tool_grad.py",
"chars": 5574,
"preview": "import io\nimport logging\nimport os\n\nimport librosa\nimport numpy as np\nimport parselmouth\nimport soundfile\nimport torch\ni"
},
{
"path": "inference/slicer.py",
"chars": 6592,
"preview": "import librosa\nimport torch\nimport torchaudio\n\n\nclass Slicer:\n def __init__(self,\n sr: int,\n "
},
{
"path": "inference_main.py",
"chars": 7732,
"preview": "import logging\n\nimport soundfile\n\nfrom inference import infer_tool\nfrom inference.infer_tool import Svc\nfrom spkmix impo"
},
{
"path": "models.py",
"chars": 20466,
"preview": "import torch\nfrom torch import nn\nfrom torch.nn import Conv1d, Conv2d\nfrom torch.nn import functional as F\nfrom torch.nn"
},
{
"path": "modules/DSConv.py",
"chars": 3010,
"preview": "import torch.nn as nn\nfrom torch.nn.utils import remove_weight_norm, weight_norm\n\n\nclass Depthwise_Separable_Conv1D(nn.M"
},
{
"path": "modules/F0Predictor/CrepeF0Predictor.py",
"chars": 1392,
"preview": "import torch\n\nfrom modules.F0Predictor.crepe import CrepePitchExtractor\nfrom modules.F0Predictor.F0Predictor import F0Pr"
},
{
"path": "modules/F0Predictor/DioF0Predictor.py",
"chars": 2651,
"preview": "import numpy as np\nimport pyworld\n\nfrom modules.F0Predictor.F0Predictor import F0Predictor\n\n\nclass DioF0Predictor(F0Pred"
},
{
"path": "modules/F0Predictor/F0Predictor.py",
"chars": 421,
"preview": "class F0Predictor(object):\n def compute_f0(self,wav,p_len):\n '''\n input: wav:[signal_length]\n "
},
{
"path": "modules/F0Predictor/FCPEF0Predictor.py",
"chars": 4081,
"preview": "from typing import Union\r\n\r\nimport numpy as np\r\nimport torch\r\nimport torch.nn.functional as F\r\n\r\nfrom modules.F0Predicto"
},
{
"path": "modules/F0Predictor/HarvestF0Predictor.py",
"chars": 2510,
"preview": "import numpy as np\nimport pyworld\n\nfrom modules.F0Predictor.F0Predictor import F0Predictor\n\n\nclass HarvestF0Predictor(F0"
},
{
"path": "modules/F0Predictor/PMF0Predictor.py",
"chars": 2703,
"preview": "import numpy as np\nimport parselmouth\n\nfrom modules.F0Predictor.F0Predictor import F0Predictor\n\n\nclass PMF0Predictor(F0P"
},
{
"path": "modules/F0Predictor/RMVPEF0Predictor.py",
"chars": 3921,
"preview": "from typing import Union\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\n\nfrom modules.F0Predictor.F0Pr"
},
{
"path": "modules/F0Predictor/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "modules/F0Predictor/crepe.py",
"chars": 11345,
"preview": "from typing import Optional, Union\n\ntry:\n from typing import Literal\nexcept Exception:\n from typing_extensions imp"
},
{
"path": "modules/F0Predictor/fcpe/__init__.py",
"chars": 124,
"preview": "from .model import FCPEInfer # noqa: F401\r\nfrom .nvSTFT import STFT # noqa: F401\r\nfrom .pcmer import PCmer # noqa: F4"
},
{
"path": "modules/F0Predictor/fcpe/model.py",
"chars": 10524,
"preview": "import numpy as np\r\nimport torch\r\nimport torch.nn as nn\r\nimport torch.nn.functional as F\r\nfrom torch.nn.utils import wei"
},
{
"path": "modules/F0Predictor/fcpe/nvSTFT.py",
"chars": 5443,
"preview": "import os\n\nimport librosa\nimport numpy as np\nimport soundfile as sf\nimport torch\nimport torch.nn.functional as F\nimport "
},
{
"path": "modules/F0Predictor/fcpe/pcmer.py",
"chars": 13862,
"preview": "import math\nfrom functools import partial\n\nimport torch\nimport torch.nn.functional as F\nfrom einops import rearrange, re"
},
{
"path": "modules/F0Predictor/rmvpe/__init__.py",
"chars": 283,
"preview": "from .constants import * # noqa: F403\nfrom .inference import RMVPE # noqa: F401\nfrom .model import E2E, E2E0 # noqa: "
},
{
"path": "modules/F0Predictor/rmvpe/constants.py",
"chars": 139,
"preview": "SAMPLE_RATE = 16000\n\nN_CLASS = 360\n\nN_MELS = 128\nMEL_FMIN = 30\nMEL_FMAX = SAMPLE_RATE // 2\nWINDOW_LENGTH = 1024\nCONST = "
},
{
"path": "modules/F0Predictor/rmvpe/deepunet.py",
"chars": 7429,
"preview": "import torch\nimport torch.nn as nn\n\nfrom .constants import N_MELS\n\n\nclass ConvBlockRes(nn.Module):\n def __init__(self"
},
{
"path": "modules/F0Predictor/rmvpe/inference.py",
"chars": 2478,
"preview": "import torch\nimport torch.nn.functional as F\nfrom torchaudio.transforms import Resample\n\nfrom .constants import * # noq"
},
{
"path": "modules/F0Predictor/rmvpe/model.py",
"chars": 2538,
"preview": "from torch import nn\n\nfrom .constants import * # noqa: F403\nfrom .deepunet import DeepUnet, DeepUnet0\nfrom .seq import "
},
{
"path": "modules/F0Predictor/rmvpe/seq.py",
"chars": 648,
"preview": "import torch.nn as nn\n\n\nclass BiGRU(nn.Module):\n def __init__(self, input_features, hidden_features, num_layers):\n "
},
{
"path": "modules/F0Predictor/rmvpe/spec.py",
"chars": 2329,
"preview": "import numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom librosa.filters import mel\n\n\nclass MelSpectrogram(t"
},
{
"path": "modules/F0Predictor/rmvpe/utils.py",
"chars": 3716,
"preview": "import sys\nfrom functools import reduce\n\nimport librosa\nimport numpy as np\nimport torch\nfrom torch.nn.modules.module imp"
},
{
"path": "modules/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "modules/attentions.py",
"chars": 14178,
"preview": "import math\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nimport modules.commons as commons\nf"
},
{
"path": "modules/commons.py",
"chars": 5696,
"preview": "import math\n\nimport torch\nfrom torch.nn import functional as F\n\n\ndef slice_pitch_segments(x, ids_str, segment_size=4):\n "
},
{
"path": "modules/enhancer.py",
"chars": 4356,
"preview": "import numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torchaudio.transforms import Resample\n\nfrom vdecode"
},
{
"path": "modules/losses.py",
"chars": 1276,
"preview": "import torch\n\n\ndef feature_loss(fmap_r, fmap_g):\n loss = 0\n for dr, dg in zip(fmap_r, fmap_g):\n for rl, gl in zip(d"
},
{
"path": "modules/mel_processing.py",
"chars": 2685,
"preview": "import torch\nimport torch.utils.data\nfrom librosa.filters import mel as librosa_mel_fn\n\nMAX_WAV_VALUE = 32768.0\n\n\ndef dy"
},
{
"path": "modules/modules.py",
"chars": 12197,
"preview": "import torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nimport modules.attentions as attentions\nimport m"
},
{
"path": "onnx_export.py",
"chars": 4206,
"preview": "import argparse\nimport json\n\nimport torch\n\nimport utils\nfrom onnxexport.model_onnx_speaker_mix import SynthesizerTrn\n\npa"
},
{
"path": "onnx_export_old.py",
"chars": 2181,
"preview": "import torch\n\nimport utils\nfrom onnxexport.model_onnx import SynthesizerTrn\n\n\ndef main(NetExport):\n path = \"SoVits4.0"
},
{
"path": "onnxexport/model_onnx.py",
"chars": 12211,
"preview": "import torch\nfrom torch import nn\nfrom torch.nn import Conv1d, Conv2d\nfrom torch.nn import functional as F\nfrom torch.nn"
},
{
"path": "onnxexport/model_onnx_speaker_mix.py",
"chars": 13830,
"preview": "import torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nimport modules.attentions as attentions\nimport m"
},
{
"path": "preprocess_flist_config.py",
"chars": 4870,
"preview": "import argparse\nimport json\nimport os\nimport re\nimport wave\nfrom random import shuffle\n\nfrom loguru import logger\nfrom t"
},
{
"path": "preprocess_hubert_f0.py",
"chars": 6870,
"preview": "import argparse\nimport logging\nimport os\nimport random\nfrom concurrent.futures import ProcessPoolExecutor\nfrom glob impo"
},
{
"path": "requirements.txt",
"chars": 336,
"preview": "ffmpeg-python\nFlask\nFlask_Cors\ngradio>=3.7.0\nnumpy==1.23.5\npyworld\nscipy==1.10.0\nSoundFile==0.12.1\ntorch\ntorchaudio\ntorc"
},
{
"path": "requirements_onnx_encoder.txt",
"chars": 373,
"preview": "Flask\r\nFlask_Cors\r\ngradio>=3.7.0\r\nnumpy==1.23.0\r\npyworld==0.2.5\r\nscipy==1.10.0\r\nSoundFile==0.12.1\r\ntorch==1.13.1\r\ntorcha"
},
{
"path": "requirements_win.txt",
"chars": 425,
"preview": "librosa==0.9.1\nfairseq==0.12.2\nffmpeg-python\nFlask==2.1.2\nFlask_Cors==3.0.10\ngradio>=3.7.0\nnumpy\nplaysound==1.3.0\nPyAudi"
},
{
"path": "resample.py",
"chars": 3332,
"preview": "import argparse\nimport concurrent.futures\nimport os\nfrom concurrent.futures import ProcessPoolExecutor\nfrom multiprocess"
},
{
"path": "sovits4_for_colab.ipynb",
"chars": 25240,
"preview": "{\n \"cells\": [\n {\n \"attachments\": {},\n \"cell_type\": \"markdown\",\n \"metadata\": {\n \"id\": \"2q0l56aFQhAM\"\n },\n \""
},
{
"path": "spkmix.py",
"chars": 447,
"preview": "# 角色混合轨道 编写规则:\n# 角色ID : [[起始时间1, 终止时间1, 起始数值1, 起始数值1], [起始时间2, 终止时间2, 起始数值2, 起始数值2]]\n# 起始时间和前一个的终止时间必须相同,第一个起始时间必须为0,最后一"
},
{
"path": "train.py",
"chars": 14210,
"preview": "import logging\nimport multiprocessing\nimport os\nimport time\n\nimport torch\nimport torch.distributed as dist\nimport torch."
},
{
"path": "train_diff.py",
"chars": 2592,
"preview": "import argparse\n\nimport torch\nfrom loguru import logger\nfrom torch.optim import lr_scheduler\n\nfrom diffusion.data_loader"
},
{
"path": "train_index.py",
"chars": 873,
"preview": "import argparse\nimport os\nimport pickle\n\nimport utils\n\nif __name__ == \"__main__\":\n parser = argparse.ArgumentParser()"
},
{
"path": "utils.py",
"chars": 20678,
"preview": "import argparse\nimport glob\nimport json\nimport logging\nimport os\nimport re\nimport subprocess\nimport sys\nimport traceback"
},
{
"path": "vdecoder/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vdecoder/hifigan/env.py",
"chars": 394,
"preview": "import os\nimport shutil\n\n\nclass AttrDict(dict):\n def __init__(self, *args, **kwargs):\n super(AttrDict, self)._"
},
{
"path": "vdecoder/hifigan/models.py",
"chars": 21747,
"preview": "import json\nimport os\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch."
},
{
"path": "vdecoder/hifigan/nvSTFT.py",
"chars": 4420,
"preview": "import os\n\nimport librosa\nimport numpy as np\nimport soundfile as sf\nimport torch\nimport torch.utils.data\nfrom librosa.fi"
},
{
"path": "vdecoder/hifigan/utils.py",
"chars": 1866,
"preview": "import glob\nimport os\n\n# matplotlib.use(\"Agg\")\nimport matplotlib.pylab as plt\nimport torch\nfrom torch.nn.utils import we"
},
{
"path": "vdecoder/hifiganwithsnake/alias/__init__.py",
"chars": 242,
"preview": "# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0\n# LICENSE is in incl_licens"
},
{
"path": "vdecoder/hifiganwithsnake/alias/act.py",
"chars": 4470,
"preview": "# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0\n# LICENSE is in incl_licens"
},
{
"path": "vdecoder/hifiganwithsnake/alias/filter.py",
"chars": 4177,
"preview": "# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0\n# LICENSE is in incl_licens"
},
{
"path": "vdecoder/hifiganwithsnake/alias/resample.py",
"chars": 3226,
"preview": "# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0\n# LICENSE is in incl_licens"
},
{
"path": "vdecoder/hifiganwithsnake/env.py",
"chars": 394,
"preview": "import os\nimport shutil\n\n\nclass AttrDict(dict):\n def __init__(self, *args, **kwargs):\n super(AttrDict, self)._"
},
{
"path": "vdecoder/hifiganwithsnake/models.py",
"chars": 22720,
"preview": "import json\nimport os\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch."
},
{
"path": "vdecoder/hifiganwithsnake/nvSTFT.py",
"chars": 4420,
"preview": "import os\n\nimport librosa\nimport numpy as np\nimport soundfile as sf\nimport torch\nimport torch.utils.data\nfrom librosa.fi"
},
{
"path": "vdecoder/hifiganwithsnake/utils.py",
"chars": 1866,
"preview": "import glob\nimport os\n\n# matplotlib.use(\"Agg\")\nimport matplotlib.pylab as plt\nimport torch\nfrom torch.nn.utils import we"
},
{
"path": "vdecoder/nsf_hifigan/env.py",
"chars": 394,
"preview": "import os\nimport shutil\n\n\nclass AttrDict(dict):\n def __init__(self, *args, **kwargs):\n super(AttrDict, self)._"
},
{
"path": "vdecoder/nsf_hifigan/models.py",
"chars": 16384,
"preview": "import json\nimport os\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch."
},
{
"path": "vdecoder/nsf_hifigan/nvSTFT.py",
"chars": 5357,
"preview": "import os\n\nimport librosa\nimport numpy as np\nimport soundfile as sf\nimport torch\nimport torch.nn.functional as F\nimport "
},
{
"path": "vdecoder/nsf_hifigan/utils.py",
"chars": 1883,
"preview": "import glob\nimport os\n\nimport matplotlib\nimport matplotlib.pylab as plt\nimport torch\nfrom torch.nn.utils import weight_n"
},
{
"path": "vencoder/CNHubertLarge.py",
"chars": 1245,
"preview": "import torch\nfrom fairseq import checkpoint_utils\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass CNHubertLarge(Spee"
},
{
"path": "vencoder/ContentVec256L12_Onnx.py",
"chars": 1175,
"preview": "import onnxruntime\nimport torch\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass ContentVec256L12_Onnx(SpeechEncoder)"
},
{
"path": "vencoder/ContentVec256L9.py",
"chars": 1330,
"preview": "import torch\nfrom fairseq import checkpoint_utils\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass ContentVec256L9(Sp"
},
{
"path": "vencoder/ContentVec256L9_Onnx.py",
"chars": 1242,
"preview": "import onnxruntime\nimport torch\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass ContentVec256L9_Onnx(SpeechEncoder):"
},
{
"path": "vencoder/ContentVec768L12.py",
"chars": 1284,
"preview": "import torch\nfrom fairseq import checkpoint_utils\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass ContentVec768L12(S"
},
{
"path": "vencoder/ContentVec768L12_Onnx.py",
"chars": 1187,
"preview": "import onnxruntime\nimport torch\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass ContentVec768L12_Onnx(SpeechEncoder)"
},
{
"path": "vencoder/ContentVec768L9_Onnx.py",
"chars": 1185,
"preview": "import onnxruntime\nimport torch\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass ContentVec768L9_Onnx(SpeechEncoder):"
},
{
"path": "vencoder/DPHubert.py",
"chars": 1041,
"preview": "import torch\n\nfrom vencoder.dphubert.model import wav2vec2_model\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass DPHu"
},
{
"path": "vencoder/HubertSoft.py",
"chars": 977,
"preview": "import torch\n\nfrom vencoder.encoder import SpeechEncoder\nfrom vencoder.hubert import hubert_model\n\n\nclass HubertSoft(Spe"
},
{
"path": "vencoder/HubertSoft_Onnx.py",
"chars": 1176,
"preview": "import onnxruntime\nimport torch\n\nfrom vencoder.encoder import SpeechEncoder\n\n\nclass HubertSoft_Onnx(SpeechEncoder):\n "
},
{
"path": "vencoder/WavLMBasePlus.py",
"chars": 1215,
"preview": "import torch\n\nfrom vencoder.encoder import SpeechEncoder\nfrom vencoder.wavlm.WavLM import WavLM, WavLMConfig\n\n\nclass Wav"
},
{
"path": "vencoder/WhisperPPG.py",
"chars": 1198,
"preview": "import torch\n\nfrom vencoder.encoder import SpeechEncoder\nfrom vencoder.whisper.audio import log_mel_spectrogram, pad_or_"
},
{
"path": "vencoder/WhisperPPGLarge.py",
"chars": 1236,
"preview": "import torch\r\n\r\nfrom vencoder.encoder import SpeechEncoder\r\nfrom vencoder.whisper.audio import log_mel_spectrogram, pad_"
},
{
"path": "vencoder/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vencoder/dphubert/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vencoder/dphubert/components.py",
"chars": 58641,
"preview": "\"\"\"Building blocks for speech SSL models supporting pruning.\n\nOriginally from:\nhttps://github.com/pytorch/audio/blob/mai"
},
{
"path": "vencoder/dphubert/hardconcrete.py",
"chars": 4060,
"preview": "\"\"\"Implementation of the hard Concrete distribution.\n\nOriginally from:\nhttps://github.com/asappresearch/flop/blob/master"
},
{
"path": "vencoder/dphubert/model.py",
"chars": 38745,
"preview": "\"\"\"Speech SSL models supporting pruning.\n\nOriginally from:\nhttps://github.com/pytorch/audio/blob/main/torchaudio/models/"
},
{
"path": "vencoder/dphubert/pruning_utils.py",
"chars": 1869,
"preview": "\"\"\"Utility functions for pruning.\"\"\"\n\nfrom typing import Union\n\nimport torch\nimport torch.nn as nn\n\n\ndef prune_linear_la"
},
{
"path": "vencoder/dphubert/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vencoder/dphubert/utils/import_huggingface_wavlm.py",
"chars": 5564,
"preview": "\"\"\"Import Hugging Face transformers's wav2vec2.0 pretrained weights to torchaudios's format.\n\nOriginally from:\nhttps://g"
},
{
"path": "vencoder/encoder.py",
"chars": 364,
"preview": "class SpeechEncoder(object):\n def __init__(self, vec_path=\"pretrain/checkpoint_best_legacy_500.pt\", device=None):\n "
},
{
"path": "vencoder/hubert/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vencoder/hubert/hubert_model.py",
"chars": 7329,
"preview": "import copy\nimport random\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functio"
},
{
"path": "vencoder/hubert/hubert_model_onnx.py",
"chars": 7160,
"preview": "import copy\nimport random\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functio"
},
{
"path": "vencoder/wavlm/WavLM.py",
"chars": 27546,
"preview": "# --------------------------------------------------------\n# WavLM: Large-Scale Self-Supervised Pre-training for Full "
},
{
"path": "vencoder/wavlm/modules.py",
"chars": 31947,
"preview": "# --------------------------------------------------------\n# WavLM: Large-Scale Self-Supervised Pre-training for Full "
},
{
"path": "vencoder/whisper/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "vencoder/whisper/audio.py",
"chars": 4062,
"preview": "from functools import lru_cache\nfrom typing import Union\n\nimport ffmpeg\nimport numpy as np\nimport torch\nimport torch.nn."
},
{
"path": "vencoder/whisper/decoding.py",
"chars": 30156,
"preview": "from dataclasses import dataclass, field\nfrom typing import TYPE_CHECKING, Dict, Iterable, List, Optional, Sequence, Tup"
},
{
"path": "vencoder/whisper/model.py",
"chars": 9661,
"preview": "from dataclasses import dataclass\nfrom typing import Dict, Iterable, Optional\n\nimport numpy as np\nimport torch\nimport to"
},
{
"path": "vencoder/whisper/tokenizer.py",
"chars": 9798,
"preview": "import os\nfrom dataclasses import dataclass\nfrom functools import lru_cache\nfrom typing import List, Optional, Tuple, Un"
},
{
"path": "vencoder/whisper/utils.py",
"chars": 5226,
"preview": "import json\nimport os\nimport sys\nimport zlib\nfrom typing import Callable, TextIO\n\nsystem_encoding = sys.getdefaultencodi"
},
{
"path": "wav_upload.py",
"chars": 850,
"preview": "import argparse\nimport os\nimport shutil\n\nfrom google.colab import files\n\nif __name__ == \"__main__\":\n parser = argpars"
},
{
"path": "webUI.py",
"chars": 21486,
"preview": "import glob\nimport json\nimport logging\nimport os\nimport re\nimport subprocess\nimport sys\nimport time\nimport traceback\nfro"
}
]
About this extraction
This page contains the full source code of the svc-develop-team/so-vits-svc GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 147 files (995.4 KB), approximately 266.6k tokens, and a symbol index with 1241 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.