Showing preview only (718K chars total). Download the full file or copy to clipboard to get everything.
Repository: fishaudio/fish-speech
Branch: main
Commit: 49985a34a704
Files: 153
Total size: 676.6 KB
Directory structure:
gitextract_ft5t8lt3/
├── .dockerignore
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.yml
│ │ ├── config.yml
│ │ └── feature_request.yml
│ ├── pull_request_template.md
│ └── workflows/
│ ├── build-docker-image.yml
│ ├── docs.yml
│ └── stale.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .project-root
├── .readthedocs.yaml
├── API_FLAGS.txt
├── LICENSE
├── README.md
├── awesome_webui/
│ ├── .gitignore
│ ├── README.md
│ ├── eslint.config.js
│ ├── index.html
│ ├── package.json
│ ├── src/
│ │ ├── App.tsx
│ │ ├── components/
│ │ │ └── ui/
│ │ │ ├── alert.tsx
│ │ │ ├── badge.tsx
│ │ │ ├── button.tsx
│ │ │ ├── card.tsx
│ │ │ ├── collapsible.tsx
│ │ │ ├── dialog.tsx
│ │ │ ├── label.tsx
│ │ │ ├── scroll-area.tsx
│ │ │ ├── separator.tsx
│ │ │ ├── slider.tsx
│ │ │ ├── switch.tsx
│ │ │ ├── textarea.tsx
│ │ │ └── toggle-group.tsx
│ │ ├── index.css
│ │ └── main.tsx
│ ├── tsconfig.app.json
│ ├── tsconfig.json
│ ├── tsconfig.node.json
│ └── vite.config.ts
├── compose.base.yml
├── compose.yml
├── docker/
│ └── Dockerfile
├── dockerfile.dev
├── docs/
│ ├── CNAME
│ ├── README.ar.md
│ ├── README.ja.md
│ ├── README.ko.md
│ ├── README.pt-BR.md
│ ├── README.zh.md
│ ├── ar/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── en/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ ├── install.md
│ │ └── server.md
│ ├── ja/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── ko/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── pt/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── requirements.txt
│ ├── stylesheets/
│ │ └── extra.css
│ └── zh/
│ ├── finetune.md
│ ├── index.md
│ ├── inference.md
│ └── install.md
├── entrypoint.sh
├── fish_speech/
│ ├── callbacks/
│ │ ├── __init__.py
│ │ └── grad_norm.py
│ ├── configs/
│ │ ├── base.yaml
│ │ ├── lora/
│ │ │ └── r_8_alpha_16.yaml
│ │ ├── modded_dac_vq.yaml
│ │ └── text2semantic_finetune.yaml
│ ├── content_sequence.py
│ ├── conversation.py
│ ├── datasets/
│ │ ├── concat_repeat.py
│ │ ├── protos/
│ │ │ ├── text-data.proto
│ │ │ ├── text_data_pb2.py
│ │ │ └── text_data_stream.py
│ │ ├── semantic.py
│ │ └── vqgan.py
│ ├── i18n/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── core.py
│ │ ├── locale/
│ │ │ ├── en_US.json
│ │ │ ├── es_ES.json
│ │ │ ├── ja_JP.json
│ │ │ ├── ko_KR.json
│ │ │ ├── pt_BR.json
│ │ │ └── zh_CN.json
│ │ └── scan.py
│ ├── inference_engine/
│ │ ├── __init__.py
│ │ ├── reference_loader.py
│ │ ├── utils.py
│ │ └── vq_manager.py
│ ├── models/
│ │ ├── dac/
│ │ │ ├── __init__.py
│ │ │ ├── inference.py
│ │ │ ├── modded_dac.py
│ │ │ └── rvq.py
│ │ └── text2semantic/
│ │ ├── __init__.py
│ │ ├── inference.py
│ │ ├── lit_module.py
│ │ ├── llama.py
│ │ └── lora.py
│ ├── scheduler.py
│ ├── text/
│ │ ├── __init__.py
│ │ └── clean.py
│ ├── tokenizer.py
│ ├── train.py
│ └── utils/
│ ├── __init__.py
│ ├── braceexpand.py
│ ├── context.py
│ ├── file.py
│ ├── instantiators.py
│ ├── logger.py
│ ├── logging_utils.py
│ ├── rich_utils.py
│ ├── schema.py
│ ├── spectrogram.py
│ └── utils.py
├── inference.ipynb
├── mkdocs.yml
├── pyproject.toml
├── pyrightconfig.json
└── tools/
├── api_client.py
├── api_server.py
├── llama/
│ ├── build_dataset.py
│ ├── eval_in_context.py
│ ├── merge_lora.py
│ └── quantize.py
├── run_webui.py
├── server/
│ ├── api_utils.py
│ ├── exception_handler.py
│ ├── inference.py
│ ├── model_manager.py
│ ├── model_utils.py
│ └── views.py
├── vqgan/
│ ├── create_train_split.py
│ └── extract_vq.py
└── webui/
├── __init__.py
├── inference.py
└── variables.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .dockerignore
================================================
# .dockerignore
# Git and version control
.git
.gitignore
.gitattributes
.gitmodules
# IDE and editor files
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store
Thumbs.db
# Python cache and build artifacts
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# Virtual environments
venv/
env/
ENV/
.venv/
.env/
# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/
.nox/
coverage.xml
*.cover
.hypothesis/
# Jupyter Notebook
.ipynb_checkpoints
*.ipynb
# Logs
*.log
logs/
# Temporary files
tmp/
temp/
*.tmp
*.temp
# OS generated files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
# Docker files (except the one being used)
docker/
Dockerfile*
docker-compose*.yml
.dockerignore
# Checkpoints and models (should be mounted)
checkpoints/
models/
*.pth
*.ckpt
*.safetensors
*.bin
# Reference voices (should be mounted)
references/
# Generated audio files
*.wav
*.mp3
*.flac
*.ogg
generated_audio.wav
fake.wav
fake.npy
# Cache directories
.cache/
cache/
.uv_cache/
# Development files
.env
.env.local
.env.development
.env.test
.env.production
# Test files
test_*.py
*_test.py
tests/
# CI/CD
.github/
.gitlab-ci.yml
.travis.yml
.circleci/
azure-pipelines.yml
# Monitoring and profiling
.prof
*.prof
# Backup files
*.bak
*.backup
*.old
# Large data files
*.csv
*.jsonl
*.parquet
*.h5
*.hdf5
# Audio processing temporary files
*.tmp.wav
*.temp.wav
# OLD:
# .github
# results
# data
# *.filelist
# /data_server/target
# checkpoints
# .venv
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.yml
================================================
name: "🕷️ Bug report"
description: |
Please follow this template carefully to ensure we can address your issue quickly.
Make sure to provide as much detail as possible, including logs and screenshots.
labels:
- bug
body:
- type: checkboxes
attributes:
label: Self Checks
description: "To ensure timely help, please confirm the following:"
options:
- label: This template is only for bug reports. For questions, please visit [Discussions](https://github.com/fishaudio/fish-speech/discussions).
required: true
- label: I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. [English](https://speech.fish.audio/) [中文](https://speech.fish.audio/zh/) [日本語](https://speech.fish.audio/ja/) [Portuguese (Brazil)](https://speech.fish.audio/pt/)
required: true
- label: I have searched for existing issues, including closed ones. [Search issues](https://github.com/fishaudio/fish-speech/issues)
required: true
- label: I confirm that I am using English to submit this report (我已阅读并同意 [Language Policy](https://github.com/fishaudio/fish-speech/issues/515)).
required: true
- label: "[FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)"
required: true
- label: "Please do not modify this template and fill in all required fields."
required: true
- type: dropdown
attributes:
label: Cloud or Self Hosted
multiple: true
options:
- Cloud
- Self Hosted (Docker)
- Self Hosted (Source)
validations:
required: true
- type: textarea
attributes:
label: Environment Details
description: "Provide details such as OS, Python version, and any relevant software or dependencies."
placeholder: e.g., macOS 13.5, Python 3.10, torch==2.4.1, Gradio 4.44.0
validations:
required: true
- type: textarea
attributes:
label: Steps to Reproduce
description: |
Include detailed steps, screenshots, and logs. Use the correct markdown syntax for code blocks.
placeholder: |
1. Run the command `python -m tools.api_client -t "xxxxx"`
2. Observe the console output error: `ModuleNotFoundError: No module named 'pyaudio'` (with screenshots or logs will be better)
validations:
required: true
- type: textarea
attributes:
label: ✔️ Expected Behavior
placeholder: Describe what you expected to happen.
validations:
required: false
- type: textarea
attributes:
label: ❌ Actual Behavior
placeholder: Describe what actually happened.
validations:
required: false
================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: false
contact_links:
- name: "\U0001F4E7 Discussions"
url: https://github.com/fishaudio/fish-speech/discussions
about: General discussions and request help from the community
================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.yml
================================================
name: "⭐ Feature or enhancement request"
description: Propose something new.
labels:
- enhancement
body:
- type: checkboxes
attributes:
label: Self Checks
description: "To make sure we get to you in time, please check the following :)"
options:
- label: I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find any relevant information that meets my needs. [English](https://speech.fish.audio/) [中文](https://speech.fish.audio/zh/) [日本語](https://speech.fish.audio/ja/) [Portuguese (Brazil)](https://speech.fish.audio/pt/)
required: true
- label: I have searched for existing issues [search for existing issues]([https://github.com/langgenius/dify/issues](https://github.com/fishaudio/fish-speech/issues)), including closed ones.
required: true
- label: I confirm that I am using English to submit this report (我已阅读并同意 [Language Policy](https://github.com/fishaudio/fish-speech/issues/515)).
required: true
- label: "[FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)"
required: true
- label: "Please do not modify this template :) and fill in all the required fields."
required: true
- type: textarea
attributes:
label: 1. Is this request related to a challenge you're experiencing? Tell us your story.
description: |
Describe the specific problem or scenario you’re facing in detail. For example:
*"I was trying to use [feature] for [specific task], but encountered [issue]. This was frustrating because...."*
placeholder: Please describe the situation in as much detail as possible.
validations:
required: true
- type: textarea
attributes:
label: 2. What is your suggested solution?
description: |
Provide a clear description of the feature or enhancement you'd like to propose.
How would this feature solve your issue or improve the project?
placeholder: Describe your idea or proposed solution here.
validations:
required: true
- type: textarea
attributes:
label: 3. Additional context or comments
description: |
Any other relevant information, links, documents, or screenshots that provide clarity.
Use this section for anything not covered above.
placeholder: Add any extra details here.
validations:
required: false
- type: checkboxes
attributes:
label: 4. Can you help us with this feature?
description: |
Let us know if you're interested in contributing. This is not a commitment but a way to express interest in collaboration.
options:
- label: I am interested in contributing to this feature.
required: false
- type: markdown
attributes:
value: |
**Note:** Please submit only one request per issue to keep discussions focused and manageable.
================================================
FILE: .github/pull_request_template.md
================================================
**Is this PR adding new feature or fix a BUG?**
Add feature / Fix BUG.
**Is this pull request related to any issue? If yes, please link the issue.**
#xxx
================================================
FILE: .github/workflows/build-docker-image.yml
================================================
name: Build Docker Images
on:
push:
branches:
- main
tags:
- "v*"
jobs:
build:
runs-on: ubuntu-latest-16c64g
strategy:
matrix:
target: [webui, server]
backend: [cuda, cpu]
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Get Version
run: |
if [[ $GITHUB_REF == refs/tags/v* ]]; then
version=$(basename ${GITHUB_REF})
else
version=nightly
fi
echo "version=${version}" >> $GITHUB_ENV
echo "Current version: ${version}"
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USER }}
password: ${{ secrets.DOCKER_PAT }}
- name: Set platform for CPU builds
id: platform
run: |
if [ "${{ matrix.backend }}" = "cpu" ]; then
echo "platforms=linux/amd64,linux/arm64" >> $GITHUB_OUTPUT
else
echo "platforms=linux/amd64" >> $GITHUB_OUTPUT
fi
- name: Build and Push ${{ matrix.target }}-${{ matrix.backend }} Image
uses: docker/build-push-action@v6
with:
context: .
file: docker/Dockerfile
platforms: ${{ steps.platform.outputs.platforms }}
push: true
target: ${{ matrix.target }}
build-args: |
BACKEND=${{ matrix.backend }}
UV_EXTRA=${{ matrix.backend == 'cuda' && 'cu126' || 'cpu' }}
tags: |
fishaudio/fish-speech:${{ matrix.target }}-${{ matrix.backend }}-${{ env.version }}
fishaudio/fish-speech:${{ matrix.target }}-${{ matrix.backend }}
${{ (matrix.target == 'webui' && matrix.backend == 'cuda') && format('fishaudio/fish-speech:{0}', env.version) || '' }}
${{ (matrix.target == 'webui' && matrix.backend == 'cuda') && 'fishaudio/fish-speech:latest' || '' }}
outputs: type=image,oci-mediatypes=true,compression=zstd,compression-level=3,force-compression=true
cache-from: type=registry,ref=fishaudio/fish-speech:${{ matrix.target }}-${{ matrix.backend }}
cache-to: type=inline
update-readme:
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
steps:
- name: Push README to Dockerhub
uses: peter-evans/dockerhub-description@v4
with:
username: ${{ secrets.DOCKER_USER }}
password: ${{ secrets.DOCKER_PAT }}
repository: fishaudio/fish-speech
================================================
FILE: .github/workflows/docs.yml
================================================
name: docs
on:
push:
branches:
- main
paths:
- 'docs/**'
- 'mkdocs.yml'
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v5
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v4
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- run: pip install -r docs/requirements.txt
- run: mkdocs gh-deploy --force
================================================
FILE: .github/workflows/stale.yml
================================================
name: Close inactive issues
on:
schedule:
- cron: "0 0 * * *"
jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v9
with:
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
days-before-pr-stale: 30
days-before-pr-close: 30
stale-pr-label: "stale"
stale-pr-message: "This PR is stale because it has been open for 30 days with no activity."
close-pr-message: "This PR was closed because it has been inactive for 30 days since being marked as stale."
repo-token: ${{ secrets.GITHUB_TOKEN }}
================================================
FILE: .gitignore
================================================
# =============================================================================
# Fish Speech - .gitignore
# =============================================================================
# Operating System Files
# -----------------------
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
# IDEs and Editors
# ----------------
.vscode/
.idea/
*.swp
*.swo
*~
# Python
# ------
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# Virtual Environments
# --------------------
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
/fishenv/
# Project Dependencies
# --------------------
.pdm-python
/fish_speech.egg-info
# Data and Model Files
# --------------------
data/
results/
checkpoints/
references/
demo-audios/
example/
filelists/
*.filelist
# Audio Files
# -----------
*.wav
*.mp3
*.flac
*.ogg
*.m4a
# Data Files
# ----------
*.npy
*.npz
*.pkl
*.pickle
*.lab
/fish_speech/text/cmudict_cache.pickle
# Cache and Temporary Files
# --------------------------
/.cache/
/.gradio/
/.locale/
.pgx.*
*log
*.log
site/
# External Tools
# --------------
ffmpeg.exe
ffprobe.exe
/faster_whisper/
# Server Related
# --------------
/data_server/target/
# Test Files
# ----------
/*.test.sh
asr-label*
================================================
FILE: .pre-commit-config.yaml
================================================
ci:
autoupdate_schedule: monthly
repos:
- repo: https://github.com/pycqa/isort
rev: 8.0.1
hooks:
- id: isort
args: [--profile=black]
- repo: https://github.com/psf/black-pre-commit-mirror
rev: 26.1.0
hooks:
- id: black
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
hooks:
- id: end-of-file-fixer
- id: check-yaml
- id: check-json
- id: mixed-line-ending
args: ["--fix=lf"]
- id: check-added-large-files
args: ["--maxkb=5000"]
================================================
FILE: .project-root
================================================
================================================
FILE: .readthedocs.yaml
================================================
# Read the Docs configuration file for MkDocs projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
mkdocs:
configuration: mkdocs.yml
# Optionally declare the Python requirements required to build your docs
python:
install:
- requirements: docs/requirements.txt
================================================
FILE: API_FLAGS.txt
================================================
# --infer
--api
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
================================================
FILE: LICENSE
================================================
# FISH AUDIO RESEARCH LICENSE AGREEMENT
**Last Updated: March 7, 2026**
## I. INTRODUCTION
This Agreement applies to any individual person or entity ("You", "Your" or "Licensee") that uses or distributes any portion or element of the Fish Audio Materials or Derivative Works thereof for any Research, Non-Commercial, or Commercial purpose. Capitalized terms not otherwise defined herein are defined in Section V below.
This Agreement is intended to allow research and non-commercial uses of the Materials free of charge. Any Commercial use of the Materials requires a separate license from Fish Audio.
By clicking "I Accept" or by using, distributing, or accessing any portion or element of the Fish Audio Materials or Derivative Works, You agree that You have read, understood and are bound by the terms of this Agreement. If You are acting on behalf of a company, organization or other entity, then "You" includes you and that entity, and You agree that You: (i) are an authorized representative of such entity with the authority to bind such entity to this Agreement, and (ii) You agree to the terms of this Agreement on that entity's behalf.
## II. RESEARCH & NON-COMMERCIAL USE LICENSE
Subject to the terms of this Agreement, Fish Audio grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Fish Audio's intellectual property or other rights owned by Fish Audio embodied in the Fish Audio Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Fish Audio Materials for any Research or Non-Commercial Purpose.
"Research Purpose" means academic or scientific advancement, and in each case, is not primarily intended for commercial advantage or monetary compensation to You or others.
"Non-Commercial Purpose" means any purpose other than a Research Purpose that is not primarily intended for commercial advantage or monetary compensation to You or others, such as personal use (i.e., hobbyist) or evaluation and testing.
## III. COMMERCIAL USE
**Any use of the Fish Audio Materials or Derivative Works for a Commercial Purpose requires a separate written license agreement from Fish Audio.** No commercial rights are granted under this Agreement.
"Commercial Purpose" means any purpose other than a Research Purpose or Non-Commercial Purpose that is primarily intended for or directed toward commercial advantage or monetary compensation to You or others, including but not limited to: (i) creating, modifying, or distributing Your product or service, including via a hosted service or application programming interface, (ii) Your business's or organization's internal operations, and (iii) any use in connection with a product or service for which You charge a fee or generate revenue, whether directly or indirectly.
To obtain a commercial license, please contact Fish Audio at:
- **Website:** [https://fish.audio](https://fish.audio)
- **Email:** business@fish.audio
## IV. GENERAL TERMS
Your Research and Non-Commercial License under this Agreement is subject to the following terms.
### a. Distribution & Attribution
If You distribute or make available the Fish Audio Materials or a Derivative Work to a third party, or a product or service that uses any portion of them, You shall: (i) provide a copy of this Agreement to that third party, (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "This model is licensed under the Fish Audio Research License, Copyright © 39 AI, INC. All Rights Reserved.", and (iii) prominently display "Built with Fish Audio" on a related website, user interface, blogpost, about page, or product documentation.
If You create a Derivative Work, You may add your own attribution notice(s) to the "Notice" text file included with that Derivative Work, provided that You clearly indicate which attributions apply to the Fish Audio Materials and state in the "Notice" text file that You changed the Fish Audio Materials and how it was modified.
### b. Use Restrictions
Your use of the Fish Audio Materials and Derivative Works, including any output or results of the Fish Audio Materials or Derivative Works, must comply with applicable laws and regulations (including Trade Control Laws and equivalent regulations) and adhere to Fish Audio's Acceptable Use Policy, which is hereby incorporated by reference.
Furthermore, You will not use the Fish Audio Materials or Derivative Works, or any output or results of the Fish Audio Materials or Derivative Works, to create or improve any foundational generative AI model (excluding the Models or Derivative Works).
### c. Intellectual Property
**(i) Trademark License.** No trademark licenses are granted under this Agreement, and in connection with the Fish Audio Materials or Derivative Works, You may not use any name or mark owned by or associated with Fish Audio or any of its Affiliates, except as required under Section IV(a) herein.
**(ii) Ownership of Derivative Works.** As between You and Fish Audio, You are the owner of Derivative Works You create, subject to Fish Audio's ownership of the Fish Audio Materials and any Derivative Works made by or for Fish Audio.
**(iii) Ownership of Outputs.** As between You and Fish Audio, You own any outputs generated from the Models or Derivative Works to the extent permitted by applicable law.
**(iv) Disputes.** If You or Your Affiliate(s) institute litigation or other proceedings against Fish Audio (including a cross-claim or counterclaim in a lawsuit) alleging that the Fish Audio Materials, Derivative Works or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by You, then any licenses granted to You under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Fish Audio from and against any claim by any third party arising out of or related to Your use or distribution of the Fish Audio Materials or Derivative Works in violation of this Agreement.
**(v) Feedback.** From time to time, You may provide Fish Audio with verbal and/or written suggestions, comments or other feedback related to Fish Audio's existing or prospective technology, products or services (collectively, "Feedback"). You are not obligated to provide Fish Audio with Feedback, but to the extent that You do, You hereby grant Fish Audio a perpetual, irrevocable, royalty-free, fully-paid, sub-licensable, transferable, non-exclusive, worldwide right and license to exploit the Feedback in any manner without restriction. Your Feedback is provided "AS IS" and You make no warranties whatsoever about any Feedback.
### d. Disclaimer of Warranty
UNLESS REQUIRED BY APPLICABLE LAW, THE FISH AUDIO MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OR LAWFULNESS OF USING OR REDISTRIBUTING THE FISH AUDIO MATERIALS, DERIVATIVE WORKS OR ANY OUTPUT OR RESULTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE FISH AUDIO MATERIALS, DERIVATIVE WORKS AND ANY OUTPUT AND RESULTS.
### e. Limitation of Liability
IN NO EVENT WILL FISH AUDIO OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF FISH AUDIO OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
### f. Term and Termination
The term of this Agreement will commence upon Your acceptance of this Agreement or access to the Fish Audio Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Fish Audio may terminate this Agreement if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You shall delete and cease use of any Fish Audio Materials or Derivative Works. Sections IV(d), (e), and (g) shall survive the termination of this Agreement.
### g. Governing Law
This Agreement will be governed by and construed in accordance with the laws of the United States and the State of California without regard to choice of law principles, and the UN Convention on Contracts for International Sale of Goods does not apply to this Agreement.
## V. DEFINITIONS
**"Affiliate(s)"** means any entity that directly or indirectly controls, is controlled by, or is under common control with the subject entity; for purposes of this definition, "control" means direct or indirect ownership or control of more than 50% of the voting interests of the subject entity.
**"Agreement"** means this Fish Audio Research License Agreement.
**"Derivative Work(s)"** means (a) any derivative work of the Fish Audio Materials as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Model's output, including "fine tune" and "low-rank adaptation" models derived from a Model or a Model's output, but do not include the output of any Model.
**"Documentation"** means any specifications, manuals, documentation, and other written information provided by Fish Audio related to the Software or Models.
**"Fish Audio"** or **"we"** means 39 AI, INC. and its Affiliates.
**"Model(s)"** means, collectively, Fish Audio's proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing.
**"Software"** means Fish Audio's proprietary software made available under this Agreement now or in the future.
**"Fish Audio Materials"** means, collectively, Fish Audio's proprietary Models, Software and Documentation (and any portion or combination thereof) made available under this Agreement.
**"Trade Control Laws"** means any applicable U.S. and non-U.S. export control and trade sanctions laws and regulations.
================================================
FILE: README.md
================================================
<div align="center">
<h1>Fish Speech</h1>
**English** | [简体中文](docs/README.zh.md) | [Portuguese](docs/README.pt-BR.md) | [日本語](docs/README.ja.md) | [한국어](docs/README.ko.md) | [العربية](docs/README.ar.md) <br>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
<br>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
> [!IMPORTANT]
> **License Notice**
> This codebase and its associated model weights are released under **[FISH AUDIO RESEARCH LICENSE](LICENSE)**. Please refer to [LICENSE](LICENSE) for more details. We will take action against any violation of the license.
> [!WARNING]
> **Legal Disclaimer**
> We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
## Quick Start
### For Human
Here are the official documents for Fish Audio S2, follow the instructions to get started easily.
- [Installation](https://speech.fish.audio/install/)
- [Command Line Inference](https://speech.fish.audio/inference/#command-line-inference)
- [WebUI Inference](https://speech.fish.audio/inference/#webui-inference)
- [Server Inference](https://speech.fish.audio/server/)
- [Docker Setup](https://speech.fish.audio/install/#docker-setup)
> [!IMPORTANT]
> **For SGLang server, please read [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
### For LLM Agent
```
Install and configure Fish-Audio S2 by following the instructions here: https://speech.fish.audio/install/
```
## Fish Audio S2 Pro
**State-of-the-art multilingual text-to-speech (TTS) system, redefining the boundaries of voice generation.**
Fish Audio S2 Pro is the most advanced multimodal model developed by [Fish Audio](https://fish.audio/). Trained on over **10 million hours** of audio data covering more than **80 languages**, S2 Pro combines a **Dual-Autoregressive (Dual-AR)** architecture with reinforcement learning (RL) alignment to generate speech that is exceptionally natural, realistic, and emotionally rich, leading the competition among both open-source and closed-source systems.
The core strength of S2 Pro lies in its support for **sub-word level** fine-grained control of prosody and emotion using natural language tags (e.g., `[whisper]`, `[excited]`, `[angry]`), while natively supporting multi-speaker and multi-turn conversation generation.
Visit the [Fish Audio website](https://fish.audio/) for a live playground, or read our [technical report](https://arxiv.org/abs/2603.08823) and [blog post](https://fish.audio/blog/fish-audio-open-sources-s2/) for more details.
### Model Variants
| Model | Size | Availability | Description |
|------|------|-------------|-------------|
| S2-Pro | 4B parameters | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Full-featured flagship model with maximum quality and stability |
More details of the model can be found in the [technical report](https://arxiv.org/abs/2411.01156).
## Benchmark Results
| Benchmark | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER (Chinese) | **0.54%** (best overall) |
| Seed-TTS Eval — WER (English) | **0.99%** (best overall) |
| Audio Turing Test (with instruction) | **0.515** posterior mean |
| EmergentTTS-Eval — Win Rate | **81.88%** (highest overall) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — Quality | **4.51 / 5.0** |
| Multilingual (MiniMax Testset) — Best WER | **11 of 24** languages |
| Multilingual (MiniMax Testset) — Best SIM | **17 of 24** languages |
On Seed-TTS Eval, S2 achieves the lowest WER among all evaluated models including closed-source systems: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). On the Audio Turing Test, 0.515 surpasses Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%. On EmergentTTS-Eval, S2 achieves particularly strong results in paralinguistics (91.61% win rate), questions (84.41%), and syntactic complexity (83.39%).
## Highlights
<img src="./docs/assets/totalability.png" width=200%>
### Fine-Grained Inline Control via Natural Language
S2 Pro brings unprecedented "soul" to speech. Using simple `[tag]` syntax, you can precisely embed emotional instructions at any position in the text.
- **15,000+ Unique Tags Supported**: Not limited to fixed presets; S2 supports **free-form text descriptions**. Try `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]`.
- **Rich Emotion Library**:
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
### Innovative Dual-Autoregressive (Dual-AR) Architecture
S2 Pro adopts a master-slave Dual-AR architecture consisting of a decoder-only transformer and an RVQ audio codec (10 codebooks, ~21 Hz):
- **Slow AR (4B parameters)**: Operates along the time axis, predicting the primary semantic codebook.
- **Fast AR (400M parameters)**: Generates the remaining 9 residual codebooks at each time step, reconstructing exquisite acoustic details.
This asymmetric design achieves peak audio fidelity while significantly boosting inference speed.
### Reinforcement Learning (RL) Alignment
S2 Pro utilizes **Group Relative Policy Optimization (GRPO)** for post-training alignment. We use the same model suite for data cleaning and annotation directly as Reward Models, perfectly resolving the distribution mismatch between pre-training data and post-training objectives.
- **Multi-Dimensional Reward Signals**: Comprehensively evaluates semantic accuracy, instruction adherence, acoustic preference scoring, and timbre similarity to ensure every second of generated speech feels intuitive to humans.
### Extreme Streaming Performance (Powered by SGLang)
As the Dual-AR architecture is structurally isomorphic to standard LLMs, S2 Pro natively supports all SGLang inference acceleration features, including Continuous Batching, Paged KV Cache, CUDA Graph, and RadixAttention-based Prefix Caching.
**Performance on a single NVIDIA H200 GPU:**
- **Real-Time Factor (RTF)**: 0.195
- **Time-to-First-Audio (TTFA)**: ~100 ms
- **Extreme Throughput**: 3,000+ acoustic tokens/s while maintaining RTF < 0.5
### Robust Multilingual Support
S2 Pro supports over 80 languages without requiring phonemes or language-specific preprocessing:
- **Tier 1**: Japanese (ja), English (en), Chinese (zh)
- **Tier 2**: Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)
- **Global Coverage**: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, etc.
### Native Multi-Speaker Generation
<img src="./docs/assets/chattemplate.png" width=200%>
Fish Audio S2 allows users to upload reference audio containing multiple speakers, and the model processes each speaker's features via the `<|speaker:i|>` token. You can then control the model's performance via speaker ID tokens, enabling a single generation to include multiple speakers. There is no longer a need to upload separate reference audio for each individual speaker.
### Multi-Turn Generation
Thanks to the expansion of the model context, our model can now leverage previous information to improve the expressiveness of subsequent generated content, thereby increasing the naturalness of the dialogue.
### Rapid Voice Cloning
Fish Audio S2 supports accurate voice cloning using short reference samples (typically 10-30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
For SGLang Server usage, please refer to the [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).
---
## Credits
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## Tech Report
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: awesome_webui/.gitignore
================================================
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*
node_modules
dist
dist-ssr
*.local
# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
================================================
FILE: awesome_webui/README.md
================================================
# React + TypeScript + Vite
This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
Currently, two official plugins are available:
- [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react) uses [Babel](https://babeljs.io/) (or [oxc](https://oxc.rs) when used in [rolldown-vite](https://vite.dev/guide/rolldown)) for Fast Refresh
- [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
## React Compiler
The React Compiler is currently not compatible with SWC. See [this issue](https://github.com/vitejs/vite-plugin-react/issues/428) for tracking the progress.
## Expanding the ESLint configuration
If you are developing a production application, we recommend updating the configuration to enable type-aware lint rules:
```js
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
// Other configs...
// Remove tseslint.configs.recommended and replace with this
tseslint.configs.recommendedTypeChecked,
// Alternatively, use this for stricter rules
tseslint.configs.strictTypeChecked,
// Optionally, add this for stylistic rules
tseslint.configs.stylisticTypeChecked,
// Other configs...
],
languageOptions: {
parserOptions: {
project: ['./tsconfig.node.json', './tsconfig.app.json'],
tsconfigRootDir: import.meta.dirname,
},
// other options...
},
},
])
```
You can also install [eslint-plugin-react-x](https://github.com/Rel1cx/eslint-react/tree/main/packages/plugins/eslint-plugin-react-x) and [eslint-plugin-react-dom](https://github.com/Rel1cx/eslint-react/tree/main/packages/plugins/eslint-plugin-react-dom) for React-specific lint rules:
```js
// eslint.config.js
import reactX from 'eslint-plugin-react-x'
import reactDom from 'eslint-plugin-react-dom'
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
// Other configs...
// Enable lint rules for React
reactX.configs['recommended-typescript'],
// Enable lint rules for React DOM
reactDom.configs.recommended,
],
languageOptions: {
parserOptions: {
project: ['./tsconfig.node.json', './tsconfig.app.json'],
tsconfigRootDir: import.meta.dirname,
},
// other options...
},
},
])
```
================================================
FILE: awesome_webui/eslint.config.js
================================================
import js from '@eslint/js'
import globals from 'globals'
import reactHooks from 'eslint-plugin-react-hooks'
import reactRefresh from 'eslint-plugin-react-refresh'
import tseslint from 'typescript-eslint'
import { defineConfig, globalIgnores } from 'eslint/config'
export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
js.configs.recommended,
tseslint.configs.recommended,
reactHooks.configs.flat.recommended,
reactRefresh.configs.vite,
],
languageOptions: {
ecmaVersion: 2020,
globals: globals.browser,
},
},
])
================================================
FILE: awesome_webui/index.html
================================================
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Awesome WebUI</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>
================================================
FILE: awesome_webui/package.json
================================================
{
"name": "awesome_webui",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "tsc -b && vite build",
"lint": "eslint .",
"preview": "vite preview"
},
"dependencies": {
"@radix-ui/react-collapsible": "^1.1.12",
"@radix-ui/react-dialog": "^1.1.15",
"@radix-ui/react-label": "^2.1.8",
"@radix-ui/react-scroll-area": "^1.2.10",
"@radix-ui/react-separator": "^1.1.8",
"@radix-ui/react-slider": "^1.3.6",
"@radix-ui/react-slot": "^1.2.4",
"@radix-ui/react-switch": "^1.2.6",
"@radix-ui/react-toggle-group": "^1.1.11",
"@tailwindcss/vite": "^4.2.1",
"class-variance-authority": "^0.7.1",
"clsx": "^2.1.1",
"lucide-react": "^0.577.0",
"react": "^19.2.0",
"react-dom": "^19.2.0",
"tailwind-merge": "^3.5.0",
"tailwindcss": "^4.2.1"
},
"devDependencies": {
"@eslint/js": "^9.39.1",
"@types/node": "^24.10.1",
"@types/react": "^19.2.7",
"@types/react-dom": "^19.2.3",
"@vitejs/plugin-react-swc": "^4.2.2",
"eslint": "^9.39.1",
"eslint-plugin-react-hooks": "^7.0.1",
"eslint-plugin-react-refresh": "^0.4.24",
"globals": "^16.5.0",
"typescript": "~5.9.3",
"typescript-eslint": "^8.48.0",
"vite": "^7.3.1"
}
}
================================================
FILE: awesome_webui/src/App.tsx
================================================
import { useEffect, useRef, useState } from 'react'
import {
AudioLines,
ChevronDown,
CircleAlert,
Copy,
Download,
FileText,
Info,
LoaderCircle,
Plus,
Settings2,
Upload,
} from 'lucide-react'
import { Alert, AlertDescription, AlertTitle } from '@/components/ui/alert'
import { Badge } from '@/components/ui/badge'
import { Button } from '@/components/ui/button'
import {
Card,
CardContent,
CardDescription,
CardHeader,
CardTitle,
} from '@/components/ui/card'
import {
Collapsible,
CollapsibleContent,
CollapsibleTrigger,
} from '@/components/ui/collapsible'
import {
Dialog,
DialogContent,
DialogDescription,
DialogFooter,
DialogHeader,
DialogTitle,
} from '@/components/ui/dialog'
import { Label } from '@/components/ui/label'
import { ScrollArea } from '@/components/ui/scroll-area'
import { Separator } from '@/components/ui/separator'
import { Slider } from '@/components/ui/slider'
import { Switch } from '@/components/ui/switch'
import { Textarea } from '@/components/ui/textarea'
import { ToggleGroup, ToggleGroupItem } from '@/components/ui/toggle-group'
type AudioFormat = 'mp3' | 'wav' | 'pcm' | 'opus'
type LatencyMode = 'normal' | 'balanced'
const defaultInputText = `[excited, joyful tone] We're going to DISNEY WORLD! [squeal of delight] I've been saving for [emphasis] three years [breathless] and finally, FINALLY we can go! The look on your face right now is worth every extra shift I worked!
[angry] After everything we've been through [break] I can't believe you would [emphasize] betray me like this. I gave you EVERYTHING! And now I'm left with nothing but memories and broken promises!`
type ControlsState = {
chunkLength: number
maxNewTokens: number
temperature: number
topP: number
repetitionPenalty: number
normalize: boolean
format: AudioFormat
latency: LatencyMode
}
type Metrics = {
textLength: number
ttftMs: number
receivedKb: number
}
type StatusState = {
tone: 'error' | 'info'
message: string
}
type ReferenceItem = {
id: number
name: string
audio: ArrayBuffer
text: string
previewUrl: string
}
type SpeakerGroup = {
id: number
references: ReferenceItem[]
}
type PendingReference = {
mode: 'create' | 'edit'
speakerId: number
referenceId?: number
name: string
audio?: ArrayBuffer
text: string
}
const initialControls: ControlsState = {
chunkLength: 1000,
maxNewTokens: 2048,
temperature: 0.9,
topP: 0.9,
repetitionPenalty: 1.05,
normalize: false,
format: 'mp3',
latency: 'normal',
}
const formatMimeMap: Record<AudioFormat, string> = {
mp3: 'audio/mpeg',
wav: 'audio/wav',
pcm: 'audio/pcm',
opus: 'audio/opus',
}
function createId() {
return Date.now() + Math.floor(Math.random() * 100000)
}
function arrayBufferToBase64(buffer: ArrayBuffer): string {
const bytes = new Uint8Array(buffer)
let binary = ''
for (let i = 0; i < bytes.byteLength; i++) {
binary += String.fromCharCode(bytes[i])
}
return btoa(binary)
}
function createSpeakerGroup(): SpeakerGroup {
return {
id: createId(),
references: [],
}
}
const initialSpeakerGroup = createSpeakerGroup()
function buildReferencesPayload(
speakerGroups: SpeakerGroup[],
includeBinaryAudio: boolean,
) {
return speakerGroups.flatMap((speakerGroup) =>
speakerGroup.references.map((reference) => ({
text: reference.text,
audio: includeBinaryAudio
? arrayBufferToBase64(reference.audio)
: '<audio binary data>',
})),
)
}
function buildPreviewPayload(
inputText: string,
controls: ControlsState,
speakerGroups: SpeakerGroup[],
) {
return {
text: inputText,
chunk_length: controls.chunkLength,
max_new_tokens: controls.maxNewTokens,
format: controls.format,
latency: controls.latency,
normalize: controls.normalize,
references: buildReferencesPayload(speakerGroups, false),
temperature: controls.temperature,
top_p: controls.topP,
repetition_penalty: controls.repetitionPenalty,
}
}
function buildRequestPayload(
inputText: string,
controls: ControlsState,
speakerGroups: SpeakerGroup[],
) {
return {
text: inputText,
chunk_length: controls.chunkLength,
max_new_tokens: controls.maxNewTokens,
format: controls.format,
latency: controls.latency,
normalize: controls.normalize,
references: buildReferencesPayload(speakerGroups, true),
temperature: controls.temperature,
top_p: controls.topP,
repetition_penalty: controls.repetitionPenalty,
}
}
function createFileName(inputText: string) {
const safePrefix = inputText.trim().replace(/\s+/g, '-').slice(0, 24) || 'tts'
return safePrefix
}
function getErrorMessage(error: unknown) {
return error instanceof Error ? error.message : 'Unknown error'
}
function waitForSourceBuffer(sourceBuffer: SourceBuffer) {
if (!sourceBuffer.updating) {
return Promise.resolve()
}
return new Promise<void>((resolve) => {
const handleUpdateEnd = () => {
sourceBuffer.removeEventListener('updateend', handleUpdateEnd)
resolve()
}
sourceBuffer.addEventListener('updateend', handleUpdateEnd)
})
}
function canUseStreamingPlayback(format: AudioFormat) {
const mime = formatMimeMap[format]
return typeof window.MediaSource !== 'undefined' && MediaSource.isTypeSupported(mime)
}
type SettingSliderProps = {
label: string
value: number
min: number
max: number
step?: number
onValueChange: (value: number) => void
formatValue?: (value: number) => string
}
function SettingSlider({
label,
value,
min,
max,
step = 1,
onValueChange,
formatValue,
}: SettingSliderProps) {
return (
<div className="space-y-3">
<div className="flex items-center justify-between gap-4">
<Label>{label}</Label>
<span className="text-sm text-muted-foreground">
{formatValue ? formatValue(value) : value}
</span>
</div>
<Slider
value={[value]}
min={min}
max={max}
step={step}
onValueChange={(nextValue) => {
const current = nextValue[0]
if (typeof current === 'number') {
onValueChange(current)
}
}}
/>
</div>
)
}
function App() {
const [inputText, setInputText] = useState(defaultInputText)
const [controls, setControls] = useState(initialControls)
const [speakerGroups, setSpeakerGroups] = useState<SpeakerGroup[]>([initialSpeakerGroup])
const [pendingReference, setPendingReference] = useState<PendingReference | null>(null)
const [openSpeakerIds, setOpenSpeakerIds] = useState<number[]>([initialSpeakerGroup.id])
const [metrics, setMetrics] = useState<Metrics | null>(null)
const [isGenerating, setIsGenerating] = useState(false)
const [copyLabel, setCopyLabel] = useState('Copy')
const [isRequestPreviewOpen, setIsRequestPreviewOpen] = useState(false)
const [statusMessage, setStatusMessage] = useState<StatusState | null>(null)
const [downloadUrl, setDownloadUrl] = useState<string | null>(null)
const [downloadName, setDownloadName] = useState('generated-audio.mp3')
const audioRef = useRef<HTMLAudioElement | null>(null)
const fileInputRef = useRef<HTMLInputElement | null>(null)
const speakerGroupsRef = useRef<SpeakerGroup[]>([])
const uploadTargetSpeakerIdRef = useRef<number | null>(null)
const downloadUrlRef = useRef<string | null>(null)
const mediaSourceUrlRef = useRef<string | null>(null)
speakerGroupsRef.current = speakerGroups
useEffect(() => {
return () => {
speakerGroupsRef.current.forEach((speakerGroup) => {
speakerGroup.references.forEach((reference) => {
URL.revokeObjectURL(reference.previewUrl)
})
})
if (downloadUrlRef.current) {
URL.revokeObjectURL(downloadUrlRef.current)
}
if (mediaSourceUrlRef.current) {
URL.revokeObjectURL(mediaSourceUrlRef.current)
}
}
}, [])
function addSpeaker() {
const nextSpeaker = createSpeakerGroup()
setSpeakerGroups((current) => [...current, nextSpeaker])
setOpenSpeakerIds((current) => [...current, nextSpeaker.id])
}
function removeSpeaker(speakerId: number) {
setSpeakerGroups((current) => {
const targetSpeaker = current.find((speakerGroup) => speakerGroup.id === speakerId)
if (targetSpeaker) {
targetSpeaker.references.forEach((reference) => {
URL.revokeObjectURL(reference.previewUrl)
})
}
const next = current.filter((speakerGroup) => speakerGroup.id !== speakerId)
return next.length > 0 ? next : [createSpeakerGroup()]
})
setOpenSpeakerIds((current) => current.filter((currentSpeakerId) => currentSpeakerId !== speakerId))
if (pendingReference?.speakerId === speakerId) {
setPendingReference(null)
}
}
function addReference(speakerId: number, name: string, audio: ArrayBuffer, text: string) {
const previewUrl = URL.createObjectURL(new Blob([audio], { type: formatMimeMap.mp3 }))
setSpeakerGroups((current) =>
current.map((speakerGroup) =>
speakerGroup.id === speakerId
? {
...speakerGroup,
references: [
...speakerGroup.references,
{
id: createId(),
name,
audio,
text,
previewUrl,
},
],
}
: speakerGroup,
),
)
}
function removeReference(speakerId: number, referenceId: number) {
setSpeakerGroups((current) =>
current.map((speakerGroup) => {
if (speakerGroup.id !== speakerId) {
return speakerGroup
}
return {
...speakerGroup,
references: speakerGroup.references.filter((reference) => {
if (reference.id === referenceId) {
URL.revokeObjectURL(reference.previewUrl)
return false
}
return true
}),
}
}),
)
}
function updateReferenceText(speakerId: number, referenceId: number, text: string) {
setSpeakerGroups((current) =>
current.map((speakerGroup) =>
speakerGroup.id === speakerId
? {
...speakerGroup,
references: speakerGroup.references.map((reference) =>
reference.id === referenceId ? { ...reference, text } : reference,
),
}
: speakerGroup,
),
)
}
function clearDownloadUrl() {
if (downloadUrlRef.current) {
URL.revokeObjectURL(downloadUrlRef.current)
downloadUrlRef.current = null
}
setDownloadUrl(null)
}
function clearMediaSourceUrl() {
if (mediaSourceUrlRef.current) {
URL.revokeObjectURL(mediaSourceUrlRef.current)
mediaSourceUrlRef.current = null
}
}
async function handleReferenceUpload(event: React.ChangeEvent<HTMLInputElement>) {
const file = event.target.files?.[0]
const speakerId = uploadTargetSpeakerIdRef.current
event.target.value = ''
uploadTargetSpeakerIdRef.current = null
if (!file || typeof speakerId !== 'number') {
return
}
const audio = await file.arrayBuffer()
setPendingReference({
mode: 'create',
speakerId,
name: file.name,
audio,
text: '',
})
}
function savePendingReference() {
if (!pendingReference) {
return
}
if (pendingReference.mode === 'create' && pendingReference.audio) {
addReference(
pendingReference.speakerId,
pendingReference.name,
pendingReference.audio,
pendingReference.text,
)
}
if (pendingReference.mode === 'edit' && typeof pendingReference.referenceId === 'number') {
updateReferenceText(
pendingReference.speakerId,
pendingReference.referenceId,
pendingReference.text,
)
}
setPendingReference(null)
setStatusMessage(null)
}
async function copyRequestPreview() {
const requestPreview = JSON.stringify(
buildPreviewPayload(inputText, controls, speakerGroups),
null,
2,
)
try {
await navigator.clipboard.writeText(requestPreview)
setCopyLabel('Copied')
window.setTimeout(() => setCopyLabel('Copy'), 2000)
} catch (error) {
setStatusMessage({
tone: 'error',
message: `Failed to copy request preview: ${getErrorMessage(error)}`,
})
}
}
async function handleGenerateAudio() {
const audioElement = audioRef.current
if (!audioElement) {
return
}
const mime = formatMimeMap[controls.format]
const useStreamingPlayback = canUseStreamingPlayback(controls.format)
clearDownloadUrl()
clearMediaSourceUrl()
setMetrics(null)
setStatusMessage(null)
setIsGenerating(true)
try {
const response = await fetch('/v1/tts', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(buildRequestPayload(inputText, controls, speakerGroups)),
})
if (!response.ok || !response.body) {
throw new Error('Failed to generate audio')
}
const reader = response.body.getReader()
let mediaSource: MediaSource | null = null
if (useStreamingPlayback) {
mediaSource = new MediaSource()
const streamUrl = URL.createObjectURL(mediaSource)
mediaSourceUrlRef.current = streamUrl
audioElement.src = streamUrl
} else {
audioElement.removeAttribute('src')
audioElement.load()
}
const allChunks: ArrayBuffer[] = []
const playQueue: ArrayBuffer[] = []
let sourceBuffer: SourceBuffer | null = null
let readingDone = false
let receivedLength = 0
let ttftMs = -1
const startTime = performance.now()
if (mediaSource) {
const sourceReady = new Promise<void>((resolve, reject) => {
mediaSource.addEventListener(
'sourceopen',
() => {
try {
sourceBuffer = mediaSource.addSourceBuffer(mime)
const processQueue = async () => {
if (!sourceBuffer || !mediaSource) {
return
}
while (true) {
if (readingDone && playQueue.length === 0) {
await waitForSourceBuffer(sourceBuffer)
if (mediaSource.readyState === 'open') {
mediaSource.endOfStream()
}
break
}
const chunk = playQueue.shift()
if (!chunk) {
await new Promise<void>((resolveSleep) => {
window.setTimeout(resolveSleep, 50)
})
continue
}
await waitForSourceBuffer(sourceBuffer)
sourceBuffer.appendBuffer(chunk)
await waitForSourceBuffer(sourceBuffer)
}
}
void processQueue()
resolve()
} catch (error) {
reject(error)
}
},
{ once: true },
)
})
await sourceReady
}
while (true) {
const { done, value } = await reader.read()
if (done) {
readingDone = true
break
}
receivedLength += value.byteLength
if (ttftMs < 0) {
ttftMs = performance.now() - startTime
}
setMetrics({
textLength: inputText.length,
ttftMs,
receivedKb: Math.round(receivedLength / 1024),
})
const chunk = value.buffer.slice(value.byteOffset, value.byteOffset + value.byteLength)
playQueue.push(chunk)
allChunks.push(chunk)
if (useStreamingPlayback && audioElement.paused) {
void audioElement.play().catch(() => undefined)
}
}
const audioBlob = new Blob(allChunks, { type: mime })
const nextDownloadUrl = URL.createObjectURL(audioBlob)
downloadUrlRef.current = nextDownloadUrl
setDownloadUrl(nextDownloadUrl)
setDownloadName(`${createFileName(inputText)}.${controls.format}`)
if (!useStreamingPlayback) {
audioElement.src = nextDownloadUrl
audioElement.load()
setStatusMessage({
tone: 'info',
message: `Format "${controls.format}" is not supported for in-browser playback. The file is ready to download after generation completes.`,
})
}
} catch (error) {
setStatusMessage({
tone: 'error',
message: `Audio generation failed: ${getErrorMessage(error)}`,
})
} finally {
setIsGenerating(false)
}
}
const requestPreview = JSON.stringify(
buildPreviewPayload(inputText, controls, speakerGroups),
null,
2,
)
const totalReferenceCount = speakerGroups.reduce(
(count, speakerGroup) => count + speakerGroup.references.length,
0,
)
return (
<main className="min-h-screen bg-zinc-50">
<div className="mx-auto max-w-[1600px] px-3 py-3 sm:px-4 lg:px-5">
<div className="grid gap-4 xl:h-[calc(100vh-1.5rem)] xl:grid-cols-[minmax(0,1fr)_460px]">
<section className="grid gap-4 xl:min-h-0 xl:grid-rows-[minmax(0,1fr)_auto]">
<Card className="rounded-xl border-zinc-200 bg-white shadow-none xl:min-h-0 xl:flex xl:flex-col">
<CardHeader className="space-y-1 border-b border-zinc-100 px-4 py-4">
<div className="flex items-center gap-2 text-zinc-700">
<FileText className="size-4" />
<CardTitle>Input</CardTitle>
</div>
<CardDescription>
Enter the text to synthesize and inspect the outgoing request payload.
</CardDescription>
</CardHeader>
<CardContent className="space-y-4 px-4 pt-4 xl:min-h-0 xl:flex-1 xl:overflow-y-auto">
<div className="space-y-2">
<Label htmlFor="inputText">Input Text</Label>
<Textarea
id="inputText"
value={inputText}
onChange={(event) => setInputText(event.target.value)}
placeholder="Enter text to synthesize"
className="min-h-[220px] resize-y rounded-xl border-zinc-200 bg-white p-3 text-sm shadow-none focus-visible:ring-zinc-300 xl:min-h-[260px]"
/>
</div>
<Collapsible open={isRequestPreviewOpen} onOpenChange={setIsRequestPreviewOpen}>
<div className="rounded-xl border border-zinc-200 bg-zinc-50">
<div className="flex flex-col gap-2 p-3 sm:flex-row sm:items-center sm:justify-between">
<div>
<div className="text-sm font-medium text-zinc-900">Request Preview</div>
<div className="text-xs text-zinc-500">
Live snapshot of the payload sent to the backend.
</div>
</div>
<div className="flex items-center gap-2">
<Button
type="button"
variant="ghost"
size="sm"
className="border border-zinc-200 bg-white text-zinc-700 hover:bg-zinc-100"
onClick={copyRequestPreview}
>
<Copy className="size-3.5" />
{copyLabel}
</Button>
<CollapsibleTrigger asChild>
<Button
type="button"
variant="ghost"
size="sm"
className="border border-zinc-200 bg-white text-zinc-700 hover:bg-zinc-100"
>
{isRequestPreviewOpen ? 'Collapse' : 'Expand'}
<ChevronDown
className={`size-4 transition-transform ${
isRequestPreviewOpen ? 'rotate-180' : ''
}`}
/>
</Button>
</CollapsibleTrigger>
</div>
</div>
<CollapsibleContent>
<Separator className="bg-zinc-200" />
<div className="p-3 pt-3">
<ScrollArea className="h-56 min-w-0 rounded-lg border border-zinc-200 bg-white">
<pre className="max-w-full whitespace-pre-wrap break-all p-3 text-xs leading-5 text-zinc-700">
{requestPreview}
</pre>
</ScrollArea>
</div>
</CollapsibleContent>
</div>
</Collapsible>
<div className="space-y-4">
<Button
type="button"
size="lg"
className="h-11 rounded-lg bg-zinc-900 text-white hover:bg-zinc-800"
onClick={handleGenerateAudio}
disabled={isGenerating}
>
{isGenerating ? (
<LoaderCircle className="size-4 animate-spin" />
) : (
<AudioLines className="size-4" />
)}
{isGenerating ? 'Generating Audio...' : 'Generate Audio'}
</Button>
{statusMessage ? (
<Alert
variant={statusMessage.tone === 'error' ? 'destructive' : 'warning'}
className="rounded-lg"
>
<div className="flex items-start gap-3">
{statusMessage.tone === 'error' ? (
<CircleAlert className="mt-0.5 size-4 shrink-0" />
) : (
<Info className="mt-0.5 size-4 shrink-0" />
)}
<div>
<AlertTitle>
{statusMessage.tone === 'error' ? 'Error' : 'Notice'}
</AlertTitle>
<AlertDescription>{statusMessage.message}</AlertDescription>
</div>
</div>
</Alert>
) : null}
</div>
</CardContent>
</Card>
<Card className="rounded-xl border-zinc-200 bg-white shadow-none">
<CardHeader className="space-y-1 border-b border-zinc-100 px-4 py-4">
<div className="flex items-center gap-2 text-zinc-700">
<AudioLines className="size-4" />
<CardTitle>Output</CardTitle>
</div>
<CardDescription>
Stream the result when supported, then preview or download the final file.
</CardDescription>
</CardHeader>
<CardContent className="space-y-3 px-4 pt-4">
<audio
ref={audioRef}
controls
className="w-full rounded-lg border border-zinc-200 bg-white"
/>
<div className="flex flex-wrap gap-2">
{metrics ? (
<>
<Badge variant="outline" className="border-zinc-200 bg-white text-zinc-700">
Text length: {metrics.textLength}
</Badge>
<Badge variant="outline" className="border-zinc-200 bg-white text-zinc-700">
TTFT: {metrics.ttftMs.toFixed(2)} ms
</Badge>
<Badge variant="outline" className="border-zinc-200 bg-white text-zinc-700">
Received: {metrics.receivedKb} KB
</Badge>
</>
) : (
<Badge variant="outline" className="border-zinc-200 bg-white text-zinc-500">
No output yet
</Badge>
)}
</div>
<div className="flex justify-end">
{downloadUrl ? (
<Button
asChild
variant="outline"
className="border-zinc-200 bg-white text-zinc-800 hover:bg-zinc-100"
>
<a href={downloadUrl} download={downloadName}>
<Download className="size-4" />
Download
</a>
</Button>
) : null}
</div>
</CardContent>
</Card>
</section>
<aside className="grid gap-4 xl:min-h-0 xl:grid-rows-[minmax(0,1fr)_auto]">
<Card className="rounded-xl border-zinc-200 bg-white shadow-none xl:min-h-0 xl:flex xl:flex-col">
<CardHeader className="space-y-1 border-b border-zinc-100 px-4 py-4">
<div className="flex items-center gap-2 text-zinc-700">
<Upload className="size-4" />
<CardTitle>Reference Audio</CardTitle>
</div>
<CardDescription>
Build one or more speaker groups. Each speaker can have multiple reference clips.
</CardDescription>
</CardHeader>
<CardContent className="space-y-3 px-4 pt-4 xl:min-h-0 xl:flex xl:flex-1 xl:flex-col">
<div className="flex flex-wrap items-center justify-between gap-2">
<div className="flex items-center text-sm text-zinc-500">
{speakerGroups.length} speaker{speakerGroups.length === 1 ? '' : 's'} /{' '}
{totalReferenceCount} reference{totalReferenceCount === 1 ? '' : 's'}
</div>
<Button
type="button"
variant="outline"
className="border-zinc-200 bg-white hover:bg-zinc-100"
onClick={addSpeaker}
>
<Plus className="size-4" />
Add Speaker
</Button>
<input
ref={fileInputRef}
type="file"
accept="audio/*"
className="hidden"
onChange={handleReferenceUpload}
/>
</div>
<ScrollArea className="min-h-0 rounded-md xl:h-full xl:flex-1">
<div className="space-y-2">
{speakerGroups.length > 0 ? (
speakerGroups.map((speakerGroup, speakerIndex) => (
<Collapsible
key={speakerGroup.id}
open={openSpeakerIds.includes(speakerGroup.id)}
onOpenChange={(open) => {
setOpenSpeakerIds((current) =>
open
? [...current, speakerGroup.id]
: current.filter(
(currentSpeakerId) => currentSpeakerId !== speakerGroup.id,
),
)
}}
>
<div className="rounded-lg border border-zinc-200 bg-white">
<div className="flex flex-col gap-2 px-3 py-3 sm:flex-row sm:items-center sm:justify-between">
<div className="min-w-0">
<div className="text-sm font-medium text-zinc-900">
Speaker {speakerIndex}
</div>
<div className="text-xs text-zinc-500">
{speakerGroup.references.length} reference
{speakerGroup.references.length === 1 ? '' : 's'}
</div>
</div>
<div className="flex flex-wrap gap-2">
<Button
type="button"
variant="outline"
size="sm"
className="h-8 border-zinc-200 bg-white px-2.5 hover:bg-zinc-100"
onClick={() => {
uploadTargetSpeakerIdRef.current = speakerGroup.id
fileInputRef.current?.click()
}}
>
<Upload className="size-4" />
Upload
</Button>
{speakerGroups.length > 1 ? (
<Button
type="button"
variant="ghost"
size="sm"
className="h-8 px-2.5 text-zinc-500 hover:bg-zinc-100 hover:text-zinc-900"
onClick={() => removeSpeaker(speakerGroup.id)}
>
Remove
</Button>
) : null}
<CollapsibleTrigger asChild>
<Button
type="button"
variant="ghost"
size="sm"
className="h-8 px-2 text-zinc-500 hover:bg-zinc-100 hover:text-zinc-900"
>
<ChevronDown
className={`size-4 transition-transform ${
openSpeakerIds.includes(speakerGroup.id) ? 'rotate-180' : ''
}`}
/>
</Button>
</CollapsibleTrigger>
</div>
</div>
<CollapsibleContent>
<Separator className="bg-zinc-200" />
<div className="space-y-2 px-3 py-2.5">
{speakerGroup.references.length > 0 ? (
speakerGroup.references.map((reference) => (
<div
key={reference.id}
className="flex flex-col gap-2 rounded-md border border-zinc-200 bg-zinc-50 p-2 sm:flex-row sm:items-center"
>
<audio
controls
src={reference.previewUrl}
className="h-9 w-full min-w-0 rounded-md border border-zinc-200 bg-white sm:flex-1"
/>
<div className="flex gap-2 sm:shrink-0">
<Button
type="button"
variant="ghost"
size="sm"
className="h-8 border border-zinc-200 bg-white px-2.5 text-zinc-600 hover:bg-zinc-100 hover:text-zinc-900"
onClick={() =>
setPendingReference({
mode: 'edit',
speakerId: speakerGroup.id,
referenceId: reference.id,
name: reference.name,
text: reference.text,
})
}
>
Edit Text
</Button>
<Button
type="button"
variant="ghost"
size="sm"
className="h-8 border border-zinc-200 bg-white px-2.5 text-zinc-500 hover:bg-zinc-100 hover:text-zinc-900"
onClick={() =>
removeReference(speakerGroup.id, reference.id)
}
>
Remove
</Button>
</div>
</div>
))
) : (
<div className="px-1 py-3 text-sm text-zinc-500">
No references yet.
</div>
)}
</div>
</CollapsibleContent>
</div>
</Collapsible>
))
) : (
<div className="rounded-lg border border-dashed border-zinc-300 bg-white p-4 text-sm text-zinc-500">
No speaker groups configured yet.
</div>
)}
</div>
</ScrollArea>
</CardContent>
</Card>
<Card className="rounded-xl border-zinc-200 bg-white shadow-none">
<CardHeader className="space-y-1 border-b border-zinc-100 px-4 py-4">
<div className="flex items-center gap-2 text-zinc-700">
<Settings2 className="size-4" />
<CardTitle>Generation Settings</CardTitle>
</div>
<CardDescription>Adjust sampling and output parameters.</CardDescription>
</CardHeader>
<CardContent className="space-y-4 px-4 pt-4">
<div className="space-y-2">
<Label>Latency Mode</Label>
<ToggleGroup
type="single"
value={controls.latency}
className="grid grid-cols-2 gap-2"
onValueChange={(value) => {
if (value) {
setControls((current) => ({
...current,
latency: value as LatencyMode,
}))
}
}}
>
<ToggleGroupItem value="balanced" className="w-full">
balanced
</ToggleGroupItem>
<ToggleGroupItem value="normal" className="w-full">
normal
</ToggleGroupItem>
</ToggleGroup>
<p className="text-xs text-zinc-500">
Low uses incremental local decode for faster first audio. Normal waits for the
full LLM result, then decodes once.
</p>
</div>
<div className="space-y-2">
<Label>Format</Label>
<ToggleGroup
type="single"
value={controls.format}
className="grid grid-cols-4 gap-2"
onValueChange={(value) => {
if (value) {
setControls((current) => ({
...current,
format: value as AudioFormat,
}))
}
}}
>
<ToggleGroupItem value="mp3" className="w-full">
mp3
</ToggleGroupItem>
<ToggleGroupItem value="wav" className="w-full">
wav
</ToggleGroupItem>
<ToggleGroupItem value="pcm" className="w-full">
pcm
</ToggleGroupItem>
<ToggleGroupItem value="opus" className="w-full">
opus
</ToggleGroupItem>
</ToggleGroup>
</div>
<div className="flex items-center justify-between rounded-lg border border-zinc-200 bg-zinc-50 px-3 py-2.5">
<div className="space-y-1">
<Label htmlFor="normalize">Normalize</Label>
<p className="text-xs text-zinc-500">
Normalize text before synthesis to keep input formatting consistent.
</p>
</div>
<Switch
id="normalize"
checked={controls.normalize}
onCheckedChange={(checked) =>
setControls((current) => ({
...current,
normalize: checked,
}))
}
/>
</div>
<Separator className="bg-zinc-200" />
<SettingSlider
label="Chunk Length"
value={controls.chunkLength}
min={100}
max={1000}
onValueChange={(value) =>
setControls((current) => ({
...current,
chunkLength: value,
}))
}
/>
<SettingSlider
label="Max New Tokens"
value={controls.maxNewTokens}
min={256}
max={2048}
onValueChange={(value) =>
setControls((current) => ({
...current,
maxNewTokens: value,
}))
}
/>
<SettingSlider
label="Temperature"
value={controls.temperature}
min={0.8}
max={1}
step={0.01}
formatValue={(value) => value.toFixed(2)}
onValueChange={(value) =>
setControls((current) => ({
...current,
temperature: value,
}))
}
/>
<SettingSlider
label="Top P"
value={controls.topP}
min={0.8}
max={1}
step={0.01}
formatValue={(value) => value.toFixed(2)}
onValueChange={(value) =>
setControls((current) => ({
...current,
topP: value,
}))
}
/>
<SettingSlider
label="Repetition Penalty"
value={controls.repetitionPenalty}
min={1}
max={1.2}
step={0.01}
formatValue={(value) => value.toFixed(2)}
onValueChange={(value) =>
setControls((current) => ({
...current,
repetitionPenalty: value,
}))
}
/>
</CardContent>
</Card>
</aside>
</div>
</div>
<Dialog open={pendingReference !== null} onOpenChange={(open) => !open && setPendingReference(null)}>
<DialogContent className="border-zinc-200 bg-white">
<DialogHeader>
<DialogTitle>
{pendingReference?.mode === 'create' ? 'Save Reference Text' : 'Edit Reference Text'}
</DialogTitle>
<DialogDescription>
{pendingReference
? `Speaker ${speakerGroups.findIndex(
(speakerGroup) => speakerGroup.id === pendingReference.speakerId,
)}`
: ''}
</DialogDescription>
</DialogHeader>
<div className="space-y-3">
<div className="text-sm font-medium text-zinc-900">{pendingReference?.name}</div>
<Textarea
value={pendingReference?.text ?? ''}
onChange={(event) =>
setPendingReference((current) =>
current
? {
...current,
text: event.target.value,
}
: current,
)
}
placeholder="Enter reference text"
className="min-h-40 rounded-lg border-zinc-200 bg-white shadow-none focus-visible:ring-zinc-300"
/>
</div>
<DialogFooter>
<Button type="button" variant="ghost" onClick={() => setPendingReference(null)}>
Cancel
</Button>
<Button
type="button"
variant="outline"
className="border-zinc-200 bg-white hover:bg-zinc-100"
onClick={savePendingReference}
>
Save
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
</main>
)
}
export default App
================================================
FILE: awesome_webui/src/components/ui/alert.tsx
================================================
import * as React from 'react'
import { cva, type VariantProps } from 'class-variance-authority'
import { cn } from '@/lib/utils'
const alertVariants = cva('relative w-full rounded-lg border px-4 py-3 text-sm', {
variants: {
variant: {
default: 'bg-card text-card-foreground',
destructive: 'border-destructive/20 bg-destructive/5 text-destructive',
warning: 'border-amber-200 bg-amber-50 text-amber-900',
},
},
defaultVariants: {
variant: 'default',
},
})
function Alert({
className,
variant,
...props
}: React.ComponentProps<'div'> & VariantProps<typeof alertVariants>) {
return <div role="alert" className={cn(alertVariants({ variant }), className)} {...props} />
}
function AlertTitle({ className, ...props }: React.ComponentProps<'h5'>) {
return <h5 className={cn('mb-1 font-medium leading-none tracking-tight', className)} {...props} />
}
function AlertDescription({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('text-sm [&_p]:leading-relaxed', className)} {...props} />
}
export { Alert, AlertDescription, AlertTitle }
================================================
FILE: awesome_webui/src/components/ui/badge.tsx
================================================
/* eslint-disable react-refresh/only-export-components */
import * as React from 'react'
import { cva, type VariantProps } from 'class-variance-authority'
import { cn } from '@/lib/utils'
const badgeVariants = cva(
'inline-flex items-center rounded-md border px-2 py-0.5 text-xs font-medium transition-colors',
{
variants: {
variant: {
default: 'border-transparent bg-primary text-primary-foreground',
secondary: 'border-transparent bg-secondary text-secondary-foreground',
outline: 'text-foreground',
},
},
defaultVariants: {
variant: 'default',
},
},
)
function Badge({
className,
variant,
...props
}: React.ComponentProps<'div'> & VariantProps<typeof badgeVariants>) {
return <div className={cn(badgeVariants({ variant }), className)} {...props} />
}
export { Badge, badgeVariants }
================================================
FILE: awesome_webui/src/components/ui/button.tsx
================================================
/* eslint-disable react-refresh/only-export-components */
import * as React from 'react'
import { Slot } from '@radix-ui/react-slot'
import { cva, type VariantProps } from 'class-variance-authority'
import { cn } from '@/lib/utils'
const buttonVariants = cva(
'inline-flex items-center justify-center gap-2 whitespace-nowrap rounded-md text-sm font-medium transition-colors disabled:pointer-events-none disabled:opacity-50 outline-none focus-visible:ring-2 focus-visible:ring-ring/70 focus-visible:ring-offset-2 focus-visible:ring-offset-background',
{
variants: {
variant: {
default: 'bg-primary text-primary-foreground hover:bg-primary/90',
destructive: 'bg-destructive text-destructive-foreground hover:bg-destructive/90',
outline: 'border bg-card hover:bg-accent hover:text-accent-foreground',
secondary: 'bg-secondary text-secondary-foreground hover:bg-secondary/80',
ghost: 'hover:bg-accent hover:text-accent-foreground',
},
size: {
default: 'h-9 px-4 py-2',
sm: 'h-8 rounded-md px-3 text-xs',
lg: 'h-11 rounded-md px-6',
icon: 'size-9',
},
},
defaultVariants: {
variant: 'default',
size: 'default',
},
},
)
type ButtonProps = React.ComponentProps<'button'> &
VariantProps<typeof buttonVariants> & {
asChild?: boolean
}
function Button({ className, variant, size, asChild = false, ...props }: ButtonProps) {
const Comp = asChild ? Slot : 'button'
return <Comp className={cn(buttonVariants({ variant, size, className }))} {...props} />
}
export { Button, buttonVariants }
================================================
FILE: awesome_webui/src/components/ui/card.tsx
================================================
import * as React from 'react'
import { cn } from '@/lib/utils'
function Card({ className, ...props }: React.ComponentProps<'div'>) {
return (
<div
data-slot="card"
className={cn('rounded-xl border bg-card text-card-foreground shadow-sm', className)}
{...props}
/>
)
}
function CardHeader({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('flex flex-col space-y-1.5 p-6', className)} {...props} />
}
function CardTitle({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('text-base font-semibold leading-none tracking-tight', className)} {...props} />
}
function CardDescription({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('text-sm text-muted-foreground', className)} {...props} />
}
function CardContent({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('p-6 pt-0', className)} {...props} />
}
export { Card, CardContent, CardDescription, CardHeader, CardTitle }
================================================
FILE: awesome_webui/src/components/ui/collapsible.tsx
================================================
import * as CollapsiblePrimitive from '@radix-ui/react-collapsible'
const Collapsible = CollapsiblePrimitive.Root
const CollapsibleTrigger = CollapsiblePrimitive.CollapsibleTrigger
const CollapsibleContent = CollapsiblePrimitive.CollapsibleContent
export { Collapsible, CollapsibleContent, CollapsibleTrigger }
================================================
FILE: awesome_webui/src/components/ui/dialog.tsx
================================================
import * as React from 'react'
import * as DialogPrimitive from '@radix-ui/react-dialog'
import { X } from 'lucide-react'
import { cn } from '@/lib/utils'
const Dialog = DialogPrimitive.Root
const DialogTrigger = DialogPrimitive.Trigger
const DialogPortal = DialogPrimitive.Portal
const DialogClose = DialogPrimitive.Close
function DialogOverlay({
className,
...props
}: React.ComponentProps<typeof DialogPrimitive.Overlay>) {
return (
<DialogPrimitive.Overlay
className={cn('fixed inset-0 z-50 bg-black/40', className)}
{...props}
/>
)
}
function DialogContent({
className,
children,
...props
}: React.ComponentProps<typeof DialogPrimitive.Content>) {
return (
<DialogPortal>
<DialogOverlay />
<DialogPrimitive.Content
className={cn(
'fixed left-1/2 top-1/2 z-50 grid w-full max-w-lg -translate-x-1/2 -translate-y-1/2 gap-4 rounded-xl border bg-background p-6 shadow-lg duration-200',
className,
)}
{...props}
>
{children}
<DialogClose className="absolute right-4 top-4 rounded-sm opacity-70 transition-opacity hover:opacity-100 focus-visible:ring-2 focus-visible:ring-ring/70">
<X className="size-4" />
<span className="sr-only">Close</span>
</DialogClose>
</DialogPrimitive.Content>
</DialogPortal>
)
}
function DialogHeader({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('flex flex-col space-y-1.5 text-left', className)} {...props} />
}
function DialogFooter({ className, ...props }: React.ComponentProps<'div'>) {
return <div className={cn('flex flex-col-reverse gap-2 sm:flex-row sm:justify-end', className)} {...props} />
}
function DialogTitle({ className, ...props }: React.ComponentProps<typeof DialogPrimitive.Title>) {
return (
<DialogPrimitive.Title
className={cn('text-lg font-semibold leading-none tracking-tight', className)}
{...props}
/>
)
}
function DialogDescription({
className,
...props
}: React.ComponentProps<typeof DialogPrimitive.Description>) {
return (
<DialogPrimitive.Description
className={cn('text-sm text-muted-foreground', className)}
{...props}
/>
)
}
export {
Dialog,
DialogContent,
DialogDescription,
DialogFooter,
DialogHeader,
DialogTitle,
DialogTrigger,
}
================================================
FILE: awesome_webui/src/components/ui/label.tsx
================================================
import * as React from 'react'
import * as LabelPrimitive from '@radix-ui/react-label'
import { cn } from '@/lib/utils'
function Label({ className, ...props }: React.ComponentProps<typeof LabelPrimitive.Root>) {
return (
<LabelPrimitive.Root
className={cn('text-sm font-medium leading-none', className)}
{...props}
/>
)
}
export { Label }
================================================
FILE: awesome_webui/src/components/ui/scroll-area.tsx
================================================
import * as React from 'react'
import * as ScrollAreaPrimitive from '@radix-ui/react-scroll-area'
import { cn } from '@/lib/utils'
function ScrollArea({
className,
children,
...props
}: React.ComponentProps<typeof ScrollAreaPrimitive.Root>) {
return (
<ScrollAreaPrimitive.Root className={cn('relative overflow-hidden', className)} {...props}>
<ScrollAreaPrimitive.Viewport className="h-full w-full rounded-[inherit]">
{children}
</ScrollAreaPrimitive.Viewport>
<ScrollBar />
<ScrollAreaPrimitive.Corner />
</ScrollAreaPrimitive.Root>
)
}
function ScrollBar({
className,
orientation = 'vertical',
...props
}: React.ComponentProps<typeof ScrollAreaPrimitive.ScrollAreaScrollbar>) {
return (
<ScrollAreaPrimitive.ScrollAreaScrollbar
orientation={orientation}
className={cn(
'flex touch-none select-none p-px transition-colors',
orientation === 'vertical' && 'h-full w-2.5 border-l border-l-transparent',
orientation === 'horizontal' && 'h-2.5 flex-col border-t border-t-transparent',
className,
)}
{...props}
>
<ScrollAreaPrimitive.ScrollAreaThumb className="relative flex-1 rounded-full bg-border" />
</ScrollAreaPrimitive.ScrollAreaScrollbar>
)
}
export { ScrollArea, ScrollBar }
================================================
FILE: awesome_webui/src/components/ui/separator.tsx
================================================
import * as React from 'react'
import * as SeparatorPrimitive from '@radix-ui/react-separator'
import { cn } from '@/lib/utils'
function Separator({
className,
orientation = 'horizontal',
decorative = true,
...props
}: React.ComponentProps<typeof SeparatorPrimitive.Root>) {
return (
<SeparatorPrimitive.Root
decorative={decorative}
orientation={orientation}
className={cn(
'shrink-0 bg-border',
orientation === 'horizontal' ? 'h-px w-full' : 'h-full w-px',
className,
)}
{...props}
/>
)
}
export { Separator }
================================================
FILE: awesome_webui/src/components/ui/slider.tsx
================================================
import * as React from 'react'
import * as SliderPrimitive from '@radix-ui/react-slider'
import { cn } from '@/lib/utils'
function Slider({
className,
...props
}: React.ComponentProps<typeof SliderPrimitive.Root>) {
return (
<SliderPrimitive.Root
className={cn('relative flex w-full touch-none select-none items-center', className)}
{...props}
>
<SliderPrimitive.Track className="relative h-1.5 w-full grow overflow-hidden rounded-full bg-muted">
<SliderPrimitive.Range className="absolute h-full bg-primary" />
</SliderPrimitive.Track>
<SliderPrimitive.Thumb className="block size-4 rounded-full border border-primary/20 bg-background shadow-sm transition-colors focus-visible:ring-2 focus-visible:ring-ring/70 focus-visible:ring-offset-2 focus-visible:ring-offset-background disabled:pointer-events-none disabled:opacity-50" />
</SliderPrimitive.Root>
)
}
export { Slider }
================================================
FILE: awesome_webui/src/components/ui/switch.tsx
================================================
import * as React from 'react'
import * as SwitchPrimitive from '@radix-ui/react-switch'
import { cn } from '@/lib/utils'
function Switch({
className,
...props
}: React.ComponentProps<typeof SwitchPrimitive.Root>) {
return (
<SwitchPrimitive.Root
className={cn(
'peer inline-flex h-6 w-11 shrink-0 cursor-pointer items-center rounded-full border border-transparent bg-input shadow-xs transition-colors outline-none focus-visible:ring-2 focus-visible:ring-ring/70 focus-visible:ring-offset-2 focus-visible:ring-offset-background data-[state=checked]:bg-primary data-[state=unchecked]:bg-muted-foreground/30 disabled:cursor-not-allowed disabled:opacity-50',
className,
)}
{...props}
>
<SwitchPrimitive.Thumb
className={cn(
'pointer-events-none block size-5 rounded-full bg-background shadow-sm ring-0 transition-transform data-[state=checked]:translate-x-5 data-[state=unchecked]:translate-x-0',
)}
/>
</SwitchPrimitive.Root>
)
}
export { Switch }
================================================
FILE: awesome_webui/src/components/ui/textarea.tsx
================================================
import * as React from 'react'
import { cn } from '@/lib/utils'
function Textarea({ className, ...props }: React.ComponentProps<'textarea'>) {
return (
<textarea
className={cn(
'flex min-h-16 w-full rounded-lg border border-input bg-background px-3 py-2 text-sm shadow-xs outline-none transition-[color,box-shadow] placeholder:text-muted-foreground focus-visible:ring-2 focus-visible:ring-ring/70 disabled:cursor-not-allowed disabled:opacity-50',
className,
)}
{...props}
/>
)
}
export { Textarea }
================================================
FILE: awesome_webui/src/components/ui/toggle-group.tsx
================================================
import * as React from 'react'
import * as ToggleGroupPrimitive from '@radix-ui/react-toggle-group'
import { cva, type VariantProps } from 'class-variance-authority'
import { cn } from '@/lib/utils'
const toggleGroupItemVariants = cva(
'inline-flex items-center justify-center rounded-md text-sm font-medium transition-colors hover:bg-accent hover:text-accent-foreground focus-visible:ring-2 focus-visible:ring-ring/70 focus-visible:ring-offset-2 focus-visible:ring-offset-background disabled:pointer-events-none disabled:opacity-50 data-[state=on]:bg-primary data-[state=on]:text-primary-foreground border border-border bg-card',
{
variants: {
size: {
default: 'h-9 px-3',
sm: 'h-8 px-2.5 text-xs',
lg: 'h-10 px-4',
},
},
defaultVariants: {
size: 'default',
},
},
)
function ToggleGroup({
className,
...props
}: React.ComponentProps<typeof ToggleGroupPrimitive.Root>) {
return (
<ToggleGroupPrimitive.Root
className={cn('flex items-center gap-2', className)}
{...props}
/>
)
}
function ToggleGroupItem({
className,
size,
...props
}: React.ComponentProps<typeof ToggleGroupPrimitive.Item> &
VariantProps<typeof toggleGroupItemVariants>) {
return (
<ToggleGroupPrimitive.Item
className={cn(toggleGroupItemVariants({ size }), className)}
{...props}
/>
)
}
export { ToggleGroup, ToggleGroupItem }
================================================
FILE: awesome_webui/src/index.css
================================================
@import "tailwindcss";
:root {
--background: 0 0% 96%;
--foreground: 240 10% 3.9%;
--card: 0 0% 100%;
--card-foreground: 240 10% 3.9%;
--popover: 0 0% 100%;
--popover-foreground: 240 10% 3.9%;
--primary: 240 5.9% 10%;
--primary-foreground: 0 0% 98%;
--secondary: 240 4.8% 95.9%;
--secondary-foreground: 240 5.9% 10%;
--muted: 240 4.8% 95.9%;
--muted-foreground: 240 3.8% 46.1%;
--accent: 240 4.8% 95.9%;
--accent-foreground: 240 5.9% 10%;
--destructive: 0 72.2% 50.6%;
--destructive-foreground: 0 0% 98%;
--border: 240 5.9% 88%;
--input: 240 5.9% 88%;
--ring: 240 5% 64.9%;
--radius: 0.75rem;
}
@theme inline {
--color-background: hsl(var(--background));
--color-foreground: hsl(var(--foreground));
--color-card: hsl(var(--card));
--color-card-foreground: hsl(var(--card-foreground));
--color-popover: hsl(var(--popover));
--color-popover-foreground: hsl(var(--popover-foreground));
--color-primary: hsl(var(--primary));
--color-primary-foreground: hsl(var(--primary-foreground));
--color-secondary: hsl(var(--secondary));
--color-secondary-foreground: hsl(var(--secondary-foreground));
--color-muted: hsl(var(--muted));
--color-muted-foreground: hsl(var(--muted-foreground));
--color-accent: hsl(var(--accent));
--color-accent-foreground: hsl(var(--accent-foreground));
--color-destructive: hsl(var(--destructive));
--color-destructive-foreground: hsl(var(--destructive-foreground));
--color-border: hsl(var(--border));
--color-input: hsl(var(--input));
--color-ring: hsl(var(--ring));
--radius-sm: calc(var(--radius) - 4px);
--radius-md: calc(var(--radius) - 2px);
--radius-lg: var(--radius);
--radius-xl: calc(var(--radius) + 4px);
}
@layer base {
* {
@apply border-border;
}
html {
min-width: 320px;
}
body {
@apply bg-background text-foreground antialiased;
font-family: "Inter", "Avenir Next", "Segoe UI", sans-serif;
}
button,
input,
textarea {
font: inherit;
}
}
================================================
FILE: awesome_webui/src/main.tsx
================================================
import { StrictMode } from 'react'
import { createRoot } from 'react-dom/client'
import './index.css'
import App from './App.tsx'
createRoot(document.getElementById('root')!).render(
<StrictMode>
<App />
</StrictMode>,
)
================================================
FILE: awesome_webui/tsconfig.app.json
================================================
{
"compilerOptions": {
"tsBuildInfoFile": "./node_modules/.tmp/tsconfig.app.tsbuildinfo",
"target": "ES2022",
"useDefineForClassFields": true,
"lib": [
"ES2022",
"DOM",
"DOM.Iterable"
],
"module": "ESNext",
"types": [
"vite/client"
],
"skipLibCheck": true,
"baseUrl": ".",
"paths": {
"@/*": [
"./src/*"
]
},
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"verbatimModuleSyntax": true,
"moduleDetection": "force",
"noEmit": true,
"jsx": "react-jsx",
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"erasableSyntaxOnly": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true
},
"include": [
"src"
]
}
================================================
FILE: awesome_webui/tsconfig.json
================================================
{
"files": [],
"references": [
{ "path": "./tsconfig.app.json" },
{ "path": "./tsconfig.node.json" }
]
}
================================================
FILE: awesome_webui/tsconfig.node.json
================================================
{
"compilerOptions": {
"tsBuildInfoFile": "./node_modules/.tmp/tsconfig.node.tsbuildinfo",
"target": "ES2023",
"lib": [
"ES2023"
],
"module": "ESNext",
"types": [
"node"
],
"skipLibCheck": true,
"baseUrl": ".",
"paths": {
"@/*": [
"./src/*"
]
},
"moduleResolution": "bundler",
"allowImportingTsExtensions": true,
"verbatimModuleSyntax": true,
"moduleDetection": "force",
"noEmit": true,
"strict": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"erasableSyntaxOnly": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedSideEffectImports": true
},
"include": [
"vite.config.ts"
]
}
================================================
FILE: awesome_webui/vite.config.ts
================================================
import fs from 'node:fs'
import { defineConfig, type Plugin } from 'vite'
import react from '@vitejs/plugin-react-swc'
import tailwindcss from '@tailwindcss/vite'
import path from 'node:path'
function inlineEntryAssets(): Plugin {
let resolvedOutDir = ''
return {
name: 'inline-entry-assets',
apply: 'build',
configResolved(config) {
resolvedOutDir = path.resolve(config.root, config.build.outDir)
},
closeBundle() {
const indexHtmlPath = path.join(resolvedOutDir, 'index.html')
if (!fs.existsSync(indexHtmlPath)) {
return
}
const filesToDelete = new Set<string>()
const escapeInlineScript = (code: string) => code.replace(/<\/script/gi, '<\\/script')
const escapeInlineStyle = (code: string) => code.replace(/<\/style/gi, '<\\/style')
const normalizeFileName = (assetPath: string) =>
assetPath.replace(/^\//, '').replace(/^\.\//, '')
const readBuiltAsset = (assetPath: string) => {
const fileName = normalizeFileName(assetPath)
const absolutePath = path.join(resolvedOutDir, fileName)
if (!fs.existsSync(absolutePath)) {
return null
}
filesToDelete.add(absolutePath)
return fs.readFileSync(absolutePath, 'utf8')
}
let html = fs.readFileSync(indexHtmlPath, 'utf8')
html = html.replace(
/<link rel="modulepreload"[^>]+href="([^"]+)"[^>]*>/g,
(_fullMatch, href: string) => {
const absolutePath = path.join(resolvedOutDir, normalizeFileName(href))
if (fs.existsSync(absolutePath)) {
filesToDelete.add(absolutePath)
}
return ''
},
)
html = html.replace(
/<link rel="stylesheet"[^>]+href="([^"]+)"[^>]*>/g,
(fullMatch, href: string) => {
const assetSource = readBuiltAsset(href)
if (!assetSource) {
return fullMatch
}
return `<style>${escapeInlineStyle(assetSource)}</style>`
},
)
html = html.replace(
/<script type="module"[^>]+src="([^"]+)"[^>]*><\/script>/g,
(fullMatch, src: string) => {
const chunkCode = readBuiltAsset(src)
if (!chunkCode) {
return fullMatch
}
return `<script type="module">${escapeInlineScript(chunkCode)}</script>`
},
)
fs.writeFileSync(indexHtmlPath, html)
for (const filePath of filesToDelete) {
fs.rmSync(filePath, { force: true })
}
fs.rmSync(path.join(resolvedOutDir, 'vite.svg'), { force: true })
fs.rmSync(path.join(resolvedOutDir, 'assets'), { recursive: true, force: true })
},
}
}
// https://vite.dev/config/
export default defineConfig({
plugins: [react(), tailwindcss(), inlineEntryAssets()],
publicDir: false,
resolve: {
alias: {
'@': path.resolve(__dirname, './src'),
},
},
build: {
assetsInlineLimit: Number.MAX_SAFE_INTEGER,
cssCodeSplit: false,
modulePreload: false,
rollupOptions: {
output: {
inlineDynamicImports: true,
},
},
},
server: {
proxy: {
'/v1': 'http://localhost:8888',
'/v2': 'http://localhost:8888',
'/health': 'http://localhost:8888',
},
},
})
================================================
FILE: compose.base.yml
================================================
services:
app-base:
build:
context: .
dockerfile: docker/Dockerfile
args:
BACKEND: ${BACKEND:-cuda} # or cpu
UV_VERSION: ${UV_VERSION:-0.8.15}
volumes:
- ./checkpoints:/app/checkpoints
- ./references:/app/references
environment:
COMPILE: ${COMPILE:-0}
# GPU (remove this block if CPU-only):
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
tty: true
stdin_open: true
================================================
FILE: compose.yml
================================================
name: fish-speech
services:
webui:
extends:
file: compose.base.yml
service: app-base
build:
target: webui
environment:
COMPILE: ${COMPILE:-0}
profiles: ["webui"]
ports:
- "${GRADIO_PORT:-7860}:7860"
server:
extends:
file: compose.base.yml
service: app-base
build:
target: server
environment:
COMPILE: ${COMPILE:-0}
profiles: ["server"]
ports:
- "${API_PORT:-8080}:8080"
================================================
FILE: docker/Dockerfile
================================================
# docker/Dockerfile
# IMPORTANT: The docker images do not contain the checkpoints. You need to mount the checkpoints to the container.
# Build the image:
# docker build \
# --platform linux/amd64 \
# -f docker/Dockerfile \
# --build-arg BACKEND=[cuda, cpu] \
# --target [webui, server] \
# -t fish-speech-[webui, server]:[cuda, cpu] .
# e.g. for building the webui:
# docker build \
# --platform linux/amd64 \
# -f docker/Dockerfile \
# --build-arg BACKEND=cuda \
# --target webui \
# -t fish-speech-webui:cuda .
# e.g. for building the server:
# docker build \
# --platform linux/amd64 \
# -f docker/Dockerfile \
# --build-arg BACKEND=cuda \
# --target server \
# -t fish-speech-server:cuda .
# Multi-platform build:
# docker buildx build \
# --platform linux/amd64,linux/arm64 \
# -f docker/Dockerfile \
# --build-arg BACKEND=cpu \
# --target webui \
# -t fish-speech-webui:cpu .
# Running the image interactively:
# docker run \
# --gpus all \
# -v /path/to/fish-speech/checkpoints:/app/checkpoints \
# -e COMPILE=1 \ ... or -e COMPILE=0 \
# -it fish-speech-[webui, server]:[cuda, cpu]
# E.g. running the webui:
# docker run \
# --gpus all \
# -v ./checkpoints:/app/checkpoints \
# -e COMPILE=1 \
# -p 7860:7860 \
# fish-speech-webui:cuda
# E.g. running the server:
# docker run \
# --gpus all \
# -v ./checkpoints:/app/checkpoints \
# -p 8080:8080 \
# -it fish-speech-server:cuda
# Select the specific cuda version (see https://hub.docker.com/r/nvidia/cuda/)
ARG CUDA_VER=12.6.0
# Adapt the uv extra to fit the cuda version (one of [cu126, cu128, cu129])
ARG UV_EXTRA=cu126
ARG BACKEND=cuda
ARG UBUNTU_VER=24.04
ARG PY_VER=3.12
ARG UV_VERSION=0.8.15
# Create non-root user early for security
ARG USERNAME=fish
ARG USER_UID=1000
ARG USER_GID=1000
##############################################################
# Base stage per backend
##############################################################
# --- CUDA (x86_64) ---
FROM nvidia/cuda:${CUDA_VER}-cudnn-runtime-ubuntu${UBUNTU_VER} AS base-cuda
ENV DEBIAN_FRONTEND=noninteractive
# Install system dependencies in a single layer with cleanup
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
set -eux \
&& rm -f /etc/apt/apt.conf.d/docker-clean \
&& echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' >/etc/apt/apt.conf.d/keep-cache \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
python3-pip \
python3-dev \
git \
ca-certificates \
curl \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# --- CPU-only (portable x86_64) ---
FROM python:${PY_VER}-slim AS base-cpu
ENV UV_EXTRA=cpu
# Install system dependencies in a single layer with cleanup
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
set -eux \
&& rm -f /etc/apt/apt.conf.d/docker-clean \
&& echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' >/etc/apt/apt.conf.d/keep-cache \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
git \
ca-certificates \
curl \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
##############################################################
# UV stage
##############################################################
ARG UV_VERSION
FROM ghcr.io/astral-sh/uv:${UV_VERSION} AS uv-bin
##############################################################
# Shared app base stage
##############################################################
FROM base-${BACKEND} AS app-base
ARG PY_VER
ARG BACKEND
ARG USERNAME
ARG USER_UID
ARG USER_GID
ARG UV_VERSION
ARG UV_EXTRA
ENV BACKEND=${BACKEND} \
DEBIAN_FRONTEND=noninteractive \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
# System dependencies for audio processing
ARG DEPENDENCIES=" \
libsox-dev \
build-essential \
cmake \
libasound-dev \
portaudio19-dev \
libportaudio2 \
libportaudiocpp0 \
ffmpeg"
# Install system dependencies with caching and cleanup
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
set -eux \
&& rm -f /etc/apt/apt.conf.d/docker-clean \
&& echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' >/etc/apt/apt.conf.d/keep-cache \
&& apt-get update \
&& apt-get install -y --no-install-recommends ${DEPENDENCIES} \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install specific uv version
COPY --from=uv-bin /uv /uvx /bin/
# RUN groupadd --gid ${USER_GID} ${USERNAME} \
# && useradd --uid ${USER_UID} --gid ${USER_GID} -m ${USERNAME} \
# && mkdir -p /app /home/${USERNAME}/.cache \
# && chown -R ${USERNAME}:${USERNAME} /app /home/${USERNAME}/.cache
# Create non-root user (or use existing user)
RUN set -eux; \
if getent group ${USER_GID} >/dev/null 2>&1; then \
echo "Group ${USER_GID} already exists"; \
else \
groupadd -g ${USER_GID} ${USERNAME}; \
fi; \
if id -u ${USER_UID} >/dev/null 2>&1; then \
echo "User ${USER_UID} already exists, using existing user"; \
EXISTING_USER=$(id -un ${USER_UID}); \
mkdir -p /app /home/${EXISTING_USER}/.cache; \
chown -R ${USER_UID}:${USER_GID} /app /home/${EXISTING_USER}/.cache; \
else \
useradd -m -u ${USER_UID} -g ${USER_GID} ${USERNAME}; \
mkdir -p /app /home/${USERNAME}/.cache; \
chown -R ${USERNAME}:${USERNAME} /app /home/${USERNAME}/.cache; \
fi
# Create references directory with proper permissions for the non-root user
RUN mkdir -p /app/references \
&& chown -R ${USER_UID}:${USER_GID} /app/references \
&& chmod 755 /app/references
# Set working directory
WORKDIR /app
# Copy dependency files first for better caching
COPY --chown=${USER_UID}:${USER_GID} pyproject.toml uv.lock README.md ./
# Switch to non-root user for package installation
USER ${USER_UID}:${USER_GID}
# Install Python dependencies (cacheable by lockfiles)
# Use a generic cache path that works regardless of username
RUN --mount=type=cache,target=/tmp/uv-cache,uid=${USER_UID},gid=${USER_GID} \
uv python pin ${PY_VER} \
&& uv sync --extra ${UV_EXTRA} --frozen --no-install-project
# Copy application code
COPY --chown=${USER_UID}:${USER_GID} . .
# Install the local package after copying source code
RUN uv sync --extra ${UV_EXTRA} --frozen
# Create common entrypoint script
RUN printf '%s\n' \
'#!/bin/bash' \
'set -euo pipefail' \
'' \
'# Set user info from build args' \
'USER_UID='${USER_UID} \
'USER_GID='${USER_GID} \
'' \
'# Logging function' \
'log() { echo "[$(date +"%Y-%m-%d %H:%M:%S")] $*" >&2; }' \
'' \
'# Validate environment' \
'validate_env() {' \
' if [ ! -d "/app/checkpoints" ]; then' \
' log "WARNING: /app/checkpoints directory not found. Please mount your checkpoints."' \
' fi' \
' if [ ! -d "/app/references" ]; then' \
' log "WARNING: /app/references directory not found. Please mount your references."' \
' else' \
' # Check if we can write to references directory' \
' if [ ! -w "/app/references" ]; then' \
' log "ERROR: Cannot write to /app/references directory. Please ensure the mounted directory has proper permissions for user with UID ${USER_UID}."' \
' log "You can fix this by running: sudo chown -R ${USER_UID}:${USER_GID} /path/to/your/references"' \
' exit 1' \
' fi' \
' fi' \
'}' \
'' \
'# Build device arguments' \
'build_device_args() {' \
' if [ "${BACKEND:-}" = "cpu" ]; then' \
' echo "--device cpu"' \
' fi' \
'}' \
'' \
'# Build compile arguments' \
'build_compile_args() {' \
' if [ "${1:-}" = "compile" ] || [ "${COMPILE:-}" = "1" ] || [ "${COMPILE:-}" = "true" ]; then' \
' echo "--compile"' \
' shift' \
' fi' \
' echo "$@"' \
'}' \
'' \
'# Health check function' \
'health_check() {' \
' local port=${1:-7860}' \
' local endpoint=${2:-/health}' \
' curl -f http://localhost:${port}${endpoint} 2>/dev/null || exit 1' \
'}' \
> /app/common.sh && chmod +x /app/common.sh
##############################################################
# App stages
##############################################################
# Gradio WebUI
FROM app-base AS webui
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
ARG GRADIO_SERVER_NAME="0.0.0.0"
ARG GRADIO_SERVER_PORT=7860
ARG LLAMA_CHECKPOINT_PATH="checkpoints/s2-pro"
ARG DECODER_CHECKPOINT_PATH="checkpoints/s2-pro/codec.pth"
ARG DECODER_CONFIG_NAME="modded_dac_vq"
# Expose port
EXPOSE ${GRADIO_SERVER_PORT}
# Set environment variables
ENV GRADIO_SERVER_NAME=${GRADIO_SERVER_NAME}
ENV GRADIO_SERVER_PORT=${GRADIO_SERVER_PORT}
ENV LLAMA_CHECKPOINT_PATH=${LLAMA_CHECKPOINT_PATH}
ENV DECODER_CHECKPOINT_PATH=${DECODER_CHECKPOINT_PATH}
ENV DECODER_CONFIG_NAME=${DECODER_CONFIG_NAME}
# Create webui entrypoint
RUN printf '%s\n' \
'#!/bin/bash' \
'source /app/common.sh' \
'' \
'log "Starting Fish Speech WebUI..."' \
'validate_env' \
'' \
'DEVICE_ARGS=$(build_device_args)' \
'COMPILE_ARGS=$(build_compile_args "$@")' \
'' \
'log "Device args: ${DEVICE_ARGS:-none}"' \
'log "Compile args: ${COMPILE_ARGS}"' \
'log "Server: ${GRADIO_SERVER_NAME}:${GRADIO_SERVER_PORT}"' \
'' \
'exec uv run tools/run_webui.py \' \
' --llama-checkpoint-path "${LLAMA_CHECKPOINT_PATH}" \' \
' --decoder-checkpoint-path "${DECODER_CHECKPOINT_PATH}" \' \
' --decoder-config-name "${DECODER_CONFIG_NAME}" \' \
' ${DEVICE_ARGS} ${COMPILE_ARGS}' \
> /app/start_webui.sh && chmod +x /app/start_webui.sh
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:${GRADIO_SERVER_PORT}/health || exit 1
ENTRYPOINT ["/app/start_webui.sh"]
# API Server
FROM app-base AS server
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
ARG API_SERVER_NAME="0.0.0.0"
ARG API_SERVER_PORT=8080
ARG LLAMA_CHECKPOINT_PATH="checkpoints/s2-pro"
ARG DECODER_CHECKPOINT_PATH="checkpoints/s2-pro/codec.pth"
ARG DECODER_CONFIG_NAME="modded_dac_vq"
# Expose port
EXPOSE ${API_SERVER_PORT}
# Set environment variables
ENV API_SERVER_NAME=${API_SERVER_NAME}
ENV API_SERVER_PORT=${API_SERVER_PORT}
ENV LLAMA_CHECKPOINT_PATH=${LLAMA_CHECKPOINT_PATH}
ENV DECODER_CHECKPOINT_PATH=${DECODER_CHECKPOINT_PATH}
ENV DECODER_CONFIG_NAME=${DECODER_CONFIG_NAME}
# Create server entrypoint
RUN printf '%s\n' \
'#!/bin/bash' \
'source /app/common.sh' \
'' \
'log "Starting Fish Speech API Server..."' \
'validate_env' \
'' \
'DEVICE_ARGS=$(build_device_args)' \
'COMPILE_ARGS=$(build_compile_args "$@")' \
'' \
'log "Device args: ${DEVICE_ARGS:-none}"' \
'log "Compile args: ${COMPILE_ARGS}"' \
'log "Server: ${API_SERVER_NAME}:${API_SERVER_PORT}"' \
'' \
'exec uv run tools/api_server.py \' \
' --listen "${API_SERVER_NAME}:${API_SERVER_PORT}" \' \
' --llama-checkpoint-path "${LLAMA_CHECKPOINT_PATH}" \' \
' --decoder-checkpoint-path "${DECODER_CHECKPOINT_PATH}" \' \
' --decoder-config-name "${DECODER_CONFIG_NAME}" \' \
' ${DEVICE_ARGS} ${COMPILE_ARGS}' \
> /app/start_server.sh && chmod +x /app/start_server.sh
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:${API_SERVER_PORT}/v1/health || exit 1
ENTRYPOINT ["/app/start_server.sh"]
# Development stage
FROM app-base AS dev
USER root
# Install development tools
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update \
&& apt-get install -y --no-install-recommends \
vim \
htop \
strace \
gdb \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
USER ${USER_UID}:${USER_GID}
# Install development dependencies
RUN uv sync --extra ${UV_EXTRA} --dev
# Default to bash for development
ENTRYPOINT ["/bin/bash"]
================================================
FILE: dockerfile.dev
================================================
ARG VERSION=dev
ARG BASE_IMAGE=ghcr.io/fishaudio/fish-speech:${VERSION}
FROM ${BASE_IMAGE}
ARG TOOLS=" \
git \
curl \
build-essential \
ffmpeg \
libsm6 \
libxext6 \
libjpeg-dev \
zlib1g-dev \
aria2 \
zsh \
openssh-server \
sudo \
protobuf-compiler \
libasound-dev \
portaudio19-dev \
libportaudio2 \
libportaudiocpp0 \
cmake"
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
set -ex \
&& apt-get update \
&& apt-get -y install --no-install-recommends ${TOOLS}
# Install oh-my-zsh so your terminal looks nice
RUN sh -c "$(curl https://raw.githubusercontent.com/robbyrussell/oh-my-zsh/master/tools/install.sh)" "" --unattended
# Set zsh as default shell
RUN chsh -s /usr/bin/zsh
ENV SHELL=/usr/bin/zsh
================================================
FILE: docs/CNAME
================================================
speech.fish.audio
================================================
FILE: docs/README.ar.md
================================================
<div align="center">
<h1>Fish Speech</h1>
[English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md) | [日本語](README.ja.md) | [한국어](README.ko.md) | **العربية** <br>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
<br>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
> [!IMPORTANT]
> **إشعار الترخيص**
> يتم إصدار قاعدة الأكواد هذه وأوزان النماذج المرتبطة بها تحت **[FISH AUDIO RESEARCH LICENSE](../LICENSE)**. يرجى الرجوع إلى ملف [LICENSE](../LICENSE) لمزيد من التفاصيل.
> [!WARNING]
> **إخلاء المسؤولية القانونية**
> نحن لا نتحمل أي مسؤولية عن أي استخدام غير قانوني لقاعدة الأكواد. يرجى الرجوع إلى القوانين المحلية المتعلقة بـ DMCA والقوانين الأخرى ذات الصلة.
## البداية السريعة
### روابط التوثيق
هذا هو التوثيق الرسمي لـ Fish Audio S2، يرجى اتباع التعليمات للبدء بسهولة.
- [التثبيت](https://speech.fish.audio/ar/install/)
- [الاستدلال عبر خط الأوامر](https://speech.fish.audio/ar/inference/)
- [الاستدلال عبر واجهة الويب](https://speech.fish.audio/ar/inference/)
- [استدلال الخادم](https://speech.fish.audio/ar/server/)
- [نشر Docker](https://speech.fish.audio/ar/install/)
> [!IMPORTANT]
> **إذا كنت ترغب في استخدام خادم SGLang، فيرجى الرجوع إلى [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
### دليل وكيل LLM
```
يرجى قراءة https://speech.fish.audio/ar/install/ أولاً، وتثبيت وتكوين Fish Audio S2 وفقاً للوثائق.
```
## Fish Audio S2 Pro
**نظام تحويل النص إلى كلام (TTS) متعدد اللغات الرائد في الصناعة، والذي يعيد تعريف حدود توليد الصوت.**
Fish Audio S2 Pro هو أحدث طراز متعدد الوسائط تم تطويره بواسطة [Fish Audio](https://fish.audio/). تم تدريبه على أكثر من **10 ملايين ساعة** من البيانات الصوتية الهائلة، التي تغطي أكثر من **80 لغة** حول العالم. من خلال بنية **ثنائية الانحدار الذاتي (Dual-AR)** المبتكرة وتقنية توافق التعلم التعزيزي (RL)، يمكن لـ S2 Pro توليد كلام يتمتع بإحساس طبيعي وواقعي وعمق عاطفي كبير، مما يجعله رائداً في المنافسة بين الأنظمة المفتوحة والمغلقة المصدر.
تكمن القوة الضاربة لـ S2 Pro في دعمه للتحكم الدقيق للغاية في النبرة والعاطفة على مستوى **ما دون الكلمة (Sub-word Level)** من خلال وسوم اللغة الطبيعية (مثل `[whisper]` و `[excited]` و `[angry]`) ، مع دعم أصلي لتوليد متحدثين متعددين وحوارات متعددة الجولات بسياق طويل جداً.
تفضل بزيارة [موقع Fish Audio الرسمي](https://fish.audio/) الآن لتجربة العرض المباشر، أو اقرأ [تقريرنا الفني](https://arxiv.org/abs/2603.08823) و[مقال المدونة](https://fish.audio/blog/fish-audio-open-sources-s2/) للتعرف على المزيد.
### متغيرات النموذج
| النموذج | الحجم | التوفر | الوصف |
|------|------|-------------|-------------|
| S2-Pro | 4 مليار معلمة | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | النموذج الرائد كامل الميزات، مع أعلى جودة واستقرار |
لمزيد من التفاصيل حول النماذج، يرجى مراجعة [التقرير الفني](https://arxiv.org/abs/2411.01156).
## نتائج الاختبارات المرجعية (Benchmarks)
| الاختبار | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER (الصينية) | **0.54%** (الأفضل إجمالاً) |
| Seed-TTS Eval — WER (الإنجليزية) | **0.99%** (الأفضل إجمالاً) |
| Audio Turing Test (مع التعليمات) | **0.515** متوسط خلفي (Posterior mean) |
| EmergentTTS-Eval — معدل الفوز | **81.88%** (الأعلى إجمالاً) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — الجودة | **4.51 / 5.0** |
| متعدد اللغات (MiniMax Testset) — أفضل WER | **11** لغة من أصل **24** |
| متعدد اللغات (MiniMax Testset) — أفضل SIM | **17** لغة من أصل **24** |
في تقييم Seed-TTS، حقق S2 أقل معدل خطأ في الكلمات (WER) بين جميع النماذج التي تم تقييمها (بما في ذلك الأنظمة مغلقة المصدر): Qwen3-TTS (0.77/1.24)، و MiniMax Speech-02 (0.99/1.90)، و Seed-TTS (1.12/2.25). وفي اختبار Audio Turing Test، سجل S2 قيمة 0.515 بزيادة قدرها 24% مقارنة بـ Seed-TTS (0.417) و 33% مقارنة بـ MiniMax-Speech (0.387). وفي EmergentTTS-Eval، تميز S2 بشكل خاص في أبعاد مثل اللغويات المصاحبة (معدل فوز 91.61%)، والجمل الاستفهامية (84.41%)، والتعقيد النحوي (83.39%).
## أبرز المميزات
<img src="./assets/totalability.png" width=200%>
### تحكم دقيق للغاية عبر اللغة الطبيعية
يمنح S2 Pro الصوت "روحاً" لا مثيل لها. من خلال صيغة `[tag]` البسيطة، يمكنك تضمين تعليمات عاطفية بدقة في أي موضع من النص.
- **دعم أكثر من 15,000 وسم فريد**: لا يقتصر على الإعدادات المسبقة الثابتة، بل يدعم **أوصاف النص الحر**. يمكنك تجربة `[whisper in small voice]` (همس بصوت منخفض)، أو `[professional broadcast tone]` (نبرة إذاعية احترافية)، أو `[pitch up]` (رفع طبقة الصوت).
- **مكتبة عواطف غنية**:
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
### بنية مبتكرة ثنائية الانحدار الذاتي (Dual-Autoregressive)
يعتمد S2 Pro بنية Dual-AR بنظام "رئيسي-تابع"، تتكون من Decoder-only Transformer وترميز صوتي RVQ (10 قواميس أكواد، بمعدل إطارات يبلغ حوالي 21 هرتز):
- **Slow AR (4 مليار معلمة)**: يعمل على طول المحور الزمني، ويتنبأ بقاموس الأكواد الدلالي الأساسي.
- **Fast AR (400 مليون معلمة)**: يولد الـ 9 قواميس المتبقية في كل خطوة زمنية، لاستعادة أدق التفاصيل الصوتية ببراعة.
يحقق هذا التصغير غير المتماثل أقصى درجات الدقة الصوتية مع زيادة سرعة الاستدلال بشكل كبير.
### توافق التعلم التعزيزي (RL Alignment)
يستخدم S2 Pro تقنية **Group Relative Policy Optimization (GRPO)** للتوافق بعد التدريب. نستخدم نفس مجموعة النماذج المستخدمة في تنظيف البيانات وتصنيفها مباشرة كنماذج مكافأة (Reward Model)، مما يحل بشكل مثالي مشكلة عدم التطابق بين توزيع بيانات ما قبل التدريب وأهداف ما بعد التدريب.
- **إشارات مكافأة متعددة الأبعاد**: تقييم شامل للدقة الدلالية، والقدرة على اتباع التعليمات، وتسجيل التفضيل الصوتي، وتماثل نبرة الصوت، لضمان أن كل ثانية من الكلام المولد تتوافق مع الحدس البشري.
### أداء استدلال تدفقي فائق (يعتمد على SGLang)
نظراً لأن بنية Dual-AR تتماثل هيكلياً مع بنية LLM القياسية، فإن S2 Pro يدعم أصلاً جميع ميزات تسريع الاستدلال في SGLang، بما في ذلك الدفعات المستمرة (Continuous Batching)، و Paged KV Cache، و CUDA Graph، والتخزين المؤقت للبادئة القائم على RadixAttention.
**أداء وحدة معالجة رسومات NVIDIA H200 واحدة:**
- **عامل الوقت الحقيقي (RTF)**: 0.195
- **تأخر الصوت الأول (TTFA)**: حوالي 100 مللي ثانية
- **إنتاجية فائقة السرعة**: تصل إلى 3000+ وسم صوتي/ثانية مع الحفاظ على RTF < 0.5
### دعم قوي للغات المتعددة
يدعم S2 Pro أكثر من 80 لغة، مما يتيح تركيباً عالياً الجودة دون الحاجة إلى وحدات صوتية (phonemes) أو معالجة محددة لكل لغة:
- **المستوى الأول (Tier 1)**: اليابانية (ja)، الإنجليزية (en)، الصينية (zh)
- **المستوى الثاني (Tier 2)**: الكورية (ko)، الإسبانية (es)، البرتغالية (pt)، العربية (ar)، الروسية (ru)، الفرنسية (fr)، الألمانية (de)
- **تغطية عالمية**: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo والمزيد.
### توليد متحدثين متعددين أصلي
<img src="./assets/chattemplate.png" width=200%>
يسمح Fish Audio S2 للمستخدمين بتحميل عينة مرجعية تحتوي على متحدثين متعددين، وسيقوم النموذج بمعالجة ميزات كل متحدث عبر وسم `<|speaker:i|>`. بعد ذلك، يمكنك التحكم في أداء النموذج عبر وسم معرف المتحدث، مما يتيح لتوليد واحد أن يتضمن متحدثين متعددين. لم تعد هناك حاجة لتحميل عينة مرجعية منفصلة وتوليد صوت لكل متحدث على حدة كما كان في السابق.
### توليد حوارات متعددة الجولات
بفضل توسيع سياق النموذج، يمكن لنموذجنا الآن الاستفادة من المعلومات السابقة لتحسين التعبير في المحتوى المولد لاحقاً، مما يعزز من طبيعية المحتوى.
### استنساخ الصوت السريع
يدعم Fish Audio S2 استنساخاً دقيقاً للصوت باستخدام عينات مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت وأسلوب الكلام والميول العاطفية، مما يولد أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
لاستخدام خادم SGLang، يرجى الرجوع إلى [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).
---
## شكر وتقدير
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## التقرير الفني
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: docs/README.ja.md
================================================
<div align="center">
<h1>Fish Speech</h1>
[English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md) | **日本語** | [한국어](README.ko.md) | [العربية](README.ar.md) <br>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
<br>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
> [!IMPORTANT]
> **ライセンス注意事項**
> このコードベースおよび関連するモデルウェイトは **[FISH AUDIO RESEARCH LICENSE](../LICENSE)** の下でリリースされています。詳細については [LICENSE](../LICENSE) をご参照ください。
> [!WARNING]
> **法的免責事項**
> 私たちはコードベースの不法な使用について一切の責任を負いません。DMCA 及びその他の関連法律について、現地の法律をご参照ください。
## クイックスタート
### ドキュメント入口
Fish Audio S2 の公式ドキュメントです。以下からすぐに始められます。
- [インストール](https://speech.fish.audio/ja/install/)
- [コマンドライン推論](https://speech.fish.audio/ja/inference/)
- [WebUI 推論](https://speech.fish.audio/ja/inference/)
- [サーバー推論](https://speech.fish.audio/ja/server/)
- [Docker デプロイ](https://speech.fish.audio/ja/install/)
> [!IMPORTANT]
> **SGLang サーバーについては [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。**
### LLM Agent 指南
```
https://speech.fish.audio/ja/install/ の手順に従って、Fish Audio S2 をインストール・設定してください。
```
## Fish Audio S2 Pro
**業界最先端の多言語テキスト読み上げ (TTS) システム。音声生成の限界を再定義します。**
Fish Audio S2 Pro は [Fish Audio](https://fish.audio/) が開発した最高峰のマルチモーダルモデルです。世界 **80 言語以上**、**1,000 万時間** を超える膨大な音声データで学習されています。革新的な **二重自己回帰 (Dual-AR)** アーキテクチャと強化学習 (RL) アライメント技術を組み合わせることで、極めて自然でリアル、かつ感情豊かな音声を生成し、オープンソースおよびクローズドソースの双方でリーダーシップを発揮しています。
S2 Pro の最大の特徴は、自然言語タグ(例:`[whisper]`、`[excited]`、`[angry]`)による韻律や感情の **サブワードレベル (Sub-word Level)** での極めて細やかなインライン制御が可能である点です。また、マルチスピーカー生成や長文コンテキストのマルチターン対話生成にもネイティブ対応しています。
今すぐ [Fish Audio 公式サイト](https://fish.audio/) でプレイグラウンドを体験するか、[技術レポート](https://arxiv.org/abs/2603.08823) や [ブログ記事](https://fish.audio/blog/fish-audio-open-sources-s2/) を読んで詳細を確認してください。
### モデルバリアント
| モデル | サイズ | 利用可能性 | 説明 |
|------|------|-------------|-------------|
| S2-Pro | 4B パラメータ | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 品質と安定性を最大化した、フル機能のフラッグシップモデル |
モデルの詳細は[技術レポート](https://arxiv.org/abs/2411.01156)をご参照ください。
## ベンチマーク結果
| ベンチマーク | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER(中国語) | **0.54%**(全体最良) |
| Seed-TTS Eval — WER(英語) | **0.99%**(全体最良) |
| Audio Turing Test(指示あり) | **0.515** 事後平均値 |
| EmergentTTS-Eval — 勝率 | **81.88%**(全体最高) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — 品質 | **4.51 / 5.0** |
| 多言語(MiniMax Testset)— 最良 WER | **24 言語中 11 言語** |
| 多言語(MiniMax Testset)— 最良 SIM | **24 言語中 17 言語** |
Seed-TTS Eval では、S2 はクローズドソースを含む全評価モデルの中で最小 WER を達成しました:Qwen3-TTS(0.77/1.24)、MiniMax Speech-02(0.99/1.90)、Seed-TTS(1.12/2.25)。Audio Turing Test では 0.515 を記録し、Seed-TTS(0.417)比で 24%、MiniMax-Speech(0.387)比で 33% 上回りました。EmergentTTS-Eval では、副言語情報(91.61%)、疑問文(84.41%)、統語的複雑性(83.39%)で特に高い成績を示しています。
## ハイライト
<img src="./assets/totalability.png" width=200%>
### 自然言語による細粒度インライン制御
S2 Pro は音声にこれまでにない「魂」を宿らせます。シンプルな `[tag]` 構文を使用して、テキスト内の任意の場所に感情の指示を正確に埋め込むことができます。
- **1万5,000以上のユニークタグに対応**:固定のプリセットに限定されず、**自由形式のテキスト記述** をサポートします。`[whisper in small voice]` (ささやき声で), `[professional broadcast tone]` (プロのナレーション風), `[pitch up]` (ピッチを上げる) などを試してみてください。
- **豊富な感情ライブラリ**:
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
### 革新的な二重自己回帰 (Dual-Autoregressive) アーキテクチャ
S2 Pro は、Decoder-only Transformer と RVQ オーディオコーデック(10 コードブック、約 21 Hz)で構成されるマスター・スレーブ型の Dual-AR アーキテクチャを採用しています:
- **Slow AR (4B パラメータ)**: 時間軸方向に動作し、核となるセマンティックコードブックを予測。
- **Fast AR (400M パラメータ)**: 各時間ステップで残り 9 個の残差コードブックを生成し、極めて繊細な音響ディテールを復元。
この非対称設計により、究極のオーディオ忠実度を維持しながら、推論速度を大幅に向上させています。
### 強化学習 (RL) アライメント
S2 Pro は、事後学習アライメントに **Group Relative Policy Optimization (GRPO)** 技術を採用しています。データのクリーニングとアノテーションに使用したモデルセットをそのまま報酬モデル (Reward Model) として使用することで、事前学習データの分布と事後学習の目標との間のミスマッチを完璧に解決しました。
- **多次元の報酬信号**: 意味の正確性、指示追従性、音響的な好み、音色の類似性を総合的に評価し、生成される一秒一秒の音声が人間の直感に沿うようにしています。
### SGLang による究極のストリーミング推論性能
Dual-AR アーキテクチャは標準的な LLM 構造と同型であるため、S2 Pro は SGLang のすべての推論加速機能をネイティブにサポートしています。これには、Continuous Batching、Paged KV Cache、CUDA Graph、RadixAttention ベースの Prefix Caching が含まれます。
**NVIDIA H200 GPU 1枚でのパフォーマンス表現:**
- **リアルタイム係数 (RTF)**: 0.195
- **初回音声出力までの時間 (TTFA)**: 約 100 ms
- **極速スループット**: RTF < 0.5 を維持しつつ 3,000+ acoustic tokens/s
### 強力な多言語サポート
S2 Pro は 80 以上の言語をサポートしており、音素や特定の言語に対する前処理なしで高品質な合成を実現します:
- **第1層 (Tier 1)**: 日本語 (ja), 英語 (en), 中国語 (zh)
- **第2層 (Tier 2)**: 韓国語 (ko), スペイン語 (es), ポルトガル語 (pt), アラビア語 (ar), ロシア語 (ru), フランス語 (fr), ドイツ語 (de)
- **グローバルカバレッジ**: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, e!t, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo など。
### ネイティブなマルチスピーカー生成
<img src="./assets/chattemplate.png" width=200%>
Fish Audio S2 では、複数のスピーカーを含む参照オーディオをアップロードでき、モデルは `<|speaker:i|>` トークンを介して各スピーカーの特徴を処理します。スピーカー ID トークンを使用してモデルの出力を制御することで、1回の生成に複数のスピーカーを混在させることが可能です。個別のスピーカーごとに参照オーディオをアップロードし直す手間はもう不要です。
### マルチターン対話生成
コンテキストの拡張により、以前のターンの情報を利用して後続の生成内容の表現力を高めることができ、対話としての自然さが大幅に向上しました。
### 高速音声クローニング
Fish Audio S2 は、短い参照サンプル(通常 10〜30 秒)を使用した正確な音声クローニングをサポートしています。モデルは音色、話し方、感情を捉え、追加の微調整なしでリアルで一貫したクローン音声を生成します。
SGLang サーバーの利用については、[SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) を参照してください。
---
## 謝辞
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## 技術レポート
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: docs/README.ko.md
================================================
<div align="center">
<h1>Fish Speech</h1>
[English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md) | [日本語](README.ja.md) | **한국어** | [العربية](README.ar.md) <br>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
<br>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
> [!IMPORTANT]
> **라이선스 고지**
> 이 코드베이스 및 관련 모델 가중치는 **[FISH AUDIO RESEARCH LICENSE](../LICENSE)** 에 따라 배포됩니다. 자세한 내용은 [LICENSE](../LICENSE)를 참조하십시오.
> [!WARNING]
> **법적 면책 조항**
> 당사는 코드베이스의 불법적인 사용에 대해 어떠한 책임도 지지 않습니다. 해당 지역의 DMCA 및 기타 관련 법률을 참조하십시오.
## 빠른 시작
### 문서 입구
Fish Audio S2의 공식 문서입니다. 지침에 따라 쉽게 시작하십시오.
- [설치](https://speech.fish.audio/ko/install/)
- [명령줄 추론](https://speech.fish.audio/ko/inference/)
- [WebUI 추론](https://speech.fish.audio/ko/inference/)
- [서버 추론](https://speech.fish.audio/ko/server/)
- [Docker 배포](https://speech.fish.audio/ko/install/)
> [!IMPORTANT]
> **SGLang 서버를 사용하려면 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)를 참조하십시오.**
### LLM Agent 가이드
```
먼저 https://speech.fish.audio/ko/install/ 을 읽고 문서에 따라 Fish Audio S2를 설치 및 구성하십시오.
```
## Fish Audio S2 Pro
**음성 생성의 경계를 재정의하는 업계 최고의 다국어 텍스트 음성 변환(TTS) 시스템.**
Fish Audio S2 Pro는 [Fish Audio](https://fish.audio/)에서 개발한 최첨단 멀티모달 모델입니다. 전 세계 **80개 이상의 언어**를 아우르는 **1,000만 시간** 이상의 방대한 오디오 데이터로 학습되었습니다. 혁신적인 **이중 자기회귀(Dual-AR)** 아키텍처와 강화 학습(RL) 정렬 기술을 통해 S2 Pro는 극도로 자연스럽고 사실적이며 감정이 풍부한 음성을 생성하며, 오픈 소스와 클ローズ드 소스 경쟁 모두에서 선두를 달리고 있습니다.
S2 Pro의 핵심 강점은 자연어 태그(예: `[whisper]`, `[excited]`, `[angry]`)를 통해 운율과 감정을 **하위 단어 수준(Sub-word Level)**에서 매우 세밀하게 인라인 제어할 수 있다는 점입니다. 또한 다중 화자 생성 및 긴 컨텍스트의 다중 턴 대화 생성을 기본적으로 지원합니다.
지금 바로 [Fish Audio 공식 웹사이트](https://fish.audio/)에서 온라인 데모를 체험하거나, [기술 보고서](https://arxiv.org/abs/2603.08823) 및 [블로그 게시물](https://fish.audio/blog/fish-audio-open-sources-s2/)을 통해 자세히 알아보십시오.
### 모델 변체
| 모델 | 크기 | 가용성 | 설명 |
|------|------|-------------|-------------|
| S2-Pro | 4B 파라미터 | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 최고의 품질과 안정성을 갖춘 모든 기능을 갖춘 플래그십 모델 |
모델에 대한 자세한 내용은 [기술 보고서](https://arxiv.org/abs/2411.01156)를 참조하십시오.
## 벤치마크 결과
| 벤치마크 | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER(중국어) | **0.54%** (전체 최고) |
| Seed-TTS Eval — WER(영어) | **0.99%** (전체 최고) |
| Audio Turing Test (지침 포함) | **0.515** 후험 평균 |
| EmergentTTS-Eval — 승률 | **81.88%** (전체 최고) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — 품질 | **4.51 / 5.0** |
| 다국어 (MiniMax Testset) — 최고 WER | **24개 언어 중 11개** |
| 다국어 (MiniMax Testset) — 최고 SIM | **24개 언어 중 17개** |
Seed-TTS Eval에서 S2는 클ローズ드 소스 시스템을 포함한 모든 평가 모델 중 가장 낮은 WER을 달성했습니다: Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). Audio Turing Test에서 S2의 0.515는 Seed-TTS (0.417) 대비 24%, MiniMax-Speech (0.387) 대비 33% 향상된 수치입니다. EmergentTTS-Eval에서 S2는 부차 언어학(91.61% 승률), 의문문(84.41%), 구문 복잡성(83.39%) 등의 측면에서 특히 두드러진 성과를 보였습니다.
## 하이라이트
<img src="./assets/totalability.png" width=200%>
### 자연어를 통한 초미세 인라인 제어
S2 Pro는 음성에 전례 없는 "영혼"을 부여합니다. 간단한 `[tag]` 구문을 사용하여 텍스트의 어느 위치에나 감정 지침을 정확하게 삽입할 수 있습니다.
- **15,000개 이상의 고유 태그 지원**: 고정된 사전 설정에 국한되지 않고 **자유 형식의 텍스트 설명**을 지원합니다. `[whisper in small voice]` (작은 목소리로 속삭임), `[professional broadcast tone]` (전문 방송 톤), `[pitch up]` (음높이 높임) 등을 시도해 보십시오.
- **풍부한 감정 라이브러리**:
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
### 혁신적인 이중 자기회귀 (Dual-Autoregressive) 아키텍처
S2 Pro는 Decoder-only Transformer와 RVQ 오디오 코덱(10개 코드북, 약 21Hz 프레임 속도)으로 구성된 마스터-슬레이브 방식의 Dual-AR 아키텍처를 채택했습니다.
- **Slow AR (4B 파라미터)**: 시간 축을 따라 작동하며 핵심 의미 코드북을 예측합니다.
- **Fast AR (400M 파라미터)**: 각 타임스텝에서 나머지 9개의 잔차 코드북을 생성하여 극도로 정교한 음향 세부 사항을 복원합니다.
이러한 비대칭 설계는 오디오의 최고 충실도를 보장하는 동시에 추론 속도를 대폭 향상시킵니다.
### 강화 학습 (RL) 정렬
S2 Pro는 사후 학습 정렬을 위해 **Group Relative Policy Optimization (GRPO)** 기술을 채택했습니다. 데이터 정제 및 주석 처리에 사용된 것과 동일한 모델 세트를 보상 모델(Reward Model)로 직접 사용함으로써 사전 학습 데이터 분포와 사후 학습 목표 간의 불일치 문제를 완벽하게 해결했습니다.
- **다차원 보상 신호**: 의미 체계의 정확성, 지침 준수 능력, 음향 선호도 점수 및 음색 유사성을 종합적으로 평가하여 생성된 음성의 매초가 인간의 직관에 부합하도록 보장합니다.
### SGLang 기반의 극한 스트리밍 추론 성능
Dual-AR 아키텍처는 표준 LLM 구조와 동형이므로 S2 Pro는 Continuous Batching, Paged KV Cache, CUDA Graph 및 RadixAttention 기반 Prefix Caching을 포함한 SGLang의 모든 추론 가속 기능을 기본적으로 지원합니다.
**단일 NVIDIA H200 GPU 성능 지표:**
- **실시간 계수 (RTF)**: 0.195
- **첫 음성 지연 (TTFA)**: 약 100 ms
- **초고속 처리량**: RTF < 0.5 유지 시 처리량 3,000+ acoustic tokens/s 달성
### 강력한 다국어 지원
S2 Pro는 음소나 특정 언어 처리가 필요 없는 고품질 합성을 80개 이상의 언어에서 지원합니다.
- **1계층 (Tier 1)**: 일본어 (ja), 영어 (en), 중국어 (zh)
- **2계층 (Tier 2)**: 한국어 (ko), 스페인어 (es), 포르투갈어 (pt), 아랍어 (ar), 러시아어 (ru), 프랑스어 (fr), 독일어 (de)
- **글로벌 커버리지**: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo 등.
### 네이티브 다중 화자 생성
<img src="./assets/chattemplate.png" width=200%>
Fish Audio S2를 사용하면 사용자가 여러 화자가 포함된 참조 오디오를 업로드할 수 있으며, 모델은 `<|speaker:i|>` 토큰을 통해 각 화자의 특징을 처리합니다. 이후 화자 ID 토큰을 사용하여 모델의 표현을 제어함으로써 한 번의 생성에 여러 화자를 포함할 수 있습니다. 더 이상 화자마다 별도의 참조 오디오를 업로드하고 음성을 생성할 필요가 없습니다.
### 다중 턴 대화 생성
모델 컨텍스트 확장에 힘입어 이제 이전 정보의 도움을 받아 후속 생성 내용의 표현력을 높이고 콘텐츠의 자연스러움을 향상시킬 수 있습니다.
### 고속 음성 복제
Fish Audio S2는 짧은 참조 샘플(보통 10-30초)을 사용한 정확한 음성 복제를 지원합니다. 모델은 음색, 말하기 스타일 및 감정적 경향을 포착하여 추가적인 미세 조정 없이도 사실적이고 일관된 복제 음성을 생성합니다.
SGLang 서버 사용에 대해서는 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)를 참조하십시오.
---
## 감사의 말
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## 기술 보고서
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: docs/README.pt-BR.md
================================================
<div align="center">
<h1>Fish Speech</h1>
[English](../README.md) | [简体中文](README.zh.md) | **Portuguese** | [日本語](README.ja.md) | [한국어](README.ko.md) | [العربية](README.ar.md) <br>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
<br>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
> [!IMPORTANT]
> **Aviso de Licença**
> Este repositório de código e seus pesos de modelo associados são lançados sob a **[FISH AUDIO RESEARCH LICENSE](../LICENSE)**. Consulte [LICENSE](../LICENSE) para obter mais detalhes.
> [!WARNING]
> **Aviso Legal**
> Não nos responsabilizamos por qualquer uso ilegal deste repositório. Consulte as leis locais sobre DMCA e outras regulamentações relevantes.
## Início Rápido
### Links da Documentação
Esta é a documentação oficial do Fish Audio S2, siga as instruções para começar facilmente.
- [Instalação](https://speech.fish.audio/install/)
- [Inferência por Linha de Comando](https://speech.fish.audio/inference/)
- [Inferência por WebUI](https://speech.fish.audio/inference/)
- [Inferência por Servidor](https://speech.fish.audio/server/)
- [Implantação Docker](https://speech.fish.audio/install/)
> [!IMPORTANT]
> **Caso deseje utilizar o SGLang Server, consulte o [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
### Guia para Agentes de LLM
```
Leia primeiro https://speech.fish.audio/install/ e siga a documentação para instalar e configurar o Fish Audio S2.
```
## Fish Audio S2 Pro
**O sistema de conversão de texto em fala (TTS) multilíngue líder do setor, redefinindo as fronteiras da geração de voz.**
Fish Audio S2 Pro é o modelo multimodal mais avançado desenvolvido pela [Fish Audio](https://fish.audio/). Treinado em mais de **10 milhões de horas** de dados de áudio massivos, cobrindo mais de **80 idiomas** globais. Através de uma arquitetura inovadora de **Dual-Autoregressive (Dual-AR)** e tecnologia de alinhamento por aprendizado por reforço (RL), o S2 Pro é capaz de gerar fala com um senso de naturalidade, realismo e riqueza emocional extremos, liderando tanto em competições de código aberto quanto proprietário.
O grande diferencial do S2 Pro reside em seu suporte para controle inline de granularidade ultra-fina de prosódia e emoção ao nível de **sub-palavra (Sub-word Level)** via tags de linguagem natural (como `[whisper]`, `[excited]`, `[angry]`), além de suporte nativo para múltiplos falantes e geração de diálogos de múltiplos turnos com contexto ultra-longo.
Visite agora o [site oficial da Fish Audio](https://fish.audio/) para experimentar a demonstração online, ou leia nosso [relatório técnico](https://arxiv.org/abs/2603.08823) e [artigo no blog](https://fish.audio/blog/fish-audio-open-sources-s2/) para saber mais.
### Variantes de Modelo
| Modelo | Tamanho | Disponibilidade | Descrição |
|------|------|-------------|-------------|
| S2-Pro | 4B parâmetros | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | Modelo flagship completo, com máxima qualidade e estabilidade |
Para mais detalhes sobre os modelos, consulte o [relatório técnico](https://arxiv.org/abs/2411.01156).
## Resultados de Benchmark
| Benchmark | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER (Chinês) | **0.54%** (Melhor geral) |
| Seed-TTS Eval — WER (Inglês) | **0.99%** (Melhor geral) |
| Audio Turing Test (Com instrução) | **0.515** Média posterior |
| EmergentTTS-Eval — Taxa de Vitória | **81.88%** (Maior geral) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — Qualidade | **4.51 / 5.0** |
| Multilíngue (MiniMax Testset) — Melhor WER | **11 de 24** idiomas |
| Multilíngue (MiniMax Testset) — Melhor SIM | **17 de 24** idiomas |
No Seed-TTS Eval, o S2 alcançou o menor WER entre todos os modelos avaliados (incluindo sistemas proprietários): Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), Seed-TTS (1.12/2.25). No Audio Turing Test, o valor de 0.515 do S2 representa um aumento de 24% em relação ao Seed-TTS (0.417) e 33% em relação ao MiniMax-Speech (0.387). No EmergentTTS-Eval, o S2 destacou-se especialmente em dimensões como paralinguística (taxa de vitória de 91.61%), frases interrogativas (84.41%) e complexidade sintática (83.39%).
## Destaques
<img src="./assets/totalability.png" width=200%>
### Controle Inline de Granularidade Ultra-Fina via Linguagem Natural
S2 Pro confere à voz uma "espiritualidade" sem precedentes. Através de uma sintaxe simples de `[tag]`, você pode inserir instruções emocionais precisamente em qualquer posição do texto.
- **Suporte para mais de 15.000 tags únicas**: Não limitado a predefinições fixas, suporta **descrições textuais de formato livre**. Você pode tentar `[whisper in small voice]` (sussurrando), `[professional broadcast tone]` (tom de locução profissional) ou `[pitch up]` (aumentar o tom).
- **Rica biblioteca de emoções**:
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
### Arquitetura Inovadora Dual-Autoregressive (Dual-AR)
S2 Pro adota uma arquitetura Dual-AR mestre-escravo, consistindo de um Decoder-only Transformer e um codec de áudio RVQ (10 codebooks, cerca de 21 Hz de taxa de frames):
- **Slow AR (4B parâmetros)**: Atua ao longo do eixo temporal, prevendo o codebook semântico central.
- **Fast AR (400M parâmetros)**: Gera os 9 codebooks residuais restantes em cada passo de tempo, restaurando detalhes acústicos extremos com delicadeza.
Este design assimétrico garante fidelidade extrema ao áudio enquanto aumenta significativamente a velocidade de inferência.
### Alinhamento por Aprendizado por Reforço (RL Alignment)
S2 Pro utiliza a tecnologia **Group Relative Policy Optimization (GRPO)** para o alinhamento pós-treinamento. Utilizamos o mesmo conjunto de modelos para limpeza e anotação de dados diretamente como modelos de recompensa (Reward Model), resolvendo perfeitamente o problema de descasamento entre a distribuição dos dados de pré-treinamento e os objetivos de pós-treinamento.
- **Sinais de recompensa multidimensionais**: Avalia de forma abrangente a precisão semântica, a capacidade de seguir instruções, a pontuação de preferência acústica e a similaridade de timbre, garantindo que cada segundo de fala gerada esteja alinhado com a intuição humana.
### Desempenho de Inferência de Streaming Extremo (Baseado em SGLang)
Como a arquitetura Dual-AR é estruturalmente isomorfa à estrutura padrão de LLMs, o S2 Pro suporta nativamente todos os recursos de aceleração de inferência do SGLang, incluindo loteamento contínuo (Continuous Batching), Paged KV Cache, CUDA Graph e cache de prefixo baseado em RadixAttention.
**Desempenho em uma única GPU NVIDIA H200:**
- **Fator em Tempo Real (RTF)**: 0.195
- **Latência do Primeiro Áudio (TTFA)**: aprox. 100 ms
- **Taxa de Transferência Ultrarrápida**: Alcance de 3.000+ acoustic tokens/s mantendo RTF < 0.5
### Poderoso Suporte Multilíngue
S2 Pro suporta mais de 80 idiomas, possibilitando síntese de alta qualidade sem a necessidade de fonemas ou processamento específico por idioma:
- **Tier 1**: Japonês (ja), Inglês (en), Chinês (zh)
- **Tier 2**: Coreano (ko), Espanhol (es), Português (pt), Árabe (ar), Russo (ru), Francês (fr), Alemão (de)
- **Cobertura Global**: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, etc.
### Geração Nativa Multi-falante
<img src="./assets/chattemplate.png" width=200%>
O Fish Audio S2 permite que os usuários enviem áudio de referência contendo múltiplos falantes, e o modelo processará as características de cada falante via o token `<|speaker:i|>`. Em seguida, você pode controlar o desempenho do modelo através do token de ID do falante, permitindo incluir múltiplos falantes em uma única geração. Não é mais necessário enviar áudios de referência separadamente para cada falante.
### Geração de Diálogos Multiturnos
Graças à expansão do contexto do modelo, nosso modelo agora pode aproveitar as informações prévias para aumentar a expressividade dos conteúdos gerados subsequentemente, elevando assim a naturalidade dos diálogos.
### Clonagem de Voz Rápida
O Fish Audio S2 suporta clonagem de voz precisa usando curtas amostras de referência (normalmente 10-30 segundos). O modelo captura o timbre, o estilo de fala e as tendências emocionais, gerando vozes clonadas realistas e consistentes sem necessidade de ajustes finos adicionais.
Caso deseje utilizar o SGLang Server, consulte o [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).
---
## Agradecimentos
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## Relatório Técnico
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang racing Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: docs/README.zh.md
================================================
<div align="center">
<h1>Fish Speech</h1>
[English](../README.md) | **简体中文** | [Portuguese](README.pt-BR.md) | [日本語](README.ja.md) | [한국어](README.ko.md) | [العربية](README.ar.md) <br>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
<br>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
> [!IMPORTANT]
> **许可证声明**
> 此代码库及其相关的模型权重均在 **[FISH AUDIO RESEARCH LICENSE](../LICENSE)** 下发布。更多详情请参考 [LICENSE](../LICENSE)。
> [!WARNING]
> **法律免责声明**
> 我们不对代码库的任何非法使用承担责任。请参考您当地关于 DMCA 和其他相关法律的法规。
## 快速开始
### 文档入口
这里是 Fish Audio S2 的官方文档,请按照说明轻松入门。
- [安装](https://speech.fish.audio/zh/install/)
- [命令行推理](https://speech.fish.audio/zh/inference/)
- [WebUI 推理](https://speech.fish.audio/zh/inference/)
- [服务端推理](https://speech.fish.audio/zh/server/)
- [Docker 部署](https://speech.fish.audio/zh/install/)
> [!IMPORTANT]
> **如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md)。**
### LLM Agent 指南
```
请先阅读 https://speech.fish.audio/zh/install/ ,并按文档安装和配置 Fish Audio S2。
```
## Fish Audio S2 Pro
**行业顶尖的多语言文本转语音 (TTS) 系统,重新定义声音生成的边界。**
Fish Audio S2 Pro 是 [Fish Audio](https://fish.audio/) 开发的最先进的多模态模型。S2 Pro 训练自超过 **1000 万小时** 的海量音频数据,覆盖全球 **80 多种语言**。通过创新的 **双自回归 (Dual-AR)** 架构与强化学习 (RL) 对齐技术,S2 Pro 能生成极具自然感、真实感且情感饱满的语音,在开源与闭源竞争中均处于领先地位。
S2 Pro 的杀手锏在于支持通过自然语言标签(如 `[whisper]`、`[excited]`、`[angry]`)对韵律与情绪进行 **亚词级(Sub-word Level)** 的极细粒度行内控制,同时原生支持多说话人与超长上下文的多轮对话生成。
立即访问 [Fish Audio 官网](https://fish.audio/) 体验在线演示,或阅读我们的[技术报告](https://arxiv.org/abs/2603.08823)与[博客文章](https://fish.audio/blog/fish-audio-open-sources-s2/)深入了解。
### 模型变体
| 模型 | 大小 | 可用性 | 描述 |
|------|------|-------------|-------------|
| S2-Pro | 4B 参数 | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | 功能齐全的旗舰模型,具有最高质量和稳定性 |
有关模型的更多详情,请参见[技术报告](https://arxiv.org/abs/2411.01156)。
## 基准测试结果
| 基准 | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER(中文) | **0.54%**(总体最佳) |
| Seed-TTS Eval — WER(英文) | **0.99%**(总体最佳) |
| Audio Turing Test(含指令) | **0.515** 后验均值 |
| EmergentTTS-Eval — 胜率 | **81.88%**(总体最高) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — 质量 | **4.51 / 5.0** |
| 多语言(MiniMax Testset)— 最佳 WER | **24** 种语言中的 **11** 种 |
| 多语言(MiniMax Testset)— 最佳 SIM | **24** 种语言中的 **17** 种 |
在 Seed-TTS Eval 上,S2 在所有已评估模型(包括闭源系统)中实现了最低 WER:Qwen3-TTS(0.77/1.24)、MiniMax Speech-02(0.99/1.90)、Seed-TTS(1.12/2.25)。在 Audio Turing Test 上,S2 的 0.515 相比 Seed-TTS(0.417)提升 24%,相比 MiniMax-Speech(0.387)提升 33%。在 EmergentTTS-Eval 中,S2 在副语言学(91.61% 胜率)、疑问句(84.41%)和句法复杂度(83.39%)等维度表现尤为突出。
## 亮点
<img src="./assets/totalability.png" width=200%>
### 通过自然语言进行极细粒度行内控制
S2 Pro 赋予了语音前所未有的“灵性”。通过简单的 `[tag]` 语法,你可以在文本的任何位置精准嵌入情感指令。
- **15,000+ 独特标签支持**:不局限于固定的预设,支持 **自由格式的文本描述**。你可以尝试 `[whisper in small voice]` (低声耳语), `[professional broadcast tone]` (专业播音腔), 或 `[pitch up]` (提高音调)。
- **丰富的情绪库**:
`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
### 创新的双自回归 (Dual-Autoregressive) 架构
S2 Pro 采用了主从式 Dual-AR 架构,由 Decoder-only Transformer 与 RVQ 音频编解码器(10 个码本,约 21 Hz 帧率)组成:
- **Slow AR (4B 参数)**:沿时间轴工作,预测核心的语义码本。
- **Fast AR (400M 参数)**:在每个时间步生成剩余 9 个残差码本,细腻还原极致的音频细节。
这种非对称设计在保证音频极致保真度的同时,大幅提升了推理速度。
### 强化学习对齐 (RL Alignment)
S2 Pro 采用了 **Group Relative Policy Optimization (GRPO)** 技术进行后训练对齐。我们将用于数据清洗与标注的同一套模型直接作为奖励模型 (Reward Model),完美解决了预训练数据分布与后训练目标之间的不匹配问题。
- **多维奖励信号**:综合评估语义准确性、指令遵循能力、声学偏好评分以及音色相似度,确保生成的每一秒语音都符合人类直觉。
### 极致的流式推理性能 (基于 SGLang)
由于 Dual-AR 架构与标准 LLM 结构同构,S2 Pro 原生支持 SGLang 的所有推理加速特性,包括连续批处理 (Continuous Batching)、分页 KV Cache、CUDA Graph 与基于 RadixAttention 的前缀缓存。
**单张 NVIDIA H200 GPU 性能表现:**
- **实时因子 (RTF)**:0.195
- **首音延迟 (TTFA)**:约 100 ms
- **极速吞吐**:在保持 RTF < 0.5 时,吞吐量达到 3,000+ acoustic tokens/s
### 强大的多语言支持
S2 Pro 支持 80 多种语言,无需音素或特定语言的处理即可实现高质量合成:
- **第一梯队 (Tier 1)**:日语 (ja), 英语 (en), 中文 (zh)
- **第二梯队 (Tier 2)**:韩语 (ko), 西班牙语 (es), 葡萄牙语 (pt), 阿拉伯语 (ar), 俄语 (ru), 法语 (fr), 德语 (de)
- **全球覆盖**:sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo 等。
### 原生多说话人生成
<img src="./assets/chattemplate.png" width=200%>
Fish Audio S2 允许用户上传包含多个说话人的参考音频,模型将通过 `<|speaker:i|>` 令牌处理每个说话人的特征。之后您可以通过说话人 ID 令牌控制模型的表现,从而实现一次生成中包含多个说话人。再也不需要像以前那样针对每个说话人都单独上传参考音频与生成语音了。
### 多轮对话生成
得益于模型上下文的扩展,我们的模型现在可以借助上文的信息提高后续生成内容的表现力,从而提升内容的自然度。
### 快速语音克隆
Fish Audio S2 支持使用短参考样本(通常为 10-30 秒)进行准确的语音克隆。模型可以捕捉音色、说话风格和情感倾向,无需额外微调即可生成逼真且一致的克隆语音。
如需使用 SGLang Server,请参考 [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) 。
---
## 致谢
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## 技术报告
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: docs/ar/finetune.md
================================================
# الضبط الدقيق (Fine-tuning)
من الواضح أنك عندما فتحت هذه الصفحة، لم تكن راضيًا عن أداء النموذج المدرب مسبقًا في وضع zero-shot. أنت ترغب في إجراء ضبط دقيق لنموذج لتحسين أدائه على مجموعة البيانات الخاصة بك.
في الإصدار الحالي، ما عليك سوى إجراء الضبط الدقيق لجزء 'LLAMA'.
## الضبط الدقيق لـ LLAMA
### 1. إعداد مجموعة البيانات
```
.
├── SPK1
│ ├── 21.15-26.44.lab
│ ├── 21.15-26.44.mp3
│ ├── 27.51-29.98.lab
│ ├── 27.51-29.98.mp3
│ ├── 30.1-32.71.lab
│ └── 30.1-32.71.mp3
└── SPK2
├── 38.79-40.85.lab
└── 38.79-40.85.mp3
```
تحتاج إلى تحويل مجموعة البيانات الخاصة بك إلى التنسيق أعلاه ووضعها تحت مجلد `data`. يمكن أن يكون للملف الصوتي الامتدادات `.mp3`، `.wav`، أو `.flac`، ويجب أن يكون لملف التعليقات التوضيحية الامتداد `.lab`.
!!! info "تنسيق مجموعة البيانات"
يحتاج ملف التعليقات التوضيحية `.lab` فقط إلى احتواء النص المكتوب للمقطع الصوتي، دون الحاجة إلى تنسيق خاص. على سبيل المثال، إذا كان محتوى `hi.mp3` هو "مرحبًا، وداعًا"، فسيحتوي ملف `hi.lab` على سطر واحد من النص: "مرحبًا، وداعًا".
!!! warning "تحذير"
يوصى بتطبيق تسوية جهارة الصوت (loudness normalization) على مجموعة البيانات. يمكنك استخدام [fish-audio-preprocess](https://github.com/fishaudio/audio-preprocess) للقيام بذلك.
```bash
fap loudness-norm data-raw data --clean
```
### 2. الاستخراج الدفعي للرموز الدلالية (semantic tokens)
تأكد من أنك قمت بتنزيل أوزان VQGAN. إذا لم تكن قد فعلت، قم بتشغيل الأمر التالي:
```bash
huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
```
يمكنك بعد ذلك تشغيل الأمر التالي لاستخراج الرموز الدلالية:
```bash
python tools/vqgan/extract_vq.py data \
--num-workers 1 --batch-size 16 \
--config-name "modded_dac_vq" \
--checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
```
!!! note "ملاحظة"
يمكنك ضبط `--num-workers` و `--batch-size` لزيادة سرعة الاستخراج، ولكن يرجى التأكد من عدم تجاوز حد ذاكرة وحدة معالجة الرسومات (GPU) الخاصة بك.
سيقوم هذا الأمر بإنشاء ملفات `.npy` في مجلد `data`، كما هو موضح أدناه:
```
.
├── SPK1
│ ├── 21.15-26.44.lab
│ ├── 21.15-26.44.mp3
│ ├── 21.15-26.44.npy
│ ├── 27.51-29.98.lab
│ ├── 27.51-29.98.mp3
│ ├── 27.51-29.98.npy
│ ├── 30.1-32.71.lab
│ ├── 30.1-32.71.mp3
│ └── 30.1-32.71.npy
└── SPK2
├── 38.79-40.85.lab
├── 38.79-40.85.mp3
└── 38.79-40.85.npy
```
### 3. حزم مجموعة البيانات في protobuf
```bash
python tools/llama/build_dataset.py \
--input "data" \
--output "data/protos" \
--text-extension .lab \
--num-workers 16
```
بعد انتهاء تنفيذ الأمر، يجب أن ترى ملف `protos` في مجلد `data`.
### 4. أخيرًا، الضبط الدقيق باستخدام LoRA
بالمثل، تأكد من أنك قمت بتنزيل أوزان `LLAMA`. إذا لم تكن قد فعلت، قم بتشغيل الأمر التالي:
```bash
huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
```
أخيرًا، يمكنك بدء الضبط الدقيق عن طريق تشغيل الأمر التالي:
```bash
python fish_speech/train.py --config-name text2semantic_finetune \
project=$project \
+lora@model.model.lora_config=r_8_alpha_16
```
!!! note "ملاحظة"
يمكنك تعديل معلمات التدريب مثل `batch_size`، `gradient_accumulation_steps`، وما إلى ذلك لتناسب ذاكرة وحدة معالجة الرسومات الخاصة بك عن طريق تعديل `fish_speech/configs/text2semantic_finetune.yaml`.
!!! note "ملاحظة"
لمستخدمي Windows، يمكنك استخدام `trainer.strategy.process_group_backend=gloo` لتجنب مشكلات `nccl`.
بعد اكتمال التدريب، يمكنك الرجوع إلى قسم [الاستدلال (inference)](inference.md) لاختبار نموذجك.
!!! info "معلومات"
بشكل افتراضي، سيتعلم النموذج فقط أنماط كلام المتحدث وليس جرس الصوت (timbre). لا تزال بحاجة إلى استخدام التلقينات (prompts) لضمان استقرار جرس الصوت.
إذا كنت ترغب في تعلم جرس الصوت، يمكنك زيادة عدد خطوات التدريب، ولكن هذا قد يؤدي إلى الإفراط في التخصيص (overfitting).
بعد التدريب، تحتاج إلى تحويل أوزان LoRA إلى أوزان عادية قبل إجراء الاستدلال.
```bash
python tools/llama/merge_lora.py \
--lora-config r_8_alpha_16 \
--base-weight checkpoints/openaudio-s1-mini \
--lora-weight results/$project/checkpoints/step_000000010.ckpt \
--output checkpoints/openaudio-s1-mini-yth-lora/
```
!!! note "ملاحظة"
يمكنك أيضًا تجربة نقاط تحقق (checkpoints) أخرى. نقترح استخدام أقدم نقطة تحقق تلبي متطلباتك، حيث إنها غالبًا ما تؤدي أداءً أفضل على البيانات خارج التوزيع (OOD).
================================================
FILE: docs/ar/index.md
================================================
<div align="center">
<h1>Fish Speech</h1>
<p><a href="../en/">English</a> | <a href="../zh/">简体中文</a> | <a href="../pt/">Portuguese</a> | <a href="../ja/">日本語</a> | <a href="../ko/">한국어</a> | <strong>العربية</strong></p>
<a href="https://www.producthunt.com/products/fish-speech?embed=true&utm_source=badge-top-post-badge&utm_medium=badge&utm_source=badge-fish-audio-s1" target="_blank"><img src="https://api.producthunt.com/widgets/embed-image/v1/top-post-badge.svg?post_id=1023740&theme=light&period=daily&t=1761164814710" alt="Fish Audio S1 - Expressive Voice Cloning and Text-to-Speech | Product Hunt" style="width: 250px; height: 54px;" width="250" height="54" /></a>
<a href="https://trendshift.io/repositories/7014" target="_blank">
<img src="https://trendshift.io/api/badge/repositories/7014" alt="fishaudio%2Ffish-speech | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/>
</a>
</div>
<br>
<div align="center">
<img src="https://count.getloli.com/get/@fish-speech?theme=asoul" /><br>
</div>
<br>
<div align="center">
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
<img alt="Discord" src="https://img.shields.io/discord/1214047546020728892?color=%23738ADB&label=Discord&logo=discord&logoColor=white&style=flat-square"/>
</a>
<a target="_blank" href="https://hub.docker.com/r/fishaudio/fish-speech">
<img alt="Docker" src="https://img.shields.io/docker/pulls/fishaudio/fish-speech?style=flat-square&logo=docker"/>
</a>
<a target="_blank" href="https://pd.qq.com/s/bwxia254o">
<img alt="QQ Channel" src="https://img.shields.io/badge/QQ-blue?logo=tencentqq">
</a>
</div>
<div align="center">
<a target="_blank" href="https://huggingface.co/fishaudio/s2">
<img alt="HuggingFace Model" src="https://img.shields.io/badge/🤗%20-models-orange"/>
</a>
<a target="_blank" href="https://fish.audio/blog/fish-audio-open-sources-s2/">
<img alt="Fish Audio Blog" src="https://img.shields.io/badge/Blog-Fish_Audio_S2-1f7a8c?style=flat-square&logo=readme&logoColor=white"/>
</a>
<a target="_blank" href="https://arxiv.org/abs/2603.08823">
<img alt="Paper | Technical Report" src="https://img.shields.io/badge/Paper-Technical_Report-b31b1b?style=flat-square"/>
</a>
</div>
!!! info "تنبيه الترخيص"
يتم إصدار قاعدة الأكواد هذه وأوزان النماذج المرتبطة بها بموجب رخصة **FISH AUDIO RESEARCH LICENSE**. يرجى الرجوع إلى [LICENSE](https://github.com/fishaudio/fish-speech/blob/main/LICENSE) لمزيد من التفاصيل.
!!! warning "إخلاء المسؤولية القانونية"
نحن لا نتحمل أي مسؤولية عن أي استخدام غير قانوني لقاعدة الأكواد. يرجى مراجعة القوانين المحلية المتعلقة بـ DMCA والقوانين الأخرى ذات الصلة.
## البدء السريع
### ابدأ من الوثائق
هذه هي الوثائق الرسمية لـ Fish Audio S2، ويمكنك البدء مباشرة عبر الروابط التالية:
- [التثبيت](https://speech.fish.audio/ar/install/)
- [الاستدلال عبر سطر الأوامر](https://speech.fish.audio/ar/inference/)
- [استدلال WebUI](https://speech.fish.audio/ar/inference/)
- [الاستدلال عبر الخادم](https://speech.fish.audio/ar/server/)
- [إعداد Docker](https://speech.fish.audio/ar/install/)
> [!IMPORTANT]
> **بالنسبة لخادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md).**
### دليل وكلاء LLM
```
قم بتثبيت وإعداد Fish Audio S2 باتباع التعليمات في https://speech.fish.audio/ar/install/ .
```
## Fish Audio S2
**أفضل نظام لتحويل النص إلى كلام بين الأنظمة مفتوحة المصدر ومغلقة المصدر**
Fish Audio S2 هو أحدث نموذج من [Fish Audio](https://fish.audio/). تم تدريبه على أكثر من 10 ملايين ساعة صوتية عبر نحو 50 لغة، ويجمع بين المواءمة بالتعلم المعزز وبنية Dual-Autoregressive لإنتاج كلام طبيعي وواقعي وغني بالتعبير العاطفي.
يدعم S2 التحكم الدقيق في النبرة والعاطفة داخل النص نفسه باستخدام وسوم باللغة الطبيعية مثل `[laugh]` و`[whispers]` و`[super happy]`، كما يدعم بشكل أصيل توليد متحدثين متعددين وحوارات متعددة الأدوار.
يمكنك تجربة النموذج مباشرة عبر [موقع Fish Audio](https://fish.audio/)، وقراءة المزيد في [منشور المدونة](https://fish.audio/blog/fish-audio-open-sources-s2/) و[التقرير التقني](https://arxiv.org/abs/2603.08823).
### إصدارات النموذج
| النموذج | الحجم | التوفر | الوصف |
|------|------|-------------|-------------|
| S2-Pro | 4B معلمة | [HuggingFace](https://huggingface.co/fishaudio/s2-pro) | نموذج رائد كامل الميزات بأعلى مستوى من الجودة والاستقرار |
يمكن العثور على مزيد من التفاصيل في [التقرير التقني](https://arxiv.org/abs/2411.01156).
## نتائج القياس المعياري
| المعيار | Fish Audio S2 |
|------|------|
| Seed-TTS Eval — WER (الصينية) | **0.54%** (الأفضل إجمالاً) |
| Seed-TTS Eval — WER (الإنجليزية) | **0.99%** (الأفضل إجمالاً) |
| Audio Turing Test (مع التعليمات) | **0.515** المتوسط البعدي |
| EmergentTTS-Eval — معدل الفوز | **81.88%** (الأعلى إجمالاً) |
| Fish Instruction Benchmark — TAR | **93.3%** |
| Fish Instruction Benchmark — الجودة | **4.51 / 5.0** |
| متعدد اللغات (MiniMax Testset) — أفضل WER | **11 من 24** لغة |
| متعدد اللغات (MiniMax Testset) — أفضل SIM | **17 من 24** لغة |
في Seed-TTS Eval، حقق S2 أقل WER بين جميع النماذج التي تم تقييمها، بما في ذلك الأنظمة المغلقة: Qwen3-TTS (0.77/1.24)، وMiniMax Speech-02 (0.99/1.90)، وSeed-TTS (1.12/2.25). وفي Audio Turing Test، تفوقت قيمة 0.515 على Seed-TTS (0.417) بنسبة 24% وعلى MiniMax-Speech (0.387) بنسبة 33%. وفي EmergentTTS-Eval، حقق S2 نتائج قوية بشكل خاص في الخصائص شبه اللغوية (91.61%)، والأسئلة (84.41%)، والتعقيد النحوي (83.39%).
## أبرز المميزات
<img src="../assets/totalability.png" width=200%>
### تحكم مضمّن دقيق عبر اللغة الطبيعية
يتيح Fish Audio S2 تحكمًا موضعيًا في توليد الكلام من خلال تضمين تعليمات باللغة الطبيعية مباشرة عند مواقع كلمات أو عبارات محددة داخل النص. وبدلًا من الاعتماد على مجموعة ثابتة من الوسوم المُعرّفة مسبقًا، يقبل S2 أوصافًا نصية حرة مثل [whisper in small voice] أو [professional broadcast tone] أو [pitch up]، مما يتيح تحكمًا مفتوحًا في التعبير على مستوى الكلمة.
### بنية Dual-Autoregressive
يعتمد S2 على Transformer أحادي الاتجاه (Decoder-only) مع مُرمّز صوتي قائم على RVQ (عدد 10 codebooks وبمعدل إطارات يقارب 21 هرتز). وتُقسّم بنية Dual-AR عملية التوليد إلى مرحلتين:
- **Slow AR** يعمل على المحور الزمني ويتنبأ بالـ semantic codebook الأساسي.
- **Fast AR** يولّد الـ 9 residual codebooks المتبقية في كل خطوة زمنية لإعادة بناء التفاصيل الصوتية الدقيقة.
هذا التصميم غير المتماثل (4B معلمة على المحور الزمني و400M على محور العمق) يرفع كفاءة الاستدلال مع الحفاظ على جودة الصوت.
### المواءمة بالتعلم المعزز
يستخدم S2 خوارزمية Group Relative Policy Optimization (GRPO) للمواءمة بعد التدريب. ويتم إعادة استخدام نفس النماذج التي استُخدمت لتصفية بيانات التدريب وتعليقها كنماذج مكافأة في التعلم المعزز مباشرة، مما يلغي عدم تطابق التوزيع بين بيانات ما قبل التدريب وأهداف ما بعد التدريب. وتجمع إشارة المكافأة بين الدقة الدلالية، والالتزام بالتعليمات، وتقييم التفضيل الصوتي، وتشابه النبرة.
### البث الإنتاجي عبر SGLang
لأن بنية Dual-AR متماثلة بنيويًا مع نماذج LLM autoregressive القياسية، فإن S2 يرث مباشرة تحسينات الخدمة الأصلية في SGLang، بما في ذلك: continuous batching، وpaged KV cache، وCUDA graph replay، وprefix caching المعتمد على RadixAttention.
على بطاقة NVIDIA H200 واحدة:
- **عامل الزمن الحقيقي (RTF):** 0.195
- **الزمن حتى أول مقطع صوتي:** حوالي 100 مللي ثانية
- **معدل المعالجة:** أكثر من 3,000 acoustic tokens/s مع الحفاظ على RTF أقل من 0.5
### دعم لغات متعددة
يدعم Fish Audio S2 تحويل النص إلى كلام بجودة عالية ولغات متعددة دون الحاجة إلى رموز صوتية أو معالجة مسبقة خاصة بكل لغة. بما في ذلك:
**الإنجليزية، الصينية، اليابانية، الكورية، العربية، الألمانية، الفرنسية...**
**وأكثر من ذلك بكثير!**
القائمة في توسع مستمر، تحقق من [Fish Audio](https://fish.audio/) لمعرفة أحدث الإصدارات.
### توليد أصلي لمتحدثين متعددين
<img src="../assets/chattemplate.png" width=200%>
يسمح Fish Audio S2 للمستخدمين برفع صوت مرجعي يحتوي على متحدثين متعددين، وسيتعامل النموذج مع ميزات كل متحدث عبر رمز `<|speaker:i|>`. يمكنك بعد ذلك التحكم في أداء النموذج باستخدام رمز معرف المتحدث، مما يسمح بتوليد واحد يتضمن متحدثين متعددين. لم تعد بحاجة لرفع ملفات مرجعية منفصلة لكل متحدث.
### توليد حوارات متعددة الأدوار
بفضل توسيع سياق النموذج، يمكن لنموذجنا الآن استخدام المعلومات السابقة لتحسين التعبير في المحتوى المولد لاحقاً، مما يزيد من طبيعية المحتوى.
### استنساخ صوت سريع
يدعم Fish Audio S2 استنساخ الصوت بدقة باستخدام عينة مرجعية قصيرة (عادةً 10-30 ثانية). يلتقط النموذج نبرة الصوت، وأسلوب التحدث، والميول العاطفية، مما ينتج أصواتاً مستنسخة واقعية ومتسقة دون الحاجة إلى ضبط دقيق إضافي.
لاستخدام خادم SGLang، راجع [SGLang-Omni README](https://github.com/sgl-project/sglang-omni/blob/main/sglang_omni/models/fishaudio_s2_pro/README.md) .
---
## شكر وتقدير
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- [Qwen3](https://github.com/QwenLM/Qwen3)
## التقرير التقني
```bibtex
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}
@misc{liao2026fishaudios2technical,
title={Fish Audio S2 Technical Report},
author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
year={2026},
eprint={2603.08823},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2603.08823},
}
```
================================================
FILE: docs/ar/inference.md
================================================
# الاستنتاج
يتطلب نموذج Fish Audio S2 ذاكرة فيديو (VRAM) كبيرة. نوصي باستخدام وحدة معالجة رسومات (GPU) بسعة 24 جيجابايت على الأقل للاستنتاج.
## تحميل الأوزان
أولاً ، تحتاج إلى تحميل أوزان النموذج:
```bash
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro
```
## الاستنتاج عبر خط الأوامر
!!! note
إذا كنت تخطط لترك النموذج يختار نغمة الصوت عشوائيًا ، فيمكنك تخطي هذه الخطوة.
### 1. الحصول على رموز VQ من الصوت المرجعي
```bash
python fish_speech/models/dac/inference.py \
-i "test.wav" \
--checkpoint-path "checkpoints/s2-pro/codec.pth"
```
يجب أن تحصل على `fake.npy` و `fake.wav`.
### 2. توليد الرموز الدلالية (Semantic tokens) من النص:
```bash
python fish_speech/models/text2semantic/inference.py \
--text "النص الذي تريد تحويله" \
--prompt-text "النص المرجعي الخاص بك" \
--prompt-tokens "fake.npy" \
# --compile
```
سيقوم هذا الأمر بإنشاء ملف `codes_N` في دليل العمل ، حيث N هو عدد صحيح يبدأ من 0.
!!! note
قد ترغب في استخدام `--compile` لدمج نوى CUDA لاستنتاج أسرع. ومع ذلك ، نوصي باستخدام تحسين تسريع الاستنتاج sglang الخاص بنا.
بالمقابل ، إذا كنت لا تخطط لاستخدام التسريع ، يمكنك التعليق على معلمة `--compile`.
!!! info
بالنسبة لوحدات معالجة الرسومات التي لا تدعم bf16 ، قد تحتاج إلى استخدام معلمة `--half`.
### 3. توليد الصوت من الرموز الدلالية:
```bash
python fish_speech/models/dac/inference.py \
-i "codes_0.npy" \
```
بعد ذلك ستحصل على ملف `fake.wav`.
## استنتاج WebUI
### 1. Gradio WebUI
للحفاظ على التوافق، ما زلنا نحتفظ بواجهة Gradio WebUI السابقة.
```bash
python tools/run_webui.py # --compile إذا كنت بحاجة إلى تسريع
```
### 2. Awesome WebUI
تعد Awesome WebUI واجهة ويب حديثة تعتمد على TypeScript، وتوفر ميزات أغنى وتجربة مستخدم أفضل.
**بناء WebUI:**
يجب أن يكون لديك Node.js و npm مثبتين على جهازك المحلي أو الخادم.
1. ادخل إلى دليل `awesome_webui`:
```bash
cd awesome_webui
```
2. تثبيت التبعيات:
```bash
npm install
```
3. بناء WebUI:
```bash
npm run build
```
**بدء تشغيل خادم الخلفية:**
بعد بناء WebUI، عد إلى دليل جذر المشروع وقم بتشغيل خادم API:
```bash
python tools/api_server.py --listen 0.0.0.0:8888 --compile
```
**الوصول:**
بمجرد تشغيل الخادم، يمكنك الوصول إليه عبر المتصفح على العنوان التالي:
`http://localhost:8888/ui`
================================================
FILE: docs/ar/install.md
================================================
## المتطلبات
- ذاكرة وحدة معالجة الرسومات (GPU): 24 جيجابايت (للاستدلال)
- النظام: Linux, WSL
## إعداد النظام
يدعم Fish Audio S2 طرق تثبيت متعددة. اختر الطريقة التي تناسب بيئة التطوير الخاصة بك.
**المتطلبات الأساسية**: قم بتثبيت تبعيات النظام لمعالجة الصوت:
``` bash
apt install portaudio19-dev libsox-dev ffmpeg
```
### Conda
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
# تثبيت نسخة GPU (اختر إصدار CUDA الخاص بك: cu126, cu128, cu129)
pip install -e .[cu129]
# تثبيت نسخة CPU فقط
pip install -e .[cpu]
# التثبيت الافتراضي (يستخدم فهرس PyTorch الافتراضي)
pip install -e .
# إذا واجهت خطأ أثناء التثبيت بسبب pyaudio، ففكر في استخدام الأمر التالي:
# conda install pyaudio
# ثم قم بتشغيل pip install -e . مرة أخرى
```
### UV
يوفر UV حلاً أسرع لتثبيت التبعيات:
```bash
# تثبيت نسخة GPU (اختر إصدار CUDA الخاص بك: cu126, cu128, cu129)
uv sync --python 3.12 --extra cu129
# تثبيت نسخة CPU فقط
uv sync --python 3.12 --extra cpu
```
### دعم Intel Arc XPU
لمستخدمي وحدات معالجة الرسومات Intel Arc، قم بالتثبيت مع دعم XPU على النحو التالي:
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
# تثبيت مكتبة C++ القياسية المطلوبة
conda install libstdcxx -c conda-forge
# تثبيت PyTorch مع دعم Intel XPU
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
# تثبيت Fish Speech
pip install -e .
```
!!! warning
خيار `compile` غير مدعوم على أنظمة Windows و macOS. إذا كنت ترغب في التشغيل مع التجميع، ستحتاج إلى تثبيت Triton بنفسك.
## إعداد Docker
يوفر نموذج سلسلة Fish Audio S2 خيارات نشر متعددة مع Docker لتلبية الاحتياجات المختلفة. يمكنك استخدام الصور المعدة مسبقًا من Docker Hub، أو البناء محليًا باستخدام Docker Compose، أو بناء صور مخصصة يدويًا.
لقد قدمنا صور Docker لكل من واجهة المستخدم الرسومية (WebUI) وخادم API، لكل من وحدات معالجة الرسومات (GPU) (CUDA 12.6 افتراضيًا) ووحدات المعالجة المركزية (CPU). يمكنك استخدام الصور المعدة مسبقًا من Docker Hub، أو البناء محليًا باستخدام Docker Compose، أو بناء صور مخصصة يدويًا. إذا كنت ترغب في البناء محليًا، فاتبع الإرشادات أدناه. إذا كنت ترغب فقط في استخدام الصور المعدة مسبقًا، فاتبع مباشرةً [دليل الاستدلال](inference.md).
### المتطلبات الأساسية
- تثبيت Docker و Docker Compose
- تثبيت NVIDIA Docker runtime (لدعم GPU)
- ذاكرة GPU لا تقل عن 24 جيجابايت للاستدلال باستخدام CUDA
### استخدام Docker Compose
للتطوير أو التخصيص، يمكنك استخدام Docker Compose للبناء والتشغيل محليًا:
```bash
# أولاً، استنسخ المستودع
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
# بدء واجهة المستخدم الرسومية (WebUI) مع CUDA
docker compose --profile webui up
# بدء واجهة المستخدم الرسومية (WebUI) مع تحسين التجميع
COMPILE=1 docker compose --profile webui up
# بدء خادم API
docker compose --profile server up
# بدء خادم API مع تحسين التجميع
COMPILE=1 docker compose --profile server up
# النشر باستخدام CPU فقط
BACKEND=cpu docker compose --profile webui up
```
#### متغيرات البيئة لـ Docker Compose
يمكنك تخصيص النشر باستخدام متغيرات البيئة:
```bash
# مثال على ملف .env
BACKEND=cuda # أو cpu
COMPILE=1 # تمكين تحسين التجميع
GRADIO_PORT=7860 # منفذ واجهة المستخدم الرسومية (WebUI)
API_PORT=8080 # منفذ خادم API
UV_VERSION=0.8.15 # إصدار مدير الحزم UV
```
سيقوم الأمر ببناء الصورة وتشغيل الحاوية. يمكنك الوصول إلى واجهة المستخدم الرسومية (WebUI) على `http://localhost:7860` وخادم API على `http://localhost:8080`.
### البناء اليدوي باستخدام Docker
للمستخدمين المتقدمين الذين يرغبون في تخصيص عملية البناء:
```bash
# بناء صورة واجهة المستخدم الرسومية (WebUI) مع دعم CUDA
docker build \
--platform linux/amd64 \
-f docker/Dockerfile \
--build-arg BACKEND=cuda \
--build-arg CUDA_VER=12.6.0 \
--build-arg UV_EXTRA=cu126 \
--target webui \
-t fish-speech-webui:cuda .
# بناء صورة خادم API مع دعم CUDA
docker build \
--platform linux/amd64 \
-f docker/Dockerfile \
--build-arg BACKEND=cuda \
--build-arg CUDA_VER=12.6.0 \
--build-arg UV_EXTRA=cu126 \
--target server \
-t fish-speech-server:cuda .
# بناء صورة CPU فقط (تدعم منصات متعددة)
docker build \
--platform linux/amd64,linux/arm64 \
-f docker/Dockerfile \
--build-arg BACKEND=cpu \
--target webui \
-t fish-speech-webui:cpu .
# بناء صورة التطوير
docker build \
--platform linux/amd64 \
-f docker/Dockerfile \
--build-arg BACKEND=cuda \
--target dev \
-t fish-speech-dev:cuda .
```
#### وسيطات البناء
- `BACKEND`: `cuda` أو `cpu` (الافتراضي: `cuda`)
- `CUDA_VER`: إصدار CUDA (الافتراضي: `12.6.0`)
- `UV_EXTRA`: حزمة UV إضافية لـ CUDA (الافتراضي: `cu126`)
- `UBUNTU_VER`: إصدار Ubuntu (الافتراضي: `24.04`)
- `PY_VER`: إصدار Python (الافتراضي: `3.12`)
### تحميل المجلدات
تتطلب كلتا الطريقتين تحميل المجلدات التالية:
- `./checkpoints:/app/checkpoints` - مجلد أوزان النموذج
- `./references:/app/references` - مجلد ملفات الصوت المرجعية
### متغيرات البيئة
- `COMPILE=1` - تمكين `torch.compile` لتسريع الاستدلال (حوالي 10 أضعاف)
- `GRADIO_SERVER_NAME=0.0.0.0` - مضيف خادم واجهة المستخدم الرسومية (WebUI)
- `GRADIO_SERVER_PORT=7860` - منفذ خادم واجهة المستخدم الرسومية (WebUI)
- `API_SERVER_NAME=0.0.0.0` - مضيف خادم API
- `API_SERVER_PORT=8080` - منفذ خادم API
!!! note
تتوقع حاويات Docker أن يتم تحميل أوزان النموذج في `/app/checkpoints`. تأكد من تنزيل أوزان النموذج المطلوبة قبل بدء الحاويات.
!!! warning
يتطلب دعم GPU وجود NVIDIA Docker runtime. للنشر باستخدام CPU فقط، قم بإزالة علامة `--gpus all` واستخدم صور CPU.
================================================
FILE: docs/en/finetune.md
================================================
gitextract_ft5t8lt3/
├── .dockerignore
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.yml
│ │ ├── config.yml
│ │ └── feature_request.yml
│ ├── pull_request_template.md
│ └── workflows/
│ ├── build-docker-image.yml
│ ├── docs.yml
│ └── stale.yml
├── .gitignore
├── .pre-commit-config.yaml
├── .project-root
├── .readthedocs.yaml
├── API_FLAGS.txt
├── LICENSE
├── README.md
├── awesome_webui/
│ ├── .gitignore
│ ├── README.md
│ ├── eslint.config.js
│ ├── index.html
│ ├── package.json
│ ├── src/
│ │ ├── App.tsx
│ │ ├── components/
│ │ │ └── ui/
│ │ │ ├── alert.tsx
│ │ │ ├── badge.tsx
│ │ │ ├── button.tsx
│ │ │ ├── card.tsx
│ │ │ ├── collapsible.tsx
│ │ │ ├── dialog.tsx
│ │ │ ├── label.tsx
│ │ │ ├── scroll-area.tsx
│ │ │ ├── separator.tsx
│ │ │ ├── slider.tsx
│ │ │ ├── switch.tsx
│ │ │ ├── textarea.tsx
│ │ │ └── toggle-group.tsx
│ │ ├── index.css
│ │ └── main.tsx
│ ├── tsconfig.app.json
│ ├── tsconfig.json
│ ├── tsconfig.node.json
│ └── vite.config.ts
├── compose.base.yml
├── compose.yml
├── docker/
│ └── Dockerfile
├── dockerfile.dev
├── docs/
│ ├── CNAME
│ ├── README.ar.md
│ ├── README.ja.md
│ ├── README.ko.md
│ ├── README.pt-BR.md
│ ├── README.zh.md
│ ├── ar/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── en/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ ├── install.md
│ │ └── server.md
│ ├── ja/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── ko/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── pt/
│ │ ├── finetune.md
│ │ ├── index.md
│ │ ├── inference.md
│ │ └── install.md
│ ├── requirements.txt
│ ├── stylesheets/
│ │ └── extra.css
│ └── zh/
│ ├── finetune.md
│ ├── index.md
│ ├── inference.md
│ └── install.md
├── entrypoint.sh
├── fish_speech/
│ ├── callbacks/
│ │ ├── __init__.py
│ │ └── grad_norm.py
│ ├── configs/
│ │ ├── base.yaml
│ │ ├── lora/
│ │ │ └── r_8_alpha_16.yaml
│ │ ├── modded_dac_vq.yaml
│ │ └── text2semantic_finetune.yaml
│ ├── content_sequence.py
│ ├── conversation.py
│ ├── datasets/
│ │ ├── concat_repeat.py
│ │ ├── protos/
│ │ │ ├── text-data.proto
│ │ │ ├── text_data_pb2.py
│ │ │ └── text_data_stream.py
│ │ ├── semantic.py
│ │ └── vqgan.py
│ ├── i18n/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── core.py
│ │ ├── locale/
│ │ │ ├── en_US.json
│ │ │ ├── es_ES.json
│ │ │ ├── ja_JP.json
│ │ │ ├── ko_KR.json
│ │ │ ├── pt_BR.json
│ │ │ └── zh_CN.json
│ │ └── scan.py
│ ├── inference_engine/
│ │ ├── __init__.py
│ │ ├── reference_loader.py
│ │ ├── utils.py
│ │ └── vq_manager.py
│ ├── models/
│ │ ├── dac/
│ │ │ ├── __init__.py
│ │ │ ├── inference.py
│ │ │ ├── modded_dac.py
│ │ │ └── rvq.py
│ │ └── text2semantic/
│ │ ├── __init__.py
│ │ ├── inference.py
│ │ ├── lit_module.py
│ │ ├── llama.py
│ │ └── lora.py
│ ├── scheduler.py
│ ├── text/
│ │ ├── __init__.py
│ │ └── clean.py
│ ├── tokenizer.py
│ ├── train.py
│ └── utils/
│ ├── __init__.py
│ ├── braceexpand.py
│ ├── context.py
│ ├── file.py
│ ├── instantiators.py
│ ├── logger.py
│ ├── logging_utils.py
│ ├── rich_utils.py
│ ├── schema.py
│ ├── spectrogram.py
│ └── utils.py
├── inference.ipynb
├── mkdocs.yml
├── pyproject.toml
├── pyrightconfig.json
└── tools/
├── api_client.py
├── api_server.py
├── llama/
│ ├── build_dataset.py
│ ├── eval_in_context.py
│ ├── merge_lora.py
│ └── quantize.py
├── run_webui.py
├── server/
│ ├── api_utils.py
│ ├── exception_handler.py
│ ├── inference.py
│ ├── model_manager.py
│ ├── model_utils.py
│ └── views.py
├── vqgan/
│ ├── create_train_split.py
│ └── extract_vq.py
└── webui/
├── __init__.py
├── inference.py
└── variables.py
SYMBOL INDEX (483 symbols across 65 files)
FILE: awesome_webui/src/App.tsx
type AudioFormat (line 47) | type AudioFormat = 'mp3' | 'wav' | 'pcm' | 'opus'
type LatencyMode (line 48) | type LatencyMode = 'normal' | 'balanced'
type ControlsState (line 54) | type ControlsState = {
type Metrics (line 65) | type Metrics = {
type StatusState (line 71) | type StatusState = {
type ReferenceItem (line 76) | type ReferenceItem = {
type SpeakerGroup (line 84) | type SpeakerGroup = {
type PendingReference (line 89) | type PendingReference = {
function createId (line 116) | function createId() {
function arrayBufferToBase64 (line 120) | function arrayBufferToBase64(buffer: ArrayBuffer): string {
function createSpeakerGroup (line 129) | function createSpeakerGroup(): SpeakerGroup {
function buildReferencesPayload (line 138) | function buildReferencesPayload(
function buildPreviewPayload (line 152) | function buildPreviewPayload(
function buildRequestPayload (line 171) | function buildRequestPayload(
function createFileName (line 190) | function createFileName(inputText: string) {
function getErrorMessage (line 195) | function getErrorMessage(error: unknown) {
function waitForSourceBuffer (line 199) | function waitForSourceBuffer(sourceBuffer: SourceBuffer) {
function canUseStreamingPlayback (line 214) | function canUseStreamingPlayback(format: AudioFormat) {
type SettingSliderProps (line 219) | type SettingSliderProps = {
function SettingSlider (line 229) | function SettingSlider({
function App (line 262) | function App() {
FILE: awesome_webui/src/components/ui/alert.tsx
function Alert (line 19) | function Alert({
function AlertTitle (line 27) | function AlertTitle({ className, ...props }: React.ComponentProps<'h5'>) {
function AlertDescription (line 31) | function AlertDescription({ className, ...props }: React.ComponentProps<...
FILE: awesome_webui/src/components/ui/badge.tsx
function Badge (line 23) | function Badge({
FILE: awesome_webui/src/components/ui/button.tsx
type ButtonProps (line 33) | type ButtonProps = React.ComponentProps<'button'> &
function Button (line 38) | function Button({ className, variant, size, asChild = false, ...props }:...
FILE: awesome_webui/src/components/ui/card.tsx
function Card (line 5) | function Card({ className, ...props }: React.ComponentProps<'div'>) {
function CardHeader (line 15) | function CardHeader({ className, ...props }: React.ComponentProps<'div'>) {
function CardTitle (line 19) | function CardTitle({ className, ...props }: React.ComponentProps<'div'>) {
function CardDescription (line 23) | function CardDescription({ className, ...props }: React.ComponentProps<'...
function CardContent (line 27) | function CardContent({ className, ...props }: React.ComponentProps<'div'...
FILE: awesome_webui/src/components/ui/dialog.tsx
function DialogOverlay (line 12) | function DialogOverlay({
function DialogContent (line 24) | function DialogContent({
function DialogHeader (line 49) | function DialogHeader({ className, ...props }: React.ComponentProps<'div...
function DialogFooter (line 53) | function DialogFooter({ className, ...props }: React.ComponentProps<'div...
function DialogTitle (line 57) | function DialogTitle({ className, ...props }: React.ComponentProps<typeo...
function DialogDescription (line 66) | function DialogDescription({
FILE: awesome_webui/src/components/ui/label.tsx
function Label (line 6) | function Label({ className, ...props }: React.ComponentProps<typeof Labe...
FILE: awesome_webui/src/components/ui/scroll-area.tsx
function ScrollArea (line 6) | function ScrollArea({
function ScrollBar (line 22) | function ScrollBar({
FILE: awesome_webui/src/components/ui/separator.tsx
function Separator (line 6) | function Separator({
FILE: awesome_webui/src/components/ui/slider.tsx
function Slider (line 6) | function Slider({
FILE: awesome_webui/src/components/ui/switch.tsx
function Switch (line 6) | function Switch({
FILE: awesome_webui/src/components/ui/textarea.tsx
function Textarea (line 5) | function Textarea({ className, ...props }: React.ComponentProps<'textare...
FILE: awesome_webui/src/components/ui/toggle-group.tsx
function ToggleGroup (line 23) | function ToggleGroup({
function ToggleGroupItem (line 35) | function ToggleGroupItem({
FILE: awesome_webui/vite.config.ts
function inlineEntryAssets (line 7) | function inlineEntryAssets(): Plugin {
FILE: fish_speech/callbacks/grad_norm.py
function grad_norm (line 15) | def grad_norm(
class GradNormMonitor (line 55) | class GradNormMonitor(Callback):
method __init__ (line 60) | def __init__(
method on_after_backward (line 77) | def on_after_backward(self, trainer: Trainer, model: LightningModule) ...
method log_sub_module_grad_norm (line 100) | def log_sub_module_grad_norm(
FILE: fish_speech/content_sequence.py
function restore_ndarray (line 14) | def restore_ndarray(obj, to_tensor: bool = False):
class BasePart (line 25) | class BasePart:
class VQPart (line 31) | class VQPart(BasePart):
method __post_init__ (line 35) | def __post_init__(self: "VQPart"):
class TextPart (line 41) | class TextPart(BasePart):
method __post_init__ (line 46) | def __post_init__(self: "TextPart"):
class AudioPart (line 53) | class AudioPart(BasePart):
method __post_init__ (line 57) | def __post_init__(self: "AudioPart"):
class EncodedMessage (line 63) | class EncodedMessage:
class ContentSequence (line 76) | class ContentSequence:
method __init__ (line 86) | def __init__(
method append (line 121) | def append(
method encode (line 154) | def encode(
method encode_for_inference (line 282) | def encode_for_inference(
method visualize (line 326) | def visualize(
FILE: fish_speech/conversation.py
class Message (line 20) | class Message:
class Conversation (line 33) | class Conversation:
method __init__ (line 36) | def __init__(self: "Conversation", messages: list[Message] | None = No...
method _build_content_sequence (line 39) | def _build_content_sequence(
method encode (line 79) | def encode(
method encode_for_inference (line 96) | def encode_for_inference(
method visualize (line 105) | def visualize(
method append (line 125) | def append(self: "Conversation", message: Message):
method to_content_sequence (line 128) | def to_content_sequence(
FILE: fish_speech/datasets/concat_repeat.py
class ConcatRepeatDataset (line 8) | class ConcatRepeatDataset(Dataset):
method cumsum (line 14) | def cumsum(sequence, repeats):
method __init__ (line 22) | def __init__(self, datasets: Iterable[Dataset], repeats: list[int]):
method __len__ (line 40) | def __len__(self):
method __getitem__ (line 43) | def __getitem__(self, idx):
FILE: fish_speech/datasets/protos/text_data_stream.py
function read_pb_stream (line 6) | def read_pb_stream(f):
function write_pb_stream (line 18) | def write_pb_stream(f, text_data):
function pack_pb_stream (line 24) | def pack_pb_stream(text_data):
function split_pb_stream (line 29) | def split_pb_stream(f):
FILE: fish_speech/datasets/semantic.py
function split_by_rank_worker (line 32) | def split_by_rank_worker(files):
class AutoTextSemanticInstructionIterableDataset (line 59) | class AutoTextSemanticInstructionIterableDataset(IterableDataset):
method __init__ (line 73) | def __init__(
method __iter__ (line 114) | def __iter__(self):
method init_mock_data_server (line 118) | def init_mock_data_server(self):
method sample_data (line 157) | def sample_data(self):
method pack_sentences (line 185) | def pack_sentences(
method augment (line 252) | def augment(self):
class AutoTextSemanticInstructionDataset (line 286) | class AutoTextSemanticInstructionDataset(Dataset):
method __init__ (line 300) | def __init__(
method _init_data (line 341) | def _init_data(self):
method __len__ (line 390) | def __len__(self):
method __getitem__ (line 393) | def __getitem__(self, idx):
method pack_sentences (line 396) | def pack_sentences(
class InterleaveDataset (line 464) | class InterleaveDataset(IterableDataset):
method __init__ (line 465) | def __init__(
method __iter__ (line 477) | def __iter__(self):
class TextDataCollator (line 495) | class TextDataCollator:
method __call__ (line 499) | def __call__(self, examples):
method batchify (line 522) | def batchify(self, examples, tokens_key="tokens", labels_key="labels"):
class SemanticDataModule (line 568) | class SemanticDataModule(LightningDataModule):
method __init__ (line 569) | def __init__(
method train_dataloader (line 595) | def train_dataloader(self):
method val_dataloader (line 604) | def val_dataloader(self):
FILE: fish_speech/datasets/vqgan.py
class VQGANDataset (line 16) | class VQGANDataset(Dataset):
method __init__ (line 17) | def __init__(
method __len__ (line 38) | def __len__(self):
method get_item (line 41) | def get_item(self, idx):
method __getitem__ (line 67) | def __getitem__(self, idx):
class VQGANCollator (line 79) | class VQGANCollator:
method __call__ (line 80) | def __call__(self, batch):
class VQGANDataModule (line 99) | class VQGANDataModule(LightningDataModule):
method __init__ (line 100) | def __init__(
method train_dataloader (line 116) | def train_dataloader(self):
method val_dataloader (line 126) | def val_dataloader(self):
FILE: fish_speech/i18n/core.py
function load_language_list (line 9) | def load_language_list(language):
class I18nAuto (line 16) | class I18nAuto:
method __init__ (line 17) | def __init__(self):
method __call__ (line 33) | def __call__(self, key):
method __repr__ (line 36) | def __repr__(self):
FILE: fish_speech/i18n/scan.py
function extract_i18n_strings (line 12) | def extract_i18n_strings(node):
FILE: fish_speech/inference_engine/__init__.py
class TTSInferenceEngine (line 22) | class TTSInferenceEngine(ReferenceLoader, VQManager):
method __init__ (line 24) | def __init__(
method inference (line 40) | def inference(self, req: ServeTTSRequest) -> Generator[InferenceResult...
method send_Llama_request (line 144) | def send_Llama_request(
method get_audio_segment (line 179) | def get_audio_segment(self, result: GenerateResponse) -> np.ndarray:
FILE: fish_speech/inference_engine/reference_loader.py
class ReferenceLoader (line 20) | class ReferenceLoader:
method __init__ (line 21) | def __init__(self) -> None:
method load_by_id (line 40) | def load_by_id(
method load_by_hash (line 75) | def load_by_hash(
method load_audio (line 109) | def load_audio(self, reference_audio: bytes | str, sr: int):
method list_reference_ids (line 131) | def list_reference_ids(self) -> list[str]:
method add_reference (line 167) | def add_reference(self, id: str, wav_file_path: str, reference_text: s...
method delete_reference (line 241) | def delete_reference(self, id: str) -> None:
FILE: fish_speech/inference_engine/utils.py
class InferenceResult (line 10) | class InferenceResult:
function wav_chunk_header (line 16) | def wav_chunk_header(
FILE: fish_speech/inference_engine/vq_manager.py
class VQManager (line 9) | class VQManager:
method __init__ (line 11) | def __init__(self):
method decode_vq_tokens (line 16) | def decode_vq_tokens(self, codes):
method encode_reference (line 24) | def encode_reference(self, reference_audio, enable_reference_audio):
FILE: fish_speech/models/dac/inference.py
function load_model (line 23) | def load_model(config_name, checkpoint_path, device="cuda"):
function main (line 71) | def main(input_path, output_path, config_name, checkpoint_path, device):
FILE: fish_speech/models/dac/modded_dac.py
class VQResult (line 19) | class VQResult:
function find_multiple (line 28) | def find_multiple(n: int, k: int) -> int:
class ModelArgs (line 35) | class ModelArgs:
method __post_init__ (line 52) | def __post_init__(self):
class KVCache (line 65) | class KVCache(nn.Module):
method __init__ (line 66) | def __init__(
method update (line 74) | def update(self, input_pos, k_val, v_val):
method clear_cache (line 88) | def clear_cache(self, prompt_len):
class Transformer (line 97) | class Transformer(nn.Module):
method __init__ (line 98) | def __init__(self, config: ModelArgs) -> None:
method setup_caches (line 123) | def setup_caches(self, max_batch_size, max_seq_length):
method forward (line 145) | def forward(
class TransformerBlock (line 174) | class TransformerBlock(nn.Module):
method __init__ (line 175) | def __init__(self, config: ModelArgs) -> None:
method forward (line 184) | def forward(
class Attention (line 198) | class Attention(nn.Module):
method __init__ (line 199) | def __init__(self, config: ModelArgs):
method _compute_conformer_pos_scores (line 225) | def _compute_conformer_pos_scores(self, q: Tensor, seqlen: int) -> Ten...
method forward (line 243) | def forward(
class FeedForward (line 308) | class FeedForward(nn.Module):
method __init__ (line 309) | def __init__(self, config: ModelArgs) -> None:
method forward (line 316) | def forward(self, x: Tensor) -> Tensor:
class RMSNorm (line 320) | class RMSNorm(nn.Module):
method __init__ (line 321) | def __init__(self, dim: int, eps: float = 1e-5):
method _norm (line 326) | def _norm(self, x):
method forward (line 329) | def forward(self, x: Tensor) -> Tensor:
class LayerScale (line 334) | class LayerScale(nn.Module):
method __init__ (line 335) | def __init__(
method forward (line 345) | def forward(self, x: Tensor) -> Tensor:
class WindowLimitedTransformer (line 349) | class WindowLimitedTransformer(Transformer):
method __init__ (line 354) | def __init__(
method make_window_limited_mask (line 380) | def make_window_limited_mask(
method make_mask (line 400) | def make_mask(
method forward (line 418) | def forward(
function precompute_freqs_cis (line 442) | def precompute_freqs_cis(
function apply_rotary_emb (line 455) | def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
function init_weights (line 470) | def init_weights(m):
function unpad1d (line 476) | def unpad1d(x: torch.Tensor, paddings: tp.Tuple[int, int]):
function get_extra_padding_for_conv1d (line 485) | def get_extra_padding_for_conv1d(
function pad1d (line 495) | def pad1d(
class CausalConvNet (line 521) | class CausalConvNet(nn.Module):
method __init__ (line 522) | def __init__(
method forward (line 546) | def forward(self, x):
method weight_norm (line 554) | def weight_norm(self, name="weight", dim=0):
method remove_weight_norm (line 558) | def remove_weight_norm(self):
class CausalTransConvNet (line 563) | class CausalTransConvNet(nn.Module):
method __init__ (line 564) | def __init__(
method forward (line 574) | def forward(self, x):
method weight_norm (line 582) | def weight_norm(self, name="weight", dim=0):
method remove_weight_norm (line 586) | def remove_weight_norm(self):
function CausalWNConv1d (line 591) | def CausalWNConv1d(*args, **kwargs):
function CausalWNConvTranspose1d (line 595) | def CausalWNConvTranspose1d(*args, **kwargs):
class ResidualUnit (line 599) | class ResidualUnit(nn.Module):
method __init__ (line 600) | def __init__(self, dim: int = 16, dilation: int = 1, causal: bool = Fa...
method forward (line 612) | def forward(self, x):
class EncoderBlock (line 623) | class EncoderBlock(nn.Module):
method __init__ (line 624) | def __init__(
method forward (line 666) | def forward(self, x):
class Encoder (line 670) | class Encoder(nn.Module):
method __init__ (line 671) | def __init__(
method forward (line 708) | def forward(self, x):
class DecoderBlock (line 712) | class DecoderBlock(nn.Module):
method __init__ (line 713) | def __init__(
method forward (line 756) | def forward(self, x):
class Decoder (line 760) | class Decoder(nn.Module):
method __init__ (line 761) | def __init__(
method forward (line 800) | def forward(self, x):
class DAC (line 804) | class DAC(BaseModel, CodecMixin):
method __init__ (line 805) | def __init__(
method preprocess (line 863) | def preprocess(self, audio_data, sample_rate):
method encode (line 874) | def encode(
method from_indices (line 925) | def from_indices(self, indices: torch.Tensor):
method decode (line 929) | def decode(self, z: torch.Tensor):
method forward (line 948) | def forward(
FILE: fish_speech/models/dac/rvq.py
function unpad1d (line 13) | def unpad1d(x: torch.Tensor, paddings: tp.Tuple[int, int]):
function get_extra_padding_for_conv1d (line 22) | def get_extra_padding_for_conv1d(
function pad1d (line 32) | def pad1d(
class CausalConvNet (line 58) | class CausalConvNet(nn.Module):
method __init__ (line 59) | def __init__(
method forward (line 83) | def forward(self, x):
method weight_norm (line 91) | def weight_norm(self, name="weight", dim=0):
method remove_weight_norm (line 95) | def remove_weight_norm(self):
class CausalTransConvNet (line 100) | class CausalTransConvNet(nn.Module):
method __init__ (line 101) | def __init__(
method forward (line 111) | def forward(self, x):
method weight_norm (line 119) | def weight_norm(self, name="weight", dim=0):
method remove_weight_norm (line 123) | def remove_weight_norm(self):
class ConvNeXtBlock (line 129) | class ConvNeXtBlock(nn.Module):
method __init__ (line 143) | def __init__(
method forward (line 173) | def forward(self, x, apply_residual: bool = True):
class VQResult (line 195) | class VQResult:
class DownsampleResidualVectorQuantize (line 204) | class DownsampleResidualVectorQuantize(nn.Module):
method __init__ (line 205) | def __init__(
method _init_weights (line 288) | def _init_weights(self, m):
method forward (line 293) | def forward(
method decode (line 352) | def decode(self, indices: torch.Tensor):
FILE: fish_speech/models/text2semantic/inference.py
function multinomial_sample_one_no_sync (line 43) | def multinomial_sample_one_no_sync(probs_sort):
function logits_to_probs (line 54) | def logits_to_probs(
function sample (line 80) | def sample(
function decode_one_token_ar (line 96) | def decode_one_token_ar(
function decode_n_tokens (line 184) | def decode_n_tokens(
function generate (line 243) | def generate(
function init_model (line 362) | def init_model(checkpoint_path, device, precision, compile=False):
function load_codec_model (line 396) | def load_codec_model(codec_checkpoint_path, device, precision=torch.bflo...
function encode_audio (line 421) | def encode_audio(audio_path, codec, device):
function decode_to_audio (line 440) | def decode_to_audio(codes, codec):
class GenerateResponse (line 448) | class GenerateResponse:
function split_text_by_speaker (line 454) | def split_text_by_speaker(text: str) -> list[str]:
function group_turns_into_batches (line 485) | def group_turns_into_batches(
function generate_long (line 523) | def generate_long(
class WrappedGenerateResponse (line 737) | class WrappedGenerateResponse:
class GenerateRequest (line 743) | class GenerateRequest:
function launch_thread_safe_queue (line 748) | def launch_thread_safe_queue(
function main (line 839) | def main(
FILE: fish_speech/models/text2semantic/lit_module.py
class TextToSemantic (line 16) | class TextToSemantic(L.LightningModule):
method __init__ (line 17) | def __init__(
method forward (line 29) | def forward(self, x):
method on_save_checkpoint (line 32) | def on_save_checkpoint(self, checkpoint):
method configure_optimizers (line 43) | def configure_optimizers(self) -> OptimizerLRScheduler:
method get_batch_logps (line 76) | def get_batch_logps(
method _step (line 109) | def _step(self, batch, batch_idx, stage: str):
method get_accuracy (line 193) | def get_accuracy(self, logits, labels):
method training_step (line 206) | def training_step(self, batch, batch_idx):
method validation_step (line 209) | def validation_step(self, batch, batch_idx):
FILE: fish_speech/models/text2semantic/llama.py
function find_multiple (line 21) | def find_multiple(n: int, k: int) -> int:
class BaseModelArgs (line 28) | class BaseModelArgs:
method __post_init__ (line 65) | def __post_init__(self):
method from_pretrained (line 76) | def from_pretrained(path: str):
method _from_fish_qwen3_omni (line 102) | def _from_fish_qwen3_omni(data: dict) -> "DualARModelArgs":
method save (line 145) | def save(self, path: str):
class NaiveModelArgs (line 151) | class NaiveModelArgs(BaseModelArgs):
class DualARModelArgs (line 156) | class DualARModelArgs(BaseModelArgs):
method __post_init__ (line 169) | def __post_init__(self):
class KVCache (line 196) | class KVCache(nn.Module):
method __init__ (line 197) | def __init__(
method update (line 205) | def update(self, input_pos, k_val, v_val):
class TransformerForwardResult (line 218) | class TransformerForwardResult:
class BaseTransformerForwardResult (line 224) | class BaseTransformerForwardResult:
function _remap_fish_qwen3_omni_keys (line 229) | def _remap_fish_qwen3_omni_keys(weights: OrderedDict) -> OrderedDict:
class BaseTransformer (line 249) | class BaseTransformer(nn.Module):
method __init__ (line 250) | def __init__(
method setup_caches (line 307) | def setup_caches(
method embed (line 326) | def embed(self, inp: Tensor) -> Tensor:
method forward (line 347) | def forward(
method forward_generate (line 390) | def forward_generate(
method _init_weights (line 468) | def _init_weights(self, module):
method from_pretrained (line 480) | def from_pretrained(
method save_pretrained (line 595) | def save_pretrained(self, path: str, drop_lora: bool = False):
class NaiveTransformer (line 613) | class NaiveTransformer(BaseTransformer):
method __init__ (line 614) | def __init__(self, config: NaiveModelArgs) -> None:
method decode (line 626) | def decode(self, result: BaseTransformerForwardResult) -> TransformerF...
method forward (line 641) | def forward(
method forward_generate (line 652) | def forward_generate(
class DualARTransformer (line 659) | class DualARTransformer(BaseTransformer):
method __init__ (line 660) | def __init__(self, config: NaiveModelArgs) -> None:
method setup_caches (line 707) | def setup_caches(
method forward (line 723) | def forward(
method forward_generate_fast (line 798) | def forward_generate_fast(
method forward_generate (line 818) | def forward_generate(
class TransformerBlock (line 830) | class TransformerBlock(nn.Module):
method __init__ (line 831) | def __init__(self, config: BaseModelArgs, use_sdpa: bool = True) -> None:
method forward (line 838) | def forward(
class Attention (line 846) | class Attention(nn.Module):
method __init__ (line 847) | def __init__(self, config: BaseModelArgs, use_sdpa: bool = True):
method load_hook (line 876) | def load_hook(self, state_dict, prefix, *args):
method forward (line 883) | def forward(
method eq_scaled_dot_product_attention (line 947) | def eq_scaled_dot_product_attention(
class FeedForward (line 978) | class FeedForward(nn.Module):
method __init__ (line 979) | def __init__(self, config: BaseModelArgs) -> None:
method forward (line 985) | def forward(self, x: Tensor) -> Tensor:
class RMSNorm (line 989) | class RMSNorm(nn.Module):
method __init__ (line 990) | def __init__(self, dim: int, eps: float = 1e-5):
method _norm (line 995) | def _norm(self, x):
method forward (line 998) | def forward(self, x: Tensor) -> Tensor:
function precompute_freqs_cis (line 1003) | def precompute_freqs_cis(seq_len: int, n_elem: int, base: int = 10000) -...
function apply_rotary_emb (line 1025) | def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
FILE: fish_speech/models/text2semantic/lora.py
class LoraConfig (line 7) | class LoraConfig:
function _replace_embedding (line 13) | def _replace_embedding(old_embed, lora_config):
function setup_lora (line 25) | def setup_lora(model, lora_config):
function get_merged_state_dict (line 81) | def get_merged_state_dict(model):
FILE: fish_speech/scheduler.py
function get_cosine_schedule_with_warmup_lr_lambda (line 4) | def get_cosine_schedule_with_warmup_lr_lambda(
function get_constant_schedule_with_warmup_lr_lambda (line 28) | def get_constant_schedule_with_warmup_lr_lambda(
FILE: fish_speech/text/clean.py
function clean_text (line 24) | def clean_text(text):
FILE: fish_speech/tokenizer.py
class FishTokenizer (line 55) | class FishTokenizer:
method __init__ (line 56) | def __init__(self, model_path: str):
method vocab_size (line 91) | def vocab_size(self):
method pad_token_id (line 95) | def pad_token_id(self):
method eos_token_id (line 99) | def eos_token_id(self):
method get_token_id (line 102) | def get_token_id(self, token: str) -> int:
method encode (line 105) | def encode(
method decode (line 118) | def decode(self, tokens: Union[List[int], int], **kwargs) -> str:
method save_pretrained (line 121) | def save_pretrained(self, path: str):
method from_pretrained (line 125) | def from_pretrained(cls, path: str):
method __getattr__ (line 128) | def __getattr__(self, name):
FILE: fish_speech/train.py
function train (line 36) | def train(cfg: DictConfig) -> tuple[dict, dict]:
function main (line 135) | def main(cfg: DictConfig) -> Optional[float]:
FILE: fish_speech/utils/braceexpand.py
class UnbalancedBracesError (line 15) | class UnbalancedBracesError(ValueError):
function braceexpand (line 26) | def braceexpand(pattern: str, escape: bool = True) -> Iterator[str]:
function parse_pattern (line 105) | def parse_pattern(pattern: str, escape: bool) -> Iterator[str]:
function parse_expression (line 144) | def parse_expression(expr: str, escape: bool) -> Optional[Iterable[str]]:
function parse_sequence (line 156) | def parse_sequence(seq: str, escape: bool) -> Optional[Iterator[str]]:
function make_int_range (line 187) | def make_int_range(left: str, right: str, incr: Optional[str] = None) ->...
function make_char_range (line 200) | def make_char_range(left: str, right: str, incr: Optional[str] = None) -...
FILE: fish_speech/utils/context.py
function autocast_exclude_mps (line 6) | def autocast_exclude_mps(
FILE: fish_speech/utils/file.py
function get_latest_checkpoint (line 27) | def get_latest_checkpoint(path: Path | str) -> Path | None:
function audio_to_bytes (line 41) | def audio_to_bytes(file_path):
function read_ref_text (line 49) | def read_ref_text(ref_text):
function list_files (line 57) | def list_files(
function load_filelist (line 89) | def load_filelist(path: Path | str) -> list[tuple[Path, str, str, str]]:
FILE: fish_speech/utils/instantiators.py
function instantiate_callbacks (line 13) | def instantiate_callbacks(callbacks_cfg: DictConfig) -> List[Callback]:
function instantiate_loggers (line 33) | def instantiate_loggers(logger_cfg: DictConfig) -> List[Logger]:
FILE: fish_speech/utils/logger.py
class RankedLogger (line 7) | class RankedLogger(logging.LoggerAdapter):
method __init__ (line 10) | def __init__(
method log (line 27) | def log(
FILE: fish_speech/utils/logging_utils.py
function log_hyperparameters (line 7) | def log_hyperparameters(object_dict: dict) -> None:
FILE: fish_speech/utils/rich_utils.py
function print_config_tree (line 16) | def print_config_tree(
function enforce_tags (line 82) | def enforce_tags(cfg: DictConfig, save_to_file: bool = False) -> None:
FILE: fish_speech/utils/schema.py
class ServeVQPart (line 15) | class ServeVQPart(BaseModel):
class ServeTextPart (line 20) | class ServeTextPart(BaseModel):
class ServeAudioPart (line 25) | class ServeAudioPart(BaseModel):
class ServeRequest (line 30) | class ServeRequest(BaseModel):
class ServeVQGANEncodeRequest (line 42) | class ServeVQGANEncodeRequest(BaseModel):
class ServeVQGANEncodeResponse (line 47) | class ServeVQGANEncodeResponse(BaseModel):
class ServeVQGANDecodeRequest (line 51) | class ServeVQGANDecodeRequest(BaseModel):
class ServeVQGANDecodeResponse (line 55) | class ServeVQGANDecodeResponse(BaseModel):
class ServeReferenceAudio (line 60) | class ServeReferenceAudio(BaseModel):
method decode_audio (line 65) | def decode_audio(cls, values):
method __repr__ (line 77) | def __repr__(self) -> str:
class ServeTTSRequest (line 81) | class ServeTTSRequest(BaseModel):
class Config (line 105) | class Config:
class AddReferenceRequest (line 110) | class AddReferenceRequest(BaseModel):
class AddReferenceResponse (line 116) | class AddReferenceResponse(BaseModel):
class ListReferencesResponse (line 122) | class ListReferencesResponse(BaseModel):
class DeleteReferenceResponse (line 128) | class DeleteReferenceResponse(BaseModel):
class UpdateReferenceResponse (line 134) | class UpdateReferenceResponse(BaseModel):
FILE: fish_speech/utils/spectrogram.py
class LinearSpectrogram (line 7) | class LinearSpectrogram(nn.Module):
method __init__ (line 8) | def __init__(
method forward (line 27) | def forward(self, y: Tensor) -> Tensor:
class LogMelSpectrogram (line 62) | class LogMelSpectrogram(nn.Module):
method __init__ (line 63) | def __init__(
method compress (line 102) | def compress(self, x: Tensor) -> Tensor:
method decompress (line 105) | def decompress(self, x: Tensor) -> Tensor:
method apply_mel_scale (line 108) | def apply_mel_scale(self, x: Tensor) -> Tensor:
method forward (line 111) | def forward(
FILE: fish_speech/utils/utils.py
function extras (line 16) | def extras(cfg: DictConfig) -> None:
function task_wrapper (line 46) | def task_wrapper(task_func: Callable) -> Callable:
function get_metric_value (line 100) | def get_metric_value(metric_dict: dict, metric_name: str) -> float:
function set_seed (line 120) | def set_seed(seed: int):
FILE: tools/api_client.py
function parse_args (line 16) | def parse_args():
FILE: tools/api_server.py
class API (line 29) | class API(ExceptionHandler):
method __init__ (line 30) | def __init__(self):
method initialize_app (line 81) | async def initialize_app(self, app: Kui):
FILE: tools/llama/build_dataset.py
function task_generator_folder (line 23) | def task_generator_folder(root: Path, text_extension: str):
function task_generator_filelist (line 55) | def task_generator_filelist(filelist):
function run_task (line 65) | def run_task(task):
function main (line 127) | def main(input, output, num_workers, text_extension, shard_size):
FILE: tools/llama/eval_in_context.py
function smooth (line 16) | def smooth(
function analyze_one_model (line 30) | def analyze_one_model(loader, config, weight, max_length):
function main (line 115) | def main():
FILE: tools/llama/merge_lora.py
function merge (line 21) | def merge(lora_config, base_weight, lora_weight, output):
FILE: tools/llama/quantize.py
function dynamically_quantize_per_channel (line 22) | def dynamically_quantize_per_channel(x, quant_min, quant_max, target_dty...
function get_group_qparams (line 57) | def get_group_qparams(w, n_bit=4, groupsize=128):
function pack_scales_and_zeros (line 78) | def pack_scales_and_zeros(scales, zeros):
function unpack_scales_and_zeros (line 95) | def unpack_scales_and_zeros(scales_and_zeros):
function group_quantize_tensor_from_qparams (line 101) | def group_quantize_tensor_from_qparams(w, scales, zeros, n_bit=4, groups...
function group_quantize_tensor (line 130) | def group_quantize_tensor(w, n_bit=4, groupsize=128):
function group_dequantize_tensor_from_qparams (line 137) | def group_dequantize_tensor_from_qparams(
function group_dequantize_tensor (line 157) | def group_dequantize_tensor(w_int32, scales_and_zeros, n_bit=4, groupsiz...
class QuantHandler (line 164) | class QuantHandler:
method __init__ (line 165) | def __init__(self, mod):
method create_quantized_state_dict (line 168) | def create_quantized_state_dict(self) -> "StateDict":
method convert_for_runtime (line 171) | def convert_for_runtime(self) -> "nn.Module":
function replace_linear_weight_only_int8_per_channel (line 178) | def replace_linear_weight_only_int8_per_channel(module):
class WeightOnlyInt8QuantHandler (line 190) | class WeightOnlyInt8QuantHandler:
method __init__ (line 191) | def __init__(self, mod):
method create_quantized_state_dict (line 195) | def create_quantized_state_dict(self):
method convert_for_runtime (line 207) | def convert_for_runtime(self):
class WeightOnlyInt8Linear (line 212) | class WeightOnlyInt8Linear(torch.nn.Module):
method __init__ (line 218) | def __init__(
method forward (line 235) | def forward(self, input: torch.Tensor) -> torch.Tensor:
function prepare_int4_weight_and_scales_and_zeros (line 242) | def prepare_int4_weight_and_scales_and_zeros(weight_bf16, groupsize, inn...
function linear_forward_int4 (line 252) | def linear_forward_int4(x, weight_int4pack, scales_and_zeros, out_featur...
function _check_linear_int4_k (line 263) | def _check_linear_int4_k(k, groupsize=1, inner_k_tiles=1):
function replace_linear_int4 (line 267) | def replace_linear_int4(module, groupsize, inner_k_tiles, padding):
class WeightOnlyInt4QuantHandler (line 300) | class WeightOnlyInt4QuantHandler:
method __init__ (line 301) | def __init__(self, mod, groupsize=128, inner_k_tiles=8, padding=True):
method create_quantized_state_dict (line 310) | def create_quantized_state_dict(self):
method convert_for_runtime (line 353) | def convert_for_runtime(self):
class WeightOnlyInt4Linear (line 358) | class WeightOnlyInt4Linear(torch.nn.Module):
method __init__ (line 364) | def __init__(
method forward (line 410) | def forward(self, input: torch.Tensor) -> torch.Tensor:
function generate_folder_name (line 421) | def generate_folder_name():
function quantize (line 440) | def quantize(checkpoint_path: Path, mode: str, groupsize: int, timestamp...
FILE: tools/run_webui.py
function parse_args (line 22) | def parse_args():
FILE: tools/server/api_utils.py
function parse_args (line 21) | def parse_args():
class MsgPackRequest (line 46) | class MsgPackRequest(HttpRequest):
method data (line 47) | async def data(
function inference_async (line 72) | async def inference_async(req: ServeTTSRequest, engine: TTSInferenceEngi...
function buffer_to_async_generator (line 79) | async def buffer_to_async_generator(buffer):
function get_content_type (line 83) | def get_content_type(audio_format):
function wants_json (line 96) | def wants_json(req):
function format_response (line 116) | def format_response(response: BaseModel, status_code=200):
FILE: tools/server/exception_handler.py
class ExceptionHandler (line 7) | class ExceptionHandler:
method http_exception_handler (line 9) | async def http_exception_handler(self, exc: HTTPException):
method other_exception_handler (line 20) | async def other_exception_handler(self, exc: Exception):
FILE: tools/server/inference.py
function inference_wrapper (line 12) | def inference_wrapper(req: ServeTTSRequest, engine: TTSInferenceEngine):
FILE: tools/server/model_manager.py
class ModelManager (line 11) | class ModelManager:
method __init__ (line 12) | def __init__(
method load_llama_model (line 56) | def load_llama_model(
method load_decoder_model (line 72) | def load_decoder_model(self, config_name, checkpoint_path, device) -> ...
method warm_up (line 80) | def warm_up(self, tts_inference_engine) -> None:
FILE: tools/server/model_utils.py
function batch_encode (line 17) | def batch_encode(model, audios_list: list[bytes]):
function cached_vqgan_batch_encode (line 55) | def cached_vqgan_batch_encode(model, audios: list[bytes]):
function batch_vqgan_decode (line 61) | def batch_vqgan_decode(model, features):
FILE: tools/server/views.py
class WebUI (line 62) | class WebUI(HttpView):
method get (line 64) | async def get(cls):
class Health (line 76) | class Health(HttpView):
method get (line 78) | async def get(cls):
method post (line 82) | async def post(cls):
function vqgan_encode (line 87) | async def vqgan_encode(req: Annotated[ServeVQGANEncodeRequest, Body(excl...
function vqgan_decode (line 116) | async def vqgan_decode(req: Annotated[ServeVQGANDecodeRequest, Body(excl...
function tts (line 147) | async def tts(req: Annotated[ServeTTSRequest, Body(exclusive=True)]):
function add_reference (line 209) | async def add_reference(
function list_references (line 290) | async def list_references():
function delete_reference (line 319) | async def delete_reference(reference_id: str = Body(...)):
function update_reference (line 381) | async def update_reference(
FILE: tools/vqgan/create_train_split.py
function main (line 20) | def main(root, val_ratio, val_count, filelist, min_duration, max_duration):
FILE: tools/vqgan/extract_vq.py
function get_model (line 48) | def get_model(
function process_batch (line 80) | def process_batch(files: list[Path], model) -> float:
function main (line 143) | def main(
FILE: tools/webui/__init__.py
function build_app (line 9) | def build_app(inference_fct: Callable, theme: str = "light") -> gr.Blocks:
FILE: tools/webui/inference.py
function inference_wrapper (line 9) | def inference_wrapper(
function get_reference_audio (line 58) | def get_reference_audio(reference_audio: str, reference_text: str) -> list:
function build_html_error_message (line 69) | def build_html_error_message(error: Any) -> str:
function get_inference_wrapper (line 81) | def get_inference_wrapper(engine) -> Callable:
Condensed preview — 153 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (732K chars).
[
{
"path": ".dockerignore",
"chars": 1589,
"preview": "# .dockerignore\n\n# Git and version control\n.git\n.gitignore\n.gitattributes\n.gitmodules\n\n# IDE and editor files\n.vscode/\n."
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.yml",
"chars": 2743,
"preview": "name: \"🕷️ Bug report\"\ndescription: |\n Please follow this template carefully to ensure we can address your issue quickly"
},
{
"path": ".github/ISSUE_TEMPLATE/config.yml",
"chars": 207,
"preview": "blank_issues_enabled: false\ncontact_links:\n - name: \"\\U0001F4E7 Discussions\"\n url: https://github.com/fishaudio/fish"
},
{
"path": ".github/ISSUE_TEMPLATE/feature_request.yml",
"chars": 2943,
"preview": "name: \"⭐ Feature or enhancement request\"\ndescription: Propose something new.\nlabels:\n - enhancement\nbody:\n - type: che"
},
{
"path": ".github/pull_request_template.md",
"chars": 157,
"preview": "**Is this PR adding new feature or fix a BUG?**\n\nAdd feature / Fix BUG.\n\n**Is this pull request related to any issue? If"
},
{
"path": ".github/workflows/build-docker-image.yml",
"chars": 2617,
"preview": "name: Build Docker Images\n\non:\n push:\n branches:\n - main\n tags:\n - \"v*\"\n\njobs:\n build:\n runs-on: ub"
},
{
"path": ".github/workflows/docs.yml",
"chars": 822,
"preview": "name: docs\non:\n push:\n branches:\n - main\n paths:\n - 'docs/**'\n - 'mkdocs.yml'\n\npermissions:\n cont"
},
{
"path": ".github/workflows/stale.yml",
"chars": 959,
"preview": "name: Close inactive issues\non:\n schedule:\n - cron: \"0 0 * * *\"\n\njobs:\n close-issues:\n runs-on: ubuntu-latest\n "
},
{
"path": ".gitignore",
"chars": 1344,
"preview": "# =============================================================================\n# Fish Speech - .gitignore\n# ==========="
},
{
"path": ".pre-commit-config.yaml",
"chars": 548,
"preview": "ci:\n autoupdate_schedule: monthly\n\nrepos:\n - repo: https://github.com/pycqa/isort\n rev: 8.0.1\n hooks:\n - id"
},
{
"path": ".project-root",
"chars": 0,
"preview": ""
},
{
"path": ".readthedocs.yaml",
"chars": 438,
"preview": "# Read the Docs configuration file for MkDocs projects\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html f"
},
{
"path": "API_FLAGS.txt",
"chars": 204,
"preview": "# --infer\n--api\n--listen 0.0.0.0:8080 \\\n--llama-checkpoint-path \"checkpoints/openaudio-s1-mini\" \\\n--decoder-checkpoint-p"
},
{
"path": "LICENSE",
"chars": 10359,
"preview": "# FISH AUDIO RESEARCH LICENSE AGREEMENT\n\n**Last Updated: March 7, 2026**\n\n## I. INTRODUCTION\n\nThis Agreement applies to "
},
{
"path": "README.md",
"chars": 11659,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n**English** | [简体中文](docs/README.zh.md) | [Portuguese](docs/README.pt-BR.md) "
},
{
"path": "awesome_webui/.gitignore",
"chars": 253,
"preview": "# Logs\nlogs\n*.log\nnpm-debug.log*\nyarn-debug.log*\nyarn-error.log*\npnpm-debug.log*\nlerna-debug.log*\n\nnode_modules\ndist\ndis"
},
{
"path": "awesome_webui/README.md",
"chars": 2520,
"preview": "# React + TypeScript + Vite\n\nThis template provides a minimal setup to get React working in Vite with HMR and some ESLin"
},
{
"path": "awesome_webui/eslint.config.js",
"chars": 616,
"preview": "import js from '@eslint/js'\nimport globals from 'globals'\nimport reactHooks from 'eslint-plugin-react-hooks'\nimport reac"
},
{
"path": "awesome_webui/index.html",
"chars": 285,
"preview": "<!doctype html>\n<html lang=\"en\">\n\n<head>\n <meta charset=\"UTF-8\" />\n <meta name=\"viewport\" content=\"width=device-width,"
},
{
"path": "awesome_webui/package.json",
"chars": 1290,
"preview": "{\n \"name\": \"awesome_webui\",\n \"private\": true,\n \"version\": \"0.0.0\",\n \"type\": \"module\",\n \"scripts\": {\n \"dev\": \"vit"
},
{
"path": "awesome_webui/src/App.tsx",
"chars": 42735,
"preview": "import { useEffect, useRef, useState } from 'react'\nimport {\n AudioLines,\n ChevronDown,\n CircleAlert,\n Copy,\n Downl"
},
{
"path": "awesome_webui/src/components/ui/alert.tsx",
"chars": 1116,
"preview": "import * as React from 'react'\nimport { cva, type VariantProps } from 'class-variance-authority'\n\nimport { cn } from '@/"
},
{
"path": "awesome_webui/src/components/ui/badge.tsx",
"chars": 862,
"preview": "/* eslint-disable react-refresh/only-export-components */\nimport * as React from 'react'\nimport { cva, type VariantProps"
},
{
"path": "awesome_webui/src/components/ui/button.tsx",
"chars": 1625,
"preview": "/* eslint-disable react-refresh/only-export-components */\nimport * as React from 'react'\nimport { Slot } from '@radix-ui"
},
{
"path": "awesome_webui/src/components/ui/card.tsx",
"chars": 1041,
"preview": "import * as React from 'react'\n\nimport { cn } from '@/lib/utils'\n\nfunction Card({ className, ...props }: React.Component"
},
{
"path": "awesome_webui/src/components/ui/collapsible.tsx",
"chars": 313,
"preview": "import * as CollapsiblePrimitive from '@radix-ui/react-collapsible'\n\nconst Collapsible = CollapsiblePrimitive.Root\nconst"
},
{
"path": "awesome_webui/src/components/ui/dialog.tsx",
"chars": 2369,
"preview": "import * as React from 'react'\nimport * as DialogPrimitive from '@radix-ui/react-dialog'\nimport { X } from 'lucide-react"
},
{
"path": "awesome_webui/src/components/ui/label.tsx",
"chars": 366,
"preview": "import * as React from 'react'\nimport * as LabelPrimitive from '@radix-ui/react-label'\n\nimport { cn } from '@/lib/utils'"
},
{
"path": "awesome_webui/src/components/ui/scroll-area.tsx",
"chars": 1318,
"preview": "import * as React from 'react'\nimport * as ScrollAreaPrimitive from '@radix-ui/react-scroll-area'\n\nimport { cn } from '@"
},
{
"path": "awesome_webui/src/components/ui/separator.tsx",
"chars": 588,
"preview": "import * as React from 'react'\nimport * as SeparatorPrimitive from '@radix-ui/react-separator'\n\nimport { cn } from '@/li"
},
{
"path": "awesome_webui/src/components/ui/slider.tsx",
"chars": 937,
"preview": "import * as React from 'react'\nimport * as SliderPrimitive from '@radix-ui/react-slider'\n\nimport { cn } from '@/lib/util"
},
{
"path": "awesome_webui/src/components/ui/switch.tsx",
"chars": 1041,
"preview": "import * as React from 'react'\nimport * as SwitchPrimitive from '@radix-ui/react-switch'\n\nimport { cn } from '@/lib/util"
},
{
"path": "awesome_webui/src/components/ui/textarea.tsx",
"chars": 548,
"preview": "import * as React from 'react'\n\nimport { cn } from '@/lib/utils'\n\nfunction Textarea({ className, ...props }: React.Compo"
},
{
"path": "awesome_webui/src/components/ui/toggle-group.tsx",
"chars": 1423,
"preview": "import * as React from 'react'\nimport * as ToggleGroupPrimitive from '@radix-ui/react-toggle-group'\nimport { cva, type V"
},
{
"path": "awesome_webui/src/index.css",
"chars": 2003,
"preview": "@import \"tailwindcss\";\n\n:root {\n --background: 0 0% 96%;\n --foreground: 240 10% 3.9%;\n --card: 0 0% 100%;\n --card-fo"
},
{
"path": "awesome_webui/src/main.tsx",
"chars": 230,
"preview": "import { StrictMode } from 'react'\nimport { createRoot } from 'react-dom/client'\nimport './index.css'\nimport App from '."
},
{
"path": "awesome_webui/tsconfig.app.json",
"chars": 816,
"preview": "{\n \"compilerOptions\": {\n \"tsBuildInfoFile\": \"./node_modules/.tmp/tsconfig.app.tsbuildinfo\",\n \"target\": \"ES2022\",\n"
},
{
"path": "awesome_webui/tsconfig.json",
"chars": 119,
"preview": "{\n \"files\": [],\n \"references\": [\n { \"path\": \"./tsconfig.app.json\" },\n { \"path\": \"./tsconfig.node.json\" }\n ]\n}\n"
},
{
"path": "awesome_webui/tsconfig.node.json",
"chars": 725,
"preview": "{\n \"compilerOptions\": {\n \"tsBuildInfoFile\": \"./node_modules/.tmp/tsconfig.node.tsbuildinfo\",\n \"target\": \"ES2023\","
},
{
"path": "awesome_webui/vite.config.ts",
"chars": 3278,
"preview": "import fs from 'node:fs'\nimport { defineConfig, type Plugin } from 'vite'\nimport react from '@vitejs/plugin-react-swc'\ni"
},
{
"path": "compose.base.yml",
"chars": 558,
"preview": "services:\n app-base:\n build:\n context: .\n dockerfile: docker/Dockerfile\n args:\n BACKEND: ${BAC"
},
{
"path": "compose.yml",
"chars": 476,
"preview": "name: fish-speech\n\nservices:\n webui:\n extends:\n file: compose.base.yml\n service: app-base\n build:\n "
},
{
"path": "docker/Dockerfile",
"chars": 12538,
"preview": "# docker/Dockerfile\n\n# IMPORTANT: The docker images do not contain the checkpoints. You need to mount the checkpoints to"
},
{
"path": "dockerfile.dev",
"chars": 1074,
"preview": "ARG VERSION=dev\nARG BASE_IMAGE=ghcr.io/fishaudio/fish-speech:${VERSION}\n\nFROM ${BASE_IMAGE}\n\nARG TOOLS=\" \\"
},
{
"path": "docs/CNAME",
"chars": 18,
"preview": "speech.fish.audio\n"
},
{
"path": "docs/README.ar.md",
"chars": 11481,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n[English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md"
},
{
"path": "docs/README.ja.md",
"chars": 9153,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n[English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md"
},
{
"path": "docs/README.ko.md",
"chars": 9290,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n[English](../README.md) | [简体中文](README.zh.md) | [Portuguese](README.pt-BR.md"
},
{
"path": "docs/README.pt-BR.md",
"chars": 12525,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n[English](../README.md) | [简体中文](README.zh.md) | **Portuguese** | [日本語](READM"
},
{
"path": "docs/README.zh.md",
"chars": 8485,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n[English](../README.md) | **简体中文** | [Portuguese](README.pt-BR.md) | [日本語](RE"
},
{
"path": "docs/ar/finetune.md",
"chars": 4271,
"preview": "# الضبط الدقيق (Fine-tuning)\n\nمن الواضح أنك عندما فتحت هذه الصفحة، لم تكن راضيًا عن أداء النموذج المدرب مسبقًا في وضع ze"
},
{
"path": "docs/ar/index.md",
"chars": 9991,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n<p><a href=\"../en/\">English</a> | <a href=\"../zh/\">简体中文</a> | <a href=\"../pt/"
},
{
"path": "docs/ar/inference.md",
"chars": 2259,
"preview": "# الاستنتاج\n\nيتطلب نموذج Fish Audio S2 ذاكرة فيديو (VRAM) كبيرة. نوصي باستخدام وحدة معالجة رسومات (GPU) بسعة 24 جيجابايت"
},
{
"path": "docs/ar/install.md",
"chars": 5518,
"preview": "## المتطلبات\n\n- ذاكرة وحدة معالجة الرسومات (GPU): 24 جيجابايت (للاستدلال)\n- النظام: Linux, WSL\n\n## إعداد النظام\n\nيدعم Fi"
},
{
"path": "docs/en/finetune.md",
"chars": 4276,
"preview": "# Fine-tuning\n\n!!! warning \n We highly do note recoomand users to do fine-tuning on an RL trained model. Fine-tuning "
},
{
"path": "docs/en/index.md",
"chars": 10598,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n<p><strong>English</strong> | <a href=\"../zh/\">简体中文</a> | <a href=\"../pt/\">Po"
},
{
"path": "docs/en/inference.md",
"chars": 2400,
"preview": "# Inference\n\nThe Fish Audio S2 model requires a large amount of VRAM. We recommend using a GPU with at least 24GB for in"
},
{
"path": "docs/en/install.md",
"chars": 5538,
"preview": "## Requirements\n\n- GPU Memory: 24GB (Inference)\n- System: Linux, WSL\n\n## System Setup\n\nFish Audio S2 supports multiple i"
},
{
"path": "docs/en/server.md",
"chars": 1258,
"preview": "# Server\n\nThis page covers server-side inference for Fish Audio S2, plus quick links for WebUI inference and Docker depl"
},
{
"path": "docs/ja/finetune.md",
"chars": 3263,
"preview": "# ファインチューニング\n\nこのページを開いたということは、明らかに、事前学習済みモデルのゼロショット性能に満足していないということでしょう。データセットでより良い性能を発揮するようにモデルをファインチューニングしたいとお考えのはずです。\n"
},
{
"path": "docs/ja/index.md",
"chars": 8100,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n<p><a href=\"../en/\">English</a> | <a href=\"../zh/\">简体中文</a> | <a href=\"../pt/"
},
{
"path": "docs/ja/inference.md",
"chars": 1801,
"preview": "# 推論\n\nFish Audio S2 モデルは大きなビデオメモリを必要とします。推論には少なくとも 24GB の GPU を使用することをお勧めします。\n\n## 重みのダウンロード\n\nまず、モデルの重みをダウンロードする必要があります:\n"
},
{
"path": "docs/ja/install.md",
"chars": 4429,
"preview": "## 必要条件\n\n- GPUメモリ: 24GB (推論時)\n- システム: Linux, WSL\n\n## システムセットアップ\n\nFish Audio S2は複数のインストール方法をサポートしています。ご自身の開発環境に最も適した方法をお選"
},
{
"path": "docs/ko/finetune.md",
"chars": 3274,
"preview": "# 미세 조정 (Fine-tuning)\n\n이 페이지를 열었다는 것은, 사전 훈련된 모델의 제로샷(zero-shot) 성능에 만족하지 못했다는 의미일 것입니다. 여러분의 데이터셋에서 더 나은 성능을 내도록 모델을 미세"
},
{
"path": "docs/ko/index.md",
"chars": 8120,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n<p><a href=\"../en/\">English</a> | <a href=\"../zh/\">简体中文</a> | <a href=\"../pt/"
},
{
"path": "docs/ko/inference.md",
"chars": 1751,
"preview": "# 추론\n\nFish Audio S2 모델은 큰 비디오 메모리(VRAM)가 필요합니다. 추론을 위해 최소 24GB 이상의 GPU를 사용하는 것을 권장합니다.\n\n## 가중치 다운로드\n\n먼저 모델 가중치를 다운로드해야 합"
},
{
"path": "docs/ko/install.md",
"chars": 4320,
"preview": "## 요구 사양\n\n- GPU 메모리: 24GB (추론 시)\n- 시스템: Linux, WSL\n\n## 시스템 설정\n\nFish Audio S2는 다양한 설치 방법을 지원합니다. 자신의 개발 환경에 가장 적합한 방법을 선택"
},
{
"path": "docs/pt/finetune.md",
"chars": 4551,
"preview": "# Ajuste Fino (Fine-tuning)\n\nObviamente, ao abrir esta página, você não estava satisfeito com o desempenho do modelo pré"
},
{
"path": "docs/pt/index.md",
"chars": 10706,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n<p><a href=\"../en/\">English</a> | <a href=\"../zh/\">简体中文</a> | <strong>Portugu"
},
{
"path": "docs/pt/inference.md",
"chars": 2589,
"preview": "# Inferência\n\nO modelo Fish Audio S2 requer uma grande quantidade de VRAM. Recomendamos o uso de uma GPU com pelo menos "
},
{
"path": "docs/pt/install.md",
"chars": 6123,
"preview": "## Requisitos\n\n- Memória da GPU: 24GB (Inferência)\n- Sistema: Linux, WSL\n\n## Configuração do Sistema\n\nO Fish Audio S2 su"
},
{
"path": "docs/requirements.txt",
"chars": 58,
"preview": "mkdocs-material\nmkdocs-static-i18n[material]\nmkdocs[i18n]\n"
},
{
"path": "docs/stylesheets/extra.css",
"chars": 35,
"preview": ".md-grid {\n max-width: 1440px; \n}\n"
},
{
"path": "docs/zh/finetune.md",
"chars": 2803,
"preview": "# 微调\n\n显然, 当你打开这个页面的时候, 你已经对预训练模型 zero-shot 的效果不算满意. 你想要微调一个模型, 使得它在你的数据集上表现更好. \n\n在目前版本,你只需要微调'LLAMA'部分即可.\n\n## LLAMA 微调\n"
},
{
"path": "docs/zh/index.md",
"chars": 7442,
"preview": "<div align=\"center\">\n<h1>Fish Speech</h1>\n\n<p><a href=\"../en/\">English</a> | <strong>简体中文</strong> | <a href=\"../pt/\">Po"
},
{
"path": "docs/zh/inference.md",
"chars": 1528,
"preview": "# 推理\n\nFish Audio S2 模型需要较大的显存,我们推荐您使用至少24GB的GPU进行推理。\n\n## 下载权重\n\n首先您需要下载模型权重:\n\n```bash\nhf download fishaudio/s2-pro --loca"
},
{
"path": "docs/zh/install.md",
"chars": 3846,
"preview": "## 系统要求\n\n- GPU 显存:24GB(用于推理)\n- 系统:Linux、WSL\n\n## 系统设置\n\nFish Audio S2 支持多种安装方式。请选择最适合你当前开发环境的方案。\n\n**前置依赖**:先安装音频处理所需的系统依赖:"
},
{
"path": "entrypoint.sh",
"chars": 171,
"preview": "#!/bin/bash\n\nCUDA_ENABLED=${CUDA_ENABLED:-true}\nDEVICE=\"\"\n\nif [ \"${CUDA_ENABLED}\" != \"true\" ]; then\n DEVICE=\"--device"
},
{
"path": "fish_speech/callbacks/__init__.py",
"chars": 70,
"preview": "from .grad_norm import GradNormMonitor\n\n__all__ = [\"GradNormMonitor\"]\n"
},
{
"path": "fish_speech/callbacks/grad_norm.py",
"chars": 3436,
"preview": "from typing import Optional, Union\n\nimport lightning.pytorch as pl\nimport torch\nfrom lightning import LightningModule, T"
},
{
"path": "fish_speech/configs/base.yaml",
"chars": 2544,
"preview": "# Base configuration for training a model\npaths:\n run_dir: results/${project}\n ckpt_dir: ${paths.run_dir}/checkpoints\n"
},
{
"path": "fish_speech/configs/lora/r_8_alpha_16.yaml",
"chars": 98,
"preview": "_target_: fish_speech.models.text2semantic.lora.LoraConfig\nr: 8\nlora_alpha: 16\nlora_dropout: 0.01\n"
},
{
"path": "fish_speech/configs/modded_dac_vq.yaml",
"chars": 1376,
"preview": "_target_: fish_speech.models.dac.modded_dac.DAC\n# Model setup\nsample_rate: 44100\nencoder_dim: 64\nencoder_rates: [2, 4, 8"
},
{
"path": "fish_speech/configs/text2semantic_finetune.yaml",
"chars": 2074,
"preview": "defaults:\n - base\n - _self_\n\nproject: text2semantic_finetune_dual_ar\nmax_length: 4096\npretrained_ckpt_path: checkpoint"
},
{
"path": "fish_speech/content_sequence.py",
"chars": 14345,
"preview": "from dataclasses import dataclass, field\nfrom typing import List, Literal, Union\n\nimport numpy as np\nimport torch\n\nfrom "
},
{
"path": "fish_speech/conversation.py",
"chars": 5602,
"preview": "from copy import deepcopy\nfrom dataclasses import dataclass, field\nfrom typing import Literal\n\nimport torch\nfrom transfo"
},
{
"path": "fish_speech/datasets/concat_repeat.py",
"chars": 1498,
"preview": "import bisect\nimport random\nfrom typing import Iterable\n\nfrom torch.utils.data import Dataset, IterableDataset\n\n\nclass C"
},
{
"path": "fish_speech/datasets/protos/text-data.proto",
"chars": 392,
"preview": "syntax = \"proto3\";\n\npackage text_data;\n\nmessage Semantics {\n repeated uint32 values = 1;\n}\n\nmessage Sentence {\n re"
},
{
"path": "fish_speech/datasets/protos/text_data_pb2.py",
"chars": 1759,
"preview": "# -*- coding: utf-8 -*-\n# Generated by the protocol buffer compiler. DO NOT EDIT!\n# source: text-data.proto\n# Protobuf "
},
{
"path": "fish_speech/datasets/protos/text_data_stream.py",
"chars": 781,
"preview": "import struct\n\nfrom .text_data_pb2 import TextData\n\n\ndef read_pb_stream(f):\n while True:\n buf = f.read(4)\n "
},
{
"path": "fish_speech/datasets/semantic.py",
"chars": 20516,
"preview": "import random\nfrom dataclasses import dataclass\nfrom itertools import chain\nfrom pathlib import Path\nfrom random import "
},
{
"path": "fish_speech/datasets/vqgan.py",
"chars": 4012,
"preview": "from dataclasses import dataclass\nfrom pathlib import Path\nfrom typing import Optional\n\nimport librosa\nimport numpy as n"
},
{
"path": "fish_speech/i18n/README.md",
"chars": 1473,
"preview": "## i18n Folder Attribution\n\nThe `i18n` folder within the `fish_speech` directory contains files initially sourced from t"
},
{
"path": "fish_speech/i18n/__init__.py",
"chars": 43,
"preview": "from .core import i18n\n\n__all__ = [\"i18n\"]\n"
},
{
"path": "fish_speech/i18n/core.py",
"chars": 1036,
"preview": "import json\nimport locale\nfrom pathlib import Path\n\nI18N_FILE_PATH = Path(__file__).parent / \"locale\"\nDEFAULT_LANGUAGE ="
},
{
"path": "fish_speech/i18n/locale/en_US.json",
"chars": 8103,
"preview": "{\n \"16-mixed is recommended for 10+ series GPU\": \"16-mixed is recommended for 10+ series GPU\",\n \"5 to 10 seconds of re"
},
{
"path": "fish_speech/i18n/locale/es_ES.json",
"chars": 8949,
"preview": "{\n \"16-mixed is recommended for 10+ series GPU\": \"se recomienda 16-mixed para GPU de la serie 10+\",\n \"5 to 10 seconds "
},
{
"path": "fish_speech/i18n/locale/ja_JP.json",
"chars": 6544,
"preview": "{\n \"16-mixed is recommended for 10+ series GPU\": \"10シリーズ以降のGPUには16-mixedをお勧めします\",\n \"5 to 10 seconds of reference audio"
},
{
"path": "fish_speech/i18n/locale/ko_KR.json",
"chars": 6569,
"preview": "{\n \"16-mixed is recommended for 10+ series GPU\": \"10+ 시리즈 GPU에는 16-mixed를 권장합니다.\",\n \"5 to 10 seconds of reference audi"
},
{
"path": "fish_speech/i18n/locale/pt_BR.json",
"chars": 9480,
"preview": "{\n \"5 to 10 seconds of reference audio, useful for specifying speaker.\": \"5 a 10 segundos de áudio de referência, útil "
},
{
"path": "fish_speech/i18n/locale/zh_CN.json",
"chars": 6047,
"preview": "{\n \"16-mixed is recommended for 10+ series GPU\": \"10+ 系列 GPU 建议使用 16-mixed\",\n \"5 to 10 seconds of reference audio, use"
},
{
"path": "fish_speech/i18n/scan.py",
"chars": 3751,
"preview": "import ast\nimport glob\nimport json\nfrom collections import OrderedDict\nfrom pathlib import Path\n\nfrom loguru import logg"
},
{
"path": "fish_speech/inference_engine/__init__.py",
"chars": 6260,
"preview": "import gc\nimport queue\nfrom typing import Generator\n\nimport numpy as np\nimport torch\nfrom loguru import logger\n\nfrom fis"
},
{
"path": "fish_speech/inference_engine/reference_loader.py",
"chars": 9094,
"preview": "import io\nfrom hashlib import sha256\nfrom pathlib import Path\nfrom typing import Callable, Literal, Tuple\n\nimport torch\n"
},
{
"path": "fish_speech/inference_engine/utils.py",
"chars": 685,
"preview": "import io\nimport wave\nfrom dataclasses import dataclass\nfrom typing import Literal, Optional, Tuple\n\nimport numpy as np\n"
},
{
"path": "fish_speech/inference_engine/vq_manager.py",
"chars": 1952,
"preview": "from typing import Callable\n\nimport torch\nfrom loguru import logger\n\nfrom fish_speech.models.dac.modded_dac import DAC\n\n"
},
{
"path": "fish_speech/models/dac/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "fish_speech/models/dac/inference.py",
"chars": 3895,
"preview": "from pathlib import Path\n\nimport click\nimport hydra\nimport numpy as np\nimport pyrootutils\nimport soundfile as sf\nimport "
},
{
"path": "fish_speech/models/dac/modded_dac.py",
"chars": 34850,
"preview": "import math\nimport typing as tp\nfrom dataclasses import dataclass\nfrom typing import List, Optional, Union\n\nimport numpy"
},
{
"path": "fish_speech/models/dac/rvq.py",
"chars": 13143,
"preview": "import math\nimport typing as tp\nfrom dataclasses import dataclass\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.fu"
},
{
"path": "fish_speech/models/text2semantic/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "fish_speech/models/text2semantic/inference.py",
"chars": 31126,
"preview": "import os\nimport queue\nimport re\nimport threading\nimport time\nimport traceback\nfrom copy import deepcopy\nfrom dataclasse"
},
{
"path": "fish_speech/models/text2semantic/lit_module.py",
"chars": 6935,
"preview": "from typing import Any, Optional\n\nimport lightning as L\nimport torch\nimport torch.nn.functional as F\nfrom lightning.pyto"
},
{
"path": "fish_speech/models/text2semantic/llama.py",
"chars": 35982,
"preview": "import dataclasses\nimport json\nimport math\nfrom collections import OrderedDict\nfrom dataclasses import dataclass\nfrom pa"
},
{
"path": "fish_speech/models/text2semantic/lora.py",
"chars": 2891,
"preview": "from dataclasses import dataclass\n\nimport loralib as lora\n\n\n@dataclass\nclass LoraConfig:\n r: int\n lora_alpha: floa"
},
{
"path": "fish_speech/scheduler.py",
"chars": 1101,
"preview": "import math\n\n\ndef get_cosine_schedule_with_warmup_lr_lambda(\n current_step: int,\n *,\n num_warmup_steps: int | f"
},
{
"path": "fish_speech/text/__init__.py",
"chars": 56,
"preview": "from .clean import clean_text\n\n__all__ = [\"clean_text\"]\n"
},
{
"path": "fish_speech/text/clean.py",
"chars": 828,
"preview": "import re\n\nSYMBOLS_MAPPING = {\n \"‘\": \"'\",\n \"’\": \"'\",\n}\n\nREPLACE_SYMBOL_REGEX = re.compile(\n \"|\".join(re.escape("
},
{
"path": "fish_speech/tokenizer.py",
"chars": 3948,
"preview": "import json\nimport logging\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING, List, Union\n\nimport torch\nfrom tra"
},
{
"path": "fish_speech/train.py",
"chars": 4470,
"preview": "import os\n\nos.environ[\"USE_LIBUV\"] = \"0\"\nimport sys\nfrom typing import Optional\n\nimport hydra\nimport lightning as L\nimpo"
},
{
"path": "fish_speech/utils/__init__.py",
"chars": 706,
"preview": "from .braceexpand import braceexpand\nfrom .context import autocast_exclude_mps\nfrom .file import get_latest_checkpoint\nf"
},
{
"path": "fish_speech/utils/braceexpand.py",
"chars": 6724,
"preview": "\"\"\"\nBash-style brace expansion\nCopied from: https://github.com/trendels/braceexpand/blob/main/src/braceexpand/__init__.p"
},
{
"path": "fish_speech/utils/context.py",
"chars": 287,
"preview": "from contextlib import nullcontext\n\nimport torch\n\n\ndef autocast_exclude_mps(\n device_type: str, dtype: torch.dtype\n) "
},
{
"path": "fish_speech/utils/file.py",
"chars": 3354,
"preview": "import os\nfrom pathlib import Path\nfrom typing import Union\n\nfrom loguru import logger\nfrom natsort import natsorted\n\nAU"
},
{
"path": "fish_speech/utils/instantiators.py",
"chars": 1514,
"preview": "from typing import List\n\nimport hydra\nfrom omegaconf import DictConfig\nfrom pytorch_lightning import Callback\nfrom pytor"
},
{
"path": "fish_speech/utils/logger.py",
"chars": 2467,
"preview": "import logging\nfrom typing import Mapping, Optional\n\nfrom lightning_utilities.core.rank_zero import rank_prefixed_messag"
},
{
"path": "fish_speech/utils/logging_utils.py",
"chars": 1384,
"preview": "from lightning.pytorch.utilities import rank_zero_only\n\nfrom fish_speech.utils import logger as log\n\n\n@rank_zero_only\nde"
},
{
"path": "fish_speech/utils/rich_utils.py",
"chars": 3105,
"preview": "from pathlib import Path\nfrom typing import Sequence\n\nimport rich\nimport rich.syntax\nimport rich.tree\nfrom hydra.core.hy"
},
{
"path": "fish_speech/utils/schema.py",
"chars": 3912,
"preview": "import base64\nimport os\nimport queue\nfrom dataclasses import dataclass\nfrom typing import Literal\n\nimport torch\nfrom pyd"
},
{
"path": "fish_speech/utils/spectrogram.py",
"chars": 3325,
"preview": "import torch\nimport torchaudio.functional as F\nfrom torch import Tensor, nn\nfrom torchaudio.transforms import MelScale\n\n"
},
{
"path": "fish_speech/utils/utils.py",
"chars": 4283,
"preview": "import random\nimport warnings\nfrom importlib.util import find_spec\nfrom typing import Callable\n\nimport numpy as np\nimpor"
},
{
"path": "inference.ipynb",
"chars": 4869,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Fish Speech\"\n ]\n },\n {\n \"ce"
},
{
"path": "mkdocs.yml",
"chars": 3940,
"preview": "site_name: Fish Audio\nsite_description: Targeting SOTA TTS solutions.\nsite_url: https://speech.fish.audio\n\n# Repository\n"
},
{
"path": "pyproject.toml",
"chars": 2555,
"preview": "[project]\nname = \"fish-speech\"\nversion = \"2.0.0\"\nauthors = [\n {name = \"Fish Audio\", email = \"oss@fish.audio\"},\n]\ndesc"
},
{
"path": "pyrightconfig.json",
"chars": 63,
"preview": "{\n \"exclude\": [\n \"data\",\n \"filelists\"\n ]\n}\n"
},
{
"path": "tools/api_client.py",
"chars": 7082,
"preview": "import argparse\nimport base64\nimport time\nimport wave\n\nimport ormsgpack\nimport pyaudio\nimport requests\nfrom pydub import"
},
{
"path": "tools/api_server.py",
"chars": 3788,
"preview": "import re\nfrom threading import Lock\n\nimport pyrootutils\nimport uvicorn\nfrom kui.asgi import (\n Depends,\n FactoryC"
},
{
"path": "tools/llama/build_dataset.py",
"chars": 4910,
"preview": "import itertools\nimport os\nimport re\nfrom collections import defaultdict\nfrom functools import partial\nfrom multiprocess"
},
{
"path": "tools/llama/eval_in_context.py",
"chars": 4635,
"preview": "import pyrootutils\nimport torch\nimport torch.nn.functional as F\nfrom matplotlib import pyplot as plt\nfrom transformers i"
},
{
"path": "tools/llama/merge_lora.py",
"chars": 3369,
"preview": "import shutil\nfrom copy import deepcopy\nfrom pathlib import Path\n\nimport click\nimport hydra\nimport torch\nfrom hydra impo"
},
{
"path": "tools/llama/quantize.py",
"chars": 16589,
"preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\nimport datetime\nimport shutil\n\n# This source"
},
{
"path": "tools/run_webui.py",
"chars": 3377,
"preview": "import os\nfrom argparse import ArgumentParser\nfrom pathlib import Path\n\nimport pyrootutils\nimport torch\nfrom loguru impo"
},
{
"path": "tools/server/api_utils.py",
"chars": 4429,
"preview": "from argparse import ArgumentParser\nfrom http import HTTPStatus\nfrom typing import Annotated, Any\n\nimport ormsgpack\nfrom"
},
{
"path": "tools/server/exception_handler.py",
"chars": 729,
"preview": "import traceback\nfrom http import HTTPStatus\n\nfrom kui.asgi import HTTPException, JSONResponse\n\n\nclass ExceptionHandler:"
},
{
"path": "tools/server/inference.py",
"chars": 1352,
"preview": "from http import HTTPStatus\n\nimport numpy as np\nfrom kui.asgi import HTTPException\n\nfrom fish_speech.inference_engine im"
},
{
"path": "tools/server/model_manager.py",
"chars": 2974,
"preview": "import torch\nfrom loguru import logger\n\nfrom fish_speech.inference_engine import TTSInferenceEngine\nfrom fish_speech.mod"
},
{
"path": "tools/server/model_utils.py",
"chars": 2643,
"preview": "import io\nimport re\n\nimport librosa\nimport torch\nimport torchaudio\nfrom cachetools import LRUCache, cached\n\nCACHE_MAXSIZ"
},
{
"path": "tools/server/views.py",
"chars": 16594,
"preview": "import io\nimport os\nimport re\nimport shutil\nimport tempfile\nimport time\nfrom http import HTTPStatus\nfrom pathlib import "
},
{
"path": "tools/vqgan/create_train_split.py",
"chars": 3008,
"preview": "import math\nfrom pathlib import Path\nfrom random import Random\n\nimport click\nfrom loguru import logger\nfrom pydub import"
},
{
"path": "tools/vqgan/extract_vq.py",
"chars": 6688,
"preview": "import os\nimport subprocess as sp\nimport sys\nimport time\nfrom datetime import timedelta\nfrom functools import lru_cache\n"
},
{
"path": "tools/webui/__init__.py",
"chars": 6109,
"preview": "from typing import Callable\n\nimport gradio as gr\n\nfrom fish_speech.i18n import i18n\nfrom tools.webui.variables import HE"
},
{
"path": "tools/webui/inference.py",
"chars": 2123,
"preview": "import html\nfrom functools import partial\nfrom typing import Any, Callable\n\nfrom fish_speech.i18n import i18n\nfrom fish_"
},
{
"path": "tools/webui/variables.py",
"chars": 605,
"preview": "from fish_speech.i18n import i18n\n\nHEADER_MD = f\"\"\"# Fish Speech\n\n{i18n(\"A text-to-speech model based on VQ-GAN and Llam"
}
]
About this extraction
This page contains the full source code of the fishaudio/fish-speech GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 153 files (676.6 KB), approximately 188.9k tokens, and a symbol index with 483 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.