Repository: zhao-kun/VibeVoiceFusion Branch: main Commit: b3766532d8b0 Files: 202 Total size: 10.3 MB Directory structure: gitextract__vtvi0gi/ ├── .dockerignore ├── .gitignore ├── CHANGELOG.md ├── CHANGELOG_zh.md ├── Dockerfile ├── README.md ├── README_zh.md ├── backend/ │ ├── .gitignore │ ├── README.md │ ├── __init__.py │ ├── api/ │ │ ├── __init__.py │ │ ├── datasets.py │ │ ├── dialog_sessions.py │ │ ├── generation.py │ │ ├── openai_compat.py │ │ ├── preset_voices.py │ │ ├── projects.py │ │ ├── quick_generate.py │ │ ├── speakers.py │ │ ├── tasks.py │ │ └── training.py │ ├── app.py │ ├── config.py │ ├── i18n/ │ │ ├── __init__.py │ │ ├── en.json │ │ └── zh.json │ ├── inference/ │ │ ├── inference.py │ │ └── quick_generate_inference.py │ ├── run.py │ ├── scripts/ │ │ ├── generate_cantonese_training_dataset.py │ │ ├── generate_mcv_cantonese_training_dataset.py │ │ ├── generate_training_dataset.py │ │ └── migrate_dataset_paths.py │ ├── services/ │ │ ├── __init__.py │ │ ├── dataset_service.py │ │ ├── dialog_session_service.py │ │ ├── openai_compat_service.py │ │ ├── preset_voice_service.py │ │ ├── project_service.py │ │ ├── quick_generate_service.py │ │ ├── speaker_service.py │ │ ├── training_service.py │ │ └── voice_gerneration_service.py │ ├── task_manager/ │ │ ├── inference_task.py │ │ ├── quick_generate_task.py │ │ ├── task.py │ │ └── training_task.py │ ├── training/ │ │ ├── engine.py │ │ └── state.py │ └── utils/ │ ├── __init__.py │ ├── dialog_validator.py │ ├── file_handler.py │ └── tensorboard_reader.py ├── compose.yml ├── config/ │ ├── __init__.py │ └── configuration_vibevoice.py ├── demo/ │ ├── README_AUDIO_DENOISE.md │ ├── audio_denoise_deepfilter.py │ ├── audio_denose.py │ ├── convert_model.py │ ├── list_modules.py │ ├── local_file_inference.py │ ├── train.py │ ├── verify_dataset.py │ ├── view_tensorfile.py │ └── vram_offload_animation.py ├── docs/ │ ├── APIs.md │ ├── DATASET_PATH_FIX.md │ ├── DOCKER_REBUILD.md │ ├── develop_thoughts.md │ ├── model_components_analysis.md │ ├── multi-generation-ui-design.md │ ├── offloading.md │ ├── openai-compatible-api.md │ ├── preset-voice-feature.md │ ├── processor.md │ ├── quick-generate-feature.md │ └── vibevoice_inference_architecture.md ├── frontend/ │ ├── .gitignore │ ├── README.md │ ├── app/ │ │ ├── dataset/ │ │ │ ├── detail/ │ │ │ │ └── page.tsx │ │ │ └── page.tsx │ │ ├── fine-tuning/ │ │ │ └── page.tsx │ │ ├── generate-voice/ │ │ │ └── page.tsx │ │ ├── globals.css │ │ ├── layout.tsx │ │ ├── page.tsx │ │ ├── quick-generate/ │ │ │ └── page.tsx │ │ ├── speaker-role/ │ │ │ └── page.tsx │ │ └── voice-editor/ │ │ └── page.tsx │ ├── components/ │ │ ├── AudioPlayer.tsx │ │ ├── AudioUploader.tsx │ │ ├── CreateDatasetModal.tsx │ │ ├── CurrentGeneration.tsx │ │ ├── CurrentTraining.tsx │ │ ├── DatasetCard.tsx │ │ ├── DatasetItemModal.tsx │ │ ├── DatasetItemRow.tsx │ │ ├── DialogEditor.tsx │ │ ├── DialogPreview.tsx │ │ ├── GenerationForm.tsx │ │ ├── GenerationHistory.tsx │ │ ├── ImportDatasetModal.tsx │ │ ├── InlineAudioPlayer.tsx │ │ ├── LayoutWrapper.tsx │ │ ├── Navigation.tsx │ │ ├── PresetVoiceManager.tsx │ │ ├── PresetVoiceSelector.tsx │ │ ├── ProjectSelector.tsx │ │ ├── QuickGenerateHistory.tsx │ │ ├── QuickGenerateNavigation.tsx │ │ ├── SessionManager.tsx │ │ ├── SpeakerList.tsx │ │ ├── SpeakerRoleManager.tsx │ │ ├── SpeakerSelector.tsx │ │ ├── TextEditor.tsx │ │ ├── TrainingForm.tsx │ │ ├── TrainingHistory.tsx │ │ ├── TrainingMetricsChart.tsx │ │ ├── VoicePreview.tsx │ │ └── VoiceRecorder.tsx │ ├── eslint.config.mjs │ ├── lib/ │ │ ├── DatasetContext.tsx │ │ ├── DatasetItemsContext.tsx │ │ ├── GenerationContext.tsx │ │ ├── GlobalTaskContext.tsx │ │ ├── PresetVoiceContext.tsx │ │ ├── ProjectContext.tsx │ │ ├── SessionContext.tsx │ │ ├── SpeakerRoleContext.tsx │ │ ├── TrainingContext.tsx │ │ ├── api.ts │ │ ├── audioUtils.ts │ │ └── i18n/ │ │ ├── LanguageContext.tsx │ │ ├── config.ts │ │ └── locales/ │ │ ├── en.json │ │ └── zh.json │ ├── next.config.ts │ ├── package.json │ ├── postcss.config.mjs │ ├── public/ │ │ ├── icon-preview.html │ │ ├── icon-rect-preview.html │ │ └── site.webmanifest │ ├── scripts/ │ │ └── generate-version.js │ ├── tsconfig.json │ └── types/ │ ├── dialog.ts │ ├── generation.ts │ ├── preset.ts │ ├── project.ts │ ├── quickGenerate.ts │ ├── speaker.ts │ ├── task.ts │ └── training.ts ├── pyproject.toml ├── rebuild.sh ├── test_generation_offloading.py ├── test_offloading.py ├── tests/ │ ├── test_logging.py │ ├── test_lora_network.py │ └── test_training_service.py ├── tokenizer/ │ ├── tokenizer.json │ ├── tokenizer_config.json │ └── vocab.json ├── util/ │ ├── LOGGING_README.md │ ├── __init__.py │ ├── float8_scale.py │ ├── logger.py │ ├── logger_examples.py │ ├── model_utils.py │ ├── rand_init.py │ ├── safetensors_util.py │ └── vibevoice_norm.py └── vibevoice/ ├── __init__.py ├── configs/ │ ├── qwen2.5_1.5b_64k.json │ └── qwen2.5_7b_32k.json ├── generation/ │ ├── __init__.py │ └── visitor.py ├── lora/ │ ├── __init__.py │ └── lora_network.py ├── modular/ │ ├── __init__.py │ ├── adaptive_offload.py │ ├── custom_offloading_utils.py │ ├── modeling_vibevoice.py │ ├── modeling_vibevoice_inference.py │ ├── modular_vibevoice_diffusion_head.py │ ├── modular_vibevoice_qwen.py │ ├── modular_vibevoice_text_tokenizer.py │ ├── modular_vibevoice_tokenizer.py │ └── streamer.py ├── processor/ │ ├── __init__.py │ ├── vibevoice_processor.py │ └── vibevoice_tokenizer_processor.py ├── schedule/ │ ├── __init__.py │ ├── dpm_solver.py │ └── timestep_sampler.py ├── scripts/ │ ├── __init__.py │ └── convert_nnscaler_checkpoint_to_transformers.py └── training/ ├── dataset.py ├── fake_trainer.py ├── summary_visitor.py ├── trainer.py └── trainer_visitor.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .dockerignore ================================================ # This .dockerignore is mostly irrelevant since we don't COPY from local context # The Dockerfile clones everything from GitHub # However, we keep this to prevent accidentally copying if someone modifies the Dockerfile # Everything - we clone from GitHub instead * # Exception: Allow Dockerfile itself for reference (not actually copied in our Dockerfile) !Dockerfile ================================================ FILE: .gitignore ================================================ .claude CLAUDE.md .vscode/ uv.lock media/ demo/example/ # Byte-compiled / optimized / DLL files __pycache__/ *.py[codz] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py.cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: # .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # UV # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. #uv.lock # poetry # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control #poetry.lock #poetry.toml # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. # pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python. # https://pdm-project.org/en/latest/usage/project/#working-with-version-control #pdm.lock #pdm.toml .pdm-python .pdm-build/ # pixi # Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control. #pixi.lock # Pixi creates a virtual environment in the .pixi directory, just like venv module creates one # in the .venv directory. It is recommended not to include this directory in version control. .pixi # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .envrc .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. #.idea/ # Abstra # Abstra is an AI-powered process automation framework. # Ignore directories containing user credentials, local state, and settings. # Learn more at https://abstra.io/docs .abstra/ # Visual Studio Code # Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore # and can be added to the global gitignore or merged into this file. However, if you prefer, # you could uncomment the following to ignore the entire vscode folder # .vscode/ # Ruff stuff: .ruff_cache/ # PyPI configuration file .pypirc # Cursor # Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to # exclude from AI features like autocomplete and code analysis. Recommended for sensitive data # refer to https://docs.cursor.com/context/ignore-files .cursorignore .cursorindexingignore # Marimo marimo/_static/ marimo/_lsp/ __marimo__/ models *.txt *.pt *.swp *.safetensors outputs/ # Backend backend/.env backend/uploads/ backend/__pycache__/ backend/**/__pycache__/ !backend/models/ !frontend/lib/ # Workspace workspace/ outputs_offloading/ demo/datasets/ tensorboard_logs/ nohup.out docs/images/vibevoice_architecture_1.svg ================================================ FILE: CHANGELOG.md ================================================ # Changelog All notable changes to VibeVoice will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [v2.2.0] - 2026-01-09 ### Added - **Quick Generation**: Simplified voice generation workflow without project context - One-click generation with preset voice selection - Support for multiple voice prompt files - Multiple prompt voices displayed in generation history detail view - Task indicator for quick generation status - Per-item progress tracking for each generating voice - **Navigation Enhancement**: Click logo to return to home page ### Fixed - Task indicator display bugs in quick generation - Card status not updating in time - Styling consistency for deleting generation history --- ## [v2.1.0] - 2025-12-28 ### Added - **Narration Mode Editor**: New editing mode for single-speaker narration content - Support for changing narrator - Plain text editing without speaker prefixes - **Preset Voices Management**: Manage preset voice samples for quick speaker creation - **Auto Version Generation**: Frontend version automatically generated from git ### Fixed - Offloading config save issue during inference - Offloading config error - Duplicated speaker ID issue - Missing tags in Docker repository - Dockerfile dependency errors --- ## [v2.0.0] - 2025-12-19 ### Added - **Fine-Tuning Support**: Full LoRA training workflow with real-time metrics - Training page with live progress bars and configuration options - Training metrics charts (Loss/LR/Timing) with 5-second auto-refresh - Support for layer offloading presets (Balanced/Aggressive/Extreme) - Gradient accumulation steps and checkpoint saving per epoch - TensorBoard metrics reader for training visualization - **Dataset Management**: Complete dataset CRUD operations - Dataset list and detail pages with pagination - Import/Export functionality for datasets - JSONL format for efficient line-by-line operations - Scripts for generating datasets from Mozilla Common Voice and KeSpeech - **Multi-Generation**: Batch generation with different random seeds - Generate 2-20 audio variations in a single request - Per-item progress tracking with individual audio players - Expandable history view with aggregate statistics - **LoRA Inference**: Apply trained LoRA models during voice generation - Select LoRA model from training output directory - Configurable LoRA weight (0-1] - **Unified Task API**: Single endpoint for checking any running task (inference or training) - **Preset Voice Feature**: Quick speaker creation from preset voice samples - **Audio Denoising**: Scripts for audio denoising with DeepFilter - **Dataset Processing Scripts**: - Script for ASR-SCCantDuSC (Scripted Chinese Cantonese Daily-use Speech Corpus) - Script for Mozilla Common Voice datasets ### Changed - Improved training completion UI with better status display - Enhanced training history list to display all information regardless of success/failure - Better estimated training time calculation - Increased file upload limits (500MB configurable) - Project-scoped current generation and training API endpoints ### Fixed - Training metadata update error - Generated voices having same name in batch generation - Seeds reset issue with multi-generation - CUDA resource cleanup when training finishes or fails - Invalid audio and voice_prompts field values in datasets.jsonl - Delete training history validation - OOM error handling with specific error messages - Various npm build errors and UI style issues ## [v1.0.0] - 2025-11-14 ### Added - **Core TTS Model**: AR + diffusion architecture for multi-speaker text-to-speech synthesis - Float8 inference support for optimized performance - Mono model file inference support - **Full-Stack Web Application**: - Next.js frontend with responsive UI - Flask backend with RESTful API - Static export for production deployment - **Project Management**: Create and manage multiple voice projects - **Speaker Voice Management**: - Upload and manage speaker voices - Voice recording directly in browser - Auto-assigned speaker names ("Speaker 1", etc.) - **Dialog Editor**: 4-panel layout for creating and editing dialog sessions - Clickable session names in generation history - Session navigation to voice editor - **Voice Generation**: - Live progress monitoring - Generation history with pagination - Audio playback and download - Task icon notification in navigation - **Layer Offloading**: VRAM optimization for GPU memory constraints - Configurable number of layers for GPU/CPU - Async transfers with ThreadPoolExecutor - Smart cache clearing for performance - **Internationalization (i18n)**: Full bilingual support - English and Chinese languages - Auto-detection via browser settings - Persistence in localStorage - **Docker Support**: - Dockerfile for containerized deployment - GPU support with nvidia-docker - **Documentation**: - Comprehensive API documentation - Architecture diagrams with Mermaid - Offloading configuration guide ### Fixed - Invisible text color in browser dark theme - Frontend project selection issues - Refresh page navigation bugs - Layout issues in various components - Scripts not starting with ID 1 - Various typos and documentation errors [v2.2.0]: https://github.com/zhao-kun/vibevoice/compare/v2.1.0...v2.2.0 [v2.1.0]: https://github.com/zhao-kun/vibevoice/compare/v2.0.0...v2.1.0 [v2.0.0]: https://github.com/zhao-kun/vibevoice/compare/v1.0.0...v2.0.0 [v1.0.0]: https://github.com/zhao-kun/vibevoice/releases/tag/v1.0.0 ================================================ FILE: CHANGELOG_zh.md ================================================ # 更新日志 VibeVoice 的所有重要更改都将记录在此文件中。 本文档格式基于 [Keep a Changelog](https://keepachangelog.com/zh-CN/1.0.0/), 并且本项目遵循 [语义化版本](https://semver.org/lang/zh-CN/)。 ## [v2.2.0] - 2026-01-09 ### 新增 - **快速生成**: 无需项目上下文的简化语音生成流程 - 一键生成,支持预设音色选择 - 支持多个音色提示文件 - 生成历史详情中显示多个提示音色 - 快速生成任务状态指示器 - 每个生成语音的独立进度跟踪 - **导航增强**: 点击 Logo 返回首页 ### 修复 - 快速生成中的任务指示器显示问题 - 卡片状态未及时更新 - 删除生成历史的样式一致性问题 --- ## [v2.1.0] - 2025-12-28 ### 新增 - **旁白模式编辑器**: 单人朗读内容的新编辑模式 - 支持切换朗读者 - 无需说话人前缀的纯文本编辑 - **预设音色管理**: 管理预设音色样本,快速创建说话人 - **自动版本生成**: 前端版本号自动从 git 生成 ### 修复 - 推理过程中的 Offloading 配置保存问题 - Offloading 配置错误 - 说话人 ID 重复问题 - Docker 仓库缺少标签 - Dockerfile 依赖错误 --- ## [v2.0.0] - 2025-12-19 ### 新增 - **微调支持**: 完整的 LoRA 训练工作流,支持实时指标监控 - 训练页面,带有实时进度条和配置选项 - 训练指标图表(Loss/LR/Timing),5 秒自动刷新 - 支持层卸载预设(均衡/激进/极限) - 梯度累积步数和每轮检查点保存 - TensorBoard 指标读取器用于训练可视化 - **数据集管理**: 完整的数据集 CRUD 操作 - 数据集列表和详情页面,支持分页 - 数据集导入/导出功能 - JSONL 格式,高效逐行操作 - 从 Mozilla Common Voice 和 KeSpeech 生成数据集的脚本 - **批量生成**: 使用不同随机种子批量生成 - 单次请求生成 2-20 个音频变体 - 每个项目的进度跟踪和独立音频播放器 - 可展开的历史视图,显示汇总统计 - **LoRA 推理**: 在语音生成时应用训练好的 LoRA 模型 - 从训练输出目录选择 LoRA 模型 - 可配置的 LoRA 权重 (0-1] - **统一任务 API**: 单一接口检查任何运行中的任务(推理或训练) - **预设音色功能**: 从预设音色样本快速创建说话人 - **音频降噪**: 使用 DeepFilter 的音频降噪脚本 - **数据集处理脚本**: - ASR-SCCantDuSC(粤语日常用语语音语料库)处理脚本 - Mozilla Common Voice 数据集处理脚本 ### 变更 - 改进训练完成 UI,更好的状态显示 - 增强训练历史列表,无论成功或失败都显示所有信息 - 更准确的预估训练时间计算 - 增加文件上传限制(500MB,可配置) - 项目范围的当前生成和训练 API 端点 ### 修复 - 训练元数据更新错误 - 批量生成中生成的语音名称相同 - 批量生成时种子重置问题 - 训练完成或失败时的 CUDA 资源清理 - datasets.jsonl 中无效的音频和 voice_prompts 字段值 - 删除训练历史的验证问题 - OOM 错误处理,提供具体错误信息 - 各种 npm 构建错误和 UI 样式问题 ## [v1.0.0] - 2025-11-14 ### 新增 - **核心 TTS 模型**: AR + 扩散架构的多说话人文本转语音合成 - Float8 推理支持,优化性能 - 单体模型文件推理支持 - **全栈 Web 应用**: - Next.js 前端,响应式 UI - Flask 后端,RESTful API - 静态导出用于生产部署 - **项目管理**: 创建和管理多个语音项目 - **说话人音色管理**: - 上传和管理说话人音色 - 浏览器内直接录音 - 自动分配说话人名称("说话人 1" 等) - **对话编辑器**: 四面板布局,创建和编辑对话会话 - 生成历史中可点击的会话名称 - 会话导航到语音编辑器 - **语音生成**: - 实时进度监控 - 生成历史,支持分页 - 音频播放和下载 - 导航栏任务图标通知 - **层卸载**: GPU 显存优化 - 可配置的 GPU/CPU 层数 - ThreadPoolExecutor 异步传输 - 智能缓存清理以提升性能 - **国际化 (i18n)**: 完整的双语支持 - 英语和中文 - 通过浏览器设置自动检测 - localStorage 持久化 - **Docker 支持**: - Dockerfile 容器化部署 - nvidia-docker GPU 支持 - **文档**: - 完整的 API 文档 - Mermaid 架构图 - 层卸载配置指南 ### 修复 - 浏览器深色主题中不可见的文字颜色 - 前端项目选择问题 - 页面刷新导航问题 - 各组件布局问题 - 脚本未从 ID 1 开始 - 各种拼写错误和文档错误 [v2.2.0]: https://github.com/zhao-kun/vibevoice/compare/v2.1.0...v2.2.0 [v2.1.0]: https://github.com/zhao-kun/vibevoice/compare/v2.0.0...v2.1.0 [v2.0.0]: https://github.com/zhao-kun/vibevoice/compare/v1.0.0...v2.0.0 [v1.0.0]: https://github.com/zhao-kun/vibevoice/releases/tag/v1.0.0 ================================================ FILE: Dockerfile ================================================ # Multi-stage Dockerfile for VibeVoice # This Dockerfile is completely self-contained and requires no local source code # All source code is cloned from GitHub during build ARG BASE_IMAGE=nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 ARG WORKDIR=/workspace/zhao-kun/vibevoice ARG GITHUB_REPO=https://github.com/zhao-kun/vibevoice.git ARG GITHUB_BRANCH=main ############################################# # Stage 1: Download model from HuggingFace ############################################# FROM python:3.10-slim AS model-downloader # Install huggingface-cli RUN pip install --no-cache-dir huggingface-hub[cli] # Set working directory WORKDIR /tmp/models # Download model (float8_e4m3fn only) using huggingface-cli RUN hf download zhaokun/vibevoice-large \ vibevoice7b_float8_e4m3fn.safetensors \ --local-dir /tmp/models/VibeVoice-large RUN hf download zhaokun/vibevoice-large \ vibevoice7b_bf16.safetensors \ --local-dir /tmp/models/VibeVoice-large ############################################# # Stage 2: Clone Repository and Build Frontend ############################################# FROM node:20-alpine AS source-and-frontend ARG GITHUB_REPO ARG GITHUB_BRANCH ARG CACHE_BUST=unknown # Install git RUN apk add --no-cache git # Cache bust: Force rebuild from here when CACHE_BUST changes RUN echo "Cache bust: ${CACHE_BUST}" # Clone repository (shallow clone, then unshallow to get full history for git describe) WORKDIR /build RUN git clone --depth 1 --branch ${GITHUB_BRANCH} ${GITHUB_REPO} vibevoice && \ cd /build/vibevoice && \ git fetch --unshallow origin && \ git fetch --tags origin && \ git checkout main && \ git rev-parse HEAD > backend/version.txt # Build frontend WORKDIR /build/vibevoice/frontend # Install dependencies RUN npm ci # Build frontend RUN npm run build # Verify build output RUN ls -la out/ ############################################# # Stage 3: Create Python Virtual Environment ############################################# FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS python-builder ARG WORKDIR=/workspace/zhao-kun/vibevoice ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update --allow-releaseinfo-change --yes && \ apt-get upgrade --yes && \ apt install --yes --no-install-recommends \ bash \ libgl1 \ software-properties-common \ ffmpeg \ zip \ unzip \ iputils-ping \ libtcmalloc-minimal4 \ net-tools \ vim \ p7zip-full && \ rm -rf /var/lib/apt/lists/* RUN add-apt-repository ppa:deadsnakes/ppa RUN apt-get update --allow-releaseinfo-change --yes && \ apt install python3.10-dev python3.10-venv python3-pip \ build-essential git curl -y --no-install-recommends && \ ln -s /usr/bin/python3.10 /usr/bin/python && \ rm /usr/bin/python3 && \ ln -s /usr/bin/python3.10 /usr/bin/python3 && \ apt-get clean && rm -rf /var/lib/apt/lists/* && \ echo "en_US.UTF-8 UTF-8" > /etc/locale.gen # Create working directory at EXACT runtime path RUN mkdir -p ${WORKDIR} WORKDIR ${WORKDIR} # Copy source code from frontend stage COPY --from=source-and-frontend /build/vibevoice . # Create virtual environment at runtime path (critical for absolute paths in venv) RUN python3.10 -m venv ${WORKDIR}/venv # Upgrade pip and install dependencies RUN ${WORKDIR}/venv/bin/pip install --no-cache-dir --upgrade pip setuptools wheel && \ ${WORKDIR}/venv/bin/pip install --no-cache-dir . RUN rm -rf ${WORKDIR}/frontend ############################################# # Stage 4: Final Image ############################################# FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS builder ARG WORKDIR=/workspace/zhao-kun/vibevoice ENV DEBIAN_FRONTEND=noninteractive ENV PYTHONUNBUFFERED=1 RUN apt-get update --allow-releaseinfo-change --yes && \ apt-get upgrade --yes && \ apt install --yes --no-install-recommends \ bash \ libgl1 \ software-properties-common \ ffmpeg \ zip \ unzip \ iputils-ping \ libtcmalloc-minimal4 \ net-tools \ vim \ p7zip-full && \ rm -rf /var/lib/apt/lists/* RUN add-apt-repository ppa:deadsnakes/ppa RUN apt-get update --allow-releaseinfo-change --yes && \ apt install python3.10-dev python3.10-venv python3-pip \ build-essential git curl -y --no-install-recommends && \ ln -s /usr/bin/python3.10 /usr/bin/python && \ rm /usr/bin/python3 && \ ln -s /usr/bin/python3.10 /usr/bin/python3 && \ apt-get clean && rm -rf /var/lib/apt/lists/* && \ echo "en_US.UTF-8 UTF-8" > /etc/locale.gen # Copy downloaded model from model-downloader stage RUN mkdir -p /tmp/models/ COPY --from=model-downloader /tmp/models/VibeVoice-large /tmp/models # Create working directory at EXACT same path as build stage RUN mkdir -p ${WORKDIR} WORKDIR ${WORKDIR} # Copy virtual environment from python-builder stage (with preserved absolute paths) COPY --from=python-builder ${WORKDIR} . RUN mkdir -p ${WORKDIR}/models/vibevoice/ && ln -s /tmp/models/vibevoice7b_float8_e4m3fn.safetensors ${WORKDIR}/models/vibevoice/ RUN mkdir -p ${WORKDIR}/models/vibevoice/ && ln -s /tmp/models/vibevoice7b_bf16.safetensors ${WORKDIR}/models/vibevoice/ # Copy frontend build from source-and-frontend stage RUN mkdir -p ${WORKDIR}/backend/dist COPY --from=source-and-frontend /build/vibevoice/frontend/out ${WORKDIR}/backend/dist # Create workspace directory for runtime data RUN mkdir -p ${WORKDIR}/workspace # Expose port EXPOSE 9527 # Health check HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \ CMD curl -f http://localhost:9527/health || exit 1 # Use venv python explicitly (critical - do not rely on PATH) CMD ["/workspace/zhao-kun/vibevoice/venv/bin/python", "backend/run.py"] ================================================ FILE: README.md ================================================ # VibeVoiceFusion
VibeVoiceFusion Logo **A Complete Web Application for Multi-Speaker Voice Generation** *Built on Microsoft's VibeVoice Model* [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Python](https://img.shields.io/badge/python-3.9+-blue.svg?logo=python)](https://www.python.org/) [![TypeScript](https://img.shields.io/badge/typescript-5.0+-blue.svg?logo=typescript)](https://www.typescriptlang.org/) [![Docker](https://img.shields.io/badge/docker-ready-brightgreen.svg)](Dockerfile) [![Docker Hub](https://img.shields.io/badge/Docker%20Hub-vibevoicefusion-blue?logo=docker)](https://hub.docker.com/r/zhaokundev/vibevoicefusion) [![Docker Pulls](https://img.shields.io/docker/pulls/zhaokundev/vibevoicefusion?logo=docker)](https://hub.docker.com/r/zhaokundev/vibevoicefusion) [![Image Size](https://img.shields.io/docker/image-size/zhaokundev/vibevoicefusion/latest?logo=docker)](https://hub.docker.com/r/zhaokundev/vibevoicefusion) [English](README.md) | [简体中文](README_zh.md) [Features](#features) • [Demo Samples](#demo-samples) • [Get Started](#get-started) • [Documentation](#documentation) • [Community](#community) • [Contributing](#contributing)
--- ## Overview ### Purpose VibeVoiceFusion is a **web application** for generating high-quality, multi-speaker synthetic speech with voice cloning capabilities. Built on Microsoft's VibeVoice model (AR + diffusion architecture), this project provides a complete full-stack solution with voice generation, LoRA fine-tuning, dataset management, batch generation, and advanced VRAM optimization features. **Key Goals:** - Provide a user-friendly interface for voice generation without requiring coding knowledge - Enable efficient multi-speaker dialog synthesis with distinct voice characteristics - Support LoRA fine-tuning for custom voice adaptation and style transfer - Generate multiple audio variations in batch with different random seeds - Optimize memory usage for consumer-grade GPUs (10GB+ VRAM) - Support bilingual workflows (English/Chinese) - Offer both web UI and CLI interfaces for different use cases
Video Introduction
### Principle VibeVoice combines **autoregressive (AR)** and **diffusion** techniques for text-to-speech synthesis: 1. **Text Processing**: Input text is tokenized and processed through a Qwen-based language model backbone 2. **Voice Encoding**: Reference voice samples are encoded into acoustic and semantic embeddings 3. **AR Generation**: The model autoregressively generates speech tokens conditioned on text and voice embeddings 4. **Diffusion Refinement**: A DPM-Solver-based diffusion head converts tokens to high-quality audio waveforms 5. **Voice Cloning**: The unified processor preserves speaker characteristics from reference audio samples **Technical Highlights:** - **Model Architecture**: Qwen backbone + VAE acoustic tokenizer + semantic encoder + diffusion head - **Quantization**: Float8 (FP8 E4M3FN) support for ~50% VRAM reduction with minimal quality loss - **Layer Offloading**: Dynamic CPU/GPU memory management for running on limited VRAM - **Attention Mechanism**: PyTorch native SDPA for maximum compatibility ### Features #### Quick Generation - **One-Click Generation**: Generate voice without creating projects, speakers, or sessions - **Voice Source Options**: - Upload custom audio files (WAV, MP3, M4A, FLAC, WebM) - up to 4 files - Select from preset voice samples with language/gender filters - **Auto Mode Detection**: Automatically detects dialogue vs narration format - **Multi-Voice Support**: Use up to 4 voice prompts for generation - **Generation History**: Persistent history with expandable details, bulk delete - **Per-Item Progress**: Real-time progress tracking for each generating voice #### Complete Web Application - **Project Management**: Organize voice generation projects with metadata and descriptions - **Speaker/Voice Management**: - Upload and manage reference voice samples (WAV, MP3, M4A, FLAC, WebM) - Audio preview with playback controls - Voice file replacement with automatic cache-busting - Audio trimming functionality - **Dialog Editor**: - Visual editor with drag-and-drop line reordering - Text editor mode for bulk editing - Support for multi-speaker dialogs (up to 4+ speakers) - **Narration mode** for single-speaker content (audiobooks, articles, podcasts) - Real-time preview and validation - **Generation System**: - Queue-based task management (prevents GPU conflicts) - Real-time progress monitoring with live updates - Configurable parameters (CFG scale, random seed, model precision) - **Multi-Generation**: Generate 2-20 audio variations in a single batch with different seeds - LoRA model support with configurable weight (0-1] - Generation history with filtering, sorting, and pagination - Audio playback and download for completed generations #### LoRA Fine-Tuning - **Dataset Management**: - Create and manage training datasets with audio/text pairs - Import datasets from ZIP archives or local folders - JSONL format for efficient data handling - Pagination and search for large datasets - Export datasets for backup or sharing - **Training System**: - LoRA (Low-Rank Adaptation) fine-tuning for voice customization - Configurable training parameters (epochs, learning rate, LoRA rank, batch size) - Layer offloading support for training on consumer GPUs - Real-time training progress with tqdm-style progress bar - Live training metrics charts (Loss, Learning Rate, Timing) - TensorBoard integration for detailed metrics - Training history with status tracking (Prepare, Training, Completed, Failed) - OOM detection with helpful suggestions for recovery - **LoRA Model Usage**: - Select trained LoRA models during voice generation - Configurable LoRA weight for blending with base model - Multiple LoRA files per training job (epoch checkpoints + final) #### VRAM Optimization - **Layer Offloading**: Move transformer layers between CPU/GPU to reduce VRAM requirements - **Balanced** (12 GPU / 16 CPU layers): ~5GB VRAM savings, ~2.0x slower - RTX 3060 16GB, 4070 - **Aggressive** (8 GPU / 20 CPU layers): ~6GB VRAM savings, ~2.5x slower - RTX 3060 12GB, 4060 - **Extreme** (4 GPU / 24 CPU layers): ~7GB VRAM savings, ~3.5x slower - RTX 3060 10GB (minimum) - **Float8 Quantization**: Reduce model size from ~14GB to ~7GB with comparable quality. (Supported by RTX 40 series and above graphics cards.) - **Adaptive Configuration**: Automatic VRAM estimation and optimal layer distribution **VRAM Requirements:** | Configuration | GPU Layers | VRAM Usage | Speed | Target Hardware | |--------------|-----------|------------|-------|-----------------| | No offloading | 28 | 11-14GB | 1.0x | RTX 4090, A100, 3090 | | Balanced | 12 | 6-8GB | 0.70x | RTX 4070, 3080 16GB | | Aggressive | 8 | 5-7GB | 0.55x | RTX 3060 12GB | | Extreme | 4 | 4-5GB | 0.40x | RTX 3080 10GB | > Float8 Quantization only supports by RTX 40XX or 50XX serial nvidia card. #### Internationalization - **Full Bilingual Support**: Complete English/Chinese UI with 360+ translation keys - **Auto-Detection**: Automatically detects browser language on first visit - **Persistent Preference**: Language selection saved in localStorage - **Backend i18n**: API error messages and responses translated to user's language #### Docker Deployment - **Multi-Stage Build**: Optimized Dockerfile with frontend build, Python venv, and model download - **Self-Contained**: Clones from GitHub and builds entirely from source - **HuggingFace Integration**: Automatically downloads model file (~3-4GB) during build #### Additional Features - **Responsive Design**: Mobile-friendly interface with Tailwind CSS - **Real-Time Updates**: WebSocket-free polling with smart update intervals (2s active, 60s background) - **Audio Cache-Busting**: Ensures audio updates are immediately reflected - **Toast Notifications**: User-friendly feedback for all operations - **Dark Mode Ready**: Modern UI with consistent styling - **Accessibility**: Keyboard navigation and ARIA labels --- ## Demo Samples Listen to voice generation samples created with VibeVoiceFusion. Click the links below to download and play: ### Single Speaker **🎧 [Pandora's Box Story (BFloat16 Model)](https://raw.githubusercontent.com/zhao-kun/VibeVoiceFusion/main/demo/outputs/1p_pandora_box_bf16.wav)** *Generated with bfloat16 precision model - Full quality, 14GB VRAM* **🎧 [Pandora's Box Story (Float8 Model)](https://raw.githubusercontent.com/zhao-kun/VibeVoiceFusion/main/demo/outputs/1p_pandora_box_float8_e4m3fn.wav)** *Generated with float8 quantization - Optimized for 7GB VRAM with comparable quality* ### Multi-Speaker (3 Speakers) **🎭 [东邪西毒 - 西游版 (Journey to the West Version)](https://raw.githubusercontent.com/zhao-kun/VibeVoiceFusion/main/demo/outputs/东邪西毒-西游版.wav)** *Multi-speaker dialog with distinct voice characteristics for each character* --- ## Get Started ### Prerequisites - **Python**: 3.9 or higher - **Node.js**: 16.x or higher (for frontend development) - **CUDA**: Compatible GPU with CUDA support (recommended) - **VRAM**: Minimum 6GB for extreme offloading, 14GB recommended for best performance - **Docker**: Optional, for containerized deployment ### Installation #### Option 1: Docker (Recommended for Production) Build docker image ```bash # Clone the repository git clone https://github.com/zhao-kun/vibevoicefusion.git cd vibevoicefusion # Build and the docker image docker compose build vibevoice ``` After build successfully, run command: ```bash docker run -d \ --name vibevoicefusion \ --gpus all \ -p 9527:9527 \ -v $(pwd)/workspace:/workspace/zhao-kun/vibevoice/workspace \ zhaokundev/vibevoicefusion:latest ``` Access the application at `http://localhost:9527` **The Docker image is available on Docker Hub, and you can launch VibeVoiceFusion using the following command.** ```bash docker pull zhaokundev/vibevoicefusion docker run -d \ --name vibevoicefusion \ --gpus all \ -p 9527:9527 \ -v $(pwd)/workspace:/workspace/zhao-kun/vibevoice/workspace \ zhaokundev/vibevoicefusion:latest ``` **Build Time**: 18-28 minutes | **Image Size**: ~12-15GB #### Option 2: Manual Installation **1. Install Backend Dependencies** ```bash # Clone the repository git clone https://github.com/zhao-kun/vibevoice.git cd vibevoice # Install Python package pip install -e . ``` **2. Download Pre-trained Model** Download from HuggingFace (choose one): - **Float8 (Recommended)**: [vibevoice7b_float8_e4m3fn.safetensors](https://huggingface.co/zhaokun/vibevoice-large/blob/main/vibevoice7b_float8_e4m3fn.safetensors) (~7GB) (Supported by RTX 40 series and above graphics cards.) - **BFloat16 (Full Precision)**: [vibevoice7b_bf16.safetensors](https://huggingface.co/zhaokun/vibevoice-large/blob/main/vibevoice7b_bf16.safetensors) (~14GB) - **Config**: [config.json](https://huggingface.co/zhaokun/vibevoice-large/blob/main/config.json) Place files in `./models/vibevoice/` **3. Install Frontend Dependencies** (for development) ```bash cd frontend npm install ``` **4. Build Frontend** (for production) ```bash cd frontend npm run build cp -r out/* ../backend/dist/ ``` ### Usage #### Web Application (Recommended) **Production Mode** (single server): ```bash # Start backend server (serves both API and frontend) python backend/run.py # Access at http://localhost:9527 ``` **Development Mode** (separate servers): ```bash # Terminal 1: Start backend API python backend/run.py # http://localhost:9527 # Terminal 2: Start frontend dev server cd frontend npm run dev # http://localhost:3000 ``` ### Quick Generation (No Project Required) For quick testing without setting up projects: 1. Click **"Quick Generate"** from the home page 2. **Select Voice Source**: - **Upload**: Drag & drop your audio file (supports up to 4 files) - **Preset**: Choose from preset voices with language/gender filters 3. **Enter Text**: Type dialogue (`Speaker 1: Hello`) or narration (plain text) 4. **Configure**: Set seed, batch size (2-20), and offloading options 5. **Generate**: Click generate and monitor per-item progress 6. **Download**: Play or download generated audio **Key Points:** - Auto-detects dialogue vs narration mode - All speakers use the same voice in dialogue mode - No LoRA support (base model only) - History persists across sessions --- ### Complete Workflow Guide This guide walks you through the complete process of creating multi-speaker voice generation from start to finish. #### Step 1: Create a Project Start by creating a new project or selecting an existing one. Projects help organize your voice generation work with metadata and descriptions.
Project Management

Create and manage projects from the home page

**Actions:** - Click "Create New Project" card - Enter a project name (e.g., "Podcast Episode 1") - Optionally add a description - Click "Create Project" The project will be automatically selected and you'll be navigated to the Speaker Role page. #### Step 2: Add Speakers and Upload Voice Samples Upload reference voice samples for each speaker. The system supports various audio formats (WAV, MP3, M4A, FLAC, WebM).
Speaker Management

Upload and manage voice samples for each speaker

**Actions:** - Click "Add New Speaker" button - The speaker will be automatically named (e.g., "Speaker 1", "Speaker 2") - Click "Upload Voice" to select a reference audio file (3-30 seconds recommended) - Preview the uploaded voice using the audio player - Repeat for additional speakers (supports 2-4+ speakers) **Tips:** - Use clean audio with minimal background noise - 5-15 seconds of speech is ideal for voice cloning - Each speaker needs a unique voice sample - You can replace voice files later by clicking "Change Voice" #### Step 3: Create and Edit Dialog Create a dialog session and write the multi-speaker conversation. The dialog editor supports drag-and-drop reordering and real-time preview.
Dialog Editor

Multi-speaker dialog editor with visual and text modes

**Actions:** - Click "Create New Session" in the session list - Enter a session name (e.g., "Chapter 1") - In the dialog editor, add lines for each speaker: - Select a speaker from the dropdown - Enter the dialog text - Click "Add Line" or press Enter - Reorder lines by dragging the handle icons - Use "Text Editor" mode for bulk editing - Click "Save" to persist your changes **Dialog Format (Text Mode):** ``` Speaker 1: Welcome to our podcast! Speaker 2: Thanks for having me. It's great to be here. Speaker 1: Let's dive into today's topic. ``` **Narration Mode:** For single-speaker content like audiobooks, articles, or podcasts, use **Narration Mode**: 1. When creating a new session, toggle to "Narration" mode 2. Select a narrator voice from your uploaded speakers 3. Enter plain text without `Speaker N:` prefixes 4. Each paragraph will be spoken by the selected narrator ``` This is the first paragraph of your narration. This is the second paragraph. No speaker formatting needed. The narrator voice you selected will read all the text. ``` **Features:** - Visual editor with drag-and-drop - Text editor for bulk editing - Real-time preview - Copy and download functionality - Format validation - **Narration mode** for single-speaker content #### Step 4: Generate Voice Configure generation parameters and start the voice synthesis process. Monitor real-time progress and manage generation history.
Voice Generation

Generation interface with parameters, live progress, and history

**Actions:** - Navigate to "Generate Voice" page - Select a dialog session from the dropdown - Configure parameters: - **Model Type**: - `float8_e4m3fn` (recommended): 7GB VRAM, faster loading - `bfloat16`: 14GB VRAM, full precision - **CFG Scale** (1.0-2.0): Controls generation adherence to text - Lower (1.0-1.3): More natural, varied - Higher (1.5-2.0): More controlled, may sound robotic - Default: 1.3 - **Random Seed**: Any positive integer for reproducibility - **Offloading** (optional): Enable if VRAM < 14GB - **Balanced**: 12 GPU layers, ~5GB savings, 2.0x slower (RTX 3070 12GB, 4070) - **Aggressive**: 8 GPU layers, ~6GB savings, 2.5x slower (RTX 3080 12GB) - **Extreme**: 4 GPU layers, ~7GB savings, 3.5x slower (minimum 10GB VRAM) - Click "Start Generation" **Real-Time Monitoring:** - Progress bar shows completion percentage - Phase indicators: Preprocessing → Inferencing → Saving - Live token generation count - Estimated time remaining
Voice Generation

Generation interface with parameters, live progress, and history

**Generation History:** - View all past generations with status (completed, failed, running) - Filter and sort by date, status, or session - Play generated audio inline - Download WAV files - Delete unwanted generations - View detailed metrics (tokens, duration, RTF, VRAM usage) #### Command-Line Interface For CLI-based generation without the web UI: ```bash python demo/local_file_inference.py \ --model_file ./models/vibevoice/vibevoice7b_float8_e4m3fn.safetensors \ --txt_path demo/text_examples/1p_pandora_box.txt \ --speaker_names zh-007 \ --output_dir ./outputs \ --dtype float8_e4m3fn \ --cfg_scale 1.3 \ --seed 42 ``` **CLI Arguments:** - `--model_file`: Path to model `.safetensors` file - `--config`: Path to `config.json` (optional) - `--txt_path`: Input text file with speaker-labeled dialog - `--speaker_names`: Speaker name(s) for voice file mapping - `--output_dir`: Output directory for generated audio - `--device`: `cuda`, `mps`, or `cpu` (auto-detected) - `--dtype`: `float8_e4m3fn` or `bfloat16` - `--cfg_scale`: Classifier-Free Guidance scale (default: 1.3) - `--seed`: Random seed for reproducibility ### Configuration #### Backend Configuration Environment variables (optional): ```bash export WORKSPACE_DIR=/path/to/workspace # Default: ./workspace export FLASK_DEBUG=False # Production mode ``` #### Frontend Configuration Development API URL (`frontend/.env.local`): ```bash NEXT_PUBLIC_API_URL=http://localhost:9527/api/v1 ``` --- ## Documentation ### Architecture Overview ``` vibevoice/ ├── backend/ # Flask API server │ ├── api/ # REST API endpoints │ │ ├── projects.py # Project CRUD │ │ ├── speakers.py # Speaker management │ │ ├── dialog_sessions.py # Dialog CRUD │ │ ├── generation.py # Voice generation │ │ ├── dataset.py # Dataset management │ │ └── training.py # LoRA training │ ├── services/ # Business logic layer │ ├── models/ # Data models │ ├── task_manager/ # Background task queue │ ├── inference/ # Inference engine │ ├── training/ # Training engine & state management │ ├── i18n/ # Backend translations │ └── dist/ # Frontend static files (production) ├── frontend/ # Next.js web application │ ├── app/ # Next.js pages │ │ ├── page.tsx # Home/Project selector │ │ ├── quick-generate/ # Quick generation (no project) │ │ ├── speaker-role/ # Speaker management │ │ ├── voice-editor/ # Dialog editor │ │ ├── generate-voice/ # Generation page │ │ ├── dataset/ # Dataset management │ │ └── fine-tuning/ # LoRA training page │ ├── components/ # React components │ ├── lib/ # Context providers & utilities │ │ ├── ProjectContext.tsx │ │ ├── SessionContext.tsx │ │ ├── SpeakerRoleContext.tsx │ │ ├── GenerationContext.tsx │ │ ├── TrainingContext.tsx │ │ ├── GlobalTaskContext.tsx │ │ ├── i18n/ # Frontend translations │ │ └── api.ts # API client │ └── types/ # TypeScript type definitions └── vibevoice/ # Core inference library ├── modular/ # Model implementations │ ├── custom_offloading_utils.py # Layer offloading │ └── adaptive_offload.py # Auto VRAM config ├── processor/ # Input processing └── schedule/ # Diffusion scheduling ``` ### API Reference For complete API documentation including request/response examples, see [docs/APIs.md](docs/APIs.md). ### Workspace Structure ``` workspace/ ├── projects.json # All projects metadata ├── _quick-generate/ # Quick generation storage │ ├── voices/ # Uploaded voice samples │ ├── outputs/ # Generated audio files │ └── history.json # Generation history └── {project-id}/ ├── voices/ │ ├── speakers.json # Speaker metadata │ └── {uuid}.wav # Voice files ├── scripts/ │ ├── sessions.json # Session metadata │ └── {uuid}.txt # Dialog text files ├── output/ │ ├── generation.json # Generation metadata │ └── {request_id}.wav # Generated audio files ├── datasets/ │ ├── datasets.json # Dataset metadata │ └── {dataset-id}/ │ ├── datasets.jsonl # Dataset items (one JSON per line) │ ├── audio/ # Audio files │ └── voice_prompts/ # Voice prompt files └── training/ ├── training_history.json # Training job metadata └── lora_output/ └── {lora-name}/ ├── model_epoch_*.safetensors # Checkpoint files └── model_final.safetensors # Final model ``` ### Performance Benchmarks **RTX 4090 (24GB VRAM):** | Configuration | VRAM | Generation Time | RTF | Quality | |--------------|------|-----------------|-----|---------| | BFloat16, No offload | 14GB | 15s (50s audio) | 0.30x | Excellent | | Float8, No offload | 7GB | 16s (50s audio) | 0.32x | Excellent | **RTX 3060 12GB:** | Configuration | VRAM | Generation Time | RTF | Quality | |--------------|------|-----------------|-----|---------| | Float8, Balanced | 7GB | 30s (50s audio) | 0.60x | Excellent | | Float8, Aggressive | 6GB | 40s (50s audio) | 0.80x | Good | *RTF (Real-Time Factor) < 1.0 means faster than real-time* --- ## Community ### Getting Help - **Issues**: [GitHub Issues](https://github.com/zhao-kun/vibevoice/issues) - Bug reports and feature requests - **Discussions**: [GitHub Discussions](https://github.com/zhao-kun/vibevoice/discussions) - Questions and community support ### Showcase Share your projects and experiences: - **Demo Audio**: Submit your generated samples to the showcase - **Use Cases**: Share how you're using VibeVoice - **Improvements**: Contribute optimizations and enhancements ### Responsible AI **Important**: This project is for **research and development** purposes only. #### Risks - **Deepfakes & Impersonation**: Synthetic speech can be misused for fraud or disinformation - **Voice Cloning Ethics**: Always obtain explicit consent before cloning voices - **Biases**: Model may inherit biases from training data - **Unexpected Outputs**: Generated audio may contain artifacts or inaccuracies #### Guidelines **DO:** - Clearly disclose when audio is AI-generated - Obtain explicit consent for voice cloning - Use responsibly for legitimate purposes - Respect privacy and intellectual property - Follow all applicable laws and regulations **DO NOT:** - Create deepfakes or impersonation without consent - Spread disinformation or misleading content - Use for fraud, scams, or malicious purposes - Violate laws or ethical guidelines **By using this software, you agree to use it ethically and responsibly.** --- ## Contributing We welcome contributions from the community! Here's how you can help: ### Ways to Contribute 1. **Report Bugs**: Open an issue with detailed reproduction steps 2. **Suggest Features**: Propose new features via GitHub issues 3. **Submit Pull Requests**: - Fix bugs - Add features - Improve documentation - Add translations 4. **Improve Documentation**: Help make the project more accessible 5. **Share Use Cases**: Show how you're using VibeVoice ### Testing ```bash # Backend tests (when available) pytest tests/ # Frontend tests (when available) cd frontend npm test # Manual testing # 1. Create project # 2. Add speakers # 3. Create dialog # 4. Generate voice # 5. Verify output quality ``` --- ## License This project follows the same license terms as the original Microsoft VibeVoice repository. Please refer to the [LICENSE](LICENSE) file for details. ### Third-Party Licenses - **Frontend**: React, Next.js, Tailwind CSS (MIT License) - **Backend**: Flask, PyTorch (Various open-source licenses) - **Model Weights**: Microsoft VibeVoice (subject to Microsoft's terms) --- ## Acknowledgments - **Microsoft Research**: Original VibeVoice model and architecture - **ComfyUI**: Float8 casting techniques inspiration - **kohya-ss/musubi-tuner**: Offloading implementation and LoRA network reference - **[voicepowered-ai/VibeVoice-finetuning](https://github.com/voicepowered-ai/VibeVoice-finetuning)**: Training dataloader implementation - **HuggingFace**: Model hosting and distribution - **Open Source Community**: Libraries and frameworks that made this possible --- ## Citation If you use this implementation in your research, please cite both this project and the original VibeVoice paper: ```bibtex @software{vibevoice_webapp_2024, title={VibeVoice: Complete Web Application for Multi-Speaker Voice Generation}, author={Zhao, Kun}, year={2024}, url={https://github.com/zhao-kun/vibevoice} } @article{vibevoice2024, title={VibeVoice: Unified Autoregressive and Diffusion for Speech Generation}, author={Microsoft Research}, year={2024} } ``` --- ## Troubleshooting ### CUDA Out of Memory ```bash # Try Float8 model --dtype float8_e4m3fn # Enable layer offloading in web UI # Or use CLI with manual configuration ``` ### Audio Quality Issues ```bash # Adjust CFG scale (try 1.0 - 2.0) --cfg_scale 1.5 # Use higher precision model --dtype bfloat16 ``` ### Port Already in Use ```bash # Change port in backend/run.py app.run(host='0.0.0.0', port=9528) ``` ### Frontend Build Errors ```bash cd frontend rm -rf node_modules .next npm install npm run build ``` ---
**Made by the VibeVoice Community** [Back to Top](#vibevoice)
================================================ FILE: README_zh.md ================================================ # VibeVoiceFusion
VibeVoiceFusion Logo **完整的多说话人语音生成 Web 应用** *基于 Microsoft VibeVoice 模型构建* [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Python](https://img.shields.io/badge/python-3.9+-blue.svg?logo=python)](https://www.python.org/) [![TypeScript](https://img.shields.io/badge/typescript-5.0+-blue.svg?logo=typescript)](https://www.typescriptlang.org/) [![Docker](https://img.shields.io/badge/docker-ready-brightgreen.svg)](Dockerfile) [![Docker Hub](https://img.shields.io/badge/Docker%20Hub-vibevoicefusion-blue?logo=docker)](https://hub.docker.com/r/zhaokundev/vibevoicefusion) [![Docker Pulls](https://img.shields.io/docker/pulls/zhaokundev/vibevoicefusion?logo=docker)](https://hub.docker.com/r/zhaokundev/vibevoicefusion) [![Image Size](https://img.shields.io/docker/image-size/zhaokundev/vibevoicefusion/latest?logo=docker)](https://hub.docker.com/r/zhaokundev/vibevoicefusion) [English](README.md) | [简体中文](README_zh.md) [功能特性](#功能特性) • [演示样本](#演示样本) • [快速开始](#快速开始) • [文档](#文档) • [社区](#社区) • [贡献](#贡献)
--- ## 概述 ### 项目目的 VibeVoiceFusion 是一个**Web 应用**,用于生成高质量、多说话人的合成语音,具备声音克隆功能。基于微软的 VibeVoice 模型(AR + 扩散架构),本项目提供完整的全栈解决方案,包含语音生成、LoRA 微调、数据集管理、批量生成和先进的显存优化功能。 **核心目标:** - 提供无需编程知识的友好界面进行语音生成 - 支持高效的多说话人对话合成,保持不同说话人的独特声音特征 - 支持 LoRA 微调,实现自定义声音适配和风格迁移 - 批量生成多个音频变体,使用不同的随机种子 - 优化显存使用,支持消费级 GPU(10GB+ 显存) - 支持双语工作流(英语/中文) - 提供 Web 界面和命令行界面以适应不同使用场景
Video Introduction
### 技术原理 VibeVoice 结合**自回归(AR)**和**扩散**技术进行文本转语音合成: 1. **文本处理**:输入文本经过分词并通过基于 Qwen 的语言模型主干网络处理 2. **声音编码**:参考语音样本被编码为声学和语义嵌入 3. **AR 生成**:模型基于文本和声音嵌入自回归生成语音 token 4. **扩散细化**:基于 DPM-Solver 的扩散头将 token 转换为高质量音频波形 5. **声音克隆**:统一处理器从参考音频样本中保留说话人特征 **技术亮点:** - **模型架构**:Qwen 主干网络 + VAE 声学分词器 + 语义编码器 + 扩散头 - **量化技术**:Float8 (FP8 E4M3FN) 支持,显存减少约 50%,质量损失极小 - **层卸载**:动态 CPU/GPU 内存管理,可在有限显存上运行 - **注意力机制**:PyTorch 原生 SDPA,最大化兼容性 ### 功能特性 #### 快速生成 - **一键生成**:无需创建项目、说话人或会话即可生成语音 - **音源选项**: - 上传自定义音频文件(WAV、MP3、M4A、FLAC、WebM)- 最多 4 个文件 - 从预设音色样本中选择,支持语言/性别筛选 - **自动模式检测**:自动识别对话格式与旁白格式 - **多音色支持**:支持使用最多 4 个音色提示进行生成 - **生成历史**:持久化历史记录,支持展开详情和批量删除 - **逐项进度**:实时跟踪每个生成语音的进度 #### 完整的 Web 应用 - **项目管理**:使用元数据和描述组织语音生成项目 - **说话人/声音管理**: - 上传和管理参考语音样本(WAV、MP3、M4A、FLAC、WebM) - 音频预览与播放控制 - 声音文件替换,自动缓存清除 - 音频裁剪功能 - **对话编辑器**: - 可视化编辑器,支持拖拽重排对话行 - 文本编辑模式用于批量编辑 - 支持多说话人对话(最多 4+ 个说话人) - **旁白模式**支持单说话人内容(有声书、文章、播客) - 实时预览和验证 - **生成系统**: - 基于队列的任务管理(防止 GPU 冲突) - 实时进度监控与动态更新 - 可配置参数(CFG scale、随机种子、模型精度) - **批量生成**:单次生成 2-20 个音频变体,使用不同随机种子 - LoRA 模型支持,可配置权重 (0-1] - 生成历史记录,支持过滤、排序和分页 - 完成的生成可播放和下载 #### LoRA 微调 - **数据集管理**: - 创建和管理包含音频/文本对的训练数据集 - 支持从 ZIP 压缩包或本地文件夹导入数据集 - JSONL 格式高效处理数据 - 大型数据集支持分页和搜索 - 导出数据集用于备份或分享 - **训练系统**: - LoRA(低秩适应)微调,实现声音定制 - 可配置训练参数(训练轮数、学习率、LoRA 秩、批量大小) - 支持层卸载,可在消费级 GPU 上训练 - 实时训练进度,带 tqdm 风格进度条 - 实时训练指标图表(损失、学习率、时间) - TensorBoard 集成,查看详细指标 - 训练历史记录,状态跟踪(准备中、训练中、已完成、失败) - OOM 检测,提供恢复建议 - **LoRA 模型使用**: - 在语音生成时选择已训练的 LoRA 模型 - 可配置 LoRA 权重,与基础模型混合 - 每个训练任务支持多个 LoRA 文件(epoch 检查点 + 最终模型) #### 显存优化 - **层卸载**:在 CPU/GPU 之间移动 Transformer 层以减少显存需求 - **平衡模式** (12 GPU / 16 CPU 层):约 5GB 显存节省,约 2.0 倍慢 - RTX 3060 16GB、4070 - **激进模式** (8 GPU / 20 CPU 层):约 6GB 显存节省,约 2.5 倍慢 - RTX 3080 12GB、4060 - **极限模式** (4 GPU / 24 CPU 层):约 7GB 显存节省,约 3.5 倍慢 - RTX 3080 10GB(最低配置) - **Float8 量化**:将模型大小从约 14GB 减少到约 7GB,质量相当 (RTX 40系及以上显卡支持) - **自适应配置**:自动显存估算和最优层分配 **显存需求:** | 配置 | GPU 层数 | 显存占用 | 速度 | 目标硬件 | |------|---------|---------|------|---------| | 无卸载 | 28 | 11-14GB | 1.0x | RTX 4090、A100、3090 | | 平衡 | 12 | 6-8GB | 0.70x | RTX 4070、3080 12GB | | 激进 | 8 | 5-7GB | 0.55x | RTX 3060 12GB | | 极限 | 4 | 4-5GB | 0.40x | RTX 3080 10GB | > Float8 量化,只有RTX 40系及以上显卡支持。 #### 国际化 - **完整双语支持**:完整的英文/中文界面,360+ 翻译键 - **自动检测**:首次访问自动检测浏览器语言 - **持久化偏好**:语言选择保存在 localStorage - **后端国际化**:API 错误消息和响应翻译为用户语言 #### Docker 部署 - **多阶段构建**:优化的 Dockerfile,包含前端构建、Python venv 和模型下载 - **自包含**:从 GitHub 克隆并完全从源代码构建 - **HuggingFace 集成**:构建过程中自动下载模型文件(约 3-4GB) #### 其他功能 - **响应式设计**:使用 Tailwind CSS 的移动友好界面 - **实时更新**:无需 WebSocket 的智能轮询更新间隔(活动时 2 秒,后台 60 秒) - **音频缓存清除**:确保音频更新立即反映 - **Toast 通知**:所有操作的用户友好反馈 - **深色模式就绪**:现代化一致的样式 - **可访问性**:键盘导航和 ARIA 标签 --- ## 演示样本 试听使用 VibeVoiceFusion 生成的语音样本。点击下方链接下载并播放: ### 单说话人 **🎧 [大话西游(BFloat16 模型)](https://raw.githubusercontent.com/zhao-kun/VibeVoiceFusion/main/demo/outputs/1p_pandora_box_bf16.wav)** *使用 bfloat16 精度模型生成 - 完整质量,需要 14GB 显存* **🎧 [大话西游(Float8 模型)](https://raw.githubusercontent.com/zhao-kun/VibeVoiceFusion/main/demo/outputs/1p_pandora_box_float8_e4m3fn.wav)** *使用 float8 量化生成 - 优化至 7lGB 显存,质量相当, RTX 40系及以上显卡支持* ### 多说话人(3 个说话人) **🎭 [东邪西毒 - 西游版](https://raw.githubusercontent.com/zhao-kun/VibeVoiceFusion/main/demo/outputs/东邪西毒-西游版.wav)** *多说话人对话,每个角色具有独特的声音特征* --- ## 快速开始 ### 前置要求 - **Python**:3.9 或更高版本 - **Node.js**:16.x 或更高版本(用于前端开发) - **CUDA**:支持 CUDA 的 GPU(推荐) - **显存**:极限卸载最少 6GB,推荐 14GB 以获得最佳性能 - **Docker**:可选,用于容器化部署 ### 安装 #### 方式 1:Docker(生产环境推荐) 构建Docker镜像 ```bash # 克隆仓库 git clone https://github.com/zhao-kun/vibevoicefusion.git cd vibevoicefusion # 使用 Docker Compose 构建和运行 docker compose build vibevoice ``` 运行Docker 容器 ```bash docker run -d \ --name vibevoice \ --gpus all \ -p 9527:9527 \ -v $(pwd)/workspace:/workspace/zhao-kun/vibevoice/workspace \ zhaokundev/vibevoicefusion:latest ``` 在 `http://localhost:9527` 访问应用 **Docker 镜像已经上传至Docker Hub, 你可以通过如下的命令直接启动vibevoicefusion 服务。** ```bash docker pull zhaokundev/vibevoicefusion docker run -d \ --name vibevoicefusion \ --gpus all \ -p 9527:9527 \ -v $(pwd)/workspace:/workspace/zhao-kun/vibevoice/workspace \ zhaokundev/vibevoicefusion:latest ``` **构建时间**:18-28 分钟 | **镜像大小**:约 12-15GB #### 方式 2:手动安装 **1. 安装后端依赖** ```bash # 克隆仓库 git clone https://github.com/zhao-kun/vibevoice.git cd vibevoice # 安装 Python 包 pip install -e . ``` **2. 下载预训练模型** 从 HuggingFace 下载(选择其一): - **Float8(推荐)**:[vibevoice7b_float8_e4m3fn.safetensors](https://huggingface.co/zhaokun/vibevoice-large/blob/main/vibevoice7b_float8_e4m3fn.safetensors)(约 7GB) - **BFloat16(全精度)**:[vibevoice7b_bf16.safetensors](https://huggingface.co/zhaokun/vibevoice-large/blob/main/vibevoice7b_bf16.safetensors)(约 14GB) - **配置文件**:[config.json](https://huggingface.co/zhaokun/vibevoice-large/blob/main/config.json) 将文件放置在 `./models/vibevoice/` 目录 **3. 安装前端依赖**(用于开发) ```bash cd frontend npm install ``` **4. 构建前端**(用于生产) ```bash cd frontend npm run build cp -r out/* ../backend/dist/ ``` ### 使用方法 #### Web 应用(推荐) **生产模式**(单服务器): ```bash # 启动后端服务器(同时提供 API 和前端) python backend/run.py # 在 http://localhost:9527 访问 ``` **开发模式**(分离服务器): ```bash # 终端 1:启动后端 API python backend/run.py # http://localhost:9527 # 终端 2:启动前端开发服务器 cd frontend npm run dev # http://localhost:3000 ``` ### 快速生成(无需项目) 快速测试,无需设置项目: 1. 从主页点击 **"快速生成"** 2. **选择音源**: - **上传**:拖放音频文件(最多支持 4 个文件) - **预设**:从预设音色中选择,支持语言/性别筛选 3. **输入文本**:输入对话格式(`Speaker 1: 你好`)或旁白格式(纯文本) 4. **配置参数**:设置种子、批量大小(2-20)和卸载选项 5. **生成**:点击生成并监控每项进度 6. **下载**:播放或下载生成的音频 **要点:** - 自动检测对话模式与旁白模式 - 对话模式下所有说话人使用相同音色 - 不支持 LoRA(仅基础模型) - 历史记录跨会话持久保存 --- ### 完整工作流程指南 本指南将带您完成从头到尾创建多说话人语音生成的完整过程。 #### 步骤 1:创建项目 首先创建一个新项目或选择现有项目。项目帮助您使用元数据和描述组织语音生成工作。
项目管理

从主页创建和管理项目

**操作步骤:** - 点击"创建新项目"卡片 - 输入项目名称(例如:"播客第 1 集") - 可选择添加描述 - 点击"创建项目" 项目将自动被选中,并导航到说话人角色页面。 #### 步骤 2:添加说话人并上传音色样本 为每个说话人上传参考音频样本。系统支持多种音频格式(WAV、MP3、M4A、FLAC、WebM)。
说话人管理

为每个说话人上传和管理音色样本

**操作步骤:** - 点击"添加新说话人"按钮 - 说话人将被自动命名(例如:"Speaker 1"、"Speaker 2") - 点击"上传音色"选择参考音频文件(建议 3-30 秒) - 使用音频播放器预览已上传的音色 - 重复以上步骤添加更多说话人(支持 2-4+ 个说话人) **提示:** - 使用背景噪音最小的清晰音频 - 5-15 秒的语音最适合音色克隆 - 每个说话人需要独特的音色样本 - 稍后可以通过点击"更改音色"替换音频文件 #### 步骤 3:创建和编辑对话 创建对话会话并撰写多说话人对话。对话编辑器支持拖放重排和实时预览。
对话编辑器

支持可视化和文本模式的多说话人对话编辑器

**操作步骤:** - 在会话列表中点击"创建新会话" - 输入会话名称(例如:"第 1 章") - 在对话编辑器中为每个说话人添加对话行: - 从下拉菜单中选择说话人 - 输入对话文本 - 点击"添加行"或按回车键 - 通过拖动句柄图标重新排序对话行 - 使用"文本编辑器"模式进行批量编辑 - 点击"保存"保存更改 **对话格式(文本模式):** ``` Speaker 1: 欢迎来到我们的播客! Speaker 2: 谢谢邀请。很高兴来到这里。 Speaker 1: 让我们深入探讨今天的话题。 ``` **旁白模式:** 对于有声书、文章、播客等单说话人内容,可使用**旁白模式**: 1. 创建新会话时,切换到"旁白"模式 2. 从已上传的说话人中选择旁白音色 3. 输入纯文本,无需 `Speaker N:` 前缀 4. 选定的旁白音色将朗读所有文本 ``` 这是旁白的第一段内容。 这是第二段。无需说话人格式。 您选择的旁白音色将朗读所有文本。 ``` **功能特性:** - 支持拖放的可视化编辑器 - 批量编辑的文本编辑器 - 实时预览 - 复制和下载功能 - 格式验证 - **旁白模式**支持单说话人内容 #### 步骤 4:生成语音 配置生成参数并启动语音合成过程。监控实时进度并管理生成历史。
语音生成

带有参数配置、实时进度和历史记录的生成界面

**操作步骤:** - 导航到"生成语音"页面 - 从下拉菜单中选择对话会话 - 配置参数: - **模型类型**: - `float8_e4m3fn`(推荐):7GB 显存,加载更快 - `bfloat16`:14GB 显存,完整精度 - **CFG Scale**(1.0-2.0):控制生成对文本的遵循程度 - 较低(1.0-1.3):更自然、多变 - 较高(1.5-2.0):更可控,可能听起来机械 - 默认值:1.3 - **随机种子**:任意正整数,用于可重现性 - **卸载**(可选):如果显存 < 14GB 则启用 - **平衡**:12 GPU 层,节省约 5GB,慢 2.0 倍(RTX 3060 16GB、4070) - **激进**:8 GPU 层,节省约 6GB,慢 2.5 倍(RTX 3060 12GB) - **极端**:4 GPU 层,节省约 7GB,慢 3.5 倍(最低 10GB 显存) - 点击"开始生成" **实时监控:** - 进度条显示完成百分比 - 阶段指示器:预处理 → 推理 → 保存 - 实时 token 生成计数 - 估计剩余时间
语音生成中

带有参数配置、实时进度和历史记录的生成界面

**生成历史:** - 查看所有过往生成及其状态(已完成、失败、运行中) - 按日期、状态或会话筛选和排序 - 在线播放生成的音频 - 下载 WAV 文件 - 删除不需要的生成 - 查看详细指标(token、时长、RTF、显存使用) #### 命令行界面 不使用 Web UI 的命令行生成: ```bash python demo/local_file_inference.py \ --model_file ./models/vibevoice/vibevoice7b_float8_e4m3fn.safetensors \ --txt_path demo/text_examples/1p_pandora_box.txt \ --speaker_names zh-007 \ --output_dir ./outputs \ --dtype float8_e4m3fn \ --cfg_scale 1.3 \ --seed 42 ``` **CLI 参数:** - `--model_file`:模型 `.safetensors` 文件路径 - `--config`:`config.json` 路径(可选) - `--txt_path`:带说话人标签的对话文本文件 - `--speaker_names`:语音文件映射的说话人名称 - `--output_dir`:生成音频的输出目录 - `--device`:`cuda`、`mps` 或 `cpu`(自动检测) - `--dtype`:`float8_e4m3fn` 或 `bfloat16` - `--cfg_scale`:分类器自由引导比例(默认:1.3) - `--seed`:可重现性的随机种子 ### 配置 #### 后端配置 环境变量(可选): ```bash export WORKSPACE_DIR=/path/to/workspace # 默认:./workspace export FLASK_DEBUG=False # 生产模式 ``` #### 前端配置 开发 API URL(`frontend/.env.local`): ```bash NEXT_PUBLIC_API_URL=http://localhost:9527/api/v1 ``` --- ## 文档 ### 架构概览 ``` vibevoice/ ├── backend/ # Flask API 服务器 │ ├── api/ # REST API 端点 │ │ ├── projects.py # 项目 CRUD │ │ ├── speakers.py # 说话人管理 │ │ ├── dialog_sessions.py # 对话 CRUD │ │ ├── generation.py # 语音生成 │ │ ├── dataset.py # 数据集管理 │ │ └── training.py # LoRA 训练 │ ├── services/ # 业务逻辑层 │ ├── models/ # 数据模型 │ ├── task_manager/ # 后台任务队列 │ ├── inference/ # 推理引擎 │ ├── training/ # 训练引擎和状态管理 │ ├── i18n/ # 后端翻译 │ └── dist/ # 前端静态文件(生产) ├── frontend/ # Next.js Web 应用 │ ├── app/ # Next.js 页面 │ │ ├── page.tsx # 主页/项目选择器 │ │ ├── quick-generate/ # 快速生成(无需项目) │ │ ├── speaker-role/ # 说话人管理 │ │ ├── voice-editor/ # 对话编辑器 │ │ ├── generate-voice/ # 生成页面 │ │ ├── dataset/ # 数据集管理 │ │ └── fine-tuning/ # LoRA 训练页面 │ ├── components/ # React 组件 │ ├── lib/ # Context providers 和工具 │ │ ├── ProjectContext.tsx │ │ ├── SessionContext.tsx │ │ ├── SpeakerRoleContext.tsx │ │ ├── GenerationContext.tsx │ │ ├── TrainingContext.tsx │ │ ├── GlobalTaskContext.tsx │ │ ├── i18n/ # 前端翻译 │ │ └── api.ts # API 客户端 │ └── types/ # TypeScript 类型定义 └── vibevoice/ # 核心推理库 ├── modular/ # 模型实现 │ ├── custom_offloading_utils.py # 层卸载 │ └── adaptive_offload.py # 自动显存配置 ├── processor/ # 输入处理 └── schedule/ # 扩散调度 ``` ### API 参考 完整的 API 文档(包含请求/响应示例),请参阅 [docs/APIs.md](docs/APIs.md)。 ### 工作空间结构 ``` workspace/ ├── projects.json # 所有项目元数据 ├── _quick-generate/ # 快速生成存储 │ ├── voices/ # 上传的音色样本 │ ├── outputs/ # 生成的音频文件 │ └── history.json # 生成历史 └── {project-id}/ ├── voices/ │ ├── speakers.json # 说话人元数据 │ └── {uuid}.wav # 语音文件 ├── scripts/ │ ├── sessions.json # 会话元数据 │ └── {uuid}.txt # 对话文本文件 ├── output/ │ ├── generation.json # 生成元数据 │ └── {request_id}.wav # 生成的音频文件 ├── datasets/ │ ├── datasets.json # 数据集元数据 │ └── {dataset-id}/ │ ├── datasets.jsonl # 数据集条目(每行一个 JSON) │ ├── audio/ # 音频文件 │ └── voice_prompts/ # 语音提示文件 └── training/ ├── training_history.json # 训练任务元数据 └── lora_output/ └── {lora-name}/ ├── model_epoch_*.safetensors # 检查点文件 └── model_final.safetensors # 最终模型 ``` ### 性能基准 **RTX 4090 (24GB 显存):** | 配置 | 显存 | 生成时间 | RTF | 质量 | |------|------|---------|-----|------| | BFloat16,无卸载 | 14GB | 15秒(50秒音频)| 0.30x | 优秀 | | Float8,无卸载 | 7GB | 16秒(50秒音频)| 0.32x | 优秀 | **RTX 3060 12GB:** | 配置 | 显存 | 生成时间 | RTF | 质量 | |------|------|---------|-----|------| | Float8,平衡 | 7GB | 30秒(50秒音频)| 0.60x | 优秀 | | Float8,激进 | 6GB | 40秒(50秒音频)| 0.80x | 良好 | *RTF(实时因子)< 1.0 表示快于实时生成* --- ## 社区 ### 获取帮助 - **Issues**:[GitHub Issues](https://github.com/zhao-kun/vibevoice/issues) - 错误报告和功能请求 - **Discussions**:[GitHub Discussions](https://github.com/zhao-kun/vibevoice/discussions) - 问题和社区支持 ### 展示 分享您的项目和经验: - **演示音频**:提交您的生成样本到展示区 - **使用案例**:分享您如何使用 VibeVoice - **改进建议**:贡献优化和增强功能 ### 负责任的 AI **重要**:本项目仅用于**研究和开发**目的。 #### 风险 - **深度伪造与冒充**:合成语音可能被滥用于欺诈或虚假信息 - **声音克隆伦理**:克隆声音前务必获得明确同意 - **偏见**:模型可能继承训练数据中的偏见 - **意外输出**:生成的音频可能包含瑕疵或不准确之处 #### 指南 **应该做:** - 明确披露音频是 AI 生成的 - 获得声音克隆的明确同意 - 负责任地用于合法目的 - 尊重隐私和知识产权 - 遵守所有适用的法律法规 **不应该做:** - 未经同意创建深度伪造或冒充 - 传播虚假信息或误导性内容 - 用于欺诈、诈骗或恶意目的 - 违反法律或道德准则 **使用本软件即表示您同意以道德和负责任的方式使用它。** --- ## 贡献 我们欢迎社区贡献!以下是您可以帮助的方式: ### 贡献方式 1. **报告错误**:使用详细的重现步骤开启 issue 2. **建议功能**:通过 GitHub issues 提出新功能 3. **提交 Pull Request**: - 修复错误 - 添加功能 - 改进文档 - 添加翻译 4. **改进文档**:帮助使项目更易访问 5. **分享用例**:展示您如何使用 VibeVoice ### 测试 ```bash # 后端测试(可用时) pytest tests/ # 前端测试(可用时) cd frontend npm test # 手动测试 # 1. 创建项目 # 2. 添加说话人 # 3. 创建对话 # 4. 生成语音 # 5. 验证输出质量 ``` --- ## 许可证 本项目遵循与原始 Microsoft VibeVoice 仓库相同的许可条款。详情请参阅 [LICENSE](LICENSE) 文件。 ### 第三方许可证 - **前端**:React、Next.js、Tailwind CSS(MIT 许可证) - **后端**:Flask、PyTorch(各种开源许可证) - **模型权重**:Microsoft VibeVoice(受 Microsoft 条款约束) --- ## 致谢 - **Microsoft Research**:原始 VibeVoice 模型和架构 - **ComfyUI**:Float8 转换技术灵感 - **kohya-ss/musubi-tuner**: Offloading 和 LoRA 网络实现参考 - **[voicepowered-ai/VibeVoice-finetuning](https://github.com/voicepowered-ai/VibeVoice-finetuning)**:训练数据加载器实现 - **HuggingFace**:模型托管和分发 - **开源社区**:使本项目成为可能的各种库和框架 --- ## 引用 如果您在研究中使用本实现,请同时引用本项目和原始 VibeVoice 论文: ```bibtex @software{vibevoice_webapp_2024, title={VibeVoice: Complete Web Application for Multi-Speaker Voice Generation}, author={Zhao, Kun}, year={2024}, url={https://github.com/zhao-kun/vibevoice} } @article{vibevoice2024, title={VibeVoice: Unified Autoregressive and Diffusion for Speech Generation}, author={Microsoft Research}, year={2024} } ``` --- ## 故障排除 ### CUDA 内存不足 ```bash # 尝试 Float8 模型 --dtype float8_e4m3fn # 在 Web UI 中启用层卸载 # 或在 CLI 中使用手动配置 ``` ### 音频质量问题 ```bash # 调整 CFG scale(尝试 1.0 - 2.0) --cfg_scale 1.5 # 使用更高精度模型 --dtype bfloat16 ``` ### 端口已被占用 ```bash # 在 backend/run.py 中更改端口 app.run(host='0.0.0.0', port=9528) ``` ### 前端构建错误 ```bash cd frontend rm -rf node_modules .next npm install npm run build ``` ---
**VibeVoice 社区制作** [返回顶部](#vibevoice)
================================================ FILE: backend/.gitignore ================================================ .env ================================================ FILE: backend/README.md ================================================ # VibeVoice Backend Flask-based REST API backend for VibeVoice speech generation system. ## Architecture ``` backend/ ├── api/ # API endpoints (blueprints) │ ├── __init__.py # Main API blueprint │ ├── projects.py # Project management endpoints │ ├── speakers.py # Speaker role endpoints │ ├── dialog_sessions.py # Dialog session endpoints │ └── generation.py # Voice generation endpoints (TBD) ├── models/ # Data models │ ├── project.py # Project dataclass │ ├── speaker.py # SpeakerRole dataclass │ └── dialog_session.py # DialogSession dataclass ├── services/ # Business logic services │ ├── project_service.py │ ├── speaker_service.py │ └── dialog_session_service.py ├── utils/ # Utility functions │ ├── file_handler.py │ └── dialog_validator.py ├── app.py # Flask application factory ├── config.py # Configuration management ├── run.py # Development server └── .env.example # Environment variables template ``` ## Setup ### Install Dependencies The backend shares dependencies with the main VibeVoice project. Install from the project root: ```bash pip install -e . ``` ### Configuration 1. Copy the example environment file: ```bash cp backend/.env.example backend/.env ``` 2. Edit `backend/.env` with your configuration ### Running the Development Server From the project root: ```bash python backend/run.py ``` Or using the module: ```bash python -m backend.run ``` The server will start at `http://localhost:9527` by default. ## API Endpoints ### Health Check ``` GET /health ``` Returns server health status. ### API Base ``` GET / ``` Returns API information. ### Test Endpoint ``` GET /api/v1/ping ``` Simple ping endpoint for testing. ### Projects API #### List Projects ``` GET /api/v1/projects ``` Returns all projects with metadata. **Response:** ```json { "projects": [ { "id": "my-project", "name": "My Project", "description": "Project description", "created_at": "2025-10-22T03:18:58.969507", "updated_at": "2025-10-22T03:18:58.969507" } ], "count": 1 } ``` #### Get Project ``` GET /api/v1/projects/ ``` Get specific project by ID. #### Create Project ``` POST /api/v1/projects Content-Type: application/json { "name": "Project Name", "description": "Optional description" } ``` Creates a new project directory with subdirectories (`voices/`, `scripts/`, `outputs/`) and adds metadata entry. **Response:** HTTP 201 with project data #### Update Project ``` PUT /api/v1/projects/ Content-Type: application/json { "name": "Updated Name", "description": "Updated description" } ``` Updates project metadata (name and/or description). #### Delete Project ``` DELETE /api/v1/projects/ ``` Deletes project directory and removes from metadata. **Response:** ```json { "message": "Project deleted successfully", "project_id": "my-project" } ``` ## Configuration Environment variables (see `.env.example`): - `FLASK_ENV`: Environment (development/production/testing) - `FLASK_HOST`: Server host (default: 0.0.0.0) - `FLASK_PORT`: Server port (default: 9527) - `FLASK_DEBUG`: Enable debug mode (default: true) - `SECRET_KEY`: Flask secret key - `CORS_ORIGINS`: Allowed CORS origins (comma-separated) - `WORKSPACE_DIR`: Root directory for all projects (default: ./workspace) - `MODEL_PATH`: Path to VibeVoice model - `MODEL_DEVICE`: Device for model inference (cuda/cpu) - `UPLOAD_FOLDER`: Directory for uploaded files - `MAX_CONTENT_LENGTH`: Maximum file upload size ## Development ### Project Structure - **api/**: REST API endpoints organized by resource - **models/**: Data models and schemas - **services/**: Business logic and model integration - **utils/**: Helper functions and utilities ### Adding New Endpoints 1. Create a new file in `api/` (e.g., `api/myresource.py`) 2. Define routes using Flask blueprints 3. Import and register the blueprint in `api/__init__.py` ## Project Directory Structure Each project is stored as a directory under `WORKSPACE_DIR`: ``` workspace/ ├── projects.json # Metadata for all projects └── my-project/ # Project directory (named by project ID) ├── voices/ # Speaker voice samples │ ├── speakers.json # Speaker metadata │ └── *.wav # Voice sample files ├── scripts/ # Dialog scripts │ ├── sessions.json # Dialog session metadata │ └── *.txt # Dialog text files └── outputs/ # Generated audio files (TBD) ``` ### Speakers API #### List Speakers ``` GET /api/v1/projects//speakers ``` Returns all speaker roles for a project. **Response:** ```json { "speakers": [ { "speaker_id": "Speaker 1", "name": "Alice", "description": "Main host", "voice_filename": "abc123.wav", "created_at": "2025-10-22T...", "updated_at": "2025-10-22T..." } ], "count": 1 } ``` #### Get Speaker ``` GET /api/v1/projects//speakers/ ``` Get specific speaker by ID (e.g., "Speaker 1"). #### Add Speaker ``` POST /api/v1/projects//speakers Content-Type: multipart/form-data name: Speaker Name description: Speaker description (optional) voice_file: