Full Code of PKU-YuanGroup/Open-Sora-Plan for AI

main f7fa604f4e3a cached

171 files

1.3 MB

332.9k tokens

1372 symbols

1 requests

Download .txt

Showing preview only (1,379K chars total). Download the full file or copy to clipboard to get everything.

Repository: PKU-YuanGroup/Open-Sora-Plan
Branch: main
Commit: f7fa604f4e3a
Files: 171
Total size: 1.3 MB

Directory structure:
gitextract_la9ona01/

├── .github/
│   └── workflows/
│       └── docker_build.yml
├── .gitignore
├── LICENSE
├── README.md
├── docs/
│   ├── Contribution_Guidelines.md
│   ├── Prompt_Refiner.md
│   ├── Report-v1.0.0-cn.md
│   ├── Report-v1.0.0.md
│   ├── Report-v1.1.0.md
│   ├── Report-v1.2.0.md
│   ├── Report-v1.3.0.md
│   ├── Report-v1.5.0.md
│   ├── Report-v1.5.0_cn.md
│   └── VAE.md
├── examples/
│   ├── cond_pix_path.txt
│   ├── cond_prompt.txt
│   ├── rec_image.py
│   ├── rec_video.py
│   └── sora.txt
├── opensora/
│   ├── __init__.py
│   ├── acceleration/
│   │   ├── __init__.py
│   │   ├── communications.py
│   │   └── parallel_states.py
│   ├── adaptor/
│   │   ├── __init__.py
│   │   ├── bf16_optimizer.py
│   │   ├── engine.py
│   │   ├── modules.py
│   │   ├── stage_1_and_2.py
│   │   ├── utils.py
│   │   └── zp_manager.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   ├── inpaint_dataset.py
│   │   ├── t2v_datasets.py
│   │   ├── transform.py
│   │   └── virtual_disk.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── causalvideovae/
│   │   │   ├── __init__.py
│   │   │   ├── dataset/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── ddp_sampler.py
│   │   │   │   ├── transform.py
│   │   │   │   └── video_dataset.py
│   │   │   ├── eval/
│   │   │   │   ├── cal_fvd.py
│   │   │   │   ├── cal_lpips.py
│   │   │   │   ├── cal_psnr.py
│   │   │   │   ├── cal_ssim.py
│   │   │   │   ├── eval.py
│   │   │   │   ├── fvd/
│   │   │   │   │   ├── styleganv/
│   │   │   │   │   │   └── fvd.py
│   │   │   │   │   └── videogpt/
│   │   │   │   │       ├── fvd.py
│   │   │   │   │       └── pytorch_i3d.py
│   │   │   │   └── script/
│   │   │   │       ├── cal_clip_score.sh
│   │   │   │       ├── cal_fvd.sh
│   │   │   │       ├── cal_lpips.sh
│   │   │   │       ├── cal_psnr.sh
│   │   │   │       └── cal_ssim.sh
│   │   │   ├── model/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── configuration_videobase.py
│   │   │   │   ├── dataset_videobase.py
│   │   │   │   ├── ema_model.py
│   │   │   │   ├── losses/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── discriminator.py
│   │   │   │   │   ├── lpips.py
│   │   │   │   │   └── perceptual_loss.py
│   │   │   │   ├── modeling_videobase.py
│   │   │   │   ├── modules/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── attention.py
│   │   │   │   │   ├── block.py
│   │   │   │   │   ├── conv.py
│   │   │   │   │   ├── normalize.py
│   │   │   │   │   ├── ops.py
│   │   │   │   │   ├── quant.py
│   │   │   │   │   ├── resnet_block.py
│   │   │   │   │   ├── updownsample.py
│   │   │   │   │   └── wavelet.py
│   │   │   │   ├── registry.py
│   │   │   │   ├── trainer_videobase.py
│   │   │   │   ├── utils/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── distrib_utils.py
│   │   │   │   │   ├── module_utils.py
│   │   │   │   │   ├── scheduler_utils.py
│   │   │   │   │   ├── video_utils.py
│   │   │   │   │   └── wavelet_utils.py
│   │   │   │   └── vae/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── modeling_causalvae.py
│   │   │   │       └── modeling_wfvae.py
│   │   │   ├── sample/
│   │   │   │   └── rec_video_vae.py
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── dataset_utils.py
│   │   │       ├── downloader.py
│   │   │       └── video_utils.py
│   │   ├── diffusion/
│   │   │   ├── __init__.py
│   │   │   ├── common.py
│   │   │   └── opensora_v1_3/
│   │   │       ├── __init__.py
│   │   │       ├── modeling_inpaint.py
│   │   │       ├── modeling_opensora.py
│   │   │       └── modules.py
│   │   ├── frame_interpolation/
│   │   │   ├── cfgs/
│   │   │   │   └── AMT-G.yaml
│   │   │   ├── interpolation.py
│   │   │   ├── networks/
│   │   │   │   ├── AMT-G.py
│   │   │   │   ├── __init__.py
│   │   │   │   └── blocks/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── feat_enc.py
│   │   │   │       ├── ifrnet.py
│   │   │   │       ├── multi_flow.py
│   │   │   │       └── raft.py
│   │   │   ├── readme.md
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── build_utils.py
│   │   │       ├── dist_utils.py
│   │   │       ├── flow_utils.py
│   │   │       └── utils.py
│   │   ├── prompt_refiner/
│   │   │   ├── inference.py
│   │   │   ├── merge.py
│   │   │   └── train.py
│   │   └── text_encoder/
│   │       ├── __init__.py
│   │       ├── clip.py
│   │       └── t5.py
│   ├── npu_config.py
│   ├── sample/
│   │   ├── caption_refiner.py
│   │   ├── pipeline_inpaint.py
│   │   ├── pipeline_opensora.py
│   │   ├── rec_image.py
│   │   ├── rec_video.py
│   │   └── sample.py
│   ├── serve/
│   │   ├── gradio_utils.py
│   │   ├── gradio_web_server.py
│   │   ├── gradio_web_server_i2v.py
│   │   └── style.css
│   ├── train/
│   │   ├── train_causalvae.py
│   │   ├── train_inpaint.py
│   │   └── train_t2v_diffusers.py
│   └── utils/
│       ├── communications.py
│       ├── dataset_utils.py
│       ├── downloader.py
│       ├── ema.py
│       ├── ema_utils.py
│       ├── freeinit_utils.py
│       ├── lora_utils.py
│       ├── mask_utils.py
│       ├── parallel_states.py
│       ├── sample_utils.py
│       └── utils.py
├── pyproject.toml
└── scripts/
    ├── accelerate_configs/
    │   ├── ddp_config.yaml
    │   ├── deepspeed_zero2_config.yaml
    │   ├── deepspeed_zero2_offload_config.yaml
    │   ├── deepspeed_zero3_config.yaml
    │   ├── deepspeed_zero3_offload_config.yaml
    │   ├── default_config.yaml
    │   ├── hostfile
    │   ├── zero2.json
    │   ├── zero2_npu.json
    │   ├── zero2_offload.json
    │   ├── zero3.json
    │   └── zero3_offload.json
    ├── causalvae/
    │   ├── eval.sh
    │   ├── prepare_eval.sh
    │   ├── rec_image.sh
    │   ├── rec_video.sh
    │   ├── train.sh
    │   └── wfvae_4dim.json
    ├── slurm/
    │   └── placeholder
    ├── text_condition/
    │   ├── gpu/
    │   │   ├── sample_inpaint_v1_3.sh
    │   │   ├── sample_t2v_v1_3.sh
    │   │   ├── train_inpaint_v1_3.sh
    │   │   └── train_t2v_v1_3.sh
    │   └── npu/
    │       ├── sample_inpaint_v1_3.sh
    │       ├── sample_t2v_v1_3.sh
    │       ├── train_inpaint_v1_3.sh
    │       └── train_t2v_v1_3.sh
    ├── train_configs/
    │   └── mask_config.yaml
    └── train_data/
        └── merge_data.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/docker_build.yml
================================================
name: docker-build

on:
  workflow_dispatch:
  push:
    branches:
      - "main"
    paths:
      - "docker/Dockerfile"

jobs:
  build-Open-Sora:
    runs-on: ubuntu-latest
    steps:
      -
        name: Checkout
        uses: actions/checkout@v4
      -
        name: Set up QEMU
        uses: docker/setup-qemu-action@v3
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      -
        name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
      -
        name: Build and push Open-Sora image
        uses: docker/build-push-action@v5
        with:
          context: .
          file: ./docker/Dockerfile
          push: true
          platforms: linux/amd64, linux/arm64, linux/s390x, linux/ppc64le
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/open-sora

================================================
FILE: .gitignore
================================================
ucf101_stride4x4x4
__pycache__
*.mp4
.ipynb_checkpoints
*.pth
UCF-101/
results/
build/
opensora.egg-info/
wandb/
.idea
*.ipynb
*.jpg
*.mp3
*.safetensors
*.mp4
*.png
*.gif
*.pth
*.pt
cache_dir/
wandb/
test*
sample_video*/
512*
720*
1024*
*debug*
private*
.deepspeed_env
256*
sample_image*/
taming*
*test*
sft*
flash*
65x256*
alpha_vae
*node*
cache/
Open-Sora-Plan_models/
sample_image*cfg*
*tmp*
*pymp*
check.py
bucket.py
whileinf.py
validation_dir/
runs/
samples/
inpaint*/
bs32x8x1*
*tmp*
*pymp*
check.py
bucket.py
whileinf.py
bs4x8x16_*
*.zip
*validation/
bs1x8x32*
bs16x8x1*
bs8x8x2*
bs8x8x1*
bs8x8x8*
bs1x8x16*
checklora.py
dim4todim8.py
*vae8_any*320x320*
samples/
runs/
*validation/
training_log*txt
filter_motion*
json2*.py
motionfun*
res_dist*
filter_json_aes_m*
stage2*.json
kernel_meta
ge_check_op.json
WFVAE_DISTILL_FORMAL
read_video*
bs32x8x2*
filter_json_aes_m*
json2json*
makenpu_json*
*make_small_json*
*schedule_noise*
test*
gpu_profiling*
gyy_dense*
torchelasti*
*VEnhancer*
*spdemo*
i2v.txt
*run_i2v*
*curope*
any*
*nomotion*
log*
*svg
*k8s*
*rf*
*lzj*
final*
opensora/train/*debug.py


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) Rabbitpre Intelligence Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================


<h1 align="left"> <a href="">Open-Sora Plan</a></h1>

This project aims to create a simple and scalable repo, to reproduce [Sora](https://openai.com/sora) (OpenAI, but we prefer to call it "ClosedAI" ). 

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，来自兔展、华为、鹏城实验室和开源社区伙伴均有深度贡献力量。

当前V1.5版本**完全基于华为昇腾训练（昇腾纯血版）**，欢迎Pull Request和使用！

我们正在快速迭代新版本，欢迎更多合作者或算法工程师加入，[算法工程师招聘-兔展智能.pdf](https://github.com/user-attachments/files/19107972/-.pdf)

<h5 align="left">

[![arXiv](https://img.shields.io/badge/Arxiv-Open--Sora%20Plan-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2412.00131)
[![arXiv](https://img.shields.io/badge/Arxiv-Helios-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2603.04379)
[![arXiv](https://img.shields.io/badge/Arxiv-WF--VAE-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2411.17459)
[![License](https://img.shields.io/badge/License-Apache-yellow)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/LICENSE)  <br>
[![slack badge](https://img.shields.io/badge/Discord-join-blueviolet?logo=discord&amp)](https://discord.gg/DFZg5678)
[![WeChat badge](https://img.shields.io/badge/微信-加入-green?logo=wechat&amp)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/issues/53#issuecomment-1987226516)
[![Twitter](https://img.shields.io/badge/-Twitter@LinBin46984-black?logo=twitter&logoColor=1D9BF0)](https://x.com/LinBin46984/status/1795018003345510687) 
[![Modelers](https://img.shields.io/badge/%E9%AD%94%E4%B9%90-%E6%A8%A1%E5%9E%8B%E4%BD%93%E9%AA%8C-blue)](https://modelers.cn/spaces/MindSpore-Lab/Open_Sora_Plan) <br>
[![GitHub repo stars](https://img.shields.io/github/stars/PKU-YuanGroup/Open-Sora-Plan?style=flat&logo=github&logoColor=whitesmoke&label=Stars)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/stargazers)&#160;
[![GitHub repo forks](https://img.shields.io/github/forks/PKU-YuanGroup/Open-Sora-Plan?style=flat&logo=github&logoColor=whitesmoke&label=Forks)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/network)&#160;
[![GitHub repo watchers](https://img.shields.io/github/watchers/PKU-YuanGroup/Open-Sora-Plan?style=flat&logo=github&logoColor=whitesmoke&label=Watchers)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/watchers)&#160;
[![GitHub repo size](https://img.shields.io/github/repo-size/PKU-YuanGroup/Open-Sora-Plan?style=flat&logo=github&logoColor=whitesmoke&label=Repo%20Size)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/archive/refs/heads/main.zip) <br>
[![GitHub repo contributors](https://img.shields.io/github/contributors-anon/PKU-YuanGroup/Open-Sora-Plan?style=flat&label=Contributors)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/graphs/contributors) 
[![GitHub Commit](https://img.shields.io/github/commit-activity/m/PKU-YuanGroup/Open-Sora-Plan?label=Commit)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/commits/main/)
[![Pr](https://img.shields.io/github/issues-pr-closed-raw/PKU-YuanGroup/Open-Sora-Plan.svg?label=Merged+PRs&color=green)](https://github.com/PKU-YuanGroup/Open-Sora-Plan/pulls)
[![GitHub issues](https://img.shields.io/github/issues/PKU-YuanGroup/Open-Sora-Plan?color=critical&label=Issues)](https://github.com/PKU-YuanGroup/Video-LLaVA/issues?q=is%3Aopen+is%3Aissue)
[![GitHub closed issues](https://img.shields.io/github/issues-closed/PKU-YuanGroup/Open-Sora-Plan?color=success&label=Issues)](https://github.com/PKU-YuanGroup/Video-LLaVA/issues?q=is%3Aissue+is%3Aclosed)
</h5>
<a href="https://trendshift.io/repositories/8280" target="_blank"><img src="https://trendshift.io/api/badge/repositories/8280" alt="PKU-YuanGroup%2FOpen-Sora-Plan | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<h5 align="left"> If you like our project, please give us a star ⭐ on GitHub for latest update.  </h2>


# 📣 News

* **[2026.03.08]** 👋👋👋 We introduce [Helios](https://github.com/PKU-YuanGroup/Helios), a breakthrough video generation model that achieves minute-scale, high-quality video synthesis at **19.5 FPS on a single H100** GPU — without relying on conventional long video anti-drifting strategies or standard video acceleration techniques. Welcome to check [Technical Report](https://huggingface.co/papers/2603.04379)!
* **[2025.06.05]** 🔥🔥🔥 We release version 1.5.0, our most powerful model! By introducing a **higher-compression WFVAE** and an improved sparse DiT architecture, **SUV**, we achieve performance **comparable to HunyuanVideo (Open-Source)** using an 8B-scale model and 40 million video samples. Version 1.5.0 is **fully trained and inferred on Ascend 910-series accelerators**; Please check the [mindspeed_mmdit](https://github.com/PKU-YuanGroup/Open-Sora-Plan/tree/mindspeed_mmdit) branch for our new code and [Report-v1.5.0.md](docs/Report-v1.5.0.md) for our report. The GPU version is coming soon. 
* **[2024.12.03]** ⚡️ We released our [arxiv paper](https://arxiv.org/abs/2412.00131) and WF-VAE [paper](https://arxiv.org/abs/2411.17459) for v1.3. The next more powerful version is coming soon.
* **[2024.10.16]** 🎉 We released version 1.3.0, featuring: **WFVAE**, **prompt refiner**, **data filtering strategy**, **sparse attention**, and **bucket training strategy**. We also support 93x480p within **24G VRAM**. More details can be found at our latest [report](docs/Report-v1.3.0.md).
* **[2024.08.13]** 🎉 We are launching Open-Sora Plan v1.2.0 **I2V** model, which is based on Open-Sora Plan v1.2.0. The current version supports image-to-video generation and transition generation (the starting and ending frames conditions for video generation). Check out the Image-to-Video section in this [report](https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.2.0.md#training-image-to-video-diffusion-model).
* **[2024.07.24]** 🔥🔥🔥 v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Check out our latest [report](docs/Report-v1.2.0.md).
* **[2024.05.27]** 🎉 We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest [report](docs/Report-v1.1.0.md). Thanks to [ShareGPT4Video's](https://sharegpt4video.github.io/) capability to annotate long videos.
* **[2024.04.09]** 🤝 Excited to share our latest exploration on metamorphic time-lapse video generation: [MagicTime](https://github.com/PKU-YuanGroup/MagicTime), which learns real-world physics knowledge from time-lapse videos.
* **[2024.04.07]** 🎉🎉🎉 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our [report](docs/Report-v1.0.0.md). Thanks to HUAWEI NPU for supporting us.
* **[2024.03.27]** 🚀🚀🚀 We release the report of [VideoCausalVAE](docs/CausalVideoVAE.md), which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
* **[2024.03.01]** 🤗 We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to **watch** 👀 this repository for the latest updates.

# 😍 Gallery

Text-to-Video Generation of Open-Sora Plan v1.5.0.
### Youtube:
[![Demo Video of Open-Sora Plan V1.5.0](https://github.com/user-attachments/assets/130bbba2-3ded-4092-92ef-b65b673cb1a6)](https://youtu.be/IiWTdx2EHCY)
### Bilibili:
[![Demo Video of Open-Sora Plan V1.5.0](https://github.com/user-attachments/assets/130bbba2-3ded-4092-92ef-b65b673cb1a6)](https://www.bilibili.com/video/BV1X77tzxE3b/)

# 😮 Highlights

Open-Sora Plan shows excellent performance in video generation.

### 🔥 WFVAE with higher performance and compression
- With an 8×8×8 downsampling rate, but achieves higher PSNR than the VAE used in Wan2.1. Lowers the training cost for the DiT built upon it.

### 🚀 More powerful sparse dit
- The more powerful sparse attention architecture, SUV, achieves performance close to dense DiT while providing over a 35% speedup.

<p align="center">
    <img src="https://s21.ax1x.com/2024/07/22/pk7cob8.png" width="650" style="margin-bottom: 0.2;"/>
<p>

# 🐳 Resource

| Version | Architecture |  Diffusion Model | CausalVideoVAE | Data | Prompt Refiner |
|:---|:---|:---|:---|:---|:---|
| v1.5.0 | SUV (Skiparse 3D) | [121x576x1024](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.5.0/blob/main/MindSpeed/model_ema.pt)[5] | [Anysize_8x8x8_32dim](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.5.0/blob/main/MindSpeed/wfvae_888_dim32.ckpt) | - | - |
| v1.3.0 [4] | Skiparse 3D | [Anysize in 93x640x640](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640)[3], [Anysize in 93x640x640_i2v](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/any93x640x640_i2v)[3] | [Anysize](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/vae)| [prompt_refiner](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/prompt_refiner) | [checkpoint](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/prompt_refiner)| |
| v1.2.0 | Dense 3D | [93x720p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x720p), [29x720p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x720p)[1], [93x480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x480p)[1,2], [29x480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x480p), [1x480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/1x480p), [93x480p_i2v](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x480p_i2v) | [Anysize](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/vae)| [Annotations](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0) | - |
| v1.1.0 | 2+1D | [221x512x512](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/221x512x512), [65x512x512](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/65x512x512) |[Anysize](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/vae) |[Data and Annotations](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0)| - |
| v1.0.0 | 2+1D | [65x512x512](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/65x512x512), [65x256x256](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/65x256x256), [17x256x256](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/17x256x256) | [Anysize](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/vae) | [Data and Annotations](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0)| - |

> [1] Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.

> [2] We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.

> [3] The model is trained arbitrarily on stride=32. So keep the resolution of the inference a multiple of 32. Frames need to be 4n+1, e.g. 93, 77, 61, 45, 29, 1 (image).

> [4] Model weights are also available at [OpenMind](https://modelers.cn/models/linbin/Open-Sora-Plan-v1.3.0) and [WiseModel](https://wisemodel.cn/models/PKU-YUAN/Open-Sora-Plan-v1.3.0).

> [5] The current model weights are only compatible with the NPU + MindSpeed-MM framework. Model weights are also available at and [modelers](https://modelers.cn/models/PKU-YUAN-Group/Open-Sora-Plan-v1.5.0/tree/main/MindSpeed).

> [!Warning]
>
> <div align="left">
> <b>
> 🚨 For version 1.2.0, we no longer support 2+1D models.
> </b>
> </div>

# ⚙️ How to start

### GPU
coming soon...
### NPU
Please check out the **[mindspeed_mmdit](https://github.com/PKU-YuanGroup/Open-Sora-Plan/tree/mindspeed_mmdit)** branch and follow the README.md for configuration.

# 📖 Technical report
Please check [Report-v1.5.0.md](docs/Report-v1.5.0.md).

# 💡 How to Contribute
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the [Contribution Guidelines](docs/Contribution_Guidelines.md)

# 👍 Acknowledgement and Related Work
* [Allegro](https://github.com/rhymes-ai/Allegro): Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input based on our Open-Sora Plan. The significance of open-source is becoming increasingly tangible.
* [Latte](https://github.com/Vchitect/Latte): It is a wonderful 2+1D video generation model.
* [PixArt-alpha](https://github.com/PixArt-alpha/PixArt-alpha): Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
* [ShareGPT4Video](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4Video): Improving Video Understanding and Generation with Better Captions.
* [VideoGPT](https://github.com/wilson1yan/VideoGPT): Video Generation using VQ-VAE and Transformers.
* [DiT](https://github.com/facebookresearch/DiT): Scalable Diffusion Models with Transformers.
* [FiT](https://github.com/whlzy/FiT): Flexible Vision Transformer for Diffusion Model.
* [Positional Interpolation](https://arxiv.org/abs/2306.15595): Extending Context Window of Large Language Models via Positional Interpolation.


# 🔒 License
* See [LICENSE](LICENSE) for details.

## ✨ Star History

[![Star History](https://api.star-history.com/svg?repos=PKU-YuanGroup/Open-Sora-Plan)](https://star-history.com/#PKU-YuanGroup/Open-Sora-Plan&Date)


# ✏️ Citing


```bibtex
@article{lin2024open,
  title={Open-Sora Plan: Open-Source Large Video Generation Model},
  author={Lin, Bin and Ge, Yunyang and Cheng, Xinhua and Li, Zongjian and Zhu, Bin and Wang, Shaodong and He, Xianyi and Ye, Yang and Yuan, Shenghai and Chen, Liuhan and others},
  journal={arXiv preprint arXiv:2412.00131},
  year={2024}
}
```
```bibtex
@article{helios,
  title={Helios: Real Real-Time Long Video Generation Model},
  author={Yuan, Shenghai and Yin, Yuanyang and Li, Zongjian and Huang, Xinwei and Yang, Xiao and Yuan, Li},
  journal={arXiv preprint arXiv:2603.04379},
  year={2026}
}
```
```bibtex
@article{li2024wf,
  title={WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model},
  author={Li, Zongjian and Lin, Bin and Ye, Yang and Chen, Liuhan and Cheng, Xinhua and Yuan, Shenghai and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17459},
  year={2024}
}
```

# 🤝 Community contributors

<a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/Open-Sora-Plan" />
</a>



================================================
FILE: docs/Contribution_Guidelines.md
================================================
# Contributing to the Open-Sora Plan Community

The Open-Sora Plan open-source community is a collaborative initiative driven by the community, emphasizing a commitment to being free and void of exploitation. Organized spontaneously by community members, we invite you to contribute to the Open-Sora Plan open-source community and help elevate it to new heights!

## Submitting a Pull Request (PR)

As a contributor, before submitting your request, kindly follow these guidelines:

1. Start by checking the [Open-Sora Plan GitHub](https://github.com/PKU-YuanGroup/Open-Sora-Plan/pulls) to see if there are any open or closed pull requests related to your intended submission. Avoid duplicating existing work.

2. [Fork](https://github.com/PKU-YuanGroup/Open-Sora-Plan/fork) the [open-sora plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan) repository and download your forked repository to your local machine.

   ```bash
   git clone [your-forked-repository-url]
   ```

3. Add the original Open-Sora Plan repository as a remote to sync with the latest updates:

   ```bash
   git remote add upstream https://github.com/PKU-YuanGroup/Open-Sora-Plan
   ```

4. Sync the code from the main repository to your local machine, and then push it back to your forked remote repository.

   ```
   # Pull the latest code from the upstream branch
   git fetch upstream
   
   # Switch to the main branch
   git checkout main
   
   # Merge the updates from the upstream branch into main, synchronizing the local main branch with the upstream
   git merge upstream/main
   
   # Additionally, sync the local main branch to the remote branch of your forked repository
   git push origin main
   ```


   > Note: Sync the code from the main repository before each submission.

5. Create a branch in your forked repository for your changes, ensuring the branch name is meaningful.

   ```bash
   git checkout -b my-docs-branch main
   ```

6. While making modifications and committing changes, adhere to our [Commit Message Format](#Commit-Message-Format).

   ```bash
   git commit -m "[docs]: xxxx"
   ```

7. Push your changes to your GitHub repository.

   ```bash
   git push origin my-docs-branch
   ```

8. Submit a pull request to `Open-Sora-Plan:main` on the GitHub repository page.

## Commit Message Format

Commit messages must include both `<type>` and `<summary>` sections.

```bash
[<type>]: <summary>
  │        │
  │        └─⫸ Briefly describe your changes, without ending with a period.
  │
  └─⫸ Commit Type: |docs|feat|fix|refactor|
```

### Type 

* **docs**: Modify or add documents.
* **feat**: Introduce a new feature.
* **fix**: Fix a bug.
* **refactor**: Restructure code, excluding new features or bug fixes.

### Summary

Describe modifications in English, without ending with a period.

> e.g., git commit -m "[docs]: add a contributing.md file"

This guideline is borrowed by [minisora](https://github.com/mini-sora/minisora). We sincerely appreciate MiniSora authors for their awesome templates. 


================================================
FILE: docs/Prompt_Refiner.md
================================================
## Data

We have open-sourced our dataset of 32,555 pairs, which includes Chinese data. The dataset is available [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/prompt_refiner). The details can be found [here](https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.3.0.md#prompt-refiner).

In fact, it is a JSON file with the following structure.

```
[
  {
    "instruction": "Refine the sentence: \"A newly married couple sharing a piece of there wedding cake.\" to contain subject description, action, scene description. (Optional: camera language, light and shadow, atmosphere) and conceive some additional actions to make the sentence more dynamic. Make sure it is a fluent sentence, not nonsense.",
    "input": "",
    "output": "The newlywed couple, dressed in elegant attire..."
  },
  ...
]
```

## Train

`--data_path` is the path to the prepared JSON file.
`--model_path` is the directory containing the LLaMA 3.1 weights, including `config.json` and some weight files.
`--lora_out_path` is the path where the LoRA model will be saved.

```
cd opensora/models/prompt_refiner
CUDA_VISIBLE_DEVICES=0 python train.py \
    --data_path path/to/data.json \
    --model_path path/to/llama_model \ 
    --lora_out_path path/to/save/lora_model
```

## Merge

`--model_path` is the directory containing the LLaMA 3.1 weights, including `config.json` and some weight files.
`--lora_in_path` is the directory containing the pre-trained LoRA model.
`--lora_out_path` is the path for the merged model.

```
cd opensora/models/prompt_refiner
CUDA_VISIBLE_DEVICES=0 python merge.py \
    --base_path path/to/llama_model \
    --lora_in_path path/to/save/lora_model \
    --lora_out_path path/to/save/merge_model
```

## Inference

`--model_path` is the directory containing the weights (LLaMA 3.1 or merged Lora weight), including `config.json` and some weight files.
`--prompt` is the text you want to input, which will be refined.

```
cd opensora/models/prompt_refiner
CUDA_VISIBLE_DEVICES=0 python merge.py \
    --mode_path path/to/data.json \
    --prompt path/to/save/lora_model
```

================================================
FILE: docs/Report-v1.0.0-cn.md
================================================
# 技术报告 v1.0.0

在2024年3月，我们推出了Open-Sora-Plan，一个旨在复现OpenAI [Sora](https://openai.com/sora)的开源计划。它作为一个基础的开源框架，能够训练视频生成模型包括无条件视频生成，类别引导视频生成，文生视频。

**今天，我们兴奋地展示Open-Sora-Plan v1.0.0，极大地改进视频生成质量、文本控制能力。**

相比于之前的视频生成模型，Open-Sora-Plan v1.0.0 有以下的改进：

1. **CausalVideoVAE高效的训练与推理**。 我们用4×8×8的对视频进行时间和空间的压缩。
2. **图片视频联合训练提升视觉质量**。 CasualVideoVAE 将首帧看作图片，天然支持同时编码图片和视频。这允许扩散模型提取更多时空细节来改善质量。


### Open-Source Release
我们开源了Open-Sora-Plan去促进视频生成社区的进一步发展。公开代码、数据、模型。
- 在线演示：Hugging Face [![hf_space](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/Open-Sora-Plan-v1.0.0)， [![Replicate demo and cloud API](https://replicate.com/camenduru/open-sora-plan-512x512/badge)](https://replicate.com/camenduru/open-sora-plan-512x512) 和 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/Open-Sora-Plan-jupyter/blob/main/Open_Sora_Plan_jupyter.ipynb), 感谢[@camenduru](https://github.com/camenduru)大力支持我们的工作！🤝
- 代码：所有训练脚本和采样代码。
- 模型：包括扩散模型和CausalVideoVAE [这里](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0)。
- 数据：所有原视频和对应描述 [这里](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0)。
  
## 效果

Open-Sora-Plan v1.0.0支持图片视频联合训练。我们在此展示视频和图片的重建以及生成：

720×1280**视频重建**。 因为github的限制，原视频放在: [1](https://streamable.com/gqojal), [2](https://streamable.com/6nu3j8). 

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/c100bb02-2420-48a3-9d7b-4608a41f14aa

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/8aa8f587-d9f1-4e8b-8a82-d3bf9ba91d68

1536×1024**图片重建**

<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/1684c3ec-245d-4a60-865c-b8946d788eb9" width="45%"/> <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/46ef714e-3e5b-492c-aec4-3793cb2260b5" width="45%"/>

65×1024×1024**文生视频**

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/2641a8aa-66ac-4cda-8279-86b2e6a6e011

65×512×512**文生视频** 

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/37e3107e-56b3-4b09-8920-fa1d8d144b9e


512×512**文生视频** 

![download](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/491d72bc-e762-48ff-bdcc-cc69350f56d6)

## 详细技术报告

### CausalVideoVAE

#### 模型结构

![image](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/e3c8b35d-a217-4d96-b2e9-5c248a2859c8)

因果VAE架构继承了[Stable-Diffusion Image VAE](https://github.com/CompVis/stable-diffusion/tree/main)。 为了保证图片VAE的预训练权重可以无缝运用到视频VAE中，模型结构采取如下设计:

1. **CausalConv3D**: 将Conv2D 转变成CausalConv3D可以实现图片和视频的联合训练. CausalConv3D 对第一帧进行特殊处理，因为它无法访问后续帧。对于更多细节，请参考https://github.com/PKU-YuanGroup/Open-Sora-Plan/pull/145

2. **初始化**：将Conv2D扩展到Conv3D常用的[方法](https://github.com/hassony2/inflated_convnets_pytorch/blob/master/src/inflate.py#L5)有两种：平均初始化和中心初始化。 但我们采用了特定的初始化方法（尾部初始化）。 这种初始化方法确保模型无需任何训练就能够直接重建图像，甚至视频。
   
#### 训练细节

<img width="833" alt="image" src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/9ffb6dc4-23f6-4274-a066-bbebc7522a14">

我们展示了 17×256×256 下两种不同初始化方法的损失曲线。黄色曲线代表使用尾部初始化的损失，而蓝色曲线对应中心初始化的损失。 如图所示，尾部初始化在损失曲线上表现出更好的性能。 此外，我们发现中心初始化会导致错误累积，导致长时间内崩溃。

#### 推理技巧
尽管训练Diffusion中VAE始终是冻住的，我们仍然无法负担CasualVideoVAE的花销。在我们的实验中, 80G的显存只能够在半精度下推理一个256×512×512或32×1024×1024的视频 ，这限制了我们扩展到更长更高清的视频。因此我们采用tile convolution，能够以几乎恒定的内存推理任意时长或任意分辨率的视频。

### 数据构建
我们定义高质量的视频数据集包括两个核心法则：(1) 没有与内容无关的水印。(2) 高质量的文本注释。

**对于法则1**，我们从开源网站（CC0协议）爬取了大约40k videos：1234个来自[mixkit](https://mixkit.co/)，7408个来自[pexels](https://www.pexels.com/)，31616个来自[pixabay](https://pixabay.com/)。我们根据[Panda70M](https://github.com/snap-research/Panda-70M/blob/main/splitting/README.md)提供的场景变换剪切script将这些视频切成大约434k video clips。事实上，根据我们的剪切结果，从这些网上上爬取的99%的视频都是单一的场景。另外，我们发现爬取的数据中超过60%为风景相关视频。更多细节可以在[这](https://github.com/PKU-YuanGroup/Open-Sora-Dataset)找到。

**对于法则2**，很难有大量的高质量的文本注释能够从网上直接爬取。因此我们用成熟的图片标注模型来获取高质量的稠密描述。我们对2个多模态大模型进行消融实验：[ShareGPT4V-Captioner-7B](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/README.md) 和 [LLaVA-1.6-34B](https://github.com/haotian-liu/LLaVA)。前者是专门用来制作文本注释的模型，而后者是一个通用的多模态大模型。经过我们的消融实验，他们在caption的表现差不多。然而他们的推理速度在A800上差距很大：40s/it of batch size of 12 for ShareGPT4V-Captioner-7B，15s/it of batch size of 1 for ShareGPT4V-Captioner-7B。我们开源所有的[文本注释和原视频](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0)。

| 模型名字 | 平均长度 | 最大值 | 标准差 |
|---|---|---|---|
| ShareGPT4V-Captioner-7B | 170.0827524529121 |  467 | 53.689967539537776 | 
| LLaVA-1.6-34B | 141.75851073472666 |  472 | 48.52492072346965 | 

### 训练扩散模型
与之前的工作类似，我们采用多阶段的级联的训练方法，总共消耗了2048个A800 GPU 小时。我们发现联合图片训练能够显著加速模型的收敛并且增强视觉观感，这与[Latte](https://github.com/Vchitect/Latte)一致。以下是我们的训练花销。

| 名字 | Stage 1 | Stage 2 | Stage 3 | Stage 4 |
|---|---|---|---|---|
| 训练视频尺寸 | 17×256×256 |  65×256×256 | 65×512×512 |  65×1024×1024 | 
| 计算资源 (#A800 GPU x #小时) | 32 × 40 |  32 × 18 |  32 × 6 |  训练中 | 
| 权重 | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/17x256x256) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/65x256x256) |  [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/65x512x512) |  训练中 | 
| 日志 | [wandb](https://api.wandb.ai/links/linbin/p6n3evym) |  [wandb](https://api.wandb.ai/links/linbin/t2g53sew) |  [wandb](https://api.wandb.ai/links/linbin/uomr0xzb) | 训练中 | 
| 训练数据 | ~40k videos |  ~40k videos |  ~40k videos |  ~40k videos | 

## 下版本预览
### CausalVideoVAE
目前我们发布的CausalVideoVAE v1.0.0版本存在2个主要的缺陷：**运动模糊**以及**网格效应**。我们对CasualVideoVAE做了一系列的改进使它推理成本更低且性能更强大，我们暂时叫它为预览版本，将在下个版本发布。

**1分钟720×1280视频重建**。 受限于GitHub，我们将原视频放在这：[原视频](https://streamable.com/u4onbb)，[重建视频](https://streamable.com/qt8ncc)。

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/cdcfa9a3-4de0-42d4-94c0-0669710e407b

我们从kinetic 400的验证集中随机选取100个样本进行评估，结果表如下所示：

|  | SSIM↑ | LPIPS↓ | PSNR↑ | FLOLPIPS↓ |
|---|---|---|---|---|
| v1.0.0 | 0.829 |  0.106 |  27.171 |  0.119 | 
| Preview | 0.877 |  0.064 |  29.695 |  0.070 | 

#### 运动模糊

| **v1.0.0** | **预览版本** |
| --- | --- |
| ![6862cae0-b1b6-48d1-bd11-84348cf42b42](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/f815636f-fb38-4891-918b-50b1f9aa086d)  | ![9189da06-ef2c-42e6-ad34-bd702a6f538e](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/1e413f50-a785-485a-9851-a1449f952f1c)  |

#### 网格效应

| **v1.0.0** | **预览版本** |
| --- | --- |
| ![img](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/7fec5bed-3c83-4ee9-baef-4a3dacafc658)  | ![img](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/4f41b432-a3ef-484e-a492-8afd8a691bf7)  |

### 数据构建

**数据源**：正如上文提到，我们的数据集中超过60%为风景视频。这意味着我们的开域视频生成能力有限。然而当前的大规模开源数据集大多从YouTube爬取，尽管视频的数量多，但我们担忧视频本身的质量是否达标。因此，我们将继续收集高质量的数据集，同时也欢迎开源社区的推荐。 

**Caption生成流程**：当我们训练时长增加时，我们不得不考虑更有效的视频caption生成方法，而不是多模态图片大模型。我们正在开发一个新的视频注释生成管线，它能够很好的支持长视频，敬请期待。

### 训练扩散模型
尽管目前v1.0.0展现了可喜的结果，但我们仍然离Sora有一段距离。在接下来的工作中，我们主要围绕这三个方面:

1. **动态分辨率与时长的训练**: 我们的目标是开发出能够以不同分辨率和持续时间训练模型的技术，使训练过程更加灵活、适应性更强。

2. **更长的视频生成**: 我们将探索扩展模型生成能力的方法，使其能够制作更长的视频，超越目前的限制。

3. **更多条件控制**: 我们力求增强模型的条件控制能力，为用户提供更多的选项和对生成视频的控制能力。

另外，通过仔细观察生成的视频，我们发现存在一些不符合常理的斑点或异常的流动，这是由于CasualVideoVAE的性能不足导致的 如上面提到。在未来的实验中，我们将使用更强的VAE，重新训练一个扩散模型。


================================================
FILE: docs/Report-v1.0.0.md
================================================
# Report v1.0.0

In March 2024, we launched a plan called Open-Sora-Plan, which aims to reproduce the OpenAI [Sora](https://openai.com/sora) through an open-source framework. As a foundational open-source framework, it enables training of video generation models, including Unconditioned Video Generation, Class Video Generation, and Text-to-Video Generation.

**Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities.**

Compared with previous video generation model, Open-Sora-Plan v1.0.0 has several improvements:

1. **Efficient training and inference with CausalVideoVAE**. We apply a spatial-temporal compression to the videos by 4×8×8.
2. **Joint image-video training for better quality**. Our CausalVideoVAE considers the first frame as an image, allowing for the simultaneous encoding of both images and videos in a natural manner. This allows the diffusion model to grasp more spatial-visual details to improve visual quality.

### Open-Source Release
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model are made publicly available.
- Demo: Hugging Face demo [![hf_space](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg)](https://huggingface.co/spaces/LanguageBind/Open-Sora-Plan-v1.0.0). 🤝 Enjoying the [![Replicate demo and cloud API](https://replicate.com/camenduru/open-sora-plan-512x512/badge)](https://replicate.com/camenduru/open-sora-plan-512x512) and [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/camenduru/Open-Sora-Plan-jupyter/blob/main/Open_Sora_Plan_jupyter.ipynb), created by [@camenduru](https://github.com/camenduru), who generously supports our research!
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CausalVideoVAE [here](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0).
- Data: Both raw videos and captions [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0).

## Gallery

Open-Sora-Plan v1.0.0 supports joint training of images and videos. Here, we present the capabilities of Video/Image Reconstruction and Generation:

### CausalVideoVAE Reconstruction

**Video Reconstruction** with 720×1280. Since github can't upload large video, we put it here: [1](https://streamable.com/gqojal), [2](https://streamable.com/6nu3j8). 

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/c100bb02-2420-48a3-9d7b-4608a41f14aa

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/8aa8f587-d9f1-4e8b-8a82-d3bf9ba91d68

**Image Reconstruction** in 1536×1024.

<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/1684c3ec-245d-4a60-865c-b8946d788eb9" width="45%"/> <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/46ef714e-3e5b-492c-aec4-3793cb2260b5" width="45%"/>

**Text-to-Video Generation** with 65×1024×1024

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/2641a8aa-66ac-4cda-8279-86b2e6a6e011

**Text-to-Video Generation** with 65×512×512

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/37e3107e-56b3-4b09-8920-fa1d8d144b9e


**Text-to-Image Generation** with 512×512

![download](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/491d72bc-e762-48ff-bdcc-cc69350f56d6)

## Detailed Technical Report

### CausalVideoVAE

#### Model Structure

![image](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/e3c8b35d-a217-4d96-b2e9-5c248a2859c8)

The CausalVideoVAE architecture inherits from the [Stable-Diffusion Image VAE](https://github.com/CompVis/stable-diffusion/tree/main). To ensure that the pretrained weights of the Image VAE can be seamlessly applied to the Video VAE, the model structure has been designed as follows:

1. **CausalConv3D**: Converting Conv2D to CausalConv3D enables joint training of image and video data. CausalConv3D applies a special treatment to the first frame, as it does not have access to subsequent frames. For more specific details, please refer to https://github.com/PKU-YuanGroup/Open-Sora-Plan/pull/145

2. **Initialization**: There are two common [methods](https://github.com/hassony2/inflated_convnets_pytorch/blob/master/src/inflate.py#L5) to expand Conv2D to Conv3D: average initialization and center initialization. But we employ a specific initialization method (tail initialization). This initialization method ensures that without any training, the model is capable of directly reconstructing images, and even videos.

#### Training Details

<img width="833" alt="image" src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/9ffb6dc4-23f6-4274-a066-bbebc7522a14">

We present the loss curves for two distinct initialization methods under 17×256×256. The yellow curve represents the loss using tail init, while the blue curve corresponds to the loss from center initialization. As shown in the graph, tail initialization demonstrates better performance on the loss curve. Additionally, we found that center initialization leads to error accumulation, causing the collapse over extended durations.

#### Inference Tricks
Despite the VAE in Diffusion training being frozen, we still find it challenging to afford the cost of the CausalVideoVAE. In our case, with 80GB of GPU memory, we can only infer a video of either 256×512×512 or 32×1024×1024 resolution using half-precision, which limits our ability to scale up to longer and higher-resolution videos. Therefore, we adopt tile convolution, which allows us to infer videos of arbitrary duration or resolution with nearly constant memory usage.

### Data Construction
We define a high-quality video dataset based on two core principles: (1) No content-unrelated watermarks. (2) High-quality and dense captions.

**For principles 1**, we crawled approximately 40,000 videos from open-source websites under the CC0 license. Specifically, we obtained 1,234 videos from [mixkit](https://mixkit.co/), 7,408 videos from [pexels](https://www.pexels.com/), and 31,616 videos from [pixabay](https://pixabay.com/). These videos adhere to the principle of having no content-unrelated watermarks. According to the scene transformation and clipping script provided by [Panda70M](https://github.com/snap-research/Panda-70M/blob/main/splitting/README.md), we have divided these videos into approximately 434,000 video clips. In fact, based on our clipping results, 99% of the videos obtained from these online sources are found to contain single scenes. Additionally, we have observed that over 60% of the crawled data comprises landscape videos. More details can be found [here](https://github.com/PKU-YuanGroup/Open-Sora-Dataset).

**For principles 2**, it is challenging to directly crawl a large quantity of high-quality dense captions from the internet. Therefore, we utilize a mature Image-captioner model to obtain high-quality dense captions. We conducted ablation experiments on two multimodal large models: [ShareGPT4V-Captioner-7B](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/README.md) and [LLaVA-1.6-34B](https://github.com/haotian-liu/LLaVA). The former is specifically designed for caption generation, while the latter is a general-purpose multimodal large model. After conducting our ablation experiments, we found that they are comparable in performance. However, there is a significant difference in their inference speed on the A800 GPU: 40s/it of batch size of 12 for ShareGPT4V-Captioner-7B, 15s/it of batch size of 1 for LLaVA-1.6-34B. We open-source all annotations [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.0.0). We show some statistics here, and we set the maximum length of the model to 300, which covers almost 99% of the samples.

| Name | Avg length | Max | Std |
|---|---|---|---|
| ShareGPT4V-Captioner-7B | 170.0827524529121 |  467 | 53.689967539537776 | 
| LLaVA-1.6-34B | 141.75851073472666 |  472 | 48.52492072346965 | 

### Training Diffusion Model
Similar to previous work, we employ a multi-stage cascaded training approach, which consumes a total of 2,528 A800 GPU hours. We found that joint training with images significantly accelerates model convergence and enhances visual perception, aligning with the findings of [Latte](https://github.com/Vchitect/Latte). Below is our training card:

| Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |
|---|---|---|---|---|
| Training Video Size | 17×256×256 |  65×256×256 | 65×512×512 |  65×1024×1024 | 
| Compute (#A800 GPU x #Hours) | 32 × 40 |  32 × 22 |  32 × 17 |  Under training | 
| Checkpoint | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/17x256x256) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/65x256x256) |  [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.0.0/tree/main/65x512x512) |  Under training | 
| Log | [wandb](https://api.wandb.ai/links/linbin/p6n3evym) |  [wandb](https://api.wandb.ai/links/linbin/t2g53sew) |  [wandb](https://api.wandb.ai/links/linbin/uomr0xzb) | Under training | 
| Training Data | ~40k videos |  ~40k videos |  ~40k videos |  ~40k videos | 

## Next Release Preview
### CausalVideoVAE
Currently, the released version of CausalVideoVAE (v1.0.0) has two main drawbacks: **motion blurring** and **gridding effect**. We have made a series of improvements to CausalVideoVAE to reduce its inference cost and enhance its performance. We are currently referring to this enhanced version as the "preview version," which will be released in the next update. Preview reconstruction is as follows:

**1 min Video Reconstruction with 720×1280**. Since github can't put too big video, we put it here: [origin video](https://streamable.com/u4onbb), [reconstruction video](https://streamable.com/qt8ncc). 

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/cdcfa9a3-4de0-42d4-94c0-0669710e407b

We randomly selected 100 samples from the validation set of Kinetics-400 for evaluation, and the results are presented in the following table:

|  | SSIM↑ | LPIPS↓ | PSNR↑ | FLOLPIPS↓ |
|---|---|---|---|---|
| v1.0.0 | 0.829 |  0.106 |  27.171 |  0.119 | 
| Preview | 0.877 |  0.064 |  29.695 |  0.070 | 

#### Motion Blurring

| **v1.0.0** | **Preview** |
| --- | --- |
| ![6862cae0-b1b6-48d1-bd11-84348cf42b42](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/f815636f-fb38-4891-918b-50b1f9aa086d)  | ![9189da06-ef2c-42e6-ad34-bd702a6f538e](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/1e413f50-a785-485a-9851-a1449f952f1c)  |

#### Gridding effect

| **v1.0.0** | **Preview** |
| --- | --- |
| ![img](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/7fec5bed-3c83-4ee9-baef-4a3dacafc658)  | ![img](https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/88202804/4f41b432-a3ef-484e-a492-8afd8a691bf7)  |

### Data Construction

**Data source**. As mentioned earlier, over 60% of our dataset consists of landscape videos. This implies that our ability to generate videos in other domains is limited. However, most of the current large-scale open-source datasets are primarily obtained through web scraping from platforms like YouTube. While these datasets provide a vast quantity of videos, we have concerns about the quality of the videos themselves. Therefore, we will continue to collect high-quality datasets and also welcome recommendations from the open-source community. We are launching an Open-Sora-Dataset project, check out the details at [Open-Sora-Dataset](https://github.com/PKU-YuanGroup/Open-Sora-Dataset)

**Caption Generation Pipeline**. As the video duration increases, we need to consider more efficient methods for video caption generation instead of relying solely on large multimodal image models. We are currently developing a new video caption generation pipeline that provides robust support for long videos. We are excited to share more details with you in the near future. Stay tuned!

### Training Diffusion Model
Although v1.0.0 has shown promising results, we acknowledge that we still have a ways to go to reach the level of Sora. In our upcoming work, we will primarily focus on three aspects:

1. **Training support for dynamic resolution and duration**: We aim to develop techniques that enable training models with varying resolutions and durations, allowing for more flexible and adaptable training processes.

2. **Support for longer video generation**: We will explore methods to extend the generation capabilities of our models, enabling them to produce longer videos beyond the current limitations.

3. **Enhanced conditional control**: We seek to enhance the conditional control capabilities of our models, providing users with more options and control over the generated videos.

Furthermore, through careful observation of the generated videos, we have noticed the presence of some non-physiological speckles or abnormal flow. This can be attributed to the limited performance of CausalVideoVAE, as mentioned earlier. In future experiments, we plan to retrain a diffusion model using a more powerful version of CausalVideoVAE to address these issues.


================================================
FILE: docs/Report-v1.1.0.md
================================================
# Report v1.1.0

In April 2024, we launched Open-Sora-Plan v1.0.0, featuring a simple and efficient design along with remarkable performance in text-to-video generation. It has already been adopted as a foundational model in numerous research projects, including its data and model.

**Today, we are excited to present Open-Sora-Plan v1.1.0, which significantly improves video generation quality and duration.**

Compared to the previous version, Open-Sora-Plan v1.1.0, the improvements include:

1. **Better compressed visual representations**. We optimized the CausalVideoVAE architecture, which now has stronger performance and higher inference efficiency.
2. **Generate higher quality, longer videos**. We used higher quality visual data and captions by [ShareGPT4Video](https://sharegpt4video.github.io/), enabling the model to better understand the workings of the world.

Along with performance improvements, Open-Sora-Plan v1.1.0 maintains the minimalist design and data efficiency of v1.0.0. Remarkably, we found that v1.1.0 exhibits similar performance to the Sora base model, indicating that our version's evolution aligns with the scaling law demonstrated by Sora.

### Open-Source Release
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.
- Demo: Hugging Face demo [here](https://huggingface.co/spaces/LanguageBind/Open-Sora-Plan-v1.1.0).
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CasualVideoVAE [here](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.1.0).
- Data: Both raw videos and captions [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0).

## Gallery


### 221×512×512 Text-to-Video Generation

| 221×512×512 (9.2s) | 221×512×512 (9.2s) | 221×512×512 (9.2s) | 221×512×512 (9.2s) |
| --- | --- | --- | --- |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/6d18f344-f7da-44eb-9e07-77813f6b5e90" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/71f75e72-e9ee-4ce7-b8ea-d2d45a6f367e" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/80430eae-a3b4-4f24-b448-0db2919327d6" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/dc217894-a8c8-4174-a42c-acd2811f61f5" width=224> |
| This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage ... |  a cat wearing sunglasses and working as a lifeguard at pool. |  Photorealistic closeup video of two pirate ships battling each other as they sail ... | A movie trailer featuring the adventures ofthe 30 year old spacemanwearing a redwool ...  |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/f9c5823f-aa03-40ee-8335-684684f5c842" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/f207d798-8988-45b0-b836-347e499ee000" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/d40c38dc-9f26-4591-8163-c7089e1553e3" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/67b40ce1-135d-4ae2-a1ee-5a0866461eb2" width=224> |
| A snowy forest landscape with a dirt road running through it. The road is flanked by ... |  Drone shot along the Hawaii jungle coastline, sunny day. Kayaks in the water. | Alpacas wearing knit wool sweaters, graffiti background, sunglasses.  | The camera rotates around a large stack of vintage televisions all showing different ...  |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/42542b4e-b1b8-49b8-ada4-7bcfde8e1453" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/4eee2619-a5ca-4a32-b350-e65d6220c8f7" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/e6152c93-4edf-4569-8d17-1effd87a7780" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/ec78d855-07c7-4421-896a-3c46a83ec129" width=224> |
| A drone camera circles around a beautiful historic church built on a rocky outcropping ... | Aerial view of Santorini during the blue hour, showcasing the stunning architecture ...  |  A robot dog explores the surface of Mars, kicking up red dust as it investigates  ... | An aerial shot of a lighthouse standing tall on a rocky cliff, its beacon cutting ...  |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/98897215-eae9-49f3-8fdb-df1d4b74d435" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/10e0f93d-925f-4b38-8205-7d89f49195f1" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/5453cf37-29ac-423d-9fb2-05f23416ca3e" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/77928824-6705-4d83-a7b4-43fdb74790bf" width=224> |
| 3D animation of a small, round, fluffy creature with big, expressive eyes explores ... |  A corgi vlogging itself in tropical Maui. |  A single drop of liquid metal falls from a floating orb, landing on a mirror-like ... | The video presents an abstract composition centered around a hexagonal shape adorned ...  |

### 65×512×512 Text-to-Video Generation

| 65×512×512 (2.7s) | 65×512×512 (2.7s) | 65×512×512 (2.7s) | 65×512×512 (2.7s) |
| --- | --- | --- | --- |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/a0601f20-579c-4e2e-832c-5763546718cc" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/55229eca-de3a-476b-930b-13a35eb5db30" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/74b9e7a1-0fa4-4f0d-8faf-0c84d97b11b5" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/66b06822-2652-453e-80fb-dd2988b730ce" width=224> |
| Extreme close-up of chicken and green pepper kebabs grilling on a barbeque with flames. | 3D animation of a small, round, fluffy creature with big, expressive eyes explores a ...   |  A corgi vlogging itself in tropical Maui. |  In a studio, there is a painting depicting a ship sailing through the rough sea. |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/4d27ff13-e725-4602-bf17-90df2c0d8005" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/049fd4db-f2fe-4633-ab62-1dda8268e090" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/aba01259-60f2-49ef-aa33-e738dc8c9a49" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/98178397-61b0-4200-9ab9-9bef2728ed98" width=224> |
| A robot dog trots down a deserted alley at night, its metallic paws clinking softly ... | A solitary spider weaves its web in a quiet corner. The web shimmers and glows with ...  |  A lone surfer rides a massive wave, skillfully maneuvering through the surf. The water ... |  A solitary cheetah sprints across the savannah, its powerful muscles propelling it ... |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/3964dfcd-d1b4-406b-916c-d4a702184a27" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/2a5fd1c0-9304-46e4-af35-5ffcc718bf08" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/73e1304a-6241-4e19-9dde-f2b1032edefd" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/c82b822b-cadd-4f11-bcc3-29e3651c02c0" width=224> |
| A solitary astronaut plants a flag on an alien planet covered in crystal formations ... |  At dawn's first light, a spaceship slowly exits the edge of the galaxy against a ...|  A dapper puppy in a miniature suit, basking in the afternoon sun, adjusting his tie ... |  A wise old elephant painting abstract art with its trunk, each stroke a burst of color ... |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/8edfa249-272c-4773-8728-12686527771e" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/33338d8e-5ea7-4b57-9e94-97f3ee404033" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/227fa6b6-801b-438b-9b30-cd3a4e0a7f2f" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/b5d5ebee-ab13-4e32-8b79-e6cb8f08d4a0" width=224> |
| In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two ... | A Shiba Inu dog wearing a beret and black turtleneck.  | A painting of a boat on water comes to life, with waves crashing and the boat becoming ...  | Many spotted jellyfish pulsating under water. Their bodies are transparent and glowing ...  |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/fe8e9450-5a80-435d-b050-6b2fe11cdf53" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/7d9335df-817d-479d-9ea7-615cb50c66b8" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/34c1dda6-e420-4edd-bbcd-38ff8e542ec6" width=224> |  |
| An animated hedgehog with distinctive spiky hair and large eyes is seen exploring a ... | An animated rabbit in a playful pink snowboarding outfit is carving its way down a ...  | A person clad in a space suit with a helmet and equipped with a chest light and arm ...  |   |

### 65×512×512 Video Editing

| generated 65×512×512 (2.7s) | edited 65×512×512 (2.7s) |
| --- | --- |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/edb8e8c2-5eef-4c90-85fb-6adb035067c3" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/32d93845-8904-4f8f-832e-37eba2ceb542" width=224>  |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/a70d0d91-0d61-4aa4-9520-6e4a6c477f12" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/7913e06f-1e0b-4d06-8233-72c882c6abfe" width=224>  |
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/0e614ec0-fba0-4f42-a343-d4607966dd40" width=224> | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/c531d930-410c-4614-890d-bae8013f33c2" width=224>  |

### 512×512 Text-to-Image Generation

 <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/e44b7f8a-5da2-49c2-87c4-52ea680ad43b" width=512> 

## Detailed Technical Report

### CasualVideoVAE

#### Model Structure

As the number of frames increases, the encoder overhead of CausalVideoVAE gradually rises. When training with 257 frames, 80GB of VRAM is insufficient for the VAE to encode the video. Therefore, we reduced the number of CausalConv3D layers, retaining only the last two stages of CausalConv3D in the encoder. This change significantly lowers the overhead while maintaining nearly the same performance. Note that we only modified the encoder; the decoder still retains all CausalConv3D layers, as training the Diffusion Model does not require the decoder.

<img width="722" alt="vaemodel" src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/b6f8a638-cfb2-40f3-94be-45af8dbad18e">

We compare the computational overhead of the two versions by testing the forward inference of the encoder on the H100.

| Version | 129×256×256 |   | 257×256×256 | | 513×256×256 | |
|---|---|---|---|---|---|---|
|  |  Peak Mem. |  Speed  | Peak Mem. |  Speed  |Peak Mem. |  Speed  |
| v1.0.0 |  22G |   2.9 it/s  | OOM |  -   | OOM |   -  |
| v1.1.0 |  18G |  4.9 it/s   | 34G  |  2.5 it/s   | 61G |   1.2 it/s   |


#### Temporal Module

<img width="480" alt="vaemodel" src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/7c8a4263-b7d1-4edc-a60c-801ae9b4344f">

In v1.0.0, our temporal module had only TemporalAvgPool. TemporalAvgPool leads to the loss of high-frequency information in the video, such as details and edges. To address this issue, we improved this module in v1.1.0. As shown in the figure below, we introduced convolution and added learnable weights, allowing different branches to decouple different features. When we omit CausalConv3D, the video is reconstructed very blurry. Similarly, when we omit TemporalAvgPool, the video becomes very sharp.

|  | SSIM↑ | LPIPS↓ | PSNR↑ |
|---|---|---|---|
| Base | 0.850 |  0.091 |  28.047 |
| + Frames | 0.868 |  0.070 |  28.829 | 
| + Reset mixed factor | 0.873 |  0.070 |  29.140 | 





#### Training Details

Similar to v1.0.0, we initialized from the Latent Diffusion's VAE and used tail initialization. For CausalVideoVAE, we trained for 100k steps in the first stage with a video shape of 9×256×256. Subsequently, we increased the frame count from 9 to 25 and found that this significantly improved the model's performance. It is important to clarify that we enabled the mixed factor during both the first and second stages, with a value of a (sigmoid(mixed factor)) reaching 0.88 at the end of training, indicating the model's tendency to retain low-frequency information. In the third stage, we reinitialized the mixed factor to 0.5 (sigmoid(0.5) = 0.6225), which further enhanced the model's capabilities.

#### Loss Function

We found that using GAN loss helps retain high-frequency information and alleviates grid artifacts. Additionally, we observed that switching from 2D GAN to 3D GAN provides further improvements.

| GAN Loss/Step | SSIM↑ | LPIPS↓ | PSNR↑ |
|---|---|---|---|
| 2D/80k | 0.879 |  0.068 |  29.480 |
| 3D/80k | 0.882 |  0.067 |  29.890 | 

#### Inference Tricks
Therefore, we introduced a method called **temporal rollback tiled convolution**, a tiling approach specifically designed for CausalVideoVAE. Specifically, all windows except the first one discard the first frame because the first frame in a window is treated as an image, while the remaining frames should be treated as video frames.

<img width="633" alt="tiled_temp" src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/0a06011e-1d6c-410a-9f1c-82c4122a018a">

We tested the speed on the H100 with a window size of 65×256×256.

| Version | 129×256×256 |   | 257×256×256 | | 513×256×256 | |
|---|---|---|---|---|---|---|
|  |  Peak Mem. |  Speed  | Peak Mem. |  Speed  |Peak Mem. |  Speed  |
| 4×8×8 |  10G |   1.3 s/it  | 10G | 2.6 s/it   | 10G |   5.3 s/it  |

### Data Construction
Since Open-Sora-Plan supports joint training of images and videos, our data collection is divided into two parts: images and videos. Images do not need to originate from videos; they are independent datasets. We spent approximately 32×240 H100 hours generating image and video captions, and all of this is **open source**!

#### Image-Text Collection Pipeline
We obtained 11 million image-text pairs from [Pixart-Alpha](https://huggingface.co/datasets/PixArt-alpha/SAM-LLaVA-Captions10M), with captions generated by [LLaVA](https://github.com/haotian-liu/LLaVA). Additionally, we utilized the high-quality OCR dataset [Anytext-3M](https://github.com/tyxsspa/AnyText), which pairs each image with corresponding OCR characters. However, these captions were insufficient to describe the entire image, so we used [InternVL-1.5](https://github.com/OpenGVLab/InternVL) for supplementary descriptions. Since T5 only supports English, we filtered for English data, which constitutes about half of the complete dataset. Furthermore, we selected high-quality images from [Laion-5B](https://laion.ai/blog/laion-5b/) to enhance human-like generation quality. The selection criteria included high resolution, high aesthetic scores, and watermark-free images containing people.

Here, we are open-sourcing the prompt used for InternVL-1.5:
```
# for anytext-3m
Combine this rough caption: "{}", analyze the image in a comprehensive and detailed manner. "{}" can be recognized in the image.
# for human-160k
Analyze the image in a comprehensive and detailed manner.
```

| Name | Image Source | Text Captioner | Num pair |
|---|---|---|---|
| SAM-11M | [SAM](https://ai.meta.com/datasets/segment-anything/) |  [LLaVA](https://github.com/haotian-liu/LLaVA) |  11,185,255 |
| Anytext-3M-en | [Anytext](https://github.com/tyxsspa/AnyText) |  [InternVL-1.5](https://github.com/OpenGVLab/InternVL) |  1,886,137 | 
| Human-160k | [Laion](https://laion.ai/blog/laion-5b/) |  [InternVL-1.5](https://github.com/OpenGVLab/InternVL) |  162,094 | 


#### Video-Text Collection Pipeline
In v1.0.0, we sampled one frame from each video to generate captions. However, as video length increased, a single frame could not adequately describe the entire video's content or temporal movements. Therefore, we used a video captioner to generate captions for the entire video clip. Specifically, we used [ShareGPT4Video](https://sharegpt4video.github.io/), which effectively covers temporal information and describes the entire video content. The v1.1.0 video dataset comprises approximately 3k hours, compared to only 300 hours in v1.0.0. As before, we have open-sourced all text annotations and videos (both under the CC0 license), which can be found [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main).



| Name | Hours | Num frames | Num pair |
|---|---|---|---|
| [Mixkit](https://mixkit.co/) | 42.0h |  65 |  54,735 |
|   |  |  513 |  1,997 | 
| [Pixabay](https://pixabay.com/) | 353.3h |  65 | 601,513 |
|   |  |  513 |  51,483 | 
| [Pexel](https://www.pexels.com/) | 2561.9h |  65 |  3,832,666 |
|   |  |  513 |  271,782 | 

### Training Diffusion Model
Similar to our previous work, we employed a multi-stage cascaded training method. Below is our training card:

#### Stage 1

Surprisingly, we initially believed that the performance of the diffusion model would improve with longer training. However, by observing the [logs](https://api.wandb.ai/links/linbin/o76j03j4), we found that videos generated at 50k steps were of higher quality than those at 70-100k steps. In fact, extensive sampling revealed that checkpoints at 40-60k steps outperformed those at 80-100k steps. Quantitatively, 50k steps correspond to approximately 2 epochs of training. It is currently unclear whether this is due to overfitting from a small dataset or the limited capacity of the 2+1D model.

#### Stage 2

In the second stage, we used Huawei Ascend computing power for training. This stage's training and inference were fully supported by Huawei. We conducted sequence parallel training and inference on a large-scale cluster, distributing one sample across eight ranks. Models trained on Huawei Ascend can also be loaded into GPUs and generate videos of the same quality.


#### Stage 3

In the third stage, we further increased the frame count to 513 frames, approximately 21 seconds at 24 FPS. However, this stage presents several challenges, such as ensuring temporal consistency in the 2+1D model over long durations and whether the current amount of data is sufficient. We are still training the model for this stage and continuously monitoring its progress.

| Name | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|
| Training Video Size | 65×512×512 |  221×512×512 | 513×512×512 |
| Compute (#Num x #Hours) | 80 H100 × 72 | 512 Ascend × 72 |  Under Training |
| Checkpoint | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/65x512x512) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.1.0/tree/main/221x512x512) | Under Training |
| Log | [wandb](https://api.wandb.ai/links/linbin/o76j03j4) | - |  - |
| Training Data | ~3k hours videos + 13M images |  |  |

### Video Editing

The recently proposed [ReVideo](https://mc-e.github.io/project/ReVideo/) achieves accurate video editing by modifying the first frame and applying motion control within the edited area. Although it achieves excellent video editing performance, the editing length is limited by the base model [SVD](https://github.com/Stability-AI/generative-models). Open-Sora, as a fundamental model for long-video generation, can compensate for this issue. Currently, we are collaborating with the ReVideo team to use Open-Sora as the base model for long video editing. Some preliminary results are shown [here](). 

The initial version still needs improvement in several aspects. In the future, we will continue to explore integration with ReVideo to develop improved long-video editing models.

## Failed Case and Discussion

Despite the promising results of v1.1.0, there remains a gap between our model and Sora. Here, we present some failure cases and discuss them.

### CasualVideoVAE

Despite the significant performance improvement of VAE in v1.1.0 over v1.0.0, we still encounter failures in challenging cases, such as sand dunes and leaves. The video on the left shows the reconstructed video downsampled by a factor of 4 in time, while the video on the right is downsampled by a factor of 2. Both exhibit jitter when reconstructing fine-grained features. This indicates that reducing temporal downsampling alone cannot fully resolve the jitter issue.

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/1a87d6d8-4bf1-4b4e-83bb-84870c5c3a11

https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/1a87d6d8-4bf1-4b4e-83bb-84870c5c3a11

### Diffusion Model

#### Semantic distortion

On the left is a video generated by v1.1.0 showing a puppy in the snow. In this video, the puppy's head exhibits semantic distortion, indicating that the model struggles to correctly identify which head belongs to which dog. On the right is a video generated by Sora's [base model](https://openai.com/index/video-generation-models-as-world-simulators/). We observe that Sora's early base model also experienced semantic distortion issues. This suggests that we may achieve better results by scaling up the model and increasing the amount of training data.

Prompt：A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in.

| Our | Sora Base×1 | Sora Base×4 | Sora Base×32 |
|---|---|---|---|
| <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/1d456168-afad-4e22-ae3b-fc28eca935e8" width=224>  |<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/c4ca99d9-9492-45c8-a75e-6efe21c330aa" width=224>  |<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/b7b52894-f58b-4e64-858b-015247108b8b" width=224>  | <img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/dcaf793b-1da4-4cc1-9c17-8a46d55e80e6" width=224> |

#### Limited dynamics

The primary difference between videos and images lies in their dynamic nature, where objects undergo a series of changes across consecutive frames. However, the videos generated by v1.1.0 still contain many instances of limited dynamics. Upon reviewing a large number of training videos, we found that while web-crawled videos have high visual quality, they are often filled with meaningless close-up shots. These close-ups typically show minimal movement or are even static. On the left, we present a generated video of a bird, while on the right is a training video we found, which is almost static. There are many similar videos in the dataset from stock footage sites.

Prompt：This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird's head is tilted slightly to the side,giving the impression of it looking regal and majestic. The background is blurred,drawing attention to the bird's striking appearance.


| Our | Raw video |
|---|---|
|<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/7ffb6bc6-b52c-488e-9f29-d7d90bda44d6" width=224>  |<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/95fbbca4-e206-42f6-8c6b-af1063c442c6" width=224>  | 

#### Negative prompt

We found that using negative prompts can significantly improve video quality, even though we did not explicitly tag the training data with different labels. On the left is a video sampled using a negative prompt, while on the right is a video generated without a negative prompt. This suggests that we may need to incorporate more prior knowledge into the training data. For example, when a video has a watermark, we should note "watermark" in the corresponding caption. When a video's bitrate is too low, we should add more tags to distinguish it from high-quality videos, such as "low quality" or "blurry." We believe that explicitly injecting these priors can help the model differentiate between the vast amounts of pretraining data (low quality) and the smaller amounts of fine-tuning data (high quality), thereby generating higher quality videos.

Prompt：A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in.
Negative Prompt：distorted, discontinuous, ugly, blurry, low resolution, motionless, static, low quality


| With Negative Prompt | Without Negative Prompt |
|---|---|
|<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/1d456168-afad-4e22-ae3b-fc28eca935e8" width=224>  |<img src="https://github.com/PKU-YuanGroup/Open-Sora-Plan/assets/62638829/7ad17f96-bfab-455a-830d-0daebccaf6fb" width=224>  | 

## Future Work

In our future work, we will focus on two main areas: (1) data scaling and (2) model design. Once we have a robust baseline model, we will extend it to handle variable durations and conditional control models.

### Data Scaling

#### Data source

As mentioned earlier, our dataset is entirely sourced from stock footage websites. Although these videos are of high quality, many consist of close-up shots of specific areas, resulting in slow motion in the videos. We believe this is one of the main reasons for the limited dynamics observed. Therefore, we will continue to collect datasets from diverse sources to address this issue.

#### Data volume

In v1.1.0, our dataset comprises only ~3k hours of video. We are actively collecting more data and anticipate that the video dataset for the next version will reach ~100k hours. We welcome recommendations from the open-source community for additional datasets.

### Model Design

#### CasualVideoVAE
In our internal testing, even without downsampling in time, we found that it is not possible to completely resolve the jitter issue in reconstructing fine-grained features. Therefore, we need to reconsider how to mitigate video jitter to the greatest extent possible while simultaneously supporting both images and videos. We will introduce a more powerful CasualVideoVAE in the next version.

#### Diffusion Model
In v1.1.0, we found that 2+1D models can generate higher-quality videos in short durations. However, for long videos, they tend to exhibit discontinuities and inconsistencies. Therefore, we will explore more possibilities in model architecture to address this issue.


================================================
FILE: docs/Report-v1.2.0.md
================================================
# Report v1.2.0

In May 2024, we launched Open-Sora-Plan v1.1.0, featuring a 2+1D model architecture that could be quickly utilized for exploratory training in text-to-video generation tasks. However, when handling dense visual tokens, the 2+1D architecture could not simultaneously process spatial and temporal dimensions. Therefore, we transitioned to **a 3D full attention architecture**, which better captures the joint spatial-temporal features. Although this version is experimental, it advances video generation architecture to a new realm, leading us to release it as v1.2.0.

Compared to previous video generation models, Open-Sora-Plan v1.2.0 offers the following improvements:

1. **Better compressed visual representations**. We optimized the structure of CausalVideoVAE, which now delivers enhanced performance and higher inference efficiency.
2. **Better video generation architecture**. Instead of 2+1D, we use a diffusion model with a 3D full attention architecture, which provides a better understanding of the world.


### Open-Source Release
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model are made publicly available.
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CausalVideoVAE [here](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0).
- Data: Filtered data [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0).


## Gallery

93×1280×720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.

<table class="center">
<tr>
  <td><video src="https://github.com/user-attachments/assets/1c84bc92-d585-46c9-ae7c-e5f79cefea88" autoplay></td>
</tr>
</table>

## Detailed Technical Report

### CausalVideoVAE

#### Model Structure

The VAE in version 1.2.0 maintains the overall architecture of the previous version but merges the temporal and spatial downsampling layers. In version 1.1.0, we performed spatial downsampling (stride=1,2,2) followed by temporal downsampling (stride=2,1,1). In version 1.2.0, we conduct both spatial and temporal downsampling simultaneously (stride=2,2,2) and perform spatial-temporal upsampling in the decoder (interpolate_factor=2,2,2).

Due to the absence of additional convolutions during downsampling and upsampling, this method more seamlessly inherits the weights from the SD2.1 VAE, leading to improved initialization of our VAE.

<img src="https://s21.ax1x.com/2024/07/24/pkHrHx0.png" width=768>


#### Training Details


As with v1.1.0, we initialize from the [SD2.1 VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse) using tail initialization. We perform the first phase of training on the Kinetic400 video dataset, then use the EMA weights from this phase to initialize the second phase, which is fine-tuned on high-quality data (collected in v1.1.0). All training is conducted on 25-frame 256×256 videos using **one A100 node**.


| Training stage | Dataset | Training steps  |
|---|---|---|
| 1 |  K400 |  200,000 |
| 2 |  collected in v1.1.0 |  450,000   |


#### Evaluation

We evaluated our VAE on the validation sets of two video datasets: [Webvid](https://github.com/m-bain/webvid) and [Panda70m](https://github.com/snap-research/Panda-70M/), and compared it with our [v1.1.0](https://github.com/PKU-YuanGroup/Open-Sora-Plan/blob/main/docs/Report-v1.1.0.md), [SD2.1 VAE](https://huggingface.co/stabilityai/sd-vae-ft-mse), [CV-VAE](https://github.com/AILab-CVC/CV-VAE), and [Open-Sora's VAE](https://github.com/hpcaitech/Open-Sora). The Webvid validation set contains 5k videos, while the Panda70m validation set has 6k videos. The videos were resized to 256 pixels on the short side, center-cropped to 256x256, and then 33 consecutive frames were extracted. We used PSNR, SSIM, and LPIPS metrics, and measured the encoding speed on an A100 GPU. The specific results are as follows:


**WebVid**

| Model | Compress Ratio |PNSR↑ | SSIM↑ |LPIPS↓ |
|---|---|---|---|---|
| SD2-1 VAE | 1x8x8 | 30.19 | 0.8379 | 0.0568 |
| SVD VAE | 1x8x8 |<ins>31.15</ins> |<ins>0.8686</ins> | **0.0547** | 
| CV-VAE | 4x8x8 | 30.76 | 0.8566 | 0.0803 |
| Open-Sora VAE | 4x8x8 | 31.12 | 0.8569 | 0.1003 |
|  Open-Sora Plan v1.1 | 4x8x8 | 30.26 | 0.8597 |<ins>0.0551</ins> |
|  Open-Sora Plan v1.2  | 4x8x8| **31.16** | **0.8694** | 0.0586 |

**Panda70M**

| Model | Compress Ratio| PNSR↑ | SSIM↑ |LPIPS↓ |
|---|---|---|---|---|
| SD2-1 VAE | 1x8x8 |30.40 | 0.8894 | 0.0396 |
| SVD VAE | 1x8x8 |<ins>31.00</ins> | **0.9058** | **0.0379** | 
| CV-VAE  | 4x8x8| 29.57 | 0.8795 | 0.0673 |
| Open-Sora VAE | 4x8x8 | **31.06** | 0.8969 | 0.0666 |
| Open-Sora Plan v1.1 | 4x8x8 | 29.16 | 0.8844 | 0.0481 |
|  Open-Sora Plan v1.2  | 4x8x8| 30.49 |<ins>0.8970</ins> |<ins>0.0454</ins>|

**Encode Time on A100**

|Input Size| CV-VAE | Open-Sora | Open-Sora Plan v1.1 | Open-Sora Plan v1.2 |
|---|---|---|---|---|
| 33x256x256 | 0.186 | 0.147 |<ins>0.104</ins> | **0.102** |
| 81x256x256 | 0.465 | 0.357 |<ins>0.243</ins> | **0.242** |

### Training Text-to-Video Diffusion Model

#### Model Structure

The most significant change is that we **replaced all 2+1D Transformer blocks with 3D full attention blocks**. Each video is first processed by a patch embedding layer, which downsamples the spatial dimensions by a factor of 2. The video is then flattened into a one-dimensional sequence across the frame, width, and height dimensions. We replaced [T5-XXL](https://huggingface.co/DeepFloyd/t5-v1_1-xxl) with [mT5-XXL](https://huggingface.co/google/mt5-xxl) to enhance multilingual adaptation. Additionally, we incorporated RoPE.


### Sequence Parallelism

Due to the high computational complexity of 3D full attention, we must allocate a video across 2 GPUs for parallel processing when training with long-duration and high-resolution videos. We can control the number of GPUs used for a video sample by adjusting the batch size on a node. For example, with `sp_size=8` and `train_sp_batch_size=4`, 2 GPUs are used for a single sample. **We support sequence parallelism for both training and inference**.

**Training on 93×720p**, we report speed on H100.

| GPU （sp_size） | batch size | Enable sp | Train_sp_batch_size | Speed | Step per day |
|---|---|---|---|---|---|
|8|8|×|-|100s/step|~850|
|8|-|√|4|53s/step|~1600|
|8|-|√|2|27s/step|~3200|

**Inference on 93×720p**, we report speed on H100.

| Size | 1 GPU | 8 GPUs | 
|---|---|---|
|29×720p|420s/100step|80s/100step|
|93×720p|3400s/100step|450s/100step|

#### Dynamic training

Deep neural networks are typically trained using batched inputs. For efficient hardware processing, batch shapes are fixed, leading to a fixed data size. This requires either cropping or padding images to a uniform size, both of which have drawbacks: cropping degrades performance, while padding is inefficient and results in significant information loss. Generally, there are three methods for training with arbitrary token counts: Patch n' Pack, bucket, and pad-mask.



**Patch n' Pack** ([NaViT](https://arxiv.org/abs/2307.06304)): bypasses the fixed sequence length limitation by combining tokens from multiple samples into a new sample. This approach allows variable-resolution images while maintaining aspect ratios by packaging multiple samples together, thereby reducing training time and enhancing performance and flexibility. However, this method involves significant code modifications and requires re-adaptation when exploring different model architectures in fields with unstable model designs.


**Bucket** ([Pixart-alpha](https://arxiv.org/abs/2310.00426), [Open-Sora](https://github.com/hpcaitech/Open-Sora)): This method packages data of different resolutions into buckets, sampling batches from each bucket to ensure same resolution within each batch. It requires minimal code modifications to the model, mainly adjusting the data sampling strategy.

**Pad-mask** ([FiT](https://arxiv.org/abs/2402.12376), our v1.0/v1.1): This method sets a maximum resolution and pads all data to this resolution, generating a corresponding mask. Although the approach is straightforward, it is computationally inefficient.

We believe that current video generation models are still in an exploratory phase. Extensive modifications to model code during this period can incur unnecessary development costs. The pad-mask method, while straightforward, is computationally inefficient and can waste resources in video, which involves dense computations. Ultimately, we chose the bucket strategy, which requires no modifications to the model code. Next, we will explain how our bucket strategy supports arbitrary lengths and resolutions. For simplicity, we will use video duration as an example:

<img src="https://s21.ax1x.com/2024/07/24/pkHr4aQ.png" width=768>


We define a megabatch as the total data processed in a single step across all GPUs. A megabatch can be divided into multiple batches, with each batch corresponding to the data processed by a single GPU.

**Sort by frame**: The first step is to count the number of frames in all video data and sort them. This step aims to group similar data together, with sorting being one method to achieve this.

**Group megabatch**: Next, all data is divided into groups, each forming a megabatch. Since all data is pre-sorted, most videos within a megabatch have the same number of frames. However, there will always be boundary cases, such as having both 61-frame and 1-frame videos in a single megabatch.

**Re-organize megabatch**: We re-organize these special megabatches, which actually constitute a small proportion. We randomly replace the minority data in the megabatch with the majority data, thus re-organizing it into a megabatch with same frame counts.

**Shuffle megabatch**: To ensure data randomness, we shuffle both within each megabatch and between different megabatches.

When supporting dynamic resolutions, we simply replace each sample's frame sequence with (frame × height × width). This method ensures that the data dimension processed by each GPU in every step is the same, preventing situations where GPU1 waits for GPU0 to finish processing a longer video. Moreover, it is entirely decoupled from the model code, serving as a plug-and-play video sampling strategy.


#### Training stage

Similar to previous work, we use a multi-stage training approach. With the 3D DiT architecture, all parameters can be transferred from images to videos without loss. To explore training costs, all parameters of the diffusion model are trained from scratch. Therefore, we first train an text-to-image model, using the training strategy from [Pixart-alpha](https://arxiv.org/abs/2310.00426).

The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 6 million higher-quality (HQ) samples from Panda70M, filtered for aesthetic quality and motion. Finally, we refine the model with a more higher-quality (HQ) subset of 1 million samples. After that, we use a filtered data (collected in v1.1.0) for fine-tuning 93-frame 720p videos. Below is our training card. We release the annotation file [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/anno_json).

| Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
|---|---|---|---|---|---|
| Training Video Size | 1×320×240 |  1×640×480 | 29×640×480 |  29×1280×720 | 93×1280×720 |
| Training Step| 146k |  200k | 30k | 21k | 3k |
| Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 |  128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
| Checkpoint | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/1x480p) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x480p) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x720p) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x720p) |
| Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
| Training Data | [10M SAM](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/sam_image_11185255_resolution.json) | 5M internal image data | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [1M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ1M.json) and [100k HQ data](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/anno_json) (collected in v1.1.0) |

Additionally, we fine-tuned 3.5k steps from the final 93×720p to get [93×480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x480p) for community research use.

### Training Image-to-Video Diffusion Model

#### Model Structure

<img src="https://s21.ax1x.com/2024/08/12/pApZZJf.png">

To reuse the weights of the Text-to-Video model, our Image-to-Video model is inspired by the Stable Diffusion Inpainting Model and adopts a strategy based on frame-level inpainting. By incorporating three types of information—original noise, masked video, and mask—under different control frame conditions, our model can generate coherent videos while ensuring flexibility in its usage.

Compared to the denoiser structure of the Text-to-Video model, the Inpainting Model's denoiser has only changed the number of channels in the `conv in` layer. To ensure the model has a good prior knowledge, we introduce the masked video and mask information through zero initialization. We believe this is due to the 2+1D structure's lack of ability to establish long-range information dependencies, and relying solely on attention in the temporal dimension makes it difficult to capture information changes under frame control. In Text-to-Video tasks, this phenomenon is not as evident because all frames share the same text prompt embedding. However, in Image-to-Video tasks, simply concatenating images in the channel dimension does not ensure the model can accurately capture changes between frames. This is because the model cannot directly replicate image information from the channels to reduce the loss, and the 2+1D structure's interaction solely on the temporal axis fails to allow the model to discern which information from the control frames can be utilized, especially there are significant differences between frames. Therefore, without a shared image-semantic information, the control frame information might not be effectively conveyed to each frame.

##### About Semantic Adapter

In previous models based on the Unet 2+1D architecture, it is necessary to input the control frames into the CLIP model to obtain semantic embeddings. These semantic embeddings are then injected into the denoiser through cross-attention. The structure that extracts CLIP embeddings and injects them into the denoiser is commonly referred to as a semantic adapter.

In the 2+1D architecture, the semantic adapter is commonly present. Additionally, papers like [DynamiCrafter](https://arxiv.org/abs/2310.12190) have pointed out that incorporating the semantic adapter helps maintain stability in the generated videos. We believe this is because the 2+1D structure lacks the ability to establish long-range information dependencies, and relying solely on attention in the temporal dimension makes it difficult to capture information changes under frame control. In the Text-to-Video task, this phenomenon is not as evident because all frames share the same text prompt embedding. However, in the Image-to-Video task, without shared semantic information, it may lead to the inability to effectively transfer control frame information to each individual frame.

<center>
<figure>
    <img src="https://github.com/user-attachments/assets/06df193a-fe89-42c1-8c01-fb3b7c2be0e3" height=400 />
	<img src="https://github.com/user-attachments/assets/09906df4-ab9f-443d-8d38-512e16075b0c" height=400 />
</figure>
</center>

We conducted a simple comparison of the performance of using the Inpainting Model under the 2+1D structure (Open-Sora Plan v1.1, left in the figure) versus the 3D structure (Open-Sora Plan v1.2, right in the figure). With the same number of optimization steps, the probability of unstable visual performance in the 2+1D structure was significantly higher than in the 3D structure. Even at convergence, the 2+1D structure's visual stability was still inferior to that of the 3D structure, and it was even worse than the early training stages of the 3D structure.

## Future Work and Discussion

#### CausalVideoVAE
We observed that high-frequency motion information in videos tends to exhibit jitter, and increasing training duration and data volume does not significantly alleviate this issue. In videos, compressing the duration while maintaining the original latent dimension can lead to significant information loss. A more robust VAE will be released in the next version.

#### Diffusion Model
We replaced T5 with mT5 to enhance multilingual capabilities, but this capability is limited as our training data is currently only in English. The multilingual ability primarily comes from the mT5 mapping space. We will explore additional text encoders and expand the data in the next steps.

Our model performs well in generating character consistency, likely due to panda70m being a character-centric dataset. However, it still shows poor performance in text consistency and object generalization. We suspect this may be due to the limited amount of data the model has seen, as evidenced by the non-convergence of the loss in the final stage. **We hope to collaborate with the open-source community to optimize the 3D DiT architecture.**

================================================
FILE: docs/Report-v1.3.0.md
================================================
# Report v1.3.0

In August 2024, we released Open-Sora-Plan v1.2.0, transitioning to a 3D full attention architecture, which enhanced the capture of joint spatial-temporal features. However, the substantial computational cost made it unsustainable, and the lack of a clear training strategy hindered continuous progress along a focused path.

In version 1.3.0, Open-Sora-Plan introduced the following five key features:

**1. A more powerful and cost-efficient WFVAE.** We decompose video into several sub-bands using wavelet transforms, naturally capturing information across different frequency domains, leading to more efficient and robust VAE learning.

**2. Prompt Refiner.** A large language model designed to refine short text inputs.

**3. High-quality data cleaning strategy.** The cleaned panda70m dataset retains only 27% of the original data.

**4. DiT with new sparse attention.** A more cost-effective and efficient learning approach.

**5. Dynamic resolution and dynamic duration.** This enables more efficient utilization of videos with varying lengths (treating a single frame as an image).

### Open-Source Release
We open-source the Open-Sora-Plan to facilitate future development of Video Generation in the community. Code, data, model will be made publicly available.
- Code: All training scripts and sample scripts.
- Model: Both Diffusion Model and CasualVideoVAE [here](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0).
- Data: The data of prompt refiner is [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/prompt_refiner).

## Gallery

Text & Image to Video Generation. 

[![Demo Video of Open-Sora Plan V1.3](https://github.com/user-attachments/assets/4ff1d873-3dde-4905-a907-dbff51174c20)](https://www.bilibili.com/video/BV1KR2fYPEF5/?spm_id_from=333.999.0.0&vd_source=cfda99203e659100629b465161f1d87d)

## Detailed Technical Report

### WF-VAE

As video generation models move toward higher resolutions and longer durations, the computational cost of video VAEs grows exponentially, becoming unsustainable. Most related work addresses this by using tiling to reduce inference memory consumption. However, in high-resolution, long-duration scenarios, tiling significantly increases inference time. Additionally, since tiling is lossy for latents, it can lead to visual artifacts such as shadows or flickering in the generated videos. Then, we introduce WFVAE, which provide a new model to handle these problems.

#### Model Structure

<center>
<figure>
	<img width="899" alt="SCR-20241023-tzct" src="https://github.com/user-attachments/assets/03615e1d-2633-4247-af0b-d93e2a935e3e">
</figure>
</center>

The compression rate fundamentally determines the quality of VAE-reconstructed videos. We analyzed the energy and entropy of different subbands obtained through wavelet transform and found that most of the energy in videos is concentrated in the low-frequency bands. Moreover, by replacing the `LLL` subband of the VAE-reconstructed video with the original video's `LLL` subband, we observed a significant improvement in the spatiotemporal quality of the videos.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/533666a6-05be-4584-8b14-86f01d0471dd" height=250 />
</figure>
</center>

In previous VAE architectures, the lack of a "highway" for transmitting the dominant energy during video compression meant that this pathway had to be gradually established during model training, leading to redundancy in model parameters and structure. Therefore, in our model design, we created a more efficient transmission path for the LLL subband energy, significantly simplifying the model architecture, reducing inference time, and lowering memory consumption.

#### Training Details

More details will be provided in the forthcoming paper.

#### Ablation Study

In our experiments, we used the K400 training and validation sets, conducted on 8xH100 GPUs. The latent dimension was fixed at 4. We observed that as model parameters increased, there was still room for improvement in reconstruction metrics. GroupNorm showed instability during training, performing worse than LayerNorm on PSNR but better on LPIPS.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/ed880143-72d1-4316-a1d4-5fdfc5ed155a" height=200 />
	<img src="https://github.com/user-attachments/assets/303954c3-73ee-44f3-9897-d3d14b37b27e" height=200 />
</figure>
</center>

#### Performance

The following metrics were tested on H100 with float32 precision. For fairness, tiling was disabled for all models, and direct inference was performed.

<center>
<figure>
	<img width="765" alt="SCR-20241023-tzwz" src="https://github.com/user-attachments/assets/f7d4f225-5d22-4152-90ad-32716884ae6c">
</figure>
</center>


#### Evaluation

We evaluated PSNR and LPIPS on the Panda70M test set at 256 pixels and 33 frames. In the open-source WF-VAE-S (8-dim), our encoder was distilled from the 8-dim OD-VAE, resulting in some metric degradation compared to direct training.


| Latent Dim | Model | Params |  PSNR |  LPIPS | 
|---|---|---|---|---|
| 4 | OD-VAE（Our VAE in v1.2.0） | 94M + 144M | 30.311| 0.043|
| 4 | WFVAE-S | 38M + 108M | 30.579 | 0.044 |
| 8 | WFVAE-S（Distillion） |38M + 108M | 31.764|0.050 |

#### Causal Cache

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/59cb0543-225b-45a3-a4a6-429e5e753167" height=200 />
</figure>
</center>


To address the issue of tiling, we replaced GroupNorm with LayerNorm and introduced a novel method called **Causal Cache**, enabling lossless temporal block-wise inference.

First, we replaced GroupNorm with LayerNorm and utilized the properties of CausalConv3D to achieve lossless inference through temporal dimension chunking. In each layer of CausalConv3D, we cache the information from the previous few frames to maintain continuity during the convolution sliding operation for the next temporal chunk, thereby enabling lossless processing. As illustrated, we use a kernel size of 3 and a stride of 1 as an example:

**Initial Chunk (chunk idx=0):** For the first time chunk, we perform standard causal padding to support joint processing of graphs and videos. After the convolution operation, we cache the last two frames of this chunk into the causal cache in preparation for the next chunk's inference.

**Subsequent Chunks (chunk idx=1 and beyond):** Starting from the second time chunk, we no longer use causal padding. Instead, we concatenate the cached causal information from the previous chunk to the front of the current chunk. We continue to cache the last two frames of the current input into the causal cache for use in subsequent chunks.

## Prompt Refiner

User-provided captions are typically fewer than 10 words, whereas the text annotations in the current training data are often dense. This inconsistency between training and inference may result in poor visual quality and weak text alignment. We categorize captions into four types:

(1) Short captions from real user input; we collected 11k from [COCO](https://cocodataset.org/#home).

(2) Captions composed of multiple tags; we collected 5k from [DiffusionDB](https://github.com/poloclub/diffusiondb).

(3) Medium-length captions generated by large language models; 3k sourced from [JourneyDB](https://github.com/JourneyDB/JourneyDB).

(4) Ultra-long, surrealist captions, sourced from Sora/Vidu/Pika/Veo and approximately 0.5k generated by GPT.

We used ChatGPT to rewrite the above captions, with the following instructions provided to ChatGPT:

```
rewrite the sentence to contain subject description action, scene description. 
Optional: camera language, light and shadow, atmosphere and
conceive some additional actions to make the sentence more dynamic,
make sure it is a fluent sentence, not nonsense.
```

Finally, we performed LoRA fine-tuning using [LLaMa 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), completing the training in just 30 minutes with a single H100. We fine-tuned for only 1 epoch, using a batch size of 32 and a LoRA rank of 64. The log can be found [here](https://api.wandb.ai/links/1471742727-Huawei/p5xmkft5). We open-sourced the data [here](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.3.0/tree/main/prompt_refiner).

### Data Construction

We randomly sampled from the original Panda70m dataset and found many videos to be static, contain multiple subtitles, or suffer from motion blur. Additionally, the captions in Panda70m did not always accurately describe the video content. To address this, we designed a video filtering pipeline, which retained approximately 27% of the videos after processing.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/90f9d386-ff2e-465a-b013-a9e7151afaf8" height=400 />
</figure>
</center>


#### Jump Cut and Detect Motion

We used [LPIPS](https://github.com/richzhang/PerceptualSimilarity) frame-skipping to compute inter-frame semantic similarity, identifying anomalies as cut points and taking the mean as the motion score. We found that videos with motion scores below 0.001 were nearly static, while those above 0.3 exhibited significant jitter and flicker. After applying this method, we manually reviewed 2k videos and concluded that the cut detection accuracy was sufficient for pre-training requirements.

#### OCR

We estimated the average position of subtitles on common video platforms to be around 18%. Consequently, we set the maximum crop threshold to 20% of the video's original dimensions and used [EasyOCR](https://github.com/JaidedAI/EasyOCR) to detect subtitles (sampling one frame per second). However, not all videos have subtitles or printed text located at the edges; this method may miss text appearing in central areas, such as in advertisement videos or speeches. Nonetheless, we cannot assume that the presence of text in a video necessitates filtering it out, as certain texts in specific contexts can be meaningful. We leave such judgments to aesthetic considerations.

#### Aesthetic

As before, we used the [Laion aesthetic predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor) for evaluation. Based on the visualization [website](http://captions.christoph-schuhmann.de/aesthetic_viz_laion_sac+logos+ava1-l14-linearMSE-en-2.37B.html), we determined that a score of 4.75 serves as a suitable threshold, effectively filtering out excessive text while retaining high-quality aesthetics. We will add an additional aesthetic prompt, such as `A high-aesthetic scene, ` for data with a score above 6.25.

#### Video Quality

Some old photos or videos have very low bit rates, resulting in blurry visual effects even at 480P resolution, often resembling a mosaic appearance. Aesthetic filtering struggles to exclude these videos, as it resizes images to 224 resolution. We aim to establish a metric for assessing absolute video quality, independent of the visual content itself, focusing solely on compression artifacts, low bit rates, and jitter. We employed the technical prediction score from [DOVER](https://github.com/VQAssessment/DOVER) and excluded videos with scores below 0.

#### Recheck Motion

Since some videos contain subtitles, variations in the subtitles may lead to inaccurate motion values. Therefore, we re-evaluated the motion values and discarded static videos.

#### Captioning

We used [QWen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) for video annotation.

```
Please describe the content of this video in as much detail as possible, 
including the objects, scenery, animals, characters, and camera movements within the video. 
Do not include '\n' in your response. 
Please start the description with the video content directly. 
Please describe the content of the video and the changes that occur, in chronological order.
```

However, the 7B model tends to generate certain prefixes, such as "This video" or "The video." We compiled a list of all irrelevant opening strings and removed them.

```
    'The video depicts ', 
    'The video captures ', 
    'In the video, ', 
    'The video showcases ', 
    'The video features ', 
    'The video is ', 
    'The video appears to be ', 
    'The video shows ', 
    'The video begins with ', 
    'The video displays ', 
    'The video begins in ', 
    'The video consists of ', 
    'The video opens with ', 
    'The video opens on ', 
    'The video appears to capture ', 
    'The video appears to show ', 
    "The video appears to depict ", 
    "The video opens in ", 
    "The video appears to focus closely on ", 
    "The video starts with ", 
    "The video begins inside ", 
    "The video presents ", 
    "The video takes place in ", 
    "The video appears to showcase ", 
    "The video appears to display ", 
    "The video appears to focus on ", 
    "The video appears to feature "
```

### Training Text-to-Video Diffusion Model

#### Framework

##### Skiparse (Skip-Sparse) Attention

In video generation models, alternating 2+1D spatial-temporal blocks is a commonly used approach, yet these models lack long-range modeling, limiting their performance ceiling. Consequently, models like  [CogVideoX](https://arxiv.org/abs/2408.06072), [Meta Movie Gen](https://ai.meta.com/research/movie-gen/), and Open-Sora Plan v1.2 employ **Full 3D Attention** as a denoiser, achieving substantially improved visual fidelity and motion quality compared to 2+1D models. This approach, however, requires calculating attention across all tokens in each clip encoding, which significantly raises training costs. For instance, Open-Sora Plan v1.2, training a 2.7-billion-parameter model, takes **100 seconds per step at 93x720p and over 15 seconds per step at 93x480p**, severely constraining scalability under limited computational resources.

To accelerate training while ensuring adequate performance, we propose the **Skiparse (Skip-Sparse) Attention** method. Specifically, under a fixed sparse ratio $$k$$ , we organize candidate tokens for attention through two alternating skip-gather methods. This approach preserves the attention operation is global while effectively reducing FLOPS, enabling faster training of 3D Attention models. In our experiments, applying Skiparse with sparse ratio $$k=4$$ to a 2.7B model reduced training time to **42 seconds per step at 93x720p and 8 seconds per step at 93x480p**.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/186377ca-26b2-4f0f-af42-ae6c846eebcb" />
</figure>
</center>

**Skiparse DiT modifies only the Attention component** within the Transformer Block, using two alternating Skip Sparse Transformer Blocks. With sparse ratio $$k$$, the sequence length in the attention operation reduces to $$\frac{1}{k}$$ of the original, and batch size increases by $$k$$-fold, lowering the theoretical complexity of self-attention to $$\frac{1}{k}$$ of the original, while cross-attention complexity remains unchanged. Due to GPU/NPU parallel processing, increasing the batch size by $$k$$-fold does not linearly decrease speed to $$\frac{1}{k}$$, resulting in a performance boost that exceeds theoretical expectations.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/80f9470a-8afe-4588-a22c-e8c576fea9b6" />
</figure>
</center>

In Single Skip mode, the elements located at positions $$[0, k, 2k, 3k, ...]$$ ,  $$[1, k+1, 2k+1, 3k+1, ...]$$ , ..., $$[k-1, 2k-1, 3k-1, ...]$$ are grouped into the same scope (with each list forming one scope of elements). The figure above, using $$k=2$$ as an example, illustrates this organizational structure. This concept is straightforward, as each token performs attention with tokens spaced $$k-1$$ apart.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/5880f667-7a06-4e7f-8e44-2e1cfb9209b8" />
</figure>
</center>

In Group Skip mode, elements at positions $$[(0, 1, ..., k-1), (k^2, k^2+1, ..., k^2+k-1), (2k^2, 2k^2+1, ..., 2k^2+k-1), ...]$$ , $$[(k, k+1, ..., 2k-1), (k^2+k, k^2+k+1, ..., k^2+2k-1), (2k^2+k, 2k^2+k+1, ..., 2k^2+2k-1), ...]$$ , ..., $$[(k^2-k, k^2-k-1, ..., k^2-1), (2k^2-k, 2k^2-k-1, ..., 2k^2-1), (3k^2-k, 3k^2 -k-1, ..., 3k^2-1), ...]$$ are grouped together as a scope (with each list forming a scope). This arrangement may seem complex numerically, so it can be helpful to understand with the above figure.

In this pattern, we first **group adjacent tokens** in segments of length $$k$$ , then **bundle these groups** with other groups that are spaced $$k-1$$ groups apart into a single scope.For example, in $$[(0, 1, ..., k-1), (k^2, k^2+1, ..., k^2+k-1), (2k^2, 2k^2+1, ..., 2k^2+k-1), ...]$$ , each set of indices in parentheses represents a group. Each group is then connected with another group that is offset by $$k-1$$ groups, forming one scope.

Since the last index of the first group is $$k-1$$ , the first token in the next group to be linked will be at index $$k-1+k(k-1)+1=k^2$$ . Following this pattern, you can determine the indices for each scope in this configuration.

##### Why "Skiparse"?

The 2+1D DiT models temporal understanding only along the time axis of a single spatial location, theoretically and practically limiting performance. In real-world scenarios, changes at a specific spatial location are typically influenced not by prior content at that same location but by content across all spatial locations at preceding times. This constraint makes it challenging for 2+1D DiT to model complex physical dynamics accurately.

Full 3D Attention represents global attention, allowing any spatial position at any time to access information from any other position across all times, aligning well with real-world physical modeling. However, this approach is time-consuming and inefficient, as visual information often contains considerable redundancy, making it unnecessary to establish attention across all spatiotemporal tokens.

**A ideal spatiotemporal modeling approach should employ attention that minimizes the overhead from redundant visual information while capturing the complexities of the dynamic physical world**. Reducing redundancy requires avoiding connections among all tokens, yet global spatiotemporal attention remains essential for modeling complex physical interactions.

To achieve a balance between 2+1D efficiency and Full 3D’s strong spatiotemporal modeling, we developed Skiparse Attention.  This approach provides global spatiotemporal attention within each block, with each block having the same “receptive field”. The use of "group" operations also introduces a degree of locality, aligning well with visual tasks.

Interestingly, once you understand the Skiparse Attention mechanism, you’ll notice that **the attention in 2+1D DiT corresponds to a sparse ratio of $$k=HW$$  (since $$T \ll HW$$ , making the "skip" in Group Skip negligible), while Full 3D DiT corresponds to a sparse ratio of $$k=1$$.** In Skiparse Attention, $$k$$ is typically chosen to be close to 1, yet far smaller than $$HW$$ , making it a 3D Attention that approaches the effectiveness of Full 3D Attention.

In Skiparse Attention, Single Skip is a straightforward operation, easily understood by most. Within Group Skip, the Group operation is also intuitive, serving as a means to model local information. However, **Group Skip involves not only grouping but also skipping**—particularly between groups—which is often overlooked. This oversight frequently leads researchers to confuse Skiparse Attention with a Skip + Window Attention approach. The key difference lies in even-numbered blocks: Window Attention only groups tokens without skipping between groups. The distinctions among these attention methods are illustrated in the figure below, which shows the attention scopes for self-attention only, with dark tokens representing the tokens involved in each attention calculation.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/62d0c75a-7e1d-458e-9faf-ae394e8ddd34" />
</figure>
</center>

To deeply understand why nearly global attention is necessary and why Skiparse Attention theoretically approximates Full 3D Attention more closely than other common methods, we introduce the concept of **Average Attention Distance**. This concept is defined as follows: for any two tokens, if it takes $$m$$ attention operations to establish a connection between them, the attention distance is  $$m$$ . The average attention distance for a tensor is then the mean of the attention distances across all token pairs, representing the corresponding attention method’s overall connectivity efficiency. The average attention distance of all tokens within a tensor is defined as the average attention distance for that particular attention method. 

For example, in Full 3D Attention, any token can connect with any other token in just one attention operation, resulting in an average attention distance of 1.

In 2+1D Attention, the process is somewhat more complex, though still straightforward to understand. In all configurations above, any two different tokens can connect with an attention distance between 1 and 2 (Note that we define the attention distance between a token and itself as zero). Thus, for the other three attention methods, we can first identify which tokens have an attention distance of 1. Subsequently, tokens with an attention distance of 2 can be determined, allowing us to calculate the average attention distance.

In the $$2N$$ Block, attention operates over the $$(H, W)$$ dimensions, where tokens within this region have an attention distance of 1. In the $$2N+1$$ Block, attention operates along the $$(T)$$ dimension, also assigning an attention distance of 1 for these tokens. The total number of tokens with an attention distance of 1 in this case is $$HW + T - 2$$ (excluding the token itself, hence $$(HW + T - 1) - 1 = HW + T - 2$$).

Therefore, in 2+1D Attention, the average attention distance (AVG Attention Distance) is:

$$
\begin{aligned}
	d&=\frac{1}{THW}\left[ 1\times 0+\left( HW+T-2 \right) \times 1+\left[ THW-\left( HW+T-1 \right) \right] \times 2 \right]\\
	&=2-\left( \frac{1}{T}+\frac{1}{HW} \right)\\
\end{aligned}
$$

In Skip+Window Attention, aside from the token itself, there are $$\frac{THW}{k} - 1$$ tokens with an attention distance of 1 in the $$2N$$ Block, and $$k - 1$$ tokens with an attention distance of 1 in the $$2N+1$$ Block. Thus, the total number of tokens with an attention distance of 1 is $$\frac{THW}{k} + k - 2$$.

Therefore, in Skip+Window Attention, the average attention distance (AVG Attention Distance) is:

$$
\begin{aligned}
	d&=\frac{1}{THW}\left[ 1\times 0+\left( \frac{THW}{k}+k-2 \right) \times 1+\left[ THW-\left( \frac{THW}{k}+k-1 \right) \right] \times 2 \right]\\
	&=2-\left( \frac{1}{k}+\frac{k}{THW} \right)\\
\end{aligned}
$$

In Skiparse Attention, aside from the token itself, $$\frac{THW}{k} - 1$$ tokens have an attention distance of 1 in the $$2N$$ Block, and $$\frac{THW}{k} - 1$$ tokens have an attention distance of 1 in the $$2N+1$$ Block. Notably, $$\frac{THW}{k^2} - 1$$ tokens can establish an attention distance of 1 in both blocks and should not be counted twice.

Therefore, in Skiparse Attention, the average attention distance (AVG Attention Distance) is:

$$
\begin{aligned}
	d&=\frac{1}{THW}\left[ 1\times 0+\left[ \frac{2THW}{k}-2-\left( \frac{THW}{k^2}-1 \right) \right] \times 1+\left[ THW-\left( \frac{2THW}{k}-\frac{THW}{k^2} \right) \right] \times 2 \right]\\
	&=2-\frac{2}{k}+\frac{1}{k^2}-\frac{1}{THW}\\
	&=2-\frac{2}{k}+\frac{1}{k^2}\left( 1\ll THW \right)\\
\end{aligned}
$$

In fact, in the Group Skip of the $$2N+1$$ Block, the actual sequence length is $$k\lceil \frac{THW}{k^2} \rceil$$ rather than $$\frac{THW}{k}$$. The prior calculation assumes the ideal case where $$k \ll THW$$ and $$k$$ divides $$THW$$ exactly, yielding $$k\lceil \frac{THW}{k^2} \rceil = k \cdot \frac{THW}{k^2} = \frac{THW}{k}$$. In practical applications, excessively large $$k$$ values are typically avoided, making this derivation a reasonably accurate approximation for general use.

Specifically, when $$k = HW$$ and padding is disregarded, since $$T \ll HW$$, group skip attention reduces to window attention with a window size of $$HW$$. Given that padding does not affect the final computation, Skiparse Attention is equivalent to 2+1D Attention when $$k = HW$$.

For the commonly used resolution of 93x512x512, using a causal VAE with a 4x8x8 compression rate and a DiT with a 1x2x2 patch embedding, we obtain a latent shape of 24x32x32 before applying attention. The AVG Attention Distance for different calculation methods would then be as follows:

|                        | Full 3D Attention | 2+1D  Attention |
| ---------------------- | ----------------- | --------------- |
| AVG Attention Distance | 1                 | 1.957           |

|                        | Skip + Window Attention(k=2) | Skip + Window Attention(k=4) | Skip + Window Attention(k=6) | Skip + Window Attention(k=8) |
| ---------------------- | ---------------------------- | ---------------------------- | ---------------------------- | ---------------------------- |
| AVG Attention Distance | 1.500                        | 1.750                        | 1.833                        | 1.875                        |

|                        | Skiparse Attention(k=2) | Skiparse Attention(k=4) | Skiparse Attention(k=6) | Skiparse Attention(k=8) |
| ---------------------- | ----------------------- | ----------------------- | ----------------------- | ----------------------- |
| AVG Attention Distance | 1.250                   | 1.563                   | 1.694                   | 1.766                   |

In 2+1D Attention, the average attention distance is 1.957, larger than that of Skip + Window Attention and Skiparse Attention at commonly used sparse ratios. While Skip + Window Attention achieves a shorter average attention distance, its modeling capacity remains limited due to the locality of attention in its 2N+1 blocks. Skiparse Attention, with the shortest average attention distance, applies global attention in both 2N and 2N+1 blocks, making its spatiotemporal modeling capabilities closer to Full 3D Attention than the other two non-Full 3D methods.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/80ca6d70-5033-454b-883f-11d12d140360" width=600/>
</figure>
</center>

The figure above shows how Skiparse Attention’s AVG Attention Distance changes with sparse ratio $$k$$.

We can summarize the characteristics of these attention types as follows:

|                                    | Full 3D Attention | 2+1D  Attention                  | Skip + Window Attention                 | Skiparse Attention                                           |
| ---------------------------------- | ----------------- | -------------------------------- | --------------------------------------- | ------------------------------------------------------------ |
| Speed                              | Slow              | Fast                             | Depending on $$k$$                    | Depending on $$k$$                                         |
| Spatiotemporal modeling capability | Strong            | Weak                             | Weak                                    | Approaches Full 3D                                           |
| Is attention global?               | Yes               | No                               | Half of the attention blocks are global | Yes                                                          |
| Computation load per block         | Equal             | Not Equal                        | Not Equal                               | Equal                                                        |
| AVG Attention Distance             | 1                 | $$2-(\frac{1}{T}+\frac{1}{HW})$$ | $$2-(\frac{1}{k}+\frac{k}{THW})$$       | $$2-\frac{2}{k}+\frac{1}{k^2},1<k\ll THW$$                   |

Considering both computational load and AVG Attention Distance, we select Skiparse with $$k = 4$$, replacing the first and last two blocks with Full 3D Attention to enhance performance.

Overall, we retained the architecture from version 1.2 but incorporated Skiparse Attention module.


<center>
<figure>
	<img src="https://github.com/user-attachments/assets/af21a577-e5a8-46ac-8be4-0cd08cddb6c6" height=350 />
</figure>
</center>

#### Dynamic training

Overall, we maintained the bucket strategy from v1.2.0, pre-defining the shape of each video during training and aggregating data of the same shape through a sampler. Finally, the dataloader retrieves data based on our aggregated indices.

In our early implementation, we specified `--max_width`, `--max_height`, `--min_width`, and `--min_height`. While this allows for specifying arbitrary resolutions within a certain range, this approach can easily lead to OOM issues during video training. For instance, for a 720P (720×1280) video, if the maximum dimensions are set to 720, the video would be scaled to 405×720. However, if there are square videos with resolutions greater than 720, they would be scaled to 720×720. Most videos are non-square, and to prevent OOM, we need to reserve GPU memory, which leads to significant computational waste. Therefore, we recommend using `--max_token` and `--min_token` to limit any range, as this aligns better with the Transformer architecture.

#### Training scheduler

We replaced the eps-pred loss with v-pred loss and enable ZeroSNR. For videos, we resample to 16 FPS for training.

**Stage 1**: We initially initialized from the image weights of version 1.2.0 and trained images at a resolution of 1x320x320. The objective of this phase was to fine-tune the 3D dense attention model to a sparse attention model. The entire fine-tuning process involved approximately 100k steps, with a batch size of 1024 and a learning rate of 2e-5. The image data was primarily sourced from SAM in version 1.2.0.


**Stage 2**: We trained the model jointly on images and videos, with a maximum resolution of 93x320x320. The entire fine-tuning process involved approximately 300k steps, with a batch size of 1024 and a learning rate of 2e-5. The image data was primarily sourced from SAM in version 1.2.0, while the video data consisted of the unfiltered Panda70m. In fact, the model had nearly converged around 100k steps, and by 300k steps, there were no significant gains. Subsequently, we performed data cleaning and caption rewriting, with further data analysis discussed at the end.

**Stage 3**: We fine-tuned the model using our filtered Panda70m dataset, with a fixed resolution of 93x352x640. The entire fine-tuning process involved approximately 30k steps, with a batch size of 1024 and a learning rate of 1e-5.

### Training Image-to-Video Diffusion Model

#### Framework

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/41e22292-8d8b-469e-940a-6e5ae00bf620" />
</figure>
</center>

In terms of framework, Open-Sora Plan v1.3 continues to use the Inpainting model architecture from Open-Sora Plan v1.2.

#### Data processing

For data processing, Open-Sora Plan v1.3 introduces two new mask types: an all-1 mask and an all-0 mask. This brings the total number of mask types in the Inpainting Model to six.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/f31b222e-811c-49b9-839c-f72fb85c4ee4" />
</figure>
</center>

In the figure above, black indicates retained frames, while white denotes discarded frames. The corresponding frame strategies are as follows:

- **Clear**: Retain all frames.
- **T2V**: Discard all frames.
- **I2V**: Retain only the first frame; discard the rest.
- **Transition**: Retain only the first and last frames; discard the rest.
- **Continuation**: Retain the first $$n$$ frames; discard the rest.
- **Random**: Retain $$n$$ randomly selected frames; discard the rest.

#### Progressive training

The Open-Sora Plan v1.3 uses more data for training and employs a progressive training approach to help the model understand frame-based inpainting tasks.

Since the Inpainting Model supports various mask inputs, different mask inputs correspond to tasks of varying difficulty levels. Therefore, we can first teach the model simple tasks, such as random masks, allowing it to develop a basic capability for frame-based inpainting before gradually increasing the proportion of more challenging tasks. It is important to note that at different training stages, we ensure that at least 5% of the data the model sees pertains to T2V tasks, which is aimed at enhancing the model's understanding of prompts.

The model weights are initialized from the T2V model with zero initialization. The batch size is fixed at 256, and the learning rate is set to 1e-5, using a two-stage training approach.

**Stage 1**: Any resolution and duration within 93x102400 (320x320), using unfiltered motion and aesthetic low-quality data:

(1) Step 1: t2v 10%, continuation 40%, random mask 40%, clear 10%. Ensure that at least 50% of the frames are retained during continuation and random mask, training with 4 million samples.

(2) Step 2: t2v 10%, continuation 40%, random mask 40%, clear 10%. Ensure that at least 25% of the frames are retained during continuation and random mask, training with 4 million samples.

(3) Step 3: t2v 10%, continuation 40%, random mask 40%, clear 10%. Ensure that at least 12.5% of the frames are retained during continuation and random mask, training with 4 million samples.

(4) Step 4: t2v 10%, continuation 25%, random mask 60%, clear 5%. Ensure that at least 12.5% of the frames are retained during continuation and random mask, training with 4 million samples.

(5) Step 5: t2v 10%, continuation 25%, random mask 60%, clear 5%, training with 8 million samples.

(6) Step 6: t2v 10%, continuation 10%, random mask 20%, i2v 40%, transition 20%, training with 16 million samples.

(7) Step 7: t2v 5%, continuation 5%, random mask 10%, i2v 40%, transition 40%, training with 10 million samples.

**Stage 2:** Any resolution and duration within 93x236544 (e.g., 480x480, 640x352, 352x640), using filtered motion and aesthetic high-quality data:

t2v 5%, continuation 5%, random mask 10%, i2v 40%, transition 40%, training with 15 million samples.

#### About the Semantic Adapter

We conducted further experiments on the Semantic Adapter module and compared the video quality of Image-to-Video under various Image Encoders, including [Clip](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [Dino v2](https://huggingface.co/timm/vit_large_patch14_dinov2.lvd142m). We also attempted strategies such as directly injecting image embeddings into cross-attention or extracting features from the Image Encoder using Qformer before injecting them into cross-attention. 

Under various strategies, we did not observe significant performance improvements; the impact on video quality was much smaller than that of the dataset. Therefore, we decided not to include the Semantic Adapter in Open-Sora Plan v1.3.

#### Noise Injection Strategy for Conditional Images

Researchs like [CogVideoX](https://arxiv.org/abs/2408.06072) and [Stable Video Diffusion](https://stability.ai/stable-video) have indicated that adding a certain amount of noise to Conditional Images can enhance the generalization capability of I2V models and achieve a greater range of motion. Therefore, we will implement this strategy in Open-Sora Plan v1.3, just the same as in [CogVideoX](https://arxiv.org/abs/2408.06072).

### The implementation of Skiparse Attention

Skiparse is theoretically easy to understand and straightforward to implement. Its implementation mainly relies on the rearrange operation, which reduces the sequence length of latents before entering `F.scaled_dot_product_attention()`. Aside from this adjustment, no other modifications are made. For simplicity, the following discussion focuses solely on the self-attention part, excluding the attention mask.

The pseudocode implementation of Single Skip is as follows:

```python
# x.shape: (B,N,C)
def single_skip_rearrange(x, sparse_k):
	return rearrange(x, 'b (g k) d -> (k b) g d', k=sparse_k)
def reverse_sparse(x, sparse_k):
	return rearrange(x, '(k b) g d -> b (g k) d', k=sparse_k)
q, k, v = Q(x), K(x), V(x)
q = add_rope(q)
k = add_rope(k)
q = single_skip_rearrange(q)
k = single_skip_rearrange(k)
v = single_skip_rearrange(v)
hidden_states = F.scaled_dot_product_attention(q=q,k=k,v=v)
output = reverse_sparse(hidden_states)
```

The core of the Skiparse operation lies in "rearranging the sequence", which corresponds to the Single Skip operation in the pseudocode:

```python
rearrange(x, '(g k) b d -> g (k b) d', k=sparse_k)
```

This operation can be understood as a combination of a reshape and a transpose operation:

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/e42c4dd5-ee95-42a8-b8c6-8cb803cd7e12" height=300/>
</figure>
</center>

In this way, $$k$$ sub-sequences can be created, and $$k$$ can be moved to the batch dimension, allowing the Attention mechanism to compute the sub-sequences in parallel.

Understanding Single Skip makes Group Skip easy to comprehend as well; it simply adds a grouping operation before the Skip. Its pseudocode is as follows:

```python
# x.shape: (B,N,C)
def group_skip_rearrange(x, sparse_k):
	return rearrange(x, ' b (n m k) d -> (m b) (n k) d', m=sparse_k, k=sparse_k)
def reverse_sparse(x, sparse_k):
	return rearrange(x, '(m b) (n k) d -> b (n m k) d', m=sparse_k, k=sparse_k)
q, k, v = Q(x), K(x), V(x)
q = add_rope(q)
k = add_rope(k)
q = group_skip_rearrange(q)
k = group_skip_rearrange(k)
v = group_skip_rearrange(v)
hidden_states = F.scaled_dot_product_attention(q=q,k=k,v=v)
output = reverse_sparse(hidden_states)
```

Every $$k^2$$ tokens form a repetition, and every $$k$$ tokens form a group. To help everyone better understand this operation, the following figure illustrates the situation when $$k=3$$:

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/5e777862-d03c-4c7e-8ffc-1e1e9234b84e"/>
</figure>
</center>

It is important to note that the rope is added before the Skiparse operation and cannot be placed after it, as the sequence after Skiparse will lose its original spatial positions.


## Future Work and Discussion

### CasualVideoVAE
For videos, increasing the compression ratio while maintaining the original latent dimension leads to significant information loss. Therefore, it is a trend to increase the latent dimension to achieve higher compression ratios. A more advanced VAE will be released in the next version.

### Diffusion Model
The current 2B model in version 1.3.0 shows performance saturation during the later stages of training. However, it does not perform well in understanding physical laws (e.g., a cup overflowing with milk, a car moving forward, or a person walking). We have 4 hypotheses regarding this issue:

#### The current data domain is too narrow.

We randomly sampled 2,000 videos from Panda70m and conducted manual verification, finding that less than 1% featured cars in motion, and there were even fewer than 10 videos of people walking. Approximately 80% of the videos consist of half-body conversations with multiple people in front of the camera. Therefore, we speculate that the narrow data domain of Panda70m restricts the model's ability to generate many scenarios. We plan to collect more data in the next version.

#### Joint training of images and videos

Models such as [Open-Sora v1.2](https://github.com/hpcaitech/Open-Sora), [EasyAnimate v4](https://github.com/aigc-apps/EasyAnimate), and [Vchitect-2.0](https://github.com/Vchitect/Vchitect-2.0) can easily generate high-visual-quality videos, possibly due to their direct inheritance of image weights ([Pixart-Sigma](https://pixart-alpha.github.io/PixArt-sigma-project/), [HunyuanDiT](https://github.com/Tencent/HunyuanDiT), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers)). They train the model with a small amount of video data to learn how to flow along the time axis based on 2D images. However, we trained images from scratch with only 10M-level data, which is far from sufficient. We have two hypotheses regarding the training strategy: (1) the first is to start joint training from scratch, with images significantly outnumbering videos; (2) The second is to first train a high-quality image model and then use joint training, with a higher proportion of videos at that stage. Considering the learning path and training costs, the second approach may offer more decoupling, while the first aligns better with scaling laws.

#### The model still needs to scale

By observing the differences between [CogVideoX-2B](https://github.com/THUDM/CogVideo) and its 5B variant, we can clearly see that the 5B model understands more physical laws than the 2B model. We speculate that instead of spending excessive effort designing for smaller models, it may be more effective to leverage scaling laws to solve these issues. In the next version, we will scale up the model to explore the boundaries of video generation.

We currently have two plans: one is to continue using the Deepspeed/FSDP approach, sharding the EMA and text encoder across ranks with Zero3, which is sufficient for training 10-15B models. The other is to adopt [MindSpeed](https://gitee.com/ascend/MindSpeed) for various parallel strategies, enabling us to scale the model up to 30B.

#### Supervised loss in training

Whether flow-based models are more suitable than v-pred models remains uncertain and requires further ablation studies to determine.

### How else can "Skiparse" skip?

The sparse method we use is theoretically and practically straightforward; however, its implementation treats the original video data purely as a one-dimensional sequence, neglecting the 2D spatial priors. Thus, we extended Skiparse to create Skiparse-2D, which is better suited for 2D Visuals.

<center>
<figure>
	<img src="https://github.com/user-attachments/assets/44bd5284-b4c0-4a9d-9f2e-5acbb2e3450f" height=500/>
</figure>
</center>

In Skiparse-2D, a sparse ratio of $$k$$ represents the sparsity along the $$h$$ or $$w$$ direction. In terms of the number of tokens involved in attention computation, it is equivalent to the square of the sparse ratio in Skiparse-1D.

We conducted basic experiments comparing Skiparse-1D and Skiparse-2D. Under identical experimental settings, Skiparse-2D showed no improvement over Skiparse-1D in terms of loss or sampling results. Additionally, Skiparse-2D is less flexible to implement than Skiparse-1D. Therefore, we opted to use the Skiparse-1D approach for training in Open-Sora Plan v1.3.

Nevertheless, given our limited experimentation, the feasibility of Skiparse-2D remains worth exploring. Intuitively, Skiparse-2D better aligns with the spatial characteristics of visuals, and as sparse ratio $$k$$ increases, its approach intuitively approximates that of 2+1D. We therefore encourage interested researchers in the community to pursue further exploration in this area.


================================================
FILE: docs/Report-v1.5.0.md
================================================
## Report v1.5.0

In October 2024, we released Open-Sora Plan v1.3.0, introducing the sparse attention structure, Skiparse Attention, to the field of video generation for the first time. Additionally, we adopted the efficient WFVAE, significantly reducing encoding time and memory usage during training.

In Open-Sora Plan v1.5.0, We introduce several key updates to enhance the framework:

1、Improved Sparse DiT, SUV. Building on Skiparse Attention, we extend sparse DiT into a U-shaped sparse structure. This design preserves speed advantages while enabling sparse DiT to achieve performance comparable to dense DiT.

2、Higher-compression WFVAE. In Open-Sora Plan v1.5.0, we explore a WFVAE with an 8×8×8 downsampling rate. It outperforms the performance of the widely adopted 4×8×8 VAE in the community, while reducing the latent shape by half and shortening the attention sequence length.

3、Data and model scaling. In Open-Sora Plan v1.5.0, we collect 1.1 billion high-quality images and 40 million high-quality videos. The model is scaled up to 8.5 billion parameters, resulting in strong overall performance.

4、Simplified Adaptive Gradient Clipping strategy. Compared to the more complex batch-dropping method in version 1.3.0, version 1.5.0 maintains a simple adaptive gradient norm threshold for clipping, making it more compatible with various parallel training strategies.

Open-Sora Plan v1.5.0 is fully trained and inferred on Ascend 910-series accelerators, using the mindspeed-mm framework to support parallel training strategies.

### Open-Source Release

Open-Sora Plan v1.5.0 is open-sourced with the following components:

1、All training and inference code. You can also find the implementation of Open-Sora Plan v1.5.0 in the official [MindSpeed-MM](https://gitee.com/ascend/MindSpeed-MM) repository.

2、The WFVAE weights with 8×8×8 compression, along with the 8.5B SUV denoiser weights.

## Detailed Technical Report

### Data collection and processing

Our dataset includes 1.1B images from [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B)、[COYO-700M](https://github.com/kakaobrain/coyo-dataset)、[LAION-Aesthetics](https://laion.ai/blog/laion-aesthetics/), with no filtering applied aside from resolution checks. The video data are drawn from [Panda-70M](https://github.com/snap-research/Panda-70M) and internal sources, and filtered using the same protocol as in Open-Sora Plan v1.3.0, yielding 40M high-quality videos.

### Adaptive Grad Clipping

In Open-Sora Plan v1.3.0, we introduce an Adaptive Grad Clipping strategy based on discarding gradient-abnormal batches. While highly stable, this method involve overly complex execution logic. In Open-Sora Plan v1.5.0, we optimize the strategy by maintaining the gradient norm threshold via an exponential moving average (EMA). Gradients exceeding the threshold are clipped accordingly. This approach effectively extends the fixed threshold of 1.0, which is commonly used in large-scale models, into a dynamic, training-dependent threshold.

```python
'''
	moving_avg_max_grad_norm: the maximum gradient norm maintained via EMA
	moving_avg_max_grad_norm_var: the variance of the maximum gradient norm maintained via EMA
	clip_threshold: the gradient clipping threshold computed using the 3-sigma rule
	ema_decay: the EMA decay coefficient, typically set to 0.99.
	grad_norm: grad norm at the current step 
'''
clip_threshold = moving_avg_max_grad_norm + 3.0 * (moving_avg_max_grad_norm_var ** 0.5)
if grad_norm <= clip_threshold:
    # If the gradient norm is below the clipping threshold, the parameters are updated normally at this step, and both the moving_avg_max_grad_norm and moving_avg_max_grad_norm_var are updated accordingly.
    moving_avg_max_grad_norm = ema_decay * moving_avg_max_grad_norm + (1 - ema_decay) * grad_norm
    max_grad_norm_var = (moving_avg_max_grad_norm - grad_norm) ** 2
    moving_avg_max_grad_norm_var = ema_decay * moving_avg_max_grad_norm_var + (1 - ema_decay) * max_grad_norm_var
    # update weights...
else:
    # If the gradient norm exceeds the clipping threshold, the gradients are first clipped to reduce the norm to the threshold value before updating the parameters.
    clip_coef = grad_norm / clip_threshold
    grads = clip(grads, clip_coef) # clipping grads
    # update weights...
```

Compared to the strategy in v1.3.0, this approach is simpler to implement and effectively addresses the issue of loss spikes that occur in the later stages of diffusion training when the gradient norm is significantly below 1.0.

### WFVAE with 8x8x8 compression

In version 1.5.0, we increase the temporal compression rate of the VAE from 4× to 8×, reducing the latent shape to half that of the previous version. This enables the generation of videos with higher frame counts.

| Model             | THW(C)        | PSNR         | LPIPS         | rFVD         |
| ----------------- | ------------- | ------------ | ------------- | ------------ |
| CogVideoX         | 4x8x8 (16)    | <u>36.38</u> | 0.0243        | <u>50.33</u> |
| StepVideo         | 8x16x16 (16)  | 33.61        | 0.0337        | 113.68       |
| LTXVideo          | 8x32x32 (128) | 33.84        | 0.0380        | 150.87       |
| Wan2.1            | 4x8x8 (16)    | 35.77        | **0.0197**    | **46.05**    |
| Ours （WF-VAE-M） | 8x8x8 (32)    | **36.91**    | <u>0.0205</u> | 52.53        |

**Test on an open-domain dataset with 1K samples.**

For more details on WFVAE, please refer to [WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model](https://arxiv.org/abs/2411.17459)

### Training Text-to-Video Diffusion Model

#### Framework —— SUV: A Sparse U-shaped Diffusion Transformer For Fast Video Generation

In Open-Sora Plan v1.3.0, we discuss the strengths and weaknesses of Full 3D Attention and 2+1D Attention. Based on their characteristics, we propose Skiparse Attention, a novel global sparse attention mechanism.

Under a predefined sparsity $k$, Skiparse Attention selects a subsequence of length $\frac{1}{k}$ of the original sequence in an alternating Single-Skip and Group-Skip pattern for attention interaction. This design approximates the effect of Full 3D Attention. As the sparsity increases, the selected positions become more widely spaced; as it decreases, the positions become more concentrated. Regardless of the sparsity, Skiparse Attention remains global.

In Open-Sora Plan v1.5.0, we interpret this sparse interaction pattern as a form of token-level information downsampling. Sparser Skiparse Attention performs more semantic-level interactions, while denser Skiparse Attention captures fine-grained information. Following the multi-scale design principle in neural networks, we introduce Skiparse Attention with U-shaped sparsity variation: low-sparsity Skiparse Attention is used in shallow layers, with Full 3D Attention applied at the shallowest layer, and high-sparsity Skiparse Attention in deeper layers. Inspired by the UNet architecture, we further incorporate long skip connections between stages with identical sparsity. This U-shaped DiT architecture based on Skiparse Attention is referred to as **SUV**.

![SUV](https://github.com/user-attachments/assets/6eb54e37-7077-4746-a4c6-9b7165dd48fe)

In Open-Sora Plan v1.5.0, we adopt an SUV architecture based on MMDiT. Skiparse Attention is applied to the video latents, while the text embeddings are only repeated to align with the skiparse-processed latent shape, without any sparsification.

The SUV architecture offers the following advantages:

1、SUV is the first sparsification method proven effective for video generation. Our ablation studies show that it achieves performance comparable to dense DiT within the approximate training steps. Moreover, it can be applied during both pretraining and inference. Testing on the Ascend 910B platform at 121×576×1024 shape shows SUV runs over 35% faster than Dense DiT, with the attention operation alone gaining a speed boost of over 45%.

2、Unlike UNet structures that explicitly downsample feature maps and cause information loss, the U-shaped structure of SUV operates on attention. The shape of the feature map remains unchanged, preserving information while altering only the granularity of token-level interactions.

3、Skiparse Attention and SUV only change the attention computation during the forward pass instead of modifying model weights. This allows dynamic adjustment of sparsity throughout training: lower sparsity for image or low-resolution video training, and higher sparsity for high-resolution video training. As a result, FLOPS grow approximately linearly with increasing of sequence length.


A more detailed analysis of the SUV architecture will be released in a future arXiv update.

#### Training Stage

Our training consists of two stages: Text-to-Image and Text-to-Video.

#### Text-to-Image

Previous studies have shown that image weights trained on synthetic data may negatively impact video training. Therefore, in the v1.5.0 update, we choose to train image weights using a much larger corpus of real-world data, totaling 1.1B images. Since image data come in various resolutions, whereas videos are primarily in a 9:16 aspect ratio, we adopt multi-resolution training for images using five common aspect ratios—(1,1), (3,4), (4,3), (9,16), and (16,9)—along with the Min-Max Token Strategy. In contrast, video training is conducted using a fixed 9:16 resolution.

The difference between Skiparse Attention and Full Attention lies in the token sequences involved in the forward computation; the required weights remain identical. Therefore, we can first train the model using Dense MMDiT with Full 3D Attention, and then fine-tune it to the Sparse MMDiT mode after sufficient training.

**Image-Stage-1:** Training is conducted using 512 Ascend 910B accelerators. We train a randomly initialized Dense MMDiT on 256²-pixel images with multi-resolution enabled. The learning rate is set to 1e-4, with a batch size of 8096. This stage runs for a total of 225k steps.

**Image-Stage-2:** Training is conducted using 384 Ascend 910B accelerators. We train on 384²-pixel images with multi-resolution still enabled. The learning rate remains 1e-4, the batch size is 6144, and training lasts for 150k steps.

**Image-Stage-3:** Training is conducted using 256 Ascend 910B accelerators. We train on 288x512 images with force resolution. The learning rate is 1e-4, the batch size is 4096, and training lasts for 110k steps. This stage completes the Dense MMDiT training.

**Image-Stage-4:** Training is conducted using 256 Ascend 910B accelerators. We initialize the SUV model using the pretrained weights from Dense MMDiT, with skip connections zero-initialized to ensure that the model could produce non-noise outputs at the start. In practice, zero-shot inference reveals that the generated images contained meaningful low-frequency structures. Our experiments confirm that fine-tuning from Dense DiT to SUV converges quickly. This stage uses a fixed resolution of 288×512, a learning rate of 1e-4, a batch size of 4096, and is trained for approximately 160k steps.

#### Text-to-Video

For video training, we fix the aspect ratio at 9:16 and training solely on video data instead of joint training with image data. All training in this stage is performed using 512 Ascend 910B accelerators.

**Video-Stage-1:** Starting from the SUV weights pretrained during the Text-to-Image phase, we train on videos with a shape of 57×288×512 for about 40k steps. The setup includes a learning rate of 6e-5, TP/SP parallelism of 2, gradient accumulation set to 2, a micro batch size of 2, and a global batch size of 1024. Videos are trained at 24 fps, representing approximately 2.4 seconds (57/24 ≈ 2.4s) of content per sample. This stage marks the initial adaptation from image-based to video-based weights, for which shorter video clips are intentionally selected to ensure stable initialization.

**Video-Stage-2:** We further train on videos with a shape of 57×288×512 for 45k steps, keeping the learning rate, TP/SP parallelism, and gradient accumulation settings unchanged. However, the training frame rate is reduced to 12 fps, corresponding to ~4.8 seconds of video content per sample (57/12 ≈ 4.8s). This stage aims to enhance temporal learning without increasing sequence length, serving as preparation for later high-frame-counts training.

**Video-Stage-3:** We train on videos with a shape of 121×288×512 for approximately 25k steps. The learning rate is adjusted to 4e-5, with TP/SP parallelism set to 4, gradient accumulation steps set to 2, a micro batch size of 4, and a global batch size of 1024. In this stage, we revert to a training frame rate of 24 fps.

**Video-Stage-4:** We conduct training on videos with a shape of 121×576×1024 for a total of 16k + 9k steps. The learning rates are set to 2e-5 and 1e-5 for the two phases, respectively. TP/SP parallelism is configured as 4, with gradient accumulation steps set to 4, a micro batch size of 1, and a global batch size of 512.

**Video-Stage-5:** We train on a high-quality subset of the dataset for 5k steps, using a learning rate of 1e-5. TP/SP parallelism is set to 4, with gradient accumulation steps of 4, a micro batch size of 1, and a global batch size of 512.

 #### Performance on Vbench

| Model                      | Parameters | Total Score   | Quality Score | Semantic Score | **aesthetic quality** |
| -------------------------- | ---------- | ------------- | ------------- | -------------- | --------------------- |
| Mochi-1                    | 10B        | 80.13%        | 82.64%        | 70.08%         | 56.94%                |
| CogvideoX-2B               | 2B         | 80.91%        | 82.18%        | 75.83%         | 60.82%                |
| CogvideoX-5B               | 5B         | 81.61%        | 82.75%        | 77.04%         | 61.98%                |
| Step-Video-T2V             | 30B        | 81.83%        | <u>84.46%</u> | 71.28%         | 61.23%                |
| CogvideoX1.5-5B            | 5B         | 82.17%        | 82.78%        | **79.76%**     | 62.79%                |
| Gen-3                      | -          | 82.32%        | 84.11%        | 75.17%         | <u>63.34%</u>         |
| HunyuanVideo (Open-Source) | 13B        | **83.24%**    | **85.09%**    | 75.82%         | 60.36%                |
| Open-Sora Plan v1.5.0      | 8B         | <u>83.02%</u> | 84.24%        | <u>78.18%</u>  | **66.89%**            |


### Training Image-to-Video Diffusion Model

Coming Soon...

### Future Work

Currently, open-source models such as Wan2.1 have achieved performance comparable to closed-source commercial counterparts. Given the gap in computing resources and data availability compared to industry-scale efforts, the future development of the Open-Sora Plan will focus on the following directions:

1、Latents Cache。

In the training process of Text-to-Video models, the data must be processed through two key modules—the Variational Autoencoder (VAE) and the Text Encoder—to extract features from both video/images and their corresponding prompts. These encoded features serve as inputs to the training model. However, in existing industry practices, feature encoding is redundantly performed on the multimodal training dataset during every training epoch. This leads to additional computational overhead and significantly prolongs the total training time.

Specifically, in conventional training pipelines, the VAE and Text Encoder modules are typically kept resident in GPU memory to perform feature encoding in real time during each epoch. While this ensures on-the-fly encoding, it also results in persistently high GPU memory usage, becoming a major bottleneck for training efficiency. This issue is exacerbated when handling large-scale datasets or complex models, where memory constraints further limit model capacity and training speed.

To address the above issue, we propose an optimization strategy that replaces repeated feature computation with feature lookup. The core idea is to decouple feature encoding from model training. Specifically, during pretraining or the first training epoch, we compute and store the most computationally expensive text prompt features in external high-performance storage. During subsequent training, the model directly loads these precomputed features from storage, avoiding redundant encoding operations. This design significantly reduces computational overhead and GPU memory usage, allowing more memory to be allocated to model training.

Based on the following configuration environment, we compare the training time per epoch and per step before and after applying the feature caching strategy. Experimental results show that storing precomputed features reduces multi-epoch training time by approximately 30% and frees up around 20% of GPU memory resources.

| **Configuration** |                 **Details**                 |
| :---------------: | :-----------------------------------------: |
|       Model       | Open-Sora Plan v1.5.0 (2B-level parameters) |
|      Dataset      |         100K images and 10K videos          |
|   Accelerators    |             8× Nvidia A800 GPUs             |
|  Feature Storage  |         Huawei OceanStor AI Storage         |

Test cases:

| **Training Stage** | **Test Type**          | **Batch Size** | **Time per Step** | **Time per Epoch** | **Memory Usage** |
| ------------------ | ---------------------- | -------------- | ----------------- | ------------------ | ---------------- |
| Low-Res Images     | General Method         | 64             | 6.53s             | 21 min 12s         | 56 GB            |
|                    | Feature Caching Method | 64             | 4.10s             | 13 min 19s         | 40 GB            |
|                    | General Method         | 128            | 12.78s            | 20 min 39s         | 74 GB            |
|                    | Feature Caching Method | 128            | 7.81s             | 12 min 38s         | 50 GB            |
| Low-Res Videos     | General Method         | 8              | 8.90s             | 26 min 23s         | 68 GB            |
|                    | Feature Caching Method | 8              | 7.78s             | 23 min 05s         | 51 GB            |
| High-Res Videos    | General Method         | 4              | 17.00s            | 101 min            | 71 GB            |
|                    | Feature Caching Method | 4              | 16.00s            | 97 min             | 57 GB            |

2、Improved DiT pretraining with sparse or linear attention. In v1.3.0, we introduce the first DiT pretrained with sparse attention in the community. This is extended in v1.5.0 into the SUV architecture, enabling sparse DiT to achieve performance comparable to its dense counterpart. While sparse and linear attention have demonstrated significant success in the LLM domain, their application in video generation remains underexplored. In future versions, we plan to further investigate the integration of sparse and linear attention into video generation models.

3、MoE-based DiT. Since the release of [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), the MoE (Mixture-of-Experts) paradigm has become a common approach for scaling LLMs to larger parameter sizes. Currently, open-source video generation models are capped at around 14B parameters, which is still relatively small compared to the 100B+ scales in the LLM field. Incorporating MoE into the DiT architecture, and exploring its combination with sparse and linear attention, is a future direction under consideration by the Open-Sora Plan team.

4、Unified video generation models for both generation and understanding. The March release of GPT-4o demonstrates that unified architectures combining generation and understanding can offer fundamentally different capabilities compared to purely generative models. In the video domain, we should similarly anticipate the potential breakthroughs that such unified generative models might bring.

5、Enhancing Image-to-Video generation models. Current approaches in this field still largely follow either the SVD paradigm or the inpainting-based paradigm adopted since Open-Sora Plan v1.2.0. Both approaches require extensive fine-tuning of pretrained Text-to-Video models. From a practical standpoint, Text-to-Video is more aligned with academic exploration, while Image-to-Video is more relevant to real-world production scenarios. As a result, developing a new paradigm for Image-to-Video will be a key focus for the Open-Sora Plan team moving forward.



================================================
FILE: docs/Report-v1.5.0_cn.md
================================================
## Report v1.5.0

在2024年的10月，我们发布了Open-Sora Plan v1.3.0，第一次将一种稀疏化的attention结构——skiparse attention引入video generation领域。同时，我们采用了高效的WFVAE，使得训练时的编码时间和显存占用大大降低。

在Open-Sora Plan v1.5.0中，Open-Sora Plan引入了几个关键的更新：

1、更好的sparse dit——SUV。在skiparse attention的基础上，我们将sparse dit扩展至U形变化的稀疏结构，使得在保持速度优势的基础上sparse dit可以取得和dense dit相近的性能。

2、更高压缩率的WFVAE。在Open-Sora Plan v1.5.0中，我们尝试了8x8x8下采样率的WFVAE，它在性能上媲美社区中广泛存在的4x8x8下采样率的VAE的同时latent shape减半，降低attention序列长度。

3、data和model scaling。在Open-Sora Plan v1.5.0中，我们收集了1.1B的高质量图片数据和40m的高质量视频数据，并将模型大小scale到8.5B，使最终得到的模型呈现出不俗的性能。

4、更简易的Adaptive Grad Clipping。相比于version 1.3.0中较复杂的丢弃污点batch的策略，在version 1.5.0中我们简单地维护一个adaptive的grad norm threshold并clipping，以此更适应各种并行策略的需要。

Open-Sora Plan v.1.5.0全程在昇腾910系列加速卡上完成训练和推理，并采用mindspeed-mm训练框架适配并行策略。

### Open-Source Release

Open-Sora Plan v1.5.0的开源包括：

1、所有训练和推理代码。你也可以在[MindSpeed-MM](https://gitee.com/ascend/MindSpeed-MM)官方仓库找到open-sora plan v1.5.0版本的实现。

2、8x8x8下采样的WFVAE权重以及8.5B的SUV去噪器权重。

## Detailed Technical Report

### Data collection and processing

我们共收集了来自Recap-DataComp-1B、Coyo700M、Laion-aesthetic的共1.1B图片数据。对于图片数据，我们不进行除了分辨率之外的筛选。我们的视频数据来自于Panda70M以及其他自有数据。对于视频数据，我们采用与Open-Sora Plan v1.3.0相同的处理策略进行筛选，最终数据量为40m的高质量视频数据。

### Adaptive Grad Clipping

在Open-Sora Plan v1.3.0中，我们介绍了一种基于丢弃梯度异常batch的Adaptive Grad Clipping策略，这种策略具有很高的稳定性，但是执行逻辑过于复杂。因此，在Open-Sora Plan v1.5.0中，我们选择将该策略进行优化，采用EMA方式维护grad norm的threshold，并在grad norm超过该threshold时裁剪到threshold以下。该策略本质上是将大模型领域常用的1.0常数grad norm threshold扩展为一个随着训练进程动态变化的threshold。

```python
'''
	moving_avg_max_grad_norm: EMA方式维护的最大grad norm
	moving_avg_max_grad_norm_var: EMA方式维护的最大grad norm的方差
	clip_threshold: 根据3 sigma策略计算得到的梯度裁剪阈值
	ema_decay: EMA衰减系数，一般为0.99
	grad_norm: 当前step的grad norm
'''
clip_threshold = moving_avg_max_grad_norm + 3.0 * (moving_avg_max_grad_norm_var ** 0.5)
if grad_norm <= clip_threshold:
    # grad norm小于裁剪阈值，则该step参数正常更新，同时更新维护的moving_avg_max_grad_norm 和 moving_avg_max_grad_norm_var
    moving_avg_max_grad_norm = ema_decay * moving_avg_max_grad_norm + (1 - ema_decay) * grad_norm
    max_grad_norm_var = (moving_avg_max_grad_norm - grad_norm) ** 2
    moving_avg_max_grad_norm_var = ema_decay * moving_avg_max_grad_norm_var + (1 - ema_decay) * max_grad_norm_var
    参数更新...
else:
    # grad norm大于裁剪阈值，则先裁剪grad使grad norm减少至clip_threshold，再进行参数更新。
    clip_coef = grad_norm / clip_threshold
    grads = clip(grads, clip_coef) # 裁剪grads
    参数更新...
```

该策略相较于v1.3.0中策略实现更简单，且能够很好应对diffusion训练后期grad norm远小于1.0时仍存在loss spike的问题。

### WFVAE with 8x8x8 downsampling

在V1.5.0版本中，我们将VAE的时间压缩率从4倍压缩提高至8倍压缩，使得对于同样原始尺寸的视频，latent shape减少为先前版本的一半，这使得我们可以实现更高帧数的视频生成。

| Model             | THW(C)        | PSNR         | LPIPS         | rFVD         |
| ----------------- | ------------- | ------------ | ------------- | ------------ |
| CogVideoX         | 4x8x8 (16)    | <u>36.38</u> | 0.0243        | <u>50.33</u> |
| StepVideo         | 8x16x16 (16)  | 33.61        | 0.0337        | 113.68       |
| LTXVideo          | 8x32x32 (128) | 33.84        | 0.0380        | 150.87       |
| Wan2.1            | 4x8x8 (16)    | 35.77        | **0.0197**    | **46.05**    |
| Ours （WF-VAE-M） | 8x8x8 (32)    | **36.91**    | <u>0.0205</u> | 52.53        |

**Test on an open-domain dataset with 1K samples.**

WFVAE详情请见[WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model](https://arxiv.org/abs/2411.17459)

### Training Text-to-Video Diffusion Model

#### Framework —— SUV: A Sparse U-shaped Diffusion Transformer For Fast Video Generation

在Open-Sora Plan v1.3.0中，我们讨论了Full 3D Attention以及2+1D Attention的优劣，并综合他们的特点提出了Skiparse Attention——一种新型的global sparse attention。

在一个事先指定的sparse ratio $k$ 下，Skiparse Attention按照Single Skip - Group Skip交替的方式选定原序列长度 $\frac{1}{k}$ 的子序列进行attention交互，以此达到近似Full 3D Attention的效果。在Skiparse Attention中，sparse ratio越大，子序列在原序列中的位置越稀疏；sparse ratio越小，子序列在原序列中的位置越密集。但无论sparse ratio为多少，Skiparse Attention总是global的。

在Open-Sora Plan v1.5.0中，我们将这种稀疏交互方式看作一种token上的信息下采样，越稀疏的Skiparse Attention是一种更偏语义级的信息交互，越密集的Skiparse Attention是一种更偏细粒度的信息交互。遵循神经网络中多尺度设计的准则，我们在网络中引入U形变化稀疏度的Skiparse Attention，即浅层采用稀疏度低的Skiparse Attention，并在最浅层使用Full 3D Attention，深层采用稀疏度高的Skiparse Attention。特别的，类比UNet的设计，我们在相同稀疏度的Stage之间引入了Long Skip Connection。我们将这种U形变化的基于Skiparse Attention的DiT称之为SUV。

![SUV](https://github.com/user-attachments/assets/6eb54e37-7077-4746-a4c6-9b7165dd48fe)

在Open-Sora Plan v1.5.0中我们采用了基于MMDiT的SUV架构。对于video latents，我们对其进行skiparse attention操作，对于text embedding，我们仅对其进行repeat以对齐skiparse后的latent shape而不进行任何稀疏化操作。

SUV架构存在以下优点：

1、SUV是首个在视频生成模型上验证有效的稀疏化方法，在我们的消融实验中表明其在同样训练步数下可以达到接近dense dit的性能，且可以同时应用于预训练和推理中。在910B测试平台下，在121x576x1024的视频shape下，SUV的推理速度相比Dense DiT提升35%以上，其中Attn部分速度提升45%以上。

2、相较于UNet结构对feature map进行显式的下采样造成了信息损失，SUV的U形结构作用在Attention上，feature map的shape并没有发生变化，即信息并未发生损失，改变的只是token间信息交互的粒度。

3、Skiparse Attention及SUV不改变权重大小，只改变forward时attention的计算方式。这使得我们可以随着训练进程动态调整稀疏度，在图片训练或低分辨率视频训练时采用较低的稀疏度，在高分辨率视频训练时提高稀疏度，获得随序列长度近似线性增长的FLOPS。

对SUV架构更细致的分析，将会在后续更新至arxiv。

#### Training Stage

我们的训练包括Text-to-Image和Text-to-Video两个阶段。

#### Text-to-Image

先前的工作表明从合成数据训练得到的图像权重可能会影响视频训练时的效果。因此，在v1.5.0更新中，我们选择在更大的真实数据域内训练图像权重。我们收集了共1.1B的图片数据进行训练。由于图片存在多种不同的分辨率，而视频主要为9：16分辨率，因此我们选择在训练图片权重时开启多分辨率（5个常见宽高比：(1,1), (3,4), (4,3), (9,16), (16,9) ）及Min-Max token Strategy训练，而在训练视频时采用固定9：16的宽高比固定分辨率训练。

Skiparse Attention与Full Attention的区别在于前向过程中参与计算的token序列不同，所需要的权重变量则完全相同。因此，我们可以先用Full 3D Attention的Dense MMDiT做训练，并在训练充分后Fine-tune至Sparse MMDiT模式。

**Image-Stage-1:** 采用512张Ascend 910B进行训练。 我们采用随机初始化的Dense MMDiT在256^2px级别分辨率的图片上训练，开启多分辨率。学习率为1e-4，batch size为8096。在这个阶段我们总共训练了225k steps。

**Image-Stage-2:** 采用384张Ascend 910B进行训练。在384^px级别的图片上训练，开启多分辨率训练。学习率为1e-4，batch size为6144，共训练150k step。

**Image-Stage-3:** 采用256张Ascend 910B进行训练。固定288x512分辨率训练。学习率为1e-4，batch size为4096，共训练110k step。Dense MMDiT阶段训练完成。

**Image-Stage-4:** 采用256张Ascend 910B进行训练。采用Dense MMDiT的权重初始化SUV，其中skip connection采用零初始化，保证初始SUV权重能够推出非噪声图片。事实上，zero shot推理得到的图片具备一定的低频信息，我们验证了Dense DiT到SUV的finetune可以很快达成。该阶段固定分辨率为288x512，学习率为1e-4，batch size为4096，共训练约160k step。

#### Text-to-Video

在训练视频时，我们采用的宽高比固定为9：16，且并未采用视频图像联合训练，而是仅用视频数据做训练。以下训练均在512张Ascend 910B上完成。

**Video-Stage-1:** 继承Text-to-Image阶段得到的SUV权重，我们在57x288x512的视频上训练了大约40k step，学习率为6e-5，TP/SP并行度为2，学习率为6e-5，梯度累积次数为2， micro batch size为2，global batch size为1024。在这个阶段，我们采用的train fps为24，即大约57/24≈2.4s的视频内容。该阶段作为图片权重到视频权重迁移的第一个阶段，我们选择了较短的视频训练作为良好的初始化。

**Video-Stage-2:** 我们同样在57x288x512的视频上训练45k step，学习率、TP/SP并行度和梯度累积设置保持不变，但是train fps更改为12，即对应的原视频长度为57/12≈4.8s的内容。该阶段旨在不增加序列长度的同时提高对时序的学习，为后续高帧数训练阶段做准备。

**Video-Stage-3:** 我们在121x288x512的视频上训练约25k step，学习率调整为4e-5、TP/SP并行度设置为4，梯度累积次数设置为2，micro batch size为4，global batch size为1024。在这个阶段我们重新采用train fps为24。

**Video-Stage-4:** 在121x576x1024的视频上共训练16k + 9k step，学习率分别为2e-5和1e-5，TP/SP并行度设置为4，梯度累积次数设置为4，micro batch size为1，global batch size为512。

**Video-Stage-5:** 我们选择数据中的高质量子集训练了5k step，学习率为1e-5，TP/SP并行度设置为4，梯度累积次数设置为4，micro batch size为1，global batch size为512。

 #### Performance on Vbench

| Model                      | Parameters | Total Score   | Quality Score | Semantic Score | **aesthetic quality** |
| -------------------------- | ---------- | ------------- | ------------- | -------------- | --------------------- |
| Mochi-1                    | 10B        | 80.13%        | 82.64%        | 70.08%         | 56.94%                |
| CogvideoX-2B               | 2B         | 80.91%        | 82.18%        | 75.83%         | 60.82%                |
| CogvideoX-5B               | 5B         | 81.61%        | 82.75%        | 77.04%         | 61.98%                |
| Step-Video-T2V             | 30B        | 81.83%        | <u>84.46%</u> | 71.28%         | 61.23%                |
| CogvideoX1.5-5B            | 5B         | 82.17%        | 82.78%        | **79.76%**     | 62.79%                |
| Gen-3                      | -          | 82.32%        | 84.11%        | 75.17%         | <u>63.34%</u>         |
| HunyuanVideo (Open-Source) | 13B        | **83.24%**    | **85.09%**    | 75.82%         | 60.36%                |
| Open-Sora Plan v1.5.0      | 8B         | <u>83.02%</u> | 84.24%        | <u>78.18%</u>  | **66.89%**            |

### Training Image-to-Video Diffusion Model

Comming Soon...

### Future Work

目前，开源社区已经有与闭源商业版本相当性能的模型，如Wan2.1。鉴于算力和数据相比企业来说仍存在不足，后续Open-Sora Plan团队的改进方向为：

1、Latents Cache。

在Text2Video模型的训练过程中，训练数据需要经过变分自编码器（VAE）和文本编码器（Text Encoder）两个关键模块的处理，以实现对视频/图片和对应引导词的特征编码。这些编码后的特征数据作为模型训练的输入，参与后续训练流程。然而业界训练方案中，每个训练周期（Epoch）都需要对多模态训练数据集进行重复的特征编码计算，这不仅增加了额外的计算开销，还显著延长了整体训练时间。

具体而言，在传统的训练流程中，VAE和Text Encoder模型通常需要常驻于GPU显存中，以便在每个Epoch中实时执行特征编码任务。这种设计虽然确保了特征编码的实时性，但也导致了GPU显存占用率居高不下，成为制约训练效率的主要瓶颈之一。尤其是在处理大规模数据集或复杂模型时，显存资源的紧张会进一步加剧这一问题，限制了模型的参数量和训练速度。

为了解决上述问题，我们提出了一种特征值以查代算的优化方案。该方案的核心思想是将特征编码的计算过程与模型训练过程进行解耦。具体实现方式为：在训练前或首轮训练时计算耗时最高的引导词特征值，将其保存至外置高性能文件存储中。后续的训练过程中，模型可以直接从文件存储中读取这些预计算的特征数据，避免了重复的特征编码计算。这种设计不仅显著减少了计算资源的浪费，还大幅降低了GPU显存的占用率，使更多的显存资源可用于模型训练。

基于以下配置环境，统计使用特征数据存储前后的单个epoch及单个step的训练数据。实验表明，特征值存储方案**可缩短约30%多轮迭代训练时间，同时释放约20%显存资源。**

|  配置环境  |             详细信息              |
| :--------: | :-------------------------------: |
|    模型    | Open-Sora Plan v1.5.0 with 2B量级 |
|   数据集   |         100K图片及10K视频         |
| GPU服务器  |          8张Nvidia A800           |
| 特征值存储 |       华为OceanStor AI存储        |

测试数据：

| 训练阶段     | 测试类型         | Batch Size | 单Step耗时 | 单Epoch耗时 | 显存占用 |
| ------------ | ---------------- | ---------- | ---------- | ----------- | -------- |
| 低分辨率图片 | 通用方案         | 64         | 6.53s      | 21min12s    | 56GB     |
|              | 特征数据存储方案 | 64         | 4.10s      | 13min19s    | 40GB     |
|              | 通用方案         | 128        | 12.78s     | 20min39s    | 74GB     |
|              | 特征数据存储方案 | 128        | 7.81s      | 12min38s    | 50GB     |
| 低分辨率视频 | 通用方案         | 8          | 8.90s      | 26min23s    | 68GB     |
|              | 特征数据存储方案 | 8          | 7.78s      | 23min05s    | 51GB     |
| 高分辨率视频 | 通用方案         | 4          | 17s        | 101min      | 71GB     |
|              | 特征数据存储方案 | 4          | 16s        | 97min       | 57GB     |

2、更好的基于稀疏化attention or 线性attention预训练的DiT。在V1.3.0中，我们推出了社区中第一个基于稀疏attention预训练的DiT，并在V1.5.0版本中将其扩展为SUV架构，使稀疏DiT获得了与Dense DiT相当的模型性能。稀疏attention和线性attention在LLM领域已经获得了很大的成功，但在视频生成领域中的应用仍不够明显。在后续版本中，我们将进一步探索稀疏attention和线性attention在video generation领域的应用。

3、基于MoE的DiT。自Mixtral 8x7B发布以来，LLM领域通常会采用MoE的方式将模型scale至更大的参数量。目前开源视频模型的最大大小仅限于14B，相比于LLM领域上百B的参数量来说仍属于小模型。在DiT架构中引入MoE，以及MoE与稀疏attention和线性attention的结合，是Open-Sora Plan团队未来考虑的方向。

4、生成和理解统一的视频生成模型。3月份gpt-4o的更新让大家认识到了生成理解统一架构的生成模型能够获得与纯生成模型完全不同的能力。在视频领域，我们同样应该期待一个统一的生成模型能够为我们带来哪些惊喜。

5、更好的Image-to-Video模型。目前Image-to-Video领域仍基本遵循SVD范式和Open-Sora Plan v1.2.0起采用的Inpainting范式。这两种范式都需要在Text-to-Video模型权重的基础上进行长时间的finetune。从应用意义上看，Text-to-Video更接近于学术上的探索，而Image-to-Video则更贴近现实的生产环境。因此，Image-to-Video的更新范式也会是Open-Sora Plan团队未来的重点探索方向。



================================================
FILE: docs/VAE.md
================================================

### Data prepare
The organization of the training data is easy. We only need to put all the videos recursively in a directory. This makes the training more convenient when using multiple datasets.
``` shell
Training Dataset
|——sub_dataset1
    |——sub_sub_dataset1
        |——video1.mp4
        |——video2.mp4
        ......
    |——sub_sub_dataset2
        |——video3.mp4
        |——video4.mp4
        ......
|——sub_dataset2
    |——video5.mp4
    |——video6.mp4
    ......
|——video7.mp4
|——video8.mp4
```

### Training
``` shell
bash scripts/causalvae/train.sh
```
We introduce the important args for training.

| Argparse | Usage |
|:---|:---|
|_Training size_||
|`--num_frames`|The number of using frames for training videos|
|`--resolution`|The resolution of the input to the VAE|
|`--batch_size`|The local batch size in each GPU|
|`--sample_rate`|The frame interval of when loading training videos|
|_Data processing_||
|`--video_path`|/path/to/dataset|
|_Load weights_||
|`--model_name`| `CausalVAE` or `WFVAE`|
|`--model_config`|/path/to/config.json The model config of VAE. If you want to train from scratch use this parameter.|
|`--pretrained_model_name_or_path`|A directory containing a model checkpoint and its config. Using this parameter will only load its weight but not load the state of the optimizer|
|`--resume_from_checkpoint`|/path/to/checkpoint It will resume the training process from the checkpoint including the weight and the optimizer.|

### Inference

``` shell
bash scripts/causalvae/rec_video.sh
```
We introduce the important args for inference.
| Argparse | Usage |
|:---|:---|
|_Ouoput video size_||
|`--num_frames`|The number of frames of generated videos|
|`--height`|The resolution of generated videos|
|`--width`|The resolution of generated videos|
|_Data processing_||
|`--video_path`|The path to the original video|
|`--rec_path`|The path to the generated video|
|_Load weights_||
|`--ae_path`|/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.json|
|_Other_||
|`--enable_tilintg`|Use tiling to deal with videos of high resolution and long duration|
|`--save_memory`|Save memory to inference but lightly influence quality|


### Evaluation

The evaluation process consists of two steps:

Reconstruct videos in batches: `bash scripts/causalvae/prepare_eval.sh`
Evaluate video metrics: `bash scripts/causalvae/eval.sh`

To simplify the evaluation, environment variables are used for control. For step 1 (`bash scripts/causalvae/prepare_eval.sh`):

```bash
# Experiment name
EXP_NAME=wfvae
# Video parameters
SAMPLE_RATE=1
NUM_FRAMES=33
RESOLUTION=256
# Model weights
CKPT=ckpt
# Select subset size (0 for full set)
SUBSET_SIZE=0
# Dataset directory
DATASET_DIR=test_video
```

For step 2 (`scripts/causalvae/eval.sh`):

```bash
# Experiment name
EXP_NAME=wfvae-4dim
# Video parameters
SAMPLE_RATE=1
NUM_FRAMES=33
RESOLUTION=256
# Evaluation metric
METRIC=lpips
# Select subset size (0 for full set)
SUBSET_SIZE=0
# Path to the ground truth videos, which can be saved during video reconstruction by setting `--output_origin`
ORIGIN_DIR=video_gen/${EXP_NAME}_sr${SAMPLE_RATE}_nf${NUM_FRAMES}_res${RESOLUTION}_subset${SUBSET_SIZE}/origin
# Path to the reconstructed videos
RECON_DIR=video_gen/${EXP_NAME}_sr${SAMPLE_RATE}_nf${NUM_FRAMES}_res${RESOLUTION}_subset${SUBSET_SIZE}
```

================================================
FILE: examples/cond_pix_path.txt
================================================
examples/test_img1.png
examples/test_img2.png
examples/test_img3.png

================================================
FILE: examples/cond_prompt.txt
================================================
A rocket ascends slowly into the sky.
Along the coast, variously sized boats float on the lake.
The landscape at sunset is profound and expansive.

================================================
FILE: examples/rec_image.py
================================================
import sys
sys.path.append(".")
from PIL import Image
import torch
from torchvision.transforms import ToTensor, Compose, Resize, Normalize, Lambda
from torch.nn import functional as F
import argparse
import numpy as np
from opensora.models.causalvideovae import ae_wrapper

def preprocess(video_data: torch.Tensor, short_size: int = 128) -> torch.Tensor:
    transform = Compose(
        [
            ToTensor(),
            Lambda(lambda x: 2. * x - 1.), 
            Resize(size=short_size),
        ]
    )
    outputs = transform(video_data)
    outputs = outputs.unsqueeze(0).unsqueeze(2)
    return outputs

def main(args: argparse.Namespace):
    image_path = args.image_path
    short_size = args.short_size
    device = args.device
    kwarg = {}
    
    # vae = getae_wrapper(args.ae)(args.model_path, subfolder="vae", cache_dir='cache_dir', **kwarg).to(device)
    vae = ae_wrapper[args.ae](args.ae_path, **kwarg).eval().to(device)
    if args.enable_tiling:
        vae.vae.enable_tiling()
        vae.vae.tile_overlap_factor = args.tile_overlap_factor
    vae.eval()
    vae = vae.to(device)
    vae = vae.half()
    
    with torch.no_grad():
        x_vae = preprocess(Image.open(image_path), short_size)
        x_vae = x_vae.to(device, dtype=torch.float16)  # b c t h w
        latents = vae.encode(x_vae)
        latents = latents.to(torch.float16)
        image_recon = vae.decode(latents)  # b t c h w
    x = image_recon[0, 0, :, :, :]
    x = x.squeeze()
    x = x.detach().cpu().numpy()
    x = np.clip(x, -1, 1)
    x = (x + 1) / 2
    x = (255*x).astype(np.uint8)
    x = x.transpose(1,2,0)
    image = Image.fromarray(x)
    image.save(args.rec_path)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--image_path', type=str, default='')
    parser.add_argument('--rec_path', type=str, default='')
    parser.add_argument('--ae', type=str, default='')
    parser.add_argument('--ae_path', type=str, default='')
    parser.add_argument('--model_path', type=str, default='results/pretrained')
    parser.add_argument('--short_size', type=int, default=336)
    parser.add_argument('--device', type=str, default='cuda')
    parser.add_argument('--tile_overlap_factor', type=float, default=0.25)
    parser.add_argument('--enable_tiling', action='store_true')
    
    args = parser.parse_args()
    main(args)

================================================
FILE: examples/rec_video.py
================================================
import math
import random
import argparse
from typing import Optional

import cv2
import numpy as np
import numpy.typing as npt
import torch
from PIL import Image
from decord import VideoReader, cpu
from torch.nn import functional as F
from torchvision.transforms import Lambda, Compose
import sys
sys.path.append(".")
from opensora.models.causalvideovae import ae_wrapper
from opensora.dataset.transform import ToTensorVideo, CenterCropResizeVideo


def array_to_video(image_array: npt.NDArray, fps: float = 30.0, output_file: str = 'output_video.mp4') -> None:
    height, width, channels = image_array[0].shape
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    video_writer = cv2.VideoWriter(output_file, fourcc, float(fps), (width, height))

    for image in image_array:
        image_rgb = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
        video_writer.write(image_rgb)

    video_writer.release()


def custom_to_video(x: torch.Tensor, fps: float = 2.0, output_file: str = 'output_video.mp4') -> None:
    x = x.detach().cpu()
    x = torch.clamp(x, -1, 1)
    x = (x + 1) / 2
    x = x.permute(0, 2, 3, 1).float().numpy()
    x = (255 * x).astype(np.uint8)
    array_to_video(x, fps=fps, output_file=output_file)
    return


def read_video(video_path: str, num_frames: int, sample_rate: int) -> torch.Tensor:
    decord_vr = VideoReader(video_path, ctx=cpu(0))
    total_frames = len(decord_vr)
    sample_frames_len = sample_rate * num_frames

    # if total_frames > sample_frames_len:
    #     s = random.randint(0, total_frames - sample_frames_len - 1)
    #     s = 0
    #     e = s + sample_frames_len
    #     num_frames = num_frames
    # else:
    # s = 0
    # e = total_frames
    # num_frames = int(total_frames / sample_frames_len * num_frames)
    s = 0
    e = sample_frames_len
    print(f'sample_frames_len {sample_frames_len}, only can sample {num_frames * sample_rate}', video_path,
            total_frames)

    frame_id_list = np.linspace(s, e - 1, num_frames, dtype=int)
    video_data = decord_vr.get_batch(frame_id_list).asnumpy()
    video_data = torch.from_numpy(video_data)
    video_data = video_data.permute(3, 0, 1, 2)  # (T, H, W, C) -> (C, T, H, W)
    return video_data


def preprocess(video_data: torch.Tensor, height: int = 128, width: int = 128) -> torch.Tensor:
    transform = Compose(
        [
            ToTensorVideo(),
            CenterCropResizeVideo((height, width)),
            Lambda(lambda x: 2. * x - 1.)
        ]
    )

    video_outputs = transform(video_data)
    video_outputs = torch.unsqueeze(video_outputs, 0)

    return video_outputs


def main(args: argparse.Namespace):
    device = args.device
    kwarg = {}
    # vae = getae_wrapper(args.ae)(args.model_path, subfolder="vae", cache_dir='cache_dir', **kwarg).to(device)
    # vae = CausalVAEModelWrapper(args.ae_path, **kwarg).to(device)
    vae = ae_wrapper[args.ae](args.ae_path, **kwarg).eval().to(device)
    if args.enable_tiling:
        vae.vae.enable_tiling()
        vae.vae.tile_overlap_factor = args.tile_overlap_factor
        # vae.vae.tile_sample_min_size = 512
        # vae.vae.tile_latent_min_size = 64
        # vae.vae.tile_sample_min_size_t = 29
        # vae.vae.tile_latent_min_size_t = 8
        # if args.save_memory:
        #     vae.vae.tile_sample_min_size = 256
        #     vae.vae.tile_latent_min_size = 32
        #     vae.vae.tile_sample_min_size_t = 9
        #     vae.vae.tile_latent_min_size_t = 3
    dtype = torch.bfloat16
    vae.eval()
    vae = vae.to(device, dtype=dtype)
    
    with torch.no_grad():
        x_vae = preprocess(read_video(args.video_path, args.num_frames, args.sample_rate), args.height,
                           args.width)
        print("input shape", x_vae.shape)
        x_vae = x_vae.to(device, dtype=dtype)  # b c t h w
        # for i in range(10000):
        latents = vae.encode(x_vae)
        latents = latents.to(dtype)
        video_recon = vae.decode(latents)  # b t c h w
        print("recon shape", video_recon.shape)


    
    # vae = vae.half()
    # from tqdm import tqdm
    # with torch.no_grad():
    #     x_vae = torch.rand(1, 3, 93, 720, 1280)
    #     print(x_vae.shape)
    #     x_vae = x_vae.to(device, dtype=torch.float16)  # b c t h w
    #     # x_vae = x_vae.to(device)  # b c t h w
    #     for i in tqdm(range(100000)):
    #         latents = vae.encode(x_vae)
    #     print(latents.shape)
    #     latents = latents.to(torch.float16)
    #     video_recon = vae.decode(latents)  # b t c h w
    #     print(video_recon.shape)


    custom_to_video(video_recon[0], fps=args.fps, output_file=args.rec_path)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--video_path', type=str, default='')
    parser.add_argument('--rec_path', type=str, default='')
    parser.add_argument('--ae', type=str, default='')
    parser.add_argument('--ae_path', type=str, default='')
    parser.add_argument('--model_path', type=str, default='results/pretrained')
    parser.add_argument('--fps', type=int, default=30)
    parser.add_argument('--height', type=int, default=336)
    parser.add_argument('--width', type=int, default=336)
    parser.add_argument('--num_frames', type=int, default=100)
    parser.add_argument('--sample_rate', type=int, default=1)
    parser.add_argument('--device', type=str, default="cuda")
    parser.add_argument('--tile_overlap_factor', type=float, default=0.25)
    parser.add_argument('--tile_sample_min_size', type=int, default=512)
    parser.add_argument('--tile_sample_min_size_t', type=int, default=33)
    parser.add_argument('--tile_sample_min_size_dec', type=int, default=256)
    parser.add_argument('--tile_sample_min_size_dec_t', type=int, default=33)
    parser.add_argument('--enable_tiling', action='store_true')
    parser.add_argument('--save_memory', action='store_true')

    args = parser.parse_args()
    main(args)

================================================
FILE: examples/sora.txt
================================================
A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, along red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is dampand reflective, creating a mirror effect of thecolorful lights. Many pedestrians walk about.
Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered tree sand dramatic snow capped mountains in the distance,mid afternoon lightwith wispy cloud sand a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field
A movie trailer featuring the adventures ofthe 30 year old spacemanwearing a redwool knitted motorcycle helmet, bluesky, saltdesert, cinematic style, shoton 35mm film, vivid colors. 
Drone view of waves crashing against the rugged cliffs along Big Sur's garay point beach.The crashing blue waters create white-tipped waves,while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green
shrubbery covers the cliffs edge. The steep drop from the road down to the beach is adramatic feat, with the cliff's edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.
Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle.The art style is 3D and realistic,with a focus on lighting and texture.The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and
open mouth. lts pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time.The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
A gorgeously rendered papercraft world of a coral reef,rife with colorful fish and sea creatures.
This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird's head is tilted slightly to the side,giving the impression of it looking regal and majestic. The background is blurred,drawing attention to the bird's striking appearance.
Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.
A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.
A petri dish with a bamboo forest growing within it that has tiny red pandas running around.
The camera rotates around a large stack of vintage televisions all showing different programs-1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery.
3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bushy, striped tail. It hops along a sparkling stream,its eyes wide with wonder. The forest is alive with magical elements: flowers that glow and change colors, trees with leaves in shades of purple and silver, and small floating lights that resemble fireflies. The creature stops to interact playfully with a group of tiny, fairy-like beings dancing around a mushroom ring. The creature looks up in awe at a large, glowing tree that seems to be the heart of the forest.
Historical footage of California during the gold rush.
A close up view of a glass sphere that has a zen garden within it. There is a small dwarf in the sphere who is raking the zen garden and creating patterns in the sand.
Extreme close up of a 24 year old woman's eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field,vivid colors, cinematic.
A cartoon kangaroo disco dances.
A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.
A cat waking up its sleeping owner demanding breakfast.The owner tries to ignore the cat, but the cat tries new tactics and finally the owner pulls out a secret stash of treats from under the pillow to hold the cat off a little longer.
Borneo wildlife on the Kinabatangan River
A Chinese Lunar New Year celebration video with Chinese Dragon.
The camera follows behind a white vintage SUv with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it's tires, the sunlight shines on the Suv as it speeds along the dirt road,casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars orvehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains with a clear blue sky above with wispy clouds.
Reflections in the window of a train traveling through the Tokyo suburbs.
A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast, the view showcases historic and magnificent architectural details and tiered pathways and patios, waves are seen crashing against the rocks below as the view overlooks the horizon of the coastal waters and hilly landscapes of the Amalfi Coast ltaly, several distant people are seen walking and enjoying vistas on patios of the dramatic ocean views, the warm glow of the afternoon sun creates a magical and romantic feeling to the scene, the view is stunning captured with beautiful photography
A large orange octopus is seen resting on the bottom of the ocean floor, blending in with the sandy and rocky terrain. lts tentacles are spread out around its body, and its eyes are closed. The octopus is unaware of a king crab that is crawling towards it from behind a rock,its claws raised and ready to attack. The crab is brown and spiny,with long legs and antennae. The scene is captured from a wide angle,showing the vastness and depth of the ocean. The wateris clear and blue, with rays of sunlight filtering through. The shot is sharp and crisp, with a high dynamic range. The octopus and the crab are in focus, while the background is slightly blurred,creating a depth of field effect.
A flock of paper airplanes flutters through a dense jungle,weaving around trees as if they were migrating birds.
A beautiful silhouette animation shows a wolf howling at the moon,feeling lonely, untilit finds its pack.
New York City submerged like Atlantis.Fish,whales,sea turtles and sharks swim through the streets of New York.
A litter of golden retriever puppies playing in the snow.Their heads pop out of the snow, covered in.
Tour of an art gallery with many beautiful works of art in different styles.
Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.
A stop motion animation of a flower growing out of the windowsill of a suburban house.
The story of a robot's life in a cyberpunk setting.
An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt, he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.
Basketball through hoop then explodes
Archeologists discovera generic plastic chairin the desert,excavating and dusting it with great care
A grandmother with neatly combed grey hair stands behind a colorful birthday cake with numerous candles at a wood dining room table,expression is one of pure joy and happines with a happy glow in her eye. She leans forward and blows out the candles with a gentle puff, the cake has pink frosting and sprinkles and the candles cease to flicker,the grandmotherwears a light blue blouse adorned with floral patterns,several happy friends and family sitting at the table can be seen celebrating,out of focus.The scene is beautifully captured, cinematic, showing a 3/4 view of the grandmother and the dining room. Warm color tones and soft lighting enhance the mood
Step-printing scene of a person running, cinematic film shot in 35mm
Five gray wolf pups frolicking and chasing each other around a remote gravel road, surrounded by grass. The pups run and leap, chasing each other, and nipping at each other, playing.
Tiltshift of a construction site filled with workers, equipment, and heavy machinery.
A giant, towering cloud in the shape of a man looms overthe earth. The cloud man shoots lighting bolts down to the earth.
A Samoyed and a Golden Retriever dog are playfully romping through a futuristic neon city at night. The neon lights emitted from the nearby buildings glistens off of their fur.
The Glenfinnan Viaduct is a historic railway bridge in Scotland, UK, that crosses over the west highland line between the towns of Mallaig and Fort Wiliam. It is a stunning sight as a steam train leaves the bridge, traveling over the arch-covered viaduct. The landscape is dotted with lush greenery and rocky mountains, creating a picturesque backdrop forthe train journey. The sky is blue and the sun is shining,making for a beautiful day to explore this majestic spot.
The camera directly faces colorful buildings in Burano ltaly. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.
An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along turquoise tropical waters near lush tropical islands,3D digital render art style.
This close-up shot of a chameleon showcases its striking color changing capabilities.The background is blurred, drawing attention to the animals striking appearance.
A corgi vlogging itself in tropical Maui.
A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something.Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates awarm contrast, accentuating the cat's orange fur. The shot is clear and sharp, with a shallow depth of field.
Aerial view of Santorini during the blue hour, showcasing the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking,and the lighting creates a beautiful, serene atmosphere.
Tiltshift of a construction site filled with workers, equipment, and heavy machinery.

================================================
FILE: opensora/__init__.py
================================================
# 

================================================
FILE: opensora/acceleration/__init__.py
================================================


================================================
FILE: opensora/acceleration/communications.py
================================================
import torch
import torch.distributed as dist
from einops import rearrange
from opensora.acceleration.parallel_states import hccl_info, lccl_info, enable_LCCL
try:
    from lcalib.functional import lcal_all2allvc
except:
    lcal_all2allvc = None

def broadcast(input_: torch.Tensor):
    sp_size = hccl_info.world_size
    src = hccl_info.rank // sp_size * sp_size
    dist.broadcast(input_, src=src, group=hccl_info.group)

_COUNT = 0
def _all_to_all(
    input_: torch.Tensor,
    scatter_dim: int,
    gather_dim: int,
):
    group = hccl_info.group
    sp_size = hccl_info.world_size
    input_list = [t.contiguous() for t in torch.tensor_split(input_, sp_size, scatter_dim)]
    output_list = [torch.empty_like(input_list[0]) for _ in range(sp_size)]
    dist.all_to_all(output_list, input_list, group=group)
    return torch.cat(output_list, dim=gather_dim).contiguous()

def _single_all_to_all(
    input_: torch.Tensor,
    scatter_dim: int,
    gather_dim: int,
    enable_HCCL=False,
):
    if enable_LCCL:
        sp_size = lccl_info.world_size
    else:
        sp_size = hccl_info.world_size
    inp_shape = list(input_.shape)
    inp_shape[scatter_dim] = inp_shape[scatter_dim] // sp_size
    if scatter_dim < 1:
        input_t = input_.reshape(
            [sp_size, inp_shape[scatter_dim]] + \
            inp_shape[scatter_dim + 1:]
        )
    else:
        # transpose groups of heads with the seq-len parallel dimension, so that we can scatter them!
        input_t = input_.reshape(
            [-1, sp_size, inp_shape[scatter_dim]] + \
            inp_shape[scatter_dim + 1:]
        ).transpose(0, 1).contiguous()

    output = torch.empty_like(input_t)
    if enable_LCCL and not enable_HCCL:
        matrix_count = torch.ones([sp_size, sp_size], dtype=torch.int64, device=input_t.device) * (
                    input_t.numel() // sp_size)
        lcal_all2allvc(input_t, output, matrix_count, lccl_info.group)
    else:
        dist.all_to_all_single(output, input_t, group=hccl_info.group)
    # if scattering the seq-dim, transpose the heads back to the original dimension
    if scatter_dim < 1:
        output = output.transpose(0, 1).contiguous()

    return output.reshape(
        inp_shape[: gather_dim] + [inp_shape[gather_dim] * sp_size, ] + inp_shape[gather_dim + 1:])


class _AllToAll(torch.autograd.Function):
    """All-to-all communication.

    Args:
        input_: input matrix
        process_group: communication group
        scatter_dim: scatter dimension
        gather_dim: gather dimension
    """

    @staticmethod
    def forward(ctx, input_, scatter_dim, gather_dim, all_to_all_func):
        ctx.scatter_dim = scatter_dim
        ctx.gather_dim = gather_dim
        ctx.all_to_all = all_to_all_func
        output = ctx.all_to_all(input_, scatter_dim, gather_dim)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        grad_output = ctx.all_to_all(
            grad_output,
            ctx.gather_dim,
            ctx.scatter_dim,
        )
        return (
            grad_output,
            None,
            None,
            None,
        )

def all_to_all_SBH(
    input_: torch.Tensor,
    scatter_dim: int = 1,
    gather_dim: int = 0,
):
    return _AllToAll.apply(input_, scatter_dim, gather_dim, _single_all_to_all)

def all_to_all_BSND(
    input_: torch.Tensor,
    scatter_dim: int = 2,
    gather_dim: int = 1,
):
    return _AllToAll.apply(input_, scatter_dim, gather_dim, _all_to_all)


def prepare_parallel_data(
        hidden_states, 
        encoder_hidden_states, 
        attention_mask, 
        encoder_attention_mask, 
        pooled_projections=None, 
        ):
    def all_to_all(
            hidden_states, 
            encoder_hidden_states, 
            attention_mask, 
            encoder_attention_mask, 
            pooled_projections, 
            ):
        # hidden_states          (b c t h w)   -gather0-> (sp*b c t h w)   -scatter2-> (sp*b c t//sp h w)
        # encoder_hidden_states  (b sp l/sp d) -gather0-> (sp*b sp l/sp d) -scatter1-> (sp*b 1 l/sp d)
        # attention_mask         (b t*sp h w)  -gather0-> (sp*b t*sp h w)  -scatter1-> (sp*b t h w)
        # encoder_attention_mask (b sp l)      -gather0-> (sp*b sp l)      -scatter1-> (sp*b 1 l)
        # pooled_projections     (b sp d)      -gather0-> (sp*b sp d)      -scatter1-> (sp*b 1 d)
        hidden_states = _single_all_to_all(hidden_states, scatter_dim=2, gather_dim=0, enable_HCCL=True)
        encoder_hidden_states = _single_all_to_all(encoder_hidden_states, scatter_dim=1, gather_dim=0, enable_HCCL=True)
        attention_mask = _single_all_to_all(attention_mask, scatter_dim=1, gather_dim=0, enable_HCCL=True)
        encoder_attention_mask = _single_all_to_all(encoder_attention_mask, scatter_dim=1, gather_dim=0, enable_HCCL=True)
        if pooled_projections is not None:
            pooled_projections = _single_all_to_all(pooled_projections, scatter_dim=1, gather_dim=0, enable_HCCL=True)

        return hidden_states, encoder_hidden_states, attention_mask, encoder_attention_mask, pooled_projections

    sp_size = hccl_info.world_size
    frame = hidden_states.shape[2]
    assert frame % sp_size == 0, "frame should be a multiple of sp_size"

    encoder_hidden_states = rearrange(
        encoder_hidden_states, 'b 1 (n x) h -> b n x h',
        n=sp_size, x=encoder_hidden_states.shape[2]//sp_size
        ).contiguous()
    hidden_states, encoder_hidden_states, attention_mask, encoder_attention_mask, pooled_projections = all_to_all(
        hidden_states, 
        encoder_hidden_states, 
        attention_mask.repeat(1, sp_size, 1, 1), 
        encoder_attention_mask.repeat(1, sp_size, 1), 
        pooled_projections.repeat(1, sp_size, 1)
        )

    return hidden_states, encoder_hidden_states, attention_mask, encoder_attention_mask, pooled_projections

================================================
FILE: opensora/acceleration/parallel_states.py
================================================
import torch
import torch_npu
import torch.distributed as dist
import os
try:
    from lcalib.functional import lcal_initialize
    enable_LCCL = True
except:
    lcal_initialize = None
    enable_LCCL = False
class COMM_INFO:
    def __init__(self):
        self.group = None
        self.world_size = 0
        self.rank = -1

lccl_info = COMM_INFO()
hccl_info = COMM_INFO()
_SEQUENCE_PARALLEL_STATE = False
def initialize_sequence_parallel_state(sequence_parallel_size):
    global _SEQUENCE_PARALLEL_STATE
    if sequence_parallel_size > 1:
        _SEQUENCE_PARALLEL_STATE = True
        initialize_sequence_parallel_group(sequence_parallel_size)

def set_sequence_parallel_state(state):
    global _SEQUENCE_PARALLEL_STATE
    _SEQUENCE_PARALLEL_STATE = state

def get_sequence_parallel_state():
    return _SEQUENCE_PARALLEL_STATE

def initialize_sequence_parallel_group(sequence_parallel_size):
    """Initialize the sequence parallel group."""
    rank = int(os.getenv('RANK', '0'))
    world_size = int(os.getenv("WORLD_SIZE", '1'))
    assert world_size % sequence_parallel_size == 0, "world_size must be divisible by sequence_parallel_size"
    # hccl
    hccl_info.world_size = sequence_parallel_size
    hccl_info.rank = rank
    num_sequence_parallel_groups: int = world_size // sequence_parallel_size
    for i in range(num_sequence_parallel_groups):
        ranks = range(i * sequence_parallel_size, (i + 1) * sequence_parallel_size)
        group = dist.new_group(ranks)
        if rank in ranks:
            hccl_info.group = group

    if enable_LCCL:
        assert sequence_parallel_size == 8, "sequence_parallel_size should be 8 when enable_LCCL is True"
        rank %= sequence_parallel_size
        lccl_info.world_size = sequence_parallel_size
        lccl_info.group = lcal_initialize(rank, sequence_parallel_size)
        lccl_info.rank = rank

def destroy_sequence_parallel_group():
    """Destroy the sequence parallel group."""
    dist.destroy_process_group()


================================================
FILE: opensora/adaptor/__init__.py
================================================


================================================
FILE: opensora/adaptor/bf16_optimizer.py
================================================
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

from collections import OrderedDict
import torch
import sys
import os
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from deepspeed import comm as dist
from deepspeed.runtime.constants import PIPE_REPLICATED
from deepspeed.runtime import ZeROOptimizer
from packaging import version as pkg_version

from deepspeed.git_version_info import version
from deepspeed.runtime.utils import (get_global_norm_of_tensors, clip_tensors_by_global_norm, DummyOptim,
                                     align_dense_tensors, all_gather_dp_groups, bwc_tensor_model_parallel_rank,
                                     is_model_parallel_parameter, see_memory_usage, graph_process)

from deepspeed.utils import link_hp_params, fragment_address
from deepspeed.checkpoint import enable_universal_checkpoint
from deepspeed.checkpoint.constants import (DS_VERSION, PARTITION_COUNT, BASE_OPTIMIZER_STATE,
                                            SINGLE_PARTITION_OF_FP32_GROUPS, CLIP_GRAD, GROUP_PADDINGS,
                                            PARAM_SLICE_MAPPINGS)

setattr(sys.modules[__name__], 'fragment_address', fragment_address)


def contigous_flatten(tensors):
    return _flatten_dense_tensors([tensor.contiguous() for tensor in tensors])


class BF16_Optimizer(ZeROOptimizer):

    def __init__(self,
                 init_optimizer,
                 param_names,
                 mpu=None,
                 clip_grad=0.0,
                 norm_type=2,
                 allgather_bucket_size=5000000000,
                 dp_process_group=None,
                 timers=None,
                 grad_acc_dtype=None,
                 graph_harvesting=False):
        # super().__init__()
        # base_class = ZeROOptimizer.__bases__[0]
        # # 直接调用基类的 __init_

Download .txt

gitextract_la9ona01/

├── .github/
│   └── workflows/
│       └── docker_build.yml
├── .gitignore
├── LICENSE
├── README.md
├── docs/
│   ├── Contribution_Guidelines.md
│   ├── Prompt_Refiner.md
│   ├── Report-v1.0.0-cn.md
│   ├── Report-v1.0.0.md
│   ├── Report-v1.1.0.md
│   ├── Report-v1.2.0.md
│   ├── Report-v1.3.0.md
│   ├── Report-v1.5.0.md
│   ├── Report-v1.5.0_cn.md
│   └── VAE.md
├── examples/
│   ├── cond_pix_path.txt
│   ├── cond_prompt.txt
│   ├── rec_image.py
│   ├── rec_video.py
│   └── sora.txt
├── opensora/
│   ├── __init__.py
│   ├── acceleration/
│   │   ├── __init__.py
│   │   ├── communications.py
│   │   └── parallel_states.py
│   ├── adaptor/
│   │   ├── __init__.py
│   │   ├── bf16_optimizer.py
│   │   ├── engine.py
│   │   ├── modules.py
│   │   ├── stage_1_and_2.py
│   │   ├── utils.py
│   │   └── zp_manager.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   ├── inpaint_dataset.py
│   │   ├── t2v_datasets.py
│   │   ├── transform.py
│   │   └── virtual_disk.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── causalvideovae/
│   │   │   ├── __init__.py
│   │   │   ├── dataset/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── ddp_sampler.py
│   │   │   │   ├── transform.py
│   │   │   │   └── video_dataset.py
│   │   │   ├── eval/
│   │   │   │   ├── cal_fvd.py
│   │   │   │   ├── cal_lpips.py
│   │   │   │   ├── cal_psnr.py
│   │   │   │   ├── cal_ssim.py
│   │   │   │   ├── eval.py
│   │   │   │   ├── fvd/
│   │   │   │   │   ├── styleganv/
│   │   │   │   │   │   └── fvd.py
│   │   │   │   │   └── videogpt/
│   │   │   │   │       ├── fvd.py
│   │   │   │   │       └── pytorch_i3d.py
│   │   │   │   └── script/
│   │   │   │       ├── cal_clip_score.sh
│   │   │   │       ├── cal_fvd.sh
│   │   │   │       ├── cal_lpips.sh
│   │   │   │       ├── cal_psnr.sh
│   │   │   │       └── cal_ssim.sh
│   │   │   ├── model/
│   │   │   │   ├── __init__.py
│   │   │   │   ├── configuration_videobase.py
│   │   │   │   ├── dataset_videobase.py
│   │   │   │   ├── ema_model.py
│   │   │   │   ├── losses/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── discriminator.py
│   │   │   │   │   ├── lpips.py
│   │   │   │   │   └── perceptual_loss.py
│   │   │   │   ├── modeling_videobase.py
│   │   │   │   ├── modules/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── attention.py
│   │   │   │   │   ├── block.py
│   │   │   │   │   ├── conv.py
│   │   │   │   │   ├── normalize.py
│   │   │   │   │   ├── ops.py
│   │   │   │   │   ├── quant.py
│   │   │   │   │   ├── resnet_block.py
│   │   │   │   │   ├── updownsample.py
│   │   │   │   │   └── wavelet.py
│   │   │   │   ├── registry.py
│   │   │   │   ├── trainer_videobase.py
│   │   │   │   ├── utils/
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── distrib_utils.py
│   │   │   │   │   ├── module_utils.py
│   │   │   │   │   ├── scheduler_utils.py
│   │   │   │   │   ├── video_utils.py
│   │   │   │   │   └── wavelet_utils.py
│   │   │   │   └── vae/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── modeling_causalvae.py
│   │   │   │       └── modeling_wfvae.py
│   │   │   ├── sample/
│   │   │   │   └── rec_video_vae.py
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── dataset_utils.py
│   │   │       ├── downloader.py
│   │   │       └── video_utils.py
│   │   ├── diffusion/
│   │   │   ├── __init__.py
│   │   │   ├── common.py
│   │   │   └── opensora_v1_3/
│   │   │       ├── __init__.py
│   │   │       ├── modeling_inpaint.py
│   │   │       ├── modeling_opensora.py
│   │   │       └── modules.py
│   │   ├── frame_interpolation/
│   │   │   ├── cfgs/
│   │   │   │   └── AMT-G.yaml
│   │   │   ├── interpolation.py
│   │   │   ├── networks/
│   │   │   │   ├── AMT-G.py
│   │   │   │   ├── __init__.py
│   │   │   │   └── blocks/
│   │   │   │       ├── __init__.py
│   │   │   │       ├── feat_enc.py
│   │   │   │       ├── ifrnet.py
│   │   │   │       ├── multi_flow.py
│   │   │   │       └── raft.py
│   │   │   ├── readme.md
│   │   │   └── utils/
│   │   │       ├── __init__.py
│   │   │       ├── build_utils.py
│   │   │       ├── dist_utils.py
│   │   │       ├── flow_utils.py
│   │   │       └── utils.py
│   │   ├── prompt_refiner/
│   │   │   ├── inference.py
│   │   │   ├── merge.py
│   │   │   └── train.py
│   │   └── text_encoder/
│   │       ├── __init__.py
│   │       ├── clip.py
│   │       └── t5.py
│   ├── npu_config.py
│   ├── sample/
│   │   ├── caption_refiner.py
│   │   ├── pipeline_inpaint.py
│   │   ├── pipeline_opensora.py
│   │   ├── rec_image.py
│   │   ├── rec_video.py
│   │   └── sample.py
│   ├── serve/
│   │   ├── gradio_utils.py
│   │   ├── gradio_web_server.py
│   │   ├── gradio_web_server_i2v.py
│   │   └── style.css
│   ├── train/
│   │   ├── train_causalvae.py
│   │   ├── train_inpaint.py
│   │   └── train_t2v_diffusers.py
│   └── utils/
│       ├── communications.py
│       ├── dataset_utils.py
│       ├── downloader.py
│       ├── ema.py
│       ├── ema_utils.py
│       ├── freeinit_utils.py
│       ├── lora_utils.py
│       ├── mask_utils.py
│       ├── parallel_states.py
│       ├── sample_utils.py
│       └── utils.py
├── pyproject.toml
└── scripts/
    ├── accelerate_configs/
    │   ├── ddp_config.yaml
    │   ├── deepspeed_zero2_config.yaml
    │   ├── deepspeed_zero2_offload_config.yaml
    │   ├── deepspeed_zero3_config.yaml
    │   ├── deepspeed_zero3_offload_config.yaml
    │   ├── default_config.yaml
    │   ├── hostfile
    │   ├── zero2.json
    │   ├── zero2_npu.json
    │   ├── zero2_offload.json
    │   ├── zero3.json
    │   └── zero3_offload.json
    ├── causalvae/
    │   ├── eval.sh
    │   ├── prepare_eval.sh
    │   ├── rec_image.sh
    │   ├── rec_video.sh
    │   ├── train.sh
    │   └── wfvae_4dim.json
    ├── slurm/
    │   └── placeholder
    ├── text_condition/
    │   ├── gpu/
    │   │   ├── sample_inpaint_v1_3.sh
    │   │   ├── sample_t2v_v1_3.sh
    │   │   ├── train_inpaint_v1_3.sh
    │   │   └── train_t2v_v1_3.sh
    │   └── npu/
    │       ├── sample_inpaint_v1_3.sh
    │       ├── sample_t2v_v1_3.sh
    │       ├── train_inpaint_v1_3.sh
    │       └── train_t2v_v1_3.sh
    ├── train_configs/
    │   └── mask_config.yaml
    └── train_data/
        └── merge_data.txt

Download .txt

SYMBOL INDEX (1372 symbols across 99 files)

FILE: examples/rec_image.py
  function preprocess (line 11) | def preprocess(video_data: torch.Tensor, short_size: int = 128) -> torch...
  function main (line 23) | def main(args: argparse.Namespace):

FILE: examples/rec_video.py
  function array_to_video (line 20) | def array_to_video(image_array: npt.NDArray, fps: float = 30.0, output_f...
  function custom_to_video (line 32) | def custom_to_video(x: torch.Tensor, fps: float = 2.0, output_file: str ...
  function read_video (line 42) | def read_video(video_path: str, num_frames: int, sample_rate: int) -> to...
  function preprocess (line 68) | def preprocess(video_data: torch.Tensor, height: int = 128, width: int =...
  function main (line 83) | def main(args: argparse.Namespace):

FILE: opensora/acceleration/communications.py
  function broadcast (line 10) | def broadcast(input_: torch.Tensor):
  function _all_to_all (line 16) | def _all_to_all(
  function _single_all_to_all (line 28) | def _single_all_to_all(
  class _AllToAll (line 67) | class _AllToAll(torch.autograd.Function):
    method forward (line 78) | def forward(ctx, input_, scatter_dim, gather_dim, all_to_all_func):
    method backward (line 86) | def backward(ctx, grad_output):
  function all_to_all_SBH (line 99) | def all_to_all_SBH(
  function all_to_all_BSND (line 106) | def all_to_all_BSND(
  function prepare_parallel_data (line 114) | def prepare_parallel_data(

FILE: opensora/acceleration/parallel_states.py
  class COMM_INFO (line 11) | class COMM_INFO:
    method __init__ (line 12) | def __init__(self):
  function initialize_sequence_parallel_state (line 20) | def initialize_sequence_parallel_state(sequence_parallel_size):
  function set_sequence_parallel_state (line 26) | def set_sequence_parallel_state(state):
  function get_sequence_parallel_state (line 30) | def get_sequence_parallel_state():
  function initialize_sequence_parallel_group (line 33) | def initialize_sequence_parallel_group(sequence_parallel_size):
  function destroy_sequence_parallel_group (line 55) | def destroy_sequence_parallel_group():

FILE: opensora/adaptor/bf16_optimizer.py
  function contigous_flatten (line 30) | def contigous_flatten(tensors):
  class BF16_Optimizer (line 34) | class BF16_Optimizer(ZeROOptimizer):
    method __init__ (line 36) | def __init__(self,
    method _setup_for_real_optimizer (line 98) | def _setup_for_real_optimizer(self):
    method _enable_universal_checkpoint (line 178) | def _enable_universal_checkpoint(self):
    method _create_param_mapping (line 182) | def _create_param_mapping(self):
    method _link_all_hp_params (line 194) | def _link_all_hp_params(self):
    method initialize_optimizer_states (line 212) | def initialize_optimizer_states(self):
    method _split_flat_tensor (line 233) | def _split_flat_tensor(self, flat_tensor, num_elem_list):
    method _update_storage_to_flattened_tensor (line 244) | def _update_storage_to_flattened_tensor(self, tensor_list, flat_tensor):
    method _flatten_dense_tensors_aligned (line 249) | def _flatten_dense_tensors_aligned(self, tensor_list, alignment):
    method step (line 253) | def step(self, closure=None):
    method backward (line 277) | def backward(self, loss, update_hp_grads=True, clear_lp_grads=False, *...
    method update_hp_grads (line 293) | def update_hp_grads(self, clear_lp_grads=False):
    method get_grads_for_reduction (line 322) | def get_grads_for_reduction(self):
    method get_grads_for_norm (line 326) | def get_grads_for_norm(self, for_clipping=False):
    method update_lp_params (line 346) | def update_lp_params(self):
    method clear_hp_grads (line 361) | def clear_hp_grads(self):
    method clear_lp_grads (line 368) | def clear_lp_grads(self):
    method state_dict (line 375) | def state_dict(self):
    method _restore_from_bit16_weights (line 388) | def _restore_from_bit16_weights(self):
    method refresh_fp32_params (line 394) | def refresh_fp32_params(self):
    method load_state_dict (line 397) | def load_state_dict(self,
    method _load_legacy_checkpoint (line 408) | def _load_legacy_checkpoint(self, state_dict_list, load_optimizer_stat...
    method _load_universal_checkpoint (line 431) | def _load_universal_checkpoint(self, checkpoint_folder, load_optimizer...
    method param_groups (line 435) | def param_groups(self):
    method _load_hp_checkpoint_state (line 439) | def _load_hp_checkpoint_state(self, checkpoint_dir):
  function _get_padded_tensor (line 452) | def _get_padded_tensor(src_tensor, size):

FILE: opensora/adaptor/engine.py
  function split_half_float_double_sparse (line 123) | def split_half_float_double_sparse(tensors):
  class EngineTimers (line 142) | class EngineTimers(object):
    method __init__ (line 145) | def __init__(self, enable_micro_timers, enable_global_timers):
  class DeepSpeedEngine (line 177) | class DeepSpeedEngine(Module):
    method __init__ (line 180) | def __init__(
    method destroy (line 365) | def destroy(self):
    method _get_model_parameters (line 369) | def _get_model_parameters(self):
    method get_batch_info (line 391) | def get_batch_info(self):
    method set_train_batch_size (line 407) | def set_train_batch_size(self, train_batch_size):
    method set_train_micro_batch_size (line 425) | def set_train_micro_batch_size(self, micro_batch_size):
    method set_data_post_process_func (line 436) | def set_data_post_process_func(self, post_process_func):
    method set_custom_curriculum_learning_schedule (line 440) | def set_custom_curriculum_learning_schedule(self, schedule_func_dict):
    method get_global_grad_norm (line 444) | def get_global_grad_norm(self) -> float:
    method __getattr__ (line 456) | def __getattr__(self, name):
    method checkpoint_tag_validation_enabled (line 471) | def checkpoint_tag_validation_enabled(self):
    method checkpoint_tag_validation_fail (line 474) | def checkpoint_tag_validation_fail(self):
    method elasticity_enabled (line 477) | def elasticity_enabled(self):
    method is_elastic_model_parallel_supported (line 480) | def is_elastic_model_parallel_supported(self):
    method pld_enabled (line 488) | def pld_enabled(self):
    method pld_params (line 491) | def pld_params(self):
    method pld_theta (line 494) | def pld_theta(self):
    method pld_gamma (line 497) | def pld_gamma(self):
    method eigenvalue_enabled (line 500) | def eigenvalue_enabled(self):
    method eigenvalue_verbose (line 503) | def eigenvalue_verbose(self):
    method eigenvalue_max_iter (line 506) | def eigenvalue_max_iter(self):
    method eigenvalue_tol (line 509) | def eigenvalue_tol(self):
    method eigenvalue_stability (line 512) | def eigenvalue_stability(self):
    method eigenvalue_gas_boundary_resolution (line 515) | def eigenvalue_gas_boundary_resolution(self):
    method eigenvalue_layer_name (line 518) | def eigenvalue_layer_name(self):
    method eigenvalue_layer_num (line 521) | def eigenvalue_layer_num(self):
    method curriculum_enabled_legacy (line 524) | def curriculum_enabled_legacy(self):
    method curriculum_params_legacy (line 527) | def curriculum_params_legacy(self):
    method data_efficiency_enabled (line 530) | def data_efficiency_enabled(self):
    method data_efficiency_config (line 533) | def data_efficiency_config(self):
    method data_sampling_enabled (line 536) | def data_sampling_enabled(self):
    method data_sampling_config (line 539) | def data_sampling_config(self):
    method curriculum_learning_enabled (line 542) | def curriculum_learning_enabled(self):
    method curriculum_learning_config (line 545) | def curriculum_learning_config(self):
    method random_ltd_enabled (line 548) | def random_ltd_enabled(self):
    method random_ltd_config (line 551) | def random_ltd_config(self):
    method random_ltd_initialize (line 554) | def random_ltd_initialize(self):
    method wall_clock_breakdown (line 575) | def wall_clock_breakdown(self):
    method flops_profiler_enabled (line 578) | def flops_profiler_enabled(self):
    method flops_profiler_recompute_fwd_factor (line 581) | def flops_profiler_recompute_fwd_factor(self):
    method flops_profiler_profile_step (line 584) | def flops_profiler_profile_step(self):
    method flops_profiler_module_depth (line 590) | def flops_profiler_module_depth(self):
    method flops_profiler_top_modules (line 593) | def flops_profiler_top_modules(self):
    method flops_profiler_detailed (line 596) | def flops_profiler_detailed(self):
    method flops_profiler_output_file (line 601) | def flops_profiler_output_file(self):
    method memory_breakdown (line 604) | def memory_breakdown(self):
    method autotuning_enabled (line 607) | def autotuning_enabled(self):
    method autotuning_start_profile_step (line 610) | def autotuning_start_profile_step(self):
    method autotuning_end_profile_step (line 613) | def autotuning_end_profile_step(self):
    method autotuning_metric_path (line 616) | def autotuning_metric_path(self):
    method autotuning_model_info_path (line 622) | def autotuning_model_info_path(self):
    method autotuning_metric (line 628) | def autotuning_metric(self):
    method autotuning_profile_model_info (line 631) | def autotuning_profile_model_info(self):
    method sparse_gradients_enabled (line 636) | def sparse_gradients_enabled(self):
    method train_batch_size (line 639) | def train_batch_size(self):
    method train_micro_batch_size_per_gpu (line 642) | def train_micro_batch_size_per_gpu(self):
    method optimizer_name (line 645) | def optimizer_name(self):
    method optimizer_params (line 648) | def optimizer_params(self):
    method optimizer_legacy_fusion (line 651) | def optimizer_legacy_fusion(self):
    method scheduler_name (line 654) | def scheduler_name(self):
    method scheduler_params (line 657) | def scheduler_params(self):
    method quantize_training (line 660) | def quantize_training(self):
    method zero_optimization (line 675) | def zero_optimization(self):
    method zero_allow_untested_optimizer (line 678) | def zero_allow_untested_optimizer(self):
    method zero_force_ds_cpu_optimizer (line 681) | def zero_force_ds_cpu_optimizer(self):
    method zero_reduce_scatter (line 684) | def zero_reduce_scatter(self):
    method zero_overlap_comm (line 687) | def zero_overlap_comm(self):
    method zero_offload_optimizer (line 690) | def zero_offload_optimizer(self):
    method zero_offload_param (line 693) | def zero_offload_param(self):
    method zero_use_cpu_optimizer (line 696) | def zero_use_cpu_optimizer(self):
    method zero_cpu_offload (line 701) | def zero_cpu_offload(self):
    method zero_partial_offload (line 706) | def zero_partial_offload(self):
    method zero_sub_group_size (line 709) | def zero_sub_group_size(self):
    method zero_optimization_stage (line 712) | def zero_optimization_stage(self):
    method mics_shard_size (line 715) | def mics_shard_size(self):
    method zero_reduce_bucket_size (line 718) | def zero_reduce_bucket_size(self):
    method zero_multi_rank_bucket_allreduce (line 721) | def zero_multi_rank_bucket_allreduce(self):
    method zero_allgather_bucket_size (line 724) | def zero_allgather_bucket_size(self):
    method zero_optimization_partition_gradients (line 727) | def zero_optimization_partition_gradients(self):
    method zero_optimization_partition_weights (line 730) | def zero_optimization_partition_weights(self):
    method is_first_weights_partition_group (line 733) | def is_first_weights_partition_group(self):
    method zero_contiguous_gradients (line 740) | def zero_contiguous_gradients(self):
    method zero_load_from_fp32_weights (line 743) | def zero_load_from_fp32_weights(self):
    method zero_elastic_checkpoint (line 746) | def zero_elastic_checkpoint(self):
    method zero_max_live_parameters (line 749) | def zero_max_live_parameters(self):
    method zero_max_reuse_distance (line 752) | def zero_max_reuse_distance(self):
    method zero_prefetch_bucket_size (line 755) | def zero_prefetch_bucket_size(self):
    method zero_param_persistence_threshold (line 758) | def zero_param_persistence_threshold(self):
    method zero_model_persistence_threshold (line 761) | def zero_model_persistence_threshold(self):
    method zero_gather_16bit_weights_on_model_save (line 764) | def zero_gather_16bit_weights_on_model_save(self):
    method zero_grad_hooks (line 767) | def zero_grad_hooks(self):
    method zero_legacy_stage1 (line 770) | def zero_legacy_stage1(self):
    method zero_ignore_unused_parameters (line 773) | def zero_ignore_unused_parameters(self):
    method graph_harvesting (line 776) | def graph_harvesting(self):
    method fp16_enabled (line 779) | def fp16_enabled(self):
    method bfloat16_enabled (line 782) | def bfloat16_enabled(self):
    method fp16_master_weights_and_gradients (line 785) | def fp16_master_weights_and_gradients(self):
    method amp_enabled (line 788) | def amp_enabled(self):
    method amp_params (line 791) | def amp_params(self):
    method fp16_auto_cast (line 794) | def fp16_auto_cast(self):
    method loss_scale (line 797) | def loss_scale(self):
    method gradient_accumulation_steps (line 800) | def gradient_accumulation_steps(self):
    method use_node_local_storage (line 803) | def use_node_local_storage(self):
    method load_universal_checkpoint (line 806) | def load_universal_checkpoint(self):
    method communication_data_type (line 810) | def communication_data_type(self):
    method communication_data_type (line 824) | def communication_data_type(self, value):
    method postscale_gradients (line 827) | def postscale_gradients(self):
    method gradient_predivide_factor (line 830) | def gradient_predivide_factor(self):
    method steps_per_print (line 833) | def steps_per_print(self):
    method zero_allgather_partitions (line 836) | def zero_allgather_partitions(self):
    method zero_round_robin_gradients (line 839) | def zero_round_robin_gradients(self):
    method zero_hpz_partition_size (line 842) | def zero_hpz_partition_size(self):
    method zero_quantized_weights (line 845) | def zero_quantized_weights(self):
    method zero_quantized_nontrainable_weights (line 848) | def zero_quantized_nontrainable_weights(self):
    method zero_quantized_gradients (line 851) | def zero_quantized_gradients(self):
    method dump_state (line 854) | def dump_state(self):
    method gradient_clipping (line 857) | def gradient_clipping(self):
    method dynamic_loss_scale (line 860) | def dynamic_loss_scale(self):
    method initial_dynamic_scale (line 863) | def initial_dynamic_scale(self):
    method dynamic_loss_scale_args (line 866) | def dynamic_loss_scale_args(self):
    method swap_tensor_config (line 869) | def swap_tensor_config(self):
    method aio_config (line 872) | def aio_config(self):
    method get_data_types (line 875) | def get_data_types(self):
    method _optimizer_has_ckpt_event_prologue (line 892) | def _optimizer_has_ckpt_event_prologue(self):
    method _optimizer_has_ckpt_event_epilogue (line 895) | def _optimizer_has_ckpt_event_epilogue(self):
    method _configure_lr_scheduler (line 898) | def _configure_lr_scheduler(self, client_lr_scheduler):
    method _configure_checkpointing (line 914) | def _configure_checkpointing(self, dist_init_required):
    method _scheduler_from_config (line 943) | def _scheduler_from_config(self, optimizer):
    method _set_distributed_vars (line 960) | def _set_distributed_vars(self, args):
    method _configure_with_arguments (line 973) | def _configure_with_arguments(self, args, mpu):
    method _do_args_sanity_check (line 991) | def _do_args_sanity_check(self, args):
    method _is_supported_optimizer (line 1005) | def _is_supported_optimizer(self, optimizer_name):
    method _supported_optims (line 1008) | def _supported_optims(self):
    method _do_sanity_check (line 1022) | def _do_sanity_check(self):
    method _broadcast_model (line 1042) | def _broadcast_model(self):
    method __check_params (line 1061) | def __check_params(model: Module, dtype: torch.dtype) -> None:
    method _set_client_model (line 1068) | def _set_client_model(self, model):
    method _configure_distributed_model (line 1075) | def _configure_distributed_model(self, model):
    method _check_for_duplicates (line 1142) | def _check_for_duplicates(self, optimizer):
    method _do_optimizer_sanity_check (line 1155) | def _do_optimizer_sanity_check(self, basic_optimizer):
    method _configure_optimizer (line 1210) | def _configure_optimizer(self, client_optimizer, model_parameters):
    method _configure_basic_optimizer (line 1258) | def _configure_basic_optimizer(self, model_parameters):
    method _configure_compression_scheduler (line 1356) | def _configure_compression_scheduler(self):
    method _configure_random_ltd_scheduler (line 1359) | def _configure_random_ltd_scheduler(self, configs):
    method _configure_quantization (line 1362) | def _configure_quantization(self):
    method _configure_fp16_optimizer (line 1394) | def _configure_fp16_optimizer(self, optimizer):
    method _configure_bf16_optimizer (line 1445) | def _configure_bf16_optimizer(self, optimizer):
    method _configure_zero_optimizer (line 1466) | def _configure_zero_optimizer(self, optimizer):
    method _return_mics_optimizer (line 1608) | def _return_mics_optimizer(self, basic_optimizer, timers):
    method _configure_eigenvalue (line 1641) | def _configure_eigenvalue(self):
    method _configure_progressive_layer_drop (line 1654) | def _configure_progressive_layer_drop(self):
    method _configure_curriculum_scheduler_legacy (line 1659) | def _configure_curriculum_scheduler_legacy(self):
    method is_map_style_dataset (line 1664) | def is_map_style_dataset(obj):
    method is_iterable_style_dataset (line 1668) | def is_iterable_style_dataset(obj):
    method dataloader_drop_last (line 1671) | def dataloader_drop_last(self):
    method was_step_applied (line 1674) | def was_step_applied(self) -> bool:
    method deepspeed_io (line 1684) | def deepspeed_io(self,
    method train (line 1745) | def train(self, mode=True):
    method eval (line 1751) | def eval(self):
    method _scale_loss_by_gas (line 1757) | def _scale_loss_by_gas(self, prescaled_loss):
    method forward (line 1776) | def forward(self, *inputs, **kwargs):
    method _cast_inputs_half (line 1861) | def _cast_inputs_half(self, inputs):
    method print_forward_breakdown (line 1877) | def print_forward_breakdown(self, fwd_time):
    method allreduce_gradients (line 1901) | def allreduce_gradients(self, bucket_size=MEMORY_OPT_ALLREDUCE_SIZE):
    method backward (line 1920) | def backward(self, loss, allreduce_gradients=True, release_loss=False,...
    method is_gradient_accumulation_boundary (line 2002) | def is_gradient_accumulation_boundary(self):
    method set_gradient_accumulation_boundary (line 2018) | def set_gradient_accumulation_boundary(self, is_boundary):
    method zero_grad (line 2042) | def zero_grad(self):
    method clip_fp32_gradients (line 2049) | def clip_fp32_gradients(self):
    method _take_model_step (line 2052) | def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
    method step (line 2118) | def step(self, lr_kwargs=None):
    method _start_timers (line 2224) | def _start_timers(self, timer_names):
    method _stop_timers (line 2228) | def _stop_timers(self, timer_names):
    method _autotuning_exit (line 2235) | def _autotuning_exit(self):
    method _write_monitor (line 2259) | def _write_monitor(self):
    method _get_optimizer_param (line 2290) | def _get_optimizer_param(self, param_name):
    method get_lr (line 2301) | def get_lr(self):
    method get_type (line 2304) | def get_type(self):
    method get_mom (line 2307) | def get_mom(self):
    method get_pld_theta (line 2313) | def get_pld_theta(self):
    method _report_progress (line 2319) | def _report_progress(self, step):
    method allreduce_bucket (line 2324) | def allreduce_bucket(self, bucket, dp_group):
    method allreduce_and_copy (line 2349) | def allreduce_and_copy(self, small_bucket, dp_group):
    method allreduce_no_retain (line 2354) | def allreduce_no_retain(self, bucket, dp_group, numel_per_bucket=50000...
    method _get_gradients_for_reduction (line 2367) | def _get_gradients_for_reduction(self):
    method _reduce_non_expert_gradients (line 2398) | def _reduce_non_expert_gradients(self, grads, elements_per_buffer):
    method _reduce_expert_gradients (line 2413) | def _reduce_expert_gradients(self, expert_grads, elements_per_buffer):
    method buffered_allreduce_fallback (line 2426) | def buffered_allreduce_fallback(self, grads=None, elements_per_buffer=...
    method sparse_allreduce_no_retain (line 2438) | def sparse_allreduce_no_retain(self, bucket, dp_group):
    method sparse_allreduce_bucket (line 2447) | def sparse_allreduce_bucket(self, bucket, dp_group):
    method sparse_allreduce (line 2453) | def sparse_allreduce(self, sparse, dp_group):
    method sparse_all_gather (line 2479) | def sparse_all_gather(self, value, dp_group):
    method all_gather_scalar (line 2506) | def all_gather_scalar(self, value, dp_group):
    method module_state_dict (line 2511) | def module_state_dict(self, destination=None, prefix="", keep_vars=Fal...
    method load_moe_state_dict (line 2525) | def load_moe_state_dict(checkpoint_path,
    method load_module_state_dict (line 2580) | def load_module_state_dict(self, checkpoint, strict=True, custom_load_...
    method _get_zero_ckpt_prefix (line 2611) | def _get_zero_ckpt_prefix(self, dp_rank, bf16_mode):
    method _get_rank_zero_ckpt_name (line 2614) | def _get_rank_zero_ckpt_name(self, checkpoints_path, tag, mp_rank, dp_...
    method _get_zero_ckpt_name (line 2623) | def _get_zero_ckpt_name(self, checkpoints_path, tag):
    method _get_ckpt_name (line 2629) | def _get_ckpt_name(self, checkpoints_path, tag, mp_placeholder=None):
    method _get_optimizer_ckpt_name (line 2651) | def _get_optimizer_ckpt_name(self, checkpoints_path, tag, expp_rank):
    method _get_expert_ckpt_name (line 2658) | def _get_expert_ckpt_name(checkpoints_path, layer_id, expert_id, tag, ...
    method _get_all_ckpt_names (line 2670) | def _get_all_ckpt_names(self, checkpoints_path, tag):
    method load_checkpoint (line 2679) | def load_checkpoint(self,
    method _load_checkpoint (line 2756) | def _load_checkpoint(self,
    method _load_zero_checkpoint (line 2895) | def _load_zero_checkpoint(self, load_dir, tag, load_optimizer_states=T...
    method update_optimizer_step (line 2933) | def update_optimizer_step(self, step):
    method _get_mp_rank_zero_checkpoint_names (line 2951) | def _get_mp_rank_zero_checkpoint_names(self, load_dir, tag, mp_rank, d...
    method _get_all_zero_checkpoint_names (line 2963) | def _get_all_zero_checkpoint_names(self, load_dir, tag, bf16_mode):
    method _get_all_zero_checkpoint_state_dicts (line 2981) | def _get_all_zero_checkpoint_state_dicts(self, zero_ckpt_names):
    method _get_all_zero_checkpoints (line 3001) | def _get_all_zero_checkpoints(self, load_dir, tag):
    method _checkpoint_tag_validation (line 3014) | def _checkpoint_tag_validation(self, tag):
    method save_checkpoint (line 3031) | def save_checkpoint(self, save_dir, tag=None, client_state={}, save_la...
    method _get_non_moe_state_dict (line 3107) | def _get_non_moe_state_dict(self, full_state_dict):
    method _save_moe_checkpoint (line 3117) | def _save_moe_checkpoint(self, save_dir, tag, client_state={}, exclude...
    method _create_checkpoint_file (line 3228) | def _create_checkpoint_file(self, save_dir, tag, zero_checkpoint):
    method _create_zero_checkpoint_files (line 3240) | def _create_zero_checkpoint_files(self, save_dir, tag):
    method _save_checkpoint (line 3251) | def _save_checkpoint(self, save_dir, tag, client_state={}, exclude_fro...
    method _get_buffer_names (line 3294) | def _get_buffer_names(self):
    method _get_param_shape_func (line 3315) | def _get_param_shape_func(self, param):
    method _get_param_fragment_func (line 3318) | def _get_param_fragment_func(self, param):
    method _get_zero_frozen_param_attributes (line 3321) | def _get_zero_frozen_param_attributes(self, attr_func):
    method _get_zero_param_shapes (line 3334) | def _get_zero_param_shapes(self):
    method _get_shared_params (line 3376) | def _get_shared_params(self):
    method _copy_recovery_script (line 3416) | def _copy_recovery_script(self, save_path):
    method _change_recovery_script_permissions (line 3425) | def _change_recovery_script_permissions(self, dst):
    method _save_zero_checkpoint (line 3435) | def _save_zero_checkpoint(self, save_path, tag):
    method _zero3_consolidated_16bit_state_dict (line 3445) | def _zero3_consolidated_16bit_state_dict(self):
    method save_fp16_model (line 3509) | def save_fp16_model(self, save_dir, save_filename="pytorch_model.bin"):
    method save_16bit_model (line 3514) | def save_16bit_model(self, save_dir, save_filename="pytorch_model.bin"):
    method empty_partition_cache (line 3561) | def empty_partition_cache(self):

FILE: opensora/adaptor/modules.py
  function fp32_layer_norm_forward (line 6) | def fp32_layer_norm_forward(self, inputs: torch.Tensor) -> torch.Tensor:
  function fp32_silu_forward (line 12) | def fp32_silu_forward(self, inputs: torch.Tensor) -> torch.Tensor:
  function fp32_gelu_forward (line 16) | def fp32_gelu_forward(self, inputs: torch.Tensor) -> torch.Tensor:
  function replace_with_fp32_forwards (line 20) | def replace_with_fp32_forwards():

FILE: opensora/adaptor/stage_1_and_2.py
  function input (line 49) | def input(msg):
  function split_half_float_double (line 53) | def split_half_float_double(tensors):
  function isclose (line 67) | def isclose(a, b, rtol=1e-09, atol=0.0):
  function lcm (line 71) | def lcm(x, y):
  function get_alignment_padding (line 76) | def get_alignment_padding(tensor_list, alignment):
  function move_to_cpu (line 82) | def move_to_cpu(tensor_list):
  function print_rank_msg (line 87) | def print_rank_msg(msg):
  function _get_padded_tensor (line 91) | def _get_padded_tensor(src_tensor, size):
  function contigous_flatten (line 100) | def contigous_flatten(tensors):
  function all_gather_into_tensor_dp_groups (line 104) | def all_gather_into_tensor_dp_groups(groups_flat, partitioned_param_grou...
  class DeepSpeedZeroOptimizer (line 121) | class DeepSpeedZeroOptimizer(ZeROOptimizer):
    method __init__ (line 133) | def __init__(self,
    method _enable_universal_checkpoint (line 561) | def _enable_universal_checkpoint(self):
    method _create_param_mapping (line 565) | def _create_param_mapping(self):
    method _link_all_hp_params (line 577) | def _link_all_hp_params(self):
    method is_moe_group (line 598) | def is_moe_group(self, group):
    method _configure_moe_settings (line 601) | def _configure_moe_settings(self):
    method _update_model_bit16_weights (line 628) | def _update_model_bit16_weights(self, group_index):
    method _round_robin_reorder (line 639) | def _round_robin_reorder(self, tensor_list, num_partitions):
    method _release_ipg_buffers (line 662) | def _release_ipg_buffers(self):
    method initialize_optimizer_states (line 668) | def initialize_optimizer_states(self):
    method reduce_gradients (line 694) | def reduce_gradients(self, pipeline_parallel=False):
    method get_first_param_index (line 720) | def get_first_param_index(self, group_id, param_group, partition_id):
    method initialize_gradient_partitioning_data_structures (line 727) | def initialize_gradient_partitioning_data_structures(self):
    method independent_gradient_partition_epilogue (line 751) | def independent_gradient_partition_epilogue(self):
    method reset_partition_gradient_structures (line 799) | def reset_partition_gradient_structures(self):
    method initialize_gradient_partition (line 809) | def initialize_gradient_partition(self, i, param_group, partition_id):
    method overlapping_partition_gradients_reduce_epilogue (line 860) | def overlapping_partition_gradients_reduce_epilogue(self):
    method fill_grad_accum_attribute (line 863) | def fill_grad_accum_attribute(self):
    method get_gradient_for_reduction (line 874) | def get_gradient_for_reduction(self, param):
    method get_param_gradient_attribute (line 880) | def get_param_gradient_attribute(self, param):
    method clear_grad_attribute (line 884) | def clear_grad_attribute(self, param):
    method create_reduce_and_remove_grad_hooks (line 890) | def create_reduce_and_remove_grad_hooks(self):
    method get_param_id (line 907) | def get_param_id(self, param):
    method report_ipg_memory_usage (line 911) | def report_ipg_memory_usage(self, tag, param_elems):
    method flatten_dense_tensors_aligned (line 919) | def flatten_dense_tensors_aligned(self, tensor_list, alignment):
    method reduce_independent_p_g_buckets_and_remove_grads (line 923) | def reduce_independent_p_g_buckets_and_remove_grads(self, param, i):
    method print_rank_0 (line 962) | def print_rank_0(self, message):
    method gradient_reduction_w_predivide (line 966) | def gradient_reduction_w_predivide(self, tensor):
    method allreduce_and_copy_with_multiple_ranks (line 993) | def allreduce_and_copy_with_multiple_ranks(self,
    method allreduce_and_scatter (line 1005) | def allreduce_and_scatter(self, bucket, numel_per_bucket=500000000, lo...
    method average_tensor (line 1033) | def average_tensor(self, tensor):
    method get_grad_position (line 1138) | def get_grad_position(self, group_id, tensor_list, first_offset, parti...
    method update_overflow_tracker_for_param_grad (line 1163) | def update_overflow_tracker_for_param_grad(self, param):
    method _get_offload_gradient_dict (line 1168) | def _get_offload_gradient_dict(self):
    method async_accumulate_grad_in_cpu_via_gpu (line 1178) | def async_accumulate_grad_in_cpu_via_gpu(self, param):
    method set_norm_for_param_grad (line 1227) | def set_norm_for_param_grad(self, param):
    method set_norm_for_param_grad_in_gpu (line 1240) | def set_norm_for_param_grad_in_gpu(self, param):
    method async_inplace_copy_grad_to_fp32_buffer_from_gpu (line 1255) | def async_inplace_copy_grad_to_fp32_buffer_from_gpu(self, param):
    method complete_grad_norm_calculation_for_cpu_offload (line 1273) | def complete_grad_norm_calculation_for_cpu_offload(self, params):
    method copy_grads_in_partition (line 1316) | def copy_grads_in_partition(self, param):
    method reduce_ipg_grads (line 1352) | def reduce_ipg_grads(self):
    method reduce_ready_partitions_and_remove_grads (line 1410) | def reduce_ready_partitions_and_remove_grads(self, param, i):
    method zero_reduced_gradients (line 1414) | def zero_reduced_gradients(self, partition_id, i):
    method flatten_and_print (line 1426) | def flatten_and_print(self, message, tensors, start=0, n=5):
    method get_grads_to_reduce (line 1434) | def get_grads_to_reduce(self, i, partition_id):
    method sequential_execution (line 1459) | def sequential_execution(self, function, message, group=None):
    method set_none_gradients_to_zero (line 1469) | def set_none_gradients_to_zero(self, i, partition_id):
    method allreduce_bucket (line 1476) | def allreduce_bucket(self, bucket, rank=None, log=None, divide=True, p...
    method _clear_previous_reduced_grads (line 1510) | def _clear_previous_reduced_grads(self):
    method allreduce_and_copy (line 1517) | def allreduce_and_copy(self, small_bucket, rank=None, log=None, divide...
    method allreduce_no_retain (line 1539) | def allreduce_no_retain(
    method buffered_reduce_fallback (line 1563) | def buffered_reduce_fallback(self, rank, grads, elements_per_buffer=50...
    method get_data_parallel_partitions (line 1575) | def get_data_parallel_partitions(self, tensor, group_id):
    method get_partition_info (line 1595) | def get_partition_info(self, tensor_list, partition_size, partition_id):
    method zero_grad (line 1626) | def zero_grad(self, set_to_none=True):
    method _model_parallel_all_reduce (line 1643) | def _model_parallel_all_reduce(self, tensor, op):
    method get_grad_norm_direct (line 1651) | def get_grad_norm_direct(self, gradients, params, norm_type=2):
    method get_flat_partition (line 1704) | def get_flat_partition(self, tensor_list, first_offset, partition_size...
    method free_grad_in_param_list (line 1744) | def free_grad_in_param_list(self, param_list):
    method reset_cpu_buffers (line 1749) | def reset_cpu_buffers(self):
    method set_lr (line 1753) | def set_lr(self, lr):
    method get_lr (line 1758) | def get_lr(self):
    method override_loss_scale (line 1762) | def override_loss_scale(self, loss_scale):
    method scaled_global_norm (line 1768) | def scaled_global_norm(self, norm_type=2):
    method get_bit16_param_group (line 1785) | def get_bit16_param_group(self, group_no):
    method _optimizer_step (line 1790) | def _optimizer_step(self, group_no):
    method step (line 1802) | def step(self, closure=None):
    method update_lp_params (line 1924) | def update_lp_params(self):
    method _average_expert_grad_norms (line 1936) | def _average_expert_grad_norms(self, norm_groups):
    method unscale_and_clip_grads (line 1946) | def unscale_and_clip_grads(self, grad_groups_flat, total_norm):
    method _check_overflow (line 1963) | def _check_overflow(self, partition_gradients=True):
    method has_overflow_serial (line 1967) | def has_overflow_serial(self, params, is_grad_list=False):
    method has_overflow_partitioned_grads_serial (line 1974) | def has_overflow_partitioned_grads_serial(self):
    method has_overflow (line 1981) | def has_overflow(self, partition_gradients=True):
    method _has_inf_or_nan (line 2007) | def _has_inf_or_nan(x, j=None):
    method backward (line 2027) | def backward(self, loss, retain_graph=False):
    method check_overflow (line 2062) | def check_overflow(self, partition_gradients=True):
    method _update_scale (line 2065) | def _update_scale(self, has_overflow=False):
    method _get_state (line 2069) | def _get_state(self):
    method _set_state (line 2072) | def _set_state(self, value):
    method _get_param_groups (line 2079) | def _get_param_groups(self):
    method _set_param_groups (line 2082) | def _set_param_groups(self, value):
    method _get_loss_scale (line 2088) | def _get_loss_scale(self):
    method _set_loss_scale (line 2094) | def _set_loss_scale(self, value):
    method _get_groups_without_padding (line 2102) | def _get_groups_without_padding(self, groups_with_padding):
    method _get_state_without_padding (line 2111) | def _get_state_without_padding(self, state_with_padding, padding):
    method _get_base_optimizer_state (line 2124) | def _get_base_optimizer_state(self):
    method state_dict (line 2133) | def state_dict(self):
    method _restore_from_elastic_fp32_weights (line 2179) | def _restore_from_elastic_fp32_weights(self, all_state_dict):
    method _restore_from_bit16_weights (line 2198) | def _restore_from_bit16_weights(self):
    method refresh_fp32_params (line 2205) | def refresh_fp32_params(self):
    method _partition_base_optimizer_state (line 2209) | def _partition_base_optimizer_state(self, state_key, all_partition_sta...
    method _restore_base_optimizer_state (line 2220) | def _restore_base_optimizer_state(self, base_optimizer_group_states):
    method get_ep_ranks (line 2233) | def get_ep_ranks(self, rank=0, group_name=None):
    method _restore_elastic_base_optimizer_state (line 2245) | def _restore_elastic_base_optimizer_state(self, all_state_dict):
    method load_state_dict (line 2270) | def load_state_dict(self,
    method _load_universal_checkpoint (line 2281) | def _load_universal_checkpoint(self, checkpoint_folder, load_optimizer...
    method param_groups (line 2285) | def param_groups(self):
    method _load_hp_checkpoint_state (line 2289) | def _load_hp_checkpoint_state(self, checkpoint_dir):
    method _load_global_state (line 2308) | def _load_global_state(self, sd):
    method _load_legacy_checkpoint (line 2326) | def _load_legacy_checkpoint(self, state_dict_list, load_optimizer_stat...
  function _handle_overflow (line 2415) | def _handle_overflow(cpu_sum, x, i):
  function estimate_zero2_model_states_mem_needs (line 2427) | def estimate_zero2_model_states_mem_needs(total_params,
  function model_to_params (line 2444) | def model_to_params(model):
  function estimate_zero2_model_states_mem_needs_all_live (line 2450) | def estimate_zero2_model_states_mem_needs_all_live(model,
  function estimate_zero2_model_states_mem_needs_all_cold (line 2480) | def estimate_zero2_model_states_mem_needs_all_cold(total_params,

FILE: opensora/adaptor/utils.py
  class DummyOptim (line 39) | class DummyOptim():
    method __init__ (line 45) | def __init__(self, params):
  function graph_process (line 53) | def graph_process(replay_first_step, func, *args, **kwargs):
  function noop_decorator (line 71) | def noop_decorator(func):
  class noop_context (line 75) | class noop_context(object):
    method __init__ (line 77) | def __init__(self):
    method __enter__ (line 80) | def __enter__(self):
    method __exit__ (line 83) | def __exit__(self, exc_type, exc_val, exc_tb):
  function ensure_directory_exists (line 87) | def ensure_directory_exists(filename):
  function set_random_seed (line 97) | def set_random_seed(seed):
  function is_model_parallel_parameter (line 110) | def is_model_parallel_parameter(p) -> bool:
  function bwc_tensor_model_parallel_rank (line 120) | def bwc_tensor_model_parallel_rank(mpu=None):
  function copy_to_device (line 158) | def copy_to_device(item, device, criterion_func):
  function move_to_device (line 182) | def move_to_device(item, device, criterion_func):
  class CheckOverflow (line 208) | class CheckOverflow(object):
    method __init__ (line 211) | def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=Fa...
    method check_using_norm (line 224) | def check_using_norm(self, norm_group, reduce_overflow=True):
    method check (line 243) | def check(self, param_groups=None):
    method has_overflow_serial (line 262) | def has_overflow_serial(self, params):
    method has_overflow (line 268) | def has_overflow(self, params, has_moe_params=None):
    method _has_inf_or_nan (line 299) | def _has_inf_or_nan(x, i):
  function _handle_overflow (line 320) | def _handle_overflow(cpu_sum, x, i):
  function get_global_norm (line 332) | def get_global_norm(norm_list):
  function clip_grad_norm_ (line 342) | def clip_grad_norm_(parameters, max_norm, norm_type=2, mpu=None):
  function get_grad_norm (line 407) | def get_grad_norm(parameters, norm_type=2, mpu=None):
  function get_grad_zeros (line 463) | def get_grad_zeros(parameters, mpu=None):
  function get_weight_norm (line 503) | def get_weight_norm(parameters, norm_type=2, mpu=None):
  function prefix_sum_inc (line 559) | def prefix_sum_inc(weights):
  function partition_uniform (line 572) | def partition_uniform(num_items, num_parts):
  function partition_balanced (line 593) | def partition_balanced(weights, num_parts):
  class PartitionedTensor (line 634) | class PartitionedTensor:
    method __init__ (line 636) | def __init__(self, tensor, group, partition_meta=None):
    method from_meta (line 648) | def from_meta(cls, meta, local_part, group, device=get_accelerator().d...
    method _partition_tensor (line 673) | def _partition_tensor(self, tensor):
    method full (line 681) | def full(self, device=None):
    method to_meta (line 700) | def to_meta(self):
    method data (line 717) | def data(self):
    method local_size (line 720) | def local_size(self):
    method full_size (line 723) | def full_size(self):
  function memory_status (line 731) | def memory_status(msg, print_rank=-1, reset_max=False):
  function get_ma_status (line 770) | def get_ma_status():
  function empty_cache (line 776) | def empty_cache():
  function see_memory_usage (line 781) | def see_memory_usage(message, force=False):
  function call_to_str (line 805) | def call_to_str(base, *args, **kwargs):
  function get_only_unique_item (line 827) | def get_only_unique_item(items):
  function clip_gradients (line 836) | def clip_gradients(parameters, max_norm=1.0, global_grad_norm=None, mpu=...
  function get_global_norm_of_tensors (line 855) | def get_global_norm_of_tensors(input_tensors, norm_type=2, mpu=None, use...
  function clip_tensors_by_global_norm (line 908) | def clip_tensors_by_global_norm(input_tensors, max_norm=1.0, global_norm...
  function align_dense_tensors (line 942) | def align_dense_tensors(tensor_list, alignment):
  function all_gather_into_tensor_dp_groups (line 956) | def all_gather_into_tensor_dp_groups(groups_flat, partitioned_param_grou...
  function all_gather_dp_groups (line 969) | def all_gather_dp_groups(groups_flat, partitioned_param_groups, zp_proce...
  class TLinear (line 1006) | class TLinear(torch.nn.Linear):
    method __init__ (line 1008) | def __init__(self, orig_layer, name=""):
    method _fwd (line 1015) | def _fwd(self, input):
    method _fwd_bias_add (line 1018) | def _fwd_bias_add(self, input):
    method forward (line 1021) | def forward(self, input):
  function get_inactive_params (line 1025) | def get_inactive_params(param_list):
  function required_torch_version (line 1031) | def required_torch_version(min_version=None, max_version=None):

FILE: opensora/adaptor/zp_manager.py
  class ZPManager (line 6) | class ZPManager(object):
    method __init__ (line 7) | def __init__(self, zp_size=8):
    method init_group (line 15) | def init_group(self):

FILE: opensora/dataset/__init__.py
  function getdataset (line 19) | def getdataset(args):

FILE: opensora/dataset/inpaint_dataset.py
  function type_ratio_normalize (line 44) | def type_ratio_normalize(mask_type_ratio_dict):
  class Inpaint_dataset (line 53) | class Inpaint_dataset(T2V_dataset):
    method __init__ (line 54) | def __init__(self, args, resize_transform, transform, temporal_sample,...
    method __getitem__ (line 86) | def __getitem__(self, idx):
    method get_data (line 100) | def get_data(self, idx):
    method drop (line 111) | def drop(self, text, is_video=True):
    method get_video (line 126) | def get_video(self, idx):
    method get_image (line 200) | def get_image(self, idx):

FILE: opensora/dataset/t2v_datasets.py
  function filter_json_by_existed_files (line 42) | def filter_json_by_existed_files(directory, data, postfix=".mp4"):
  function random_video_noise (line 56) | def random_video_noise(t, c, h, w):
  class SingletonMeta (line 62) | class SingletonMeta(type):
    method __call__ (line 68) | def __call__(cls, *args, **kwargs):
  class DataSetProg (line 75) | class DataSetProg(metaclass=SingletonMeta):
    method __init__ (line 76) | def __init__(self):
    method set_cap_list (line 84) | def set_cap_list(self, num_workers, cap_list, n_elements):
    method get_item (line 100) | def get_item(self, work_info):
  function find_closest_y (line 113) | def find_closest_y(x, vae_stride_t=4, model_ds_t=1):
  function filter_resolution (line 128) | def filter_resolution(h, w, max_h_div_w_ratio=17/16, min_h_div_w_ratio=8...
  function read_parquet (line 133) | def read_parquet(path):
  class DecordDecoder (line 140) | class DecordDecoder(object):
    method __init__ (line 141) | def __init__(self, url, num_threads=1):
    method get_avg_fps (line 149) | def get_avg_fps(self):
    method get_num_frames (line 152) | def get_num_frames(self):
    method get_height (line 155) | def get_height(self):
    method get_width (line 158) | def get_width(self):
    method get_batch (line 162) | def get_batch(self, frame_indices):
  class T2V_dataset (line 172) | class T2V_dataset(Dataset):
    method __init__ (line 173) | def __init__(self, args, transform, temporal_sample, tokenizer_1, toke...
    method set_checkpoint (line 219) | def set_checkpoint(self, n_used_elements):
    method __len__ (line 223) | def __len__(self):
    method __getitem__ (line 226) | def __getitem__(self, idx):
    method get_data (line 239) | def get_data(self, idx):
    method get_video (line 246) | def get_video(self, idx):
    method get_image (line 307) | def get_image(self, idx):
    method define_frame_index (line 360) | def define_frame_index(self, data):
    method decord_read (line 568) | def decord_read(self, video_data):
    method opencv_read (line 597) | def opencv_read(self, video_data):
    method get_actual_frame (line 625) | def get_actual_frame(self, fps, start_frame_idx, clip_total_frames, pa...

FILE: opensora/dataset/transform.py
  function _is_tensor_video_clip (line 12) | def _is_tensor_video_clip(clip):
  function center_crop_arr (line 22) | def center_crop_arr(pil_image, image_size):
  function crop (line 43) | def crop(clip, i, j, h, w):
  function resize (line 53) | def resize(clip, target_size, interpolation_mode):
  function resize_scale (line 59) | def resize_scale(clip, target_size, interpolation_mode):
  function resized_crop (line 67) | def resized_crop(clip, i, j, h, w, size, interpolation_mode="bilinear"):
  function center_crop (line 87) | def center_crop(clip, crop_size):
  function center_crop_using_short_edge (line 100) | def center_crop_using_short_edge(clip):
  function center_crop_th_tw (line 116) | def center_crop_th_tw(clip, th, tw, top_crop):
  function random_shift_crop (line 137) | def random_shift_crop(clip):
  function to_tensor (line 159) | def to_tensor(clip):
  function to_tensor_after_resize (line 175) | def to_tensor_after_resize(clip):
  function normalize (line 187) | def normalize(clip, mean, std, inplace=False):
  function hflip (line 207) | def hflip(clip):
  class RandomCropVideo (line 219) | class RandomCropVideo:
    method __init__ (line 220) | def __init__(self, size):
    method __call__ (line 226) | def __call__(self, clip):
    method get_params (line 237) | def get_params(self, clip):
    method __repr__ (line 252) | def __repr__(self) -> str:
  function get_params (line 256) | def get_params(h, w, stride):
  class SpatialStrideCropVideo (line 265) | class SpatialStrideCropVideo:
    method __init__ (line 266) | def __init__(self, stride):
    method __call__ (line 269) | def __call__(self, clip):
    method __repr__ (line 282) | def __repr__(self) -> str:
  function longsideresize (line 285) | def longsideresize(h, w, size, skip_low_resolution):
  function maxhwresize (line 300) | def maxhwresize(ori_height, ori_width, max_hxw):
  class LongSideResizeVideo (line 310) | class LongSideResizeVideo:
    method __init__ (line 316) | def __init__(
    method __call__ (line 326) | def __call__(self, clip):
    method __repr__ (line 341) | def __repr__(self) -> str:
  class MaxHWResizeVideo (line 345) | class MaxHWResizeVideo:
    method __init__ (line 351) | def __init__(
    method __call__ (line 359) | def __call__(self, clip):
    method __repr__ (line 374) | def __repr__(self) -> str:
  class CenterCropResizeVideo (line 378) | class CenterCropResizeVideo:
    method __init__ (line 384) | def __init__(
    method __call__ (line 396) | def __call__(self, clip):
    method __repr__ (line 409) | def __repr__(self) -> str:
  class UCFCenterCropVideo (line 413) | class UCFCenterCropVideo:
    method __init__ (line 419) | def __init__(
    method __call__ (line 433) | def __call__(self, clip):
    method __repr__ (line 445) | def __repr__(self) -> str:
  class KineticsRandomCropResizeVideo (line 449) | class KineticsRandomCropResizeVideo:
    method __init__ (line 454) | def __init__(
    method __call__ (line 468) | def __call__(self, clip):
  class CenterCropVideo (line 474) | class CenterCropVideo:
    method __init__ (line 475) | def __init__(
    method __call__ (line 489) | def __call__(self, clip):
    method __repr__ (line 500) | def __repr__(self) -> str:
  class NormalizeVideo (line 504) | class NormalizeVideo:
    method __init__ (line 513) | def __init__(self, mean, std, inplace=False):
    method __call__ (line 518) | def __call__(self, clip):
    method __repr__ (line 525) | def __repr__(self) -> str:
  class ToTensorVideo (line 529) | class ToTensorVideo:
    method __init__ (line 535) | def __init__(self):
    method __call__ (line 538) | def __call__(self, clip):
    method __repr__ (line 547) | def __repr__(self) -> str:
  class ToTensorAfterResize (line 551) | class ToTensorAfterResize:
    method __init__ (line 557) | def __init__(self):
    method __call__ (line 560) | def __call__(self, clip):
    method __repr__ (line 569) | def __repr__(self) -> str:
  class RandomHorizontalFlipVideo (line 574) | class RandomHorizontalFlipVideo:
    method __init__ (line 581) | def __init__(self, p=0.5):
    method __call__ (line 584) | def __call__(self, clip):
    method __repr__ (line 595) | def __repr__(self) -> str:
  class TemporalRandomCrop (line 602) | class TemporalRandomCrop(object):
    method __init__ (line 609) | def __init__(self, size):
    method __call__ (line 612) | def __call__(self, total_frames):
  class DynamicSampleDuration (line 618) | class DynamicSampleDuration(object):
    method __init__ (line 625) | def __init__(self, t_stride, extra_1):
    method __call__ (line 629) | def __call__(self, t, h, w):
  function add_masking_notice (line 752) | def add_masking_notice(caption):
  function add_webvid_watermark_notice (line 758) | def add_webvid_watermark_notice(caption):
  function add_aesthetic_notice_video (line 762) | def add_aesthetic_notice_video(caption, aesthetic_score):
  function add_aesthetic_notice_image (line 773) | def add_aesthetic_notice_image(caption, aesthetic_score):
  function add_high_aesthetic_notice_image (line 782) | def add_high_aesthetic_notice_image(caption):
  function add_high_aesthetic_notice_image_human (line 786) | def add_high_aesthetic_notice_image_human(caption):
  function basic_clean (line 790) | def basic_clean(text):
  function whitespace_clean (line 796) | def whitespace_clean(text):
  function clean_youtube (line 802) | def clean_youtube(text, is_tags=False):
  function clean_vidal (line 815) | def clean_vidal(text):
  function calculate_statistics (line 825) | def calculate_statistics(data):

FILE: opensora/dataset/virtual_disk.py
  class SuppressStdout (line 10) | class SuppressStdout:
    method __new__ (line 13) | def __new__(cls, *args, **kwargs):
    method __enter__ (line 18) | def __enter__(self):
    method __exit__ (line 22) | def __exit__(self, exc_type, exc_value, traceback):
  class ObsConnection (line 29) | class ObsConnection:
    method __init__ (line 35) | def __init__(self):
    method connect (line 44) | def connect(self, obs):
  class VirtualDisk (line 57) | class VirtualDisk:
    method __init__ (line 64) | def __init__(self, storage_dir, size="1G", obs="/home/opensora/obsutil...
    method _convert_size_to_bytes (line 80) | def _convert_size_to_bytes(self, size):
    method create_ramdisk (line 95) | def create_ramdisk(self):
    method load_index (line 109) | def load_index(self):
    method save_index (line 119) | def save_index(self):
    method unmount_ramdisk (line 131) | def unmount_ramdisk(self):
    method is_tmpfs_mounted (line 146) | def is_tmpfs_mounted(self):
    method get_data (line 156) | def get_data(self, key):
    method del_data (line 190) | def del_data(self, local_path):
    method download_and_convert_to_pickle (line 193) | def download_and_convert_to_pickle(self, bucket, object_name, local_pa...
    method ensure_storage_limit (line 208) | def ensure_storage_limit(self):
    method get_total_storage_size (line 221) | def get_total_storage_size(self):

FILE: opensora/models/causalvideovae/__init__.py
  class CausalVAEModelWrapper (line 15) | class CausalVAEModelWrapper(nn.Module):
    method __init__ (line 16) | def __init__(self, model_path, subfolder=None, cache_dir=None, use_ema...
    method encode (line 20) | def encode(self, x):
    method decode (line 23) | def decode(self, x):
    method dtype (line 28) | def dtype(self):
  class WFVAEModelWrapper (line 31) | class WFVAEModelWrapper(nn.Module):
    method __init__ (line 32) | def __init__(self, model_path, subfolder=None, cache_dir=None, **kwargs):
    method encode (line 38) | def encode(self, x):
    method decode (line 42) | def decode(self, x):
    method dtype (line 48) | def dtype(self):

FILE: opensora/models/causalvideovae/dataset/ddp_sampler.py
  class CustomDistributedSampler (line 9) | class CustomDistributedSampler(Sampler[T_co]):
    method __init__ (line 58) | def __init__(self, dataset: Dataset, num_replicas: Optional[int] = None,
    method __iter__ (line 93) | def __iter__(self) -> Iterator[T_co]:
    method __len__ (line 123) | def __len__(self) -> int:
    method set_epoch (line 126) | def set_epoch(self, epoch: int) -> None:
    method state_dict (line 137) | def state_dict(self) -> dict:
    method load_state_dict (line 144) | def load_state_dict(self, state_dict: dict) -> None:

FILE: opensora/models/causalvideovae/dataset/transform.py
  function _is_tensor_video_clip (line 7) | def _is_tensor_video_clip(clip):
  function center_crop_arr (line 17) | def center_crop_arr(pil_image, image_size):
  function crop (line 38) | def crop(clip, i, j, h, w):
  function resize (line 48) | def resize(clip, target_size, interpolation_mode):
  function resize_scale (line 54) | def resize_scale(clip, target_size, interpolation_mode):
  function resized_crop (line 62) | def resized_crop(clip, i, j, h, w, size, interpolation_mode="bilinear"):
  function center_crop (line 82) | def center_crop(clip, crop_size):
  function center_crop_using_short_edge (line 95) | def center_crop_using_short_edge(clip):
  function random_shift_crop (line 110) | def random_shift_crop(clip):
  function to_tensor (line 132) | def to_tensor(clip):
  function normalize (line 148) | def normalize(clip, mean, std, inplace=False):
  function hflip (line 168) | def hflip(clip):
  class RandomCropVideo (line 180) | class RandomCropVideo:
    method __init__ (line 181) | def __init__(self, size):
    method __call__ (line 187) | def __call__(self, clip):
    method get_params (line 198) | def get_params(self, clip):
    method __repr__ (line 213) | def __repr__(self) -> str:
  class SpatialStrideCropVideo (line 217) | class SpatialStrideCropVideo:
    method __init__ (line 218) | def __init__(self, stride):
    method __call__ (line 221) | def __call__(self, clip):
    method get_params (line 232) | def get_params(self, clip):
    method __repr__ (line 239) | def __repr__(self) -> str:
  class LongSideResizeVideo (line 242) | class LongSideResizeVideo:
    method __init__ (line 248) | def __init__(
    method __call__ (line 258) | def __call__(self, clip):
    method __repr__ (line 279) | def __repr__(self) -> str:
  class CenterCropResizeVideo (line 282) | class CenterCropResizeVideo:
    method __init__ (line 288) | def __init__(
    method __call__ (line 302) | def __call__(self, clip):
    method __repr__ (line 315) | def __repr__(self) -> str:
  class UCFCenterCropVideo (line 319) | class UCFCenterCropVideo:
    method __init__ (line 325) | def __init__(
    method __call__ (line 339) | def __call__(self, clip):
    method __repr__ (line 351) | def __repr__(self) -> str:
  class KineticsRandomCropResizeVideo (line 355) | class KineticsRandomCropResizeVideo:
    method __init__ (line 360) | def __init__(
    method __call__ (line 374) | def __call__(self, clip):
  class CenterCropVideo (line 380) | class CenterCropVideo:
    method __init__ (line 381) | def __init__(
    method __call__ (line 395) | def __call__(self, clip):
    method __repr__ (line 406) | def __repr__(self) -> str:
  class NormalizeVideo (line 410) | class NormalizeVideo:
    method __init__ (line 419) | def __init__(self, mean, std, inplace=False):
    method __call__ (line 424) | def __call__(self, clip):
    method __repr__ (line 431) | def __repr__(self) -> str:
  class ToTensorVideo (line 435) | class ToTensorVideo:
    method __init__ (line 441) | def __init__(self):
    method __call__ (line 444) | def __call__(self, clip):
    method __repr__ (line 453) | def __repr__(self) -> str:
  class RandomHorizontalFlipVideo (line 457) | class RandomHorizontalFlipVideo:
    method __init__ (line 464) | def __init__(self, p=0.5):
    method __call__ (line 467) | def __call__(self, clip):
    method __repr__ (line 478) | def __repr__(self) -> str:
  class TemporalRandomCrop (line 485) | class TemporalRandomCrop(object):
    method __init__ (line 492) | def __init__(self, size):
    method __call__ (line 495) | def __call__(self, total_frames):
  class DynamicSampleDuration (line 501) | class DynamicSampleDuration(object):
    method __init__ (line 508) | def __init__(self, t_stride, extra_1):
    method __call__ (line 512) | def __call__(self, t, h, w):

FILE: opensora/models/causalvideovae/dataset/video_dataset.py
  class DecordInit (line 19) | class DecordInit(object):
    method __init__ (line 20) | def __init__(self, num_threads=1):
    method __call__ (line 24) | def __call__(self, filename):
    method __repr__ (line 30) | def __repr__(self):
  function TemporalRandomCrop (line 38) | def TemporalRandomCrop(total_frames, size):
  function _format_video_shape (line 44) | def _format_video_shape(video, time_compress=4, spatial_compress=8):
  class TrainVideoDataset (line 63) | class TrainVideoDataset(data.Dataset):
    method __init__ (line 66) | def __init__(
    method _make_dataset (line 98) | def _make_dataset(self):
    method __len__ (line 118) | def __len__(self):
    method __getitem__ (line 121) | def __getitem__(self, idx):
    method decord_read (line 132) | def decord_read(self, path):
  function resize (line 151) | def resize(x, resolution):
  class ValidVideoDataset (line 163) | class ValidVideoDataset(data.Dataset):
    method __init__ (line 166) | def __init__(
    method _make_dataset (line 192) | def _make_dataset(self, real_video_dir):
    method __len__ (line 212) | def __len__(self):
    method __getitem__ (line 215) | def __getitem__(self, index):
    method _load_video (line 228) | def _load_video(self, video_path, sample_rate=None):

FILE: opensora/models/causalvideovae/eval/cal_fvd.py
  function trans (line 5) | def trans(x):
  function calculate_fvd (line 15) | def calculate_fvd(videos1, videos2, device, method='styleganv'):
  function main (line 67) | def main():

FILE: opensora/models/causalvideovae/eval/cal_lpips.py
  function trans (line 15) | def trans(x):
  function calculate_lpips (line 25) | def calculate_lpips(videos1, videos2, device):
  function main (line 82) | def main():

FILE: opensora/models/causalvideovae/eval/cal_psnr.py
  function img_psnr_cuda (line 6) | def img_psnr_cuda(img1, img2):
  function img_psnr (line 18) | def img_psnr(img1, img2):
  function trans (line 30) | def trans(x):
  function calculate_psnr (line 33) | def calculate_psnr(videos1, videos2):
  function main (line 84) | def main():

FILE: opensora/models/causalvideovae/eval/cal_ssim.py
  function ssim (line 6) | def ssim(img1, img2):
  function calculate_ssim_function (line 26) | def calculate_ssim_function(img1, img2):
  function trans (line 44) | def trans(x):
  function calculate_ssim (line 47) | def calculate_ssim(videos1, videos2):
  function main (line 99) | def main():

FILE: opensora/models/causalvideovae/eval/eval.py
  class EvalDataset (line 26) | class EvalDataset(ValidVideoDataset):
    method __init__ (line 27) | def __init__(
    method _make_dataset (line 61) | def _make_dataset(self, real_video_dir):
    method __len__ (line 72) | def __len__(self):
    method __getitem__ (line 75) | def __getitem__(self, index):
  function calculate_common_metric (line 85) | def calculate_common_metric(args, dataloader, device):
  function main (line 110) | def main():
  function parse_args (line 147) | def parse_args():

FILE: opensora/models/causalvideovae/eval/fvd/styleganv/fvd.py
  function load_i3d_pretrained (line 9) | def load_i3d_pretrained(device=torch.device('cpu')):
  function get_feats (line 21) | def get_feats(videos, detector, device, bs=10):
  function get_fvd_feats (line 31) | def get_fvd_feats(videos, i3d, device, bs=10):
  function preprocess_single (line 38) | def preprocess_single(video, resolution=224, sequence_length=None):
  function compute_stats (line 75) | def compute_stats(feats: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
  function frechet_distance (line 81) | def frechet_distance(feats_fake: np.ndarray, feats_real: np.ndarray) -> ...

FILE: opensora/models/causalvideovae/eval/fvd/videogpt/fvd.py
  function load_i3d_pretrained (line 8) | def load_i3d_pretrained(device=torch.device('cpu')):
  function preprocess_single (line 21) | def preprocess_single(video, resolution, sequence_length=None):
  function preprocess (line 51) | def preprocess(videos, target_resolution=224):
  function get_fvd_logits (line 62) | def get_fvd_logits(videos, i3d, device, bs=10):
  function _symmetric_matrix_square_root (line 68) | def _symmetric_matrix_square_root(mat, eps=1e-10):
  function trace_sqrt_product (line 74) | def trace_sqrt_product(sigma, sigma_v):
  function cov (line 80) | def cov(m, rowvar=False):
  function frechet_distance (line 113) | def frechet_distance(x1, x2):
  function get_logits (line 128) | def get_logits(i3d, videos, device, bs=10):

FILE: opensora/models/causalvideovae/eval/fvd/videogpt/pytorch_i3d.py
  class MaxPool3dSamePadding (line 7) | class MaxPool3dSamePadding(nn.MaxPool3d):
    method compute_pad (line 9) | def compute_pad(self, dim, s):
    method forward (line 15) | def forward(self, x):
  class Unit3D (line 37) | class Unit3D(nn.Module):
    method __init__ (line 39) | def __init__(self, in_channels,
    method compute_pad (line 71) | def compute_pad(self, dim, s):
    method forward (line 78) | def forward(self, x):
  class InceptionModule (line 107) | class InceptionModule(nn.Module):
    method __init__ (line 108) | def __init__(self, in_channels, out_channels, name):
    method forward (line 127) | def forward(self, x):
  class InceptionI3d (line 135) | class InceptionI3d(nn.Module):
    method __init__ (line 172) | def __init__(self, num_classes=400, spatial_squeeze=True,
    method replace_logits (line 290) | def replace_logits(self, num_classes):
    method build (line 301) | def build(self):
    method forward (line 305) | def forward(self, x):
    method extract_features (line 318) | def extract_features(self, x):

FILE: opensora/models/causalvideovae/model/configuration_videobase.py
  class VideoBaseConfiguration (line 7) | class VideoBaseConfiguration(ConfigMixin):
    method __init__ (line 11) | def __init__(self, **kwargs):
    method to_dict (line 14) | def to_dict(self) -> Dict[str, Any]:
    method to_yaml_file (line 25) | def to_yaml_file(self, yaml_path: str):
    method load_from_yaml (line 30) | def load_from_yaml(cls: T, yaml_path: str) -> T:
    method load_from_dict (line 39) | def load_from_dict(cls: T, config_dict: Dict[str, Any]) -> T:

FILE: opensora/models/causalvideovae/model/dataset_videobase.py
  function TemporalRandomCrop (line 15) | def TemporalRandomCrop(total_frames, size):
  function resize (line 35) | def resize(x, resolution):
  class VideoDataset (line 48) | class VideoDataset(data.Dataset):
    method __init__ (line 52) | def __init__(self, video_folder, sequence_length, image_folder=None, t...
    method _make_dataset (line 71) | def _make_dataset(self):
    method __len__ (line 77) | def __len__(self):
    method __getitem__ (line 80) | def __getitem__(self, idx):
    method decord_read (line 91) | def decord_read(self, path):

FILE: opensora/models/causalvideovae/model/ema_model.py
  class EMA (line 1) | class EMA:
    method __init__ (line 2) | def __init__(self, model, decay):
    method register (line 8) | def register(self):
    method update (line 13) | def update(self):
    method apply_shadow (line 19) | def apply_shadow(self):
    method restore (line 25) | def restore(self):

FILE: opensora/models/causalvideovae/model/losses/discriminator.py
  function weights_init (line 6) | def weights_init(m):
  function weights_init_conv (line 14) | def weights_init_conv(m):
  class NLayerDiscriminator3D (line 24) | class NLayerDiscriminator3D(nn.Module):
    method __init__ (line 26) | def __init__(self, input_nc=1, ndf=64, n_layers=3, use_actnorm=False):
    method forward (line 71) | def forward(self, input):

FILE: opensora/models/causalvideovae/model/losses/lpips.py
  class LPIPS (line 9) | class LPIPS(nn.Module):
    method __init__ (line 11) | def __init__(self, use_dropout=True):
    method load_from_pretrained (line 25) | def load_from_pretrained(self, name="vgg_lpips"):
    method from_pretrained (line 31) | def from_pretrained(cls, name="vgg_lpips"):
    method forward (line 39) | def forward(self, input, target):
  class ScalingLayer (line 55) | class ScalingLayer(nn.Module):
    method __init__ (line 56) | def __init__(self):
    method forward (line 61) | def forward(self, inp):
  class NetLinLayer (line 65) | class NetLinLayer(nn.Module):
    method __init__ (line 67) | def __init__(self, chn_in, chn_out=1, use_dropout=False):
  class vgg16 (line 74) | class vgg16(torch.nn.Module):
    method __init__ (line 75) | def __init__(self, requires_grad=False, pretrained=True):
    method forward (line 98) | def forward(self, X):
  function normalize_tensor (line 114) | def normalize_tensor(x,eps=1e-10):
  function spatial_average (line 119) | def spatial_average(x, keepdim=True):

FILE: opensora/models/causalvideovae/model/losses/perceptual_loss.py
  function hinge_d_loss (line 8) | def hinge_d_loss(logits_real, logits_fake):
  function vanilla_d_loss (line 15) | def vanilla_d_loss(logits_real, logits_fake):
  function hinge_d_loss_with_exemplar_weights (line 23) | def hinge_d_loss_with_exemplar_weights(logits_real, logits_fake, weights):
  function adopt_weight (line 33) | def adopt_weight(weight, global_step, threshold=0, value=0.0):
  function measure_perplexity (line 39) | def measure_perplexity(predicted_indices, n_embed):
  function l1 (line 49) | def l1(x, y):
  function l2 (line 53) | def l2(x, y):
  class LPIPSWithDiscriminator3D (line 57) | class LPIPSWithDiscriminator3D(nn.Module):
    method __init__ (line 58) | def __init__(
    method calculate_adaptive_weight (line 97) | def calculate_adaptive_weight(self, nll_loss, g_loss, last_layer=None):
    method forward (line 108) | def forward(

FILE: opensora/models/causalvideovae/model/modeling_videobase.py
  class VideoBaseAE (line 12) | class VideoBaseAE(ModelMixin, ConfigMixin):
    method __init__ (line 15) | def __init__(self, *args, **kwargs) -> None:
    method encode (line 18) | def encode(self, x: torch.Tensor, *args, **kwargs):
    method decode (line 21) | def decode(self, encoding: torch.Tensor, *args, **kwargs):
    method num_training_steps (line 25) | def num_training_steps(self) -> int:
    method from_pretrained (line 42) | def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union...

FILE: opensora/models/causalvideovae/model/modules/attention.py
  class AttnBlock3D (line 16) | class AttnBlock3D(Block):
    method __init__ (line 18) | def __init__(self, in_channels):
    method forward (line 28) | def forward(self, x):
  class AttnBlock3DFix (line 54) | class AttnBlock3DFix(nn.Module):
    method __init__ (line 58) | def __init__(self, in_channels, norm_type="groupnorm"):
    method forward (line 68) | def forward(self, x):

FILE: opensora/models/causalvideovae/model/modules/block.py
  class Block (line 3) | class Block(nn.Module):
    method __init__ (line 4) | def __init__(self, *args, **kwargs) -> None:

FILE: opensora/models/causalvideovae/model/modules/conv.py
  class Conv2d (line 18) | class Conv2d(nn.Conv2d):
    method __init__ (line 19) | def __init__(
    method forward (line 48) | def forward(self, x):
  class CausalConv3d (line 53) | class CausalConv3d(Block):
    method __init__ (line 54) | def __init__(
    method forward (line 88) | def forward(self, x):
  class CausalConv3d_GC (line 113) | class CausalConv3d_GC(CausalConv3d):
    method __init__ (line 114) | def __init__(
    method forward (line 124) | def forward(self, x):

FILE: opensora/models/causalvideovae/model/modules/normalize.py
  class GroupNorm (line 6) | class GroupNorm(Block):
    method __init__ (line 7) | def __init__(self, num_channels, num_groups=32, eps=1e-6, *args, **kwa...
    method forward (line 12) | def forward(self, x):
  class LayerNorm (line 15) | class LayerNorm(Block):
    method __init__ (line 16) | def __init__(self, num_channels, eps=1e-6, *args, **kwargs) -> None:
    method forward (line 19) | def forward(self, x):
  function Normalize (line 30) | def Normalize(in_channels, num_groups=32, norm_type="groupnorm"):

FILE: opensora/models/causalvideovae/model/modules/ops.py
  function video_to_image (line 4) | def video_to_image(func):
  function nonlinearity (line 23) | def nonlinearity(x):
  function cast_tuple (line 26) | def cast_tuple(t, length=1):
  function shift_dim (line 29) | def shift_dim(x, src_dim=-1, dest_dim=-1, make_contiguous=True):

FILE: opensora/models/causalvideovae/model/modules/quant.py
  class Codebook (line 8) | class Codebook(nn.Module):
    method __init__ (line 9) | def __init__(self, n_codes, embedding_dim):
    method _tile (line 19) | def _tile(self, x):
    method _init_embeddings (line 28) | def _init_embeddings(self, z):
    method forward (line 42) | def forward(self, z):
    method dictionary_lookup (line 98) | def dictionary_lookup(self, encodings):

FILE: opensora/models/causalvideovae/model/modules/resnet_block.py
  class ResnetBlock2D (line 16) | class ResnetBlock2D(Block):
    method __init__ (line 17) | def __init__(
    method forward (line 51) | def forward(self, x):
  class ResnetBlock3D (line 75) | class ResnetBlock3D(Block):
    method __init__ (line 76) | def __init__(
    method forward (line 105) | def forward(self, x):
  class ResnetBlock3D_GC (line 128) | class ResnetBlock3D_GC(Block):
    method __init__ (line 129) | def __init__(
    method forward (line 158) | def forward(self, x):
    method _forward (line 161) | def _forward(self, x):

FILE: opensora/models/causalvideovae/model/modules/updownsample.py
  class Upsample (line 18) | class Upsample(Block):
    method __init__ (line 19) | def __init__(self, in_channels, out_channels):
    method forward (line 30) | def forward(self, x):
  class Downsample (line 36) | class Downsample(Block):
    method __init__ (line 37) | def __init__(self, in_channels, out_channels, undown=False):
    method forward (line 56) | def forward(self, x):
  class SpatialDownsample2x (line 79) | class SpatialDownsample2x(Block):
    method __init__ (line 80) | def __init__(
    method forward (line 102) | def forward(self, x):
  class SpatialUpsample2x_GC (line 108) | class SpatialUpsample2x_GC(Block):
    method __init__ (line 109) | def __init__(
    method forward (line 130) | def forward(self, x):
  class SpatialUpsample2x (line 140) | class SpatialUpsample2x(Block):
    method __init__ (line 141) | def __init__(
    method forward (line 162) | def forward(self, x):
  class TimeDownsample2x (line 171) | class TimeDownsample2x(Block):
    method __init__ (line 172) | def __init__(
    method forward (line 186) | def forward(self, x):
  class TimeUpsample2x (line 201) | class TimeUpsample2x(Block):
    method __init__ (line 202) | def __init__(
    method forward (line 208) | def forward(self, x):
  class TimeDownsampleRes2x (line 215) | class TimeDownsampleRes2x(Block):
    method __init__ (line 216) | def __init__(
    method forward (line 235) | def forward(self, x):
  class TimeUpsampleRes2x (line 253) | class TimeUpsampleRes2x(Block):
    method __init__ (line 254) | def __init__(
    method forward (line 267) | def forward(self, x):
  class Spatial2xTime2x3DDownsample (line 281) | class Spatial2xTime2x3DDownsample(Block):
    method __init__ (line 282) | def __init__(self, in_channels, out_channels):
    method forward (line 286) | def forward(self, x):
  class Spatial2x3DDownsample (line 292) | class Spatial2x3DDownsample(Block):
    method __init__ (line 293) | def __init__(self, in_channels, out_channels):
    method forward (line 297) | def forward(self, x):
  class Spatial2x3DUpsample (line 304) | class Spatial2x3DUpsample(Block):
    method __init__ (line 305) | def __init__(self, in_channels, out_channels):
    method forward (line 309) | def forward(self, x):
  class Spatial2xTime2x3DUpsample (line 313) | class Spatial2xTime2x3DUpsample(Block):
    method __init__ (line 314) | def __init__(
    method forward (line 327) | def forward(self, x):

FILE: opensora/models/causalvideovae/model/modules/wavelet.py
  class HaarWaveletTransform3D (line 15) | class HaarWaveletTransform3D(nn.Module):
    method __init__ (line 16) | def __init__(self, *args, **kwargs) -> None:
    method forward (line 61) | def forward(self, x):
  class InverseHaarWaveletTransform3D (line 109) | class InverseHaarWaveletTransform3D(nn.Module):
    method __init__ (line 110) | def __init__(self, enable_cached=False, *args, **kwargs) -> None:
    method forward (line 140) | def forward(self, coeffs):
  class HaarWaveletTransform2D (line 227) | class HaarWaveletTransform2D(nn.Module):
    method __init__ (line 228) | def __init__(self):
    method forward (line 236) | def forward(self, x):
  class InverseHaarWaveletTransform2D (line 246) | class InverseHaarWaveletTransform2D(nn.Module):
    method __init__ (line 247) | def __init__(self):
    method forward (line 255) | def forward(self, coeffs):

FILE: opensora/models/causalvideovae/model/registry.py
  class ModelRegistry (line 1) | class ModelRegistry:
    method register (line 5) | def register(cls, model_name):
    method get_model (line 12) | def get_model(cls, model_name):

FILE: opensora/models/causalvideovae/model/trainer_videobase.py
  class VideoBaseTrainer (line 9) | class VideoBaseTrainer(Trainer):
    method _save (line 11) | def _save(self, output_dir: Optional[str] = None, state_dict=None):

FILE: opensora/models/causalvideovae/model/utils/distrib_utils.py
  class DiagonalGaussianDistribution (line 4) | class DiagonalGaussianDistribution(object):
    method __init__ (line 5) | def __init__(self, parameters, deterministic=False):
    method sample (line 15) | def sample(self):
    method kl (line 19) | def kl(self, other=None):
    method nll (line 33) | def nll(self, sample, dims=[1,2,3]):
    method mode (line 41) | def mode(self):

FILE: opensora/models/causalvideovae/model/utils/module_utils.py
  function resolve_str_to_obj (line 6) | def resolve_str_to_obj(str_val, append=True):
  function create_instance (line 13) | def create_instance(module_class_str: str, **kwargs):

FILE: opensora/models/causalvideovae/model/utils/scheduler_utils.py
  function cosine_scheduler (line 3) | def cosine_scheduler(step, max_steps, value_base=1, value_end=0):

FILE: opensora/models/causalvideovae/model/utils/video_utils.py
  function tensor_to_video (line 4) | def tensor_to_video(x):

FILE: opensora/models/causalvideovae/model/utils/wavelet_utils.py
  class HaarWaveletTransform3D (line 7) | class HaarWaveletTransform3D(nn.Module):
    method __init__ (line 8) | def __init__(self, *args, **kwargs) -> None:
    method forward (line 53) | def forward(self, x):
  class InverseHaarWaveletTransform3D (line 89) | class InverseHaarWaveletTransform3D(nn.Module):
    method __init__ (line 90) | def __init__(self, enable_cached=False, *args, **kwargs) -> None:
    method forward (line 120) | def forward(self, coeffs):
  class HaarWaveletTransform2D (line 179) | class HaarWaveletTransform2D(nn.Module):
    method __init__ (line 180) | def __init__(self):
    method forward (line 187) | def forward(self, x):
  class InverseHaarWaveletTransform2D (line 197) | class InverseHaarWaveletTransform2D(nn.Module):
    method __init__ (line 198) | def __init__(self):
    method forward (line 205) | def forward(self, coeffs):

FILE: opensora/models/causalvideovae/model/vae/modeling_causalvae.py
  class Encoder (line 22) | class Encoder(nn.Module):
    method __init__ (line 23) | def __init__(
    method forward (line 127) | def forward(self, x):
  class Decoder (line 151) | class Decoder(nn.Module):
    method __init__ (line 152) | def __init__(
    method forward (line 247) | def forward(self, z):
  class CausalVAEModel (line 272) | class CausalVAEModel(VideoBaseAE):
    method __init__ (line 274) | def __init__(
    method get_encoder (line 391) | def get_encoder(self):
    method get_decoder (line 396) | def get_decoder(self):
    method encode (line 401) | def encode(self, x):
    method decode (line 415) | def decode(self, z):
    method forward (line 427) | def forward(self, input, sample_posterior=True):
    method on_train_start (line 436) | def on_train_start(self):
    method get_last_layer (line 439) | def get_last_layer(self):
    method blend_v (line 445) | def blend_v(
    method blend_h (line 455) | def blend_h(
    method tiled_encode (line 465) | def tiled_encode(self, x):
    method tiled_decode (line 491) | def tiled_decode(self, x):
    method tiled_encode2d (line 518) | def tiled_encode2d(self, x, return_moments=False):
    method tiled_decode2d (line 560) | def tiled_decode2d(self, z):
    method enable_tiling (line 602) | def enable_tiling(self, use_tiling: bool = True):
    method disable_tiling (line 605) | def disable_tiling(self):
    method init_from_ckpt (line 608) | def init_from_ckpt(self, path, ignore_keys=list()):

FILE: opensora/models/causalvideovae/model/vae/modeling_wfvae.py
  class Encoder (line 34) | class Encoder(VideoBaseAE):
    method __init__ (line 37) | def __init__(
    method forward (line 128) | def forward(self, x):
  class Decoder (line 153) | class Decoder(VideoBaseAE):
    method __init__ (line 156) | def __init__(
    method forward (line 286) | def forward(self, z):
  class WFVAEModel (line 316) | class WFVAEModel(VideoBaseAE):
    method __init__ (line 319) | def __init__(
    method get_encoder (line 389) | def get_encoder(self):
    method get_decoder (line 394) | def get_decoder(self):
    method _empty_causal_cached (line 399) | def _empty_causal_cached(self, parent):
    method _set_causal_cached (line 404) | def _set_causal_cached(self, enable_cached=True):
    method _set_cache_offset (line 409) | def _set_cache_offset(self, modules, cache_offset=0):
    method _set_first_chunk (line 415) | def _set_first_chunk(self, is_first_chunk=True):
    method build_chunk_start_end (line 420) | def build_chunk_start_end(self, t, decoder_mode=False):
    method encode (line 432) | def encode(self, x):
    method tile_encode (line 448) | def tile_encode(self, x):
    method decode (line 464) | def decode(self, z):
    method tile_decode (line 478) | def tile_decode(self, x):
    method forward (line 504) | def forward(self, input, sample_posterior=True):
    method get_last_layer (line 513) | def get_last_layer(self):
    method enable_tiling (line 519) | def enable_tiling(self, use_tiling: bool = True):
    method disable_tiling (line 523) | def disable_tiling(self):
    method init_from_ckpt (line 526) | def init_from_ckpt(self, path, ignore_keys=list()):

FILE: opensora/models/causalvideovae/sample/rec_video_vae.py
  function main (line 15) | def main(args: argparse.Namespace):

FILE: opensora/models/causalvideovae/utils/dataset_utils.py
  function is_image_file (line 10) | def is_image_file(filename):
  class DecordInit (line 13) | class DecordInit(object):
    method __init__ (line 16) | def __init__(self, num_threads=1):
    method __call__ (line 20) | def __call__(self, filename):
    method __repr__ (line 31) | def __repr__(self):
  function pad_to_multiple (line 37) | def pad_to_multiple(number, ds_stride):

FILE: opensora/models/causalvideovae/utils/downloader.py
  function gdown_download (line 9) | def gdown_download(id, fname, cache_dir=None):

FILE: opensora/models/causalvideovae/utils/video_utils.py
  function array_to_video (line 7) | def array_to_video(
  function custom_to_video (line 21) | def custom_to_video(
  function read_video (line 32) | def read_video(video_path: str, num_frames: int, sample_rate: int) -> to...
  function tensor_to_video (line 57) | def tensor_to_video(x):

FILE: opensora/models/diffusion/common.py
  class PatchEmbed2D (line 20) | class PatchEmbed2D(nn.Module):
    method __init__ (line 23) | def __init__(
    method forward (line 36) | def forward(self, latent):
  class PositionGetter3D (line 44) | class PositionGetter3D(object):
    method __init__ (line 47) | def __init__(self, ):
    method __call__ (line 50) | def __call__(self, b, t, h, w, device):
  class RoPE3D (line 66) | class RoPE3D(torch.nn.Module):
    method __init__ (line 68) | def __init__(self, freq=10000.0, F0=1.0, interpolation_scale_thw=(1, 1...
    method get_cos_sin (line 77) | def get_cos_sin(self, D, seq_len, device, dtype, interpolation_scale=1):
    method rotate_half (line 89) | def rotate_half(x):
    method apply_rope1d (line 93) | def apply_rope1d(self, tokens, pos1d, cos, sin):
    method forward (line 101) | def forward(self, tokens, positions):

FILE: opensora/models/diffusion/opensora_v1_3/modeling_inpaint.py
  function zero_module (line 16) | def zero_module(module):
  class OpenSoraInpaint_v1_3 (line 22) | class OpenSoraInpaint_v1_3(OpenSoraT2V):
    method __init__ (line 26) | def __init__(
    method _init_patched_inputs_for_inpainting (line 88) | def _init_patched_inputs_for_inpainting(self):
    method _operate_on_patched_inputs (line 114) | def _operate_on_patched_inputs(self, hidden_states, encoder_hidden_sta...
  function OpenSoraInpaint_v1_3_2B_122 (line 142) | def OpenSoraInpaint_v1_3_2B_122(**kwargs):

FILE: opensora/models/diffusion/opensora_v1_3/modeling_opensora.py
  class OpenSoraT2V_v1_3 (line 26) | class OpenSoraT2V_v1_3(ModelMixin, ConfigMixin):
    method __init__ (line 30) | def __init__(
    method _init_patched_inputs (line 65) | def _init_patched_inputs(self):
    method _set_gradient_checkpointing (line 114) | def _set_gradient_checkpointing(self, module, value=False):
    method forward (line 118) | def forward(
    method _operate_on_patched_inputs (line 253) | def _operate_on_patched_inputs(self, hidden_states, encoder_hidden_sta...
    method _get_output_for_patched_inputs (line 270) | def _get_output_for_patched_inputs(
  function OpenSoraT2V_v1_3_2B_122 (line 291) | def OpenSoraT2V_v1_3_2B_122(**kwargs):

FILE: opensora/models/diffusion/opensora_v1_3/modules.py
  class Attention (line 30) | class Attention(Attention_):
    method __init__ (line 31) | def __init__(
    method prepare_sparse_mask (line 42) | def prepare_sparse_mask(attention_mask, encoder_attention_mask, sparse...
    method prepare_attention_mask (line 88) | def prepare_attention_mask(
  class OpenSoraAttnProcessor2_0 (line 128) | class OpenSoraAttnProcessor2_0:
    method __init__ (line 133) | def __init__(self, interpolation_scale_thw=(1, 1, 1),
    method _init_rope (line 145) | def _init_rope(self, interpolation_scale_thw):
    method _sparse_1d (line 149) | def _sparse_1d(self, x, frame, height, width):
    method _reverse_sparse_1d (line 166) | def _reverse_sparse_1d(self, x, frame, height, width, pad_len):
    method _sparse_1d_kv (line 178) | def _sparse_1d_kv(self, x):
    method __call__ (line 185) | def __call__(
  class BasicTransformerBlock (line 316) | class BasicTransformerBlock(nn.Module):
    method __init__ (line 317) | def __init__(
    method forward (line 395) | def forward(

FILE: opensora/models/frame_interpolation/interpolation.py
  function init (line 35) | def init(device="cuda"):
  function get_input_video_from_path (line 56) | def get_input_video_from_path(input_path, device="cuda"):
  function load_model (line 103) | def load_model(ckpt_path, device="cuda"):
  function interpolater (line 118) | def interpolater(model, inputs, scale, padder, iters=1):
  function write (line 150) | def write(outputs, input_path, output_path, frame_rate=30):

FILE: opensora/models/frame_interpolation/networks/AMT-G.py
  class Model (line 23) | class Model(nn.Module):
    method __init__ (line 24) | def __init__(self,
    method _get_updateblock (line 55) | def _get_updateblock(self, cdim, scale_factor=None):
    method _corr_scale_lookup (line 61) | def _corr_scale_lookup(self, corr_fn, coord, flow0, flow1, embt, downs...
    method forward (line 76) | def forward(self, img0, img1, embt, scale_factor=1.0, eval=False, **kw...

FILE: opensora/models/frame_interpolation/networks/blocks/feat_enc.py
  class BottleneckBlock (line 5) | class BottleneckBlock(nn.Module):
    method __init__ (line 6) | def __init__(self, in_planes, planes, norm_fn='group', stride=1):
    method forward (line 52) | def forward(self, x):
  class ResidualBlock (line 64) | class ResidualBlock(nn.Module):
    method __init__ (line 65) | def __init__(self, in_planes, planes, norm_fn='group', stride=1):
    method forward (line 106) | def forward(self, x):
  class SmallEncoder (line 117) | class SmallEncoder(nn.Module):
    method __init__ (line 118) | def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0):
    method _make_layer (line 157) | def _make_layer(self, dim, stride=1):
    method forward (line 166) | def forward(self, x):
  class BasicEncoder (line 191) | class BasicEncoder(nn.Module):
    method __init__ (line 192) | def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0):
    method _make_layer (line 232) | def _make_layer(self, dim, stride=1):
    method forward (line 241) | def forward(self, x):
  class LargeEncoder (line 267) | class LargeEncoder(nn.Module):
    method __init__ (line 268) | def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0):
    method _make_layer (line 309) | def _make_layer(self, dim, stride=1):
    method forward (line 318) | def forward(self, x):

FILE: opensora/models/frame_interpolation/networks/blocks/ifrnet.py
  function resize (line 7) | def resize(x, scale_factor):
  function convrelu (line 10) | def convrelu(in_channels, out_channels, kernel_size=3, stride=1, padding...
  class ResBlock (line 16) | class ResBlock(nn.Module):
    method __init__ (line 17) | def __init__(self, in_channels, side_channels, bias=True):
    method forward (line 39) | def forward(self, x):
  class Encoder (line 55) | class Encoder(nn.Module):
    method __init__ (line 56) | def __init__(self, channels, large=False):
    method forward (line 70) | def forward(self, in_x):
  class InitDecoder (line 78) | class InitDecoder(nn.Module):
    method __init__ (line 79) | def __init__(self, in_ch, out_ch, skip_ch) -> None:
    method forward (line 86) | def forward(self, f0, f1, embt):
  class IntermediateDecoder (line 94) | class IntermediateDecoder(nn.Module):
    method __init__ (line 95) | def __init__(self, in_ch, out_ch, skip_ch) -> None:
    method forward (line 102) | def forward(self, ft_, f0, f1, flow0_in, flow1_in):

FILE: opensora/models/frame_interpolation/networks/blocks/multi_flow.py
  function multi_flow_combine (line 10) | def multi_flow_combine(comb_block, img0, img1, flow0, flow1,
  class MultiFlowDecoder (line 46) | class MultiFlowDecoder(nn.Module):
    method __init__ (line 47) | def __init__(self, in_ch, skip_ch, num_flows=3):
    method forward (line 56) | def forward(self, ft_, f0, f1, flow0, flow1):

FILE: opensora/models/frame_interpolation/networks/blocks/raft.py
  function resize (line 6) | def resize(x, scale_factor):
  function bilinear_sampler (line 10) | def bilinear_sampler(img, coords, mask=False):
  function coords_grid (line 27) | def coords_grid(batch, ht, wd, device):
  class SmallUpdateBlock (line 35) | class SmallUpdateBlock(nn.Module):
    method __init__ (line 36) | def __init__(self, cdim, hidden_dim, flow_dim, corr_dim, fc_dim,
    method forward (line 67) | def forward(self, net, flow, corr):
  class BasicUpdateBlock (line 88) | class BasicUpdateBlock(nn.Module):
    method __init__ (line 89) | def __init__(self, cdim, hidden_dim, flow_dim, corr_dim, corr_dim2,
    method forward (line 121) | def forward(self, net, flow, corr):
  class BidirCorrBlock (line 142) | class BidirCorrBlock:
    method __init__ (line 143) | def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
    method __call__ (line 165) | def __call__(self, coords0, coords1):
    method corr (line 200) | def corr(fmap1, fmap2):

FILE: opensora/models/frame_interpolation/utils/build_utils.py
  function base_build_fn (line 4) | def base_build_fn(module, cls, params):
  function build_from_cfg (line 9) | def build_from_cfg(config):

FILE: opensora/models/frame_interpolation/utils/dist_utils.py
  function get_world_size (line 5) | def get_world_size():
  function get_global_rank (line 17) | def get_global_rank():
  function get_local_rank (line 29) | def get_local_rank():
  function get_master_ip (line 41) | def get_master_ip():

FILE: opensora/models/frame_interpolation/utils/flow_utils.py
  function warp (line 8) | def warp(img, flow):
  function make_colorwheel (line 19) | def make_colorwheel():
  function flow_uv_to_colors (line 66) | def flow_uv_to_colors(u, v, convert_to_bgr=False):
  function flow_to_image (line 101) | def flow_to_image(flow_uv, clip_flow=None, convert_to_bgr=False):

FILE: opensora/models/frame_interpolation/utils/utils.py
  class AverageMeter (line 12) | class AverageMeter():
    method __init__ (line 13) | def __init__(self):
    method reset (line 16) | def reset(self):
    method update (line 22) | def update(self, val, n=1):
  class AverageMeterGroups (line 29) | class AverageMeterGroups:
    method __init__ (line 30) | def __init__(self) -> None:
    method update (line 33) | def update(self, dict, n=1):
    method reset (line 39) | def reset(self, name=None):
    method avg (line 48) | def avg(self, name):
  class InputPadder (line 54) | class InputPadder:
    method __init__ (line 56) | def __init__(self, dims, divisor=16):
    method pad (line 62) | def pad(self, *inputs):
    method unpad (line 68) | def unpad(self, *inputs):
    method _unpad (line 74) | def _unpad(self, x):
  function img2tensor (line 80) | def img2tensor(img):
  function tensor2img (line 86) | def tensor2img(img_t):
  function seed_all (line 91) | def seed_all(seed):
  function read (line 98) | def read(file):
  function write (line 109) | def write(file, data):
  function readPFM (line 120) | def readPFM(file):
  function writePFM (line 158) | def writePFM(file, image, scale=1):
  function readFlow (line 188) | def readFlow(name):
  function readImage (line 206) | def readImage(name):
  function writeImage (line 216) | def writeImage(name, data):
  function writeFlow (line 222) | def writeFlow(name, flow):
  function readFloat (line 230) | def readFloat(name):
  function writeFloat (line 255) | def writeFloat(name, data):
  function check_dim_and_resize (line 281) | def check_dim_and_resize(tensor_list):

FILE: opensora/models/prompt_refiner/inference.py
  function get_output (line 6) | def get_output(prompt):
  function parse_args (line 27) | def parse_args():

FILE: opensora/models/prompt_refiner/merge.py
  function get_lora_model (line 8) | def get_lora_model(base_model_path, lora_model_input_path, lora_model_ou...
  function get_model_result (line 19) | def get_model_result(base_model_path, fintune_model_path):
  function parse_args (line 72) | def parse_args():

FILE: opensora/models/prompt_refiner/train.py
  function parse_args (line 12) | def parse_args():
  function process_func (line 28) | def process_func(example):

FILE: opensora/models/text_encoder/__init__.py
  function get_text_warpper (line 15) | def get_text_warpper(text_encoder_name):

FILE: opensora/models/text_encoder/clip.py
  class CLIPWrapper (line 10) | class CLIPWrapper(nn.Module):
    method __init__ (line 11) | def __init__(self, args, **kwargs):
    method forward (line 21) | def forward(self, input_ids, attention_mask):

FILE: opensora/models/text_encoder/t5.py
  class T5Wrapper (line 10) | class T5Wrapper(nn.Module):
    method __init__ (line 11) | def __init__(self, args, **kwargs):
    method forward (line 17) | def forward(self, input_ids, attention_mask):

FILE: opensora/npu_config.py
  function compress_video (line 28) | def compress_video(input_file, output_file, out_size):
  function set_run_dtype (line 44) | def set_run_dtype(x, dtype=None):
  class NPUConfig (line 58) | class NPUConfig:
    method __init__ (line 61) | def __init__(self):
    method get_total_cores (line 120) | def get_total_cores(self):
    method bind_thread_to_cpu (line 128) | def bind_thread_to_cpu(self):
    method replace_methods (line 145) | def replace_methods(self, target_class, source_class, skip_fcns=[], on...
    method get_attention_mask (line 163) | def get_attention_mask(self, attention_mask, repeat_num):
    method set_current_run_dtype (line 169) | def set_current_run_dtype(self, variables):
    method restore_dtype (line 175) | def restore_dtype(self, x):
    method get_output_video_path (line 180) | def get_output_video_path(self, name):
    method get_node_id (line 184) | def get_node_id(self):
    method get_node_size (line 187) | def get_node_size(self):
    method get_local_rank (line 190) | def get_local_rank(self):
    method get_pickle_path (line 193) | def get_pickle_path(self, file_name):
    method free_mm (line 196) | def free_mm(self):
    method __del__ (line 201) | def __del__(self):
    method try_load_pickle (line 204) | def try_load_pickle(self, file_name, function):
    method try_get_vid_path (line 223) | def try_get_vid_path(self, file, out_size=1024):
    method npu_format_cast (line 230) | def npu_format_cast(self, x):
    method calc_grad_norm (line 233) | def calc_grad_norm(self, model):
    method _run (line 253) | def _run(self, operator, x, tmp_dtype, out_dtype=None, out_nd_format=F...
    method run_group_norm (line 268) | def run_group_norm(self, operator, x):
    method run_layer_norm (line 271) | def run_layer_norm(self, operator, x):
    method print_tensor_stats (line 274) | def print_tensor_stats(self, tensor, name="Tensor", rank=None):
    method run_conv3d (line 295) | def run_conv3d(self, operator, x, out_dtype):
    method run_pool_2d (line 298) | def run_pool_2d(self, operator, x):
    method run_pad_2d (line 301) | def run_pad_2d(self, operator, x, pad, mode="constant"):
    method seed_everything (line 311) | def seed_everything(self, seed=100):
    method print_with_rank (line 321) | def print_with_rank(self, msg, rank=0, save=False):
    method print_msg (line 327) | def print_msg(self, msg, on=True, rank=None):
    method save_loss (line 332) | def save_loss(self, filename, rank=0):
    method run_attention (line 338) | def run_attention(self, query, key, value, atten_mask, input_layout, h...
    method scaled_dot_product_attention (line 353) | def scaled_dot_product_attention(self, query, key, value, input_layout...
    method print_tensor_with_rank (line 393) | def print_tensor_with_rank(self, name, tensor, rank=[0], dim_print_cnt...

FILE: opensora/sample/caption_refiner.py
  class OpenSoraCaptionRefiner (line 13) | class OpenSoraCaptionRefiner(nn.Module):
    method __init__ (line 14) | def __init__(self, args, dtype, device):
    method get_refiner_output (line 24) | def get_refiner_output(self, prompt):

FILE: opensora/sample/pipeline_inpaint.py
  function is_video_file (line 44) | def is_video_file(file_path):
  function is_image_file (line 49) | def is_image_file(file_path):
  function open_image (line 54) | def open_image(file_path):
  function open_video (line 58) | def open_video(file_path, start_frame_idx, num_frames, frame_interval=1):
  function get_pixel_values (line 77) | def get_pixel_values(file_path, num_frames):
  class OpenSoraInpaintPipeline (line 87) | class OpenSoraInpaintPipeline(OpenSoraPipeline):
    method __init__ (line 89) | def __init__(
    method check_inputs (line 116) | def check_inputs(
    method get_resize_transform (line 181) | def get_resize_transform(
    method get_video_transform (line 205) | def get_video_transform(self):
    method get_mask_type_cond_indices (line 213) | def get_mask_type_cond_indices(self, mask_type, conditional_pixel_valu...
    method get_masked_pixel_values_mask (line 237) | def get_masked_pixel_values_mask(
    method __call__ (line 293) | def __call__(

FILE: opensora/sample/pipeline_opensora.py
  class OpenSoraPipelineOutput (line 32) | class OpenSoraPipelineOutput(BaseOutput):
  function rescale_noise_cfg (line 36) | def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
  function retrieve_timesteps (line 51) | def retrieve_timesteps(
  class OpenSoraPipeline (line 110) | class OpenSoraPipeline(DiffusionPipeline):
    method __init__ (line 127) | def __init__(
    method encode_prompt (line 149) | def encode_prompt(
    method prepare_extra_step_kwargs (line 323) | def prepare_extra_step_kwargs(self, generator, eta):
    method check_inputs (line 340) | def check_inputs(
    method prepare_latents (line 420) | def prepare_latents(self, batch_size, num_channels_latents, num_frames...
    method guidance_scale (line 445) | def guidance_scale(self):
    method guidance_rescale (line 449) | def guidance_rescale(self):
    method do_classifier_free_guidance (line 453) | def do_classifier_free_guidance(self):
    method num_timesteps (line 457) | def num_timesteps(self):
    method interrupt (line 461) | def interrupt(self):
    method __call__ (line 465) | def __call__(
    method decode_latents (line 743) | def decode_latents(self, latents):

FILE: opensora/sample/rec_image.py
  function preprocess (line 11) | def preprocess(video_data: torch.Tensor, short_size: int = 128) -> torch...
  function main (line 23) | def main(args: argparse.Namespace):

FILE: opensora/sample/rec_video.py
  function array_to_video (line 20) | def array_to_video(image_array: npt.NDArray, fps: float = 30.0, output_f...
  function custom_to_video (line 32) | def custom_to_video(x: torch.Tensor, fps: float = 2.0, output_file: str ...
  function read_video (line 42) | def read_video(video_path: str, num_frames: int, sample_rate: int) -> to...
  function preprocess (line 68) | def preprocess(video_data: torch.Tensor, height: int = 128, width: int =...
  function main (line 83) | def main(args: argparse.Namespace):

FILE: opensora/serve/gradio_utils.py
  function randomize_seed_fn (line 22) | def randomize_seed_fn(seed: int, randomize_seed: bool) -> int:

FILE: opensora/serve/gradio_web_server.py
  function generate (line 24) | def generate(

FILE: opensora/serve/gradio_web_server_i2v.py
  function generate (line 24) | def generate(

FILE: opensora/train/train_causalvae.py
  function set_random_seed (line 34) | def set_random_seed(seed):
  function ddp_setup (line 41) | def ddp_setup():
  function setup_logger (line 45) | def setup_logger(rank):
  function check_unused_params (line 70) | def check_unused_params(model):
  function set_requires_grad_optimizer (line 77) | def set_requires_grad_optimizer(optimizer, requires_grad):
  function total_params (line 82) | def total_params(model):
  function get_exp_name (line 88) | def get_exp_name(args):
  function set_train (line 92) | def set_train(modules):
  function set_eval (line 97) | def set_eval(modules):
  function set_modules_requires_grad (line 102) | def set_modules_requires_grad(modules, requires_grad):
  function save_checkpoint (line 107) | def save_checkpoint(
  function valid (line 134) | def valid(global_rank, rank, model, val_dataloader, precision, args):
  function gather_valid_result (line 195) | def gather_valid_result(psnr_list, lpips_list, video_log_list, rank, wor...
  function train (line 210) | def train(args):
  function main (line 571) | def main():

FILE: opensora/train/train_inpaint.py
  class ProgressInfo (line 85) | class ProgressInfo:
    method __init__ (line 86) | def __init__(self, global_step, train_loss=0.0):
  function main (line 95) | def main(args):

FILE: opensora/train/train_t2v_diffusers.py
  function log_validation (line 82) | def log_validation(args, model, vae, text_encoder, tokenizer, accelerato...
  class ProgressInfo (line 155) | class ProgressInfo:
    method __init__ (line 156) | def __init__(self, global_step, train_loss=0.0):
  function main (line 165) | def main(args):

FILE: opensora/utils/communications.py
  function broadcast (line 6) | def broadcast(input_: torch.Tensor):
  function _all_to_all (line 12) | def _all_to_all(
  function _single_all_to_all (line 24) | def _single_all_to_all(
  class _AllToAll (line 56) | class _AllToAll(torch.autograd.Function):
    method forward (line 67) | def forward(ctx, input_, scatter_dim, gather_dim, all_to_all_func):
    method backward (line 75) | def backward(ctx, grad_output):
  function all_to_all_SBH (line 88) | def all_to_all_SBH(
  function all_to_all_BSND (line 95) | def all_to_all_BSND(
  function prepare_parallel_data (line 103) | def prepare_parallel_data(

FILE: opensora/utils/dataset_utils.py
  function is_image_file (line 18) | def is_image_file(filename):
  class DecordInit (line 21) | class DecordInit(object):
    method __init__ (line 24) | def __init__(self, num_threads=1):
    method __call__ (line 28) | def __call__(self, filename):
    method __repr__ (line 39) | def __repr__(self):
  function pad_to_multiple (line 45) | def pad_to_multiple(number, ds_stride):
  class Collate (line 53) | class Collate:
    method __init__ (line 54) | def __init__(self, args):
    method package (line 72) | def package(self, batch):
    method __call__ (line 86) | def __call__(self, batch):
    method process (line 99) | def process(self, batch_tubes, input_ids_1, cond_mask_1, input_ids_2, ...
  function group_data_fun (line 181) | def group_data_fun(lengths, generator=None):
  function last_group_data_fun (line 198) | def last_group_data_fun(shuffled_megabatches, lengths):
  function split_to_even_chunks (line 237) | def split_to_even_chunks(megabatch, lengths, world_size, batch_size):
  function get_length_grouped_indices (line 260) | def get_length_grouped_indices(lengths, batch_size, world_size, gradient...
  class LengthGroupedSampler (line 327) | class LengthGroupedSampler(Sampler):
    method __init__ (line 333) | def __init__(
    method __len__ (line 356) | def __len__(self):
    method __iter__ (line 359) | def __iter__(self):

FILE: opensora/utils/downloader.py
  function gdown_download (line 9) | def gdown_download(id, fname, cache_dir=None):

FILE: opensora/utils/ema.py
  class EMAModel (line 23) | class EMAModel:
    method __init__ (line 28) | def __init__(
    method extract_ema_kwargs (line 110) | def extract_ema_kwargs(cls, kwargs):
    method from_pretrained (line 129) | def from_pretrained(cls, path, model_cls) -> "EMAModel":
    method save_pretrained (line 139) | def save_pretrained(self, path):
    method get_decay (line 154) | def get_decay(self, optimization_step: int) -> float:
    method step (line 174) | def step(self, parameters: Iterable[torch.nn.Parameter]):
    method copy_to (line 211) | def copy_to(self, parameters: Iterable[torch.nn.Parameter]) -> None:
    method to (line 225) | def to(self, device=None, dtype=None) -> None:
    method state_dict (line 237) | def state_dict(self) -> dict:
    method store (line 256) | def store(self, parameters: Iterable[torch.nn.Parameter]) -> None:
    method restore (line 265) | def restore(self, parameters: Iterable[torch.nn.Parameter]) -> None:
    method load_state_dict (line 283) | def load_state_dict(self, state_dict: dict) -> None:

FILE: opensora/utils/ema_utils.py
  class EMAModel (line 11) | class EMAModel(diffuser_EMAModel):
    method __init__ (line 12) | def __init__(self, parameters, **kwargs):
    method from_pretrained (line 17) | def from_pretrained(cls, path, model_cls, lora_config, model_base) -> ...
    method save_pretrained (line 36) | def save_pretrained(self, path):

FILE: opensora/utils/freeinit_utils.py
  function freq_mix_3d (line 6) | def freq_mix_3d(x, noise, LPF):
  function get_freq_filter (line 34) | def get_freq_filter(shape, device, filter_type, n, d_s, d_t):
  function gaussian_low_pass_filter (line 56) | def gaussian_low_pass_filter(shape, d_s=0.25, d_t=0.25):
  function butterworth_low_pass_filter (line 77) | def butterworth_low_pass_filter(shape, n=4, d_s=0.25, d_t=0.25):
  function ideal_low_pass_filter (line 99) | def ideal_low_pass_filter(shape, d_s=0.25, d_t=0.25):
  function box_low_pass_filter (line 120) | def box_low_pass_filter(shape, d_s=0.25, d_t=0.25):

FILE: opensora/utils/lora_utils.py
  function maybe_zero_3 (line 8) | def maybe_zero_3(param, ignore_status=False, name=None):
  function get_peft_state_maybe_zero_3 (line 22) | def get_peft_state_maybe_zero_3(named_params, bias):

FILE: opensora/utils/mask_utils.py
  class MaskType (line 20) | class MaskType(Enum):
  function save_mask_to_video (line 31) | def save_mask_to_video(mask, save_path='mask.mp4', fps=24):
  function read_video (line 40) | def read_video(video_path):
  class BaseMaskGenerator (line 51) | class BaseMaskGenerator(ABC):
    method create_system_mask (line 53) | def create_system_mask(self, num_frames, height, width, device, dtype):
    method process (line 59) | def process(self, mask):
    method __call__ (line 63) | def __call__(self, num_frames=None, height=None, width=None, device='c...
  class T2IVMaskGenerator (line 67) | class T2IVMaskGenerator(BaseMaskGenerator):
    method process (line 68) | def process(self, mask):
  class I2VMaskGenerator (line 72) | class I2VMaskGenerator(BaseMaskGenerator):
    method process (line 73) | def process(self, mask):
  class TransitionMaskGenerator (line 77) | class TransitionMaskGenerator(BaseMaskGenerator):
    method process (line 78) | def process(self, mask):
  class ContinuationMaskGenerator (line 83) | class ContinuationMaskGenerator(BaseMaskGenerator):
    method __init__ (line 85) | def __init__(self, min_clear_ratio=0.0, max_clear_ratio=1.0):
    method process (line 92) | def process(self, mask):
  class ClearMaskGenerator (line 98) | class ClearMaskGenerator(BaseMaskGenerator):
    method process (line 99) | def process(self, mask):
  class RandomTemporalMaskGenerator (line 103) | class RandomTemporalMaskGenerator(BaseMaskGenerator):
    method __init__ (line 105) | def __init__(self, min_clear_ratio=0.0, max_clear_ratio=1.0):
    method process (line 112) | def process(self, mask):
  class MaskProcessor (line 120) | class MaskProcessor:
    method __init__ (line 121) | def __init__(
    method init_mask_generators (line 136) | def init_mask_generators(self):
    method get_mask (line 146) | def get_mask(self, mask_generator_type, num_frames, height, width, dev...
    method __call__ (line 149) | def __call__(self, pixel_values, mask_type=None, mask_type_ratio_dict=...
  class MaskCompressor (line 168) | class MaskCompressor:
    method __init__ (line 169) | def __init__(self, ae_stride_h=8, ae_stride_w=8, ae_stride_t=4, **kwar...
    method __call__ (line 174) | def __call__(self, mask):
  class BaseNoiseAdder (line 196) | class BaseNoiseAdder(ABC):
    method add_noise (line 199) | def add_noise(self, mask_pixel_values, mask):
    method __call__ (line 202) | def __call__(self, mask_pixel_values, mask):
  class GaussianNoiseAdder (line 205) | class GaussianNoiseAdder(BaseNoiseAdder):
    method __init__ (line 206) | def __init__(self, mean=-3.0, std=0.5, clear_ratio=0.05):
    method add_noise (line 212) | def add_noise(self, masked_pixel_values, mask):

FILE: opensora/utils/parallel_states.py
  class COMM_INFO (line 5) | class COMM_INFO:
    method __init__ (line 6) | def __init__(self):
  function initialize_sequence_parallel_state (line 13) | def initialize_sequence_parallel_state(sequence_parallel_size):
  function set_sequence_parallel_state (line 19) | def set_sequence_parallel_state(state):
  function get_sequence_parallel_state (line 23) | def get_sequence_parallel_state():
  function initialize_sequence_parallel_group (line 26) | def initialize_sequence_parallel_group(sequence_parallel_size):
  function destroy_sequence_parallel_group (line 42) | def destroy_sequence_parallel_group():

FILE: opensora/utils/sample_utils.py
  function get_scheduler (line 38) | def get_scheduler(args):
  function prepare_pipeline (line 82) | def prepare_pipeline(args, dtype, device):
  function init_gpu_env (line 166) | def init_gpu_env(args):
  function init_npu_env (line 180) | def init_npu_env(args):
  function save_video_grid (line 195) | def save_video_grid(video, nrow=None):
  function run_model_and_save_samples (line 222) | def run_model_and_save_samples(args, pipeline, caption_refiner_model=Non...
  function run_model_and_save_samples_npu (line 417) | def run_model_and_save_samples_npu(args, pipeline, caption_refiner_model...
  function get_args (line 443) | def get_args():

FILE: opensora/utils/utils.py
  function to_2tuple (line 39) | def to_2tuple(x):
  function explicit_uniform_sampling (line 47) | def explicit_uniform_sampling(T, n, rank, bsz, device):
  function get_grad_norm (line 77) | def get_grad_norm(
  function clip_grad_norm_ (line 115) | def clip_grad_norm_(
  function get_experiment_dir (line 172) | def get_experiment_dir(root_dir, args):
  function get_precision (line 188) | def get_precision(args):
  function create_logger (line 201) | def create_logger(logging_dir):
  function create_tensorboard (line 221) | def create_tensorboard(tensorboard_dir):
  function write_tensorboard (line 232) | def write_tensorboard(writer, *args):
  function get_npu_power (line 240) | def get_npu_power():
  function monitor_npu_power (line 261) | def monitor_npu_power():
  function update_ema (line 272) | def update_ema(ema_model, model, decay=0.9999):
  function requires_grad (line 284) | def requires_grad(model, flag=True):
  function cleanup (line 292) | def cleanup():
  function set_seed (line 300) | def set_seed(seed, rank, device_specific=True):
  function setup_distributed (line 311) | def setup_distributed(backend="nccl", port=None):
  function collect_env (line 352) | def collect_env():
  function text_preprocessing (line 375) | def text_preprocessing(text, support_Chinese=True):
  function basic_clean (line 380) | def basic_clean(text):
  function clean_caption (line 385) | def clean_caption(caption, support_Chinese=True):

Download .json

Condensed preview — 171 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,397K chars).

[
  {
    "path": ".github/workflows/docker_build.yml",
    "chars": 931,
    "preview": "name: docker-build\n\non:\n  workflow_dispatch:\n  push:\n    branches:\n      - \"main\"\n    paths:\n      - \"docker/Dockerfile\""
  },
  {
    "path": ".gitignore",
    "chars": 1103,
    "preview": "ucf101_stride4x4x4\n__pycache__\n*.mp4\n.ipynb_checkpoints\n*.pth\nUCF-101/\nresults/\nbuild/\nopensora.egg-info/\nwandb/\n.idea\n*"
  },
  {
    "path": "LICENSE",
    "chars": 1078,
    "preview": "MIT License\n\nCopyright (c) Rabbitpre Intelligence Ltd\n\nPermission is hereby granted, free of charge, to any person obtai"
  },
  {
    "path": "README.md",
    "chars": 14628,
    "preview": "\n\n<h1 align=\"left\"> <a href=\"\">Open-Sora Plan</a></h1>\n\nThis project aims to create a simple and scalable repo, to repro"
  },
  {
    "path": "docs/Contribution_Guidelines.md",
    "chars": 3020,
    "preview": "# Contributing to the Open-Sora Plan Community\n\nThe Open-Sora Plan open-source community is a collaborative initiative d"
  },
  {
    "path": "docs/Prompt_Refiner.md",
    "chars": 2140,
    "preview": "## Data\n\nWe have open-sourced our dataset of 32,555 pairs, which includes Chinese data. The dataset is available [here]("
  },
  {
    "path": "docs/Report-v1.0.0-cn.md",
    "chars": 7222,
    "preview": "# 技术报告 v1.0.0\n\n在2024年3月，我们推出了Open-Sora-Plan，一个旨在复现OpenAI [Sora](https://openai.com/sora)的开源计划。它作为一个基础的开源框架，能够训练视频生成模型包括无"
  },
  {
    "path": "docs/Report-v1.0.0.md",
    "chars": 13247,
    "preview": "# Report v1.0.0\n\nIn March 2024, we launched a plan called Open-Sora-Plan, which aims to reproduce the OpenAI [Sora](http"
  },
  {
    "path": "docs/Report-v1.1.0.md",
    "chars": 27002,
    "preview": "# Report v1.1.0\n\nIn April 2024, we launched Open-Sora-Plan v1.0.0, featuring a simple and efficient design along with re"
  },
  {
    "path": "docs/Report-v1.2.0.md",
    "chars": 18040,
    "preview": "# Report v1.2.0\n\nIn May 2024, we launched Open-Sora-Plan v1.1.0, featuring a 2+1D model architecture that could be quick"
  },
  {
    "path": "docs/Report-v1.3.0.md",
    "chars": 43185,
    "preview": "# Report v1.3.0\n\nIn August 2024, we released Open-Sora-Plan v1.2.0, transitioning to a 3D full attention architecture, w"
  },
  {
    "path": "docs/Report-v1.5.0.md",
    "chars": 20888,
    "preview": "## Report v1.5.0\r\n\r\nIn October 2024, we released Open-Sora Plan v1.3.0, introducing the sparse attention structure, Skip"
  },
  {
    "path": "docs/Report-v1.5.0_cn.md",
    "chars": 11033,
    "preview": "## Report v1.5.0\r\n\r\n在2024年的10月，我们发布了Open-Sora Plan v1.3.0，第一次将一种稀疏化的attention结构——skiparse attention引入video generation领域。"
  },
  {
    "path": "docs/VAE.md",
    "chars": 3366,
    "preview": "\n### Data prepare\nThe organization of the training data is easy. We only need to put all the videos recursively in a dir"
  },
  {
    "path": "examples/cond_pix_path.txt",
    "chars": 68,
    "preview": "examples/test_img1.png\nexamples/test_img2.png\nexamples/test_img3.png"
  },
  {
    "path": "examples/cond_prompt.txt",
    "chars": 146,
    "preview": "A rocket ascends slowly into the sky.\nAlong the coast, variously sized boats float on the lake.\nThe landscape at sunset "
  },
  {
    "path": "examples/rec_image.py",
    "chars": 2376,
    "preview": "import sys\nsys.path.append(\".\")\nfrom PIL import Image\nimport torch\nfrom torchvision.transforms import ToTensor, Compose,"
  },
  {
    "path": "examples/rec_video.py",
    "chars": 5931,
    "preview": "import math\nimport random\nimport argparse\nfrom typing import Optional\n\nimport cv2\nimport numpy as np\nimport numpy.typing"
  },
  {
    "path": "examples/sora.txt",
    "chars": 11586,
    "preview": "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black lea"
  },
  {
    "path": "opensora/__init__.py",
    "chars": 2,
    "preview": "# "
  },
  {
    "path": "opensora/acceleration/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/acceleration/communications.py",
    "chars": 5884,
    "preview": "import torch\nimport torch.distributed as dist\nfrom einops import rearrange\nfrom opensora.acceleration.parallel_states im"
  },
  {
    "path": "opensora/acceleration/parallel_states.py",
    "chars": 1993,
    "preview": "import torch\nimport torch_npu\nimport torch.distributed as dist\nimport os\ntry:\n    from lcalib.functional import lcal_ini"
  },
  {
    "path": "opensora/adaptor/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/adaptor/bf16_optimizer.py",
    "chars": 20331,
    "preview": "# Copyright (c) Microsoft Corporation.\n# SPDX-License-Identifier: Apache-2.0\n\n# DeepSpeed Team\n\nfrom collections import "
  },
  {
    "path": "opensora/adaptor/engine.py",
    "chars": 166017,
    "preview": "# Copyright (c) Microsoft Corporation.\n# SPDX-License-Identifier: Apache-2.0\n\n# DeepSpeed Team\n\nimport os\nimport re\nimpo"
  },
  {
    "path": "opensora/adaptor/modules.py",
    "chars": 899,
    "preview": "import torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\n\ndef fp32_layer_norm_forward(self, inputs: torch"
  },
  {
    "path": "opensora/adaptor/stage_1_and_2.py",
    "chars": 120962,
    "preview": "# Copyright (c) Microsoft Corporation.\n# SPDX-License-Identifier: Apache-2.0\n\n# DeepSpeed Team\n\nimport torch\nimport os\ni"
  },
  {
    "path": "opensora/adaptor/utils.py",
    "chars": 39237,
    "preview": "# Copyright (c) Microsoft Corporation.\n# SPDX-License-Identifier: Apache-2.0\n\n# DeepSpeed Team\n\"\"\"\nCopyright NVIDIA/Mega"
  },
  {
    "path": "opensora/adaptor/zp_manager.py",
    "chars": 885,
    "preview": "import torch\nimport os\nimport torch.distributed as dist\n\n\nclass ZPManager(object):\n    def __init__(self, zp_size=8):\n  "
  },
  {
    "path": "opensora/dataset/__init__.py",
    "chars": 5033,
    "preview": "from torchvision.transforms import Compose\nfrom transformers import AutoTokenizer, AutoImageProcessor\n\nfrom torchvision "
  },
  {
    "path": "opensora/dataset/inpaint_dataset.py",
    "chars": 11260,
    "preview": "import time\nimport traceback\n\ntry:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch"
  },
  {
    "path": "opensora/dataset/t2v_datasets.py",
    "chars": 29485,
    "preview": "import time\nimport traceback\n\ntry:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch"
  },
  {
    "path": "opensora/dataset/transform.py",
    "chars": 30532,
    "preview": "import torch\nimport random\nimport numbers\nfrom torchvision.transforms import RandomCrop, RandomResizedCrop\nimport statis"
  },
  {
    "path": "opensora/dataset/virtual_disk.py",
    "chars": 7467,
    "preview": "import subprocess\nimport json\nimport pickle\nfrom collections import OrderedDict\nfrom opensora.npu_config import npu_conf"
  },
  {
    "path": "opensora/models/__init__.py",
    "chars": 68,
    "preview": "from .causalvideovae import CausalVAEModelWrapper, WFVAEModelWrapper"
  },
  {
    "path": "opensora/models/causalvideovae/__init__.py",
    "chars": 3860,
    "preview": "from torchvision.transforms import Lambda\nfrom .model.vae import CausalVAEModel, WFVAEModel\nfrom einops import rearrange"
  },
  {
    "path": "opensora/models/causalvideovae/dataset/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/causalvideovae/dataset/ddp_sampler.py",
    "chars": 6493,
    "preview": "import math\nfrom typing import TypeVar, Optional, Iterator\n\nimport torch\nfrom torch.utils.data import Sampler, Dataset\ni"
  },
  {
    "path": "opensora/models/causalvideovae/dataset/transform.py",
    "chars": 18157,
    "preview": "import torch\nimport random\nimport numbers\nfrom torchvision.transforms import RandomCrop, RandomResizedCrop\n\n\ndef _is_ten"
  },
  {
    "path": "opensora/models/causalvideovae/dataset/video_dataset.py",
    "chars": 8184,
    "preview": "import os.path as osp\nimport random\nfrom glob import glob\nfrom torchvision import transforms\nimport numpy as np\nimport t"
  },
  {
    "path": "opensora/models/causalvideovae/eval/cal_fvd.py",
    "chars": 2540,
    "preview": "import numpy as np\nimport torch\nfrom tqdm import tqdm\n\ndef trans(x):\n    # if greyscale images add channel\n    if x.shap"
  },
  {
    "path": "opensora/models/causalvideovae/eval/cal_lpips.py",
    "chars": 2786,
    "preview": "import numpy as np\nimport torch\nfrom tqdm import tqdm\nimport math\n\nimport torch\nimport lpips\n\nspatial = True         # R"
  },
  {
    "path": "opensora/models/causalvideovae/eval/cal_psnr.py",
    "chars": 2510,
    "preview": "import numpy as np\nimport torch\nfrom tqdm import tqdm\nimport math\n\ndef img_psnr_cuda(img1, img2):\n    # [0,1]\n    # comp"
  },
  {
    "path": "opensora/models/causalvideovae/eval/cal_ssim.py",
    "chars": 3506,
    "preview": "import numpy as np\nimport torch\nfrom tqdm import tqdm\nimport cv2\n \ndef ssim(img1, img2):\n    C1 = 0.01 ** 2\n    C2 = 0.0"
  },
  {
    "path": "opensora/models/causalvideovae/eval/eval.py",
    "chars": 6170,
    "preview": "import os\nfrom argparse import ArgumentDefaultsHelpFormatter, ArgumentParser\nimport numpy as np\nimport torch\nfrom torch."
  },
  {
    "path": "opensora/models/causalvideovae/eval/fvd/styleganv/fvd.py",
    "chars": 3116,
    "preview": "import torch\nimport os\nimport math\nimport torch.nn.functional as F\n\n# https://github.com/universome/fvd-comparison\n\n\ndef"
  },
  {
    "path": "opensora/models/causalvideovae/eval/fvd/videogpt/fvd.py",
    "chars": 5368,
    "preview": "import torch\nimport os\nimport math\nimport torch.nn.functional as F\nimport numpy as np\nimport einops\n\ndef load_i3d_pretra"
  },
  {
    "path": "opensora/models/causalvideovae/eval/fvd/videogpt/pytorch_i3d.py",
    "chars": 13385,
    "preview": "# Original code from https://github.com/piergiaj/pytorch-i3d\nimport torch\nimport torch.nn as nn\nimport torch.nn.function"
  },
  {
    "path": "opensora/models/causalvideovae/eval/script/cal_clip_score.sh",
    "chars": 584,
    "preview": "# clip_score cross modality\npython eval_clip_score.py \\\n    --real_path path/to/image \\\n    --generated_path path/to/tex"
  },
  {
    "path": "opensora/models/causalvideovae/eval/script/cal_fvd.sh",
    "chars": 248,
    "preview": "python eval_common_metric.py \\\n    --real_video_dir path/to/imageA\\\n    --generated_video_dir path/to/imageB \\\n    --bat"
  },
  {
    "path": "opensora/models/causalvideovae/eval/script/cal_lpips.sh",
    "chars": 218,
    "preview": "python eval_common_metric.py \\\n    --real_video_dir path/to/imageA\\\n    --generated_video_dir path/to/imageB \\\n    --bat"
  },
  {
    "path": "opensora/models/causalvideovae/eval/script/cal_psnr.sh",
    "chars": 251,
    "preview": "\npython eval_common_metric.py \\\n    --real_video_dir /data/xiaogeng_liu/data/video1 \\\n    --generated_video_dir /data/xi"
  },
  {
    "path": "opensora/models/causalvideovae/eval/script/cal_ssim.sh",
    "chars": 250,
    "preview": "python eval_common_metric.py \\\n    --real_video_dir /data/xiaogeng_liu/data/video1 \\\n    --generated_video_dir /data/xia"
  },
  {
    "path": "opensora/models/causalvideovae/model/__init__.py",
    "chars": 88,
    "preview": "from .registry import ModelRegistry\nfrom .vae import (\n    CausalVAEModel, WFVAEModel\n)\n"
  },
  {
    "path": "opensora/models/causalvideovae/model/configuration_videobase.py",
    "chars": 1657,
    "preview": "import json\nimport yaml\nfrom typing import TypeVar, Dict, Any\nfrom diffusers import ConfigMixin\n\nT = TypeVar('T', bound="
  },
  {
    "path": "opensora/models/causalvideovae/model/dataset_videobase.py",
    "chars": 4263,
    "preview": "import os.path as osp\nimport random\nfrom glob import glob\n\nfrom torchvision import transforms\nimport numpy as np\nimport "
  },
  {
    "path": "opensora/models/causalvideovae/model/ema_model.py",
    "chars": 1027,
    "preview": "class EMA:\n    def __init__(self, model, decay):\n        self.model = model\n        self.decay = decay\n        self.shad"
  },
  {
    "path": "opensora/models/causalvideovae/model/losses/__init__.py",
    "chars": 54,
    "preview": "from .perceptual_loss import LPIPSWithDiscriminator3D\n"
  },
  {
    "path": "opensora/models/causalvideovae/model/losses/discriminator.py",
    "chars": 5100,
    "preview": "import functools\nimport torch.nn as nn\nfrom ..modules.conv import CausalConv3d\nfrom einops import rearrange\n\ndef weights"
  },
  {
    "path": "opensora/models/causalvideovae/model/losses/lpips.py",
    "chars": 4824,
    "preview": "\"\"\"Stripped version of https://github.com/richzhang/PerceptualSimilarity/tree/master/models\"\"\"\n\nimport torch\nimport torc"
  },
  {
    "path": "opensora/models/causalvideovae/model/losses/perceptual_loss.py",
    "chars": 7913,
    "preview": "import torch\nfrom torch import nn\nimport torch.nn.functional as F\nfrom .lpips import LPIPS\nfrom einops import rearrange\n"
  },
  {
    "path": "opensora/models/causalvideovae/model/modeling_videobase.py",
    "chars": 1982,
    "preview": "import torch\nfrom diffusers import ModelMixin, ConfigMixin\nfrom torch import nn\nimport os\nimport json\nfrom diffusers.con"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/__init__.py",
    "chars": 173,
    "preview": "from .block import Block\nfrom .attention import *\nfrom .conv import *\nfrom .normalize import *\nfrom .resnet_block import"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/attention.py",
    "chars": 4159,
    "preview": "import torch.nn as nn\nimport torch.nn.functional as F\nfrom .normalize import Normalize\nfrom .conv import CausalConv3d\nim"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/block.py",
    "chars": 137,
    "preview": "import torch.nn as nn\n\nclass Block(nn.Module):\n    def __init__(self, *args, **kwargs) -> None:\n        super().__init__"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/conv.py",
    "chars": 3816,
    "preview": "try:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch_npu = None\n    npu_config = N"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/normalize.py",
    "chars": 1333,
    "preview": "import torch\nimport torch.nn as nn\nfrom .block import Block\nfrom einops import rearrange\n\nclass GroupNorm(Block):\n    de"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/ops.py",
    "chars": 1514,
    "preview": "import torch\nfrom einops import rearrange\n\ndef video_to_image(func):\n    def wrapper(self, x, *args, **kwargs):\n        "
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/quant.py",
    "chars": 3618,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.distributed as dist\nimport numpy as np\nimport torch.nn.functional as F\nf"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/resnet_block.py",
    "chars": 5588,
    "preview": "try:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch_npu = None\n    npu_config = N"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/updownsample.py",
    "chars": 12194,
    "preview": "from typing import Union, Tuple\nfrom collections import deque\nimport torch\nimport torch.nn as nn\nimport torch.nn.functio"
  },
  {
    "path": "opensora/models/causalvideovae/model/modules/wavelet.py",
    "chars": 11716,
    "preview": "import torch\nimport torch.nn.functional as F\nimport torch.nn as nn\nfrom ..modules import CausalConv3d\nfrom ..modules.ops"
  },
  {
    "path": "opensora/models/causalvideovae/model/registry.py",
    "chars": 329,
    "preview": "class ModelRegistry:\n    _models = {}\n\n    @classmethod\n    def register(cls, model_name):\n        def decorator(model_c"
  },
  {
    "path": "opensora/models/causalvideovae/model/trainer_videobase.py",
    "chars": 966,
    "preview": "from transformers import Trainer\nimport torch.nn.functional as F\nfrom typing import Optional\nimport os\nimport torch\nfrom"
  },
  {
    "path": "opensora/models/causalvideovae/model/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/causalvideovae/model/utils/distrib_utils.py",
    "chars": 1653,
    "preview": "import torch\nimport numpy as np\n\nclass DiagonalGaussianDistribution(object):\n    def __init__(self, parameters, determin"
  },
  {
    "path": "opensora/models/causalvideovae/model/utils/module_utils.py",
    "chars": 574,
    "preview": "import importlib\n\nModule = str\nMODULES_BASE = \"opensora.models.causalvideovae.model.modules.\"\n\ndef resolve_str_to_obj(st"
  },
  {
    "path": "opensora/models/causalvideovae/model/utils/scheduler_utils.py",
    "chars": 260,
    "preview": "import torch\n\ndef cosine_scheduler(step, max_steps, value_base=1, value_end=0):\n    step = torch.tensor(step)\n    cosine"
  },
  {
    "path": "opensora/models/causalvideovae/model/utils/video_utils.py",
    "chars": 256,
    "preview": "import torch\nimport numpy as np\n\ndef tensor_to_video(x):\n    x = (x * 2 - 1).detach().cpu()\n    x = torch.clamp(x, -1, 1"
  },
  {
    "path": "opensora/models/causalvideovae/model/utils/wavelet_utils.py",
    "chars": 10096,
    "preview": "import torch\nimport torch.nn.functional as F\nimport torch.nn as nn\nfrom ..modules import CausalConv3d\nfrom einops import"
  },
  {
    "path": "opensora/models/causalvideovae/model/vae/__init__.py",
    "chars": 135,
    "preview": "from .modeling_causalvae import CausalVAEModel\nfrom .modeling_wfvae import WFVAEModel\nfrom einops import rearrange\nfrom "
  },
  {
    "path": "opensora/models/causalvideovae/model/vae/modeling_causalvae.py",
    "chars": 23523,
    "preview": "\ntry:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch_npu = None\n    npu_config = "
  },
  {
    "path": "opensora/models/causalvideovae/model/vae/modeling_wfvae.py",
    "chars": 19452,
    "preview": "try:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch_npu = None\n    npu_config = N"
  },
  {
    "path": "opensora/models/causalvideovae/sample/rec_video_vae.py",
    "chars": 4261,
    "preview": "import argparse\nfrom tqdm import tqdm\nimport torch\nimport sys\nfrom torch.utils.data import DataLoader, Subset\nimport os\n"
  },
  {
    "path": "opensora/models/causalvideovae/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/causalvideovae/utils/dataset_utils.py",
    "chars": 1318,
    "preview": "import math\nfrom einops import rearrange\nimport decord\nfrom torch.nn import functional as F\nimport torch\n\n\nIMG_EXTENSION"
  },
  {
    "path": "opensora/models/causalvideovae/utils/downloader.py",
    "chars": 492,
    "preview": "import gdown\nimport os\n\nopensora_cache_home = os.path.expanduser(\n    os.getenv(\"OPENSORA_HOME\", os.path.join(\"~/.cache\""
  },
  {
    "path": "opensora/models/causalvideovae/utils/video_utils.py",
    "chars": 2096,
    "preview": "import torch\nimport numpy as np\nimport numpy.typing as npt\nimport cv2\nfrom decord import VideoReader, cpu\n\ndef array_to_"
  },
  {
    "path": "opensora/models/diffusion/__init__.py",
    "chars": 568,
    "preview": "from .opensora_v1_3.modeling_opensora import OpenSora_v1_3_models\nfrom .opensora_v1_3.modeling_inpaint import OpenSoraIn"
  },
  {
    "path": "opensora/models/diffusion/common.py",
    "chars": 5139,
    "preview": "import torch\nfrom einops import rearrange, repeat\nfrom typing import Any, Dict, Optional, Tuple\nimport torch\nimport torc"
  },
  {
    "path": "opensora/models/diffusion/opensora_v1_3/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/diffusion/opensora_v1_3/modeling_inpaint.py",
    "chars": 6069,
    "preview": "import os\nimport numpy as np\nfrom torch import nn\nimport torch\nfrom einops import rearrange, repeat\nfrom typing import A"
  },
  {
    "path": "opensora/models/diffusion/opensora_v1_3/modeling_opensora.py",
    "chars": 16654,
    "preview": "import os\nimport numpy as np\nfrom torch import nn\nimport torch\nfrom einops import rearrange, repeat\nfrom typing import A"
  },
  {
    "path": "opensora/models/diffusion/opensora_v1_3/modules.py",
    "chars": 18752,
    "preview": "import torch\nfrom einops import rearrange, repeat\nfrom typing import Any, Dict, Optional, Tuple\nimport torch.nn.function"
  },
  {
    "path": "opensora/models/frame_interpolation/cfgs/AMT-G.yaml",
    "chars": 113,
    "preview": "\nseed: 2023\n\nnetwork:\n  name: networks.AMT-G.Model\n  params:\n    corr_radius: 3\n    corr_lvls: 4\n    num_flows: 5"
  },
  {
    "path": "opensora/models/frame_interpolation/interpolation.py",
    "chars": 6643,
    "preview": "# this script is modified from https://github.com/MCG-NKU/AMT/blob/main/demos/demo_2x.py\nfrom json import load\nimport os"
  },
  {
    "path": "opensora/models/frame_interpolation/networks/AMT-G.py",
    "chars": 8009,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom networks.blocks.raft import (\n    coords_grid,\n "
  },
  {
    "path": "opensora/models/frame_interpolation/networks/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/frame_interpolation/networks/blocks/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/frame_interpolation/networks/blocks/feat_enc.py",
    "chars": 11300,
    "preview": "import torch\nimport torch.nn as nn\n\n\nclass BottleneckBlock(nn.Module):\n    def __init__(self, in_planes, planes, norm_fn"
  },
  {
    "path": "opensora/models/frame_interpolation/networks/blocks/ifrnet.py",
    "chars": 4260,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom utils.flow_utils import warp\n\n\ndef resize(x, sca"
  },
  {
    "path": "opensora/models/frame_interpolation/networks/blocks/multi_flow.py",
    "chars": 3000,
    "preview": "import torch\nimport torch.nn as nn\nfrom utils.flow_utils import warp\nfrom networks.blocks.ifrnet import (\n    convrelu, "
  },
  {
    "path": "opensora/models/frame_interpolation/networks/blocks/raft.py",
    "chars": 8113,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\ndef resize(x, scale_factor):\n    return F.interpola"
  },
  {
    "path": "opensora/models/frame_interpolation/readme.md",
    "chars": 1200,
    "preview": "#### Frame Interpolation\r\n\r\nWe use AMT as our frame interpolation model. (Thanks [AMT](https://github.com/MCG-NKU/AMT)) "
  },
  {
    "path": "opensora/models/frame_interpolation/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "opensora/models/frame_interpolation/utils/build_utils.py",
    "chars": 323,
    "preview": "import importlib\n\n\ndef base_build_fn(module, cls, params):\n    return getattr(importlib.import_module(\n                 "
  },
  {
    "path": "opensora/models/frame_interpolation/utils/dist_utils.py",
    "chars": 1460,
    "preview": "import os\nimport torch\n\n\ndef get_world_size():\n    \"\"\"Find OMPI world size without calling mpi functions\n    :rtype: int"
  },
  {
    "path": "opensora/models/frame_interpolation/utils/flow_utils.py",
    "chars": 4325,
    "preview": "import numpy as np\nimport torch\nfrom PIL import ImageFile\nimport torch.nn.functional as F\nImageFile.LOAD_TRUNCATED_IMAGE"
  },
  {
    "path": "opensora/models/frame_interpolation/utils/utils.py",
    "chars": 8257,
    "preview": "import re\nimport sys\nimport torch\nimport random\nimport numpy as np\nfrom PIL import ImageFile\nimport torch.nn.functional "
  },
  {
    "path": "opensora/models/prompt_refiner/inference.py",
    "chars": 1804,
    "preview": "from transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\nfrom tqdm import tqdm\nimport argparse\n\ndef get"
  },
  {
    "path": "opensora/models/prompt_refiner/merge.py",
    "chars": 3279,
    "preview": "import os\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom peft import PeftModel\nimport torch\nimport ar"
  },
  {
    "path": "opensora/models/prompt_refiner/train.py",
    "chars": 3344,
    "preview": "from datasets import Dataset\nimport pandas as pd\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, DataColla"
  },
  {
    "path": "opensora/models/text_encoder/__init__.py",
    "chars": 762,
    "preview": "from opensora.models.text_encoder.clip import CLIPWrapper\nfrom opensora.models.text_encoder.t5 import T5Wrapper\n\ntext_en"
  },
  {
    "path": "opensora/models/text_encoder/clip.py",
    "chars": 985,
    "preview": "import torch\nfrom torch import nn\nfrom transformers import CLIPTextModelWithProjection\n\ntry:\n    import torch_npu\nexcept"
  },
  {
    "path": "opensora/models/text_encoder/t5.py",
    "chars": 676,
    "preview": "import torch\nfrom torch import nn\nfrom transformers import T5EncoderModel\n\ntry:\n    import torch_npu\nexcept:\n    torch_n"
  },
  {
    "path": "opensora/npu_config.py",
    "chars": 15829,
    "preview": "import math\nimport mmap\nimport os\nimport pickle\nimport random\nimport numpy as np\nimport torch\nimport subprocess\nimport s"
  },
  {
    "path": "opensora/sample/caption_refiner.py",
    "chars": 1629,
    "preview": "import torch\nfrom torch import nn\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\n\n\nTEMPLATE = \"\"\"\nRefine "
  },
  {
    "path": "opensora/sample/pipeline_inpaint.py",
    "chars": 28844,
    "preview": "\nimport inspect\nimport os\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\nfrom dataclasses import datacl"
  },
  {
    "path": "opensora/sample/pipeline_opensora.py",
    "chars": 35666,
    "preview": "\nimport inspect\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\nfrom dataclasses import dataclass\nimport"
  },
  {
    "path": "opensora/sample/rec_image.py",
    "chars": 2377,
    "preview": "import sys\nsys.path.append(\".\")\nfrom PIL import Image\nimport torch\nfrom torchvision.transforms import ToTensor, Compose,"
  },
  {
    "path": "opensora/sample/rec_video.py",
    "chars": 5952,
    "preview": "import math\nimport random\nimport argparse\nfrom typing import Optional\n\nimport cv2\nimport numpy as np\nimport numpy.typing"
  },
  {
    "path": "opensora/sample/sample.py",
    "chars": 1465,
    "preview": "import os\nimport torch\ntry:\n    import torch_npu\n    from opensora.npu_config import npu_config\nexcept:\n    torch_npu = "
  },
  {
    "path": "opensora/serve/gradio_utils.py",
    "chars": 15266,
    "preview": "import random\n\nimport imageio\nimport uuid\nimport torch\n\nimport numpy as np\n\n\nPOS_PROMPT = \"\"\"\n    high quality, high aes"
  },
  {
    "path": "opensora/serve/gradio_web_server.py",
    "chars": 10555,
    "preview": "import gradio as gr\nimport os\nimport torch\nfrom einops import rearrange\nimport torch.distributed as dist\nfrom torchvisio"
  },
  {
    "path": "opensora/serve/gradio_web_server_i2v.py",
    "chars": 11290,
    "preview": "import gradio as gr\nimport os\nimport torch\nfrom einops import rearrange\nimport torch.distributed as dist\nfrom torchvisio"
  },
  {
    "path": "opensora/serve/style.css",
    "chars": 41,
    "preview": ".gradio-container{width:1280px!important}"
  },
  {
    "path": "opensora/train/train_causalvae.py",
    "chars": 24304,
    "preview": "import os\nimport torch\nimport torch.distributed as dist\nfrom torch.nn.parallel import DistributedDataParallel as DDP\n\nfr"
  },
  {
    "path": "opensora/train/train_inpaint.py",
    "chars": 53905,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n\n# This source code is licensed under the li"
  },
  {
    "path": "opensora/train/train_t2v_diffusers.py",
    "chars": 60704,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n\n# This source code is licensed under the li"
  },
  {
    "path": "opensora/utils/communications.py",
    "chars": 5453,
    "preview": "import torch\nimport torch.distributed as dist\nfrom einops import rearrange\nfrom opensora.utils.parallel_states import nc"
  },
  {
    "path": "opensora/utils/dataset_utils.py",
    "chars": 18434,
    "preview": "import math\nfrom einops import rearrange\nimport decord\nfrom torch.nn import functional as F\nimport torch\nfrom typing imp"
  },
  {
    "path": "opensora/utils/downloader.py",
    "chars": 492,
    "preview": "import gdown\nimport os\n\nopensora_cache_home = os.path.expanduser(\n    os.getenv(\"OPENSORA_HOME\", os.path.join(\"~/.cache\""
  },
  {
    "path": "opensora/utils/ema.py",
    "chars": 13521,
    "preview": "import contextlib\nimport copy\nimport random\nfrom typing import Any, Dict, Iterable, List, Optional, Union\n\nfrom diffuser"
  },
  {
    "path": "opensora/utils/ema_utils.py",
    "chars": 2531,
    "preview": "\nfrom peft import get_peft_model, PeftModel\nimport os\nfrom copy import deepcopy\nimport torch\nimport json\nfrom diffusers."
  },
  {
    "path": "opensora/utils/freeinit_utils.py",
    "chars": 4663,
    "preview": "import torch\nimport torch.fft as fft\nimport math\n\n\ndef freq_mix_3d(x, noise, LPF):\n    \"\"\"\n    Noise reinitialization.\n\n"
  },
  {
    "path": "opensora/utils/lora_utils.py",
    "chars": 1645,
    "preview": "\nfrom peft import get_peft_model, PeftModel\nimport os\nfrom copy import deepcopy\nimport torch\nimport json\n\ndef maybe_zero"
  },
  {
    "path": "opensora/utils/mask_utils.py",
    "chars": 9861,
    "preview": "\nfrom math import floor, ceil\nfrom abc import ABC, abstractmethod\nimport cv2\nimport torch\nimport torch.nn.functional as "
  },
  {
    "path": "opensora/utils/parallel_states.py",
    "chars": 1495,
    "preview": "import torch\nimport torch.distributed as dist\nimport os\n\nclass COMM_INFO:\n    def __init__(self):\n        self.group = N"
  },
  {
    "path": "opensora/utils/sample_utils.py",
    "chars": 21776,
    "preview": "from diffusers.schedulers import (\n    DDIMScheduler, DDPMScheduler, PNDMScheduler,\n    EulerDiscreteScheduler, DPMSolve"
  },
  {
    "path": "opensora/utils/utils.py",
    "chars": 18127,
    "preview": "import os\n\nimport torch\n\nimport os\nimport math\nimport torch\nimport logging\nimport random\nimport subprocess\nimport numpy "
  },
  {
    "path": "pyproject.toml",
    "chars": 1922,
    "preview": "[build-system]\nrequires = [\"setuptools>=61.0\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"opensora\"\nvers"
  },
  {
    "path": "scripts/accelerate_configs/ddp_config.yaml",
    "chars": 230,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: MULTI_GPU\nfsdp_config: {}\nmachine_rank: 0\nmain_process_ip: null\nmai"
  },
  {
    "path": "scripts/accelerate_configs/deepspeed_zero2_config.yaml",
    "chars": 324,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: DEEPSPEED\ndeepspeed_config:\n deepspeed_config_file: scripts/acceler"
  },
  {
    "path": "scripts/accelerate_configs/deepspeed_zero2_offload_config.yaml",
    "chars": 331,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: DEEPSPEED\ndeepspeed_config:\n deepspeed_config_file: scripts/acceler"
  },
  {
    "path": "scripts/accelerate_configs/deepspeed_zero3_config.yaml",
    "chars": 323,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: DEEPSPEED\ndeepspeed_config:\n deepspeed_config_file: scripts/acceler"
  },
  {
    "path": "scripts/accelerate_configs/deepspeed_zero3_offload_config.yaml",
    "chars": 331,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: DEEPSPEED\ndeepspeed_config:\n deepspeed_config_file: scripts/acceler"
  },
  {
    "path": "scripts/accelerate_configs/default_config.yaml",
    "chars": 265,
    "preview": "compute_environment: LOCAL_MACHINE\ndistributed_type: MULTI_GPU\nfsdp_config: {}\nmachine_rank: 0\nmain_process_ip: null\nmai"
  },
  {
    "path": "scripts/accelerate_configs/hostfile",
    "chars": 332,
    "preview": "100.64.24.30 slots=8\n100.64.24.6 slots=8\n100.64.24.7 slots=8\n100.64.24.8 slots=8\n100.64.24.10 slots=8\n100.64.24.11 slots"
  },
  {
    "path": "scripts/accelerate_configs/zero2.json",
    "chars": 621,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": false,\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_s"
  },
  {
    "path": "scripts/accelerate_configs/zero2_npu.json",
    "chars": 640,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": false,\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_s"
  },
  {
    "path": "scripts/accelerate_configs/zero2_offload.json",
    "chars": 760,
    "preview": "{\r\n    \"fp16\": {\r\n        \"enabled\": \"auto\",\r\n        \"loss_scale\": 0,\r\n        \"loss_scale_window\": 1000,\r\n        \"ini"
  },
  {
    "path": "scripts/accelerate_configs/zero3.json",
    "chars": 864,
    "preview": "{\n    \"fp16\": {\n        \"enabled\": \"auto\",\n        \"loss_scale\": 0,\n        \"loss_scale_window\": 1000,\n        \"initial_"
  },
  {
    "path": "scripts/accelerate_configs/zero3_offload.json",
    "chars": 955,
    "preview": "{\n  \"fp16\": {\n    \"enabled\": \"auto\",\n    \"loss_scale\": 0,\n    \"loss_scale_window\": 1000,\n    \"initial_scale_power\": 16,\n"
  },
  {
    "path": "scripts/causalvae/eval.sh",
    "chars": 690,
    "preview": "EXP_NAME=wfvae-4dim\nSAMPLE_RATE=1\nNUM_FRAMES=33\nRESOLUTION=256\nMETRIC=lpips\nSUBSET_SIZE=0\nORIGIN_DIR=video_gen/${EXP_NAM"
  },
  {
    "path": "scripts/causalvae/prepare_eval.sh",
    "chars": 787,
    "preview": "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\nDATASET_DIR=test_video\nEXP_NAME=wfvae\nSAMPLE_RATE=1\nNUM_FRAMES=33\nRESOLUTION"
  },
  {
    "path": "scripts/causalvae/rec_image.sh",
    "chars": 290,
    "preview": "CUDA_VISIBLE_DEVICES=0 python examples/rec_image.py \\\n    --ae WFVAEModel_D8_4x8x8 \\\n    --ae_path \"/storage/lcm/WF-VAE/"
  },
  {
    "path": "scripts/causalvae/rec_video.sh",
    "chars": 381,
    "preview": "CUDA_VISIBLE_DEVICES=1 python examples/rec_video.py \\\n    --ae WFVAEModel_D8_4x8x8 \\\n    --ae_path \"/storage/lcm/WF-VAE/"
  },
  {
    "path": "scripts/causalvae/train.sh",
    "chars": 1421,
    "preview": "export WANDB_PROJECT=WFVAE\nexport CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\nexport GLOO_SOCKET_IFNAME=bond0\nexport NCCL_SOCKE"
  },
  {
    "path": "scripts/causalvae/wfvae_4dim.json",
    "chars": 836,
    "preview": "{\n    \"_class_name\": \"WFVAEModel\",\n    \"_diffusers_version\": \"0.30.2\",\n    \"base_channels\": 128,\n    \"connect_res_layer_"
  },
  {
    "path": "scripts/slurm/placeholder",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "scripts/text_condition/gpu/sample_inpaint_v1_3.sh",
    "chars": 958,
    "preview": "\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node 8 --master_port 29513 \\\n    -m opensora.sampl"
  },
  {
    "path": "scripts/text_condition/gpu/sample_t2v_v1_3.sh",
    "chars": 933,
    "preview": "\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node 8 --master_port 29514 \\\n    -m opensora.sampl"
  },
  {
    "path": "scripts/text_condition/gpu/train_inpaint_v1_3.sh",
    "chars": 2055,
    "preview": "\nexport HF_DATASETS_OFFLINE=1 \nexport TRANSFORMERS_OFFLINE=1\nexport PDSH_RCMD_TYPE=ssh\n# NCCL setting\nexport GLOO_SOCKET"
  },
  {
    "path": "scripts/text_condition/gpu/train_t2v_v1_3.sh",
    "chars": 1974,
    "preview": "\nexport HF_DATASETS_OFFLINE=1 \nexport TRANSFORMERS_OFFLINE=1\nexport PDSH_RCMD_TYPE=ssh\n# NCCL setting\nexport GLOO_SOCKET"
  },
  {
    "path": "scripts/text_condition/npu/sample_inpaint_v1_3.sh",
    "chars": 1065,
    "preview": "\nexport TASK_QUEUE_ENABLE=0\ntorchrun --nnodes=1 --nproc_per_node 8 --master_port 29522 \\\n    -m opensora.sample.sample \\"
  },
  {
    "path": "scripts/text_condition/npu/sample_t2v_v1_3.sh",
    "chars": 770,
    "preview": "\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nnodes=1 --nproc_per_node 8 --master_port 29513 \\\n    -m opensora.sampl"
  },
  {
    "path": "scripts/text_condition/npu/train_inpaint_v1_3.sh",
    "chars": 2217,
    "preview": "\nexport PROJECT=$PROJECT_NAME\n# export PROJECT='test'\nexport HF_DATASETS_OFFLINE=1 \nexport TRANSFORMERS_OFFLINE=1\n\nexpor"
  },
  {
    "path": "scripts/text_condition/npu/train_t2v_v1_3.sh",
    "chars": 2018,
    "preview": "\nexport PROJECT=$PROJECT_NAME\n# export PROJECT='test'\nexport HF_DATASETS_OFFLINE=1 \nexport TRANSFORMERS_OFFLINE=1\n\nexpor"
  },
  {
    "path": "scripts/train_configs/mask_config.yaml",
    "chars": 257,
    "preview": "# mask processor args\nmin_clear_ratio: 0.0\nmax_clear_ratio: 1.0 \n\n# mask_type_ratio_dict_video\nmask_type_ratio_dict_vide"
  },
  {
    "path": "scripts/train_data/merge_data.txt",
    "chars": 137,
    "preview": "/storage/dataset/recap_datacomp_1b_data/output,/storage/anno_pkl/img_nocn_res160_pkl/recap_64part_filter_aes_res160_pkl/"
  }
]

About this extraction

This page contains the full source code of the PKU-YuanGroup/Open-Sora-Plan GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 171 files (1.3 MB), approximately 332.9k tokens, and a symbol index with 1372 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo